Think outside the box. “The Oracle Performance Tuning Process”

Dear Tom,

   I was asked to come out with the performance tuning process which will be used as a 
general guideline when conducting performance tuning for customer. I understand that 
there is no fix rules for perofrmance tuning. Hope you can provide some views on these. 
Thanks in advance.

   Step 1 : Baseline statistic collection.
   Step 2 : Define performance tuning Goals.
   Step 3 : Determine the area and the period of the bottleneck
            (when the slowness occurs, which particular program or 
             overall system slow down at certain time).
   Step 4 : Identify the main constraints
               - System Level
               - Database Level (Statspack report)
               - Program Logic Level (Statspack report)
               - Network Level
               - Client Resource Level 
   Step 5 : Remove Constraints
   Step 6 : Review and Repeat until Tuning goals met.


and we said…

step 2 is really step 0 -- before anyone gets on a plane, train or automobile.<br /><br /><br />step 3 assumes it is a bursty problem, many are not.<br /><br /><br />The problem with tuning is step 5 and 6.  If we tune before we deploy -- well, then you <br />never get to 5 and 6.  <br /><br /><br />Each and every exercise in post mortem tuning is an experience.  Everytime I do it -- it <br />is different.  I guess if I had to categorize i would:<br /><br />a) isolate to the best of your ability which of the 15 tiers people run this is happening <br />on.  Eg: if you have a 48 cpu database server and it is not breaking a sweat -- the <br />database isn't broken.  (i've been there where they are jumping up and down "database <br />slow, database slow", I look at database -- database is doing SQUAT, nothing, no work.  <br />Database is fast)<br /><br />b) make sure all of the code you ever write, your company writes is instrumented to the <br />gills.  Good code has as much debug/trace code in it as "real" code.  Consider what would <br />happen if Oracle didn't have:<br /><br />  sql_trace=true -- that is instruemented code<br />  v$ tables -- that is instrumentation<br />  events -- instruementation on steriods<br /><br />how would you do anything?  You need to make sure every byte of code that comes out of <br />your shop is as instruemented, if not more so, then the database.  Else you are blindly <br />finding your way through a maze.  Just try to pinpoint a bottleneck in a 14 tier really <br />cool system if the only instruemented piece of code is the database and the database is <br />doing fine!<br /><br />c) make sure you understand the problem -- is it a single user having an issue with a <br />single report?  is it everyone on a single screen?  is it some users with the entire <br />system?  is it all users with the entire system?  does it only happen at lunch on <br />wednesday?  does it only happen at 9am, 1pm, 5pm?  <br /><br />d) using the insstruemented trace dumps you get from each layer -- figure out where the <br />bottleneck might be.  This might be sql_trace=true for a single user in a single module.  <br />This might be statspack at the database level, this might be some diagnostic tool for <br />apache in the middle tier, this might be a java source code profiler, whatever.<br /><br /><br />e) remember -- each experience will be different from the last and totally different from <br />the next.  They are mysteries.  If you could codify "do steps 1, 2, 3, 4, 5 and repeat", <br />software would do this for us.  It requires some level of intuition and experience.  You <br />have to learn to ignore some things (eg: those guys that want to move data files around <br />to increase performance -- tell them this -- Ok, you go over there and play around, just <br />remember where everything is RIGHT NOW and be prepared to put it all back when we say so. <br /> Then, ignore them and get back to work ;)   Oh, I can count on one hand without any <br />fingers how many times moving datafiles did the trick -- but it is what everyone does for <br />some reason?<br /><br /><br />sorry -- this is a non-answer answer.  If you have my book Expert one on one Oracle -- in <br />there I relate a tuning story where it was 100% a user interface issue.  "Hey, the <br />database is hung, stupid database is broken".  Turned out it was users "using the <br />application in a manner the developers did not forsee".  The logic the programmers had <br />was:<br /><br /><br />o user swipes timecard<br />o that selects a record for update<br />o user hits enter key to confirm and walks away from terminal<br />o record committed.<br /><br />What users did:<br /><br />o hit enter key to clear screen from the last guy<br />o record committed<br />o user swipes timecard<br />o user walks away<br /><br /><br />So, guess what happens when the last guy from lunch comes in?  They leave their record <br />locked.  Then, a batch process wakes up in the background and tries to post the records <br />-- so it does a select for update on all of the records -- gets some but gets stuck on <br />the record this last guy didn't commit.  Now the people that have their records locked by <br />the batch process are stuck in line - cannot swipe their card.  queues build up, people <br />scream, then -- all of a sudden, all by itself -- problem goes away.<br /><br />It took two days to figure it out.  Had to goto the warehouse floor and observe how <br />people used the system.  It was a client program issue, nothing to do with the database.  <br />But, everyone was 100% sure "database is problem".  <br /><br />Think outside the box. 


About this entry