ITPub博客

首页 > Linux操作系统 > Linux操作系统 > Oracle Wait Event For Users

Oracle Wait Event For Users

原创 Linux操作系统 作者:yanggq 时间:2019-05-24 21:21:05 0 删除 编辑

ABSTRACT

Oracle Wait Events have been around since version 7.0.12 was released. Why did Oracle introduce this functionality? So that you can get an idea of exactly what events your process is waiting to happen before it can continue processing. If you know exactly what events you are waiting on, you will know exactly where your bottlenecks lie. From there, you can tune your system to make your process run as fast as possible. This paper addresses the Oracle Wait Events and shows you how to use them to diagnose and resolve user problems.

INTRODUCTION

This paper is an extension of a previous paper I wrote titled "A Guide to Solving User Problems". If you do not have a copy of this paper, feel free to ask me for a copy. In that paper, I briefly discussed the Wait Event interface. I quickly showed how this interface could be used to diagnose user problems. At that time, I indicated that I would write another paper that would explain the Wait Event interface in more detail. Hopefully, you've already read the guide mentioned above. And hopefully, you are interested in delving into the wide world of the Oracle Wait Event interface.

In later sections, we will get a chance to see different wait events. Each one has a different name. The hardest part of this topic to grasp is an understanding of the individual events and what action to take to improve the wait time for that event. While I expect that an understanding of each wait event may take a long time to master, I imagine that the other material presented in this paper will be very easy to grasp. So don't feel frustrated if you do not understand each and every event as it is discussed. Hopefully, you'll master that specific event in the future. I hope that you use this paper to understand the process rather than get bogged down in the details.


WHY WAIT EVENTS?

In the beginning, the only method to tune an Oracle database was to use what is called "ratio based tuning". You've seen them all before. People say, "my query is running slow and I've tuned it as best as I can. What could be the problem?" The normal response is "well, have you checked your buffer cache hit ratio?" So the individual runs off and issues a query similar to the following:


SQL> column hit_ratio format 990.99
SQL> select (1-(sum(decode(name,'physical reads',value,0)) /
2 (sum(decode(name,'db block gets',value,0)) +
3 sum(decode(name,'consistent gets',value,0))))) as hit_ratio
4 from v$sysstat;

HIT_RATIO
---------
0.81

Here, the user is trying to get a ratio of disk reads to buffer reads to determine how efficient the size of their buffer cache is. Then they apply a rule of thumb similar to "If the buffer cache hit ratio is below 95%, then increase the size of the buffer cache by increasing the DB_BLOCK_BUFFERS parameter." In my example, my buffer cache hit ratio is 81%. Applying the rule of thumb, I need to increase my buffer cache size.

Pick up the any one of the countless numbers of books being printed today, and they'll give you all sorts of queries to measure many different types of ratios. You have queries that calculate ratios to provide the overall effectiveness of many different components of the SGA, i.e. the Shared Pool, Buffer Cache, etc. These books go further with a rule of thumb for each ratio. If the ratio is over/under a certain percentage, do something

So what's the problem with that? This query only provides a measure of the "overall" effectiveness of the memory structure. This does not tell you how effective the buffer cache is for a specific process. The hit ratio is generally calculated since the database instance was started. So people adapted and changed their approach. They would query for specific values used to calculate the ratio. They would perform some processing. They would then query for their values again. The differences in the values would provide a "delta", or difference during that specific time period. How did that memory structure perform during that time period? This is where the tuning method starts to get difficult. How can you be sure that only the process in question influenced the memory structure? What time period do you use? Is 1 minute good enough? How about 5, 10, or 15 minutes? And in order to know where to look, you have to take ratios of every memory structure's performance in the SGA and see which one looks like a problem. Performance tuning this way becomes very complex and can leave you with many unanswered questions. There simply has to be a better way.

Don't be too upset with yourself if you've been tuning systems with this methodology. It's been preached this way for many years. Like I said, there are many books on the market which tell you to tune this way. Go to any Oracle conference and ask any vendor who is more than willing to sell you a tuning tool what time period they use to calculate deltas in their hit ratios. Either they will have the answer ready for you or they can find out it quickly. The last conference I went to, only one vendor said that they don't strongly rely on hit ratios in their tuning tools. This vendor, Luminate, was the only one who is strongly relying on wait events in their tuning tool.

Another thing you may have seen is the checklist approach to tuning. At first, you calculated a handful of ratios. Was the first ratio within an acceptable range? Check! Was the second ratio in an acceptable range? Check! And so on through the list. You get all the way through the list and every thing looks good, but you know that there is still a bottleneck. So someone new comes out with a new checklist. This new list is more complete. With this new list, you get your problem solved. But over time, this list is not thorough enough to adequately tune the system. So a new checklist is developed. After all these years, don't you think the one true, all-inclusive list would have been developed? With wait event based tuning, you will not rely on these checklists like you have in the past.

Hopefully, I've convinced you that there is a better way to tuning than relying on calculated ratios. I'm not saying that ratios do not have their place. They do. I still use ratios. But I only use them to get an overall picture of the health of my memory structures. So don't abandon them entirely. But keep in mind what they are good for. One of the things they don't show you is other non-memory related bottlenecks. For instance, what ratio would you use to show you if the synchronization of the online redo log files is your bottleneck?


HISTORY

A long time ago, long before I started working with Oracle products, some smart people within Oracle Corporation recognized the need for the wait event interface. Other database systems used ratios for tuning methodologies. Even Unix operating systems were tuned with ratios. Back then, the only computer systems utilizing wait events were mainframe systems. Juan Loaiza from Oracle Corporation, started pushing for wait events to be implemented in the database. Wait events first appeared with the 7.0.12 version of Oracle. And wait events have been around ever since. In Oracle 7.0, there were 104 different wait events. As the table below shows for various Oracle versions, the number of wait events increases with each release.


Version # of events
7.0 104
7.3.4 106
8.0.6 159
8.1.7 217
9.0.1 287
Table 1. Wait Events per Oracle Version

Even though wait events have been around since the earliest versions of Oracle 7, they were not widely used. Most people still used ratios in their tuning efforts. It was not until Anjo Kolk wrote his famous YAPP paper that a select few began to investigate using the wait event interface. Some of the first adopters of this methodology include Mogens Norgaard, Cary Millsap, and Craig Shallahamer. In fact, it is Craig Shallahamer's work that is behind Luminate's new product line. The Oracle experts are getting behind wait events! Maybe you should too!

CONFIGURATION

In order to effectively use wait events, you must first configure your database correctly. Luckily, it is very easy to do. The only thing you have to do is set TIMED_STATISTICS equal to TRUE for your database. You need to set this so that you can obtain timing information for the events. You do not even have to restart your database for this to take effect. Since Oracle 7.3, this has been a system-modifiable parameter. In SQL*Plus, issue the following:

SQL> alter system set timed_statistics=true;

System altered.

If you want this setting to persist across database shutdowns, then set this parameter in your INIT.ORA file. Contrary to popular belief, setting this parameter does not introduce excessive overhead on your system. This was true for older versions of Oracle that used a different timing method, but is no longer true. I always set this parameter in all of my databases.

And that's it! That's all the configuration changes you'll need to make. The rest is already set up for you.


TUNING GUIDELINES

This paper is more about the wait events than an overall guide to tuning. There are many good sources for overall tuning. But I offer these points when tuning:

1. Tune your SQL statements first. Most of your performance problems are directly tied to poorly tuned SQL statements. If your SQL statements are not properly tuned, then the methods in this paper will provide little benefit. Tune your application first, and then look to these methods. This paper does not discuss SQL tuning.
2. Tune with a purpose. Know before you start what you want to accomplish. If you just start tuning just because you want to tune, then you can quickly become lost. Tune with a specific goal in mind. For instance, I once had an import process that was taking a considerable amount of time. My goal was to find out what the bottleneck was and provide some relief for that bottleneck. In the end, a quicker import process was desired. I knew how long it took before I started to tune the process (over 48 hours) and I wanted to import in less than 12 hours.
3. There is always a bottleneck. You'll just have to live with this fact. Get over it! No matter what you do, there will always be a bottleneck. As you will see in this paper, when you reduce one bottleneck, another will pop up to take its place. Hopefully, this new bottleneck is not nearly as much of a restriction as the previous one.
4. Know when to stop tuning. Eventually, there comes a time when the amount of effort spent to squeeze a few more minutes of time is not worth it. The Law of Diminishing Returns eventually will come into play. In my import problem discussed above, I knew that a 12-hour import was acceptable for me. Any quicker might have been an unrealistic goal. So have a stopping point in mind when you begin your tuning process, otherwise, you'll tune your process forever.


THE V$ VIEWS

There are a few V$ views that we want to explore. This is one of the best places for getting tons of good information. I'll explain what each view is used for in this section, and in later sections, we'll see the views in action.


V$EVENT_NAME

This view contains all of the wait events in the database. How did I know how many events were in each version that I listed in the table above? I simply did a COUNT(*) on this view in a database of that version. This view not only shows the names of all events, but also gives a description of the P1, P2, and P3 parameters for that event. What are these parameters for? They give more specific information about that event. For instance, let's look at the parameter descriptions for a specific event.

SQL> column PARAMETER1 format a10
SQL> column PARAMETER2 format a10
SQL> column PARAMETER3 format a10
SQL> column name format a25
SQL> select * from v$event_name where name='buffer busy waits';

EVENT# NAME PARAMETER1 PARAMETER2 PARAMETER3
---------- ------------------------- ---------- ---------- ----------
92 buffer busy waits file# block# id

Here we see the information for the buffer busy waits event. Don't worry about what this specific event exactly means. But you can get an idea from the name, that the buffer is busy and that's why we are waiting. What block of information in the buffer is causing our problem? From here, we can see that the P1 and P2 parameters make a difference. They point out that P1 is the file# and P2 is the block#. If I had specific values for these parameters, I will instantly know which block in which file we are waiting on.


IDLE EVENTS

There are certain events that are deemed to be "idle events". While the session is waiting on these events to occur, these events are not contributing to a bottleneck. The system is sitting idle waiting for something before the processing can continue. The following list shows idle events, which we ignore in our analysis:

pmon timer smon timer rdbms ipc reply
SQL*Net message from client SQL*Net message to client
slave wait null event dispatcher timer
parallel query dequeue wait parallel query idle wait
virtual circuit client message
Table 2. List of Idle Events

Let's look at an example of why these idle events are ignored. Let's suppose that our connected session may be waiting for the user to press a key or click a button in their application. The system is waiting for the user's next command. While it is waiting, the session in the database is idle. The system has no choice but to wait for the user's next instruction. In this case, the session is waiting on the SQL*Net message from client event. Another example concerns SMON. SMON is a process that performs recovery and coalesces free space in tablespaces. SMON does its work, sleeps for a while, and then wakes up to see if it has work to do. While SMON is asleep, the session corresponding to the SMON process is waiting on the SMON timer event.


V$SESSION_WAIT

This view will show you exactly what a particular session is waiting on exactly at that moment. Let's look at a specific example.

SQL> select event,p1,p2,p3 from v$session_wait
2 where sid=32;

EVENT P1 P2 P3
------------------------------ ---------- ---------- ----------
db file sequential read 75 17575 1

Here, we can see that the session corresponding to SID 32 is waiting on the db file sequential read event. So what do the P1, P2, and P3 parameters signify? For that, we'll have to look it up in the V$EVENT_NAME view.

SQL> select * from v$event_name where name='db file sequential read';

EVENT# NAME PARAMETER1 PARAMETER2 PARAMETER3
---------- ---------------------------- ---------- ---------- ----------
94 db file sequential read file# block# blocks

Here, we can see a short description of each parameter. Unfortunately, V$EVENT_NAME does not give a description of this exact event. For that, we'll have to look it up in the Oracle Reference Guide, Appendix A. From that document, we find that this event is caused by, "The session waits while a sequential read from the database is performed. This event is also used for rebuilding the controlfile, dumping datafile headers, and getting the database file headers." I could have also gotten a description of the parameters from this document.

So let's put this information together. At that specific point in time I issued my query against V$SESSION_WAIT, that particular session was waiting on the db file sequential read event. This event is caused by the session waiting for a sequential read from a database file, i.e. and index read of an object. Which block was being read into the buffer cache? This block is in file 75 (P1), block 17575 (P2), and the event was trying to read 1 block (P3). Which object does this block belong to? We'll have to query the data dictionary to find out. Now that I understand what these three parameters indicate, I can use my knowledge of the data dictionary tables to query to find out this information.

SQL> select owner,segment_name from dba_extents
2 where file_id=75 and 17575 between block_id and block_id+blocks-1;

OWNER SEGMENT_NAME
---------- -----------------------------------
GEO SDE_BLK_644

From this query, it is clear exactly which database segment is being read that is causing the session to wait.

There are three more columns in V$SESSION_WAIT worth talking about.

SQL> select event,wait_time,seconds_in_wait,state
2 from v$session_wait where sid=32;

EVENT WAIT_TIME SECONDS_IN_WAIT STATE
------------------------------ ---------- --------------- --------------
db file sequential read 0 12 WAITING

In order to make any sense out of this query, you need to first examine the STATE column. If the state is WAITING, then you can examine the other two columns. The SECONDS_IN_WAIT column shows how many seconds the session is waiting on this event to pass. The WAIT_TIME column is a non-zero value indicating the session's last wait time. If the value is zero, then the session is currently waiting. If the STATE column is something other than WAITING, then ignore the contents of the other two columns. For instance:

SQL> select event,wait_time,seconds_in_wait,state
2 from v$session_wait where sid=32;

EVENT WAIT_TIME SECONDS_IN_WAIT STATE
--------------------------- ---------- --------------- --------------
SQL*Net message to client -1 0 WAITED SHORT TIME

Here, we see that the state column is WAITED SHORT TIME. In this state, the SECONDS_IN_WAIT column is meaningless.

V$SESSION_EVENT

So you now have an idea of how to determine exactly what a session is waiting on for that point in time. Unfortunately, you may not be able to capture pertinent information about all events the session is waiting on even if you repeatedly issue queries against V$SESSION_WAIT. Luckily for us, the V$SESSION_EVENT view captures this information for us.


SQL> select EVENT,TOTAL_WAITS,TIME_WAITED,AVERAGE_WAIT
2 from v$session_event where sid=32
3 order by AVERAGE_WAIT desc;

EVENT TOTAL_WAITS TIME_WAITED AVERAGE_WAIT
------------------------------ ----------- ----------- ------------
free buffer waits 1 103 103
latch free 22 48 2.18181818
db file scattered read 71302 151725 2.12792068
direct path write 44 84 1.90909091
file open 60 18 .3
direct path read 24 7 .291666667
db file sequential read 8493855 1014829 .119478023
SQL*Net message from client 939903 42514 .045232327
SQL*Net message to client 939904 177 .000188317

9 rows selected.

Here, we can see all of the events this particular session has been waiting on for the life of the session. This view is instrumental in showing us which wait events are the biggest bottlenecks for our session. The TOTAL_WAITS column indicates the number of times this session waited on that particular event. The TIME_WAITED is the total time the session waited on this event. And the AVERAGE_WAIT column is the average wait time for the event. This value should equal the time waited divided by the total number of waits. Later in this paper, we'll work extensively with this view.

V$SYSTEM_EVENT

V$SYSTEM_EVENT is just like V$SESSION_EVENT but it captures wait event information for the entire system. There is no SID column in this view. An example of this view with idle events removed looks like:

SQL> select event,total_waits,time_waited,average_wait
2 from v$system_event
3 where event not in ('SQL*Net message from client',
4 'rdbms ipc message','pmon timer','smon timer',
5 'SQL*Net message to client','Null event',
6 'rdbms ipc reply')
7 order by time_waited desc;

EVENT TOTAL_WAITS TIME_WAITED AVERAGE_WAIT
------------------------------ ----------- ----------- ------------
db file sequential read 1163362 251856 .216489794
db file scattered read 48284 113664 2.35407174
enqueue 2150 88771 41.2888372
control file parallel write 44211 79054 1.78810703
buffer busy waits 16089 55564 3.45353969
direct path write 8910 46300 5.19640853
SQL*Net more data to client 2359170 24918 .010562189
log file sync 1781 6504 3.65188097
latch free 8779 6429 .732315754
direct path read 8747 3596 .411112381
log file parallel write 4845 3342 .689783282
db file parallel write 433 2005 4.63048499
log file sequential read 500 194 .388
log file switch completion 5 156 31.2
library cache pin 206 144 .699029126
db file parallel read 59 141 2.38983051
sort segment request 2 102 51
file open 3030 71 .023432343
control file sequential read 1709 57 .033352838
process startup 7 48 6.85714286
log file single write 20 34 1.7
SQL*Net break/reset to client 1122 19 .016934046
file identify 81 17 .209876543
LGWR wait for redo copy 628 13 .020700637
refresh controlfile command 189 6 .031746032
library cache load lock 1 4 4
reliable message 1 0 0
SQL*Net more data from client 6 0 0
instance state change 2 0 0

29 rows selected.

This event shows the cumulative waits of all sessions since instance startup.


AVERAGE_WAIT OR TIME_WAITED?

In my research, I've seen people order the event in a descending order on one of two columns. They either order the results by the average wait time, or the total time waited. Personally, I don't take too much of a stand one way or another. I look at both orderings.

SQL> select event,total_waits,time_waited,average_wait
2 from v$session_event where sid=101
3* order by average_wait desc;

EVENT TOTAL_WAITS TIME_WAITED AVERAGE_WAIT
------------------------------ ----------- ----------- ------------
SQL*Net message from client 48423 3320676 68.5764203
db file scattered read 1 11 11
direct path write 330 1726 5.23030303
log file sync 2 4 2
latch free 154 70 .454545455
direct path read 358 121 .337988827
db file sequential read 5995 1644 .274228524
file open 33 4 .121212121
SQL*Net more data to client 22771 155 .006806904
SQL*Net message to client 48424 10 .000206509
SQL*Net break/reset to client 6 0 0

11 rows selected.

In the query above, we ordered by the average wait, descending. Ignoring idle events, we can see that the event with the highest average wait time is the db file scattered read event. But this event only occurred once. So is it really contributing to the overall wait time? But the next event, direct path write, does carry a high average wait. And it is executed often enough. A high average wait with a good number of executions can be a significant bottleneck. So maybe we should be looking at the total wait time instead. Let's see how that query works.

SQL> select event,total_waits,time_waited,average_wait
2 from v$session_event where sid=31
3* order by time_waited desc;

EVENT TOTAL_WAITS TIME_WAITED AVERAGE_WAIT
------------------------------ ----------- ----------- ------------
SQL*Net message from client 155559 264502 1.70033235
log file sync 3790 3855 1.0171504
free buffer waits 3 202 67.3333333
db file sequential read 926 133 .14362851
log file switch completion 14 118 8.42857143
SQL*Net message to client 155560 35 .000224994
SQL*Net more data from client 3812 33 .008656873
db file scattered read 159 29 .182389937
rdbms ipc reply 17 21 1.23529412
latch free 3 3 1
SQL*Net more data to client 646 2 .003095975
SQL*Net break/reset to client 34 2 .058823529
file open 22 0 0

13 rows selected.

This query is from a completely different session but we have ordered it by the total time waited. Again, we will ignore any idle waits. In this example, the log file sync event contributes the highest total time waited. Notice how low its average wait is. If we ordered by average wait time, then we would not even consider this query. Yet it is the event that contributes the most to our wait time. If we sorted by the overall time waited, then we would concentrate on the log file sync event. If we sorted by the average wait time, then we would concentrate on the free buffer waits event. So which do we look at first?

In the end, it doesn't matter which column you sort on. You'll want to look at both the total time waited and the average wait for each event. Then, you'll want to balance these numbers against the total number of waits for that event. If the average wait for an event is extremely high but the event occurred only once, why worry about it? But what if that wait contributed the most to the wait time? Then we'd want to worry about it. So I try to look at everything to determine which wait event is the most important to me.

NOW WHAT?

Ok. You've read everything up to this point. You know what a wait event is. You have a good grasp of the relevant V$ views. So what do you do now? How does this information help us? To understand that, we have to look at the following formula:

response time = service time + wait time + latency time

The time it takes a user to receive a response is equal to the time spent servicing the request (i.e. CPU time), the time the database waits for specific events to occur and the latency time it takes to transmit data from the database to the user's application interface (i.e. network transmission time). From the end user's perspective, they want to reduce the response time. Decreasing at least one of the three components mentioned above is the only way to improve response time. Unless we get better bandwidth or work hard with different network settings, we can't do too much to improve the latency time. Unless we get better and faster hardware, to improve sorting for example, we won't be able to improve service time either. So most of our efforts in database tuning tend to fall towards reducing the wait time. Keep in mind that if service time is the biggest contributor to the response time, then reducing the wait time will not result in large gains in decreasing the overall response time.

So how do we reduce the wait time? In Oracle databases, this is done by examining the V$ views mentioned in this paper and eliminating or reducing the contention causing a specific wait event to occur. Everything you've seen so far is pretty easy. You've looked at a session and its wait events. You've determined from V$SESSION_EVENT that db file sequential read is the event causing most of your bottleneck. How do you relax that restriction? That's the hardest part of this process.

As you explore wait events, it will quickly become apparent that the hardest part is to decide which course of action to take to reduce the effect that specific event is having on your overall wait time. For instance, the write complete waits event indicates that the session is waiting to flush a changed buffer to disk. There is a bottleneck here. How do we reduce this bottleneck? We either make larger redo log files, or modify the INIT.ORA parameter called log_checkpoint_interval. How do I know this? First off, it really helps if one has a thorough understanding of the Oracle RDBMS architecture. One needs to know how every piece of the database server architecture works and its interdependence on other pieces. In my example, I know that waiting to flush a changed buffer to disk is indicative of check pointing. Increasing the size of the online redo log files can reduce check pointing.

The best source of information is the Oracle Reference Guide, Appendix A. It gives a brief description of every wait event. Hopefully, from that description and your knowledge of the Oracle architecture, you can arrive at a successful resolution to the bottleneck. If you need further help, search for the event on Google at http://www.google.com.

So take a look at a session. Determine which event or two is contributing the most to your overall wait time. Take appropriate steps to reduce the wait time for that event. If you've successfully reduced this bottleneck, you will find that some other event pops up as the major bottleneck to your process. Don't be alarmed. This is supposed to happen. Remember that there is always a bottleneck. You now have to decide if this bottleneck is worth reducing. How do you know if it is worth it? Hopefully, you've already decided on a stopping point in your tuning efforts. This determines if you continue to tune further.


EXAMPLES IN ACTION

This section of the paper will show a few examples on how wait events can be used to tune Oracle processes. Hopefully, after reading this section, one can get a feel of how to use wait events in their tuning efforts.
EXAMPLE 1

We'll take a look at an example problem. For this problem, we know that we have some process that is experiencing a long response time. For our demonstration, we do not know what that process is doing. And even though it would be nice to know, we'll see how knowledge about the specific process is unnecessary for our example. As we look at this example, we'll see how to use wait events to reduce the process response time by reducing the process wait time.

SQL> select event,total_waits,time_waited,average_wait
2 from v$session_event where sid=7 order by average_wait desc;

EVENT TOTAL_WAITS TIME_WAITED AVERAGE_WAIT
------------------------------ ----------- ----------- ------------
sort segment request 1 100 100
log file sync 3 18 6
SQL*Net message from client 3882 21453 6
db file scattered read 97 72 1
direct path write 2085 1320 1
async disk IO 13407 2471 0
SQL*Net break/reset to client 4 0 0
SQL*Net more data to client 1 0 0
direct path read 2041 176 0
SQL*Net message to client 3883 1 0
db file sequential read 85 6 0

11 rows selected.

Here we see a process that is being executed by the user corresponding to sid 7. The only thing I can tell you about this session is that a single SQL statement has been issued. Look at all the things that this statement had to wait on just for one single SQL statement! The SQL statement took 32 seconds to execute. Let's see if we can improve that time just by tuning the wait events.

The first thing to remember is to ignore the idle events. So we ignore the SQL*Net message from client and the SQL*Net message to client events. Look at the event with the highest average wait time, the sort segment request event. A careful reader may note that this event only occurred once and that it didn't contribute significantly to the overall wait time. If I were truly tuning this process, I'd probably start with either the direct path write or aysnc disk IO events. These two non-idle events contribute the most overall time to the process. But for demonstration purposes, I'd like to start with the sort segment request event.

So what does this event suggest? It suggests that this process had to acquire a sort segment and had to wait until that segment was obtained. Now use your knowledge of Oracle architecture. Where are sort segments stored and why are they obtained? If a SQL query needs to sort data, it first tries to sort the data in memory. If the memory allocated for the user's sort space (defined by sort_area_size) is insufficient, then the process must perform its sort on disk. This is done in a temporary tablespace. Before the process can sort on disk in the user's temporary tablespace, it must acquire space in this tablespace. This space is called a sort segment. So logically, it follows that this didn't have enough memory to perform its sort to satisfy the query. Since it didn't have enough memory, it had to acquire a sort segment on disk that took some time to complete; in our case it took 100 milliseconds.

So now we know where to look. Why was there not enough memory for the sort? Looking at the user's process, I found that they set their sort_area_size to 16K. We upped this to 1M and reran the query. This time, the query executed in 10 seconds. Let's look at the wait events for this process after we changed the sort_area_size.


SQL> select event,total_waits,time_waited,average_wait
2 from v$session_event where sid=7 order by average_wait desc;

EVENT TOTAL_WAITS TIME_WAITED AVERAGE_WAIT
------------------------------ ----------- ----------- ------------
SQL*Net message from client 3881 33472 9
direct path write 117 297 3
log file sync 1 2 2
async disk IO 533 298 1
db file scattered read 97 3 0
direct path read 65 1 0
SQL*Net break/reset to client 1 0 0
SQL*Net message to client 3882 1 0

8 rows selected.

Notice that the sort segment request event is completely gone! By tuning this event, we were able to completely eliminate it from the picture. This may not always be the case. The event may still be there, but hopefully, we've reduced the wait time (average and/or overall).

EXAMPLE 2

In this example, we have the luxury of knowing exactly what the user is trying to accomplish before we even look at the wait events for that session. Our user is trying to create an index on a table with over 1 million rows.

SQL> create index db_objects_idx on db_objects(object_id)
2 tablespace peasland;

Index created.

Elapsed: 00:06:43.03

Here we can see that it took almost seven minutes to create this index. Now I know that this is not a terribly long time to wait for an index to be created. But let's see if we can improve on this time.


Let's start by looking at the wait events for this session. Remember to ignore idle events.


SQL> select event,TOTAL_WAITS,TIME_WAITED,AVERAGE_WAIT
2 from v$session_event where sid=9 order by time_waited desc;

EVENT TOTAL_WAITS TIME_WAITED AVERAGE_WAIT
------------------------------ ----------- ----------- ------------
async disk IO 78400 19794 0
direct path write 13928 10063 1
SQL*Net message from client 30 6915 231
log buffer space 770 5514 7
db file sequential read 731 5189 7
db file scattered read 1827 4255 2
direct path read 12035 2695 0
free buffer waits 16 890 56
log file sync 2 8 4
latch free 1 1 1
SQL*Net message to client 31 0 0
SQL*Net break/reset to client 1 0 0

12 rows selected.

Here, we can see multiple I/O events that are giving us problems. The async disk IO, direct path write, db file sequential read, db file scattered read, and direct path read events are all I/O events.

We could handle these individually, but you will find that taking action on one event can often lead to improvements in related events. These are all I/O related events. What could be going on here? If we look at the process even further, we can find that the tablespace that will hold the index resides on the same disk volume that holds the table. In fact, both the index and table are in the same tablespace! It is obvious that we have disk head contention here. The most obvious way to resolve this contention is to create the index on a different disk volume.

SQL> create index db_objects_idx on db_objects(object_id)
2 tablespace peasland_idx;

Index created.

Elapsed: 00:04:38.44

Here, we created the index in a completely different tablespace (which is on a different disk volume). Notice that we went from 06:43 to 04:38, or a 31% improvement! Let's look at the wait events for this process now!

SQL> select event,TOTAL_WAITS,TIME_WAITED,AVERAGE_WAIT
2* from v$session_event where sid=9 order by time_waited desc;

EVENT TOTAL_WAITS TIME_WAITED AVERAGE_WAIT
------------------------------ ----------- ----------- ------------
async disk IO 76843 17246 0
SQL*Net message from client 29 9342 322
direct path write 14246 9245 1
log buffer space 772 4707 6
direct path read 11593 2261 0
db file scattered read 1824 1650 1
db file sequential read 170 80 0
log file sync 2 12 6
SQL*Net message to client 30 0 0

9 rows selected.

Take particular notice of the db file sequential read and db file scattered read events. Before our resolution action, these two events totaled 9,444 milliseconds. After we built the index on a new disk volume, these two events totaled 1,730 milliseconds. This was significant savings and greatly contributed to our decrease in overall processing time. Also, the free buffer waits, latch free, and SQL*Net break/reset to client events are completely gone! Our other IO events we identified in the beginning of this example also experienced a reduction in processing time, though not as significant.

The point to note here is that one simple change in our process had an effect on multiple events. Just as the Oracle RDBMS architecture has many dependencies and interconnections, so do the wait events. Making one simple change can effect many different wait events.

EXAMPLE 3

This example is pretty basic, but it highlights one of the events, db file scattered read. Let's take a look at a current session that needs improvement. This user entered a query like the following:

SQL> set timing on
SQL> set autotrace traceonly
SQL> select * from db_objects where object_name = 'DBA_USERS';

512 rows selected.

Elapsed: 00:00:10.01

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 TABLE ACCESS (FULL) OF 'DB_OBJECTS'

This query took 10 seconds to complete. We'd like to improve this process. Now we could tune this query using SQL tuning methods. A reader familiar with these methods should arrive at the correct conclusion that a full table scan is being performed here and that placing an index on the OBJECT_NAME column would improve performance. But let's look at the wait events and see how this figures into our problem.

SQL> select event,TOTAL_WAITS,TIME_WAITED,AVERAGE_WAIT
2 from v$session_event where sid=9 order by time_waited desc;

EVENT TOTAL_WAITS TIME_WAITED AVERAGE_WAIT
------------------------------ ----------- ----------- ------------
SQL*Net message from client 98 7588 77
db file scattered read 1903 656 0
log file sync 1 5 5
db file sequential read 19 2 0
SQL*Net message to client 99 0 0

Ignoring the idle waits, we can see that the biggest contributor is the db file scattered read event. This event signifies a full table scan. At first, this might seem to be a misnomer. But scattered reads mean full table scans. Let's tune the query and see how this improves. We'll add an index to this column and perform the test again.

SQL> create index db_objects_o_name_idx on db_objects(object_name)
2 tablespace peasland_idx;

Index created.

Now we'll restart the whole process and see if there was improvement.

SQL> set timing on
SQL> set autotrace traceonly
SQL> select * from db_objects where object_name = 'DBA_USERS';

512 rows selected.

Elapsed: 00:00:01.99

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'DB_OBJECTS'
2 1 INDEX (RANGE SCAN) OF 'DB_OBJECTS_O_NAME_IDX' (NON-UNIQUE)

Notice that the query went from 10 seconds to two seconds. Also notice by the execution plan that the query is now using the index we just created. We should see a different set of wait events too.

SQL>select event,TOTAL_WAITS,TIME_WAITED,AVERAGE_WAIT
2* from v$session_event where sid=9 order by time_waited desc;

EVENT TOTAL_WAITS TIME_WAITED AVERAGE_WAIT
------------------------------ ----------- ----------- ------------
SQL*Net message from client 96 1959 20
db file sequential read 517 161 0
log file sync 1 6 6
db file scattered read 1 0 0
SQL*Net message to client 97 0 0

Careful examination will show that our biggest wait event for this process is now the db file sequential read event. While db file scattered read indicated a full table scan, db file sequential read indicates index reads. This should be expected since our query is now using an index. Also note that we waited much less time for the sequential reads than for scattered reads. This is why our query performed better.

EXAMPLE 4

For this example, let's jump straight to the wait events for the process.

SQL> select event,TOTAL_WAITS,TIME_WAITED,AVERAGE_WAIT
2* from v$session_event where sid=9 order by time_waited desc;

EVENT TOTAL_WAITS TIME_WAITED AVERAGE_WAIT
------------------------------ ----------- ----------- ------------
SQL*Net message from client 123308 16221 0
db file sequential read 17831 849 0
SQL*Net message to client 123309 31 0
log file sync 1 4 4
db file scattered read 1 1 1

For the non-idle events, only one really sticks out. The db file sequential read event is the only significant non-idle event out there. And the total number of waits for this event is pretty high. What do we know about this event? We know that this signifies an index is used to read a table. We can verify this looking at the user's process.

SQL> set timing on
SQL> set autotrace traceonly
SQL> select * from db_objects where status='VALID';

1232640 rows selected.

Elapsed: 00:03:09.25

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'DB_OBJECTS'
2 1 INDEX (RANGE SCAN) OF 'DB_OBJECTS_STATUS_IDX' (NON-UNIQUE)

Statistics
----------------------------------------------------------
29 recursive calls
0 db block gets
262945 consistent gets
17826 physical reads
0 redo size
143263551 bytes sent via SQL*Net to client
16147956 bytes received via SQL*Net from client
123265 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1232640 rows processed

An index was used. In our case, it is a pretty bad index because the STATUS column only has two distinct values, VALID and INVALID. Also note that there were 17,826 physical reads done to satisfy this query. Let's remove the index.

SQL> drop index DB_OBJECTS_STATUS_IDX;

Index dropped.

Now, let's run the process again.

SQL> select * from db_objects where status='VALID';

1232640 rows selected.

Elapsed: 00:02:59.06

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 TABLE ACCESS (FULL) OF 'DB_OBJECTS'

Statistics
----------------------------------------------------------
155 recursive calls
12 db block gets
136773 consistent gets
14861 physical reads
0 redo size
143263551 bytes sent via SQL*Net to client
16147956 bytes received via SQL*Net from client
123265 SQL*Net roundtrips to/from client
4 sorts (memory)
0 sorts (disk)
1232640 rows processed

This time, a full table scan was performed. The elapsed time improved slightly as did the number of physical reads. Now that you know a full table scan was performed, you should already know that we will see db file scattered reads as our major wait event.

SQL> select event,TOTAL_WAITS,TIME_WAITED,AVERAGE_WAIT
2* from v$session_event where sid=9 order by time_waited desc;

EVENT TOTAL_WAITS TIME_WAITED AVERAGE_WAIT
------------------------------ ----------- ----------- ------------
SQL*Net message from client 123308 14932 0
db file scattered read 1883 633 0
SQL*Net message to client 123309 31 0
log file sync 1 5 5
db file sequential read 21 2 0

And the output above proves our point. We had a full table scan indicated by the scattered reads. We had less total waits and waited less total time for the same query. We just removed the index from the equation. It was causing contention in our process.

This example didn't really show too much improvement. Why? There was not significant improvement since the database we ran our process against a test platform with only one user connected to the database. This user had sole access to the buffer cache. Had other users need parts of the buffer cache, the difference in the two processes would have been more significant.


EXAMPLE SUMMARY

In this paper, we've seen a few simple example of how to use wait events to tune database processes. Let's summarize what we've learned with these examples.

Example 1 showed us a problem with the user's sort area. That user did not have enough sort area and had to allocate disk space to complete the sort taking more time than necessary to complete. By modifying that user's sort area space, we were able to sort entirely in memory and reduce the processing time.

Example 2 was probably our most complex example. We saw many different wait events that were I/O related. By looking at the process, we were able to conclude that the index being created was built on the same disk volume that holds the index's table. So we created the index on a different disk volume. I/O was a bottleneck because we had read (read data from the table) and write (build the index) operations to the same disk unit. Separating these operations to different units helped relieve some of the contention. This example also showed us how one simple action can have an effect on multiple wait events.

Examples 3 and 4 really go together. We learned that a scattered read meant a full table scan was being performed. We also learned that a sequential read means the table was read with the use of an index. Sometimes an index helps our query processing time. Other times, it hurts us. These examples also showed how we could never fully get rid of our bottlenecks. Remember, there is always a bottleneck. In Example 3, the scattered read was our bottleneck. We virtually eliminated this bottleneck by placing an index on the table. But in it's placed, the sequential read bottleneck popped up. Example 4 was the converse of Example 3.

None of these examples was truly difficult. Nor were these examples really deep. It is hard to come up with examples that can be explained in a short paper. For a more complex example of tuning a process with wait events, I have another paper to read. My paper titled "Tuning An Import With Wait Events (A Case Study)" shows a much more complex example. Feel free to ask me for a copy if you are interested in reading it.


CONCLUSION

This paper started off by giving a history of the Oracle Wait Event interface. We discussed why ratio tuning is not always the best tuning methodology. Wait Events instantly tell us exactly what events a process is waiting to complete before it can continue processing. There are a few V$ tables that we use to find out these wait events. Using this information, we can determine the best cause of action to resolve our bottlenecks and improve the performance of our process. I hope that the events themselves don't become too much of a burden on your pursuit of understanding and using this methodology. In time, you will become comfortable with these events as you work with them.


来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/26706/viewspace-64560/,如需转载,请注明出处,否则将追究法律责任。

上一篇: Trigger Body
请登录后发表评论 登录
全部评论

注册时间:2002-10-29

  • 博文量
    54
  • 访问量
    40942