Selectivity

~~~~~~~~~~~

文档 ID: | 72539.1 68992.1 |

Selectivity is a measure of the proportion of a row source retrieved by

application of a particular predicate or combination of predicates.

Within the Oracle kernel it is expressed as a value between 0 and 1.

The closer the value is to 0 the more selective the predicate is.

Selectivity is only used by the CBO.

Basic Selectivity formula:

~~~~~~~~~~~~~~~~~~~~~~~~~~

Number of records satisfying a condition

Selectivity = -----------------------------------------

Total Number of records

In the optimizer, selectivity is used to compare the usefulness of various

predicates in combination with base object costs.

Knowing the proportion of the total data set that a column predicate defines

is very helpful in defining actual access costs.

By default, column selectivity is based on the high and low values and the

number of values in the column with an assumption of even distribution of

data between these two points.

Histogram data can give better selectivity estimates for unevenly distributed

column data. There is more discussion regarding Histograms later.

Selectivity is also used to define the cardinality of a particular row source

once predicates have been applied. Cardinality is the expected number of rows

that will be retrieved from a row source. Cardinality is useful in determining

nested loop join and sort costs. Application of selectivity to the original

cardinality of the row source will produce the expected (computed) cardinality

for the row source.

Glossary of Terms:

~~~~~~~~~~~~~~~~~~

NDV Number of Distinct Values

Cardinality Number of rows

Selectivity Proportion of a dataset returned by a particular predicate(or

group of predicates)

In the following illustrations there are 2 tables (T1 & T2) with columns (c1)

Selectivities:

~~~~~~~~~~~~~~

Without histograms

~~~~~~~~~~~~~~~~~~

c1 = '4076' 1/NDV

c1 > '4076' 1 - (High - Value / High - Low)

c1 >= '4076' 1 - (High - Value / High - Low) + 1/NDV

c1 like '4076' 1/NDV

Join selectivity

~~~~~~~~~~~~~~~~

The selectivity of a join is defined as the selectivity of the most selective

join column adjusted by the proportion of not null values in each join column.

Sel = 1/max[NDV(t1.c1),NDV(t2.c2)] *

( (Card t1 - # t1.c1 NULLs) / Card t1) *

( (Card t2 - # t2.c2 NULLs) / Card t2)

Bind Variable selectivity

~~~~~~~~~~~~~~~~~~~~~~~~~

Bind variables present a special case because the optimizer has no idea what

the bind variable value is prior to query optimization. This does not present

a problem with equality predicates since a uniform. distribution of data is

assumed and the selectivity is taken as 1/NDV for the column. However for

range predicates it presents a major issue because the optimizer does not

know where the range starts or stops. Because of this the optimizer has to

make some assumptions as follows:

c1 = :bind1 1/NDV

c1 > :bind1 Default of 5%

c1 >= :bind1 Default of 5%

c1 like :bind1 Default of 25%

Bind variables

~~~~~~~~~~~~~~

Bind variables are place holders for query input values. They are a pointer to

a memory location where data value(s) will be placed.

Note that the prescence of bind variables has no effect on queries that are

optimised using the RBO. They only affect CBO query optimization because the

CBO attempts to use column value information to determine the optimal access

path for the query.

When no values are supplied, the CBO may make a sub-optimal plan choice.

Advantages of bind variables:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

o When a bind variable as opposed to a hardcoded value is placed in a query,

the query code does not have to change each time the query is run. This means

that the code does not need to be reparsed and can be shared between sessions

and you do not need to maintain a copy of the statement for each value used

in the query. The effect of this is to reduce the amount of space used in the

shared pool to store almost identical copies of sql statements.

NB sharing also depends on other factors e.g.

o identical objects and object owners must be referenced

o bind variables must have the same datatype

o etc.

Disadvantages of bind variables:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

o When a SQL statement is optimized, the optimizer is unable to use the current

bind value. If it did then the plan chosen for that value may be excessively

poor for other values. Also the plan chosen would depend on which value was

supplied first. Because of this the optimizer must either choose the average

selectivity fo that column (the density) or use defaults. This may result in

the generation of a sub-optimal plan.

The CBO is unable to determine accurate selectivities for range predicate

containing bind variables. The CBO uses column value data to adjust

selectivities. If it does not have any data values to do this with

(such as if bind variables are used) then this is not possible and assumptions

have to be made.

For queries with range predicates using bind variables, we have no way of

calculating the selectivity, so we use a hardcoded default value of 5%

This is true irrespective of histograms as CBO does not know the value of

the bind variable.

Selectivity for bind variables with 'like' predicates defaults to 25%

Range Predicate Example:

~~~~~~~~~~~~~~~~~~~~~~~~

SELECT ename FROM emp WHERE empno > 9999;

SELECT ename FROM emp WHERE empno > :bind1;

Assuming the table has been analyzed, CBO knows the HIGH and LOW values for

empno and that the values are evenly distributed between these points.

For the first statement, CBO can determine the selectivity for the

where clause 'where empno >9999' - it uses the assumption that values

are evenly distributed to enable it to estimate the number of values between

the supplied value and the HIGH value.

For the second statement, it does not know what the value of :bind1 is,

so it is unable to use the same assumption and uses the default selectivity

of 5%.

Selectivity With Histograms

~~~~~~~~~~~~~~~~~~~~~~~~~~~

Histograms provide additional information about column selectivity for

columns whose distribution is non uniform. Histograms store information about

column data value ranges. Each range is stored in a single row and is often

called a 'bucket'. There are 2 different methods for storing histograms in

Oracle. If there are a small number of distinct

column values (i.e. less than the number of buckets), the column value

and the count of that value is stored. If not then a series of endpoints

are stored to enable more accurate selectivity to be determined.

The first method allows the accurate figures to be used. However with

inexact histograms the terms popular and non-popular value are introduced

and are used to help determine selectivity. A popular value is a value that

spans multiple endpoints whereas a non-popular value does not.

When building histograms the information it stores is interpreted differently

depending on whether the number of buckets requested is less than the number

distinct values or if it is the same. Specifically, ENDPOINT_NUMBER and

ENDPOINT_VALUE in dba/user/all_histograms would have different meanings.

EXAMPLE

-------

Table TAB1

SQL> desc tab1

Name Null? Type

------------------------------- -------- ----

A NUMBER(6)

B NUMBER(6)

Column A contains unique values from 1 to 10000.

Column B contains 10 distinct values. The value '5' occurs 9991 times. Values

'1, 2, 3, 4, 9996, 9997, 9998, 9999, 10000' occur only once.

Test queries:

(1) select * from tab1 where b=5;

(2) select * from tab1 where b=3;

Both the above queries would use a FULL TABLE SCAN as there is no other

access method available.

Then we create an index on column B.

select lpad(INDEX_NAME,10), lpad(TABLE_NAME,10),

lpad(COLUMN_NAME,10), COLUMN_POSITION, COLUMN_LENGTH

from user_ind_columns

where table_name='TAB1'

SQL> /

LPAD(INDEX LPAD(TABLE LPAD(COLUM COLUMN_POSITION COLUMN_LENGTH

---------- ---------- ---------- --------------- -------------

TAB1_B TAB1 B 1 22

Now,

(1) select * from tab1 where b=5;

(2) select * from tab1 where b=3;

Both do an INDEX RANGE SCAN to get the ROWID to do a lookup in the table.

With an INDEX present, it would preferrable to do an INDEX RANGE SCAN for

query (2), but a FULL TABLE SCAN for query (1).

ANALYZING THE TABLE

-------------------

Next, analyze the table using compute statistics:

SQL> analyze table tab1 compute statistics;

From dba_tables:

NUM_ROWS BLOCKS EMPTY_BLOCKS AVG_SPACE CHAIN_CNT AVG_ROW_LEN

---------- ---------- ------------ ---------- ---------- -----------

10000 64 0 86 0 10

From dba_tab_columns:

NUM_DISTINCT LOW HIGH DENSITY NUM_NULLS NUM_BUCKETS LAST_ANALYZ SAMPLE_SIZE

------------ ---- ---- --------- ---------- ----------- ----------- -----------

10000 Full Full .0001 0 1 30-JUN-1999 10000

10 Full Full .1 0 1 30-JUN-1999 10000

For Oracle7, from user_histograms:

SQL> select lpad(TABLE_NAME,10), lpad(COLUMN_NAME, 10),

2 bucket_number, endpoint_value

3 from user_histograms

4 where table_name='TAB1';

TABLE_NAME COLUMN_NAME BUCKET_NUMBER ENDPOINT_VALUE

---------- ----------- ------------- --------------

TAB1 A 0 1

TAB1 A 1 10000

TAB1 B 0 1

TAB1 B 1 10000

For Oracle8, from user_tab_histograms:

SQL> select lpad(TABLE_NAME,10), lpad(COLUMN_NAME, 10),

2 bucket_number, endpoint_value

3 from user_tab_histograms

4 where table_name='TAB1';

Analyze has created 1 BUCKET for each column. So all values for the column

are in the same bucket. The BUCKET_NUMBER represents the BUCKET NUMBER and

ENDPOINT_VALUE represents the last column value in that bucket.

Now query (1) and (2) ; both do a FULL TABLE SCAN.

So, the fact that you have statistics about the table and columns does not

help the optimizer to distinguish between how many of each value we have.

The reason it does a FULL TABLE SCAN is because there is a 1 BUCKET histogram

and any value selected for should be in that bucket.

CREATING HISTOGRAMS

-------------------

What you need now is to create histograms so the Optimizer knows how many

values occur for each column.

Query (1): select * from tab1 where b=5;

should do a FULL TABLE SCAN and

Query (2): select * from tab1 where b=3;

should do an INDEX RANGE SCAN

SQL> analyze table tab1 compute statistics for columns b size 10;

select lpad(TABLE_NAME,10), lpad(COLUMN_NAME, 5),

endpoint_number, endpoint_value

from user_histograms;

TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE

TAB1 B 1 1

TAB1 B 2 2

TAB1 B 3 3

TAB1 B 4 4

TAB1 B 9995 5

TAB1 B 9996 9996

TAB1 B 9997 9997

TAB1 B 9998 9998

TAB1 B 9999 9999

TAB1 B 10000 10000

So, now there are statistics on the table and on the columns.

You requested 10 buckets (size 10) and there are 10 distinct values.

The ENDPOINT_VALUE shows the column value and the ENDPOINT_NUMBER

shows the cumulative number of rows.

For example, for ENDPOINT_VALUE 2, it has an ENDPOINT_NUMBER 2, the previous

ENDPOINT_NUMBER is 1, hence the number of rows with value 2 is 1.

Another example is ENDPOINT_VALUE 5. Its ENDPOINT_NUMBER is 9995. The previous

bucket ENDPOINT_NUMBER is 4, so 9995 - 4 = 9991 rows containing the value 5.

So, now QUERY (1) does in fact do a Full Table Scan.

SQL> select * from tab1 where b=5

SQL> /

Execution Plan

----------------------------------------------------------

0 SELECT STATEMENT ptimizer=ALL_ROWS (Cost=10 Card=9991 Bytes=99910)

1 0 TABLE ACCESS (FULL) OF 'TAB1' (Cost=10 Card=9991 Bytes=99910)

And, QUERY (2) does do an Index Range Scan.

SQL> select * from tab1 where b=3

SQL> /

Execution Plan

----------------------------------------------------------

0 SELECT STATEMENT ptimizer=ALL_ROWS (Cost=6 Card=500 Bytes=5000)

1 0 TABLE ACCESS (BY ROWID) OF 'TAB1' (Cost=6 Card=500 Bytes=5000)

2 1 INDEX (RANGE SCAN) OF 'TAB1_B' (NON-UNIQUE)

This is fine if you have a low number of distinct values, but there can

be tables with a huge number of distinct values. You don't want to

create a bucket for each value. There would be too much overhead.

In this case you would request less buckets than distinct values.

CREATING HISTOGRAMS WITH LESS BUCKETS THAN DISTINCT VALUES

----------------------------------------------------------

SQL> analyze table tab1 compute statistics for columns b size 8;

SQL> select lpad(TABLE_NAME,10), lpad(COLUMN_NAME, 5),

2> endpoint_number, endpoint_value

3> from user_histograms;

LPAD(TABLE LPAD( ENDPOINT_NUMBER ENDPOINT_VALUE

---------- ----- --------------- --------------

TAB1 B 0 1

TAB1 B 7 5

TAB1 B 8 10000

Here, Oracle creates the requested number of buckets but puts the same

number of values into each bucket, where there are more endpoint values

that are the same for the frequently occurring value.

The ENDPOINT_NUMBER is the actual bucket number and ENDPOINT_VALUE is

the endpoint value of the bucket determined by the column value.

From above, bucket 0 holds the low value for the column. You cannot see

buckets 1 to 6 so as to save space.

But we have bucket 1 with an endpoint of 5,

bucket 2 with an endpoint of 5,

bucket 3 with an endpoint of 5,

bucket 4 with an endpoint of 5,

bucket 5 with an endpoint of 5,

bucket 6 with an endpoint of 5,

bucket 7 with an endpoint of 5 AND

bucket 8 with an endpoint of 10000

So bucket 8 contains values between 5 and 10000.

All buckets contain the same number of values (which is why they are called

height-balanced histograms), except the last bucket may have less values

then the other buckets.

If the data is uniform, you would not use histograms. However, if you request

the same number of buckets as distinct values, Oracle creates 1 bucket. If

you request less buckets, Oracle uses an algorithm to balance values into each

bucket and any values that remain (which have to be less then the number

stored in each height-balanced bucket) go into the last bucket.

STORING CHARACTER VALUES IN HISTOGRAMS

--------------------------------------

Character columns have some exceptional behaviour, in as much as we store

histogram data for the first 32 bytes of any string. Any predicates that

contain strings greater than 32 characters will not use histogram information

and the selectivity will be 1 / DISTINCT.

Data in histogram endpoints is normalized to double precision floating point

arithmetic.

EXAMPLE

-------

SQL> select * from morgan;

A

----------

a

b

c

d

e

e

e

e

The table contains 5 distinct values. There is one occurance of 'a', 'b', 'c'

and 'd' and 4 of 'e'.

Create a histogram with 5 buckets.

SQL> analyze table morgan compute statistics for columns a size 5;

Looking in user_histograms:

LPAD(TABLE LPAD( ENDPOINT_NUMBER ENDPOINT_VALUE

---------- ----- --------------- --------------

MORGAN A 1 5.0365E+35

MORGAN A 2 5.0885E+35

MORGAN A 3 5.1404E+35

MORGAN A 4 5.1923E+35

MORGAN A 8 5.2442E+35

So, ENDPOINT_VALUE 5.0365E+35 represents a

5.0885E+35 represents b

5.1404E+35 represents c

5.1923E+35 represents d

5.2442E+35 represents e

Then if you look at the cumulative values for ENDPOINT_NUMBER,

the corresponding ENDPOINT_VALUE's are correct.

Exact histograms

~~~~~~~~~~~~~~~~

c1 = '4706' count of value '4076' / Total Number of Rows

c1 > value count of values > '4076' / Total Number of Rows

InExact Histograms

~~~~~~~~~~~~~~~~~~

col = pop value # popular buckets / # buckets

col = non pop (Density)

col > value # buckets > value / # buckets

Rules for combining selectivity

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let P1 and P2 be 2 distinct predicates of query Q

P1 AND P2

S(P1&P2) = S(P1) * S(P2)

P1 OR P2

S(P1|P2) = S(P1) + S(P2) -[S(P1) * S(P2)]

Index Selectivity for concatenated indexes

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Starting with 10.2, when a concatenated index, with all its columns having

equality predicates, is used as an access path, the optimizer uses 1/NDK as

the selectivity (where NDK is the number of distinct keys in the index).

On 9.2.0.7 and 9.2.0.8 this may be enabled with event 38060 level 1.

On 10.1.0.4 and 10.1.0.5 this may be enabled with event 38059 level 1.

On 10.2 adjustments will be made to the selectivity to account for nulls

in the index keys. This also occurs on 10.1.0.5 (with event 38059.)

Join cardinality

~~~~~~~~~~~~~~~~

Card(Pj) = Card(T1) * Card(T2) * Sel(Pj)

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/16158219/viewspace-531687/，如需转载，请注明出处，否则将追究法律责任。

上一篇：
dump

下一篇：
IMP过程中对象丢失

请登录后发表评论
登录

全部评论

- 10.2和以上版本的pga动态工作区管理
- database records an archived log in the control file
- When RMAN Performs Control File Autobackups
- Planning an Incremental Backup Strategy
- Multiple Copies of RMAN Backups
- 关于flashback log
- maximum control file size？？
- Renaming a Datafile in the Primary Database
- Duplicate Database fails with Rman-5001 [ID 334588.1]
- Oracle backup fails with error RMAN-06056: could not access datafile