谈SQL QUERY涉及到的predicate

来源:kinfinger博客  发布日期:2012-09-13 09:02:03

本文主要从predicate的概念,分类,以及predicate度访问路径的影响来进行介绍,主要的目的是通过介绍让你对你的sql有一个大体的优化方向,顺便介绍了一下DB2的 runstat utility的部分语法。需要注意的是catalog中的统计信息表对访问路径的影响是很大的,虽然我们不明白具体的机制,但我们可以给DB2优化器提供足够的多信息,让优化器进行选择,达到最优的访问路径

本文涉及的内容主要涉及三张catalog表

SYSCOLUMNS(cardinality)

SYSCOLDIST(frequency)

SYSCOLDIST_HIST(histgram)

Predicate

Concept: A predicate specifies a condition that is true, false, or unknown about a given row

or group.

对于一个给定的行或是行组,predicate可以用来规定条件的真假或是未知,

它主要是用来描述数据的属性,它主要出现在WHERE,HAVING,ON clause

Predicate的一般形式是:

COLUMN_NAME  OPERATOR  CONDITION_VALUE

Predicate的结果只能是真,假,或是未知,这样对于一个特定的行进行匹配的时候,就可以确定该行是否满足该predicate,也就是说predicate是用来限定(qualified)行的

Predicate依据operator进行分类,主要包含: 

Subquery predicate : any predicate that includes another select statement

For example :

C1 in (select c10  from t_c where   xxxx )

Equal predicate:any predicate that is not a subquery predicate and has an equal operator and no not operator,also included  are predicates of the form

C1 is null

C is not distinct from

Range predicate:not a subquery predicate and contains one of the following operaotrs:

>

<=

<

<=

Like

Between and

IN predicate :  a predicate of the form :

Column in (list of values)

按照predicate 的结果对where条件的影响有可以分为:

Boolean term predicate:

A Boolean term predicate(BTP)  is a simple or compound predicate that, when it is evaluated false for a particular row, makes the entire where clause false for that particular row.

所谓BTP就是一个简单或是符合predicate,针对某一行,当该predicate为假时,整个where条件在那个特定的行业为假,该行也就不满足条件(not-qualified)

Non_boolean term  predicate

下面举例说明:

假设P1,P2,P3都是simple predicate

SELECT * FROM T1 WHERE P1 AND (P2 OR P3)

针对上面的SQL :

P1  simplge BTP

P2  SIMPLE NON-BTP

P3 SIMPLE NON-BTP

P2 OR P3   COMPUND BTP

P1 AND (P2 OR P3) COMPUND BTP

针对predicate,还需要解释一点就是having中的predicate是不用来选择access  path的

Not predicate: any predicate that is a not subquery predicate and contains a not operator.

一个sql中的predicate的处理可以包含两个stage,因此predicate分为两种

Stage1  predicate  --sargable(search argument-able )可用来搜索的参数吧,自己理解

Stage 2 predicate    ----residual (剩余的)

Predicate 是按一下顺序进行求值处理的:stage

1 首先应用可索引predicate,

2 stage 1 predicate中没有选作matching predicate的predicate,这样的predicate引用了index column,这个过程也被成为index screening。

3 剩余其它的stage 1 predicate

After  data page access,stage 1 prediate are applied to the data

4  stage 2 predicate are applied  on the returned data rows

在每一阶段,predicate是按如下的数据类型进行处理的:

1 equal predicate ( incluce in predicate that contain only one  item and between predicates that contain the save value twice

2 range predicate and predicate of the forms  : column is not null,多说一句,range predicate可以终止索引predicate的匹配(前面的博文已涉及)

3 all other  predicate type

下面我们涉及predicate的重点内容:predicate的filter Factors(FF)是如何影响访问路径,只有明白它,你才能写出更加高效的SQL QUERY.

所谓filter Factors,是一个介于0到1之间的值,它表明了predicate为真时,表中的行所占的比例,这些行被这些predicate所限定。DB2使用FF来估计predicate所限定的行的数目,通常对于一个simple predicate,FF是下面三个变量的函数:(参考predicate的构成)

Constant value in  the predicate

The operator in the predicte

Statistic on the column in the predicate
www.it165.net

上述三个变量,前两个是你可以控制的,第三个变量的统计信息是保存在catalog中的,而catalog中的统计信息你可以通过运行带有update 的ruanstat utility 来进行实时更新。下面给出一些简单的predicate的FF的default value

Predicate         FF

Col = constant        1/25

Col <> constant            1 – (1/25)

Col IS NULL               1/25

Col IS NOT DISTINCT FROM              1/25

Col IS DISTINCT FROM              1 – (1/25)

Col IN (constant list)             (number of constants)/25

Col Op constant            1/3

Col LIKE constant                 1/10

其实说白了,FF的值就是某一值出现在表中行的一个统计特性,即出现的概率值,默认某一个值出现的概率是服从均匀分布(uniform distribution).对于其他比较复杂的predicate,可以使用插值公式来进行计算(interpolation formulation)。

当你收集足够多的统计信息时,DB2优化器进行访问路径选择时,估计的也就越准确,从而可以选出最优的访问路径。

通常DB2收集两种统计信息

Frequency  : the percentage of rows in the table that that contains a value for a column of set of columns

Cardinality 

The number of distinct value in a set of  columns

这两个统计信息侧重点不同,frequency描述某一个值在表中所占的百分比 ,而cardinality描述了针对一列或是多列所包含的distinct值的多少,二者都能影响access path的选择。

其中frequency的统计信息放在SYSIBM.SYSCOLDIST 表中,需要关注的列:

COLVALUE            FREQUENCYF

该值可以通过runstat      utility来收集,在这里顺带解释一下runstat的语法,

索引包含的列数可以能是一列或是多列,因此针对列统计信息的收集的语法包含两种

COLUMN

ALL

Specifies that statistics are to be gathered on all columns in the table.

COLUMN_NAME

Specifies the columns on which statistics are to be gathered.

COLUMN-GUOUP SPEC

Indicates that the specified set of columns are to be treated as a group. This

option enables RUNSTATS to collect a cardinality value on the specified

column group. 针对该column group收集distinct  value

FREQVAL

Indicates, when specified with the COLGROUP option, that frequency statistics

are also to be gathered for the specified group of columns. (COLGROUP

indicates that cardinality statistics are to be gathered.) One group of statistics is

gathered for each column.

该参数同colgroup 一同使用,用来收集特定colgroup 的统计信息,其中colgroup表明了cardinality信息会被收集),colgroup中的每一列的统计信息都会被收集。

COUNT integer

Indicates the number of frequently occurring values to be collected from

the specified column group. For example, COUNT 20 means that DB2 collects 20 frequently occurring values from the column group. You must

specify a value for integer; no default value is assumed.

Be careful when specifying a high value for COUNT. Specifying a value of

1000 or more can increase the prepare time for some SQL statements.

MOST

Indicates that the utility is to collect the most frequently occurring values

for the specified set of columns when COLGROUP is specified.

BOTH

Indicates that the utility is to collect the most and the least frequently

occurring values for the specified set of columns when COLGROUP is

specified. If COUNT is n, the utility collects the n least frequently

occurring values and the n most frequently occurring values.

LEAST

Indicates that the utility is to collect the least frequently occurring values

for the specified set of columns when COLGROUP is specified.

总而言之,freval选项在启用colgroup后,用来收集colgroup中某个值出现次数前几(COUNTER)最多(MOST)或是最少(least),或者都有的值(both include most and least)值的统计信息的统计信息,同时还会收集colgroup每一列的cardinality的统计信息

DB2在选用统计信息的时候,如果针对某一列或是colgroup,你收集的前十个值,那么剩下的值会按均匀分布的概率进行估算。

通常针对col frequency统计信息的收集都是针对single col或是single multi-column ,采集前十或是某些特殊的统计信息,如

Example frequency statistics for values on the YRS_OF_EXPERIENCE column in

an EMPLOYEE table

VALUE   FREQUENCY

2      10%

25    15%

26    15%

27     7%

12    0.02%

13    0.01%

40     0.0001%

41     0.00001%

上述的统计信息收集了出现最多前五个,最少前五个的统计信息,但是面对如下查询

Equality predicate with unmatched value:

SELECT EMPID FROM EMPLOYEE T

WHERE T.YRS_OF_EXPERIENCE = 6;

v Range predicate:

SELECT T.EMPID FROM EMPLOYEE T

WHERE T.YRS_OF_EXPERIENCE BETWEEN 5 AND 10;

v Non-local predicate:

SELECT T1.EMPID FROM EMPLOYEE T1, OPENJOBS T2

WHERE T1.SPECIALTY = T2.AREA AND T1.YRS_OF_EXPERIENCE > T2.YRS_OF_EXPERIENCE;

Frequency对FF的估算并不能起作用,如果没有相关的统计信息,DB2只能使用interpolation来计算,为了解决这种问题,你还可以使用runstat utility来收集直方图统计信息以帮助解决FF的计算。在直方图统计信息中,DB2将某一个值的区间进行等分,区间的数目取决于你histogram option 中的NUMQUANTILES,每一个区间都有一个quantilenno,DB2针对每一个区间收集统计信息,下表是numquantiles=7 时,上一个表的直方图统计信息

QUANTILENO LOWVALUE HIGHVALUE CARDF FREQUENCYF

1       0     3    4       14%

2       4    15               8     14%

3      18    24    7                12%

4      25     25   1     15%

5      26    26     1    15%

6      27     30   4    16%

7      35     40   6     14%

然后DB2优化器会根据特定的规则,以及你所收集的到的统计信息来计算查询的成本,然后选择一个成本最低的计划来执行,并将执行计划放入plan_table 如果你在bind的时候使用explain(yes) 选项

关于predicate以及FF的计算就先说这么多么,以后会慢慢补充

PS:

INDEXALBE  PREDICATE

AN INDEXABLE PREDICATE CAN MATCH INDEX ENTEIES,

PREDICATES  THAT CAN NOT MATCH INDEX ENTRIES ARE SAID TO BE NON-INDEXABLE PREDICATE
 

Tag标签: SQL   QUERY   predicate  
  • 专题推荐

About IT165 - 广告服务 - 隐私声明 - 版权申明 - 免责条款 - 网站地图 - 网友投稿 - 联系方式
本站内容来自于互联网,仅供用于网络技术学习,学习中请遵循相关法律法规