ITPub博客

首页 > 数据库 > PostgreSQL > PostgreSQL 源码解读(169)- 查询#89(PG中的词法定义:scanner.l)#2

PostgreSQL 源码解读(169)- 查询#89(PG中的词法定义:scanner.l)#2

原创 PostgreSQL 作者:husthxd 时间:2019-04-16 11:31:35 0 删除 编辑

输入一条SQL语句,PostgreSQL如何解析输入的SQL,识别SQL类型以及基表/字段等信息?接下来的几节将逐一进行解析.
本节介绍了PostgreSQL的词法定义文件(Flex输入文件),在文件src/backend/parser/scan.l中.
如前所述,Flex输入文件由四部分组成:


%{
Declarations
%}
Definitions
%%
Rules
%%
User subroutines

本节介绍第二部分Definitions.

一、Definitions

在Declarations和Rules之间的部分是Definitions,这一部分可以定义进行正则表达式的”宏定义”,这些定义可在 规则(Rules)段被使用,如:


newline            [\n\r]

这样在Rules中可以直接使用newline指代[\n\r]。


//各种选项设置
%option reentrant
%option bison-bridge
%option bison-locations
%option 8bit
%option never-interactive
%option nodefault
%option noinput
%option nounput
%option noyywrap
%option noyyalloc
%option noyyrealloc
%option noyyfree
%option warn
%option prefix="core_yy"
/*
 * OK, here is a short description of lex/flex rules behavior.
 * The longest pattern which matches an input string is always chosen.
 * For equal-length patterns, the first occurring in the rules list is chosen.
 * INITIAL is the starting state, to which all non-conditional rules apply.
 * Exclusive states change parsing rules while the state is active.  When in
 * an exclusive state, only those rules defined for that state apply.
 * 下面是一些lex/flex规则动作的简单描述.
 * 通常会选中可以最大匹配输入的字符串模式.
 * 对于长度一致的模式,规则链表中的第一个规则会选中.
 * INITIAL是开始状态,适用于所有非条件规则.
 *
 * We use exclusive states for quoted strings, extended comments,
 * and to eliminate parsing troubles for numeric strings.
 * Exclusive states:
 *  <xb> bit string literal
 *  <xc> extended C-style comments
 *  <xd> delimited identifiers (double-quoted identifiers)
 *  <xh> hexadecimal numeric string
 *  <xq> standard quoted strings
 *  <xe> extended quoted strings (support backslash escape sequences)
 *  <xdolq> $foo$ quoted strings
 *  <xui> quoted identifier with Unicode escapes
 *  <xuiend> end of a quoted identifier with Unicode escapes, UESCAPE can follow
 *  <xus> quoted string with Unicode escapes
 *  <xusend> end of a quoted string with Unicode escapes, UESCAPE can follow
 *  <xeu> Unicode surrogate pair in extended quoted string
 * 对于引用字符串/扩展注释使用独占的状态,并消除数值字符串在解析中存在的麻烦.
 * 独占的状态包括:
 *  <xb> 位字符串
 *    <xc> 扩展的C-style注释
 *    <xd> 分隔标识符(双引号标识符)
 *    <xh> 十六进制数字字符串
 *    <xq> 标准的带引号的字符串
 *    <xe> 扩展的带引号的字符串(支持反斜杠转义序列)
 *  <xdolq> $foo$带引号的字符串
 *  <xui> 带Unicode转义的带引号的标识符
 *  <xuiend> 带Unicode转义的带引号的标识符的结尾,后跟UESCAPE
 *    <xus> 带Unicode转义的带引号的字符串
 *    <xueend> 带Unicode转义的带引号的字符串的结尾,后跟UESCAPE
 *    <xeu> 扩展带引号的字符串中的Unicode代理对
 *
 * Remember to add an <<EOF>> case whenever you add a new exclusive state!
 * The default one is probably not the right thing.
 * 增加一个独占状态时,务请记住添加<<EOF>>.
 * 默认情况下可能不是正确.
 */
//INITIAL是开始状态,其他状态必须由%s或%x指定
%x xb
%x xc
%x xd
%x xh
%x xe
%x xq
%x xdolq
%x xui
%x xuiend
%x xus
%x xusend
%x xeu
/*
 * In order to make the world safe for Windows and Mac clients as well as
 * Unix ones, we accept either \n or \r as a newline.  A DOS-style \r\n
 * sequence will be seen as two successive newlines, but that doesn't cause
 * any problems.  Comments that start with -- and extend to the next
 * newline are treated as equivalent to a single whitespace character.
 * 对了适配Windows和Mac客户端,\n或者\r也视为新行.
 * DOS-style的\r\n序列被视为两个连续的新行,但这不会引起任何问题.
 * 由--开始的注释,如果扩展到新行,视为单个空白字符
 *
 * NOTE a fine point: if there is no newline following --, we will absorb
 * everything to the end of the input as a comment.  This is correct.  Older
 * versions of Postgres failed to recognize -- as a comment if the input
 * did not end with a newline.
 * 注意:如果--后没有新行,将把输入末尾的所有内容作为注释.
 * PG的旧版本对这种情况无法识别,--如果没有以换行符结束,则作为注释
 *
 * XXX perhaps \f (formfeed) should be treated as a newline as well?
 * XXX 那么,\f也应该作为新行来处理
 *
 * XXX if you change the set of whitespace characters, fix scanner_isspace()
 * to agree, and see also the plpgsql lexer.
 * XXX 如果改变了空白字符集合,注意同步修改scanner_isspace()以适应修改后的情况,同时关注plpgsql的词法
 */
//\t -->Tab键,\n -->换行,\t -->回车,\f -->换页
space            [ \t\n\r\f]        //tab键/换行/回车/换页
horiz_space        [ \t\f]            //空格/tab键/换页
newline            [\n\r]            //换行/回车
non_newline        [^\n\r]            //除了换行/回车外的其他字符
//单行注释
comment            ("--"{non_newline}*)
//空白字符(1个或以上空格或者注释均视为whitespace)
whitespace        ({space}+|{comment})
/*
 * SQL requires at least one newline in the whitespace separating
 * string literals that are to be concatenated.  Silly, but who are we
 * to argue?  Note that {whitespace_with_newline} should not have * after
 * it, whereas {whitespace} should generally have a * after it...
 * SQL语句要求在分隔字符串字面值的空格中至少有一行换行符,
 *   这些字符串字面值将被连接起来.
 * 很傻:( 但这又有什么好争论的呢?
 * 注意{whitespace_with_newline}不应该在定义的后面存在*号,
 *   这里{whitespace}通常至少在其后面跟一个*
 */
//特殊空白,1个+以上空格或注释后跟新行
special_whitespace        ({space}+|{comment}{newline})
//水平空白(一堆的空格或者注释)
horiz_whitespace        ({horiz_space}|{comment})
//0个或多个horiz_whitespace+新行+0个或多个特殊空白
whitespace_with_newline    ({horiz_whitespace}*{newline}{special_whitespace}*)
/*
 * To ensure that {quotecontinue} can be scanned without having to back up
 * if the full pattern isn't matched, we include trailing whitespace in
 * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
 * except for {quote} followed by whitespace and just one "-" (not two,
 * which would start a {comment}).  To cover that we have {quotefail}.
 * The actions for {quotestop} and {quotefail} must throw back characters
 * beyond the quote proper.
 * 如果全模式没有匹配,为了确保{quotecontinue}不需要备份就可以扫描,
 *   我们在{quotestop}中包含了尾部空格.
 * 这可以匹配{quotecontinue}无法匹配的所有情况,除了{quote}后跟空格而且只有一个'-'字符的情况
 * (注意,不是两个'-'字符,这被视为{comment}的开始)
 * 为了覆盖含有{quotefail}的情况,{quotestop}和{quotefail}的动作必须返回超出引号的字符
 */
quote            '
quotestop        {quote}{whitespace}*
quotecontinue    {quote}{whitespace_with_newline}{quote}
quotefail        {quote}{whitespace}*"-"
//<xb>
/* Bit string 
 * It is tempting to scan the string for only those characters
 * which are allowed. However, this leads to silently swallowed
 * characters if illegal characters are included in the string.
 * For example, if xbinside is [01] then B'ABCD' is interpreted
 * as a zero-length string, and the ABCD' is lost!
 * Better to pass the string forward and let the input routines
 * validate the contents.
 * 位字符串
 * 倾向于只扫描字符串中允许的字符.
 * 但是这会导致如果非法字符包含在字符串中时默默的接受这些非法字符.
 * 比如,如果xbinside是[01],则B'ABCD'被视为0长度的字符串,并且丢失了ABCD' 
 */
xbstart            [bB]{quote}    //开始:b或B字符开头,后跟单引号'字符
xbinside        [^']*        //字符串内容:除单引号外的其他字符,0个或多个
/* Hexadecimal number */
//<xh> 十六进制数字
xhstart            [xX]{quote}    //开始:以x或X打头,后跟单引号
xhinside        [^']*        //内容:除单引号外的其他字符,0个或多个
/* National character */
//<xn> 国家字符(Unicode)
xnstart            [nN]{quote}    //开始:以n或N打头
/* Quoted string that allows backslash escapes */
//<xe> 允许反斜杠转义字符的带引号的字符串
xestart            [eE]{quote}    //开始:e或E打头,后跟单引号
xeinside        [^\\']+        //内容:除反斜杠和单引号外的其他字符
xeescape        [\\][^0-7]    //转义字符:以反斜杠打头后跟除0-7之外的其他字符
xeoctesc        [\\][0-7]{1,3}    //八进制转义字符:以反斜杠打头后跟0-7,出现1次-3次
xehexesc        [\\]x[0-9A-Fa-f]{1,2}    //十六进制转义字符:以反斜杠打头后跟0-F/f,出现1次-2次
//Unicode字符:以反斜杠打头,后跟u和0-F/f(连续出现4次)或者是后跟U,0-F/f连续出现8次
xeunicode        [\\](u[0-9A-Fa-f]{4}|U[0-9A-Fa-f]{8})    
//不符合xeunicode的其他情况
xeunicodefail    [\\](u[0-9A-Fa-f]{0,3}|U[0-9A-Fa-f]{0,7})
/* Extended quote
 * 扩展引号
 * xqdouble implements embedded quote, ''''
 * xqdouble实现了内嵌引号,''''
 */
xqstart            {quote}
xqdouble        {quote}{quote}
xqinside        [^']+
/* $foo$ style quotes ("dollar quoting")
 * The quoted string starts with $foo$ where "foo" is an optional string
 * in the form of an identifier, except that it may not contain "$",
 * and extends to the first occurrence of an identical string.
 * There is *no* processing of the quoted text.
 * $foo$类型的引号("美元引号")
 * 带引号的字符串以$foo$开始,这里foo是一个可选的字符串,
 *   但它不包含字符$,并且扩展到相同字符串的第一次出现.
 *   扩展到标识符第一次出现的地方.
 * 对于引用文本,不需要进行处理
 *
 * {dolqfailed} is an error rule to avoid scanner backup when {dolqdelim}
 * fails to match its trailing "$".
 * {dolqfailed}是一种错误规则,用以避免扫描器在{dolqdelim}不能匹配末尾的$时进行备份
 */
//<xdolq>
dolq_start        [A-Za-z\200-\377_]        //开始:大小写英文字母/80-FF字符(8进制是200-377)/下划线            
dolq_cont        [A-Za-z\200-\377_0-9]    //dolq_start + 数字
dolqdelim        \$({dolq_start}{dolq_cont}*)?\$    //分隔符$xx$,xx可选
dolqfailed        \${dolq_start}{dolq_cont}*    //失败:以$开始,但没有$结束
dolqinside        [^$]+                    //内容:除$外的其他字符
/* Double quote
 * Allows embedded spaces and other special characters into identifiers.
 * 双引号
 * 允许嵌入空格和其他特殊字符
 */
//<xd>
dquote            \"                //双引号
xdstart            {dquote}        //开始:以双引号打头
xdstop            {dquote}        //结束:以双引号结束
xddouble        {dquote}{dquote}//两个双引号
xdinside        [^"]+            //内容:除双引号外的其他字符,1个或多个
/* Unicode escapes */
//<xue>
//转义字符:
uescape            [uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
/* error rule to avoid backup */
//错误规则:避免备份
uescapefail        [uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
/* Quoted identifier with Unicode escapes */
//使用Unicode转义字符的引用标识符(双引号)
//<xui>
xuistart        [uU]&{dquote}    //开头:以u/U打头,后跟双引号
/* Quoted string with Unicode escapes */
//使用Unicode转义字符的字符串(单引号)
//<xus>
xusstart        [uU]&{quote}    
/* Optional UESCAPE after a quoted string or identifier with Unicode escapes. */
//引用字符串或者标识符后可选的UESCAPE
//<>
xustop1        {uescapefail}?
xustop2        {uescape}
/* error rule to avoid backup */
//错误规则:避免备份
xufailed        [uU]&
/* C-style comments
 * C风格注释
 *
 * The "extended comment" syntax closely resembles allowable operator syntax.
 * The tricky part here is to get lex to recognize a string starting with
 * slash-star as a comment, when interpreting it as an operator would produce
 * a longer match --- remember lex will prefer a longer match!  Also, if we
 * have something like plus-slash-star, lex will think this is a 3-character
 * operator whereas we want to see it as a + operator and a comment start.
 * The solution is two-fold:
 * 1. append {op_chars}* to xcstart so that it matches as much text as
 *    {operator} would. Then the tie-breaker (first matching rule of same
 *    length) ensures xcstart wins.  We put back the extra stuff with yyless()
 *    in case it contains a star-slash that should terminate the comment.
 * 2. In the operator rule, check for slash-star within the operator, and
 *    if found throw it back with yyless().  This handles the plus-slash-star
 *    problem.
 * Dash-dash comments have similar interactions with the operator rule.
 * "扩展注释"语法与允许的操作符语法非常相似.
 * 这里比较棘手的部分是让词法分析器可以识别以斜杠加星号开头的字符串作为注释,
 *   因为在认为星号为操作符时可能会产生更长的匹配 -- 记住:词法分析器倾向于更长的匹配.
 * 同时,如果存在形如+/*这样的字符串,词法分析器会认为这是3元操作符,
 *   但其实我们希望把它视作一个加号操作符和注释的开始.
 * 解决方案如下:
 * 1.追加{op_chars}*到xcstart中,以便它可以匹配尽可能多的文本(与{operator}一样).
 *   然后,tie-breaker(相同长度首次匹配原则)确保xcstart会首先匹配.
 *   我们用yyless()放进去了一些额外的东西,以防它包含一个星号和斜杠(即:*/),这会终止注释
 * 2.在操作符规则中,检查操作符中的反斜杠+星号,如发现则返回给yyless().这可以处理+/*这个问题
 * "--"注释与操作符规则有类型的交互方式.
 */
xcstart            \/\*{op_chars}*            //开始:/*+操作符(0个或多个)
xcstop            \*+\/                    //结束:1个或多个*号+字符/
xcinside        [^*/]+                    //内容:除了*和/外的其他字符,1个或多个
digit            [0-9]                    //数字:0-9
ident_start        [A-Za-z\200-\377_]        //标识符开始:英文字母/80-FF字符/下划线
ident_cont        [A-Za-z\200-\377_0-9\$]    //标识符:ident_start外加数字
identifier        {ident_start}{ident_cont}*    //标识符
/* Assorted special-case operators and operator-like tokens */
//组合的特殊情况操作符和类似操作符的tokens
typecast        "::"    //强制类型转换操作符
dot_dot            \.\.    //点点操作符
colon_equals    ":="    //赋值操作符
/*
 * These operator-like tokens (unlike the above ones) also match the {operator}
 * rule, which means that they might be overridden by a longer match if they
 * are followed by a comment start or a + or - character. Accordingly, if you
 * add to this list, you must also add corresponding code to the {operator}
 * block to return the correct token in such cases. (This is not needed in
 * psqlscan.l since the token value is ignored there.)
 * 这些类操作符tokens(不同于上面所列)同时会匹配{operator}规则,
 *   这意味着如果后跟注释起始符或者+-字符的话,它们可能会被长匹配覆盖.
 * 因此,如果加入到链表中,必须同时相应的代码到{operator}块中以便返回正确的token.
 * (在psqlscan.l中不需要这样做,因为token值会被忽略)
 */
equals_greater    "=>"    //等于大于
less_equals        "<="    //小于等于
greater_equals    ">="    //大于等于
less_greater    "<>"    //小于/大于
not_equals        "!="    //不等于
/*
 * "self" is the set of chars that should be returned as single-character
 * tokens.  "op_chars" is the set of chars that can make up "Op" tokens,
 * which can be one or more characters long (but if a single-char token
 * appears in the "self" set, it is not to be returned as an Op).  Note
 * that the sets overlap, but each has some chars that are not in the other.
 *
 * If you change either set, adjust the character lists appearing in the
 * rule for "operator"!
 * "self"是那些作为单字符tokens返回的字符集合.
 * "op_chars"是组成"Op" tokens(一个或多个字符)的字符集合
 *   (如果单个字符token出现在"self"中,则不会作为Op返回).
 * 注意这些集合是重复的,但是每个集合都有一些不在另外一个集合中的字符.
 * 如果改变了其中一个集合,调整出现在"operator"所设定的规则中字符列表.
 */
self            [,()\[\].;\:\+\-\*\/\%\^\<\>\=]
op_chars        [\~\!\@\#\^\&\|\`\?\+\-\*\/\%\<\>\=]
operator        {op_chars}+
/* we no longer allow unary minus in numbers.
 * instead we pass it separately to parser. there it gets
 * coerced via doNegate() -- Leon aug 20 1999
 * 我们不再允许一进制负数,这些值会单独传递给解析器.
 * 在那里,会通过doNegate()方法处理.
 *
 * {decimalfail} is used because we would like "1..10" to lex as 1, dot_dot, 10.
 * {decimalfail}在处理形如1..10的情况
 *
 * {realfail1} and {realfail2} are added to prevent the need for scanner
 * backup when the {real} rule fails to match completely.
 * 添加{realfail1} 和 {realfail2}的目的是防止在{real}规则匹配失败时的扫描器备份
 */
integer            {digit}+    //整数
decimal            (({digit}*\.{digit}+)|({digit}+\.{digit}*))    //小数
decimalfail        {digit}+\.\.    //匹配失败的小数
real            ({integer}|{decimal})[Ee][-+]?{digit}+    //实数
realfail1        ({integer}|{decimal})[Ee]        //匹配失败1
realfail2        ({integer}|{decimal})[Ee][-+]    //匹配失败2
param            \${integer}    //参数
other            .            //其他
/*
 * Dollar quoted strings are totally opaque, and no escaping is done on them.
 * Other quoted strings must allow some special characters such as single-quote
 *  and newline.
 * Embedded single-quotes are implemented both in the SQL standard
 *  style of two adjacent single quotes "''" and in the Postgres/Java style
 *  of escaped-quote "\'".
 * Other embedded escaped characters are matched explicitly and the leading
 *  backslash is dropped from the string.
 * Note that xcstart must appear before operator, as explained above!
 *  Also whitespace (comment) must appear before operator.
 * 使用$符号括起来的字符串是完全密封的,在其上无任何的转义可做.
 * 其他引用字符串必须运行一些特殊字符比如单引号或者新行.
 * 嵌入式的单引号在标准SQL风格中通过两个相邻的单引号"''"实现,
 *   在Postgres/Java风格中使用转义字符"\'"实现.
 * 其他嵌入式的转义字符显式匹配,打头的反斜杠会从字符串中去掉.
 * 如前所解释过的,务必注意xcstart必须在操作符前出现.
 * 同时空白字符(注释)必须在操作符前出现.
 */

二、参考资料

Flex&Bison

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/6906/viewspace-2641503/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论
长期从事政务、金融等行业产品研发和架构设计工作,ITPUB数据库版块资深版主,对Oracle、PostgreSQL以及大数据等相关技术有深入研究。现就职于广州云图数据技术有限公司,系统架构师。

注册时间:2007-12-28

  • 博文量
    1253
  • 访问量
    3728965