ITPub博客

首页 > 数据库 > PostgreSQL > PostgreSQL 源码解读(171)- 查询#91(PG中的词法定义:scanner.l)#4

PostgreSQL 源码解读(171)- 查询#91(PG中的词法定义:scanner.l)#4

原创 PostgreSQL 作者:husthxd 时间:2019-04-18 14:31:59 0 删除 编辑

输入一条SQL语句,PostgreSQL如何解析输入的SQL,识别SQL类型以及基表/字段等信息?接下来的几节将逐一进行解析.
本节介绍了PostgreSQL的词法定义文件(Flex输入文件),在文件src/backend/parser/scan.l中.
如前所述,Flex输入文件由四部分组成:


%{
Declarations
%}
Definitions
%%
Rules
%%
User subroutines

本节介绍第四部分User subroutines.

一、User subroutines

在规则之后是自定义例程,在scan.l中定义的例程主要是对输入的SQL语句进行解析以及执行初始化和事后清理工作等.


/* LCOV_EXCL_STOP */
/*
 * Arrange access to yyextra for subroutines of the main yylex() function.
 * We expect each subroutine to have a yyscanner parameter.  Rather than
 * use the yyget_xxx functions, which might or might not get inlined by the
 * compiler, we cheat just a bit and cast yyscanner to the right type.
 * 为主yylex()函数提供yyextra的访问.
 * 我们期望每一个子例程都有参数:yyscanner.
 * 相对于使用yyget_xxx函数(可能或不可能被编译器内联),我们强制yyscanner为正确的类型.
 */
#undef yyextra
#define yyextra  (((struct yyguts_t *) yyscanner)->yyextra_r)
/* Likewise for a couple of other things we need. */
//定义其他需要的东西:yylloc/yyleng
#undef yylloc
#define yylloc    (((struct yyguts_t *) yyscanner)->yylloc_r)
#undef yyleng
#define yyleng    (((struct yyguts_t *) yyscanner)->yyleng_r)
/*
 * scanner_errposition
 *        Report a lexer or grammar error cursor position, if possible.
 * scanner_errposition : 如可以,报告词法或语法错误位置
 *
 * This is expected to be used within an ereport() call.  The return value
 * is a dummy (always 0, in fact).
 * 该例程在ereport()调用中使用.返回值是伪列(实际上通常为0).
 *
 * Note that this can only be used for messages emitted during raw parsing
 * (essentially, scan.l and gram.y), since it requires the yyscanner struct
 * to still be available.
 * 注意,这只能用于在原始解析期间产生的消息(scan.l & gram.y),
 *   因为需要yyscanner结构体仍然可用才行.
 */
int
scanner_errposition(int location, core_yyscan_t yyscanner)
{
    int            pos;
    if (location < 0)
        return 0;                /* no-op if location is unknown */
    /* Convert byte offset to character number */
    pos = pg_mbstrlen_with_len(yyextra->scanbuf, location) + 1;
    /* And pass it to the ereport mechanism */
    return errposition(pos);
}
/*
 * scanner_yyerror
 *        Report a lexer or grammar error.
 *         报告词法或语法错误.
 *
 * The message's cursor position is whatever YYLLOC was last set to,
 * ie, the start of the current token if called within yylex(), or the
 * most recently lexed token if called from the grammar.
 * This is OK for syntax error messages from the Bison parser, because Bison
 * parsers report error as soon as the first unparsable token is reached.
 * Beware of using yyerror for other purposes, as the cursor position might
 * be misleading!
 * 该消息游标的位置在于YYLLOC最后设置的地方,比如如果在yylex()中则是当前的token开始位置,
 *   或者如果是grammer调用则为最近一次的词法token.
 * 在Bison解析器中抛出语法错误是没有问题的,因为Bison及诶吸气在遇到第一个无法解析的token时就会报错.
 * 注意:如果处于其他目的使用yyerror,这时候游标的位置可能会出现误导.
 */
void
scanner_yyerror(const char *message, core_yyscan_t yyscanner)
{
    const char *loc = yyextra->scanbuf + *yylloc;
    if (*loc == YY_END_OF_BUFFER_CHAR)
    {
        ereport(ERROR,
                (errcode(ERRCODE_SYNTAX_ERROR),
        /* translator: %s is typically the translation of "syntax error" */
                 errmsg("%s at end of input", _(message)),
                 lexer_errposition()));
    }
    else
    {
        ereport(ERROR,
                (errcode(ERRCODE_SYNTAX_ERROR),
        /* translator: first %s is typically the translation of "syntax error" */
                 errmsg("%s at or near \"%s\"", _(message), loc),
                 lexer_errposition()));
    }
}
/*
 * Called before any actual parsing is done
 * 初始化扫描器,在实际解析完成前调用
 */
core_yyscan_t
scanner_init(const char *str,
             core_yy_extra_type *yyext,
             const ScanKeyword *keywords,
             int num_keywords)
{
    Size        slen = strlen(str);
    yyscan_t    scanner;
    if (yylex_init(&scanner) != 0)
        elog(ERROR, "yylex_init() failed: %m");
    core_yyset_extra(yyext, scanner);
    yyext->keywords = keywords;
    yyext->num_keywords = num_keywords;
    yyext->backslash_quote = backslash_quote;
    yyext->escape_string_warning = escape_string_warning;
    yyext->standard_conforming_strings = standard_conforming_strings;
    /*
     * Make a scan buffer with special termination needed by flex.
     */
    yyext->scanbuf = (char *) palloc(slen + 2);
    yyext->scanbuflen = slen;
    memcpy(yyext->scanbuf, str, slen);
    yyext->scanbuf[slen] = yyext->scanbuf[slen + 1] = YY_END_OF_BUFFER_CHAR;
    yy_scan_buffer(yyext->scanbuf, slen + 2, scanner);
    /* initialize literal buffer to a reasonable but expansible size */
    yyext->literalalloc = 1024;
    yyext->literalbuf = (char *) palloc(yyext->literalalloc);
    yyext->literallen = 0;
    return scanner;
}
/*
 * Called after parsing is done to clean up after scanner_init()
 * 在解析完成后调用,用于在scanner_init()之后进行清理.
 */
void
scanner_finish(core_yyscan_t yyscanner)
{
    /*
     * We don't bother to call yylex_destroy(), because all it would do is
     * pfree a small amount of control storage.  It's cheaper to leak the
     * storage until the parsing context is destroyed.  The amount of space
     * involved is usually negligible compared to the output parse tree
     * anyway.
     * 不需要调用yylex_destroy(),因为所有需要做的事情只是释放一小块控制内存而已.
     * 在解析上下文被销毁前,保留这部分内存成本会更低.
     * 无论如何,与输出解析树相比,所涉及到的空间大小通常可以忽略不计.
     *
     * We do bother to pfree the scanbuf and literal buffer, but only if they
     * represent a nontrivial amount of space.  The 8K cutoff is arbitrary.
     * 需要使用pfree释放扫描缓存和字面值缓存,但前提是它们代表了一个不小的空间才需要.
     * 8K这个数值其实是很随意的.
     */
    if (yyextra->scanbuflen >= 8192)
        pfree(yyextra->scanbuf);
    if (yyextra->literalalloc >= 8192)
        pfree(yyextra->literalbuf);
}
static void
addlit(char *ytext, int yleng, core_yyscan_t yyscanner)
{
    /* enlarge buffer if needed */
    //增大缓存
    if ((yyextra->literallen + yleng) >= yyextra->literalalloc)
    {
        do
        {
            yyextra->literalalloc *= 2;
        } while ((yyextra->literallen + yleng) >= yyextra->literalalloc);
        yyextra->literalbuf = (char *) repalloc(yyextra->literalbuf,
                                                yyextra->literalalloc);
    }
    /* append new data */
    //追加新数据
    memcpy(yyextra->literalbuf + yyextra->literallen, ytext, yleng);
    yyextra->literallen += yleng;
}
static void
addlitchar(unsigned char ychar, core_yyscan_t yyscanner)
{
    /* enlarge buffer if needed */
    if ((yyextra->literallen + 1) >= yyextra->literalalloc)
    {
        yyextra->literalalloc *= 2;
        yyextra->literalbuf = (char *) repalloc(yyextra->literalbuf,
                                                yyextra->literalalloc);
    }
    /* append new data */
    yyextra->literalbuf[yyextra->literallen] = ychar;
    yyextra->literallen += 1;
}
/*
 * Create a palloc'd copy of literalbuf, adding a trailing null.
 * 创建字面值缓存的拷贝,在末尾增加null.
 */
static char *
litbufdup(core_yyscan_t yyscanner)
{
    int            llen = yyextra->literallen;
    char       *new;
    new = palloc(llen + 1);
    memcpy(new, yyextra->literalbuf, llen);
    new[llen] = '\0';
    return new;
}
static int
process_integer_literal(const char *token, YYSTYPE *lval)
{
    //处理整型字面值
    int            val;
    char       *endptr;
    errno = 0;
    val = strtoint(token, &endptr, 10);
    if (*endptr != '\0' || errno == ERANGE)
    {
        /* integer too large, treat it as a float */
        lval->str = pstrdup(token);
        return FCONST;
    }
    lval->ival = val;
    return ICONST;
}
static unsigned int
hexval(unsigned char c)
{
    //十六进制
    if (c >= '0' && c <= '9')
        return c - '0';
    if (c >= 'a' && c <= 'f')
        return c - 'a' + 0xA;
    if (c >= 'A' && c <= 'F')
        return c - 'A' + 0xA;
    elog(ERROR, "invalid hexadecimal digit");
    return 0;                    /* not reached */
}
static void
check_unicode_value(pg_wchar c, char *loc, core_yyscan_t yyscanner)
{
    if (GetDatabaseEncoding() == PG_UTF8)
        return;
    if (c > 0x7F)
    {
        ADVANCE_YYLLOC(loc - yyextra->literalbuf + 3);    /* 3 for U&" */
        yyerror("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8");
    }
}
static bool
is_utf16_surrogate_first(pg_wchar c)
{
    return (c >= 0xD800 && c <= 0xDBFF);
}
static bool
is_utf16_surrogate_second(pg_wchar c)
{
    return (c >= 0xDC00 && c <= 0xDFFF);
}
static pg_wchar
surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second)
{
    return ((first & 0x3FF) << 10) + 0x10000 + (second & 0x3FF);
}
static void
addunicode(pg_wchar c, core_yyscan_t yyscanner)
{
    char        buf[8];
    if (c == 0 || c > 0x10FFFF)
        yyerror("invalid Unicode escape value");
    if (c > 0x7F)
    {
        if (GetDatabaseEncoding() != PG_UTF8)
            yyerror("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8");
        yyextra->saw_non_ascii = true;
    }
    unicode_to_utf8(c, (unsigned char *) buf);
    addlit(buf, pg_mblen(buf), yyscanner);
}
/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
static bool
check_uescapechar(unsigned char escape)
{
    if (isxdigit(escape)
        || escape == '+'
        || escape == '\''
        || escape == '"'
        || scanner_isspace(escape))
    {
        return false;
    }
    else
        return true;
}
/* like litbufdup, but handle unicode escapes */
static char *
litbuf_udeescape(unsigned char escape, core_yyscan_t yyscanner)
{
    char       *new;
    char       *litbuf,
               *in,
               *out;
    pg_wchar    pair_first = 0;
    /* Make literalbuf null-terminated to simplify the scanning loop */
    litbuf = yyextra->literalbuf;
    litbuf[yyextra->literallen] = '\0';
    /*
     * This relies on the subtle assumption that a UTF-8 expansion cannot be
     * longer than its escaped representation.
     */
    new = palloc(yyextra->literallen + 1);
    in = litbuf;
    out = new;
    while (*in)
    {
        if (in[0] == escape)
        {
            if (in[1] == escape)
            {
                if (pair_first)
                {
                    ADVANCE_YYLLOC(in - litbuf + 3);    /* 3 for U&" */
                    yyerror("invalid Unicode surrogate pair");
                }
                *out++ = escape;
                in += 2;
            }
            else if (isxdigit((unsigned char) in[1]) &&
                     isxdigit((unsigned char) in[2]) &&
                     isxdigit((unsigned char) in[3]) &&
                     isxdigit((unsigned char) in[4]))
            {
                pg_wchar    unicode;
                unicode = (hexval(in[1]) << 12) +
                    (hexval(in[2]) << 8) +
                    (hexval(in[3]) << 4) +
                    hexval(in[4]);
                check_unicode_value(unicode, in, yyscanner);
                if (pair_first)
                {
                    if (is_utf16_surrogate_second(unicode))
                    {
                        unicode = surrogate_pair_to_codepoint(pair_first, unicode);
                        pair_first = 0;
                    }
                    else
                    {
                        ADVANCE_YYLLOC(in - litbuf + 3);        /* 3 for U&" */
                        yyerror("invalid Unicode surrogate pair");
                    }
                }
                else if (is_utf16_surrogate_second(unicode))
                    yyerror("invalid Unicode surrogate pair");
                if (is_utf16_surrogate_first(unicode))
                    pair_first = unicode;
                else
                {
                    unicode_to_utf8(unicode, (unsigned char *) out);
                    out += pg_mblen(out);
                }
                in += 5;
            }
            else if (in[1] == '+' &&
                     isxdigit((unsigned char) in[2]) &&
                     isxdigit((unsigned char) in[3]) &&
                     isxdigit((unsigned char) in[4]) &&
                     isxdigit((unsigned char) in[5]) &&
                     isxdigit((unsigned char) in[6]) &&
                     isxdigit((unsigned char) in[7]))
            {
                pg_wchar    unicode;
                unicode = (hexval(in[2]) << 20) +
                    (hexval(in[3]) << 16) +
                    (hexval(in[4]) << 12) +
                    (hexval(in[5]) << 8) +
                    (hexval(in[6]) << 4) +
                    hexval(in[7]);
                check_unicode_value(unicode, in, yyscanner);
                if (pair_first)
                {
                    if (is_utf16_surrogate_second(unicode))
                    {
                        unicode = surrogate_pair_to_codepoint(pair_first, unicode);
                        pair_first = 0;
                    }
                    else
                    {
                        ADVANCE_YYLLOC(in - litbuf + 3);        /* 3 for U&" */
                        yyerror("invalid Unicode surrogate pair");
                    }
                }
                else if (is_utf16_surrogate_second(unicode))
                    yyerror("invalid Unicode surrogate pair");
                if (is_utf16_surrogate_first(unicode))
                    pair_first = unicode;
                else
                {
                    unicode_to_utf8(unicode, (unsigned char *) out);
                    out += pg_mblen(out);
                }
                in += 8;
            }
            else
            {
                ADVANCE_YYLLOC(in - litbuf + 3);        /* 3 for U&" */
                yyerror("invalid Unicode escape value");
            }
        }
        else
        {
            if (pair_first)
            {
                ADVANCE_YYLLOC(in - litbuf + 3);        /* 3 for U&" */
                yyerror("invalid Unicode surrogate pair");
            }
            *out++ = *in++;
        }
    }
    /* unfinished surrogate pair? */
    if (pair_first)
    {
        ADVANCE_YYLLOC(in - litbuf + 3);                /* 3 for U&" */
        yyerror("invalid Unicode surrogate pair");
    }
    *out = '\0';
    /*
     * We could skip pg_verifymbstr if we didn't process any non-7-bit-ASCII
     * codes; but it's probably not worth the trouble, since this isn't likely
     * to be a performance-critical path.
     */
    pg_verifymbstr(new, out - new, false);
    return new;
}
static unsigned char
unescape_single_char(unsigned char c, core_yyscan_t yyscanner)
{
    switch (c)
    {
        case 'b':
            return '\b';
        case 'f':
            return '\f';
        case 'n':
            return '\n';
        case 'r':
            return '\r';
        case 't':
            return '\t';
        default:
            /* check for backslash followed by non-7-bit-ASCII */
            if (c == '\0' || IS_HIGHBIT_SET(c))
                yyextra->saw_non_ascii = true;
            return c;
    }
}
static void
check_string_escape_warning(unsigned char ychar, core_yyscan_t yyscanner)
{
    if (ychar == '\'')
    {
        if (yyextra->warn_on_first_escape && yyextra->escape_string_warning)
            ereport(WARNING,
                    (errcode(ERRCODE_NONSTANDARD_USE_OF_ESCAPE_CHARACTER),
                     errmsg("nonstandard use of \\' in a string literal"),
                     errhint("Use '' to write quotes in strings, or use the escape string syntax (E'...')."),
                     lexer_errposition()));
        yyextra->warn_on_first_escape = false;    /* warn only once per string */
    }
    else if (ychar == '\\')
    {
        if (yyextra->warn_on_first_escape && yyextra->escape_string_warning)
            ereport(WARNING,
                    (errcode(ERRCODE_NONSTANDARD_USE_OF_ESCAPE_CHARACTER),
                     errmsg("nonstandard use of \\\\ in a string literal"),
                     errhint("Use the escape string syntax for backslashes, e.g., E'\\\\'."),
                     lexer_errposition()));
        yyextra->warn_on_first_escape = false;    /* warn only once per string */
    }
    else
        check_escape_warning(yyscanner);
}
static void
check_escape_warning(core_yyscan_t yyscanner)
{
    if (yyextra->warn_on_first_escape && yyextra->escape_string_warning)
        ereport(WARNING,
                (errcode(ERRCODE_NONSTANDARD_USE_OF_ESCAPE_CHARACTER),
                 errmsg("nonstandard use of escape in a string literal"),
        errhint("Use the escape string syntax for escapes, e.g., E'\\r\\n'."),
                 lexer_errposition()));
    yyextra->warn_on_first_escape = false;        /* warn only once per string */
}
/*
 * Interface functions to make flex use palloc() instead of malloc().
 * It'd be better to make these static, but flex insists otherwise.
 */
void *
core_yyalloc(yy_size_t bytes, core_yyscan_t yyscanner)
{
    return palloc(bytes);
}
void *
core_yyrealloc(void *ptr, yy_size_t bytes, core_yyscan_t yyscanner)
{
    if (ptr)
        return repalloc(ptr, bytes);
    else
        return palloc(bytes);
}
void
core_yyfree(void *ptr, core_yyscan_t yyscanner)
{
    if (ptr)
        pfree(ptr);
}

二、参考资料

Flex&Bison

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/6906/viewspace-2641788/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论
长期从事政务、金融等行业产品研发和架构设计工作,ITPUB数据库版块资深版主,对Oracle、PostgreSQL以及大数据等相关技术有深入研究。现就职于广州云图数据技术有限公司,系统架构师。

注册时间:2007-12-28

  • 博文量
    1250
  • 访问量
    3724315