ITPub博客

首页 > 数据库 > PostgreSQL > PostgreSQL 源码解读(133)- MVCC#17(vacuum过程-lazy_vacuum_index函数#2)

PostgreSQL 源码解读(133)- MVCC#17(vacuum过程-lazy_vacuum_index函数#2)

原创 PostgreSQL 作者:husthxd 时间:2019-01-31 15:11:23 0 删除 编辑

本节简单介绍了PostgreSQL手工执行vacuum的处理流程,主要分析了B-Tree索引vacuum的处理过程,实现函数是btvacuumscan。

一、数据结构

IndexVacuumInfo
传递给ambulkdelete/amvacuumcleanup的输入参数结构体


/*
 * Struct for input arguments passed to ambulkdelete and amvacuumcleanup
 * 传递给ambulkdelete/amvacuumcleanup的输入参数结构体
 *
 * num_heap_tuples is accurate only when estimated_count is false;
 * otherwise it's just an estimate (currently, the estimate is the
 * prior value of the relation's pg_class.reltuples field).  It will
 * always just be an estimate during ambulkdelete.
 * 在estimated_count为F的情况下,num_heap_tuples才是精确的.
 * 否则,该值只是一个故事(当前的实现是,该值是relation's pg_class.reltuples字段的上一个值).
 * 在ambulkdelete期间该值会一直都是估算值.
 */
typedef struct IndexVacuumInfo
{
    //index relation
    Relation    index;          /* the index being vacuumed */
    //是否只是ANALYZE(没有实际的vacuum)
    bool        analyze_only;   /* ANALYZE (without any actual vacuum) */
    //如为T,则num_heap_tuples是一个估算值
    bool        estimated_count;    /* num_heap_tuples is an estimate */
    //进度信息的日志等级
    int         message_level;  /* ereport level for progress messages */
    //在堆中仍存在的元组数
    double      num_heap_tuples;    /* tuples remaining in heap */
    //访问策略
    BufferAccessStrategy strategy;  /* access strategy for reads */
} IndexVacuumInfo;

IndexBulkDeleteResult
ambulkdelete/amvacuumcleanup返回的统计信息结构体


/*
 * Struct for statistics returned by ambulkdelete and amvacuumcleanup
 * ambulkdelete/amvacuumcleanup返回的统计信息结构体
 * 
 * This struct is normally allocated by the first ambulkdelete call and then
 * passed along through subsequent ones until amvacuumcleanup; however,
 * amvacuumcleanup must be prepared to allocate it in the case where no
 * ambulkdelete calls were made (because no tuples needed deletion).
 * Note that an index AM could choose to return a larger struct
 * of which this is just the first field; this provides a way for ambulkdelete
 * to communicate additional private data to amvacuumcleanup.
 * 该结构体通常由第一个ambulkdelete调用分配内存,传递到下一个处理过程,直至amvacuumcleanup;
 * 但是,在ambulkdelete没有调用时,amvacuumcleanup必须预分配(因为没有元组需要删除).
 * 注意索引访问方法(AM)可以选择返回一个更大的结构体,而该结构体是这个更大的结构体的第一个域;
 * 这为ambulkdelete提供了一个方法用于与需要额外私有数据的amvacuumcleanup函数通讯.
 *
 * Note: pages_removed is the amount by which the index physically shrank,
 * if any (ie the change in its total size on disk).  pages_deleted and
 * pages_free refer to free space within the index file.  Some index AMs
 * may compute num_index_tuples by reference to num_heap_tuples, in which
 * case they should copy the estimated_count field from IndexVacuumInfo.
 * 注意:pages_remove是索引物理收缩(shrank)的数量,如果有的话(即它在磁盘上的总大小的变化)。
 * pages_deleted和pages_free指的是索引文件中的空闲空间.
 * 某些索引访问方法(AMs)可能通过参考num_heap_tuples计算num_index_tuples,
 *   在这种情况下会拷贝从IndexVacuumInfo中拷贝estimated_count域.
 */
typedef struct IndexBulkDeleteResult
{
    //index中剩下的pages
    BlockNumber num_pages;      /* pages remaining in index */
    //在vacuum期间清除的元组数
    BlockNumber pages_removed;  /* # removed during vacuum operation */
    //num_index_tuples是一个估算值?
    bool        estimated_count;    /* num_index_tuples is an estimate */
    //剩余的元组数
    double      num_index_tuples;   /* tuples remaining */
    //在vacuum期间清除的元组数
    double      tuples_removed; /* # removed during vacuum operation */
    //索引中未使用的pages
    BlockNumber pages_deleted;  /* # unused pages in index */
    //可重用的pages
    BlockNumber pages_free;     /* # pages available for reuse */
} IndexBulkDeleteResult;

BTPageOpaque
在每个页面的尾部,我们存储了一个指针用于指向树中的兄弟


/*
 *  BTPageOpaqueData -- At the end of every page, we store a pointer
 *  to both siblings in the tree.  This is used to do forward/backward
 *  index scans.  The next-page link is also critical for recovery when
 *  a search has navigated to the wrong page due to concurrent page splits
 *  or deletions; see src/backend/access/nbtree/README for more info.
 *  BTPageOpaqueData -- 在每个页面的尾部,我们存储了一个指针用于指向树中的兄弟.
 *  这用于执行正向/反向索引扫描。
 *  当搜索由于并发的页面分裂或删除而导航到错误的页面时,
 *    下一页链接对于恢复也非常关键;
 *  有关更多信息,请参见src/backend/access/nbtree/README。 
 *
 *  In addition, we store the page's btree level (counting upwards from
 *  zero at a leaf page) as well as some flag bits indicating the page type
 *  and status.  If the page is deleted, we replace the level with the
 *  next-transaction-ID value indicating when it is safe to reclaim the page.
 *  此外,我们存储页面的btree级别(在叶子页面上从0开始计数)以及一些标志位,
 *    这些标志位指示页面类型和状态。
 *  如果页面被删除,我们将用next-transaction-ID值替换该级别,该值指示何时可以安全地回收页面。
 *
 *  We also store a "vacuum cycle ID".  When a page is split while VACUUM is
 *  processing the index, a nonzero value associated with the VACUUM run is
 *  stored into both halves of the split page.  (If VACUUM is not running,
 *  both pages receive zero cycleids.)  This allows VACUUM to detect whether
 *  a page was split since it started, with a small probability of false match
 *  if the page was last split some exact multiple of MAX_BT_CYCLE_ID VACUUMs
 *  ago.  Also, during a split, the BTP_SPLIT_END flag is cleared in the left
 *  (original) page, and set in the right page, but only if the next page
 *  to its right has a different cycleid.
 *  我们同样会存储"vacuum cycle ID".当页面在vacuum处理,索引被分裂时,
 *    与vacuum运行相关的非零值存储在分裂页面的两个部分中。
 *  (如果VACUUM没有运行,两个页面都接收到的cycleid均为0)
 *  这允许VACUUM检测页面是否在开始时就被分裂了,
 *    如果页面上次分裂的时间恰好是MAX_BT_CYCLE_ID VACUUMs值的倍数,则有很小的可能出现错误匹配。
 *  此外,在分裂期间,BTP_SPLIT_END标记在左侧(原始)页面中被清除,
 *    并在右侧页面中设置,但只有在其右侧的下一页具有不同的cycleid时才会这样做。
 *
 *  NOTE: the BTP_LEAF flag bit is redundant since level==0 could be tested
 *  instead.
 * 注意:BTP_LEAF标记为是冗余的,因为level==0可以被测试.
 */
typedef struct BTPageOpaqueData
{
    //左兄弟,如为最左边的节点,则为P_NONE
    BlockNumber btpo_prev;      /* left sibling, or P_NONE if leftmost */
    //右兄弟,如为最右边的节点,则为P_NONE
    BlockNumber btpo_next;      /* right sibling, or P_NONE if rightmost */
    union
    {
        //树层次,如为叶子节点,则为0
        uint32      level;      /* tree level --- zero for leaf pages */
        //如已删除,记录下一个事务ID
        TransactionId xact;     /* next transaction ID, if deleted */
    }           btpo;//联合体
    //标记位
    uint16      btpo_flags;     /* flag bits, see below */
    //最后一次分裂的vacuum cycle ID
    BTCycleId   btpo_cycleid;   /* vacuum cycle ID of latest split */
} BTPageOpaqueData;
typedef BTPageOpaqueData *BTPageOpaque;
/* Bits defined in btpo_flags */
#define BTP_LEAF        (1 << 0)    /* leaf page, i.e. not internal page */
#define BTP_ROOT        (1 << 1)    /* root page (has no parent) */
#define BTP_DELETED     (1 << 2)    /* page has been deleted from tree */
#define BTP_META        (1 << 3)    /* meta-page */
#define BTP_HALF_DEAD   (1 << 4)    /* empty, but still in tree */
#define BTP_SPLIT_END   (1 << 5)    /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6)    /* page has LP_DEAD tuples */
#define BTP_INCOMPLETE_SPLIT (1 << 7)   /* right sibling's downlink is missing */

二、源码解读

lazy_vacuum_index->index_bulk_delete->…btbulkdelete->btvacuumscan
btvacuumscan扫描索引,执行vacuum
该函数的功能包括:
A.搜索符合vacuum callback条件的已删除的叶子元组;
B.搜索可删除的空页;
C.搜索老旧已删除可被回收的页面.
btbulkdelete和btvacuumcleanup函数都会调用该过程(后者仅在没有发生btbulkdelete调用时才会发生)

其主要处理逻辑如下:
1.初始化统计信息(IndexBulkDeleteResult结构体)
2.初始化vstate状态信息(BTVacState结构体)
3.构造临时上下文
4.循环遍历page
4.1获取relation锁
4.2遍历block,执行btvacuumpage
4.3如需要,多次遍历relation
5.WAL Record处理
6.删除临时上下文
7.处理空闲空间
8.更新统计信息


/*
 * btvacuumscan --- scan the index for VACUUMing purposes
 * btvacuumscan --- 为VACUUMing扫描索引
 *
 * This combines the functions of looking for leaf tuples that are deletable
 * according to the vacuum callback, looking for empty pages that can be
 * deleted, and looking for old deleted pages that can be recycled.  Both
 * btbulkdelete and btvacuumcleanup invoke this (the latter only if no
 * btbulkdelete call occurred).
 * 该函数的功能包括:
 *    A.搜索符合vacuum callback条件的已删除的叶子元组;
 *    B.搜索可删除的空页;
 *    C.搜索老旧已删除可被回收的页面.
 * btbulkdelete和btvacuumcleanup函数都会调用该过程(后者仅在没有发生btbulkdelete调用时才会发生)
 *
 * The caller is responsible for initially allocating/zeroing a stats struct
 * and for obtaining a vacuum cycle ID if necessary.
 * 调用者有责任初始化分配或者归零统计结构体,如需要获取一个vacuum cycle ID.
 */
static void
btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
             IndexBulkDeleteCallback callback, void *callback_state,
             BTCycleId cycleid, TransactionId *oldestBtpoXact)
{
    Relation    rel = info->index;
    BTVacState  vstate;
    BlockNumber num_pages;
    BlockNumber blkno;
    bool        needLock;
    /*
     * Reset counts that will be incremented during the scan; needed in case
     * of multiple scans during a single VACUUM command
     * 在扫描重置计数会被增加,在单个VACUUM命令期间多次扫描时需要.
     */
    stats->estimated_count = false;
    stats->num_index_tuples = 0;
    stats->pages_deleted = 0;
    /* Set up info to pass down to btvacuumpage */
    //设置传递给btvacuumpage函数的参数
    vstate.info = info;
    vstate.stats = stats;
    vstate.callback = callback;
    vstate.callback_state = callback_state;
    vstate.cycleid = cycleid;
    vstate.lastBlockVacuumed = BTREE_METAPAGE;  /* Initialise at first block */
    vstate.lastBlockLocked = BTREE_METAPAGE;
    vstate.totFreePages = 0;
    vstate.oldestBtpoXact = InvalidTransactionId;
    /* Create a temporary memory context to run _bt_pagedel in */
    //创建临时内存上下文用于运行_bt_pagedel
    vstate.pagedelcontext = AllocSetContextCreate(CurrentMemoryContext,
                                                  "_bt_pagedel",
                                                  ALLOCSET_DEFAULT_SIZES);
    /*
     * The outer loop iterates over all index pages except the metapage, in
     * physical order (we hope the kernel will cooperate in providing
     * read-ahead for speed).  It is critical that we visit all leaf pages,
     * including ones added after we start the scan, else we might fail to
     * delete some deletable tuples.  Hence, we must repeatedly check the
     * relation length.  We must acquire the relation-extension lock while
     * doing so to avoid a race condition: if someone else is extending the
     * relation, there is a window where bufmgr/smgr have created a new
     * all-zero page but it hasn't yet been write-locked by _bt_getbuf(). If
     * we manage to scan such a page here, we'll improperly assume it can be
     * recycled.  Taking the lock synchronizes things enough to prevent a
     * problem: either num_pages won't include the new page, or _bt_getbuf
     * already has write lock on the buffer and it will be fully initialized
     * before we can examine it.  (See also vacuumlazy.c, which has the same
     * issue.)  Also, we need not worry if a page is added immediately after
     * we look; the page splitting code already has write-lock on the left
     * page before it adds a right page, so we must already have processed any
     * tuples due to be moved into such a page.
     * 外部循环按照物理顺序遍历除元数据页面之外的所有索引页
     *   (我们希望内核能够配合提供预读以提高性能)。
     * 至关重要的是,我们要访问所有页,包括在开始扫描后添加的页,否则可能无法删除一些可删除的元组。
     * 因此,我们必须反复检查关系的大小。我们必须在获取关系扩展锁的同时避免竞争条件:
     *   如果其他人正在扩展该关系,则会出现一个窗口,其中bufmgr/smgr创建了一个新的初始化页面(全0),
     *   但是_bt_getbuf()尚未对其进行写锁定。
     * 如果我们成功地扫描了这样一个页面,我们就会错误地认为它可以被回收。
     * 使用锁可以同步足够的信息以防止出现此类问题:num_pages不包含新页面,
     *   或者_bt_getbuf已经在缓冲区上有写锁,在我们检查它之前,它将被完全初始化。
     * (参见vacuumlazy.c,里面有相同的主题和内容).
     * 此外,如果在查看后立即新增页面,也无需担心;
     *   在添加右页之前,页面分割代码已经在左页上设置了写锁,
     *   因此我们必须已经处理了将被移动到此类页中的所有元组。
     *
     * We can skip locking for new or temp relations, however, since no one
     * else could be accessing them.
     * 对于新的或临时relations可以跳过锁定,因为其他进程无法访问这些relations.
     */
    //是否需要锁定?
    needLock = !RELATION_IS_LOCAL(rel);
    //索引relation的第一个页是元数据页,需要跳过
    blkno = BTREE_METAPAGE + 1;
    for (;;)//循环
    {
        /* Get the current relation length */
        //获取当前relation的大小
        if (needLock)
            //如需要锁,则锁定
            LockRelationForExtension(rel, ExclusiveLock);
        //获取pages
        num_pages = RelationGetNumberOfBlocks(rel);
        if (needLock)
            //解锁
            UnlockRelationForExtension(rel, ExclusiveLock);
        /* Quit if we've scanned the whole relation */
        //如果已扫描了整个relation,则Quit
        if (blkno >= num_pages)
            break;
        /* Iterate over pages, then loop back to recheck length */
        //迭代扫描pages,然后回过头来重新检查大小
        for (; blkno < num_pages; blkno++)
        {
            btvacuumpage(&vstate, blkno, blkno);
        }
    }
    /*
     * Check to see if we need to issue one final WAL record for this index,
     * which may be needed for correctness on a hot standby node when non-MVCC
     * index scans could take place.
     * 检查我们是否需要为这个索引发出最后一条WAL记录,
     *   当可以进行非MVCC索引扫描时,可能需要在热备节点上正确地发出这条记录。
     *
     * If the WAL is replayed in hot standby, the replay process needs to get
     * cleanup locks on all index leaf pages, just as we've been doing here.
     * However, we won't issue any WAL records about pages that have no items
     * to be deleted.  For pages between pages we've vacuumed, the replay code
     * will take locks under the direction of the lastBlockVacuumed fields in
     * the XLOG_BTREE_VACUUM WAL records.  To cover pages after the last one
     * we vacuum, we need to issue a dummy XLOG_BTREE_VACUUM WAL record
     * against the last leaf page in the index, if that one wasn't vacuumed.
     * 如果在热备份中重放WAL,重放过程需要在所有索引页上获得清理锁,就像我们在这里所做的那样。
     * 但是,我们不会发布任何关于没有要删除项的页面的WAL记录。
     * 对于在已vacuumed页面之间的页面,
     *   重放代码将在XLOG_BTREE_VACUUM WAL记录中的lastBlockVacuumed字段下进行锁定。
     * 要覆盖最后一个vacuumed页面之后的页面,
     *   我们需要对索引中的最后一个叶子页面发出一个虚拟的XLOG_BTREE_VACUUM WAL记录,
     *   如果这个页面没有vacuumed的话。
     */
    if (XLogStandbyInfoActive() &&
        vstate.lastBlockVacuumed < vstate.lastBlockLocked)
    {
        Buffer      buf;
        /*
         * The page should be valid, but we can't use _bt_getbuf() because we
         * want to use a nondefault buffer access strategy.  Since we aren't
         * going to delete any items, getting cleanup lock again is probably
         * overkill, but for consistency do that anyway.
         * 页面应该是有效的,但是我们不能使用_bt_getbuf(),
         *   因为我们想使用非默认的缓冲区访问策略。
         * 因为我们不打算删除任何项,所以再次获得清理锁可能有点过头,但为了一致性,还是要这样做。
         */
        buf = ReadBufferExtended(rel, MAIN_FORKNUM, vstate.lastBlockLocked,
                                 RBM_NORMAL, info->strategy);
        LockBufferForCleanup(buf);
        _bt_checkpage(rel, buf);
        _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
        _bt_relbuf(rel, buf);
    }
    //删除临时上下文
    MemoryContextDelete(vstate.pagedelcontext);
    /*
     * If we found any recyclable pages (and recorded them in the FSM), then
     * forcibly update the upper-level FSM pages to ensure that searchers can
     * find them.  It's possible that the pages were also found during
     * previous scans and so this is a waste of time, but it's cheap enough
     * relative to scanning the index that it shouldn't matter much, and
     * making sure that free pages are available sooner not later seems
     * worthwhile.
     * 如果我们发现任何可回收的页面(并将其记录在FSM中),则强制更新高级FSM页面,以确保能够找到它们。
     * 有可能这些页面也是在以前的扫描中找到的,所以这是浪费时间,
     *   但是相对于扫描索引来说,这个操作的成本足够低了,所以它不太重要,
     *   并且确保空闲页面是尽早可用的,而不是以后才有用。
     *
     * Note that if no recyclable pages exist, we don't bother vacuuming the
     * FSM at all.
     * 注意:如果没有可回收的页面,不需要纠结是否需要vacuuming FSM
     */
    if (vstate.totFreePages > 0)
        //处理空闲空间
        IndexFreeSpaceMapVacuum(rel);
    /* update statistics */
    //更新统计信息
    stats->num_pages = num_pages;
    stats->pages_free = vstate.totFreePages;
    if (oldestBtpoXact)
        *oldestBtpoXact = vstate.oldestBtpoXact;
}

lazy_vacuum_index->index_bulk_delete->…btbulkdelete->btvacuumscan->btvacuumpage
btvacuumpage —- VACUUM页面
btvacuumscan()调用该过程处理单个页面.在某些情况下,必须回过头来重新检查先前已扫描过的页面,该过程在需要的时候递归处理这种情况.

其主要处理逻辑如下:
1.初始化相关变量
2.调用ReadBufferExtended读取block到buffer中,锁定buffer,获取page
3.如果不是new page,则执行检查并获取BTPageOpaque
4.如块号与原始不同,正在进行递归处理,如page可回收或者可忽略或者不是叶子节点或者cycleid不同,则调用_bt_relbuf,返回;否则继续往下执行
5.执行相关判断
5.1如page可回收,则回收页面
5.2如page已删除,但不能回收,则更新统计信息
5.3如page为Half-dead,则尝试删除(设置delete_now标记为T)
5.4如page为叶子节点
5.4.1初始化变量
5.4.2锁定缓冲区
5.4.3记录已取得cleanup lock的最大叶子页编号
5.4.4检查我们是否需要递归回先前的页面
5.4.5扫描所有条目,看看哪些根据回调函数得到的需要删除的条目(写入到deletable数组中)
5.4.6如数组不为空,则调用_bt_delitems_vacuum,记录相关信息
如数组为空,判断页面是否在这个vacuum cycle中被分裂,清除btpo_cycleid标记,标记缓冲区为脏
5.4.7如果page为空,则试着删除(设置delete_now为T);否则计算活动元组
5.4.8如试着删除(delete_now为T),则调用_bt_pagedel删除,更新统计信息
否则调用_bt_relbuf
5.4.9判断recurse_to != P_NONE,如T,则重新启动,否则退出


/*
 * btvacuumpage --- VACUUM one page
 * btvacuumpage --- VACUUM页面
 * 
 * This processes a single page for btvacuumscan().  In some cases we
 * must go back and re-examine previously-scanned pages; this routine
 * recurses when necessary to handle that case.
 * btvacuumscan()调用该过程处理单个页面.
 * 在某些情况下,必须回过头来重新检查先前已扫描过的页面,该过程在需要的时候递归处理这种情况.
 *
 * blkno is the page to process.  orig_blkno is the highest block number
 * reached by the outer btvacuumscan loop (the same as blkno, unless we
 * are recursing to re-examine a previous page).
 * blkno是需处理的页面.orig_blkno是外层btvacuumscan循环最大的块号
 * (与blkno一样,除非我们需要递归检查先前的页面)
 */
static void
btvacuumpage(BTVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
{
    IndexVacuumInfo *info = vstate->info;//IndexVacuumInfo
    IndexBulkDeleteResult *stats = vstate->stats;//统计信息
    //typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
    IndexBulkDeleteCallback callback = vstate->callback;//回调函数
    void       *callback_state = vstate->callback_state;//回调函数状态
    Relation    rel = info->index;//index relation
    bool        delete_now;//现在删除?
    BlockNumber recurse_to;//递归处理的block
    Buffer      buf;//buffer
    Page        page;//page
    BTPageOpaque opaque = NULL;//
restart:
    delete_now = false;
    recurse_to = P_NONE;
    /* call vacuum_delay_point while not holding any buffer lock */
    //在没有持有任何buffer lock时,调用vacuum_delay_point
    vacuum_delay_point();
    /*
     * We can't use _bt_getbuf() here because it always applies
     * _bt_checkpage(), which will barf on an all-zero page. We want to
     * recycle all-zero pages, not fail.  Also, we want to use a nondefault
     * buffer access strategy.
     * 在这里不能使用_bt_getbuf()函数,因为该函数通常会调用_bt_checkpage(),
     *   该函数会braf on刚被初始化的page上.
     * 我们希望成功重用all-zero pages.
     * 而且,我们希望使用非默认的buffer访问策略.
     */
    buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
                             info->strategy);
    LockBuffer(buf, BT_READ);
    page = BufferGetPage(buf);
    if (!PageIsNew(page))
    {
        _bt_checkpage(rel, buf);
        opaque = (BTPageOpaque) PageGetSpecialPointer(page);
    }
    /*
     * If we are recursing, the only case we want to do anything with is a
     * live leaf page having the current vacuum cycle ID.  Any other state
     * implies we already saw the page (eg, deleted it as being empty).
     * 如果我们正在递归处理,需要处理的唯一情况是存在当前的vacuum cycle ID的活动叶子页节点。
     * 任何其他状态都意味着我们已经看到该页面(例如,删除后视为空)。
     */
    if (blkno != orig_blkno)
    {
        //block编号已改变
        if (_bt_page_recyclable(page) ||
            P_IGNORE(opaque) ||
            !P_ISLEAF(opaque) ||
            opaque->btpo_cycleid != vstate->cycleid)
        {
            _bt_relbuf(rel, buf);
            return;
        }
    }
    /* Page is valid, see what to do with it */
    //Page有效,看看需要做些什么
    if (_bt_page_recyclable(page))
    {
        /* Okay to recycle this page */
        //可以回收该页面了
        RecordFreeIndexPage(rel, blkno);
        vstate->totFreePages++;
        stats->pages_deleted++;
    }
    else if (P_ISDELETED(opaque))
    {
        /* Already deleted, but can't recycle yet */
        //该page已被删除,但不能回收
        //更新统计信息
        stats->pages_deleted++;
        /* Update the oldest btpo.xact */
        //更新最旧的btpo.xact
        if (!TransactionIdIsValid(vstate->oldestBtpoXact) ||
            TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact))
            vstate->oldestBtpoXact = opaque->btpo.xact;
    }
    else if (P_ISHALFDEAD(opaque))
    {
        /* Half-dead, try to delete */
        //Half-dead,尝试删除
        delete_now = true;
    }
    else if (P_ISLEAF(opaque))
    {
        //------- 叶子节点
        //偏移数组
        OffsetNumber deletable[MaxOffsetNumber];
        int         ndeletable;//计数
        OffsetNumber offnum,
                    minoff,
                    maxoff;
        /*
         * Trade in the initial read lock for a super-exclusive write lock on
         * this page.  We must get such a lock on every leaf page over the
         * course of the vacuum scan, whether or not it actually contains any
         * deletable tuples --- see nbtree/README.
         * 将初始读锁转换为此页上的超级独占写锁。
         * 在vacuum扫描过程中,我们必须在每个叶子页上获得这样的锁,
         *   不管它是否包含任何可删除的元组——参见nbtree/README。
         */
        LockBuffer(buf, BUFFER_LOCK_UNLOCK);
        LockBufferForCleanup(buf);
        /*
         * Remember highest leaf page number we've taken cleanup lock on; see
         * notes in btvacuumscan
         * 记录已取得cleanup lock的最大叶子页编号,详见btvacuumscan注释
         */
        if (blkno > vstate->lastBlockLocked)
            vstate->lastBlockLocked = blkno;
        /*
         * Check whether we need to recurse back to earlier pages.  What we
         * are concerned about is a page split that happened since we started
         * the vacuum scan.  If the split moved some tuples to a lower page
         * then we might have missed 'em.  If so, set up for tail recursion.
         * (Must do this before possibly clearing btpo_cycleid below!)
         * 检查我们是否需要递归回先前的页面.
         * 我们所关心的是自开始vacuum扫描以来发生的索引页面分裂.
         * 如果分裂将一些元组移动到层次较低的页面,那么我们可能会错过它们。
         * (在可能清除下面的btpo_cycleid之前,必须这样做!)
         */
        if (vstate->cycleid != 0 &&
            opaque->btpo_cycleid == vstate->cycleid &&
            !(opaque->btpo_flags & BTP_SPLIT_END) &&
            !P_RIGHTMOST(opaque) &&
            opaque->btpo_next < orig_blkno)
            recurse_to = opaque->btpo_next;
        /*
         * Scan over all items to see which ones need deleted according to the
         * callback function.
         * 扫描所有条目,看看哪些根据回调函数得到的需要删除的条目。
         */
        ndeletable = 0;
        //最小偏移
        minoff = P_FIRSTDATAKEY(opaque);
        //最大偏移
        maxoff = PageGetMaxOffsetNumber(page);
        if (callback)
        {
            //存在回调函数
            for (offnum = minoff;
                 offnum <= maxoff;
                 offnum = OffsetNumberNext(offnum))
            {
                //从小到大遍历偏移
                IndexTuple  itup;//索引元组
                ItemPointer htup;//行指针
                //获取索引元组
                itup = (IndexTuple) PageGetItem(page,
                                                PageGetItemId(page, offnum));
                htup = &(itup->t_tid);//获取行指针
                /*
                 * During Hot Standby we currently assume that
                 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
                 * only true as long as the callback function depends only
                 * upon whether the index tuple refers to heap tuples removed
                 * in the initial heap scan. When vacuum starts it derives a
                 * value of OldestXmin. Backends taking later snapshots could
                 * have a RecentGlobalXmin with a later xid than the vacuum's
                 * OldestXmin, so it is possible that row versions deleted
                 * after OldestXmin could be marked as killed by other
                 * backends. The callback function *could* look at the index
                 * tuple state in isolation and decide to delete the index
                 * tuple, though currently it does not. If it ever did, we
                 * would need to reconsider whether XLOG_BTREE_VACUUM records
                 * should cause conflicts. If they did cause conflicts they
                 * would be fairly harsh conflicts, since we haven't yet
                 * worked out a way to pass a useful value for
                 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
                 * applies to *any* type of index that marks index tuples as
                 * killed.
                 * 在热备份期间,目前假设XLOG_BTREE_VACUUM记录不会产生冲突。
                 * 只有当回调函数仅依赖于索引元组是否引用在初始堆扫描中删除的堆元组时,这种情况才成立。
                 * 当vacuum开始时,它得到一个OldestXmin的值。
                 * 拍摄较晚快照的后台进程可能具有一个RecentGlobalXmin,其xid比vacuum的最老的xmin还要晚,
                 *   因此,在OldestXmin之后删除的行版本可能被其他后台进程标记为已删除。
                 * 回调函数*可以*单独查看索引元组的状态,并决定删除索引元组,尽管目前没有。
                 * 如有,我们需要重新考虑XLOG_BTREE_VACUUM记录是否应该引起冲突.
                 * 如果它们确实导致冲突,那将是相当严重的冲突,
                 *   因为我们还没有找到在XLOG_BTREE_VACUUM记录上传递latestRemovedXid的有用值的方法。
                 * 这适用于任何将索引元组标记为killed的索引类型。
                 */
                if (callback(htup, callback_state))
                    //回调函数返回T,写入数组中
                    deletable[ndeletable++] = offnum;
            }
        }
        /*
         * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
         * call per page, so as to minimize WAL traffic.
         * 应用需要的删除.
         * 我们每个页面只发出一个_bt_delitems_vacuum()调用,以便最小化WAL流量。
         */
        if (ndeletable > 0)
        {
            //--------------- 如deletable数组不为空
            /*
             * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
             * all information to the replay code to allow it to get a cleanup
             * lock on all pages between the previous lastBlockVacuumed and
             * this page. This ensures that WAL replay locks all leaf pages at
             * some point, which is important should non-MVCC scans be
             * requested. This is currently unused on standby, but we record
             * it anyway, so that the WAL contains the required information.
             * 请注意,已发布的XLOG_BTREE_VACUUM WAL记录包含重放代码的所有信息,
             *   以允许重放代码在上一个lastblockvacuum和这个页面之间的所有页面上获得清理锁。
             * 这确保了WAL replay在某个时刻锁定所有的叶页面,这一点在请求非mvcc扫描时非常重要。
             * 这在待机状态下目前是未使用的,但是我们会记录它,以便WAL包含所需的信息。
             *
             * Since we can visit leaf pages out-of-order when recursing,
             * replay might end up locking such pages an extra time, but it
             * doesn't seem worth the amount of bookkeeping it'd take to avoid
             * that.
             * 因为在递归处理时,我们可以无序地访问叶子页面,
             *   所以重放可能会额外地锁定这些页面,但是似乎不值得为此花费大量的bookkeeping时间。
             */
            _bt_delitems_vacuum(rel, buf, deletable, ndeletable,
                                vstate->lastBlockVacuumed);
            /*
             * Remember highest leaf page number we've issued a
             * XLOG_BTREE_VACUUM WAL record for.
             * 记住我们已经生成LOG_BTREE_VACUUM WAL record的最大叶子页面编号
             */
            if (blkno > vstate->lastBlockVacuumed)
                vstate->lastBlockVacuumed = blkno;
            stats->tuples_removed += ndeletable;
            /* must recompute maxoff */
            //重新计算maxoff
            maxoff = PageGetMaxOffsetNumber(page);
        }
        else
        {
            /*
             * If the page has been split during this vacuum cycle, it seems
             * worth expending a write to clear btpo_cycleid even if we don't
             * have any deletions to do.  (If we do, _bt_delitems_vacuum takes
             * care of this.)  This ensures we won't process the page again.
             * 如果页面在这个vacuum cycle中被分裂,
             *   那么即使我们没有任何删除工作要做,但似乎也值得花费一次写操作来清除btpo_cycleid。
             * (如果我们这样做,_bt_delitems_vacuum负责处理这个问题。)
             * 这确保我们不会再次处理该页面。
             *
             * We treat this like a hint-bit update because there's no need to
             * WAL-log it.
             * 进行这个处理如同hint-bit更新,因为不需要记录WAL Record.
             */
            if (vstate->cycleid != 0 &&
                opaque->btpo_cycleid == vstate->cycleid)
            {
                opaque->btpo_cycleid = 0;
                MarkBufferDirtyHint(buf, true);
            }
        }
        /*
         * If it's now empty, try to delete; else count the live tuples. We
         * don't delete when recursing, though, to avoid putting entries into
         * freePages out-of-order (doesn't seem worth any extra code to handle
         * the case).
         * 如果它现在是空的,试着删除;否则计算活动元组。
         * 但是,在递归时我们不会删除,
         *   以避免将条目无序地放入freePages中
         * (似乎不值得使用任何额外的代码来处理这种情况)。
         */
        if (minoff > maxoff)
            delete_now = (blkno == orig_blkno);
        else
            stats->num_index_tuples += maxoff - minoff + 1;
    }
    if (delete_now)
    {
        MemoryContext oldcontext;
        int         ndel;
        /* Run pagedel in a temp context to avoid memory leakage */
        //在临时内存上下文中执行pagedel避免内存泄漏
        MemoryContextReset(vstate->pagedelcontext);
        oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
        ndel = _bt_pagedel(rel, buf);
        /* count only this page, else may double-count parent */
        //只对该页面进行计数,否则会双倍计算父节点
        if (ndel)
        {
            stats->pages_deleted++;
            if (!TransactionIdIsValid(vstate->oldestBtpoXact) ||
                TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact))
                vstate->oldestBtpoXact = opaque->btpo.xact;
        }
        MemoryContextSwitchTo(oldcontext);
        /* pagedel released buffer, so we shouldn't */
        //pagedel会释放缓存,在这里不需要做这个事情
    }
    else
        _bt_relbuf(rel, buf);
    /*
     * This is really tail recursion, but if the compiler is too stupid to
     * optimize it as such, we'd eat an uncomfortably large amount of stack
     * space per recursion level (due to the deletable[] array). A failure is
     * improbable since the number of levels isn't likely to be large ... but
     * just in case, let's hand-optimize into a loop.
     * 这实际上是尾部递归,但是如果编译器笨到无法对其进行优化,
     *   那么每个递归级别都会消耗大量堆栈空间(由于deletable[]数组的存在)。
     * 失败是不可能的,因为级别的数量不大……但以防万一,我们手工优化成一个循环。
     */
    if (recurse_to != P_NONE)
    {
        blkno = recurse_to;
        goto restart;
    }
}

lazy_tid_reaped
回调函数,调用系统函数bsearch检查tid是否可以被删除?


/*
 *  lazy_tid_reaped() -- is a particular tid deletable?
 *
 *      This has the right signature to be an IndexBulkDeleteCallback.
 *
 *      Assumes dead_tuples array is in sorted order.
 */
static bool
lazy_tid_reaped(ItemPointer itemptr, void *state)
{
    LVRelStats *vacrelstats = (LVRelStats *) state;
    ItemPointer res;
    //vac_cmp_itemptr是比较函数
    res = (ItemPointer) bsearch((void *) itemptr,
                                (void *) vacrelstats->dead_tuples,
                                vacrelstats->num_dead_tuples,
                                sizeof(ItemPointerData),
                                vac_cmp_itemptr);
    return (res != NULL);
}
/*
 * Comparator routines for use with qsort() and bsearch().
 * qsort()和bsearch()使用的比较函数
 * 比较块号和块内偏移,如一致则返回0,否则left < right,返回-1;left > right,返回1.
 */
static int
vac_cmp_itemptr(const void *left, const void *right)
{
    BlockNumber lblk,
                rblk;
    OffsetNumber loff,
                roff;
    lblk = ItemPointerGetBlockNumber((ItemPointer) left);
    rblk = ItemPointerGetBlockNumber((ItemPointer) right);
    if (lblk < rblk)
        return -1;
    if (lblk > rblk)
        return 1;
    loff = ItemPointerGetOffsetNumber((ItemPointer) left);
    roff = ItemPointerGetOffsetNumber((ItemPointer) right);
    if (loff < roff)
        return -1;
    if (loff > roff)
        return 1;
    return 0;
}

三、跟踪分析

测试脚本 : 删除数据,执行vacuum


14:24:12 (xdb@[local]:5432)testdb=# delete from t1 where id < 1300;
DELETE 100
14:24:23 (xdb@[local]:5432)testdb=# checkpoint;
CHECKPOINT
14:24:26 (xdb@[local]:5432)testdb=# 
14:25:28 (xdb@[local]:5432)testdb=# vacuum verbose t1;

btvacuumscan
启动gdb,设置断点


(gdb) b btvacuumscan
Breakpoint 1 at 0x509951: file nbtree.c, line 959.
(gdb) c
Continuing.
Breakpoint 1, btvacuumscan (info=0x7ffd33d29b70, stats=0x23ea988, callback=0x6bf507 <lazy_tid_reaped>, 
    callback_state=0x23eaaf8, cycleid=37964, oldestBtpoXact=0x7ffd33d29a40) at nbtree.c:959
959     Relation    rel = info->index;
(gdb)

输入参数


(gdb) p *info
$1 = {index = 0x7f6b76bcc688, analyze_only = false, estimated_count = true, message_level = 17, num_heap_tuples = 14444, 
  strategy = 0x2413708}
(gdb) p *stats
$2 = {num_pages = 0, pages_removed = 0, estimated_count = false, num_index_tuples = 0, tuples_removed = 0, 
  pages_deleted = 0, pages_free = 0}
(gdb) p *oldestBtpoXact
$3 = 869440096
(gdb) 
(gdb) p (LVRelStats *)callback_state
$4 = (LVRelStats *) 0x23eaaf8
(gdb) p *(LVRelStats *)callback_state
$5 = {hasindex = true, old_rel_pages = 124, rel_pages = 124, scanned_pages = 52, pinskipped_pages = 0, 
  frozenskipped_pages = 1, tupcount_pages = 52, old_live_tuples = 14444, new_rel_tuples = 14840, new_live_tuples = 14840, 
  new_dead_tuples = 0, pages_removed = 0, tuples_deleted = 100, nonempty_pages = 124, num_dead_tuples = 100, 
  max_dead_tuples = 36084, dead_tuples = 0x7f6b76ad7050, num_index_scans = 0, latestRemovedXid = 397077, 
  lock_waiter_detected = false}

1.初始化统计信息(IndexBulkDeleteResult结构体)


(gdb) n
969     stats->estimated_count = false;
(gdb) 
970     stats->num_index_tuples = 0;
(gdb) 
971     stats->pages_deleted = 0;
(gdb)

2.初始化vstate状态信息(BTVacState结构体)


974     vstate.info = info;
(gdb) 
975     vstate.stats = stats;
(gdb) 
976     vstate.callback = callback;
(gdb) 
977     vstate.callback_state = callback_state;
(gdb) 
978     vstate.cycleid = cycleid;
(gdb) 
979     vstate.lastBlockVacuumed = BTREE_METAPAGE;  /* Initialise at first block */
(gdb) 
980     vstate.lastBlockLocked = BTREE_METAPAGE;
(gdb) 
981     vstate.totFreePages = 0;
(gdb) 
982     vstate.oldestBtpoXact = InvalidTransactionId;
(gdb) 
(gdb) p vstate
$6 = {info = 0x7ffd33d29b70, stats = 0x23ea988, callback = 0x6bf507 <lazy_tid_reaped>, callback_state = 0x23eaaf8, 
  cycleid = 37964, lastBlockVacuumed = 0, lastBlockLocked = 0, totFreePages = 0, oldestBtpoXact = 0, 
  pagedelcontext = 0x23c1d00}

3.构造临时上下文


985     vstate.pagedelcontext = AllocSetContextCreate(CurrentMemoryContext,
(gdb)

4.循环遍历page
4.1获取relation锁


1012        needLock = !RELATION_IS_LOCAL(rel);
(gdb) p vstate
$6 = {info = 0x7ffd33d29b70, stats = 0x23ea988, callback = 0x6bf507 <lazy_tid_reaped>, callback_state = 0x23eaaf8, 
  cycleid = 37964, lastBlockVacuumed = 0, lastBlockLocked = 0, totFreePages = 0, oldestBtpoXact = 0, 
  pagedelcontext = 0x23c1d00}
(gdb) 
(gdb) n
1014        blkno = BTREE_METAPAGE + 1;
(gdb) 
1018            if (needLock)
(gdb) p needLock
$7 = true
(gdb) n
1019                LockRelationForExtension(rel, ExclusiveLock);
(gdb) 
1020            num_pages = RelationGetNumberOfBlocks(rel);
(gdb) 
1021            if (needLock)
(gdb) p num_pages
$8 = 60
(gdb) n
1022                UnlockRelationForExtension(rel, ExclusiveLock);
(gdb) 
1025            if (blkno >= num_pages)
(gdb) p blkno
$9 = 1
(gdb) n
1028            for (; blkno < num_pages; blkno++)
(gdb)

4.2遍历block,执行btvacuumpage


(gdb) 
1030                btvacuumpage(&vstate, blkno, blkno);
(gdb) 
1028            for (; blkno < num_pages; blkno++)
(gdb) 
1030                btvacuumpage(&vstate, blkno, blkno);
(gdb)

4.3如需要,多次遍历relation


(gdb) b nbtree.c:1018
Breakpoint 2 at 0x509a1f: file nbtree.c, line 1018.
(gdb) c
Continuing.
Breakpoint 2, btvacuumscan (info=0x7ffd33d29b70, stats=0x23ea988, callback=0x6bf507 <lazy_tid_reaped>, 
    callback_state=0x23eaaf8, cycleid=37964, oldestBtpoXact=0x7ffd33d29a40) at nbtree.c:1018
1018            if (needLock)
(gdb) n
1019                LockRelationForExtension(rel, ExclusiveLock);
(gdb) 
1020            num_pages = RelationGetNumberOfBlocks(rel);
(gdb) 
1021            if (needLock)
(gdb) 
1022                UnlockRelationForExtension(rel, ExclusiveLock);
(gdb) 
1025            if (blkno >= num_pages)
(gdb) p blkno
$11 = 60
(gdb) n
1026                break;
(gdb)

5.WAL Record处理


(gdb) n
1048        if (XLogStandbyInfoActive() &&
(gdb)

6.删除临时上下文


(gdb) 
1067        MemoryContextDelete(vstate.pagedelcontext);
(gdb)

7.处理空闲空间


(gdb) 
1081        if (vstate.totFreePages > 0)
(gdb) 
1082            IndexFreeSpaceMapVacuum(rel);
(gdb)

8.更新统计信息


(gdb) 
1085        stats->num_pages = num_pages;
(gdb) 
1086        stats->pages_free = vstate.totFreePages;
(gdb) 
1088        if (oldestBtpoXact)
(gdb) 
1089            *oldestBtpoXact = vstate.oldestBtpoXact;
(gdb) p oldestBtpoXact
$12 = (TransactionId *) 0x7ffd33d29a40
(gdb) p *oldestBtpoXact
$13 = 869440096
(gdb) p vstate.oldestBtpoXact
$14 = 397078
(gdb) n
1090    }
(gdb) p *stats
$15 = {num_pages = 60, pages_removed = 0, estimated_count = false, num_index_tuples = 8701, tuples_removed = 100, 
  pages_deleted = 7, pages_free = 6}
(gdb)

完成调用


(gdb) n
btbulkdelete (info=0x7ffd33d29b70, stats=0x23ea988, callback=0x6bf507 <lazy_tid_reaped>, callback_state=0x23eaaf8)
    at nbtree.c:880
880         _bt_update_meta_cleanup_info(info->index, oldestBtpoXact,
(gdb)

btvacuumpage


14:50:45 (xdb@[local]:5432)testdb=# vacuum verbose t1;
...............
(gdb) b btvacuumpage
Breakpoint 3 at 0x509b82: file nbtree.c, line 1106.
(gdb) 
(gdb) c
Continuing.
Breakpoint 3, btvacuumpage (vstate=0x7ffd33d298d0, blkno=1, orig_blkno=1) at nbtree.c:1106
1106        IndexVacuumInfo *info = vstate->info;
(gdb)

输入参数


(gdb) p *vstate
$16 = {info = 0x7ffd33d29b70, stats = 0x24157e8, callback = 0x6bf507 <lazy_tid_reaped>, callback_state = 0x2415958, 
  cycleid = 37965, lastBlockVacuumed = 0, lastBlockLocked = 0, totFreePages = 0, oldestBtpoXact = 0, 
  pagedelcontext = 0x23ea7a0}
(gdb)

1.初始化相关变量


(gdb) n
1107        IndexBulkDeleteResult *stats = vstate->stats;
(gdb) 
1108        IndexBulkDeleteCallback callback = vstate->callback;
(gdb) 
1109        void       *callback_state = vstate->callback_state;
(gdb) 
1110        Relation    rel = info->index;
(gdb) 
1115        BTPageOpaque opaque = NULL;
(gdb) 
1118        delete_now = false;
(gdb) 
1119        recurse_to = P_NONE;
(gdb) 
1122        vacuum_delay_point();
(gdb) p *info
$17 = {index = 0x7f6b76b0c268, analyze_only = false, estimated_count = true, message_level = 17, num_heap_tuples = 14840, 
  strategy = 0x2403478}
(gdb) p *stats
$18 = {num_pages = 0, pages_removed = 0, estimated_count = false, num_index_tuples = 0, tuples_removed = 0, 
  pages_deleted = 0, pages_free = 0}
(gdb) p rel
$19 = (Relation) 0x7f6b76b0c268
(gdb) p *rel
$20 = {rd_node = {spcNode = 1663, dbNode = 16402, relNode = 50823}, rd_smgr = 0x23d0270, rd_refcnt = 1, rd_backend = -1, 
  rd_islocaltemp = false, rd_isnailed = false, rd_isvalid = true, rd_indexvalid = 0 '\000', rd_statvalid = false, 
  rd_createSubid = 0, rd_newRelfilenodeSubid = 0, rd_rel = 0x7f6b76bccd20, rd_att = 0x7f6b76bcc9b8, rd_id = 50823, 
  rd_lockInfo = {lockRelId = {relId = 50823, dbId = 16402}}, rd_rules = 0x0, rd_rulescxt = 0x0, trigdesc = 0x0, 
  rd_rsdesc = 0x0, rd_fkeylist = 0x0, rd_fkeyvalid = false, rd_partkeycxt = 0x0, rd_partkey = 0x0, rd_pdcxt = 0x0, 
  rd_partdesc = 0x0, rd_partcheck = 0x0, rd_indexlist = 0x0, rd_oidindex = 0, rd_pkindex = 0, rd_replidindex = 0, 
  rd_statlist = 0x0, rd_indexattr = 0x0, rd_projindexattr = 0x0, rd_keyattr = 0x0, rd_pkattr = 0x0, rd_idattr = 0x0, 
  rd_projidx = 0x0, rd_pubactions = 0x0, rd_options = 0x0, rd_index = 0x7f6b76bcc8d8, rd_indextuple = 0x7f6b76bcc8a0, 
  rd_amhandler = 330, rd_indexcxt = 0x236b340, rd_amroutine = 0x236b480, rd_opfamily = 0x236b598, rd_opcintype = 0x236b5b8, 
  rd_support = 0x236b5d8, rd_supportinfo = 0x236b600, rd_indoption = 0x236b738, rd_indexprs = 0x0, rd_indpred = 0x0, 
  rd_exclops = 0x0, rd_exclprocs = 0x0, rd_exclstrats = 0x0, rd_amcache = 0x0, rd_indcollation = 0x236b718, 
  rd_fdwroutine = 0x0, rd_toastoid = 0, pgstat_info = 0x23c4198}
(gdb)

2.调用ReadBufferExtended读取block到buffer中,锁定buffer,获取page


(gdb) 
1130        buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
(gdb) n
1132        LockBuffer(buf, BT_READ);
(gdb) 
1133        page = BufferGetPage(buf);
(gdb) 
1134        if (!PageIsNew(page))
(gdb) p *page
$21 = 1 '\001'
(gdb) p page
$22 = (Page) 0x7f6b4add8380 "\001"
(gdb) p *(PageHeader)page
$23 = {pd_lsn = {xlogid = 1, xrecoff = 1320733408}, pd_checksum = 0, pd_flags = 0, pd_lower = 28, pd_upper = 8168, 
  pd_special = 8176, pd_pagesize_version = 8196, pd_prune_xid = 0, pd_linp = 0x7f6b4add8398}
(gdb)

3.如果不是new page,则执行检查并获取BTPageOpaque


(gdb) n
1136            _bt_checkpage(rel, buf);
(gdb) 
1137            opaque = (BTPageOpaque) PageGetSpecialPointer(page);
(gdb) p buf
$24 = 224
(gdb) n
1145        if (blkno != orig_blkno)
(gdb) p opaque
$25 = (BTPageOpaque) 0x7f6b4adda370
(gdb) p *opaque
$26 = {btpo_prev = 0, btpo_next = 33, btpo = {level = 397073, xact = 397073}, btpo_flags = 5, btpo_cycleid = 0}
(gdb)

4.如块号与原始不同,正在进行递归处理,如page可回收或者可忽略或者不是叶子节点或者cycleid不同,则调用_bt_relbuf,返回;否则继续往下执行


(gdb) n
1145        if (blkno != orig_blkno)

5.执行相关判断
5.1如page可回收,则回收页面


(gdb) n
1158        if (_bt_page_recyclable(page))
(gdb) 
1161            RecordFreeIndexPage(rel, blkno);
(gdb) 
1162            vstate->totFreePages++;
(gdb) p blkno
$27 = 1
(gdb) n
1163            stats->pages_deleted++;
(gdb)

5.2如page已删除,但不能回收,则更新统计信息


N/A

5.3如page为Half-dead,则尝试删除(设置delete_now标记为T)


N/A

5.5如试着删除(delete_now为T),则调用_bt_pagedel删除,更新统计信息
否则调用_bt_relbuf


1329        if (delete_now)
(gdb) 
1353            _bt_relbuf(rel, buf);

5.6判断recurse_to != P_NONE,如T,则重新启动,否则退出


1362        if (recurse_to != P_NONE)
(gdb) p recurse_to
$29 = 0
(gdb) p P_NONE
$30 = 0
(gdb) n
1367    }
(gdb)

进入page为叶子节点的逻辑
5.4如page为叶子节点


(gdb) del 
Delete all breakpoints? (y or n) y
(gdb) b nbtree.c:1182
Breakpoint 5 at 0x509e61: file nbtree.c, line 1182.
(gdb) c
Continuing.
Breakpoint 5, btvacuumpage (vstate=0x7ffd33d298d0, blkno=6, orig_blkno=6) at nbtree.c:1194
1194            LockBuffer(buf, BUFFER_LOCK_UNLOCK);
(gdb)

5.4.1初始化变量


N/A

5.4.2锁定缓冲区


1194            LockBuffer(buf, BUFFER_LOCK_UNLOCK);
(gdb) N
1195            LockBufferForCleanup(buf);
(gdb)

5.4.3记录已取得cleanup lock的最大叶子页编号


(gdb) 
1201            if (blkno > vstate->lastBlockLocked)
(gdb) p blkno
$31 = 6
(gdb) p vstate->lastBlockLocked
$32 = 0
(gdb) n
1202                vstate->lastBlockLocked = blkno;
(gdb)

5.4.4检查我们是否需要递归回先前的页面


(gdb) 
1211            if (vstate->cycleid != 0 &&
(gdb) p vstate->cycleid
$33 = 37965
(gdb) p opaque->btpo_cycleid
$34 = 0
(gdb) p vstate->cycleid
$35 = 37965
(gdb) 
(gdb) n
1212                opaque->btpo_cycleid == vstate->cycleid &&
(gdb) 
1211            if (vstate->cycleid != 0 &&
(gdb) 
1222            ndeletable = 0;
(gdb) 
1223            minoff = P_FIRSTDATAKEY(opaque);
(gdb) 
1224            maxoff = PageGetMaxOffsetNumber(page);
(gdb) 
1225            if (callback)
(gdb) p minoff
$36 = 2
(gdb) p maxoff
$37 = 174
(gdb)

5.4.5扫描所有条目,看看哪些根据回调函数得到的需要删除的条目(写入到deletable数组中)


(gdb) n
1227                for (offnum = minoff;
(gdb) 
1234                    itup = (IndexTuple) PageGetItem(page,
(gdb) 
1236                    htup = &(itup->t_tid);
(gdb) p *itup
$38 = {t_tid = {ip_blkid = {bi_hi = 0, bi_lo = 103}, ip_posid = 138}, t_info = 16}
(gdb) n
1259                    if (callback(htup, callback_state))
(gdb) p *htup
$41 = {ip_blkid = {bi_hi = 0, bi_lo = 103}, ip_posid = 138}
(gdb)

进入回调函数lazy_tid_reaped


(gdb) step
lazy_tid_reaped (itemptr=0x7f6b4addfd40, state=0x2415958) at vacuumlazy.c:2140
2140        LVRelStats *vacrelstats = (LVRelStats *) state;
(gdb)

调用bsearch判断是否满足条件,返回NULL,不满足


(gdb) n
2145                                    vacrelstats->num_dead_tuples,
(gdb) 
2143        res = (ItemPointer) bsearch((void *) itemptr,
(gdb) 
2144                                    (void *) vacrelstats->dead_tuples,
(gdb) 
2143        res = (ItemPointer) bsearch((void *) itemptr,
(gdb) 
2149        return (res != NULL);
(gdb) p res
$42 = (ItemPointer) 0x0
(gdb) 
(gdb) n
2150    }
(gdb) 
btvacuumpage (vstate=0x7ffd33d298d0, blkno=6, orig_blkno=6) at nbtree.c:1229
1229                     offnum = OffsetNumberNext(offnum))
(gdb)

5.4.6如数组不为空,则调用_bt_delitems_vacuum,记录相关信息
如数组为空,判断页面是否在这个vacuum cycle中被分裂,清除btpo_cycleid标记,标记缓冲区为脏


(gdb) del
Delete all breakpoints? (y or n) y
(gdb) b nbtree.c:1284
Breakpoint 7 at 0x50a035: file nbtree.c, line 1284.
(gdb) c
Continuing.
Breakpoint 7, btvacuumpage (vstate=0x7ffd33d298d0, blkno=48, orig_blkno=48) at nbtree.c:1284
1284                _bt_delitems_vacuum(rel, buf, deletable, ndeletable,
(gdb) 
(gdb) n
1291                if (blkno > vstate->lastBlockVacuumed)
(gdb) p blkno
$43 = 48
(gdb) p vstate->lastBlockVacuumed
$44 = 0
(gdb) n
1292                    vstate->lastBlockVacuumed = blkno;
(gdb) 
1294                stats->tuples_removed += ndeletable;
(gdb) 
1296                maxoff = PageGetMaxOffsetNumber(page);
(gdb) 
1323            if (minoff > maxoff)
(gdb) p minoff
$45 = 2
(gdb) p maxoff
$46 = 67
(gdb) n
1326                stats->num_index_tuples += maxoff - minoff + 1;
(gdb)

DONE!

四、参考资料

PG Source Code

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/6906/viewspace-2565327/,如需转载,请注明出处,否则将追究法律责任。

下一篇: 闲话杂谈之境界
请登录后发表评论 登录
全部评论
长期从事政务、金融等行业产品研发和架构设计工作,对Oracle、PostgreSQL以及大数据等相关技术有深入研究。现就职于广州云图数据技术有限公司,系统架构师。

注册时间:2007-12-28

  • 博文量
    1169
  • 访问量
    3634699