ITPub博客

首页 > Linux操作系统 > Linux操作系统 > Lucene源码解析--Analyzer类IndexingChain介绍<一>

Lucene源码解析--Analyzer类IndexingChain介绍<一>

原创 Linux操作系统 作者:百联达 时间:2013-07-29 13:34:24 0 删除 编辑
文档的索引过程是通过DocumentsWriter的内部数据处理链完成的,下面通过代码跟踪的方法来介绍一下索引链的创建过程。

一:下面一段代码是创建索引的一个简单样例,其中红色标识部分将是我们要跟踪的。

public class IndexTest {

public static void main(String[] args)

{

try {

         File fileDir =new File("F:\\document");

         IndexWriterConfig config=new IndexWriterConfig(Version.LUCENE_43, new StandardAnalyzer(Version.LUCENE_43));

         config.setInfoStream(System.out);

         config.setOpenMode(OpenMode.CREATE);

         IndexWriter writer=new IndexWriter(FSDirectory.open(new File("F:\\index")),config);

         for(File file:fileDir.listFiles())

         {

                   Document document=new Document();

                   document.add(new TextField("content", new FileReader(file)));

                   document.add(new StringField("title", file.getName(), Store.YES));

                   writer.addDocument(document);

         }

         writer.close();

} catch (Exception e) {

         e.printStackTrace();

}       

}

}

Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE二:IndexWriterConfig config=new IndexWriterConfig(Version.LUCENE_43, new StandardAnalyzer(Version.LUCENE_43)) 
通过
IndexWriterConfig类的构造函数来创建参数配置对象,我们进入到构造函数内部

public IndexWriterConfig(Version matchVersion, Analyzer analyzer) {

    super(analyzer, matchVersion);

  }

发现,它调用父类LiveIndexWriterConfig的构造函数。  我们继续跟踪

  LiveIndexWriterConfig(Analyzer analyzer, Version matchVersion) {

    this.analyzer = analyzer;

    this.matchVersion = matchVersion;

    ramBufferSizeMB = IndexWriterConfig.DEFAULT_RAM_BUFFER_SIZE_MB;

    maxBufferedDocs = IndexWriterConfig.DEFAULT_MAX_BUFFERED_DOCS;

    maxBufferedDeleteTerms = IndexWriterConfig.DEFAULT_MAX_BUFFERED_DELETE_TERMS;

    readerTermsIndexDivisor = IndexWriterConfig.DEFAULT_READER_TERMS_INDEX_DIVISOR;

    mergedSegmentWarmer = null;

    termIndexInterval = IndexWriterConfig.DEFAULT_TERM_INDEX_INTERVAL; // TODO: this should be private to the codec, not settable here

    delPolicy = new KeepOnlyLastCommitDeletionPolicy();

    commit = null;

    openMode = OpenMode.CREATE_OR_APPEND;

    similarity = IndexSearcher.getDefaultSimilarity();

    mergeScheduler = new ConcurrentMergeScheduler();

    writeLockTimeout = IndexWriterConfig.WRITE_LOCK_TIMEOUT;

    indexingChain = DocumentsWriterPerThread.defaultIndexingChain;

    codec = Codec.getDefault();

    if (codec == null) {

      throw new NullPointerException();

    }

    infoStream = InfoStream.getDefault();

    mergePolicy = new TieredMergePolicy();

    flushPolicy = new FlushByRamOrCountsPolicy();

    readerPooling = IndexWriterConfig.DEFAULT_READER_POOLING;

    indexerThreadPool = new ThreadAffinityDocumentsWriterThreadPool(IndexWriterConfig.DEFAULT_MAX_THREAD_STATES);

    perThreadHardLimitMB = IndexWriterConfig.DEFAULT_RAM_PER_THREAD_HARD_LIMIT_MB;

  }

其中红色标注的部分是我们关心的。

三:首先我们看 indexerThreadPool = new ThreadAffinityDocumentsWriterThreadPool(IndexWriterConfig.DEFAULT_MAX_THREAD_STATES)构建索引的线程池。 我们跟踪ThreadAffinityDocumentsWriterThreadPool的构造函数

  public ThreadAffinityDocumentsWriterThreadPool(int maxNumPerThreads) {

    super(maxNumPerThreads);

    assert getMaxThreadStates() >= 1;

  }

发现其调用父类DocumentsWriterPerThreadPool的构造函数 Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE

  DocumentsWriterPerThreadPool(int maxNumThreadStates) {

    if (maxNumThreadStates < 1) {

      throw new IllegalArgumentException("maxNumThreadStates must be >= 1 but was: " + maxNumThreadStates);

    }

    threadStates = new ThreadState[maxNumThreadStates];

    numThreadStatesActive = 0;

  }

至此,我们发现会创建一个ThreadState数组,数组默认最大值为8. 通过对ThreadState的分析我们知道,ThreadState和一个DocumentsWriterPerThread关联,而DocumentsWriterPerThread中则包含着索引链的关键部分。 Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE
三:接下来我们来分析ThreadState数组中的每个对象,是怎么跟DocumentsWriterPerThread关联起来的。我们回到索引样例中的
  IndexWriter writer=new IndexWriter(FSDirectory.open(new File("F:\\index")),config);
继续跟踪IndexWriter的构造函数,我们会发现有一处代码
docWriter = new DocumentsWriter(codec, config, directory, this, globalFieldNumberMap, bufferedDeletesStream);创建
DocumentsWriter对象

四:我们继续跟踪
DocumentsWriter的构造函数

  DocumentsWriter(Codec codec, LiveIndexWriterConfig config, Directory directory, IndexWriter writer, FieldNumbers globalFieldNumbers,

      BufferedDeletesStream bufferedDeletesStream) {

    this.codec = codec;

    this.directory = directory;

    this.indexWriter = writer;

    this.infoStream = config.getInfoStream();

    this.similarity = config.getSimilarity();

    this.perThreadPool = config.getIndexerThreadPool();

    this.chain = config.getIndexingChain();

    this.perThreadPool.initialize(this, globalFieldNumbers, config);

    flushPolicy = config.getFlushPolicy();

    assert flushPolicy != null;

    flushPolicy.init(this);

    flushControl = new DocumentsWriterFlushControl(this, config);

  }

其中标注红色的部分是表示对索引线程池进行初始化操作,我们来看看初始化时做了哪些工作

 void initialize(DocumentsWriter documentsWriter, FieldNumbers globalFieldMap, LiveIndexWriterConfig config) {

    this.documentsWriter.set(documentsWriter); // thread pool is bound to DW

    this.globalFieldMap.set(globalFieldMap);

    for (int i = 0; i < threadStates.length; i++) {

      final FieldInfos.Builder infos = new FieldInfos.Builder(globalFieldMap);

      threadStates[i] = new ThreadState(new DocumentsWriterPerThread(documentsWriter.directory, documentsWriter, infos, documentsWriter.chain));

    }

  }

可以看到,针对线程池中的threadStates数组中的每个对象进行初始化,绑定一个 Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONEDocumentsWriterPerThread 线程实例。

五:我们来看看DocumentsWriterPerThread的构造函数

  public DocumentsWriterPerThread(Directory directory, DocumentsWriter parent,

      FieldInfos.Builder fieldInfos, IndexingChain indexingChain) {

    this.directoryOrig = directory;

    this.directory = new TrackingDirectoryWrapper(directory);

    this.parent = parent;

    this.fieldInfos = fieldInfos;

    this.writer = parent.indexWriter;

    this.infoStream = parent.infoStream;

    this.codec = parent.codec;

    this.docState = new DocState(this, infoStream);

    this.docState.similarity = parent.indexWriter.getConfig().getSimilarity();

    bytesUsed = Counter.newCounter();

    byteBlockAllocator = new DirectTrackingAllocator(bytesUsed);

    pendingDeletes = new BufferedDeletes();

    intBlockAllocator = new IntBlockAllocator(bytesUsed);

    initialize();

    consumer = indexingChain.getChain(this);

  }

Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE在代码的最后一句,是为每个线程提供一个索引链。

六:最后然我们来看看索引链中的内容

 DocConsumer getChain(DocumentsWriterPerThread documentsWriterPerThread) {

      final TermsHashConsumer termVectorsWriter = new TermVectorsConsumer(documentsWriterPerThread);

      final TermsHashConsumer freqProxWriter = new FreqProxTermsWriter();

      final InvertedDocConsumer termsHash = new TermsHash(documentsWriterPerThread, freqProxWriter, true,

                                                          new TermsHash(documentsWriterPerThread, termVectorsWriter, false, null));

      final NormsConsumer normsWriter = new NormsConsumer();

      final DocInverter docInverter = new DocInverter(documentsWriterPerThread.docState, termsHash, normsWriter);

      final StoredFieldsConsumer storedFields = new TwoStoredFieldsConsumers(

                                                      new StoredFieldsProcessor(documentsWriterPerThread),

                                                      new DocValuesProcessor(documentsWriterPerThread.bytesUsed));

      return new DocFieldProcessor(documentsWriterPerThread, docInverter, storedFields);

    }

  };

Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE
索引链的调用过程,请参见下图
Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE

七:至此,每个IndexWriter创建时,会分配一个默认大小为8的线程池,线程池中存放着DocumentsWriterPerThread线程,每个线程中有一个默认的索引链IndexingChain与之相关联。
Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE

ch.jpg

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/28624388/viewspace-767366/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论
10年以上互联网经验,先后从事过制造业,证券业,物业行业和物流行业信息系统和互联网产品的研发,6年系统架构经验。最近关注Kubernetes微服务架构和Istio微服务治理框架。

注册时间:2013-02-05

  • 博文量
    316
  • 访问量
    1008060