yangjava
diff --git a/‎_posts/2015-09-23-Linux下C实战.md‎
Lines changed: 2 additions & 2 deletions b/‎_posts/2015-09-23-Linux下C实战.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎_posts/2016-04-01-Netty 简介.md‎
Lines changed: 1 addition & 1 deletion b/‎_posts/2016-04-01-Netty 简介.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎_posts/2018-04-01-Lucene入门简介.md‎
Lines changed: 5 additions & 0 deletions b/‎_posts/2018-04-01-Lucene入门简介.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎_posts/2018-04-01-Lucene全文检索原理.md‎
Lines changed: 325 additions & 0 deletions b/‎_posts/2018-04-01-Lucene全文检索原理.md‎
Lines changed: 325 additions & 0 deletions
diff --git a/‎_posts/2018-04-02-Lucene源码分析.md‎
Lines changed: 32 additions & 7 deletions b/‎_posts/2018-04-02-Lucene源码分析.md‎
Lines changed: 32 additions & 7 deletions
diff --git a/‎_posts/2018-04-03-Lucene源码IndexWriter.md‎
Lines changed: 39 additions & 1 deletion b/‎_posts/2018-04-03-Lucene源码IndexWriter.md‎
Lines changed: 39 additions & 1 deletion
diff --git a/‎_posts/2018-04-03-Lucene源码索引流程.md‎
Lines changed: 4 additions & 1 deletion b/‎_posts/2018-04-03-Lucene源码索引流程.md‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎_posts/2018-04-05-Lucene源码索引Terms.md‎
Lines changed: 7 additions & 0 deletions b/‎_posts/2018-04-05-Lucene源码索引Terms.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎_posts/2018-04-05-Lucene源码索引文件Document.md‎
Lines changed: 28 additions & 0 deletions b/‎_posts/2018-04-05-Lucene源码索引文件Document.md‎
Lines changed: 28 additions & 0 deletions
@@ -1,8 +1,8 @@
 ---
 layout: post
-categories: [Linux,C]
+categories: [C]
 description: none
-keywords: Linux,C
+keywords: C
 ---
 # Linux下C编程实战
 传统的UNIX下的程序开发语言是C语言，C语言是一种平台适应性强、易于移植的语言。
 
@@ -1,6 +1,6 @@
 ---
 layout: post
-categories: Netty
+categories: [Netty]
 description: none
 keywords: Netty
 ---
 
@@ -172,4 +172,9 @@ Maven依赖
     </dependencies>
 ```
 
+## 参考资料
+刘超觉先的博客
 
+Lucene Java DOC [https://lucene.apache.org/core/7_5_0/core/org/apache/lucene/codecs/lucene70/package-summary.html]
+
+Lucene in Action [https://www.manning.com/books/lucene-in-action-second-edition]
@@ -290,13 +290,38 @@ IndexSearcher是与IndexWriter遥相呼应的一个重要的类，方法非常
 - TopDocs
 TopDocs是一个前N个搜索结果的集合，每个结果用一个docId来标识，TopDocs可以得到几个重要的查询结果，比如，总的查询数量，查询结果对象数组以及最大评分。
 
-
-
-
-
-
-
-
+## Lucene组件
+- 被索引的文档用Document对象表示。
+- IndexWriter通过函数addDocument将文档添加到索引中，实现创建索引的过程。
+- Lucene的索引是应用反向索引。
+- 当用户有请求时，Query代表用户的查询语句。
+- IndexSearcher通过函数search搜索Lucene Index。
+- IndexSearcher计算term weight和score并且将结果返回给用户。
+- 返回给用户的文档集合用TopDocsCollector表示。
+
+## 流程
+
+### 索引过程如下：
+创建一个IndexWriter用来写索引文件，它有几个参数，INDEX_DIR就是索引文件所存放的位置，Analyzer便是用来对文档进行词法分析和语言处理的。
+创建一个Document代表我们要索引的文档。
+将不同的Field加入到文档中。我们知道，一篇文档有多种信息，如题目，作者，修改时间，内容等。不同类型的信息用不同的Field来表示，在本例子中，一共有两类信息进行了索引，一个是文件路径，一个是文件内容。其中FileReader的SRC_FILE就表示要索引的源文件。
+IndexWriter调用函数addDocument将索引写到索引文件夹中。
+
+### 搜索过程如下：
+IndexReader将磁盘上的索引信息读入到内存，INDEX_DIR就是索引文件存放的位置。
+创建IndexSearcher准备进行搜索。
+创建Analyer用来对查询语句进行词法分析和语言处理。
+创建QueryParser用来对查询语句进行语法分析。
+QueryParser调用parser进行语法分析，形成查询语法树，放到Query中。
+IndexSearcher调用search对查询语法树Query进行搜索，得到结果TopScoreDocCollector。
+
+## Lucene实现的包结构
+Lucene的analysis模块主要负责词法分析及语言处理而形成Term。
+Lucene的index模块主要负责索引的创建，里面有IndexWriter。
+Lucene的store模块主要负责索引的读写。
+Lucene的QueryParser主要负责语法分析。
+Lucene的search模块主要负责对索引的搜索。
+Lucene的similarity模块主要负责对相关性打分的实现。
 
 
 
 
@@ -7,6 +7,44 @@ keywords: Lucene
 # Lucene源码IndexWriter
 我们将深入IndexWriter内部，来探索其内核实现。
 
+## 创建IndexWriter对象
+```
+IndexWriter indexWriter = new IndexWriter(directory, config);
+```
+IndexWriter对象主要包含以下几方面的信息：
+
+### 用于索引文档
+- Directory directory;  指向索引文件夹
+- SegmentInfos segmentInfos = new SegmentInfos(); 保存段信息，大家会发现，和segments_N中的信息几乎一一对应。
+- IndexFileDeleter deleter; 此对象不是用来删除文档的，而是用来管理索引文件的。
+- Lock writeLock; 每一个索引文件夹只能打开一个IndexWriter，所以需要锁。
+
+### 用于合并段，在合并段的文章中将详细描述
+```
+  // Holds all SegmentInfo instances currently involved in
+  // merges
+  private final HashSet<SegmentCommitInfo> mergingSegments = new HashSet<>();
+  private final MergeScheduler mergeScheduler;
+  private final Set<SegmentMerger> runningAddIndexesMerges = new HashSet<>();
+  private final LinkedList<MergePolicy.OneMerge> pendingMerges = new LinkedList<>();
+  private final Set<MergePolicy.OneMerge> runningMerges = new HashSet<>();
+  private final List<MergePolicy.OneMerge> mergeExceptions = new ArrayList<>();
+  private long mergeGen;
+  private Merges merges = new Merges();
+```
+
+### 为保持索引完整性，一致性和事务性
+```
+当IndexWriter对索引进行了添加，删除文档操作后，可以调用commit将修改提交到文件中去，也可以调用rollback取消从上次commit到此时的修改。
+  private List<SegmentCommitInfo> rollbackSegments;      // list of segmentInfo we will fallback to if the commit fails
+
+  private volatile SegmentInfos pendingCommit;            // set when a commit is pending (after prepareCommit() & before commit())
+```
+
+
+
+
+
 ### 并发模型
 IndexWriter提供的核心接口都是线程安全的，并且内部做了特殊的并发优化来优化多线程写入的性能。IndexWriter内部为每个线程都会单独开辟一个空间来写入，这块空间由DocumentsWriterPerThread来控制。
 
@@ -26,7 +64,7 @@ IndexWriter提供的核心接口都是线程安全的，并且内部做了特殊
 ## add & update
 add接口用于新增文档，update接口用于更新文档。但Lucene的update和数据库的update不太一样。数据库的更新是查询后更新，Lucene的更新是查询后删除再新增，不支持更新文档内部分列。流程是先delete by term，后add document。
 
-IndexWriter提供的add和update接口，都会映射到DocumentsWriter的udpate接口，看下接口定义：
+IndexWriter提供的add和update接口，都会映射到DocumentsWriter的update接口，看下接口定义：
 ```
 long updateDocument(final Iterable<? extends IndexableField> doc, final Analyzer analyzer,
     final Term delTerm) throws IOException, AbortingException
 
@@ -48,6 +48,7 @@ IndexWriterConfig内提供了一些供高级玩家做性能调优和功能定制
 Lucene开放对commit point的管理，通过对commit point的管理可以实现例如snapshot等功能。Lucene默认配置的DeletionPolicy，只会保留最新的一个commit point。
 - Similarity
 搜索的核心是相关性，Similarity是相关性算法的抽象接口，Lucene默认实现了TF-IDF和BM25算法。相关性计算在数据写入和搜索时都会发生，数据写入时的相关性计算称为Index-time boosting，计算Normalizaiton并写入索引，搜索时的相关性计算称为query-time boosting。
+Similarity similarity = Similarity.getDefault(); 影响打分的标准化因子(normalization factor)部分，对文档的打分分两个部分，一部分是索引阶段计算的，与查询语句无关，一部分是搜索阶段计算的，与查询语句相关。
 - MergePolicy
 Lucene内部数据写入会产生很多Segment，查询时会对多个Segment查询并合并结果。所以Segment的数量一定程度上会影响查询的效率，所以需要对Segment进行合并，合并的过程就称为Merge，而何时触发Merge由MergePolicy决定。
 - MergeScheduler
@@ -394,8 +395,8 @@ IndexWriter::updateDocuments->DocumentsWriter::updateDocuments
     return postUpdate(flushingDWPT, hasEvents);
   }
 ```
-
 updateDocuments首先调用preUpdate函数处理没有写入硬盘的数据，代码如下。
+
 IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->preUpdate
 ```
   private boolean preUpdate() throws IOException, AbortingException {
@@ -418,6 +419,7 @@ IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->preUpdate
 flushControl是在DocumentsWriter构造函数中创建的DocumentsWriterFlushControl。preUpdate函数从DocumentsWriterFlushControl中逐个取出DocumentsWriterPerThread，因为在lucene中只能有一个IndexWriter获得文件锁并操作索引文件，但是实际中对文档的索引需要多线程进行，DocumentsWriterPerThread就代表一个索引文档的线程。获取到DocumentsWriterPerThread之后，就通过doFlush将DocumentsWriterPerThread内存中的索引数据写入硬盘文件里。关于doFlush函数的分析，留在后面的章节。
 
 回到DocumentsWriter的updateDocuments函数中，接下来通过DocumentsWriterFlushControl的obtainAndLock函数获得一个DocumentsWriterPerThread，DocumentsWriterPerThread被封装在ThreadState中，obtainAndLock函数的代码如下，
+
 IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->DocumentsWriterFlushControl::obtainAndLock
 ```
   ThreadState obtainAndLock() {
@@ -534,6 +536,7 @@ DocumentsWriterPerThread的updateDocuments函数首先调用reserveOneDoc查看
 
 DefaultIndexingChain的processDocument函数
 DefaultIndexingChain是一个默认的索引处理链，下面来看它的processDocument函数。
+
 IndexWriter::updateDocuments->DocumentsWriter::updateDocuments->DocumentsWriterPerThread::updateDocuments->DefaultIndexingChain::processDocument
 ```
   public void processDocument() throws IOException, AbortingException {
 
@@ -0,0 +1,7 @@
+---
+layout: post
+categories: [Lucene]
+description: none
+keywords: Lucene
+---
+# Lucene源码索引文件Terms
@@ -0,0 +1,28 @@
+---
+layout: post
+categories: [Lucene]
+description: none
+keywords: Lucene
+---
+# Lucene源码索引文件Document
+
+## 创建文档Document对象，并加入域(Field)
+```
+        Document doc = new Document();
+        doc.add(new Field("title", text, TextField.TYPE_STORED));
+```
+Document对象主要包括以下部分：
+- 一个ArrayList保存此文档所有的域
+- 每一个域包括域名，域值，和一些标志位，和fnm，fdx，fdt中的描述相对应。
+
+## Filed
+字段，一个Document会由一个或多个Field组成，数据库中可以对应一列，当然一列可以有多个field，每个Field有其对应的属性。比如 - 全文对应的TextField - 文本对应的StringField - 数值类型int对应的IntPoint，long对应的longPoint等等。且可以根据属性定制Field。较早版本中数值类型都是一种Field，而使用BKD-Tree之后，不同类型数值 分别对应一种Field，如IntPoint，对应为点的概念。
+
+往document添加field，field有很多选项，是否分词是否添加存储等，根据实际情况选择。
+
+## Field.Store.*
+Field类是文档索引期间很重要的类，控制着被索引的域值。Field.Store.* 域存储选项通过倒排序索引来控制文本是否可以搜索
+```
+Field.Store.YES//表示会把这个域中的内容完全存储到文件中，方便进行还原[对于主键，标题可以是这种方式存储] 
+Field.Store.NO//表示把这个域的内容不存储到文件中，但是可以被索引，此时内容无法完全还原（doc.get()）[对于内容而言，没有必要进行存储，可以设置为No]
+```