public class UniformSplitTermsWriter extends FieldsConsumer
The block construction is driven by two parameters, targetNumBlockLines
and deltaNumLines
.
Each block size (number of terms) is targetNumBlockLines
+-deltaNumLines
.
The algorithm computes the minimal distinguishing prefix (MDP) between
each term and its previous term (alphabetically ordered). Then it selects
in the neighborhood of the targetNumBlockLines
, and within the
deltaNumLines
, the term with the minimal MDP. This term becomes
the first term of the next block and its MDP is the block key. This block
key is added to the terms dictionary trie.
We call dictionary the trie structure in memory, and block file the disk file containing the block lines, with one term and its corresponding term state details per line.
When seeking a term, the dictionary seeks the floor leaf of the trie for the searched term and jumps to the corresponding file pointer in the block file. There, the block terms are scanned for the exact searched term.
The terms inside a block do not need to share a prefix. Only the block key is used to find the block from the dictionary trie. And the block key is selected because it is the locally smallest MDP. This makes the dictionary trie very compact.
An interesting property of the Uniform Split technique is the very linear balance between memory usage and lookup performance. By decreasing the target block size, the block scan becomes faster, and since there are more blocks, the dictionary trie memory usage increases. Additionally, small blocks are faster to read from disk. A good sweet spot for the target block size is 32 with delta of 3 (10%) (default values). This can be tuned in the constructor.
There are additional optimizations:
Blocks can be compressed or encrypted with an optional BlockEncoder
provided in the constructor
.
The block file
contains all the term blocks for each field sequentially. It also contains
the fields metadata at the end of the file.
The dictionary file
contains the trie (FST
bytes) for each
field sequentially.
Modifier and Type | Field and Description |
---|---|
protected BlockEncoder |
blockEncoder |
protected IndexOutput |
blockOutput |
static int |
DEFAULT_DELTA_NUM_LINES
Default value for the maximum allowed delta variation of the block size (delta of the number of terms per block).
|
static int |
DEFAULT_TARGET_NUM_BLOCK_LINES
Default value for the target block size (number of terms per block).
|
protected int |
deltaNumLines |
protected IndexOutput |
dictionaryOutput |
protected FieldInfos |
fieldInfos |
protected FieldMetadata.Serializer |
fieldMetadataWriter |
protected static int |
MAX_NUM_BLOCK_LINES
Upper limit of the block size (maximum number of terms per block).
|
protected int |
maxDoc |
protected PostingsWriterBase |
postingsWriter |
protected int |
targetNumBlockLines |
Modifier | Constructor and Description |
---|---|
|
UniformSplitTermsWriter(PostingsWriterBase postingsWriter,
SegmentWriteState state,
BlockEncoder blockEncoder) |
|
UniformSplitTermsWriter(PostingsWriterBase postingsWriter,
SegmentWriteState state,
int targetNumBlockLines,
int deltaNumLines,
BlockEncoder blockEncoder) |
protected |
UniformSplitTermsWriter(PostingsWriterBase postingsWriter,
SegmentWriteState state,
int targetNumBlockLines,
int deltaNumLines,
BlockEncoder blockEncoder,
FieldMetadata.Serializer fieldMetadataWriter,
String codecName,
int versionCurrent,
String termsBlocksExtension,
String dictionaryExtension) |
Modifier and Type | Method and Description |
---|---|
void |
close() |
protected static void |
validateSettings(int targetNumBlockLines,
int deltaNumLines)
Validates the
constructor
settings. |
void |
write(Fields fields,
NormsProducer normsProducer) |
protected void |
writeDictionary(IndexDictionary.Builder dictionaryBuilder)
Writes the dictionary index (FST) to disk.
|
protected void |
writeEncodedFieldsMetadata(ByteBuffersDataOutput fieldsOutput) |
protected void |
writeFieldsMetadata(int fieldsNumber,
ByteBuffersDataOutput fieldsOutput) |
protected int |
writeFieldTerms(BlockWriter blockWriter,
DataOutput fieldsOutput,
TermsEnum termsEnum,
FieldInfo fieldInfo,
NormsProducer normsProducer) |
protected BlockTermState |
writePostingLine(TermsEnum termsEnum,
FieldMetadata fieldMetadata,
NormsProducer normsProducer)
Writes the posting values for the current term in the given
TermsEnum
and updates the FieldMetadata stats. |
protected void |
writeUnencodedFieldsMetadata(ByteBuffersDataOutput fieldsOutput) |
merge
public static final int DEFAULT_TARGET_NUM_BLOCK_LINES
public static final int DEFAULT_DELTA_NUM_LINES
protected static final int MAX_NUM_BLOCK_LINES
protected final FieldInfos fieldInfos
protected final PostingsWriterBase postingsWriter
protected final int maxDoc
protected final int targetNumBlockLines
protected final int deltaNumLines
protected final BlockEncoder blockEncoder
protected final FieldMetadata.Serializer fieldMetadataWriter
protected final IndexOutput blockOutput
protected final IndexOutput dictionaryOutput
public UniformSplitTermsWriter(PostingsWriterBase postingsWriter, SegmentWriteState state, BlockEncoder blockEncoder) throws IOException
blockEncoder
- Optional block encoder, may be null if none.
It can be used for compression or encryption.IOException
public UniformSplitTermsWriter(PostingsWriterBase postingsWriter, SegmentWriteState state, int targetNumBlockLines, int deltaNumLines, BlockEncoder blockEncoder) throws IOException
blockEncoder
- Optional block encoder, may be null if none.
It can be used for compression or encryption.IOException
protected UniformSplitTermsWriter(PostingsWriterBase postingsWriter, SegmentWriteState state, int targetNumBlockLines, int deltaNumLines, BlockEncoder blockEncoder, FieldMetadata.Serializer fieldMetadataWriter, String codecName, int versionCurrent, String termsBlocksExtension, String dictionaryExtension) throws IOException
targetNumBlockLines
- Target number of lines per block.
Must be strictly greater than 0.
The parameters can be pre-validated with validateSettings(int, int)
.
There is one term per block line, with its corresponding details (TermState
).deltaNumLines
- Maximum allowed delta variation of the number of lines per block.
Must be greater than or equal to 0 and strictly less than targetNumBlockLines
.
The block size will be targetNumBlockLines
+-deltaNumLines
.
The block size must always be less than or equal to MAX_NUM_BLOCK_LINES
.blockEncoder
- Optional block encoder, may be null if none.
It can be used for compression or encryption.IOException
protected static void validateSettings(int targetNumBlockLines, int deltaNumLines)
constructor
settings.targetNumBlockLines
- Target number of lines per block.
Must be strictly greater than 0.deltaNumLines
- Maximum allowed delta variation of the number of lines per block.
Must be greater than or equal to 0 and strictly less than targetNumBlockLines
.
Additionally, targetNumBlockLines
+ deltaNumLines
must be less than
or equal to MAX_NUM_BLOCK_LINES
.public void write(Fields fields, NormsProducer normsProducer) throws IOException
write
in class FieldsConsumer
IOException
protected void writeFieldsMetadata(int fieldsNumber, ByteBuffersDataOutput fieldsOutput) throws IOException
IOException
protected void writeUnencodedFieldsMetadata(ByteBuffersDataOutput fieldsOutput) throws IOException
IOException
protected void writeEncodedFieldsMetadata(ByteBuffersDataOutput fieldsOutput) throws IOException
IOException
protected int writeFieldTerms(BlockWriter blockWriter, DataOutput fieldsOutput, TermsEnum termsEnum, FieldInfo fieldInfo, NormsProducer normsProducer) throws IOException
IOException
protected BlockTermState writePostingLine(TermsEnum termsEnum, FieldMetadata fieldMetadata, NormsProducer normsProducer) throws IOException
TermsEnum
and updates the FieldMetadata
stats.BlockTermState
; or null if none.IOException
protected void writeDictionary(IndexDictionary.Builder dictionaryBuilder) throws IOException
IOException
public void close() throws IOException
close
in interface Closeable
close
in interface AutoCloseable
close
in class FieldsConsumer
IOException
Copyright © 2000-2021 Apache Software Foundation. All Rights Reserved.