Skip to content

Commit 2bbdc2b

Browse files
author
Hannah Bast
committed
Smaller block size and add _col2LastId to block metadata
1. The block size used to be `1 << 23` (over 8M), which is too large, since we always need to decompress at least one whole block, even when reading only few triples. It's now 100'000, which still has a small relatively small overall space consumption. 2. Add member `_col2LastId` to block data because we need it for the delta triples (#916).
1 parent b3aa675 commit 2bbdc2b

File tree

3 files changed

+22
-4
lines changed

3 files changed

+22
-4
lines changed

src/index/CompressedRelation.cpp

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -367,6 +367,7 @@ void CompressedRelationWriter::addRelation(Id col0Id,
367367
}
368368
_currentBlockData._col0LastId = col0Id;
369369
_currentBlockData._col1LastId = col1And2Ids(col1And2Ids.numRows() - 1, 0);
370+
_currentBlockData._col2LastId = col1And2Ids(col1And2Ids.numRows() - 1, 1);
370371
AD_CORRECTNESS_CHECK(_buffer.numColumns() == col1And2Ids.numColumns());
371372
auto bufferOldSize = _buffer.numRows();
372373
_buffer.resize(_buffer.numRows() + col1And2Ids.numRows());
@@ -396,7 +397,8 @@ void CompressedRelationWriter::writeRelationToExclusiveBlocks(
396397

397398
_blockBuffer.push_back(CompressedBlockMetadata{
398399
std::move(offsets), actualNumRowsPerBlock, col0Id, col0Id, data[i][0],
399-
data[i + actualNumRowsPerBlock - 1][0]});
400+
data[i + actualNumRowsPerBlock - 1][0],
401+
data[i + actualNumRowsPerBlock - 1][1]});
400402
}
401403
}
402404

src/index/CompressedRelation.h

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,11 +60,21 @@ struct CompressedBlockMetadata {
6060
// For example, in the PSO permutation, col0 is the P and col1 is the S. The
6161
// col0 ID is not stored in the block. First and last are meant inclusively,
6262
// that is, they are both part of the block.
63+
//
64+
// NOTE: Strictly speaking, we don't need `_col0FirstId` and `_col1FirstId`.
65+
// However, they are convenient to have and don't really harm with respect to
66+
// space efficiency. For example, for Wikidata, we have only around 50K blocks
67+
// with block size 8M and around 5M blocks with block size 80K; even the
68+
// latter takes only half a GB in total.
6369
Id _col0FirstId;
6470
Id _col0LastId;
6571
Id _col1FirstId;
6672
Id _col1LastId;
6773

74+
// For our `DeltaTriples` (https://github.com/ad-freiburg/qlever/pull/916), we
75+
// need to know the least significant `Id` of the last triple as well.
76+
Id _col2LastId;
77+
6878
// Two of these are equal if all members are equal.
6979
bool operator==(const CompressedBlockMetadata&) const = default;
7080
};
@@ -83,6 +93,7 @@ AD_SERIALIZE_FUNCTION(CompressedBlockMetadata) {
8393
serializer | arg._col0LastId;
8494
serializer | arg._col1FirstId;
8595
serializer | arg._col1LastId;
96+
serializer | arg._col2LastId;
8697
}
8798

8899
// The metadata of a whole compressed "relation", where relation refers to a
@@ -304,6 +315,7 @@ class CompressedRelationReader {
304315
static void decompressColumn(const std::vector<char>& compressedColumn,
305316
size_t numRowsToRead, Iterator iterator);
306317

318+
public:
307319
// Read the block that is identified by the `blockMetaData` from the `file`,
308320
// decompress and return it.
309321
// If `columnIndices` is `nullopt`, then all columns of the block are read,

src/index/ConstantsIndexBuilding.h

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,10 @@ constexpr size_t QUEUE_SIZE_BEFORE_PARALLEL_PARSING = 10;
7979
// time
8080
constexpr size_t QUEUE_SIZE_AFTER_PARALLEL_PARSING = 10;
8181

82-
// The uncompressed size in bytes of a block of the permutations. Currently 8MB
83-
// is chosen which is well suited for zstd compression
84-
constexpr size_t BLOCKSIZE_COMPRESSED_METADATA = 1ul << 23u;
82+
// The uncompressed size in bytes of a block of the permutations.
83+
//
84+
// NOTE: This used to be `1 << 23` (over 8M), which is fairly large (we always
85+
// need to decompress at least one whole block, even when reading only few
86+
// triples). With 100K, the total space for all the `CompressedBlockMetadata` is
87+
// still small compared to the rest of the index.
88+
constexpr size_t BLOCKSIZE_COMPRESSED_METADATA = 100'000;

0 commit comments

Comments
 (0)