Skip to content

Commit 70cd1e5

Browse files
committed
docs: document the new journal file format additions
1 parent bbcd38e commit 70cd1e5

File tree

1 file changed

+66
-25
lines changed

1 file changed

+66
-25
lines changed

docs/JOURNAL_FILE_FORMAT.md

Lines changed: 66 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -59,9 +59,9 @@ in particular realize that they may include binary non-text data (though
5959
usually don't), and the same field might have multiple values assigned within
6060
the same entry.
6161

62-
This document describes the current format of systemd 195. The documented
62+
This document describes the current format of systemd 246. The documented
6363
format is compatible with the format used in the first versions of the journal,
64-
but received various compatible additions since.
64+
but received various compatible and incompatible additions since.
6565

6666
If you are wondering why the journal file format has been created in the first
6767
place instead of adopting an existing database implementation, please have a
@@ -73,7 +73,7 @@ thread](https://lists.freedesktop.org/archives/systemd-devel/2012-October/007054
7373

7474
* All offsets, sizes, time values, hashes (and most other numeric values) are 64bit unsigned integers in LE format.
7575
* Offsets are always relative to the beginning of the file.
76-
* The 64bit hash function used is [Jenkins lookup3](https://en.wikipedia.org/wiki/Jenkins_hash_function), more specifically jenkins_hashlittle2() with the first 32bit integer it returns as higher 32bit part of the 64bit value, and the second one uses as lower 32bit part.
76+
* The 64bit hash function siphash24 is used for newer journal files. For older files [Jenkins lookup3](https://en.wikipedia.org/wiki/Jenkins_hash_function) is used, more specifically jenkins_hashlittle2() with the first 32bit integer it returns as higher 32bit part of the 64bit value, and the second one uses as lower 32bit part.
7777
* All structures are aligned to 64bit boundaries and padded to multiples of 64bit
7878
* The format is designed to be read and written via memory mapping using multiple mapped windows.
7979
* All time values are stored in usec since the respective epoch.
@@ -174,6 +174,9 @@ _packed_ struct Header {
174174
/* Added in 189 */
175175
le64_t n_tags;
176176
le64_t n_entry_arrays;
177+
/* Added in 246 */
178+
le64_t data_hash_chain_depth;
179+
le64_t field_hash_chain_depth;
177180
};
178181
```
179182

@@ -218,6 +221,16 @@ entry has been written yet.
218221
**tail_entry_monotonic** is the monotonic timestamp of the last entry in the
219222
file, referring to monotonic time of the boot identified by **boot_id**.
220223

224+
**data_hash_chain_depth** is a counter of the deepest chain in the data hash
225+
table, minus one. This is updated whenever a chain is found that is longer than
226+
the previous deepest chain found. Note that the counter is updated during hash
227+
table lookups, as the chains are traversed. This counter is used to determine
228+
when it is a good time to rotate the journal file, because hash collisions
229+
became too frequent.
230+
231+
Similar, **field_hash_chain_depth** is a counter of the deepest chain in the
232+
field hash table, minus one.
233+
221234

222235
## Extensibility
223236

@@ -238,20 +251,30 @@ unconditionally exist in all revisions of the file format, all fields starting
238251
with "n_data" needs to be explicitly checked for via a size check, since they
239252
were additions after the initial release.
240253

241-
Currently only two extensions flagged in the flags fields are known:
254+
Currently only five extensions flagged in the flags fields are known:
242255

243256
```c
244257
enum {
245-
HEADER_INCOMPATIBLE_COMPRESSED = 1
258+
HEADER_INCOMPATIBLE_COMPRESSED_XZ = 1 << 0,
259+
HEADER_INCOMPATIBLE_COMPRESSED_LZ4 = 1 << 1,
260+
HEADER_INCOMPATIBLE_KEYED_HASH = 1 << 2,
261+
HEADER_INCOMPATIBLE_COMPRESSED_ZSTD = 1 << 3,
246262
};
247263

248264
enum {
249-
HEADER_COMPATIBLE_SEALED = 1
265+
HEADER_COMPATIBLE_SEALED = 1 << 0,
250266
};
251267
```
252268

253-
HEADER_INCOMPATIBLE_COMPRESSED indicates that the file includes DATA objects
254-
that are compressed using XZ.
269+
HEADER_INCOMPATIBLE_COMPRESSED_XZ indicates that the file includes DATA objects
270+
that are compressed using XZ. Similarly, HEADER_INCOMPATIBLE_COMPRESSED_LZ4
271+
indicates that the file includes DATA objects that are compressed with the LZ4
272+
algorithm. And HEADER_INCOMPATIBLE_COMPRESSED_ZSTD indicates that there are
273+
objects compressed with ZSTD.
274+
275+
HEADER_INCOMPATIBLE_KEYED_HASH indicates that instead of the unkeyed Jenkins
276+
hash function the keyed siphash24 hash function is used for the two hash
277+
tables, see below.
255278

256279
HEADER_COMPATIBLE_SEALED indicates that the file includes TAG objects required
257280
for Forward Secure Sealing.
@@ -308,9 +331,9 @@ structure gracefully. (Checking what you read is a pretty good idea out of
308331
security considerations anyway.) This specifically includes checking offset
309332
values, and that they point to valid objects, with valid sizes and of the type
310333
and hash value expected. All code must be written with the fact in mind that a
311-
file with inconsistent structure file might just be inconsistent temporarily,
312-
and might become consistent later on. Payload OTOH requires less scrutiny, as
313-
it should only be linked up (and hence visible to readers) after it was
334+
file with inconsistent structure might just be inconsistent temporarily, and
335+
might become consistent later on. Payload OTOH requires less scrutiny, as it
336+
should only be linked up (and hence visible to readers) after it was
314337
successfully written to memory (though not necessarily to disk). On non-local
315338
file systems it is a good idea to verify the payload hashes when reading, in
316339
order to avoid annoyances with mmap() inconsistencies.
@@ -319,8 +342,8 @@ Clients intending to show a live view of the journal should use inotify() for
319342
this to watch for files changes. Since file writes done via mmap() do not
320343
result in inotify() writers shall truncate the file to its current size after
321344
writing one or more entries, which results in inotify events being
322-
generated. Note that this is not used as transaction scheme (it doesn't protect
323-
anything), but merely for triggering wakeups.
345+
generated. Note that this is not used as a transaction scheme (it doesn't
346+
protect anything), but merely for triggering wakeups.
324347

325348
Note that inotify will not work on network file systems if reader and writer
326349
reside on different hosts. Readers which detect they are run on journal files
@@ -334,7 +357,9 @@ All objects carry a common header:
334357

335358
```c
336359
enum {
337-
OBJECT_COMPRESSED = 1
360+
OBJECT_COMPRESSED_XZ = 1 << 0,
361+
OBJECT_COMPRESSED_LZ4 = 1 << 1,
362+
OBJECT_COMPRESSED_ZSTD = 1 << 2,
338363
};
339364

340365
_packed_ struct ObjectHeader {
@@ -346,10 +371,13 @@ _packed_ struct ObjectHeader {
346371
};
347372

348373
The **type** field is one of the object types listed above. The **flags** field
349-
currently knows one flag: OBJECT_COMPRESSED. It is only valid for DATA objects
350-
and indicates that the data payload is compressed with XZ. If OBJECT_COMPRESSED
351-
is set for an object HEADER_INCOMPATIBLE_COMPRESSED must be set for the file as
352-
well. The **size** field encodes the size of the object including all its
374+
currently knows three flags: OBJECT_COMPRESSED_XZ, OBJECT_COMPRESSED_LZ4 and
375+
OBJECT_COMPRESSED_ZSTD. It is only valid for DATA objects and indicates that
376+
the data payload is compressed with XZ/LZ4/ZSTD. If one of the
377+
OBJECT_COMPRESSED_* flags is set for an object then the matching
378+
HEADER_INCOMPATIBLE_COMPRESSED_XZ/HEADER_INCOMPATIBLE_COMPRESSED_LZ4/HEADER_INCOMPATIBLE_COMPRESSED_ZSTD
379+
flag must be set for the file as well. At most one of these three bits may be
380+
set. The **size** field encodes the size of the object including all its
353381
headers and payload.
354382

355383

@@ -371,7 +399,12 @@ _packed_ struct DataObject {
371399
Data objects carry actual field data in the **payload[]** array, including a
372400
field name, a '=' and the field data. Example:
373401
`_SYSTEMD_UNIT=foobar.service`. The **hash** field is a hash value of the
374-
payload.
402+
payload. If the `HEADER_INCOMPATIBLE_KEYED_HASH` flag is set in the file header
403+
this is the siphash24 hash value of the payload, keyed by the file ID as stored
404+
in the `.file_id` field of the file header. If the flag is not set it is the
405+
non-keyed Jenkins hash of the payload instead. The keyed hash is preferred as
406+
it makes the format more robust against attackers that want to trigger hash
407+
collisions in the hash table.
375408

376409
**next_hash_offset** is used to link up DATA objects in the DATA_HASH_TABLE if
377410
a hash collision happens (in a singly linked list, with an offset of 0
@@ -388,8 +421,9 @@ number of ENTRY objects that reference this object, i.e. the sum of all
388421
ENTRY_ARRAYS chained up from this object, plus 1.
389422

390423
The **payload[]** field contains the field name and date unencoded, unless
391-
OBJECT_COMPRESSED is set in the `ObjectHeader`, in which case the payload is
392-
LZMA compressed.
424+
OBJECT_COMPRESSED_XZ/OBJECT_COMPRESSED_LZ4/OBJECT_COMPRESSED_ZSTD is set in the
425+
`ObjectHeader`, in which case the payload is compressed with the indicated
426+
compression algorithm.
393427

394428

395429
## Field Objects
@@ -448,10 +482,17 @@ identified by **boot_id**.
448482
The **xor_hash** field contains a binary XOR of the hashes of the payload of
449483
all DATA objects referenced by this ENTRY. This value is usable to check the
450484
contents of the entry, being independent of the order of the DATA objects in
451-
the array.
485+
the array. Note that even for files that have the
486+
`HEADER_INCOMPATIBLE_KEYED_HASH` flag set (and thus siphash24 the otherwise
487+
used hash function) the hash function used for this field, as singular
488+
exception, is the Jenkins lookup3 hash function. The XOR hash value is used to
489+
quickly compare the contents of two entries, and to define a well-defined order
490+
between two entries that otherwise have the same sequence numbers and
491+
timestamps.
452492

453493
The **items[]** array contains references to all DATA objects of this entry,
454-
plus their respective hashes.
494+
plus their respective hashes (which are calculated the same way as in the DATA
495+
objects, i.e. keyed by the file ID).
455496

456497
In the file ENTRY objects are written ordered monotonically by sequence
457498
number. For continuous parts of the file written during the same boot
@@ -494,8 +535,8 @@ and create a new one.
494535

495536
The DATA_HASH_TABLE should be sized taking into account to the maximum size the
496537
file is expected to grow, as configured by the administrator or disk space
497-
considerations. The FIELD_HASH_TABLE should be sized to a fixed size, as the
498-
number of fields should be pretty static it depends only on developers'
538+
considerations. The FIELD_HASH_TABLE should be sized to a fixed size; the
539+
number of fields should be pretty static as it depends only on developers'
499540
creativity rather than runtime parameters.
500541

501542

0 commit comments

Comments
 (0)