LibarchiveAddingReadFilter

How to add a new read filter to libarchive.

Table of Contents Introduction Step 1: Use an external program Step 2: Improve the bidding and format identification Step 3: Direct decoding Step 4: Test Step 5: Robustly support many platforms Step 6: Improve the existing filters Step 7: Improve the core

Introduction

Libarchive has a flexible read filter system that automatically detects various data encodings and creates the necessary filter pipeline to decode that data. This mechanism is key to bsdtar's ability to transparently handle a variety of common archive formats compressed with any of several popular compressors.

This brief article explains step-by-step how to add support for a new format, complete with the appropriate automatic detection logic. This can be used to add automatic support for new decompression algorithms, encoding methods such as uudecode or MIME/base64 encoding, decryption filters, and many other transformations. Many of the principles here also apply to adding support for new archive formats.

To illustrate, I'll present several different forms of a gzip decompression filter. The first version is only six lines but provides good support with full automatic format detection. Later versions improve the bidding, provide more portability and better performance, culminating in code very close to what is currently in libarchive.

Step 1: Use an external program

If there's an external program that supports your format already and the format has a distinctive initial signature, then you may be able to add basic support by simply wrapping the existing external program handler. For example, gzip streams have a two-byte initial signature and can be decompressed by calling the "gunzip" program. Here's a complete implementation of gzip decompression support:

/* File: archive_read_support_compression_gzip.c */
#include "archive.h"

int
archive_read_support_compression_gzip(struct archive *a)
{
  return (archive_read_support_compression_program_signature
           (a, "gunzip", "\\037\\213", 2));
}

This implementation uses a standard module that feeds the data through an external program. It also registers a simple bidder that looks for a fixed file prefix. In this case, we've specified the two byte prefix "\037\213" which is the "magic number" for gzip compressed files.

Of course, if you're adding a new compression method, you'll need to add a prototype for `archive_read_support_compression_foo` to the `archive.h` public header file. Also note that each new read filter goes into a file of its own within the source tree.

Step 2: Improve the bidding and format identification

A two-byte initial signature is really only marginally acceptable. It would be better to verify more of the header data to avoid false positives. Also, there are many formats where you can't simply compare the first bytes to a fixed string. This is especially true of text formats such as uudecode where you may have to look quite far ahead in the input in order to skip over leading garbage.

To implement this, you'll need to implement a custom handler for the internal bidding protocol, which makes our gzip decompressor example just a little more complex.

The basic bid protocol requires that you register a "bidder" which has two required items: a "bid" function that looks ahead at the input and returns a positive value if it recognizes the format and an "init" function that actually creates a suitable filter to handle the data.

Here's the first part of our revised gzip handler, which includes a couple of internal libarchive headers and defines the bid function:

#include "archive.h"
#include "archive_read_private.h"

static int
gzip_bidder_bid(struct archive_read_filter_bidder *self,
    struct archive_read_filter *filter)
{
	const unsigned char *p;
	ssize_t avail, len;
	int bits = 0;
	int header_flags;

	(void)self; /* UNUSED */

	/* Start by looking at the first ten bytes of the header, which
	 * is all fixed layout. */
	len = 10;
	p = __archive_read_filter_ahead(filter, len, &avail);
	if (p == NULL || avail == 0)
		return (0);
	if (p[0] != 037)
		return (0);
	bits += 8;
	if (p[1] != 0213)
		return (0);
	bits += 8;
	if (p[2] != 8) /* We only support deflation. */
		return (0);
	bits += 8;
	if ((p[3] & 0xE0)!= 0)	/* No reserved flags set. */
		return (0);
	bits += 3;
	header_flags = p[3];

	return (bits);
}

The above code is a little simpler than the current gzip bidder in the libarchive sources, but it's functional and illustrates the key points: First, all bidders make heavy use of `__archive_read_filter_ahead` to peek at data that has not yet been read. The amount of read-ahead is technically unlimited, but the libarchive core will have to buffer as much as you ask for; it's poor form to look much further than about 64k and recommended that you stay with a few kilobytes if possible. Second, good bidders will often verify much more than a few initial bytes. For example, you can often compute and verify header CRCs, check certain bit fields that must have known values, or provide sanity checks on the length of variable-sized fields. Finally, the return code should generally reflect how much checking you do. A good rule of thumb is to count the number of bits that you actually verify and return that as your bid.

After you've won the bid, your "init" function will be called to initialize the filter itself. For this example, we'll continue to rely on an external program by using a libarchive internal function to handle the initialization for us:

static int
gzip_bidder_init(struct archive_read_filter *self)
{
	int r = __archive_read_program(self, "gunzip");
	self->code = ARCHIVE_COMPRESSION_GZIP;
	self->name = "gzip";
	return (r);
}

Note that this overrides the `code` and `name` fields so that users of libarchive will be able to tell what filter was actually used. Look at the last line of "bsdtar -tvvf _archive_" to see how these fields are used.

Finally, you need to actually register your bidder with the core read machinery, which involves first obtaining a bidder object, then customizing it:

int
archive_read_support_compression_gzip(struct archive *_a)
{
	struct archive_read *a = (struct archive_read *)_a;
	struct archive_read_filter_bidder *bidder = __archive_read_get_bidder(a);

	if (bidder == NULL)
		return (ARCHIVE_FATAL);
	bidder->bid = gzip_bidder_bid;
	bidder->init = gzip_bidder_init;
	return (ARCHIVE_OK);
}

Step 3: Direct decoding

The code in "Step 2" above is actually sufficient to pass the libarchive test suite for gzip support, assuming you have the gunzip executable on your system. But there are two good reasons why you might want to directly handle the filtering within libarchive rather than call out to an external program:

It's often faster. Shuffling data to and from an external program adds significant overhead. Especially for formats that can be decoded very efficiently (such as base64 or lzo), it's likely fastest to do it directly in the library.
It's often more portable. If the filter implementation can be entirely within libarchive, then it will be available even on platforms that lack the external executable.

For example, the current gzip reader in libarchive uses the popular zlib library to decompress gzip streams. Since the code here is considerably more complex, I won't present all of it directly, but I will outline the key points. The full details can be seen in the source for archive_read_support_compression_gzip.c within libarchive.

First, the initializer needs to create a fully custom filter which provides "read" and "close" functions and manages any required internal state. Here, the "self" variable is a pointer to the filter object (which is provided by the libarchive core) and "state" is the gzip decompressors custom state information:

static int
gzip_bidder_init(struct archive_read_filter *self)
{
	struct private_data *state;
	static const size_t out_block_size = 64 * 1024;
	void *out_block;

	self->code = ARCHIVE_COMPRESSION_GZIP;
	self->name = "gzip";

	state = (struct private_data *)calloc(sizeof(*state), 1);
	out_block = (unsigned char *)malloc(out_block_size);
	if (state == NULL || out_block == NULL) {
		free(out_block);
		free(state);
		archive_set_error(&self->archive->archive, ENOMEM,
		    "Can't allocate data for gzip decompression");
		return (ARCHIVE_FATAL);
	}

	self->data = state;
	state->out_block_size = out_block_size;
	state->out_block = out_block;
	self->read = gzip_filter_read;
	self->skip = NULL; /* not supported */
	self->close = gzip_filter_close;

	state->in_stream = 0; /* We're not actually within a stream yet. */

	return (ARCHIVE_OK);
}

A filter's "read" function provides a block of data each time it is called. In this case, the "init" function allocates a 64k block as part of the private state. The read function will decompress input into this region and return it on each call. The `void **` pointer is an out parameter for the data, the return value is the size of the data or a negative value (usually ARCHIVE_FATAL) if there's an error.

The general structure is to use the "read-ahead" function to look at upcoming data, then "consume" as much of that data as you actually used. The libarchive core will handle all of the bookkeeping necessary to manage blocks from your upstream data providers. Note that the "read-ahead" function will tell you about all of the data that is immediately available which leads to the common idiom of asking for at least 1 byte and then processing as much as is available.

static ssize_t
gzip_filter_read(struct archive_read_filter *self, const void **p)
{
	struct private_data *state;
	size_t decompressed;
	ssize_t avail_in;
	int ret;

	state = (struct private_data *)self->data;

	if (state->eof)
		return (0);

	if (!state->header_done) {
		if (!consume_header(self)) {
			*p = NULL;
			return (0);
		}
		state->header_done = 1;
	}


	/* Decompress data into our output buffer. */
	state->stream.next_out = state->out_block;
	state->stream.avail_out = state->out_block_size;
	/* Get at least 1 byte from upstream. */
	state->stream.next_in =
		__archive_read_filter_ahead(self->upstream, 1, &avail_in);
	if (state->stream.next_in == NULL)
		return (ARCHIVE_FATAL);
	state->stream.avail_in = avail_in;

	/* Decompress some of that data. */
	ret = inflate(&(state->stream), 0);
	switch (ret) {
	case Z_OK: /* Decompressor made some progress. */
		/* Consume whatever was decompressed. */
		__archive_read_filter_consume(self->upstream,
		    avail_in - state->stream.avail_in);
		break;
	case Z_STREAM_END: /* Found end of stream. */
		__archive_read_filter_consume(self->upstream,
		    avail_in - state->stream.avail_in);
		state->eof = 1;
		break;
	default:
		/* Return an error. */
		archive_set_error(&self->archive->archive,
		    ARCHIVE_ERRNO_MISC,
		    "gzip decompression failed");
		return (ARCHIVE_FATAL);
	}

	/* Return the block we decoded. */
	decompressed = state->stream.next_out - state->out_block;
	state->total_out += decompressed;
	if (decompressed == 0)
		*p = NULL;
	else
		*p = state->out_block;
	return (decompressed);
}

The above is simplified somewhat from the actual code in libarchive. In particular, the code currently in libarchive looks for a new stream to start just after the current stream ends so that it can correctly handle gzip files that consist of multiple gzip streams concatenated together. The production code also tries to actually fill the 64k output buffer on each call so that downstream consumers will have to do less block reassembly. (The "read-ahead" logic will combine multiple blocks by allocating temporary buffers and copying data as necessary to meet minimum read requests. This tends to happen less with larger data blocks.)

Step 4: Test

The libarchive test suite exercises many parts of libarchive to verify that they work correctly. Even if your code is perfect, a well-written test case will help ensure that noone else breaks your support with future changes. Changes to libarchive that include test cases are much more likely to get accepted.

Step 5: Robustly support many platforms

The full gzip reader in the current libarchive code has extensive conditionals to provide as much support as possible on each particular platform:

If the platform has zlib, it will use code similar to that in "Step 3" above to do full gzip decompression directly.
If not, it will use code similar to that in "Step 2" above to try to run an external program. In particular, this provides gunzip support in cases where gunzip might be installed later. In any case, without zlib it's hard to do much better than this.

Step 6: Improve the existing filters

Many of the existing filters could use careful analysis and improvements:

Careful benchmarking may point out speed improvements in some cases, though I don't know of any obvious low-hanging fruit at this point.
Few of the filters fully support "skip" operations. These can provide big performance boosts when reading the table of contents of a large archive without actually extracting the bodies.
Few of the filters provide full fallback to external programs when their libraries are unavailable.
There are a few cases where filters could benefit from the new options handling, though mostly read filters don't need this.

Step 7: Improve the core

Once you understand all of the above, you may be able to find ways to improve the libarchive core.

The "read-ahead" and "consume" logic has been pretty thoroughly overhauled several times, but there may well be ways to improve it. If you do work on this, please test carefully; some of the logic here is very subtle.
The read framework now provides seek capabilities which are used by the Zip and 7-Zip readers, but more work is needed: There are obvious performance issues and seeks do not currently propagate through filters.
libarchive should support "set filter" operations in addition to the current "support filter." A "set filter" would unconditionally initialize the specified filter without going through the bidding phase and be more lenient about format errors. This would be useful in cases where the user knew the format and wanted to force a certain decoding.

LibarchiveAddingReadFilter

Table of Contents

Introduction

Step 1: Use an external program

Step 2: Improve the bidding and format identification

Step 3: Direct decoding

Step 4: Test

Step 5: Robustly support many platforms

Step 6: Improve the existing filters

Step 7: Improve the core

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Manual Pages

Clone this wiki locally