-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Fix TarReader to handle binary-encoded checksum fields #122636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
…cation Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
|
cc: @tmds |
|
Tagging subscribers to this area: @dotnet/area-system-io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes TarReader to correctly handle tar archives that use GNU binary encoding (0x80 prefix) for the checksum field. Previously, the checksum parser only handled octal ASCII digits, causing InvalidDataException when reading valid tar files created by tools that use binary encoding for all numeric fields.
Key changes:
- Changed checksum parsing from
ParseOctal<uint>toParseNumeric<int>to support both octal and binary-encoded values - Added comprehensive tests for binary-encoded checksum handling in both sync and async code paths
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Read.cs | Changed checksum parsing to use ParseNumeric<int> instead of ParseOctal<uint> to handle binary-encoded checksums |
| src/libraries/System.Formats.Tar/tests/TarTestsBase.cs | Added helper method CreateTarHeaderWithBinaryEncodedChecksum() to generate test headers with binary-encoded checksums |
| src/libraries/System.Formats.Tar/tests/TarReader/TarReader.GetNextEntry.Tests.cs | Added synchronous test Read_BinaryEncodedChecksum to verify binary checksum parsing |
| src/libraries/System.Formats.Tar/tests/TarReader/TarReader.GetNextEntryAsync.Tests.cs | Added asynchronous test Read_BinaryEncodedChecksum_Async to verify binary checksum parsing |
Do we know what tool/library outputs binary encoding for the checksum field?
Did someone check this? I don't think GNU tar accepts binary encoded checksums. using System.Diagnostics;
string tempPath = Path.GetTempFileName();
File.WriteAllBytes(tempPath, CreateTarHeaderWithBinaryEncodedChecksum());
Process.Start("tar", [ "xf", tempPath ]).WaitForExit();
static byte[] CreateTarHeaderWithBinaryEncodedChecksum()
{
byte[] header = new byte[512];
// Name: "test" (null-terminated)
byte[] name = "test\0"u8.ToArray();
name.CopyTo(header, 0);
// Mode: "0000644\0" at offset 100
byte[] mode = "0000644\0"u8.ToArray();
mode.CopyTo(header, 100);
// Uid: "0000000\0" at offset 108
byte[] uid = "0000000\0"u8.ToArray();
uid.CopyTo(header, 108);
// Gid: "0000000\0" at offset 116
byte[] gid = "0000000\0"u8.ToArray();
gid.CopyTo(header, 116);
// Size: "00000000000\0" at offset 124
byte[] size = "00000000000\0"u8.ToArray();
size.CopyTo(header, 124);
// Mtime: "00000000000\0" at offset 136
byte[] mtime = "00000000000\0"u8.ToArray();
mtime.CopyTo(header, 136);
// TypeFlag: '0' (regular file) at offset 156
header[156] = (byte)'0';
// Magic: "ustar\0" at offset 257
byte[] magic = "ustar\0"u8.ToArray();
magic.CopyTo(header, 257);
// Version: "00" at offset 263
byte[] version = "00"u8.ToArray();
version.CopyTo(header, 263);
// Calculate the correct checksum value.
// During checksum calculation, the checksum field (offset 148-155) is treated as all spaces.
int calculatedChecksum = 0;
for (int i = 0; i < 512; i++)
{
if (i >= 148 && i < 156)
{
calculatedChecksum += (byte)' ';
}
else
{
calculatedChecksum += header[i];
}
}
// Write checksum as binary-encoded (0x80 prefix) at offset 148.
// The checksum field is 8 bytes. With 0x80 prefix, remaining 7 bytes store the value in big-endian.
header[148] = 0x80;
header[149] = 0;
header[150] = 0;
header[151] = 0;
header[152] = (byte)((calculatedChecksum >> 24) & 0xFF);
header[153] = (byte)((calculatedChecksum >> 16) & 0xFF);
header[154] = (byte)((calculatedChecksum >> 8) & 0xFF);
header[155] = (byte)(calculatedChecksum & 0xFF);
return header;
}outputs: |
|
@TFTomSun,can you please share more details on the tar that's problematic, how it was created, etc? Can you share it? |
|
I found the ticket: #122635. (I overlooked it in the first comment apparently...) The code is probably missing gzip decompression. This works: using System.IO.Compression;
string outputDirectory = Path.Combine(Path.GetTempPath(), Path.GetRandomFileName());
Directory.CreateDirectory(outputDirectory);
Console.WriteLine($"Outputting to {outputDirectory}");
HttpClient client = new HttpClient();
using Stream tarGzStream = await client.GetStreamAsync("https://github.com/qdrant/qdrant/releases/download/v1.16.2/qdrant-x86_64-unknown-linux-gnu.tar.gz");
await using var tarStream = new GZipStream(tarGzStream, CompressionMode.Decompress);
await using var tarReader = new System.Formats.Tar.TarReader(tarStream);
System.Formats.Tar.TarEntry entry;
while ((entry = await tarReader.GetNextEntryAsync()) != null)
{
if (entry.EntryType is System.Formats.Tar.TarEntryType.SymbolicLink or System.Formats.Tar.TarEntryType.HardLink or System.Formats.Tar.TarEntryType.GlobalExtendedAttributes)
{
continue;
}
await entry.ExtractToFileAsync(Path.Combine(outputDirectory, entry.Name), overwrite: true);
}
Console.WriteLine("Done"); |
Description
TarReaderthrowsInvalidDataException: "Unable to parse number."when reading tar files that use GNU binary encoding (0x80 prefix) for the checksum field. The checksum parsing usedParseOctalwhich only handles ASCII octal digits, but some tar tools write all numeric fields including checksum using the GNU binary format.Changed checksum parsing from
ParseOctal<uint>toParseNumeric<int>, which already handles both octal and binary-encoded values (added in PR #101172 for large file support).Customer Impact
Customers cannot read valid tar archives created by tools that use GNU binary encoding for all numeric fields. These archives work fine with GNU tar, 7-Zip, and other standard tools.
Regression
No. This is extending support for a tar format variant not previously handled.
Testing
Read_BinaryEncodedChecksumandRead_BinaryEncodedChecksum_Asynctests that create headers with 0x80-prefixed checksumsRisk
Low.
ParseNumericis already used for all other numeric fields (size, uid, gid, mtime, etc.). The checksum field uses the same encoding rules. Checksum values are small positive integers (max ~130K for 512-byte header) that fit comfortably inint.Package authoring no longer needed in .NET 9
IMPORTANT: Starting with .NET 9, you no longer need to edit a NuGet package's csproj to enable building and bump the version.
Keep in mind that we still need package authoring in .NET 8 and older versions.
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.