-
Notifications
You must be signed in to change notification settings - Fork 236
IPIP 0499: CID Profiles #499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
lets make the fanout match the max links from files and rename profile to `-wide` this will make it easier to discuss in ipfs/specs#499
Co-authored-by: Bumblefudge <bumblefudge@learningproof.xyz>
Import.* config params for controlling DAG width were added in: ipfs/kubo#10774
|
Thank you for kicking this off, and filling initial state. I've incorporated specific "dag width" settings for Next:
|
This comment was marked as off-topic.
This comment was marked as off-topic.
Co-authored-by: Christian Paul <info@jaller.de>
|
I pushed a bunch of edits to move the conversation forward. This is sorely needed in the ecosystem, and the hope is that by building consensus we can improve developer experience when working with UnixFS and the overall health of the UnixFS ecosystem. Feedback is always appreciated. |
| 1. UnixFS DAG layout (e.g. balanced, trickle) | ||
| 1. UnixFS DAG width (max number of links per `File` node) | ||
| 1. `HAMTDirectory` fanout (must be a power of 2) | ||
| 1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this number is dynamic based on the lengths of the actual link entries in the dag, we will need to specify what algorithm that estimation follows. I would put such things in a special "ipfs legacy" profile to be honest, along with cidv0, non-raw leaves etc. We probably should heavily discourage coming up with profiles that do weird things, like dynamically setting params or not using raw-leaves for things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, each layout would have its own set of layout-params:
- balanced:
- max-links: N
- trickle:
- max-leaves-per-level: N
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should heavily discourage coming up with profiles that do weird things, like dynamically setting params or not using raw-leaves for things.
Yeah, that's exactly what we're doing by defining this profile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait is kubo dynamically assigning HAMT Directory threshold, currently? i was assuming this was a static number!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current spec mentions fanout but not threshold, so i'm a little confused what current implementations are doing and if it's even worth fitting into the profile system or just giving up and letting a significant portion of HAMT-shared legacy data just but unprofiled/not-recreatable using the profiles...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lidel Is this written down in any of the specs? Or is it just in the code at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/ipips/ipip-0499.md
Outdated
| 1. Whether empty directories are included in the DAG | ||
| - Some implementations apply filtering before merkleizing filesystem entries in the DAG. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird, because then we need to consider empty files, hidden files, unreadable files, symlinks and symlink follows, so probably need to mention all those as part of the profile too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is motivated by Git's default behaviour which ignores empty directories.
But we can mention here the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, do @hsanjuan , do you mean mentioning whether empty files, hidden files, etc affect the decision of whether a directory is empty, or do you mean that each of those files might be divergently handled by different implementations and should be a variable in the profile? I would much rather behavior for all of those file types be a UnixFS concern and specified in UnixFS spec, modulo any historic variations worth including in a profile...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have existing implementations that support filtering differently on all of these? Because unless we do, I would really rather not specify all possible variants. And I agree with @bumblefudge: let's have two behaviours if possible, and punt to the UnixFS spec for how to describe them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, these choices are being made today and it'd be nice to be explicit about them. e.g. default Helia leaves all of these options default (false): https://github.com/ipfs/helia/blob/027bd3549da9ef5a6f07eaac346942cf24f3fc24/packages/unixfs/src/utils/glob-source.ts#L12-L42
But in filecoin-pin currently I've opted to include hidden files: https://github.com/filecoin-project/filecoin-pin/blob/9ab3f8c110ce0b6c6bf21c1fcdbcf84ade557953/src/core/unixfs/car-builder.ts#L30-L32 (I'm rethinking that choice now, but I'd like to know Kubo's defaults as well).
I'd prefer to align to a standard profile for file filtering so we collectively have "one standard default behaviour", but I understand it's a bit more work to explicate all of that. So maybe it can be a hand-wave for now and tightened up later because you could argue it's external to a unixfs spec and more about the choice of what to feed into a unixfsification process.
src/ipips/ipip-0499.md
Outdated
| 1. UnixFS chunk size | ||
| 1. UnixFS DAG layout (e.g. balanced, trickle) | ||
| 1. UnixFS DAG width (max number of links per `File` node) | ||
| 1. `HAMTDirectory` fanout (must be a power of 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can alternatively be called "bitwidth" and you just use the number of bits for this, it's what we're doing in all the other hamts we have. So the default bitwidth is 8 = 256 leaves, bitwidth of 5 would be 32, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth mentioning this alias/terminology in the IPIP itself (i.e. to developer-users), or is "bitwidth" just a term of art among implementers/deep-heads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit-width is a bit more intuitive imo, because it ties more deeply to how the HAMT is implemented.
Co-authored-by: Hector Sanjuan <code@hector.link>
Co-authored-by: Rod Vagg <rod@vagg.org>
| 1. UnixFS DAG layout (e.g. balanced, trickle etc...) | ||
| 1. UnixFS DAG width (max number of links per `File` node) | ||
| 1. `HAMTDirectory` bitwidth, i.e. the number of bits determines the fanout of the `HAMTDirectory` (default bitwidth is 8 == 256 leaves). | ||
| 1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links | |
| 1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links. We do not include details about the estimation algorithm as we do not encourage implementations to support it. |
Currently, CIDs can be generated with a variety of settings and optimizations for chunking, DAG width, and more. This means the same file can yield multiple, different CIDs depending on which tools and settings are used, and it is not possible to reliably reproduce or verify the CID.
This proposal introduces profiles for IPFS CIDs. Profiles explicitly define CID version, hash algorithm, chunk size, DAG width, layout, and other parameters. They can be used to verify data across implementations, provide recommended settings depending on retrieval performance goals, and more.