-
Notifications
You must be signed in to change notification settings - Fork 525
Add support for typed N-D array to simplify large array storage #267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
A header to speedup reading/writing large N-D arrays and save space
|
Thank you! |
|
This would be a good opportunity introduce support for 64-bit arrays. If you want to put many gigabytes of data into theses arrays you don't worry that much about the size of the length field. On the other hand the space would be wasted if you don't want to do that. Maybe sometime in the future N-D array could replace some of the array types and we could reclaim these for 64-bit N-D arrays. |
|
I'll note that would be a high priority for machine learning / deep learning applications where you will want to ship around tensors too. I will still likely use MsgPack as it has many of the advantages I'd love but if/when the overhead becomes problematic packing the floats into extension bytes as suggested in #198 feels a tad odd. Not too crazy odd but there appear to be enough domains and reasons that this is a good idea for me. |
|
Any news on this? |
|
I am curious too. Anyone is in charge of further developing this specification? happy to hear your opinions. |
|
Having such types for large N-D array looks reasonable to me. That should improve the performance of such use-cases. How do you feel about it? |
|
@tagomoris, do you mind giving an example (like a 2x3x4 all one uint8 matrix)? I am not entirely sure how one can use |
|
@fangq Check the example of Timestamp ext type. The reason why I showed only I know there are needs for larger objects (equal to or larger than gigabytes) and my idea doesn't focus on it, but essentially, such data (> 4GB) requires totally different I/O and serialization/deserialization strategies. I don't think that existing msgpack libraries can support such huge I/O efficiently. |
|
@tagomoris, that is a viable solution, however, in many scientific applications, 4GB data is unfortunately not an unfathomable upper bound for storing arrays; some of my colleagues are producing terabytes of data per day by recording images from high-resolution microscopes. In neuroimaging, for example, a common format, NIfTI-1 (.nii), was designed to store up to 7D array with each dimension up to 2^16-1 (as a short), but in 2011, this format had to be extended (NIfTI-2) to support large data arrays with each dimension extended to a 64bit integer. https://nifti.nimh.nih.gov/nifti-2 Defining a dedicated data container that allows msgpack to represent such data could be a future-proofing solution. |
|
0xc1 is reserved to not be used anywhere. Some implementations depend on the fact that 0xc1 is never used. On the other hand, ext format family is reserved for extension. |
|
@frsyuki, please see more prior discussions on this proposal at #268
but what is the real-world use case for this? what application can benefit from knowing that 0xc1 is not used? in other words, by using 0xc1 to define new classes of complex objects, do you see this can cause any problems? @tagomoris, using 0xc1 for ext64 is fine, but still one need to propose the extension format for ND array within this format, but msgpack has 1D array as a first class citizen, but demote ND array as an extension, it doesn't feel quite logical. on the other hand, my proposal is not necessarily under-utilize this resource - my proposals are actually 3 parts if you look at the above two threads
you can certainly define 0xc1+ext maker to enable 64bit ext, and this is another way to see future extensions. If we are moving in this direction, perhaps we should give 0xc1 a new name - maybe "template" or "stencil", this will give a lot more flexibility and extension capacity in the future. |
|
Ext type could be used in following manner: fixarray 2-element How does applications deal with binary bigger than 4GB? I suspect that they use chunked encoding and this formatting fits with the structure. |
nowadays, it is fair common for an application to allocate/process 4GB of memory, especially in scientific applications. 64bit machines and compilers allow to allocate dynamic or static arrays of length defined via https://www.quora.com/How-much-memory-can-malloc-and-calloc-allocate |
|
An example use of 0xc1: https://github.com/msgpack/msgpack-ruby/blob/1e35fb8a771339fc51a9a9c96e77046dfc086954/ext/msgpack/unpacker.c#L53-L61 One thing we need to think is how existent applications and tools should deal with N-D array. They could (A) deal with them as an opaque Ext type value, or (B) throw an exception. An approach I showed above is for (A). Intention of Ext type is making all extensions of formats to be (A). If we allow (B), we have many different options...(I'm still catching up with the proposal). |
Does it mean applications in real world allocate memory larger than 4GB as a sequential region of memory? |
|
msgpack is JSON-like format which optimized forr small~medium sized data. |
|
the problem of defining Many other format also support large arrays such as TIFF (https://www.loc.gov/preservation/digital/formats/fdd/fdd000328.shtml), HDF5 supports 64bit integer as dimension and 512GB/1TB file size (https://support.hdfgroup.org/HDF5/faq/limits.html |
|
@methane, one does not have to use it, but I don't see why it hurts to support, if all we need is to provide a container maker and a format. on the other hand, part of my interest is to enable JSON to support complex and large data sets |
|
I think N-D array (or generally array of fixed-length elements) is a good idea. I have some other use cases in mind that can reduce overhead of type tag with it. |
|
I can propose following idea as an alternative approach: With this, with For example, to represent a 2-dimensional 4-row array, D=0 is useful to reduce size significantly especially when we want to send an array of mid-size integers or floats like Do you still think supporting binary bigger than 4GB is important? |
@tagomoris can you elaborate? why it can't define data type larger than 4GB? it is up to the developer to decide, but can't we use |
|
Re: @frsyuki
reading your above code, and it appears to me that to do that, you can simply define the upper bytes in then change your again, the proposed syntax requires This will free up |
|
@fangq That is the priority problem. My point is, the top-priority issue (in my opinion) is to support objects larger than 4GB as ext types. We can build new msgpack standard types on it (including N-D arrays) if we have ext64 and/or ext128. And also it can support others' requirements for larger objects (by using user-space (id > 0) ext types). IIUC, the idea to use the combination of |
|
@tagomoris, I agree. I just want to add a few comments regarding future extensions first, the current design of msgpack constructs, i.e. Your earlier proposal #267 (comment) can work, but I don't think we need this results in a triplet In any case, the logic of
agreed, adding ND array support will address many of the use cases in real-world, and thus giving a priority should be great. |
|
We have following options proposed here:
I think (B) is not a good direction because I can imagine other use cases of extension types with data bigger than 4GB. (C) looks good. But for more extensibility, here is another proposal, (D): A type template is a nested variable-length object, but total payload size is defined before seeing payload. This is consistent with MessgePack's basic design. Also defines sext type = -1: For N-Dimensional array, for example, it can serialize a 2-dimension array (array[0] = [1,2,3,4,5,6], array[1] = [72,16,87,25,46,87,63]) as following (0xc1 + type template + 6*1 + 7*1 = 25 bytes total): Its format is: As shown above, all integers are in one sequential memory. Decoder can parse it with zero-copy, although API design is tricky. Optionally, if usual applications have a requirement to deserialize a N-Dimensional array as a special type (such as |
|
Another use case is 2D geometric area or path. Example: [(12.3456,34.5678),(12.3456,34.5678),(12.3456,34.5678)] Expexted examples of future type templates: As we define, data size becomes smaller overall against big data because one type definition is shared by multiple objects. We also get chance to extend MessagePack using the new 8-bit space because 0xc1 is essentially mode-change. However, while first spec should be small, this direction makes MessagePack implementations much more complicated (especially, implementation of dictionary-encoded types might be be very complicated). |
as I mentioned above, my goal was not to exclusively claim For ND array, I only intent to use For your proposal D, I am excited to see extensions towards more complex data support, but on the other hand, this is not the direction I would this thread to steer towards, at least for now - to support complex structural data requires a lot of deliberation and it could take a long time to design and implement (see the stalled effort in UBJSON in #268 (comment)). In comparison, ND arrays has a clear use-case, simple, and straightforward to support. I again feel that it does not need to wait until the template support is fully cooked - see my previous post |
could you provide an example for a higher-dimensional array, say a 3D array like this 2x3x4 uint8 array? specifically, I would like to understand how the total number of dimensions (3 in this example), as well as the length of each dimensions ([2,3,4] here) are stored in this proposed container. |
will be |
|
0xc1 is the only byte remained. Once we use it, we don't have further room to extend unless we make 0xc1 enough extensible. I would like to keep it as is, or make it very extensible. Either. |
agreed. that's also what was hoping to expand in the future (again, mentioned in #268 (comment))
I managed to read the proposed format - overall, the construct aligns with what I initially suggested. the main difference is a nested type template vs a flat integer array to store the ND dimensional data. I do have a few questions regarding some sample data - likely those are typos, just to confirm so that I can fully understand the format. first, in your above proposal, you defined a "type" byte, immediately following in your example for the 2x3x4 array example in your above reply, I assume the 3rd
is it possible to define the ND array dimensional data in a flat array? it is a lot easier to process. the proposed method requires the decoder to read out the dimensions in depth recursively - without a guarantee that the data array is rectangular if not all embedded levels are read (and compared). This can cause some overheads and complex parsing logics. Also if someone abuses this nested structure, and creates a deeply embedded array, the recursion can also cause stack overflow. If the contained data is a simple rectangular ND array, a flat array is probably easiest to decode. see my next reply for a revised proposal. |
|
Here is my proposed sext construct |
|
A few examples N-D array is encoded to Array of structures (AoS) example is encoded to (see the size savings due to the removal of repeated field names) Structure of arrays (SoA) example data is same as above, now encoded by columns instead of rows |
|
I added my proposal for nd-array at #311, using ext type. I also believe that it's better to leave will be represented as With this format, a commonly used 3x3 float matrix can be represented with only 7 additional bytes |
|
@fangq Regarding commit 295b1ab. |
|
@BambOoxX, an array element order flag can be added, like the optional fangq/jdata@85994de#diff-533b679ae40e5ef5b4e1c18400b05cfa0abedb5efe5030ad022263c2c8537258R617-R619 however, that just adds an extra layer of complexity and somewhat contradict with the design style of msgpack, IMHO. Just like integers are stored in the Big-Endian byte order, I think deciding on one of the options may not necessarily be a bad thing for a language-independent data format. maybe I missed, @cmpute's typed ND array also assumes row-major from reading this |
|
@fangq Regarding what @cmpute proposed, I just figured one could use one ext type for row-major and one for column-major. I just wanted to point that serializing in row-major order a multi-dimensional array that is stored in memory in column-major may lead to performance issues (over large data of course), especially for high dimensions numbers. Choosing a specific setting may reduce interest of some users in case this choice does not agree with their main language for data storage. I am very far myself from these large array issues, so it's really a highly uneducated guess, but most of my data is stored in column-major order. I guess I may have said nothing if you had kept it the default :) |
|
I agree that the array order and even endianness matters a lot in this scenario, because it will affect the ability to directly map a memory span as an array. With that said, it's pretty easy to add this in my proposal since there are many available type slots. So we can have something like
I understand the decision of msgpack to make the endianness consistent across the standard, but it's also beneficial to offer this workaround to pass large amount data without conversion (other than manually pack them in |
|
Also as a side note, the de facto standard library for Python, Numpy, provide functionalities to represent n-d array in different order and endianness. This is more related to the format of data itself than a specific language. |
|
I've updated my proposal. But it seems the owners of the specs are very conservative about this change |
|
Any new status on this ? |
|
this feature is definitely very handy nowadays |
|
Are there any updates with regard to the proposal? Packed arrays would be very handy to have, and simply sticking things in a binary type is inadequate because it breaks the benefits of using msgpack (namely, the ability to load data in other languages) |
A simple header to speedup reading/writing large N-D arrays and save space. Particularly useful when processing 2D image data and 3D or high dimension data from scientific research.
A similar array header was supported by UBJSON and its extensions to N-D array in the JData specification draft. A similar and strong need from the UBJSON user community of such feature was discussed previously here.