Skip to content

Conversation

@fangq
Copy link

@fangq fangq commented May 22, 2019

A simple header to speedup reading/writing large N-D arrays and save space. Particularly useful when processing 2D image data and 3D or high dimension data from scientific research.

A similar array header was supported by UBJSON and its extensions to N-D array in the JData specification draft. A similar and strong need from the UBJSON user community of such feature was discussed previously here.

A header to speedup reading/writing large N-D arrays and save space
@fangq fangq mentioned this pull request May 23, 2019
@maxnoe
Copy link

maxnoe commented May 23, 2019

Thank you!

@lhns
Copy link

lhns commented Sep 12, 2019

This would be a good opportunity introduce support for 64-bit arrays.

If you want to put many gigabytes of data into theses arrays you don't worry that much about the size of the length field. On the other hand the space would be wasted if you don't want to do that.

Maybe sometime in the future N-D array could replace some of the array types and we could reclaim these for 64-bit N-D arrays.

@Smerity
Copy link

Smerity commented Nov 5, 2019

I'll note that would be a high priority for machine learning / deep learning applications where you will want to ship around tensors too.

I will still likely use MsgPack as it has many of the advantages I'd love but if/when the overhead becomes problematic packing the floats into extension bytes as suggested in #198 feels a tad odd. Not too crazy odd but there appear to be enough domains and reasons that this is a good idea for me.

@maxnoe
Copy link

maxnoe commented Feb 26, 2020

Any news on this?

@fangq
Copy link
Author

fangq commented Feb 26, 2020

I am curious too. Anyone is in charge of further developing this specification? happy to hear your opinions.

@tagomoris
Copy link
Member

Having such types for large N-D array looks reasonable to me. That should improve the performance of such use-cases.
But at the same time, I think this type can be supported by extension types, like Timestamp. If we have very large N-D arrays, the overhead of ext types (just one or two bytes) doesn't make any difference in performance actually.

How do you feel about it?

@fangq
Copy link
Author

fangq commented Feb 27, 2020

@tagomoris, do you mind giving an example (like a 2x3x4 all one uint8 matrix)? I am not entirely sure how one can use ext type to store an N-D array, where do you store the dimension sizes?

@tagomoris
Copy link
Member

@fangq Check the example of Timestamp ext type.
Bytes start with 0xc7 to specify the length of the entire object after the ext type id, the ext type id (-2 for N-D array temporary) and then type, dim and dimension sizes will follow it like your current proposal.

Typed N-D array stores an array with a specified data type and lengths in upto (2^8)-1 dimensions
+--------+--------+--------+--------+--------+====================================+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
|  0xc7  |  size  |   -2   |  type  |  dim   | dim uint32 integers (N1,N2,...,ND) |   N1*N2*...*ND values of type |
+--------+--------+--------+--------+--------+====================================+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+

The reason why I showed only 0xc7 example is the target of N-D array is clearly not fixed-size objects. For larger objects, we can use 0xc8(upto (2^16)-1 bytes) or 0xc9((2^32)-1 bytes).

I know there are needs for larger objects (equal to or larger than gigabytes) and my idea doesn't focus on it, but essentially, such data (> 4GB) requires totally different I/O and serialization/deserialization strategies. I don't think that existing msgpack libraries can support such huge I/O efficiently.

@fangq
Copy link
Author

fangq commented Feb 27, 2020

@tagomoris, that is a viable solution, however, in many scientific applications, 4GB data is unfortunately not an unfathomable upper bound for storing arrays; some of my colleagues are producing terabytes of data per day by recording images from high-resolution microscopes. In neuroimaging, for example, a common format, NIfTI-1 (.nii), was designed to store up to 7D array with each dimension up to 2^16-1 (as a short), but in 2011, this format had to be extended (NIfTI-2) to support large data arrays with each dimension extended to a 64bit integer.

https://nifti.nimh.nih.gov/nifti-2
https://www.nitrc.org/forum/forum.php?thread_id=2148&forum_id=1941

Defining a dedicated data container that allows msgpack to represent such data could be a future-proofing solution.

@tagomoris
Copy link
Member

@fangq If so, we should NOT use 0xc1 only for N-D arrays, but should use it for the type of ext 64, which can support many various huge objects, including N-D arrays.
@frsyuki What do you think about it?

@frsyuki
Copy link
Member

frsyuki commented Feb 27, 2020

0xc1 is reserved to not be used anywhere. Some implementations depend on the fact that 0xc1 is never used. On the other hand, ext format family is reserved for extension.

@fangq
Copy link
Author

fangq commented Feb 27, 2020

@frsyuki, please see more prior discussions on this proposal at #268
#270

0xc1 is reserved to not be used anywhere.

but what is the real-world use case for this? what application can benefit from knowing that 0xc1 is not used? in other words, by using 0xc1 to define new classes of complex objects, do you see this can cause any problems?

@tagomoris, using 0xc1 for ext64 is fine, but still one need to propose the extension format for ND array within this format, but msgpack has 1D array as a first class citizen, but demote ND array as an extension, it doesn't feel quite logical.

on the other hand, my proposal is not necessarily under-utilize this resource - my proposals are actually 3 parts if you look at the above two threads

  1. 0xc1+numerical type (0xca-0xd3) markers defines an ND array of that type
  2. 0xc1+true ... 0xc1+false defines dynamic length objects, and
  3. 0xc1+array/map makers may enable future support of structure template (composite records defined by a schema) in the future (see Proposal: support typed N-D array #268 (comment))

you can certainly define 0xc1+ext maker to enable 64bit ext, and this is another way to see future extensions. If we are moving in this direction, perhaps we should give 0xc1 a new name - maybe "template" or "stencil", this will give a lot more flexibility and extension capacity in the future.

@frsyuki
Copy link
Member

frsyuki commented Feb 27, 2020

Ext type could be used in following manner:

fixarray 2-element
ext-8 [a type tag that indicates that this is a sequence of N-D array’s buffer chunks] [dims...]
array-16/32 [number of chunks] [bin-32 and chunk] [bin-32 and chunk] ...

How does applications deal with binary bigger than 4GB? I suspect that they use chunked encoding and this formatting fits with the structure.

@fangq
Copy link
Author

fangq commented Feb 27, 2020

How does applications deal with binary bigger than 4GB? I suspect that they use chunked encoding and this formatting fits with the structure.

nowadays, it is fair common for an application to allocate/process 4GB of memory, especially in scientific applications. 64bit machines and compilers allow to allocate dynamic or static arrays of length defined via size_t, on a 64bit machine, is SIZE_MAX=2^64-1 bytes.

https://www.quora.com/How-much-memory-can-malloc-and-calloc-allocate

@frsyuki
Copy link
Member

frsyuki commented Feb 27, 2020

An example use of 0xc1: https://github.com/msgpack/msgpack-ruby/blob/1e35fb8a771339fc51a9a9c96e77046dfc086954/ext/msgpack/unpacker.c#L53-L61

One thing we need to think is how existent applications and tools should deal with N-D array. They could (A) deal with them as an opaque Ext type value, or (B) throw an exception. An approach I showed above is for (A). Intention of Ext type is making all extensions of formats to be (A).

If we allow (B), we have many different options...(I'm still catching up with the proposal).

@frsyuki
Copy link
Member

frsyuki commented Feb 27, 2020

64bit machines and compilers allow to allocate dynamic or static arrays of length defined via size_t, on a 64bit machine, is SIZE_MAX=2^64-1 bytes

Does it mean applications in real world allocate memory larger than 4GB as a sequential region of memory?

@methane
Copy link
Member

methane commented Feb 27, 2020

msgpack is JSON-like format which optimized forr small~medium sized data.
There is some data format for large data in the world. I am not sure why msgpack should support 4GB+ data....

@fangq
Copy link
Author

fangq commented Feb 27, 2020

the problem of defining ext 64 for ND array is the capped data holding capacity - as I mentioned above, for example, NIfTI-2 formatted data files supports 7-dimensional array with each dimension specified by a 64bit length. Perhaps it is overkill in most cases, but it was defined based on clear consensus among the field. to use ext 64 to encode such data will not be enough.

Many other format also support large arrays such as TIFF (https://www.loc.gov/preservation/digital/formats/fdd/fdd000328.shtml), HDF5 supports 64bit integer as dimension and 512GB/1TB file size (https://support.hdfgroup.org/HDF5/faq/limits.html
https://support.hdfgroup.org/HDF5/doc/TechNotes/BigDataSmMach.html) supports .

@fangq
Copy link
Author

fangq commented Feb 27, 2020

@methane, one does not have to use it, but I don't see why it hurts to support, if all we need is to provide a container maker and a format.

on the other hand, part of my interest is to enable JSON to support complex and large data sets
https://github.com/fangq/jdata/blob/master/JData_specification.md#n-dimensional-array-storage-keywords

@frsyuki
Copy link
Member

frsyuki commented Feb 27, 2020

I think N-D array (or generally array of fixed-length elements) is a good idea. I have some other use cases in mind that can reduce overhead of type tag with it.
Size bigger than 4GB is something we didn't expect as a single object. It's expected as a sequence of objects, but not as a single object.
MessagePack is not intended to be a file format but intended to be a generic object serialization format, which means that we have a lot more room to design file format on top of msgpack. Our company has designed a columnar-oriented file format using msgpack as a sequence of values in a column.
(I'm still thinking...)

@frsyuki
Copy link
Member

frsyuki commented Feb 27, 2020

I can propose following idea as an alternative approach:

N-dimentional array extension type is assigned to extension type 2 (TBD).

+--------+--------+--------+--------+--------+--------+--------+--------+
|  0xc9  |ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|   2    |  type  |   D    |
+--------+--------+--------+--------+--------+--------+--------+--------+
+--------+--------+--------+--------+========+
|AAAAAAAA|AAAAAAAA|AAAAAAAA|AAAAAAAA|  data  |
+--------+--------+--------+--------+========+

* ZZZZZZZZ_ZZZZZZZZ_ZZZZZZZZ_ZZZZZZZZ is a big-endian 32-bit unsigned integer which represents byte size of following payload.
* 2 is an example of type tag for N-dimentional array
* D is 8-bit unsigned integer that represents number of dimensions (number of columns).
* D=0 is allowed to represent a single-dimension array.
* if type is between 0xca and 0xd3, the specified type can be found in the Overview section above.
* other type values are reserved for future extension of this specification.
* AAAAAAAA_AAAAAAAA_AAAAAAAA_AAAAAAAA is a big-endian 32-bit unsigned integer which represents number of rows.
* data is a sequence of following structure:

+========+========+========+========+
|   D1   |   D2   |   ...  |   DN   |
+========+========+========+========+

* DN is a fixed-length element whose type and length is defined by `type`.

With this, with (N-Dimentional array) as an ext-type object defined as above whose size is upto 4GB, numpy.array object (or something you represent a large multidimensional numeric array) as following array object:

[(N-D array), (N-D array), ..., (N-D array)]

For example, to represent a 2-dimensional 4-row array,

[ [[1,2649],[2,7832]],  [[3,3289],[4,9853]]  ]

D=0 is useful to reduce size significantly especially when we want to send an array of mid-size integers or floats like [812,381,371,932].

Do you still think supporting binary bigger than 4GB is important?

@fangq
Copy link
Author

fangq commented Feb 28, 2020

But I'm not a supporter of the idea to use the combination of bytes because ... it doesn't solve the > problem - users cannot define their own data type larger than 4GB.

@tagomoris can you elaborate? why it can't define data type larger than 4GB? it is up to the developer to decide, but can't we use 0xc1+0xc7(ext 8) to represent ext 64, 0xc1+0xc8 for ext 128, and even 0xc1+0xc9 for even ext 256, that still leaves 0xc1+fixext flags open which you can use for other type of complex extension data. no?

@fangq
Copy link
Author

fangq commented Feb 29, 2020

Re: @frsyuki

An example use of 0xc1: https://github.com/msgpack/msgpack-ruby/blob/1e35fb8a771339fc51a9a9c96e77046dfc086954/ext/msgpack/unpacker.c#L53-L61

reading your above code, and it appears to me that 0xc1 was used as a marker to help decide if one needs to read the next head byte or not. This can be made to handle both the old (where 0xc1 is not used) and new extended syntax (i.e. 0xc1+typemarker) with only a very minor change.

to do that, you can simply define the upper bytes in HEAD_BYTE_REQUIRED to be non-zero (variable b is an int, not a char), something like

#define HEAD_BYTE_REQUIRED 0xffffffc1
#define TYPE_BYTE_REQUIRED 0x000000c1

then change your get_head_byte to

static inline int get_head_byte(msgpack_unpacker_t* uk)
{
    int b = uk->head_byte;
    if(b == HEAD_BYTE_REQUIRED) {
        b = read_head_byte(uk);
    }else if(b == TYPE_BYTE_REQUIRED){
        b = read_head_byte(uk);
        /*do something to tell caller I countered a composite marker*/
    }
    return b;
}

again, the proposed syntax requires 0xc1 must follow a type-marker, this should have a marginal conflict, if any, to the logic in existing libraries if 0xc1 is used at all.

This will free up 0xc1 for a lot of new possibilities and flexibility, the cost to pay is small.

@tagomoris
Copy link
Member

@fangq That is the priority problem.

My point is, the top-priority issue (in my opinion) is to support objects larger than 4GB as ext types. We can build new msgpack standard types on it (including N-D arrays) if we have ext64 and/or ext128. And also it can support others' requirements for larger objects (by using user-space (id > 0) ext types).

IIUC, the idea to use the combination of 0xc1 and the next byte is to add N-D array at first, then we need to think about other types in left spaces. That is a relatively long way to support many various large objects (except for N-D array).

@fangq
Copy link
Author

fangq commented Mar 2, 2020

@tagomoris, I agree. I just want to add a few comments regarding future extensions

first, the current design of msgpack constructs, i.e. type+length+payload (in the case of ext: ext+length+exttype+payload) becomes limited when supporting large arrays (where length is better represented by a vector instead of a single large integer) such as in this proposal (#268), and dynamic-length objects for streaming as in my other proposal (#270). Therefore, it is worth to consider a more flexible new construct to handle these scenarios.

Your earlier proposal #267 (comment) can work, but I don't think we need size because the size of the data shall be computed using dim/type. So, perhaps if you prefer to use ext to do this, you should probably just zero-out the size byte

+--------+--------+--------+--------+--------+====================================+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
|  0xc7  |    0   |   -2   |  type  |  dim   | dim uint32 integers (N1,N2,...,ND) |   N1*N2*...*ND values of type |
+--------+--------+--------+--------+--------+====================================+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+

this results in a triplet 0xc7,0,-2 as the starting marker of an ND array. Because ext requires the presence of type, so this will not be confused with an ext object with 0-length.

In any case, the logic of ext has to be amended somehow in order to accommodate new data structures that are outside of the scope of a fixed length record.

IIUC, the idea to use the combination of 0xc1 and the next byte is to add N-D array at first, then we need to think about other types in left spaces.

agreed, adding ND array support will address many of the use cases in real-world, and thus giving a priority should be great.

@frsyuki
Copy link
Member

frsyuki commented Mar 11, 2020

We have following options proposed here:

  • Keep backward compatibility & keep 32-bit size limit
    -> (A) Application uses chunked encoding and rebuilds sequential memory space by copying data
  • Break backward compatibility (old implementations can't read new format)
    -> (B) Add as N-D array type as 0xc1
    -> (C) Add a new ext type as 0xc1 (ext 64, or ext 128?)
    -> (D) (see below)

I think (B) is not a good direction because I can imagine other use cases of extension types with data bigger than 4GB. (C) looks good. But for more extensibility, here is another proposal, (D):

## Structured extension type

sext type family
+--------+--------+~~~~~~~~~~~~~~~~~~~~~+=========+
|  0xc1  |  type  |    type template    | payload |
+--------+--------+~~~~~~~~~~~~~~~~~~~~~+=========+

* 0xc1 indicates that following sequence enters "Structured extension" mode.

* type is a 8-bit signed integer that represents a structured extension type.
  Like extension type, [-128, -1] is reserved for predefined types,
  and [0, 127] is available for application-specific types.

* type template defines structure and size of payload.

* type template is either of following:

  array 16 type template
  +--------+--------+--------+===============+
  |  0xdc  |YYYYYYYY|YYYYYYYY| type template |
  +--------+--------+--------+===============+
  (payload size = N * sizeof(type template))

  array 32 type template
  +--------+--------+--------+--------+--------+===============+
  |  0xdd  |ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ| type template |
  +--------+--------+--------+--------+--------+===============+
  (payload size = N * sizeof(type template))

  float 32/64, uint 8-64, or int 8-64 type templates
  +-------------+
  | 0xca - 0xd3 |
  +-------------+
  (payload size = 1)

A type template is a nested variable-length object, but total payload size is defined before seeing payload. This is consistent with MessgePack's basic design.

Also defines sext type = -1:

### Simple structured object (Structured extension type = -1)

Structured extension type = -1 is deserialized as a regular object.
For example, both (0x90 1 2) and (0xc1 -1 0xdc 0 0 0 2 0xcc 1 2)
represent the identical object, a 2-element array [1, 2].

This is useful to improve efficiency.

For N-Dimensional array, for example, it can serialize a 2-dimension array (array[0] = [1,2,3,4,5,6], array[1] = [72,16,87,25,46,87,63]) as following (0xc1 + type template + 6*1 + 7*1 = 25 bytes total):

0xc1 0xdc 0 2 0xdc 0 8 0xcc 0xdc 0 7 0xcc
1 2 3 4 5 6 72 16 87 25 46 87 63

Its format is:

Start structured extension
+--------+
|  0xc1  |
+--------+

  2-element array 16
  +--------+--------+--------+
  |  0xdc  |        2        |
  +--------+--------+--------+

    6-element array 16
    +--------+--------+--------+
    |  0xdc  |        8        |
    +--------+--------+--------+

      uint 8
      +------+
      | 0xcc |
      +------+

    7-element array 16
    +--------+--------+--------+
    |  0xdc  |        7        |
    +--------+--------+--------+

      uint 8 template
      +------+
      | 0xcc |
      +------+

  Payload
  +--------+--------+--------+--------+
  |   1    |   2    |   3    |   4    |
  +--------+--------+--------+--------+
  +--------+--------+--------+--------+
  |   5    |   6    |   72   |   16   |
  +--------+--------+--------+--------+
  +--------+--------+--------+--------+
  |   87   |   25   |   46   |   87   |
  +--------+--------+--------+--------+
  +--------+
  |   63   |
  +--------+

As shown above, all integers are in one sequential memory. Decoder can parse it with zero-copy, although API design is tricky.

Optionally, if usual applications have a requirement to deserialize a N-Dimensional array as a special type (such as numpy.ndarray, although most of languages don't have such type) instead of regular array, we could define predefined sext type = -2 (is this better or worse...?):

N-Dimension array (Structured extension type = -2)

N-Dimension array 16
+--------+--------+--------+--------+--------+~~~~~~~~~~~~~~~~~~~~~~~~~+=============+
|  0xc1  |   -2   |  0xdc  |XXXXXXXX|XXXXXXXX| N-dimension definitions |   payload   |
+--------+--------+--------+--------+--------+~~~~~~~~~~~~~~~~~~~~~~~~~+=============+

  Dimension definition 16
  +--------+--------+--------+-------------+
  |  0xdc  |YYYYYYYY|YYYYYYYY| 0xca - 0xd3 |
  +--------+--------+--------+-------------+

  Dimension definition 32
  +--------+--------+--------+--------+--------+-------------+
  |  0xdd  |ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ| 0xca - 0xd3 |
  +--------+--------+--------+--------+--------+-------------+

* XXXXXXXX_XXXXXXXX stores number of dimensions.
* A dimension definition, dimension definition 16 or dimension 32, repeats for each dimension.
* A dimension definition stores number of elements in the dimension and type of the elements.
* 0xca - 0xd3 stores type of an element.

@frsyuki
Copy link
Member

frsyuki commented Mar 11, 2020

Another use case is 2D geometric area or path. Example: [(12.3456,34.5678),(12.3456,34.5678),(12.3456,34.5678)]

+--------+--------+--------+--------+--------+
|  0xc1  |   -1   |  0xdc  |        3        |
+--------+--------+--------+--------+--------+
          ^ Simple structured object
                   ^ 3-element array 16

    +--------+--------+--------+------+
    |  0xdc  |        2        | 0xcb |
    +--------+--------+--------+------+
     ^ 2-element array 16       ^ float 64

    +=============================+
    | payload (3 * 2 * 8 bytes)   |
    +=============================+

Expexted examples of future type templates:

  Little-endian float 32/64, uint 8-64, or int 8-64 type templates
  +-------------+
  | 0x?? - 0x?? |
  +-------------+

  Little-endian uint 128
  +------+
  | 0x?? |
  +------+

  fixarray type template (very common)
  +--------+--------+==================+
  |????????|YYYYYYYY| N type templates |
  +--------+--------+==================+

  map 16 fied-type template (less common)
  +--------+--------+--------+===============+===============+
  |  0x??  |YYYYYYYY|YYYYYYYY| type template | type template |
  +--------+--------+--------+===============+===============+

  map 16 fixed-keys, fixed-value-type template
  (less common because values can't be variable-length strings)
  +--------+--------+--------+~~~~~~~~~~~+===============+
  |  0x??  |YYYYYYYY|YYYYYYYY| N objects | type template |
  +--------+--------+--------+~~~~~~~~~~~+===============+

  map 16 fixed-key, variable-type template
  (less common ecause values can't be variable-length strings)
  +--------+--------+--------+~~~~~~~~~~~+==================+
  |  0x??  |YYYYYYYY|YYYYYYYY| N objects | N type templates |
  +--------+--------+--------+~~~~~~~~~~~+==================+

  dictionary-encoded array 16 template (less common)
  +--------+--------+--------+~~~~~~~~~~~+--------+--------+
  |  0x??  |XXXXXXXX|XXXXXXXX| N objects |YYYYYYYY|YYYYYYYY|
  +--------+--------+--------+~~~~~~~~~~~+--------+--------+
  XXXXXXXX_XXXXXXXX: number of dictionary items
  YYYYYYYY_YYYYYYYY: number of array elements
  Payload: sequence of array elements where an element is
           a dictionary index in uint 16

  dictionary-encoded map 16 template (very common)
  +--------+--------+--------+~~~~~~~~~~~+==================+
  |  0x??  |YYYYYYYY|YYYYYYYY| N objects | N type templates |
  +--------+--------+--------+~~~~~~~~~~~+==================+
  XXXXXXXX_XXXXXXXX: number of dictionary items
  YYYYYYYY_YYYYYYYY: number of key-value pairs
  Payload: sequence of key-value pairs where a key or value is
           a dictionary index in uint 16

  half-dictionary-encoded tuple 16 template (common for metadata + body structure)
  +--------+--------+--------+~~~~~~~~~~~+--------+--------+==================+
  |  0x??  |XXXXXXXX|XXXXXXXX| N objects |YYYYYYYY|ZZZZZZZZ| N type templates |
  +--------+--------+--------+~~~~~~~~~~~+--------+--------+==================+
  XXXXXXXX_XXXXXXXX: number of dictionary items
  YYYYYYYY: number of elements in dictionary
  ZZZZZZZZ: number of fixed-type elements
  (total number of elements is YYYYYYYY + ZZZZZZZZ)
  Payload: first 2 * YYYYYYYY bytes: dictionary index in uint 16
           following bytes: payload defined by type templates

  bin 8 type template
  +--------+--------+
  |  0x??  |XXXXXXXX|
  +--------+--------+
  XXXXXXXX: pre-defined length of bytes

  bin 64 type template
  +--------+--------+--------+--------+--------+--------+--------+--------+
  |  0x??  |XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|
  +--------+--------+--------+--------+--------+--------+--------+--------+

As we define, data size becomes smaller overall against big data because one type definition is shared by multiple objects. We also get chance to extend MessagePack using the new 8-bit space because 0xc1 is essentially mode-change. However, while first spec should be small, this direction makes MessagePack implementations much more complicated (especially, implementation of dictionary-encoded types might be be very complicated).

@fangq
Copy link
Author

fangq commented Mar 11, 2020

-> (B) Add as N-D array type as 0xc1
I think (B) is not a good direction because I can imagine other use cases of extension types with data bigger than 4GB. (C) looks good. But for more extensibility, here is another proposal, (D):

as I mentioned above, my goal was not to exclusively claim 0xc1 for ND arrays. Instead, I was proposing to use 0xc1+type as a new extension towards more complex data "templates", and 0xc1 here is, as you mentioned, a modifier flag.

For ND array, I only intent to use 0xc1+[0xca-0xd3], i.e. the type marker is a simple numerical type. In a way, I consider this as the simplest template, i.e. rectangular uniform-type packed data. 0xc1+other type can support more complex types, such as 0xc1+map or 0xc1+ext etc.

For your proposal D, I am excited to see extensions towards more complex data support, but on the other hand, this is not the direction I would this thread to steer towards, at least for now - to support complex structural data requires a lot of deliberation and it could take a long time to design and implement (see the stalled effort in UBJSON in #268 (comment)). In comparison, ND arrays has a clear use-case, simple, and straightforward to support. I again feel that it does not need to wait until the template support is fully cooked - see my previous post

#268 (comment)

@fangq
Copy link
Author

fangq commented Mar 11, 2020

+--------+--------+--------+--------+--------+
|  0xc1  |   -1   |  0xdc  |        3        |
+--------+--------+--------+--------+--------+
          ^ Simple structured object
                   ^ 3-element array 16

    +--------+--------+--------+------+
    |  0xdc  |        2        | 0xcb |
    +--------+--------+--------+------+
     ^ 2-element array 16       ^ float 64

    +=============================+
    | payload (3 * 2 * 8 bytes)   |
    +=============================+

could you provide an example for a higher-dimensional array, say a 3D array like this 2x3x4 uint8 array?

  [
      [
          [1,9,6,0],
          [2,9,3,1],
          [8,0,9,6]
      ],
      [
          [6,4,2,7],
          [8,5,1,2],
          [3,3,2,6]
      ]
  ]

specifically, I would like to understand how the total number of dimensions (3 in this example), as well as the length of each dimensions ([2,3,4] here) are stored in this proposed container.

@frsyuki
Copy link
Member

frsyuki commented Mar 11, 2020

  [
      [
          [1,9,6,0],
          [2,9,3,1],
          [8,0,9,6]
      ],
      [
          [6,4,2,7],
          [8,5,1,2],
          [3,3,2,6]
      ]
  ]

will be

Start structured extension
+--------+
|  0xc1  |
+--------+

  2-element array 16
  +--------+--------+--------+
  |  0xdc  |        2        |
  +--------+--------+--------+
  (following type template repeats for 2 times)

    3-element array 16
    +--------+--------+--------+
    |  0xdc  |        3        |
    +--------+--------+--------+
    (following type template repeats for 3 times)

      3-element array 16
      +--------+--------+--------+
      |  0xdc  |        3        |
      +--------+--------+--------+
      (following type template repeats for 3 times)

        4-element array 16
        +--------+--------+--------+
        |  0xdc  |        4        |
        +--------+--------+--------+
        (following type template repeats for 4 times)

          uint 8
          +------+
          | 0xcc |
          +------+

  Payload
  +--------+--------+--------+--------+
  |   1    |   9    |   6    |   0    |
  +--------+--------+--------+--------+
  +--------+--------+--------+--------+
  |   2    |   9    |   3    |   1    |
  +--------+--------+--------+--------+
  +--------+--------+--------+--------+
  |   8    |   0    |   9    |   6    |
  +--------+--------+--------+--------+
  +--------+--------+--------+--------+
  |   6    |   4    |   2    |   7    |
  +--------+--------+--------+--------+
  +--------+--------+--------+--------+
  |   8    |   5    |   1    |   2    |
  +--------+--------+--------+--------+
  +--------+--------+--------+--------+
  |   3    |   3    |   2    |   6    |
  +--------+--------+--------+--------+

@frsyuki
Copy link
Member

frsyuki commented Mar 11, 2020

0xc1 is the only byte remained. Once we use it, we don't have further room to extend unless we make 0xc1 enough extensible. I would like to keep it as is, or make it very extensible. Either.

@fangq
Copy link
Author

fangq commented Mar 12, 2020

0xc1 is the only byte remained. Once we use it, we don't have further room to extend unless we make 0xc1 enough extensible. I would like to keep it as is, or make it very extensible. Either.

agreed. that's also what was hoping to expand in the future (again, mentioned in #268 (comment))

sext type family

+--------+--------+~~~~~~~~~~~~~~~~~~~~~+=========+
|  0xc1  |  type  |    type template    | payload |
+--------+--------+~~~~~~~~~~~~~~~~~~~~~+=========+

I managed to read the proposed format - overall, the construct aligns with what I initially suggested. the main difference is a nested type template vs a flat integer array to store the ND dimensional data.

I do have a few questions regarding some sample data - likely those are typos, just to confirm so that I can fully understand the format.

first, in your above proposal, you defined a "type" byte, immediately following 0xc1. In the description, this byte can be set to -1, -2, .... However, in all following examples, I saw this type marker was set for some, but not all. is this actually needed?

in your example array[0] = [1,2,3,4,5,6], array[1] = [72,16,87,25,46,87,63]), the 2nd 0xdc object should be followed by 6, rather than 8. I assume '8' was a typo.

for the 2x3x4 array example in your above reply, I assume the 3rd 0xdc object is redundant, and should be removed, correct?

A type template is a nested variable-length object, but total payload size is defined before seeing payload. This is consistent with MessgePack's basic design.

is it possible to define the ND array dimensional data in a flat array? it is a lot easier to process. the proposed method requires the decoder to read out the dimensions in depth recursively - without a guarantee that the data array is rectangular if not all embedded levels are read (and compared). This can cause some overheads and complex parsing logics.

Also if someone abuses this nested structure, and creates a deeply embedded array, the recursion can also cause stack overflow. If the contained data is a simple rectangular ND array, a flat array is probably easiest to decode. see my next reply for a revised proposal.

@fangq
Copy link
Author

fangq commented Mar 12, 2020

Here is my proposed sext construct

Structured extension (sext) format family

+--------+--------+~~~~~~~~~~~~~~~~~~~~~+=========+
|  0xc1  |  type  |    type template    | payload |
+--------+--------+~~~~~~~~~~~~~~~~~~~~~+=========+

* 0xc1 indicates that following sequence enters "Structured extension" mode.

* type is a 8-bit unsigned integer specifying the data kind of the encoded data 
structure

    * a type byte between 0xca and 0xd3 indicates an N-dimensional (ND) numerical 
array, the corresponding numerical data type can be found in the Summary section above
    * in such case, the type template of the ND array must be an array object, i.e. 
array 16 (0xdc) or array 32 (0xdd) of depth 1, containing only (unsigned) integer 
(0xcc-0xcf) elements.

    * a type byte between 0xdc and 0xdd indicates an array of structures (AoS)
    ** in such case, the type and the following type template must be a valid array 
object with elements exclusively made of map objects, i.e. map 16 (0xde) or 
map 32 (0xdf). The map object may have a sub-field of any data type except 
type 0xc1, and may contain nested map objects. However, the data payload
for the even-number elements in such must be removed.

    * a type byte between 0xde and 0xdf indicates a structure of arrays (SoA)
    * in such case, the type and the following type template must be a valid map 
object with elements exclusively made of array objects, i.e. array 16 (0xdc) or 
array 32 (0xdd). The array object may have a sub-field of only numerical types 
0xca-0xd3 or nested array objects. However, all sub-fields in the type template 
section can only contain type and size makers, but the data payload must 
be removed.

    * a type byte of true (0xc3) indicates a variable length object. 
    * in such case, the type template is ignored, and the data payload is directly 
followed. The data payload must be followed by 0xc0+false (0xc2) to indicate 
the end of the variable length object. 

    * a type byte between 0xc7 and 0xc9 indicates enhanced ext (eext) data records.
    * in such case, the type template is simply an extended length record, where
        | 0xc1 | 0xc7 |  64bit integer length | ext type | payload |    for ext 64 objects
        | 0xc1 | 0xc8 | 128bit integer length | ext type | payload |    for ext 128 objects
        | 0xc1 | 0xc9 | 256bit integer length | ext type | payload |    for ext 256 objects

    * a type byte of any other value is reserved for future extensions

@fangq
Copy link
Author

fangq commented Mar 12, 2020

A few examples

N-D array

[
     [
         [1,9,6,0],
         [2,9,3,1],
         [8,0,9,6]
     ],
     [
         [6,4,2,7],
         [8,5,1,2],
         [3,3,2,6]
     ]
 ]

is encoded to

0xc0                               <- start
0xcc                               <- type of ND array: uint8
0xdc 03                            <- total dimensional vector length: 3
    0xcc 2     0xcc 3     0xcc 4   <- dimensional data 2x3x4
1 9 6 0 2 9 3 1 8 0 9 6 6 4 2 7 8 5 1 2 3 3 2 6

Array of structures (AoS) example

  Name    Age   Degree  Height
  ----  ------- ------  ------
  Andy    21     BS      69.2
  William 21     MS      71.0
  Om      22     BE      67.1

is encoded to (see the size savings due to the removal of repeated field names)

0xc0                           <- start
0xdc                           <- indicate an AoS
 03                            <- total number of structures
 0xde 04                       <- each structure has 4 elements
    0xd9 4 Name 0xd9           <- first field is Name, with a value of str 8
    0xd9 3 Age 0xcc            <- 2nd field is Age, with a value of uint8
    0xd9 6 Degree 0xd9         <- 3rd field is Degree, with a value of str 8
    0xd9 6 Height 0xca         <- 4th field is Height, with a value of float32
 4 Andy 21 2 BS 69.2 7 William 21 2 MS 71.0 2 Om 22 2 BE 67.1

Structure of arrays (SoA) example

data is same as above, now encoded by columns instead of rows

0xc0                               <- start
0xde                               <- indicate an SoA
 04                                <- total number of subfields
 0xd9 4 Name 0xdc 03 0xd9          <- first field is Name, with a value of str 8
 0xd9 3 Age 0xdc 03 0xcc           <- 2nd field is Age, with a value of uint8
 0xd9 6 Degree 0xdc 03 0xd9        <- 3rd field is Degree, with a value of str 8
 0xd9 6 Height 0xdc 03 0xca        <- 4th field is Height, with a value of float32
 4 Andy 7 William 2 Om 21 21 22 2 BS 2 MS 2 BE 69.2 71.0 67.1

@cmpute
Copy link

cmpute commented Jun 16, 2021

I added my proposal for nd-array at #311, using ext type. I also believe that it's better to leave 0xc1 as an escape character when msgpack is used together with other formats. I also explicity designed the type for dim numbers, so that

[
     [
         [1,9,6,0],
         [2,9,3,1],
         [8,0,9,6]
     ],
     [
         [6,4,2,7],
         [8,5,1,2],
         [3,3,2,6]
     ]
 ]

will be represented as

0xc7                      <- ext marker
29                        <- ext payload size
-11                        <- ext type code for N-D array
0xcc                      <- value type uint 8
3                         <- dimension count
2 3 4                     <- dimension sizes
1 9 6 0 2 9 3 1 8 0 9 6 6 4 2 7 8 5 1 2 3 3 2 6

With this format, a commonly used 3x3 float matrix can be represented with only 7 additional bytes

@BambOoxX
Copy link

BambOoxX commented Aug 5, 2021

@fangq Regarding commit 295b1ab.
Since the row vs column major storage problem will always pop-up at some point for exchange formats such as MsgPack with ND arrays, wouldn't it be better to store the actual storage order in the header (which would be possible with the solution proposed by @cmpute I believe).
This way, exchanging data between identically ordered programs would improve performance, wouldn't it ?

@fangq
Copy link
Author

fangq commented Aug 5, 2021

@BambOoxX, an array element order flag can be added, like the optional _ArrayOrder_ keyword I added for the JData specification (default is still row-major)

fangq/jdata@85994de#diff-533b679ae40e5ef5b4e1c18400b05cfa0abedb5efe5030ad022263c2c8537258R617-R619

however, that just adds an extra layer of complexity and somewhat contradict with the design style of msgpack, IMHO. Just like integers are stored in the Big-Endian byte order, I think deciding on one of the options may not necessarily be a bad thing for a language-independent data format.

maybe I missed, @cmpute's typed ND array also assumes row-major from reading this
https://github.com/msgpack/msgpack/pull/311/files#diff-bc6661da34ecae62fbe724bb93fd69b91a7f81143f2683a81163231de7e3b545R723

@BambOoxX
Copy link

BambOoxX commented Aug 5, 2021

@fangq Regarding what @cmpute proposed, I just figured one could use one ext type for row-major and one for column-major.

I just wanted to point that serializing in row-major order a multi-dimensional array that is stored in memory in column-major may lead to performance issues (over large data of course), especially for high dimensions numbers.

Choosing a specific setting may reduce interest of some users in case this choice does not agree with their main language for data storage.

I am very far myself from these large array issues, so it's really a highly uneducated guess, but most of my data is stored in column-major order. I guess I may have said nothing if you had kept it the default :)

@cmpute
Copy link

cmpute commented Aug 6, 2021

I agree that the array order and even endianness matters a lot in this scenario, because it will affect the ability to directly map a memory span as an array. With that said, it's pretty easy to add this in my proposal since there are many available type slots. So we can have something like

  • row-major big-endian array with ext type -11
  • row-major little-endian array with ext type -12
  • column-major big-endian array with ext type -13
  • column-major little-endian array with ext type -14

I understand the decision of msgpack to make the endianness consistent across the standard, but it's also beneficial to offer this workaround to pass large amount data without conversion (other than manually pack them in bin or custom ext type)

@cmpute
Copy link

cmpute commented Aug 6, 2021

Also as a side note, the de facto standard library for Python, Numpy, provide functionalities to represent n-d array in different order and endianness. This is more related to the format of data itself than a specific language.

@cmpute
Copy link

cmpute commented Aug 18, 2021

I've updated my proposal. But it seems the owners of the specs are very conservative about this change

@BambOoxX
Copy link

BambOoxX commented Nov 1, 2022

Any new status on this ?

@mzy2240
Copy link

mzy2240 commented Sep 15, 2023

this feature is definitely very handy nowadays

@codeinred
Copy link

Are there any updates with regard to the proposal? Packed arrays would be very handy to have, and simply sticking things in a binary type is inadequate because it breaks the benefits of using msgpack (namely, the ability to load data in other languages)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.