Skip to content

Commit fc7a382

Browse files
trxcllntBrian Hulette
authored andcommitted
ARROW-2116: [JS] implement IPC writers
https://issues.apache.org/jira/browse/ARROW-2116 https://issues.apache.org/jira/browse/ARROW-2115 This PR represents a first pass at implementing the IPC writers for binary stream and file formats in JS. I've also added scripts to do the `json-to-arrow`, `file-to-stream`, and `stream-to-file` steps of the integration tests. These scripts rely on a new feature in Node 10 (the next LTS version), so please update. My attempts to use a library to remain backwards-compatible with Node 9 were unsuccessful. I've only done the APIs to serialize a preexisting Table to stream or file formats so far. We will want to refactor this soon to support end-to-end streaming. Edit: Figured out why the integration tests weren't passing, fixed now 🥇 Author: ptaylor <paul.e.taylor@me.com> Author: Paul Taylor <paul.e.taylor@me.com> Author: lsb <leebutterman@gmail.com> Closes apache#2035 from trxcllnt/js-buffer-writer and squashes the following commits: 261a864 <ptaylor> Merge branch 'master' into js-buffer-writer 917c2fc <ptaylor> test the ES5/UMD bundle in the integration tests 7a346dc <ptaylor> add a handy script for printing the alignment of buffers in a table 4594fe3 <ptaylor> align to 8-byte boundaries only 1a9864c <ptaylor> read message bodyLength from flatbuffer object e34afaa <ptaylor> export the RecordBatchSerializer b765b12 <ptaylor> speed up integration_test.py by only testing the JS source, not every compilation target 4ed6554 <ptaylor> Merge branch 'master' of https://github.com/apache/arrow into js-buffer-writer f497f7a <ptaylor> measure maxColumnWidths across all recordBatches when printing a table 14e6b38 <ptaylor> cleanup: remove dead code df43bc5 <ptaylor> make arrow2csv support streaming files from stdin, add rowsToString() method to RecordBatch 7924e67 <ptaylor> rename readNodeStream -> readStream, fromNodeStream -> fromReadableStream, add support for reading File format efc7225 <ptaylor> fix perf tests a06180b <ptaylor> don't run JS integration tests in src-only mode when --debug=true ed85572 <ptaylor> fix instanceof ArrayBuffer in jest/node 10 2df1a4a <ptaylor> update google-closure-compiler, remove gcc-specific workarounds in the build a6a7ab9 <ptaylor> put test tables into hoisted functions so it's easier to set breakpoints a79334d <ptaylor> fix typo again after rebase 081fefc <ptaylor> remove bin from ts package.json ccaf489 <ptaylor> remove stream-to-iterator c0b88c2 <ptaylor> always write flatbuffer vectors 0be6de3 <ptaylor> use node v10.1.0 in travis d4b8637 <ptaylor> add license headers b52af25 <ptaylor> cleanup 3187732 <ptaylor> set bitmap alignment to 8 bytes if < 64 values af9f4a8 <ptaylor> run integration tests in node 10.1 de81ac1 <ptaylor> Update JSTester to be an Arrow producer now too 832cc30 <ptaylor> add more js integration scripts for creating/converting arrow formats 263d06d <ptaylor> clean up js integration script 78cba38 <ptaylor> arrow2csv: support reading arrow streams from stdin e75da13 <ptaylor> add support for reading streaming format via node streams 4e80851 <ptaylor> write correct recordBatch length 73a2fa9 <ptaylor> fix stream -> file, file -> stream, add tests 304e75d <ptaylor> fix magic string alignment in file reader, add file reader tests 402187e <ptaylor> add apache license headers db02c1c <ptaylor> Add an integration test for binary writer a242da8 <ptaylor> Add `Table.prototype.serialize` method to make ArrayBuffers from Tables da0f457 <ptaylor> first pass at a working binary writer, only arrow stream format tested so far 508f4f8 <ptaylor> add getChildAt(n) methods to List and FixedSizeList Vectors to be more consistent with the other nested Vectors, make it easier to do the writer a9d773d <ptaylor> move ValidityView into its own module, like ChunkedView is 85eb7ee <ptaylor> fix erroneous footer length check in reader 4333e54 <ptaylor> FileBlock constructor should accept Long | number, have public number fields 7fff99e <ptaylor> move IPC magic into its own module d98e178 <ptaylor> add option to run gulp cmds with `-t src` to run jest against the `src` folder direct aaec76b <ptaylor> fix @std/esm options for node10 18b9dd2 <lsb> Fix a typo efb840f <Paul Taylor> fix typo ae1f481 <Paul Taylor> align to 64-byte boundaries c8ba1fe <Paul Taylor> don't write an empty buffer for NullVectors 43c671f <Paul Taylor> add Binary writer 6522cb0 <Paul Taylor> fix Data generics for FixedSizeList ef1acc7 <Paul Taylor> read union buffers in the correct order dc92b83 <Paul Taylor> fix typo
1 parent 1d9d893 commit fc7a382

52 files changed

Lines changed: 2121 additions & 4101 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

ci/travis_env_common.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@
1717
# specific language governing permissions and limitations
1818
# under the License.
1919

20+
# hide nodejs experimental-feature warnings
21+
export NODE_NO_WARNINGS=1
2022
export MINICONDA=$HOME/miniconda
2123
export PATH="$MINICONDA/bin:$PATH"
2224
export CONDA_PKGS_DIRS=$HOME/.conda_packages

integration/integration_test.py

Lines changed: 25 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1092,35 +1092,52 @@ def file_to_stream(self, file_path, stream_path):
10921092
os.system(cmd)
10931093

10941094
class JSTester(Tester):
1095-
PRODUCER = False
1095+
PRODUCER = True
10961096
CONSUMER = True
10971097

1098-
INTEGRATION_EXE = os.path.join(ARROW_HOME, 'js/bin/integration.js')
1098+
EXE_PATH = os.path.join(ARROW_HOME, 'js/bin')
1099+
VALIDATE = os.path.join(EXE_PATH, 'integration.js')
1100+
JSON_TO_ARROW = os.path.join(EXE_PATH, 'json-to-arrow.js')
1101+
STREAM_TO_FILE = os.path.join(EXE_PATH, 'stream-to-file.js')
1102+
FILE_TO_STREAM = os.path.join(EXE_PATH, 'file-to-stream.js')
10991103

11001104
name = 'JS'
11011105

1102-
def _run(self, arrow_path=None, json_path=None, command='VALIDATE'):
1103-
cmd = [self.INTEGRATION_EXE]
1106+
def _run(self, exe_cmd, arrow_path=None, json_path=None, command='VALIDATE'):
1107+
cmd = [exe_cmd]
11041108

11051109
if arrow_path is not None:
11061110
cmd.extend(['-a', arrow_path])
11071111

11081112
if json_path is not None:
11091113
cmd.extend(['-j', json_path])
11101114

1111-
cmd.extend(['--mode', command])
1115+
cmd.extend(['--mode', command, '-t', 'es5', '-m', 'umd'])
11121116

11131117
if self.debug:
11141118
print(' '.join(cmd))
11151119

11161120
run_cmd(cmd)
11171121

11181122
def validate(self, json_path, arrow_path):
1119-
return self._run(arrow_path, json_path, 'VALIDATE')
1123+
return self._run(self.VALIDATE, arrow_path, json_path, 'VALIDATE')
1124+
1125+
def json_to_file(self, json_path, arrow_path):
1126+
cmd = ['node', self.JSON_TO_ARROW, '-a', arrow_path, '-j', json_path]
1127+
cmd = ' '.join(cmd)
1128+
if self.debug:
1129+
print(cmd)
1130+
os.system(cmd)
11201131

11211132
def stream_to_file(self, stream_path, file_path):
1122-
# Just copy stream to file, we can read the stream directly
1123-
cmd = ['cp', stream_path, file_path]
1133+
cmd = ['cat', stream_path, '|', 'node', self.STREAM_TO_FILE, '>', file_path]
1134+
cmd = ' '.join(cmd)
1135+
if self.debug:
1136+
print(cmd)
1137+
os.system(cmd)
1138+
1139+
def file_to_stream(self, file_path, stream_path):
1140+
cmd = ['cat', file_path, '|', 'node', self.FILE_TO_STREAM, '>', stream_path]
11241141
cmd = ' '.join(cmd)
11251142
if self.debug:
11261143
print(cmd)

js/DEVELOP.md

Lines changed: 2 additions & 194 deletions
Original file line numberDiff line numberDiff line change
@@ -64,13 +64,11 @@ This argument configuration also applies to `clean` and `test` scripts.
6464

6565
* `npm run deploy`
6666

67-
Uses [learna](https://github.com/lerna/lerna) to publish each build target to npm with [conventional](https://conventionalcommits.org/) [changelogs](https://github.com/conventional-changelog/conventional-changelog/tree/master/packages/conventional-changelog-cli).
67+
Uses [lerna](https://github.com/lerna/lerna) to publish each build target to npm with [conventional](https://conventionalcommits.org/) [changelogs](https://github.com/conventional-changelog/conventional-changelog/tree/master/packages/conventional-changelog-cli).
6868

6969
# Updating the Arrow format flatbuffers generated code
7070

71-
Once generated, the flatbuffers format code needs to be adjusted for our TS and JS build environments.
72-
73-
## TypeScript
71+
Once generated, the flatbuffers format code needs to be adjusted for our build scripts.
7472

7573
1. Generate the flatbuffers TypeScript source from the Arrow project root directory:
7674
```sh
@@ -101,193 +99,3 @@ Once generated, the flatbuffers format code needs to be adjusted for our TS and
10199
```
102100
1. Add `/* tslint:disable:class-name */` to the top of `Schema.ts`
103101
1. Execute `npm run lint` to fix all the linting errors
104-
105-
## JavaScript (for Google Closure Compiler builds)
106-
107-
1. Generate the flatbuffers JS source from the Arrow project root directory
108-
```sh
109-
cd $ARROW_HOME
110-
111-
flatc --js --no-js-exports -o ./js/src/format ./format/*.fbs
112-
113-
cd ./js/src/format
114-
115-
# Delete Tensor_generated.js (skip this when we support Tensors)
116-
rm Tensor_generated.js
117-
118-
# append an ES6 export to Schema_generated.js
119-
echo "$(cat Schema_generated.js)
120-
export { org };
121-
" > Schema_generated.js
122-
123-
# import Schema's "org" namespace and
124-
# append an ES6 export to File_generated.js
125-
echo "import { org } from './Schema';
126-
$(cat File_generated.js)
127-
export { org };
128-
" > File_generated.js
129-
130-
# import Schema's "org" namespace and
131-
# append an ES6 export to Message_generated.js
132-
echo "import { org } from './Schema';
133-
$(cat Message_generated.js)
134-
export { org };
135-
" > Message_generated.js
136-
```
137-
1. Fixup the generated JS enums with the reverse value-to-key mappings to match TypeScript
138-
`Message_generated.js`
139-
```js
140-
// Replace this
141-
org.apache.arrow.flatbuf.MessageHeader = {
142-
NONE: 0,
143-
Schema: 1,
144-
DictionaryBatch: 2,
145-
RecordBatch: 3,
146-
Tensor: 4
147-
};
148-
// With this
149-
org.apache.arrow.flatbuf.MessageHeader = {
150-
NONE: 0, 0: 'NONE',
151-
Schema: 1, 1: 'Schema',
152-
DictionaryBatch: 2, 2: 'DictionaryBatch',
153-
RecordBatch: 3, 3: 'RecordBatch',
154-
Tensor: 4, 4: 'Tensor'
155-
};
156-
```
157-
`Schema_generated.js`
158-
```js
159-
/**
160-
* @enum
161-
*/
162-
org.apache.arrow.flatbuf.MetadataVersion = {
163-
/**
164-
* 0.1.0
165-
*/
166-
V1: 0, 0: 'V1',
167-
168-
/**
169-
* 0.2.0
170-
*/
171-
V2: 1, 1: 'V2',
172-
173-
/**
174-
* 0.3.0 -> 0.7.1
175-
*/
176-
V3: 2, 2: 'V3',
177-
178-
/**
179-
* >= 0.8.0
180-
*/
181-
V4: 3, 3: 'V4'
182-
};
183-
184-
/**
185-
* @enum
186-
*/
187-
org.apache.arrow.flatbuf.UnionMode = {
188-
Sparse: 0, 0: 'Sparse',
189-
Dense: 1, 1: 'Dense',
190-
};
191-
192-
/**
193-
* @enum
194-
*/
195-
org.apache.arrow.flatbuf.Precision = {
196-
HALF: 0, 0: 'HALF',
197-
SINGLE: 1, 1: 'SINGLE',
198-
DOUBLE: 2, 2: 'DOUBLE',
199-
};
200-
201-
/**
202-
* @enum
203-
*/
204-
org.apache.arrow.flatbuf.DateUnit = {
205-
DAY: 0, 0: 'DAY',
206-
MILLISECOND: 1, 1: 'MILLISECOND',
207-
};
208-
209-
/**
210-
* @enum
211-
*/
212-
org.apache.arrow.flatbuf.TimeUnit = {
213-
SECOND: 0, 0: 'SECOND',
214-
MILLISECOND: 1, 1: 'MILLISECOND',
215-
MICROSECOND: 2, 2: 'MICROSECOND',
216-
NANOSECOND: 3, 3: 'NANOSECOND',
217-
};
218-
219-
/**
220-
* @enum
221-
*/
222-
org.apache.arrow.flatbuf.IntervalUnit = {
223-
YEAR_MONTH: 0, 0: 'YEAR_MONTH',
224-
DAY_TIME: 1, 1: 'DAY_TIME',
225-
};
226-
227-
/**
228-
* ----------------------------------------------------------------------
229-
* Top-level Type value, enabling extensible type-specific metadata. We can
230-
* add new logical types to Type without breaking backwards compatibility
231-
*
232-
* @enum
233-
*/
234-
org.apache.arrow.flatbuf.Type = {
235-
NONE: 0, 0: 'NONE',
236-
Null: 1, 1: 'Null',
237-
Int: 2, 2: 'Int',
238-
FloatingPoint: 3, 3: 'FloatingPoint',
239-
Binary: 4, 4: 'Binary',
240-
Utf8: 5, 5: 'Utf8',
241-
Bool: 6, 6: 'Bool',
242-
Decimal: 7, 7: 'Decimal',
243-
Date: 8, 8: 'Date',
244-
Time: 9, 9: 'Time',
245-
Timestamp: 10, 10: 'Timestamp',
246-
Interval: 11, 11: 'Interval',
247-
List: 12, 12: 'List',
248-
Struct_: 13, 13: 'Struct_',
249-
Union: 14, 14: 'Union',
250-
FixedSizeBinary: 15, 15: 'FixedSizeBinary',
251-
FixedSizeList: 16, 16: 'FixedSizeList',
252-
Map: 17, 17: 'Map'
253-
};
254-
255-
/**
256-
* ----------------------------------------------------------------------
257-
* The possible types of a vector
258-
*
259-
* @enum
260-
*/
261-
org.apache.arrow.flatbuf.VectorType = {
262-
/**
263-
* used in List type, Dense Union and variable length primitive types (String, Binary)
264-
*/
265-
OFFSET: 0, 0: 'OFFSET',
266-
267-
/**
268-
* actual data, either wixed width primitive types in slots or variable width delimited by an OFFSET vector
269-
*/
270-
DATA: 1, 1: 'DATA',
271-
272-
/**
273-
* Bit vector indicating if each value is null
274-
*/
275-
VALIDITY: 2, 2: 'VALIDITY',
276-
277-
/**
278-
* Type vector used in Union type
279-
*/
280-
TYPE: 3, 3: 'TYPE'
281-
};
282-
283-
/**
284-
* ----------------------------------------------------------------------
285-
* Endianness of the platform producing the data
286-
*
287-
* @enum
288-
*/
289-
org.apache.arrow.flatbuf.Endianness = {
290-
Little: 0, 0: 'Little',
291-
Big: 1, 1: 'Big',
292-
};
293-
```

js/bin/file-to-stream.js

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
#! /usr/bin/env node
2+
3+
// Licensed to the Apache Software Foundation (ASF) under one
4+
// or more contributor license agreements. See the NOTICE file
5+
// distributed with this work for additional information
6+
// regarding copyright ownership. The ASF licenses this file
7+
// to you under the Apache License, Version 2.0 (the
8+
// "License"); you may not use this file except in compliance
9+
// with the License. You may obtain a copy of the License at
10+
//
11+
// http://www.apache.org/licenses/LICENSE-2.0
12+
//
13+
// Unless required by applicable law or agreed to in writing,
14+
// software distributed under the License is distributed on an
15+
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
16+
// KIND, either express or implied. See the License for the
17+
// specific language governing permissions and limitations
18+
// under the License.
19+
20+
const fs = require('fs');
21+
const path = require('path');
22+
23+
const encoding = 'binary';
24+
const ext = process.env.ARROW_JS_DEBUG === 'src' ? '.ts' : '';
25+
const { util: { PipeIterator } } = require(`../index${ext}`);
26+
const { Table, serializeStream, fromReadableStream } = require(`../index${ext}`);
27+
28+
(async () => {
29+
// Todo (ptaylor): implement `serializeStreamAsync` that accepts an
30+
// AsyncIterable<Buffer>, rather than aggregating into a Table first
31+
const in_ = process.argv.length < 3
32+
? process.stdin : fs.createReadStream(path.resolve(process.argv[2]));
33+
const out = process.argv.length < 4
34+
? process.stdout : fs.createWriteStream(path.resolve(process.argv[3]));
35+
new PipeIterator(serializeStream(await Table.fromAsync(fromReadableStream(in_))), encoding).pipe(out);
36+
37+
})().catch((e) => { console.error(e); process.exit(1); });

0 commit comments

Comments
 (0)