@@ -32,31 +32,50 @@ is not designed only for querying files but can be extended to serve all
3232possible data sources such as from inter-process communication or from other
3333network locations, etc.
3434
35+ .. contents ::
36+
3537Getting Started
3638===============
3739
40+ Currently supported file formats are:
41+
42+ - Apache Arrow (``.arrow ``)
43+ - Apache ORC (``.orc ``)
44+ - Apache Parquet (``.parquet ``)
45+ - Comma-Separated Values (``.csv ``)
46+
3847Below shows a simplest example of using Dataset to query a Parquet file in Java:
3948
4049.. code-block :: Java
4150
4251 // read data from file /opt/example.parquet
4352 String uri = " file:/opt/example.parquet" ;
44- BufferAllocator allocator = new RootAllocator (Long . MAX_VALUE );
45- DatasetFactory factory = new FileSystemDatasetFactory (allocator,
46- NativeMemoryPool . getDefault(), FileFormat . PARQUET , uri);
47- Dataset dataset = factory. finish();
48- Scanner scanner = dataset. newScan(new ScanOptions (100 )));
49- List<ArrowRecordBatch > batches = StreamSupport . stream(
50- scanner. scan(). spliterator(), false )
51- .flatMap(t - > stream(t. execute()))
52- .collect(Collectors . toList());
53-
54- // do something with read record batches, for example:
55- analyzeArrowData(batches);
56-
57- // finished the analysis of the data, close all resources:
58- AutoCloseables . close(batches);
59- AutoCloseables . close(factory, dataset, scanner);
53+ ScanOptions options = new ScanOptions (/* batchSize*/ 32768 );
54+ try (
55+ BufferAllocator allocator = new RootAllocator ();
56+ DatasetFactory datasetFactory = new FileSystemDatasetFactory (
57+ allocator, NativeMemoryPool . getDefault(),
58+ FileFormat . PARQUET , uri);
59+ Dataset dataset = datasetFactory. finish();
60+ Scanner scanner = dataset. newScan(options);
61+ ArrowReader reader = scanner. scanBatches()
62+ ) {
63+ List<ArrowRecordBatch > batches = new ArrayList<> ();
64+ while (reader. loadNextBatch()) {
65+ try (VectorSchemaRoot root = reader. getVectorSchemaRoot()) {
66+ final VectorUnloader unloader = new VectorUnloader (root);
67+ batches. add(unloader. getRecordBatch());
68+ }
69+ }
70+
71+ // do something with read record batches, for example:
72+ analyzeArrowData(batches);
73+
74+ // finished the analysis of the data, close all resources:
75+ AutoCloseables . close(batches);
76+ } catch (Exception e) {
77+ e. printStackTrace();
78+ }
6079
6180 .. note ::
6281 ``ArrowRecordBatch `` is a low-level composite Arrow data exchange format
@@ -65,6 +84,9 @@ Below shows a simplest example of using Dataset to query a Parquet file in Java:
6584 aware container ``VectorSchemaRoot `` by which user could be able to access
6685 decoded data conveniently in Java.
6786
87+ The ``ScanOptions batchSize `` argument takes effect only if it is set to a value
88+ smaller than the number of rows in the recordbatch.
89+
6890.. seealso ::
6991 Load record batches with :doc: `VectorSchemaRoot <vector_schema_root >`.
7092
@@ -104,7 +126,7 @@ within method ``Scanner::schema()``:
104126.. code-block :: Java
105127
106128 Scanner scanner = dataset. newScan(
107- new ScanOptions (100 , Optional . of(new String [] {" id" , " name" })));
129+ new ScanOptions (32768 , Optional . of(new String [] {" id" , " name" })));
108130 Schema projectedSchema = scanner. schema();
109131
110132 .. _java-dataset-projection :
@@ -119,20 +141,20 @@ in the projection list will be accepted. For example:
119141.. code-block :: Java
120142
121143 String [] projection = new String [] {" id" , " name" };
122- ScanOptions options = new ScanOptions (100 , Optional . of(projection));
144+ ScanOptions options = new ScanOptions (32768 , Optional . of(projection));
123145
124146 If no projection is needed, leave the optional projection argument absent in
125147ScanOptions:
126148
127149.. code-block :: Java
128150
129- ScanOptions options = new ScanOptions (100 , Optional . empty());
151+ ScanOptions options = new ScanOptions (32768 , Optional . empty());
130152
131153 Or use shortcut construtor:
132154
133155.. code-block :: Java
134156
135- ScanOptions options = new ScanOptions (100 );
157+ ScanOptions options = new ScanOptions (32768 );
136158
137159 Then all columns will be emitted during scanning.
138160
@@ -210,21 +232,60 @@ be thrown during scanning.
210232 dataset instances. Once the Java buffers are created the passed allocator
211233 will become their parent allocator.
212234
235+ Usage Notes
236+ ===========
237+
213238Native Object Resource Management
214- =================================
239+ ---------------------------------
240+
215241As another result of relying on JNI, all components related to
216- ``FileSystemDataset `` should be closed manually to release the corresponding
217- native objects after using. For example:
242+ ``FileSystemDataset `` should be closed manually or use try-with-resources to
243+ release the corresponding native objects after using. For example:
218244
219245.. code-block :: Java
220246
221- DatasetFactory factory = new FileSystemDatasetFactory (allocator,
222- NativeMemoryPool . getDefault(), FileFormat . PARQUET , uri);
223- Dataset dataset = factory. finish();
224- Scanner scanner = dataset. newScan(new ScanOptions (100 ));
247+ String uri = " file:/opt/example.parquet" ;
248+ ScanOptions options = new ScanOptions (/* batchSize*/ 32768 );
249+ try (
250+ BufferAllocator allocator = new RootAllocator ();
251+ DatasetFactory factory = new FileSystemDatasetFactory (
252+ allocator, NativeMemoryPool . getDefault(),
253+ FileFormat . PARQUET , uri);
254+ Dataset dataset = factory. finish();
255+ Scanner scanner = dataset. newScan(options)
256+ ) {
257+
258+ // do something
259+
260+ } catch (Exception e) {
261+ e. printStackTrace();
262+ }
225263
226- // do something
264+ If user forgets to close them then native object leakage might be caused.
227265
228- AutoCloseables . close(factory, dataset, scanner);
266+ BatchSize
267+ ---------
229268
230- If user forgets to close them then native object leakage might be caused.
269+ The ``batchSize `` argument of ``ScanOptions `` is a limit on the size of an individual batch.
270+
271+ For example, let's try to read a Parquet file with gzip compression and 3 row groups:
272+
273+ .. code-block ::
274+
275+ # Let configure ScanOptions as:
276+ ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
277+
278+ $ parquet-tools meta data4_3rg_gzip.parquet
279+ file schema: schema
280+ age: OPTIONAL INT64 R:0 D:1
281+ name: OPTIONAL BINARY L:STRING R:0 D:1
282+ row group 1: RC:4 TS:182 OFFSET:4
283+ row group 2: RC:4 TS:190 OFFSET:420
284+ row group 3: RC:3 TS:179 OFFSET:838
285+
286+ Here, we set the batchSize in ScanOptions to 32768. Because that's greater
287+ than the number of rows in the next batch, which is 4 rows because the first
288+ row group has only 4 rows, then the program gets only 4 rows. The scanner
289+ will not combine smaller batches to reach the limit, but it will split
290+ large batches to stay under the limit. So in the case the row group had more
291+ than 32768 rows, it would get split into blocks of 32768 rows or less.
0 commit comments