Skip to content

Commit e8d54ea

Browse files
ARROW-17789: [Java][Docs] Update Java Dataset documentation with latest changes (apache#14382)
Authored-by: david dali susanibar arce <davi.sarces@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
1 parent ff0aa08 commit e8d54ea

1 file changed

Lines changed: 91 additions & 30 deletions

File tree

docs/source/java/dataset.rst

Lines changed: 91 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -32,31 +32,50 @@ is not designed only for querying files but can be extended to serve all
3232
possible data sources such as from inter-process communication or from other
3333
network locations, etc.
3434

35+
.. contents::
36+
3537
Getting Started
3638
===============
3739

40+
Currently supported file formats are:
41+
42+
- Apache Arrow (``.arrow``)
43+
- Apache ORC (``.orc``)
44+
- Apache Parquet (``.parquet``)
45+
- Comma-Separated Values (``.csv``)
46+
3847
Below shows a simplest example of using Dataset to query a Parquet file in Java:
3948

4049
.. code-block:: Java
4150
4251
// read data from file /opt/example.parquet
4352
String uri = "file:/opt/example.parquet";
44-
BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
45-
DatasetFactory factory = new FileSystemDatasetFactory(allocator,
46-
NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
47-
Dataset dataset = factory.finish();
48-
Scanner scanner = dataset.newScan(new ScanOptions(100)));
49-
List<ArrowRecordBatch> batches = StreamSupport.stream(
50-
scanner.scan().spliterator(), false)
51-
.flatMap(t -> stream(t.execute()))
52-
.collect(Collectors.toList());
53-
54-
// do something with read record batches, for example:
55-
analyzeArrowData(batches);
56-
57-
// finished the analysis of the data, close all resources:
58-
AutoCloseables.close(batches);
59-
AutoCloseables.close(factory, dataset, scanner);
53+
ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
54+
try (
55+
BufferAllocator allocator = new RootAllocator();
56+
DatasetFactory datasetFactory = new FileSystemDatasetFactory(
57+
allocator, NativeMemoryPool.getDefault(),
58+
FileFormat.PARQUET, uri);
59+
Dataset dataset = datasetFactory.finish();
60+
Scanner scanner = dataset.newScan(options);
61+
ArrowReader reader = scanner.scanBatches()
62+
) {
63+
List<ArrowRecordBatch> batches = new ArrayList<>();
64+
while (reader.loadNextBatch()) {
65+
try (VectorSchemaRoot root = reader.getVectorSchemaRoot()) {
66+
final VectorUnloader unloader = new VectorUnloader(root);
67+
batches.add(unloader.getRecordBatch());
68+
}
69+
}
70+
71+
// do something with read record batches, for example:
72+
analyzeArrowData(batches);
73+
74+
// finished the analysis of the data, close all resources:
75+
AutoCloseables.close(batches);
76+
} catch (Exception e) {
77+
e.printStackTrace();
78+
}
6079
6180
.. note::
6281
``ArrowRecordBatch`` is a low-level composite Arrow data exchange format
@@ -65,6 +84,9 @@ Below shows a simplest example of using Dataset to query a Parquet file in Java:
6584
aware container ``VectorSchemaRoot`` by which user could be able to access
6685
decoded data conveniently in Java.
6786

87+
The ``ScanOptions batchSize`` argument takes effect only if it is set to a value
88+
smaller than the number of rows in the recordbatch.
89+
6890
.. seealso::
6991
Load record batches with :doc:`VectorSchemaRoot <vector_schema_root>`.
7092

@@ -104,7 +126,7 @@ within method ``Scanner::schema()``:
104126
.. code-block:: Java
105127
106128
Scanner scanner = dataset.newScan(
107-
new ScanOptions(100, Optional.of(new String[] {"id", "name"})));
129+
new ScanOptions(32768, Optional.of(new String[] {"id", "name"})));
108130
Schema projectedSchema = scanner.schema();
109131
110132
.. _java-dataset-projection:
@@ -119,20 +141,20 @@ in the projection list will be accepted. For example:
119141
.. code-block:: Java
120142
121143
String[] projection = new String[] {"id", "name"};
122-
ScanOptions options = new ScanOptions(100, Optional.of(projection));
144+
ScanOptions options = new ScanOptions(32768, Optional.of(projection));
123145
124146
If no projection is needed, leave the optional projection argument absent in
125147
ScanOptions:
126148

127149
.. code-block:: Java
128150
129-
ScanOptions options = new ScanOptions(100, Optional.empty());
151+
ScanOptions options = new ScanOptions(32768, Optional.empty());
130152
131153
Or use shortcut construtor:
132154

133155
.. code-block:: Java
134156
135-
ScanOptions options = new ScanOptions(100);
157+
ScanOptions options = new ScanOptions(32768);
136158
137159
Then all columns will be emitted during scanning.
138160

@@ -210,21 +232,60 @@ be thrown during scanning.
210232
dataset instances. Once the Java buffers are created the passed allocator
211233
will become their parent allocator.
212234

235+
Usage Notes
236+
===========
237+
213238
Native Object Resource Management
214-
=================================
239+
---------------------------------
240+
215241
As another result of relying on JNI, all components related to
216-
``FileSystemDataset`` should be closed manually to release the corresponding
217-
native objects after using. For example:
242+
``FileSystemDataset`` should be closed manually or use try-with-resources to
243+
release the corresponding native objects after using. For example:
218244

219245
.. code-block:: Java
220246
221-
DatasetFactory factory = new FileSystemDatasetFactory(allocator,
222-
NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
223-
Dataset dataset = factory.finish();
224-
Scanner scanner = dataset.newScan(new ScanOptions(100));
247+
String uri = "file:/opt/example.parquet";
248+
ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
249+
try (
250+
BufferAllocator allocator = new RootAllocator();
251+
DatasetFactory factory = new FileSystemDatasetFactory(
252+
allocator, NativeMemoryPool.getDefault(),
253+
FileFormat.PARQUET, uri);
254+
Dataset dataset = factory.finish();
255+
Scanner scanner = dataset.newScan(options)
256+
) {
257+
258+
// do something
259+
260+
} catch (Exception e) {
261+
e.printStackTrace();
262+
}
225263
226-
// do something
264+
If user forgets to close them then native object leakage might be caused.
227265

228-
AutoCloseables.close(factory, dataset, scanner);
266+
BatchSize
267+
---------
229268

230-
If user forgets to close them then native object leakage might be caused.
269+
The ``batchSize`` argument of ``ScanOptions`` is a limit on the size of an individual batch.
270+
271+
For example, let's try to read a Parquet file with gzip compression and 3 row groups:
272+
273+
.. code-block::
274+
275+
# Let configure ScanOptions as:
276+
ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
277+
278+
$ parquet-tools meta data4_3rg_gzip.parquet
279+
file schema: schema
280+
age: OPTIONAL INT64 R:0 D:1
281+
name: OPTIONAL BINARY L:STRING R:0 D:1
282+
row group 1: RC:4 TS:182 OFFSET:4
283+
row group 2: RC:4 TS:190 OFFSET:420
284+
row group 3: RC:3 TS:179 OFFSET:838
285+
286+
Here, we set the batchSize in ScanOptions to 32768. Because that's greater
287+
than the number of rows in the next batch, which is 4 rows because the first
288+
row group has only 4 rows, then the program gets only 4 rows. The scanner
289+
will not combine smaller batches to reach the limit, but it will split
290+
large batches to stay under the limit. So in the case the row group had more
291+
than 32768 rows, it would get split into blocks of 32768 rows or less.

0 commit comments

Comments
 (0)