Skip to content

Commit b178e15

Browse files
liyafan82emkornfield
authored andcommitted
ARROW-7277: [Java] [Doc] Add discussion about vector lifecycle
As discussed in https://issues.apache.org/jira/browse/ARROW-7254?focusedCommentId=16983284&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16983284, we need a discussion about the lifecycle of a vector. Each vector has a lifecycle, and different operations should be performed in particular phases of the lifecycle. If we violate this, some unexpected results may be produced. This may cause some confusion for Arrow users. So we want to add a new section to the prose document, to make it clear and explicit. Closes apache#5969 from liyafan82/fly_1203_doc and squashes the following commits: 48be293 <liyafan82> Resolve comments 3ad95de <liyafan82> Fix styles for other Java documents 030d9da <liyafan82> Add discussion about vector lifecycle Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
1 parent da0e218 commit b178e15

3 files changed

Lines changed: 159 additions & 46 deletions

File tree

docs/source/java/ipc.rst

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,9 @@ Arrow defines two types of binary formats for serializing record batches:
3030

3131
Writing and Reading Streaming Format
3232
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
33-
First, let's populate a :class:`VectorSchemaRoot` with a small batch of records::
33+
First, let's populate a :class:`VectorSchemaRoot` with a small batch of records
34+
35+
.. code-block:: Java
3436
3537
BitVector bitVector = new BitVector("boolean", allocator);
3638
VarCharVector varCharVector = new VarCharVector("varchar", allocator);
@@ -52,7 +54,9 @@ Now, we can begin writing a stream containing some number of these batches. For
5254
ArrowStreamWriter writer = new ArrowStreamWriter(root, /*DictionaryProvider=*/null, Channels.newChannel(out));
5355

5456

55-
Here we used an in-memory stream, but this could have been a socket or some other IO stream. Then we can do::
57+
Here we used an in-memory stream, but this could have been a socket or some other IO stream. Then we can do
58+
59+
.. code-block:: Java
5660
5761
writer.start();
5862
// write the first batch
@@ -78,7 +82,9 @@ could overwrite previous ones.
7882

7983
Now the :class:`ByteArrayOutputStream` contains the complete stream which contains 5 record batches.
8084
We can read such a stream with :class:`ArrowStreamReader`, note that :class:`VectorSchemaRoot` within
81-
reader will be loaded with new values on every call to :class:`loadNextBatch()`::
85+
reader will be loaded with new values on every call to :class:`loadNextBatch()`
86+
87+
.. code-block:: Java
8288
8389
try (ArrowStreamReader reader = new ArrowStreamReader(new ByteArrayInputStream(out.toByteArray()), allocator)) {
8490
Schema schema = reader.getVectorSchemaRoot().getSchema();
@@ -91,7 +97,9 @@ reader will be loaded with new values on every call to :class:`loadNextBatch()`:
9197
9298
}
9399
94-
Here we also give a simple example with dictionary encoded vectors::
100+
Here we also give a simple example with dictionary encoded vectors
101+
102+
.. code-block:: Java
95103
96104
DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
97105
// create dictionary and provider
@@ -147,7 +155,9 @@ Here we also give a simple example with dictionary encoded vectors::
147155
148156
Writing and Reading Random Access Files
149157
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
150-
The :class:`ArrowFileWriter` has the same API as :class:`ArrowStreamWriter`::
158+
The :class:`ArrowFileWriter` has the same API as :class:`ArrowStreamWriter`
159+
160+
.. code-block:: Java
151161
152162
ByteArrayOutputStream out = new ByteArrayOutputStream();
153163
ArrowFileWriter writer = new ArrowFileWriter(root, null, Channels.newChannel(out));
@@ -163,7 +173,9 @@ The :class:`ArrowFileWriter` has the same API as :class:`ArrowStreamWriter`::
163173
164174
The difference between :class:`ArrowFileReader` and :class:`ArrowStreamReader` is that the input source
165175
must have a ``seek`` method for random access. Because we have access to the entire payload, we know the
166-
number of record batches in the file, and can read any at random::
176+
number of record batches in the file, and can read any at random
177+
178+
.. code-block:: Java
167179
168180
try (ArrowFileReader reader = new ArrowFileReader(
169181
new ByteArrayReadableSeekableByteChannel(out.toByteArray()), allocator)) {

docs/source/java/vector.rst

Lines changed: 132 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,92 @@ Table with non-intuitive names (BigInt = 64 bit integer, etc).
3131

3232
It is important that vector is allocated before attempting to read or write,
3333
:class:`ValueVector` "should" strive to guarantee this order of operation:
34-
allocate > mutate > set valuecount > access > clear (or allocate to start the process over)
34+
create > allocate > mutate > set value count > access > clear (or allocate to start the process over).
35+
We will go through a concrete example to demonstrate each operation in the next section.
36+
37+
Vector Life Cycle
38+
====================
39+
As discussed above, each vector goes through several steps in its life cycle,
40+
and each step is triggered by a vector operation. In particular, we have the following vector operations:
41+
42+
1. **Vector creation**: we create a new vector object by, for example, the vector constructor.
43+
The following code creates a new ``IntVector`` by the constructor:
44+
45+
.. code-block:: Java
46+
47+
RootAllocator allocator = new RootAllocator(Long.MAX_VALUE);
48+
...
49+
IntVector vector = new IntVector("int vector", allocator);
50+
51+
By now, a vector object is created. However, no underlying memory has been allocated, so we need the
52+
following step.
53+
54+
2. **Vector allocation**: in this step, we allocate memory for the vector. For most vectors, we
55+
have two options: 1) if we know the maximum vector capacity, we can specify it by calling the
56+
``allocateNew(int)`` method; 2) otherwise, we should call the ``allocateNew()`` method, and a default
57+
capacity will be allocated for it. For our running example, we assume that the vector capacity never
58+
exceeds 10:
59+
60+
.. code-block:: Java
61+
62+
vector.allocateNew(10);
63+
64+
3. **Vector mutation**: now we can populate the vector with values we desire. For all vectors, we can populate
65+
vector values through vector writers (An example will be given in the next section). For primitive types,
66+
we can also mutate the vector by the set methods. There are two classes of set methods: 1) if we can
67+
be sure the vector has enough capacity, we can call the ``set(index, value)`` method. 2) if we are not sure
68+
about the vector capacity, we should call the ``setSafe(index, value)`` method, which will automatically
69+
take care of vector reallocation, if the capacity is not sufficient. For our running example, we know the
70+
vector has enough capacity, so we can call
71+
72+
.. code-block:: Java
73+
74+
vector.set(/*index*/5, /*value*/25);
75+
76+
4. **Set value count**: for this step, we set the value count of the vector by calling the
77+
``setValueCount(int)`` method:
78+
79+
.. code-block:: Java
80+
81+
vector.setValueCount(10);
82+
83+
After this step, the vector enters an immutable state. In other words, we should no longer mutate it.
84+
(Unless we reuse the vector by allocating it again. This will be discussed shortly.)
85+
86+
5. **Vector access**: it is time to access vector values. Similarly, we have two options to access values:
87+
1) get methods and 2) vector reader. Vector reader works for all types of vectors, while get methods are
88+
only available for primitive vectors. A concrete example for vector reader will be given in the next section.
89+
Below is an example of vector access by get method:
90+
91+
.. code-block:: Java
92+
93+
int value = vector.get(5); // value == 25
94+
95+
6. **Vector clear**: when we are done with the vector, we should clear it to release its memory. This is done by
96+
calling the ``close()`` method:
97+
98+
.. code-block:: Java
99+
100+
vector.close();
101+
102+
Some points to note about the steps above:
103+
104+
* The steps are not necessarily performed in a linear sequence. Instead, they can be in a loop. For example,
105+
when a vector enters the access step, we can also go back to the vector mutation step, and then set value
106+
count, access vector, and so on.
107+
108+
* We should try to make sure the above steps are carried out in order. Otherwise, the vector
109+
may be in an undefined state, and some unexpected behavior may occur. However, this restriction
110+
is not strict. That means it is possible that we violates the order above, but still get
111+
correct results.
112+
113+
* When mutating vector values through set methods, we should prefer ``set(index, value)`` methods to
114+
``setSafe(index, value)`` methods whenever possible, to avoid unnecessary performance overhead of handling
115+
vector capacity.
116+
117+
* All vectors implement the ``AutoCloseable`` interface. So they must be closed explicitly when they are
118+
no longer used, to avoid resource leak. To make sure of this, it is recommended to place vector related operations
119+
into a try-with-resources block.
35120

36121
Building ValueVector
37122
====================
@@ -41,50 +126,58 @@ Note that the current implementation doesn't enforce the rule that Arrow objects
41126
set/setSafe APIs and concrete subclasses of FieldWriter for populating values.
42127

43128
For example, the code below shows how to build a :class:`BigIntVector`, in this case, we build a
44-
vector of the range 0 to 7 where the element that should hold the fourth value is nulled::
45-
46-
BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
47-
48-
BigIntVector vector = new BigIntVector("vector", allocator);
49-
vector.allocateNew(8);
50-
vector.set(0, 1);
51-
vector.set(1, 2);
52-
vector.set(2, 3);
53-
vector.setNull(3);
54-
vector.set(4, 5);
55-
vector.set(5, 6);
56-
vector.set(6, 7);
57-
vector.set(7, 8);
58-
vector.setValueCount(8); // this will finalizes the vector by convention.
129+
vector of the range 0 to 7 where the element that should hold the fourth value is nulled
130+
131+
.. code-block:: Java
132+
133+
try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
134+
BigIntVector vector = new BigIntVector("vector", allocator)) {
135+
vector.allocateNew(8);
136+
vector.set(0, 1);
137+
vector.set(1, 2);
138+
vector.set(2, 3);
139+
vector.setNull(3);
140+
vector.set(4, 5);
141+
vector.set(5, 6);
142+
vector.set(6, 7);
143+
vector.set(7, 8);
144+
vector.setValueCount(8); // this will finalizes the vector by convention.
145+
...
146+
}
59147
60148
The :class:`BigIntVector` holds two ArrowBufs. The first buffer holds the null bitmap, which consists
61149
here of a single byte with the bits 1|1|1|1|0|1|1|1 (the bit is 1 if the value is non-null).
62150
The second buffer contains all the above values. As the fourth entry is null, the value at that position
63151
in the buffer is undefined. Note compared with set API, setSafe API would check value capacity before setting
64152
values and reallocate buffers if necessary.
65153

66-
Here is how to build a vector using writer::
67-
68-
BigIntVector vector = new BigIntVector("vector", allocator);
69-
BigIntWriter writer = new BigIntWriterImpl(vector);
70-
writer.setPosition(0);
71-
writer.writeBigInt(1);
72-
writer.setPosition(1);
73-
writer.writeBigInt(2);
74-
writer.setPosition(2);
75-
writer.writeBigInt(3);
76-
// writer.setPosition(3) is not called which means the forth value is null.
77-
writer.setPosition(4);
78-
writer.writeBigInt(5);
79-
writer.setPosition(5);
80-
writer.writeBigInt(6);
81-
writer.setPosition(6);
82-
writer.writeBigInt(7);
83-
writer.setPosition(7);
84-
writer.writeBigInt(8);
154+
Here is how to build a vector using writer
155+
156+
.. code-block:: Java
157+
158+
try (BigIntVector vector = new BigIntVector("vector", allocator);
159+
BigIntWriter writer = new BigIntWriterImpl(vector)) {
160+
writer.setPosition(0);
161+
writer.writeBigInt(1);
162+
writer.setPosition(1);
163+
writer.writeBigInt(2);
164+
writer.setPosition(2);
165+
writer.writeBigInt(3);
166+
// writer.setPosition(3) is not called which means the forth value is null.
167+
writer.setPosition(4);
168+
writer.writeBigInt(5);
169+
writer.setPosition(5);
170+
writer.writeBigInt(6);
171+
writer.setPosition(6);
172+
writer.writeBigInt(7);
173+
writer.setPosition(7);
174+
writer.writeBigInt(8);
175+
}
85176
86177
There are get API and concrete subclasses of :class:`FieldReader` for accessing vector values, what needs
87-
to be declared is that writer/reader is not as efficient as direct access::
178+
to be declared is that writer/reader is not as efficient as direct access
179+
180+
.. code-block:: Java
88181
89182
// access via get API
90183
for (int i = 0; i < vector.getValueCount(); i++) {
@@ -106,7 +199,9 @@ to be declared is that writer/reader is not as efficient as direct access::
106199
Slicing
107200
=======
108201
Similar with C++ implementation, it is possible to make zero-copy slices of vectors to obtain a vector
109-
referring to some logical sub-sequence of the data through :class:`TransferPair`::
202+
referring to some logical sub-sequence of the data through :class:`TransferPair`
203+
204+
.. code-block:: Java
110205
111206
IntVector vector = new IntVector("intVector", allocator);
112207
for (int i = 0; i < 10; i++) {

docs/source/java/vector_schema_root.rst

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,9 @@ of batches rather than creating a new :class:`VectorSchemaRoot` instance each ti
3030
may have no data (say it was transferred downstream or not yet populated).
3131

3232

33-
Here is the example of building a :class:`VectorSchemaRoot`::
33+
Here is the example of building a :class:`VectorSchemaRoot`
34+
35+
.. code-block:: Java
3436
3537
BitVector bitVector = new BitVector("boolean", allocator);
3638
VarCharVector varCharVector = new VarCharVector("varchar", allocator);
@@ -49,7 +51,9 @@ Here is the example of building a :class:`VectorSchemaRoot`::
4951
5052
The vectors within a :class:`VectorSchemaRoot` could be loaded/unloaded via :class:`VectorLoader` and :class:`VectorUnloader`.
5153
:class:`VectorLoader` and :class:`VectorUnloader` handles converting between :class:`VectorSchemaRoot` and :class:`ArrowRecordBatch`(
52-
representation of a RecordBatch :doc:`IPC <../format/IPC.rst>` message). Examples as below::
54+
representation of a RecordBatch :doc:`IPC <../format/IPC.rst>` message). Examples as below
55+
56+
.. code-block:: Java
5357
5458
// create a VectorSchemaRoot root1 and convert its data into recordBatch
5559
VectorSchemaRoot root1 = new VectorSchemaRoot(fields, vectors);
@@ -61,7 +65,9 @@ representation of a RecordBatch :doc:`IPC <../format/IPC.rst>` message). Example
6165
VectorLoader loader = new VectorLoader(root2);
6266
loader.load(recordBatch);
6367
64-
A new :class:`VectorSchemaRoot` could be sliced from an existing instance with zero-copy::
68+
A new :class:`VectorSchemaRoot` could be sliced from an existing instance with zero-copy
69+
70+
.. code-block:: Java
6571
6672
// 0 indicates start index (inclusive) and 5 indicated length (exclusive).
6773
VectorSchemaRoot newRoot = vectorSchemaRoot.slice(0, 5);

0 commit comments

Comments
 (0)