Skip to content

Commit b2fa55d

Browse files
Matthew Topolemkornfield
authored andcommitted
ARROW-12045: [Go][Parquet] Initial Chunk of Parquet port to Go
Based on the c++ implementation but tuned and optimized for Go, I spent the first couple months this year creating a Go implementation for Parquet with the goal of native/easy integration with the Arrow library while still being highly performant and at minimum reaching feature parity with the C++ implementation. Based on the conversations on the JIRA card, rather than dumping a huge code bomb (there's a ton), I've chunked it up. This is the initial chunk of code comprising of an internal utils directory that is analogous to the cpp/arrow/utils/ bit readers/writers/bit run readers/etc. which were ultimately used by the go implementation, while using c2goasm to reach the performance necessary for certain areas. This is part 1 of the implementation as I chunk it up and push it out. I'll wait for each chunk to get merged before pushing the next PR in order to make sure that everything stays in sync. CC: @emkornfield @wesm @sbinet @nickpoorman Closes apache#9671 from zeroshade/arrow-7905-p1 Authored-by: Matthew Topol <mtopol@factset.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
1 parent 2c5e264 commit b2fa55d

71 files changed

Lines changed: 39498 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

ci/scripts/go_build.sh

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,3 +27,10 @@ go get -d -t -v ./...
2727
go install -v ./...
2828

2929
popd
30+
31+
pushd ${source_dir}/parquet
32+
33+
go get -d -t -v ./...
34+
go install -v ./...
35+
36+
popd

ci/scripts/go_test.sh

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,11 @@ for d in $(go list ./... | grep -v vendor); do
2828
done
2929

3030
popd
31+
32+
pushd ${source_dir}/parquet
33+
34+
for d in $(go list ./... | grep -v vendor); do
35+
go test $d
36+
done
37+
38+
popd

dev/release/rat_exclude_files.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,7 @@ go/arrow/internal/cpu/*
124124
go/arrow/type_string.go
125125
go/*.tmpldata
126126
go/*.s
127+
go/parquet/go.sum
127128
js/.npmignore
128129
js/closure-compiler-scripts/*
129130
js/src/fb/*.ts

go/arrow/bitutil/bitutil.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,9 @@ var (
3030
// IsMultipleOf8 returns whether v is a multiple of 8.
3131
func IsMultipleOf8(v int64) bool { return v&7 == 0 }
3232

33+
// IsMultipleOf64 returns whether v is a multiple of 64
34+
func IsMultipleOf64(v int64) bool { return v&63 == 0 }
35+
3336
func BytesForBits(bits int64) int64 { return (bits + 7) >> 3 }
3437

3538
// NextPowerOf2 rounds x to the next power of two.

go/parquet/.gitignore

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
# Binaries for programs and plugins
18+
*.exe
19+
*.exe~
20+
*.dll
21+
*.so
22+
*.dylib
23+
24+
# Test binary, built with `go test -c`
25+
*.test
26+
27+
# Output of the go coverage tool, specifically when used with LiteIDE
28+
*.out
29+
30+
# Dependency directories (remove the comment below to include it)
31+
# vendor/

go/parquet/LICENSE.txt

Lines changed: 1987 additions & 0 deletions
Large diffs are not rendered by default.

go/parquet/doc.go

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
// Licensed to the Apache Software Foundation (ASF) under one
2+
// or more contributor license agreements. See the NOTICE file
3+
// distributed with this work for additional information
4+
// regarding copyright ownership. The ASF licenses this file
5+
// to you under the Apache License, Version 2.0 (the
6+
// "License"); you may not use this file except in compliance
7+
// with the License. You may obtain a copy of the License at
8+
//
9+
// http://www.apache.org/licenses/LICENSE-2.0
10+
//
11+
// Unless required by applicable law or agreed to in writing, software
12+
// distributed under the License is distributed on an "AS IS" BASIS,
13+
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
// See the License for the specific language governing permissions and
15+
// limitations under the License.
16+
17+
// Package parquet provides an implementation of Apache Parquet for Go.
18+
//
19+
// Apache Parquet is an open-source columnar data storage format using the record
20+
// shredding and assembly algorithm to accomodate complex data structures which
21+
// can then be used to efficiently store the data.
22+
//
23+
// This implementation is a native go implementation for reading and writing the
24+
// parquet file format.
25+
//
26+
// Install
27+
//
28+
// You can download the library via:
29+
// go get -u github.com/apache/arrow/go/parquet
30+
//
31+
// In addition, two cli utilities are provided:
32+
// go install github.factset.com/mtopol/parquet-go/cmd/parquet_reader
33+
// go install github.factset.com/mtopol/parquet-go/cmd/parquet_schema
34+
//
35+
// Modules
36+
//
37+
// This top level parquet package contains the basic common types and reader/writer
38+
// properties along with some utilities that are used throughout the other modules.
39+
//
40+
// The file module contains the functions for directly reading/writing parquet files
41+
// including Column Readers and Column Writers.
42+
//
43+
// The metadata module contains the types for managing the lower level file/rowgroup/column
44+
// metadata inside of a ParquetFile including inspecting the statistics.
45+
//
46+
// The pqarrow module contains helper functions and types for converting directly
47+
// between Parquet and Apache Arrow formats.
48+
//
49+
// The schema module contains the types for manipulating / inspecting / creating
50+
// parquet file schemas.
51+
//
52+
// Primitive Types
53+
//
54+
// The Parquet Primitive Types and their corresponding Go types are Boolean (bool),
55+
// Int32 (int32), Int64 (int64), Int96 (parquet.Int96), Float (float32), Double (float64),
56+
// ByteArray (parquet.ByteArray) and FixedLenByteArray (parquet.FixedLenByteArray).
57+
//
58+
// Encodings
59+
//
60+
// The encoding types supported in this package are:
61+
// Plain, Plain/RLE Dictionary, Delta Binary Packed (only integer types), Delta Byte Array
62+
// (only ByteArray), Delta Length Byte Array (only ByteArray)
63+
//
64+
// Tip: Some platforms don't necessarily support all kinds of encodings. If you're not
65+
// sure what to use, just use Plain and Dictionary encoding.
66+
package parquet
67+
68+
//go:generate thrift -o internal -r --gen go ../../cpp/src/parquet/parquet.thrift

go/parquet/go.mod

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
// Licensed to the Apache Software Foundation (ASF) under one
2+
// or more contributor license agreements. See the NOTICE file
3+
// distributed with this work for additional information
4+
// regarding copyright ownership. The ASF licenses this file
5+
// to you under the Apache License, Version 2.0 (the
6+
// "License"); you may not use this file except in compliance
7+
// with the License. You may obtain a copy of the License at
8+
//
9+
// http://www.apache.org/licenses/LICENSE-2.0
10+
//
11+
// Unless required by applicable law or agreed to in writing, software
12+
// distributed under the License is distributed on an "AS IS" BASIS,
13+
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
// See the License for the specific language governing permissions and
15+
// limitations under the License.
16+
17+
module github.com/apache/arrow/go/parquet
18+
19+
go 1.15
20+
21+
require (
22+
github.com/apache/arrow/go/arrow v0.0.0-20210310173904-5de02e3697aa
23+
github.com/klauspost/asmfmt v1.2.3
24+
github.com/minio/asm2plan9s v0.0.0-20200509001527-cdd76441f9d8
25+
github.com/minio/c2goasm v0.0.0-20190812172519-36a3d3bbc4f3
26+
github.com/stretchr/testify v1.7.0
27+
golang.org/x/exp v0.0.0-20210220032938-85be41e4509f
28+
golang.org/x/sys v0.0.0-20210309074719-68d13333faf2
29+
golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1
30+
gonum.org/v1/gonum v0.8.2
31+
)

0 commit comments

Comments
 (0)