Skip to content

Commit 4fe330a

Browse files
wesmemkornfield
authored andcommitted
ARROW-6678: [C++][Parquet] Binary data stored in Parquet metadata must be base64-encoded to be UTF-8 compliant
I have added a simple base64 implementation (Zlib license) to arrow/vendored from https://github.com/ReneNyffenegger/cpp-base64 Closes apache#5493 from wesm/ARROW-6678 and squashes the following commits: c058e86 <Wes McKinney> Simplify, add MSVC exports 06f75cd <Wes McKinney> Fix Python unit test that needs to base64-decode now eabb121 <Wes McKinney> Fix LICENSE.txt, add iwyu export b3a584a <Wes McKinney> Add vendored base64 C++ implementation and ensure that Thrift KeyValue in Parquet metadata is UTF-8 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
1 parent 199d3cf commit 4fe330a

7 files changed

Lines changed: 206 additions & 3 deletions

File tree

LICENSE.txt

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1874,3 +1874,31 @@ SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
18741874
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
18751875
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
18761876
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
1877+
1878+
----------------------------------------------------------------------
1879+
1880+
cpp/src/arrow/vendored/base64.cpp has the following license
1881+
1882+
ZLIB License
1883+
1884+
Copyright (C) 2004-2017 René Nyffenegger
1885+
1886+
This source code is provided 'as-is', without any express or implied
1887+
warranty. In no event will the author be held liable for any damages arising
1888+
from the use of this software.
1889+
1890+
Permission is granted to anyone to use this software for any purpose, including
1891+
commercial applications, and to alter it and redistribute it freely, subject to
1892+
the following restrictions:
1893+
1894+
1. The origin of this source code must not be misrepresented; you must not
1895+
claim that you wrote the original source code. If you use this source code
1896+
in a product, an acknowledgment in the product documentation would be
1897+
appreciated but is not required.
1898+
1899+
2. Altered source versions must be plainly marked as such, and must not be
1900+
misrepresented as being the original source code.
1901+
1902+
3. This notice may not be removed or altered from any source distribution.
1903+
1904+
René Nyffenegger rene.nyffenegger@adp-gmbh.ch

cpp/src/arrow/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,7 @@ set(ARROW_SRCS
145145
util/thread_pool.cc
146146
util/trie.cc
147147
util/utf8.cc
148+
vendored/base64.cpp
148149
vendored/datetime/tz.cpp)
149150

150151
# Add dependencies for third-party allocators.

cpp/src/arrow/util/base64.h

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
// Licensed to the Apache Software Foundation (ASF) under one
2+
// or more contributor license agreements. See the NOTICE file
3+
// distributed with this work for additional information
4+
// regarding copyright ownership. The ASF licenses this file
5+
// to you under the Apache License, Version 2.0 (the
6+
// "License"); you may not use this file except in compliance
7+
// with the License. You may obtain a copy of the License at
8+
//
9+
// http://www.apache.org/licenses/LICENSE-2.0
10+
//
11+
// Unless required by applicable law or agreed to in writing,
12+
// software distributed under the License is distributed on an
13+
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
// KIND, either express or implied. See the License for the
15+
// specific language governing permissions and limitations
16+
// under the License.
17+
18+
#pragma once
19+
20+
#include <string>
21+
22+
#include "arrow/util/visibility.h"
23+
24+
namespace arrow {
25+
namespace util {
26+
27+
ARROW_EXPORT
28+
std::string base64_encode(unsigned char const*, unsigned int len);
29+
30+
ARROW_EXPORT
31+
std::string base64_decode(std::string const& s);
32+
33+
} // namespace util
34+
} // namespace arrow

cpp/src/arrow/vendored/base64.cpp

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
/*
2+
base64.cpp and base64.h
3+
4+
base64 encoding and decoding with C++.
5+
6+
Version: 1.01.00
7+
8+
Copyright (C) 2004-2017 René Nyffenegger
9+
10+
This source code is provided 'as-is', without any express or implied
11+
warranty. In no event will the author be held liable for any damages
12+
arising from the use of this software.
13+
14+
Permission is granted to anyone to use this software for any purpose,
15+
including commercial applications, and to alter it and redistribute it
16+
freely, subject to the following restrictions:
17+
18+
1. The origin of this source code must not be misrepresented; you must not
19+
claim that you wrote the original source code. If you use this source code
20+
in a product, an acknowledgment in the product documentation would be
21+
appreciated but is not required.
22+
23+
2. Altered source versions must be plainly marked as such, and must not be
24+
misrepresented as being the original source code.
25+
26+
3. This notice may not be removed or altered from any source distribution.
27+
28+
René Nyffenegger rene.nyffenegger@adp-gmbh.ch
29+
30+
*/
31+
32+
#include "arrow/util/base64.h"
33+
#include <iostream>
34+
35+
namespace arrow {
36+
namespace util {
37+
38+
static const std::string base64_chars =
39+
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
40+
"abcdefghijklmnopqrstuvwxyz"
41+
"0123456789+/";
42+
43+
44+
static inline bool is_base64(unsigned char c) {
45+
return (isalnum(c) || (c == '+') || (c == '/'));
46+
}
47+
48+
std::string base64_encode(unsigned char const* bytes_to_encode, unsigned int in_len) {
49+
std::string ret;
50+
int i = 0;
51+
int j = 0;
52+
unsigned char char_array_3[3];
53+
unsigned char char_array_4[4];
54+
55+
while (in_len--) {
56+
char_array_3[i++] = *(bytes_to_encode++);
57+
if (i == 3) {
58+
char_array_4[0] = (char_array_3[0] & 0xfc) >> 2;
59+
char_array_4[1] = ((char_array_3[0] & 0x03) << 4) + ((char_array_3[1] & 0xf0) >> 4);
60+
char_array_4[2] = ((char_array_3[1] & 0x0f) << 2) + ((char_array_3[2] & 0xc0) >> 6);
61+
char_array_4[3] = char_array_3[2] & 0x3f;
62+
63+
for(i = 0; (i <4) ; i++)
64+
ret += base64_chars[char_array_4[i]];
65+
i = 0;
66+
}
67+
}
68+
69+
if (i)
70+
{
71+
for(j = i; j < 3; j++)
72+
char_array_3[j] = '\0';
73+
74+
char_array_4[0] = ( char_array_3[0] & 0xfc) >> 2;
75+
char_array_4[1] = ((char_array_3[0] & 0x03) << 4) + ((char_array_3[1] & 0xf0) >> 4);
76+
char_array_4[2] = ((char_array_3[1] & 0x0f) << 2) + ((char_array_3[2] & 0xc0) >> 6);
77+
78+
for (j = 0; (j < i + 1); j++)
79+
ret += base64_chars[char_array_4[j]];
80+
81+
while((i++ < 3))
82+
ret += '=';
83+
84+
}
85+
86+
return ret;
87+
88+
}
89+
90+
std::string base64_decode(std::string const& encoded_string) {
91+
size_t in_len = encoded_string.size();
92+
int i = 0;
93+
int j = 0;
94+
int in_ = 0;
95+
unsigned char char_array_4[4], char_array_3[3];
96+
std::string ret;
97+
98+
while (in_len-- && ( encoded_string[in_] != '=') && is_base64(encoded_string[in_])) {
99+
char_array_4[i++] = encoded_string[in_]; in_++;
100+
if (i ==4) {
101+
for (i = 0; i <4; i++)
102+
char_array_4[i] = base64_chars.find(char_array_4[i]) & 0xff;
103+
104+
char_array_3[0] = ( char_array_4[0] << 2 ) + ((char_array_4[1] & 0x30) >> 4);
105+
char_array_3[1] = ((char_array_4[1] & 0xf) << 4) + ((char_array_4[2] & 0x3c) >> 2);
106+
char_array_3[2] = ((char_array_4[2] & 0x3) << 6) + char_array_4[3];
107+
108+
for (i = 0; (i < 3); i++)
109+
ret += char_array_3[i];
110+
i = 0;
111+
}
112+
}
113+
114+
if (i) {
115+
for (j = 0; j < i; j++)
116+
char_array_4[j] = base64_chars.find(char_array_4[j]) & 0xff;
117+
118+
char_array_3[0] = (char_array_4[0] << 2) + ((char_array_4[1] & 0x30) >> 4);
119+
char_array_3[1] = ((char_array_4[1] & 0xf) << 4) + ((char_array_4[2] & 0x3c) >> 2);
120+
121+
for (j = 0; (j < i - 1); j++) ret += char_array_3[j];
122+
}
123+
124+
return ret;
125+
}
126+
127+
} // namespace util
128+
} // namespace arrow

cpp/src/parquet/arrow/reader_internal.cc

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@
3939
#include "arrow/table.h"
4040
#include "arrow/type.h"
4141
#include "arrow/type_traits.h"
42+
#include "arrow/util/base64.h"
4243
#include "arrow/util/checked_cast.h"
4344
#include "arrow/util/int_util.h"
4445
#include "arrow/util/logging.h"
@@ -576,7 +577,8 @@ Status GetOriginSchema(const std::shared_ptr<const KeyValueMetadata>& metadata,
576577
// The original Arrow schema was serialized using the store_schema option. We
577578
// deserialize it here and use it to inform read options such as
578579
// dictionary-encoded fields
579-
auto schema_buf = std::make_shared<Buffer>(metadata->value(schema_index));
580+
auto decoded = ::arrow::util::base64_decode(metadata->value(schema_index));
581+
auto schema_buf = std::make_shared<Buffer>(decoded);
580582

581583
::arrow::ipc::DictionaryMemo dict_memo;
582584
::arrow::io::BufferReader input(schema_buf);

cpp/src/parquet/arrow/writer.cc

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@
3030
#include "arrow/ipc/writer.h"
3131
#include "arrow/table.h"
3232
#include "arrow/type.h"
33+
#include "arrow/util/base64.h"
3334
#include "arrow/visitor_inline.h"
3435

3536
#include "parquet/arrow/reader_internal.h"
@@ -577,7 +578,13 @@ Status GetSchemaMetadata(const ::arrow::Schema& schema, ::arrow::MemoryPool* poo
577578
::arrow::ipc::DictionaryMemo dict_memo;
578579
std::shared_ptr<Buffer> serialized;
579580
RETURN_NOT_OK(::arrow::ipc::SerializeSchema(schema, &dict_memo, pool, &serialized));
580-
result->Append(kArrowSchemaKey, serialized->ToString());
581+
582+
// The serialized schema is not UTF-8, which is required for Thrift
583+
std::string schema_as_string = serialized->ToString();
584+
std::string schema_base64 = ::arrow::util::base64_encode(
585+
reinterpret_cast<const unsigned char*>(schema_as_string.data()),
586+
static_cast<unsigned int>(schema_as_string.size()));
587+
result->Append(kArrowSchemaKey, schema_base64);
581588
*out = result;
582589
return Status::OK();
583590
}

python/pyarrow/tests/test_extension_type.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -372,7 +372,10 @@ def test_parquet(tmpdir, registered_period_type):
372372
meta = pq.read_metadata(filename)
373373
assert meta.schema.column(0).physical_type == "INT64"
374374
assert b"ARROW:schema" in meta.metadata
375-
schema = pa.read_schema(pa.BufferReader(meta.metadata[b"ARROW:schema"]))
375+
376+
import base64
377+
decoded_schema = base64.b64decode(meta.metadata[b"ARROW:schema"])
378+
schema = pa.read_schema(pa.BufferReader(decoded_schema))
376379
assert schema.field("ext").metadata == {
377380
b'ARROW:extension:metadata': b'freq=D',
378381
b'ARROW:extension:name': b'pandas.period'}

0 commit comments

Comments
 (0)