465 questions
Advice
5
votes
1
replies
115
views
Parquet VS ORC In Iceberg
Hi I have been interested lately in learning iceberg. There is something was not able to get so I thought I would ask here.
I really wanna know why is Apache parquet the native file format used when ...
0
votes
1
answer
58
views
I'm writing repeated string values to a string column in an ORC file using Java and while reading the ORC file back, encounter a NullPointerException
When I am trying to write same value for each row for string column in orc file, only first row is returning the written value, while reading remaining rows, facing null pointer issue. In some cases, ...
0
votes
1
answer
165
views
Apache ORC buffer size too small
I face the attached problem when reading an orc file:
Is it possible to change this buffer size of 65536 to the needed one of 1817279?
Which configuration values do I have to adapt in order to set ...
1
vote
1
answer
149
views
How to use python to create ORC file compressed with ZLIB compression level 9?
I want to create an ORC file compressed with ZLIB compression level 9.
Thing is, when using pyarrow.orc, I can only choose between "Speed" and "Compression" mode,
and can't control ...
0
votes
0
answers
170
views
Apache Beam code to write output in ORC format
I am new to apache beam, and i have a use case where I need to write a java streaming code to read from a KafkaTopic (from which i extract some CustomObject.class) and output the entries to hdfs in ...
1
vote
1
answer
2k
views
I get a "Fatal Python error: Aborted" and no explanatory error message I can work with when I try to open a simple .orc file with pyarrow
I am using:
Win 10 Pro
Intel(R) Xeon(R) W-1250 CPU @ 3.30GHz / 16 GB RAM
Anaconda Navigator 2.5.0,
Python 3.10.13 in venv
pyarrow 11.0.0
pandas 2.1.1
Running scripts in Spyder IDE 5.4.3
I want to open/...
0
votes
1
answer
227
views
Read ORC files from AWS S3 bucket in Flink app
We are using Flink version of 1.13.5 and trying to read the ORC files from AWS S3 location. And, we are deploying our application in a self-managed flink cluster. Please find the below code for ...
0
votes
1
answer
201
views
binary format that allows to store multiple pandas dataframes with different columns, width, rows
I have like 200 pandas dataframe, and every dataframe has some unique column, or maybe completely different columns. example:
df1 = pd.DataFrame({
'Product': ['Apple', 'Banana', 'Orange', 'Mango'],...
0
votes
0
answers
893
views
Detection and Cleaning of Strike-out Texts on Handwriting
I have images where the text is strike-out and replace by next words. Sometimes it's just one line that gets struck out. Other times, multiple lines are.
I expected output should be like this. remove ...
0
votes
0
answers
78
views
In hadoop, why does the parquet format occupy higher memory than the original txt when I test?
I am testing the impact of different data formats on hive query efficiency(win10,only my desktop). The original data is 400 txt files of almost the same size (total memory 169MB). I first converted to ...
0
votes
0
answers
102
views
Issue downloading/parsing ORC File from S3, or from Local Path
I have an application deployed that is supposed to parse/download an ORC File from an S3 bucket.
I have tried multiple things, one of them being, downloading the File locally in the app, and try to ...
0
votes
0
answers
438
views
How can I optimize orc snappy compression in spark?
My orc with snappy compression dataset was 3.3 GB when it was originally constructed via a series of small writes to 128 Kb files. It totals 400 million rows, has one timestamp column, and the rest ...
1
vote
0
answers
126
views
Pyspark error while writing large dataframe to file
I am trying to write my dataframe df_trans(which has about 10 mill records) to file and want to compare the performance by writing it to parquet vs orc vs csv.
df_trans.write.mode('overwrite').parquet(...
0
votes
0
answers
207
views
To read orc file from GCS bucket
To read orc file from a GCS bucket i'm using below code snippet, where i'm creating hadoop configuration and setting required file system attributes to use gcs bucket
val hadoopConf = new ...
2
votes
1
answer
424
views
Reading orc does not trigger projection pushdown and predicate push down
I have a fileA in orc with the following format
key
id_1
id_2
value
value_1
....
value_30
If I use the following config:
'spark.sql.orc.filterPushdown' : true
And ...
1
vote
1
answer
371
views
"No enum constant org.apache.orc.CompressionKind.ZSTD" When Insert Data to ORC Compress ZSTD Table
I have created a table in hive 3.1.3 as below;
Create external table test_tez_orc_zstd
(
Id bigint
)stored as orc
Tblproperties(orc.compress=zstd)
Location '...'
It is created, and then I wanted to ...
1
vote
1
answer
372
views
Does append mode in spark with orc as the storage sort the orc file?
From my understanding orc filter is extremely fast because both file and stripe have
column-level aggregates count, min, max, and sum
However it would seem like those meta data is useful only if the ...
0
votes
1
answer
103
views
How to reduce file size of PySpark output to that of Hive?
I am writing orc snappy files to a Google Cloud Storage Bucket using PySpark and Hive. Hive produces a single file output that is significantly smaller than the output produced by Hive. How can I make ...
0
votes
0
answers
428
views
In which scenario, disable Hadoop vectorized execution better than enabling it
Vectorization in Hive is a feature (available from Hive 0.13.0) that when enabled rather than reading one row at a time reads a block on 1024 rows. This improves the CPU Usage for operations like, ...
0
votes
2
answers
452
views
Pyspark overwriting external hive table orc file changes hive table schema
I have an issue that if i run pyspark code to save data to external orc file for hive table it also overwrites hive table schema. What should i do to keep original Hive schema after each overwrite?
...
1
vote
1
answer
94
views
Is there any way to rewrite the below code using Scala value class or other concept?
I need to write two functions to get the output format and the output index for file conversion. As part of this, I wrote a TransformSettings class for these methods and set the default value. And in ...
1
vote
1
answer
2k
views
write Trino query data directly to s3
Currently we have
Trino query run and fetch data
write this to local filesystem
upload this file to s3 bucket.
For smaller data this is no issue. But currently with large data volume, this is posing ...
1
vote
1
answer
518
views
Write ORC using Pandas with all values of sequence None
I want to write a simple dataframe as an ORC file. The only sequence is of an integer type. If I set all values to None, an exception is raised on to_orc.
I understand that pyarrow cannot infer ...
0
votes
1
answer
240
views
Tesseract OCR with Python
i am looking for a way too get the text from the image below. I tried to use tesseract but the outputs weren't good at all (see code block below). Do i have to edit the picture to get a better output? ...
1
vote
1
answer
531
views
Unable to select count of rows of an ORC table through Hive Beeline command
I am using the following components - Hadoop 3.1.4 , Hive 3.1.3 and Tez 0.9.2
And there is an ORC table from which I am trying to extract count of the rows in the table. select count(*) from ORC_TABLE ...
2
votes
1
answer
3k
views
ORC Split Generation issue with Hive Table
I'm using Hive version 3.1.3 on Hadoop 3.3.4 with Tez 0.9.2. When I create an ORC table that contains splits and try to query it, I get an ORC split generation failed exception. If I concatenate the ...
0
votes
0
answers
266
views
pyarrow ORC extension install
I am trying to use pyarrow with orc but i don't find how to build it with orc extension, anyone knows how to ?
I am on Windows 10
File ~\Miniconda3\lib\site-packages\owlna-0.0.1-py3.9.egg\owlna\table....
0
votes
1
answer
702
views
Super-slow Athena join query on low amount of data
I have two databases, each contains a table, which is stored in a single S3 file like: part-00000-77654909-37c7-4c9e-8840-b2838792f98d-c000.snappy.orc of size ~83MB.
I'm trying to execute a primitive ...
1
vote
0
answers
72
views
Apache ORC file format: optional fields
Reading the message specifications in the ORC file format (https://orc.apache.org/specification/ORCv1/) I see every field is marked optional. Yet some of the fields are always required (such as the ...
1
vote
1
answer
471
views
Why ORC can support acid in hive?
I have read the documentation that says
"Only ORC file format is supported in this first release. The feature has been built such that transactions can be used by any storage format that can ...
2
votes
1
answer
278
views
kotlin write string to orc file with apache orc java
I'm using apache orc 1.8. Following the short example in the documentation here: https://orc.apache.org/docs/core-java.html I'm failing to be able to write string out to the orc file.
import org....
1
vote
0
answers
448
views
Read orc file in node js project
We're in the migration process and the new code base is developed in node js, the old one being in python. To read orc file from azure datalake we used pyorc for the python project, I was looking for ...
-1
votes
1
answer
699
views
Read ORC file not stored on HDFS using pySpark
I connected to a datalake remotely, processed the data on datalake stored in Hadoop clusters using Hive beeline terminal and stored the data on HDFS as orc format.
Then I transferred this orc file to ...
0
votes
2
answers
210
views
hive is failing when joining external and internal tables
Our environment/versions
hadoop 3.2.3
hive 3.1.3
spark 2.3.0
our internal table in hive is defined as
CREATE TABLE dw.CLIENT
(
client_id integer,
client_abbrev string,
client_name string,
...
1
vote
2
answers
370
views
Convert pandas df to orc bytes
Following is generated by this line of code:
table_bytes = df.to_parquet()
table_bytes: b'PAR1\x15\x04\x15@\x15DL\x15\x08\x15\x04\x12\x00\x00 |\x03\x00\x00\x00Tom\x04\x00\x00\x00nick\x05\x00\x00\...
0
votes
0
answers
1k
views
how do I compute the number of unique values in a pyarrow array?
I have a pyarrow int32 ChunkedArray
containing 18 chunks that I got from an ORC file:
import pyarrow.dataset
import pyarrow.compute
t = pyarrow.dataset.dataset("my/orc/file", format="...
0
votes
1
answer
1k
views
Error for column count mismatch for JSON, AVRO, ORC and PARQUET file formats
We are using a copy command for loading data into Snowflake. With CSV file format, there is a parameter 'ERROR_ON_COLUMN_COUNT_MISMATCH' to get the error if columns present in input CSV file do not ...
0
votes
0
answers
1k
views
Reading ORC file in Spark with schema returns null values
I am trying to read ORC files from spark job. I have defined the below schema based on output of printSchema,
df.printSchema():
root
|-- application: struct (nullable = true)
| |-- appserver: ...
0
votes
1
answer
169
views
Hive can't access tables after Spark recreates my orc stored tables
When I recreate a table in spark using the command displayed from show create table mydb.mytable I stop being able to use the table from Hive. This just happens for a few tables, the other tables I ...
0
votes
3
answers
296
views
how to move columns in pandas
I have pyarrow table with header like that: ['column1','column2','column3','column4','column5' ]
I want to swap and mode column header and data:
['column1','column2','column5','column3','column4' ]
...
1
vote
2
answers
7k
views
Join two pyarrow tables
I have orc with data as after.
Table A:
Name age school address phone
tony 12 havard UUU 666
tommy 13 abc Null Null
john 14 cde ...
0
votes
1
answer
464
views
Based on file read and writes speeds, which amongst ORC, Parquet & AVRO is best suited for each scenario? [closed]
I have been working with Spark and Hadoop ecosystem for some years but never bothered to question my architects about why a certain file format is chosen before they provide any explanation to the ...
0
votes
2
answers
2k
views
Error converting pandas dataframe to ORC using pyarrow
I'm trying to save a Pandas DataFrame as .orc file using Pyarrow. The versions of packages are: pandas==1.3.5 and pyarrow==6.0.1. My python3 version is 3.9.12.
Here is the code snippet:
import pandas ...
0
votes
1
answer
276
views
Error "declared column type INT for column id incompatible with ORC file column type string query" when copy orc to Redshift
Error "declared column type INT for column id incompatible with ORC file column type string query" when copy orc to Redshift using the command:
from 's3://'
iam_role 'role'
format as orc;
3
votes
0
answers
444
views
Is there an option to directly delete rows in ORC file in pyspark or databricks
Is there any option to directly delete the rows from ORC files, provided its structure.
I am using Azure Databricks,
With below query i am reading the content of the ORC file, and wanted to delete ...
0
votes
1
answer
402
views
Whats the easiest way to get a table DDL from an orc file?
With spark I can do for example:
spark.read.orc("/path/to/file").printSchema
But I would like to get something like the output of show create table in hive. Is it possible?
0
votes
1
answer
910
views
Spark writing performance csv vs snappy-orc
If I need to write dataframe on disk which format will perform better csv or 'orc with snappy' ?
One hand csv format will avoid compression task overhead but on another hand snappy will reduce total ...
1
vote
2
answers
361
views
C++ Apache Orc is not filtering data correctly
I am posting a simple c++ Apache orc file reading program which:
Read data from ORC file.
Filter data based on the given string.
Sample Code:
#include <iostream>
#include <list>
#...
1
vote
1
answer
238
views
How get the length of the ORC file being written?
I'm writing data to an ORC file. I want to get the length of this file. (including data that has been flush and the data that's still in the buff cache) What should I do, please? I don't want to close ...
0
votes
1
answer
354
views
How to create 0 byte orc file
Can I create 0 byte ORC file?
I'd like to test
if hive can load 0 byte file into external table without exception.
if python can read 0 byte orc file without exception.
for filename in glob.glob(...