Newest 'orc' Questions

Advice

5 votes

1 replies

115 views

Parquet VS ORC In Iceberg

Hi I have been interested lately in learning iceberg. There is something was not able to get so I thought I would ask here. I really wanna know why is Apache parquet the native file format used when ...

katz daniel

13

asked Nov 24, 2025 at 15:00

0 votes

1 answer

58 views

I'm writing repeated string values to a string column in an ORC file using Java and while reading the ORC file back, encounter a NullPointerException

When I am trying to write same value for each row for string column in orc file, only first row is returning the written value, while reading remaining rows, facing null pointer issue. In some cases, ...

user1885418

41

asked Apr 8, 2025 at 15:08

0 votes

1 answer

165 views

Apache ORC buffer size too small

I face the attached problem when reading an orc file: Is it possible to change this buffer size of 65536 to the needed one of 1817279? Which configuration values do I have to adapt in order to set ...

Ruben Hartenstein

1

asked Feb 3, 2025 at 16:21

1 vote

1 answer

149 views

How to use python to create ORC file compressed with ZLIB compression level 9?

I want to create an ORC file compressed with ZLIB compression level 9. Thing is, when using pyarrow.orc, I can only choose between "Speed" and "Compression" mode, and can't control ...

Y.S

1,862

asked Dec 23, 2024 at 10:57

0 votes

0 answers

170 views

Apache Beam code to write output in ORC format

I am new to apache beam, and i have a use case where I need to write a java streaming code to read from a KafkaTopic (from which i extract some CustomObject.class) and output the entries to hdfs in ...

vamsi

325

asked Feb 5, 2024 at 4:46

1 vote

1 answer

2k views

I get a "Fatal Python error: Aborted" and no explanatory error message I can work with when I try to open a simple .orc file with pyarrow

I am using: Win 10 Pro Intel(R) Xeon(R) W-1250 CPU @ 3.30GHz / 16 GB RAM Anaconda Navigator 2.5.0, Python 3.10.13 in venv pyarrow 11.0.0 pandas 2.1.1 Running scripts in Spyder IDE 5.4.3 I want to open/...

Esat Becco

51

asked Jan 9, 2024 at 14:48

0 votes

1 answer

227 views

Read ORC files from AWS S3 bucket in Flink app

We are using Flink version of 1.13.5 and trying to read the ORC files from AWS S3 location. And, we are deploying our application in a self-managed flink cluster. Please find the below code for ...

nirmal

107

asked Nov 10, 2023 at 8:54

0 votes

1 answer

201 views

binary format that allows to store multiple pandas dataframes with different columns, width, rows

I have like 200 pandas dataframe, and every dataframe has some unique column, or maybe completely different columns. example: df1 = pd.DataFrame({ 'Product': ['Apple', 'Banana', 'Orange', 'Mango'],...

Abdulrahman Sheikho

73

asked Nov 4, 2023 at 6:58

0 votes

0 answers

893 views

Detection and Cleaning of Strike-out Texts on Handwriting

I have images where the text is strike-out and replace by next words. Sometimes it's just one line that gets struck out. Other times, multiple lines are. I expected output should be like this. remove ...

Do Chi Bao

31

asked Oct 16, 2023 at 3:54

0 votes

0 answers

78 views

In hadoop, why does the parquet format occupy higher memory than the original txt when I test?

I am testing the impact of different data formats on hive query efficiency(win10,only my desktop). The original data is 400 txt files of almost the same size (total memory 169MB). I first converted to ...

fei yang

3

asked Sep 24, 2023 at 16:45

0 votes

0 answers

102 views

Issue downloading/parsing ORC File from S3, or from Local Path

I have an application deployed that is supposed to parse/download an ORC File from an S3 bucket. I have tried multiple things, one of them being, downloading the File locally in the app, and try to ...

FluffyGus

11

asked Sep 4, 2023 at 15:15

0 votes

0 answers

438 views

How can I optimize orc snappy compression in spark?

My orc with snappy compression dataset was 3.3 GB when it was originally constructed via a series of small writes to 128 Kb files. It totals 400 million rows, has one timestamp column, and the rest ...

user19695124

11

asked Aug 23, 2023 at 13:41

1 vote

0 answers

126 views

Pyspark error while writing large dataframe to file

I am trying to write my dataframe df_trans(which has about 10 mill records) to file and want to compare the performance by writing it to parquet vs orc vs csv. df_trans.write.mode('overwrite').parquet(...

OhMoh24

71

asked Jul 20, 2023 at 21:43

0 votes

0 answers

207 views

To read orc file from GCS bucket

To read orc file from a GCS bucket i'm using below code snippet, where i'm creating hadoop configuration and setting required file system attributes to use gcs bucket val hadoopConf = new ...

Nitish N Banakar

149

asked Jun 9, 2023 at 3:58

2 votes

1 answer

424 views

Reading orc does not trigger projection pushdown and predicate push down

I have a fileA in orc with the following format key id_1 id_2 value value_1 .... value_30 If I use the following config: 'spark.sql.orc.filterPushdown' : true And ...

olaf

347

asked Jun 6, 2023 at 6:49

1 vote

1 answer

371 views

"No enum constant org.apache.orc.CompressionKind.ZSTD" When Insert Data to ORC Compress ZSTD Table

I have created a table in hive 3.1.3 as below; Create external table test_tez_orc_zstd ( Id bigint )stored as orc Tblproperties(orc.compress=zstd) Location '...' It is created, and then I wanted to ...

CompEng

7,426

asked Jun 5, 2023 at 17:21

1 vote

1 answer

372 views

Does append mode in spark with orc as the storage sort the orc file?

From my understanding orc filter is extremely fast because both file and stripe have column-level aggregates count, min, max, and sum However it would seem like those meta data is useful only if the ...

olaf

347

asked Jun 2, 2023 at 8:04

0 votes

1 answer

103 views

How to reduce file size of PySpark output to that of Hive?

I am writing orc snappy files to a Google Cloud Storage Bucket using PySpark and Hive. Hive produces a single file output that is significantly smaller than the output produced by Hive. How can I make ...

Matt Weisman

1

asked May 24, 2023 at 23:00

0 votes

0 answers

428 views

In which scenario, disable Hadoop vectorized execution better than enabling it

Vectorization in Hive is a feature (available from Hive 0.13.0) that when enabled rather than reading one row at a time reads a block on 1024 rows. This improves the CPU Usage for operations like, ...

The Anh Nguyen

881

asked Apr 19, 2023 at 0:01

0 votes

2 answers

452 views

Pyspark overwriting external hive table orc file changes hive table schema

I have an issue that if i run pyspark code to save data to external orc file for hive table it also overwrites hive table schema. What should i do to keep original Hive schema after each overwrite? ...

Robertas Kirka

11

asked Feb 17, 2023 at 10:45

1 vote

1 answer

94 views

Is there any way to rewrite the below code using Scala value class or other concept?

I need to write two functions to get the output format and the output index for file conversion. As part of this, I wrote a TransformSettings class for these methods and set the default value. And in ...

vsathyak

73

asked Feb 13, 2023 at 16:04

1 vote

1 answer

2k views

write Trino query data directly to s3

Currently we have Trino query run and fetch data write this to local filesystem upload this file to s3 bucket. For smaller data this is no issue. But currently with large data volume, this is posing ...

Azima

4,171

asked Feb 9, 2023 at 4:39

1 vote

1 answer

518 views

Write ORC using Pandas with all values of sequence None

I want to write a simple dataframe as an ORC file. The only sequence is of an integer type. If I set all values to None, an exception is raised on to_orc. I understand that pyarrow cannot infer ...

Blaf

2,378

asked Feb 2, 2023 at 13:24

0 votes

1 answer

240 views

Tesseract OCR with Python

i am looking for a way too get the text from the image below. I tried to use tesseract but the outputs weren't good at all (see code block below). Do i have to edit the picture to get a better output? ...

Jannes

1

asked Jan 24, 2023 at 19:16

1 vote

1 answer

531 views

Unable to select count of rows of an ORC table through Hive Beeline command

I am using the following components - Hadoop 3.1.4 , Hive 3.1.3 and Tez 0.9.2 And there is an ORC table from which I am trying to extract count of the rows in the table. select count(*) from ORC_TABLE ...

Afroz Baig

36

asked Dec 25, 2022 at 5:40

2 votes

1 answer

3k views

ORC Split Generation issue with Hive Table

I'm using Hive version 3.1.3 on Hadoop 3.3.4 with Tez 0.9.2. When I create an ORC table that contains splits and try to query it, I get an ORC split generation failed exception. If I concatenate the ...

Patrick Tucci

1,952

asked Nov 6, 2022 at 13:08

0 votes

0 answers

266 views

pyarrow ORC extension install

I am trying to use pyarrow with orc but i don't find how to build it with orc extension, anyone knows how to ? I am on Windows 10 File ~\Miniconda3\lib\site-packages\owlna-0.0.1-py3.9.egg\owlna\table....

Devyl

695

asked Nov 6, 2022 at 12:19

0 votes

1 answer

702 views

Super-slow Athena join query on low amount of data

I have two databases, each contains a table, which is stored in a single S3 file like: part-00000-77654909-37c7-4c9e-8840-b2838792f98d-c000.snappy.orc of size ~83MB. I'm trying to execute a primitive ...

astef

9,738

asked Nov 3, 2022 at 13:47

1 vote

0 answers

72 views

Apache ORC file format: optional fields

Reading the message specifications in the ORC file format (https://orc.apache.org/specification/ORCv1/) I see every field is marked optional. Yet some of the fields are always required (such as the ...

rde1

11

asked Nov 2, 2022 at 15:57

1 vote

1 answer

471 views

Why ORC can support acid in hive?

I have read the documentation that says "Only ORC file format is supported in this first release. The feature has been built such that transactions can be used by any storage format that can ...

Kakarot

11

asked Oct 25, 2022 at 22:22

2 votes

1 answer

278 views

kotlin write string to orc file with apache orc java

I'm using apache orc 1.8. Following the short example in the documentation here: https://orc.apache.org/docs/core-java.html I'm failing to be able to write string out to the orc file. import org....

jake wong

5,248

asked Sep 24, 2022 at 16:18

1 vote

0 answers

448 views

Read orc file in node js project

We're in the migration process and the new code base is developed in node js, the old one being in python. To read orc file from azure datalake we used pyorc for the python project, I was looking for ...

Kashyap

477

asked Aug 30, 2022 at 14:46

-1 votes

1 answer

699 views

Read ORC file not stored on HDFS using pySpark

I connected to a datalake remotely, processed the data on datalake stored in Hadoop clusters using Hive beeline terminal and stored the data on HDFS as orc format. Then I transferred this orc file to ...

canon-ball

798

asked Aug 18, 2022 at 9:24

0 votes

2 answers

210 views

hive is failing when joining external and internal tables

Our environment/versions hadoop 3.2.3 hive 3.1.3 spark 2.3.0 our internal table in hive is defined as CREATE TABLE dw.CLIENT ( client_id integer, client_abbrev string, client_name string, ...

gocham

1

asked Aug 1, 2022 at 23:18

1 vote

2 answers

370 views

Convert pandas df to orc bytes

Following is generated by this line of code: table_bytes = df.to_parquet() table_bytes: b'PAR1\x15\x04\x15@\x15DL\x15\x08\x15\x04\x12\x00\x00 |\x03\x00\x00\x00Tom\x04\x00\x00\x00nick\x05\x00\x00\...

harsh solanki

67

asked Jun 11, 2022 at 14:02

0 votes

0 answers

1k views

how do I compute the number of unique values in a pyarrow array?

I have a pyarrow int32 ChunkedArray containing 18 chunks that I got from an ORC file: import pyarrow.dataset import pyarrow.compute t = pyarrow.dataset.dataset("my/orc/file", format="...

shoojoe

61

asked Jun 9, 2022 at 14:15

0 votes

1 answer

1k views

Error for column count mismatch for JSON, AVRO, ORC and PARQUET file formats

We are using a copy command for loading data into Snowflake. With CSV file format, there is a parameter 'ERROR_ON_COLUMN_COUNT_MISMATCH' to get the error if columns present in input CSV file do not ...

Sarang Mane

1

asked May 30, 2022 at 10:26

0 votes

0 answers

1k views

Reading ORC file in Spark with schema returns null values

I am trying to read ORC files from spark job. I have defined the below schema based on output of printSchema, df.printSchema(): root |-- application: struct (nullable = true) | |-- appserver: ...

Sam

1

asked May 25, 2022 at 18:08

0 votes

1 answer

169 views

Hive can't access tables after Spark recreates my orc stored tables

When I recreate a table in spark using the command displayed from show create table mydb.mytable I stop being able to use the table from Hive. This just happens for a few tables, the other tables I ...

neves

40.6k

asked May 9, 2022 at 16:08

0 votes

3 answers

296 views

how to move columns in pandas

I have pyarrow table with header like that: ['column1','column2','column3','column4','column5' ] I want to swap and mode column header and data: ['column1','column2','column5','column3','column4' ] ...

user12359241

asked May 6, 2022 at 7:55

1 vote

2 answers

7k views

Join two pyarrow tables

I have orc with data as after. Table A: Name age school address phone tony 12 havard UUU 666 tommy 13 abc Null Null john 14 cde ...

user12359241

asked May 5, 2022 at 5:47

0 votes

1 answer

464 views

Based on file read and writes speeds, which amongst ORC, Parquet & AVRO is best suited for each scenario? [closed]

I have been working with Spark and Hadoop ecosystem for some years but never bothered to question my architects about why a certain file format is chosen before they provide any explanation to the ...

Metadata

2,153

asked Apr 21, 2022 at 15:15

0 votes

2 answers

2k views

Error converting pandas dataframe to ORC using pyarrow

I'm trying to save a Pandas DataFrame as .orc file using Pyarrow. The versions of packages are: pandas==1.3.5 and pyarrow==6.0.1. My python3 version is 3.9.12. Here is the code snippet: import pandas ...

Heyb

21

asked Apr 20, 2022 at 8:07

0 votes

1 answer

276 views

Error "declared column type INT for column id incompatible with ORC file column type string query" when copy orc to Redshift

Error "declared column type INT for column id incompatible with ORC file column type string query" when copy orc to Redshift using the command: from 's3://' iam_role 'role' format as orc;

lucaspompeun

318

asked Apr 6, 2022 at 21:11

3 votes

0 answers

444 views

Is there an option to directly delete rows in ORC file in pyspark or databricks

Is there any option to directly delete the rows from ORC files, provided its structure. I am using Azure Databricks, With below query i am reading the content of the ORC file, and wanted to delete ...

Tim

1,521

asked Mar 14, 2022 at 23:11

0 votes

1 answer

402 views

Whats the easiest way to get a table DDL from an orc file?

With spark I can do for example: spark.read.orc("/path/to/file").printSchema But I would like to get something like the output of show create table in hive. Is it possible?

Pavel Orekhov

2,304

asked Jan 27, 2022 at 22:41

0 votes

1 answer

910 views

Spark writing performance csv vs snappy-orc

If I need to write dataframe on disk which format will perform better csv or 'orc with snappy' ? One hand csv format will avoid compression task overhead but on another hand snappy will reduce total ...

Rex

159

asked Jan 3, 2022 at 20:16

1 vote

2 answers

361 views

C++ Apache Orc is not filtering data correctly

I am posting a simple c++ Apache orc file reading program which: Read data from ORC file. Filter data based on the given string. Sample Code: #include <iostream> #include <list> #...

Karan Kumar

43

asked Dec 29, 2021 at 9:48

1 vote

1 answer

238 views

How get the length of the ORC file being written?

I'm writing data to an ORC file. I want to get the length of this file. (including data that has been flush and the data that's still in the buff cache) What should I do, please? I don't want to close ...

liliwei

344

asked Dec 23, 2021 at 7:25

0 votes

1 answer

354 views

How to create 0 byte orc file

Can I create 0 byte ORC file? I'd like to test if hive can load 0 byte file into external table without exception. if python can read 0 byte orc file without exception. for filename in glob.glob(...

jeewonb

75

asked Dec 14, 2021 at 7:08

Collectives™ on Stack Overflow