Skip to main content
Filter by
Sorted by
Tagged with
Advice
5 votes
1 replies
115 views

Hi I have been interested lately in learning iceberg. There is something was not able to get so I thought I would ask here. I really wanna know why is Apache parquet the native file format used when ...
katz daniel's user avatar
0 votes
1 answer
58 views

When I am trying to write same value for each row for string column in orc file, only first row is returning the written value, while reading remaining rows, facing null pointer issue. In some cases, ...
user1885418's user avatar
0 votes
1 answer
165 views

I face the attached problem when reading an orc file: Is it possible to change this buffer size of 65536 to the needed one of 1817279? Which configuration values do I have to adapt in order to set ...
Ruben Hartenstein's user avatar
1 vote
1 answer
149 views

I want to create an ORC file compressed with ZLIB compression level 9. Thing is, when using pyarrow.orc, I can only choose between "Speed" and "Compression" mode, and can't control ...
Y.S's user avatar
  • 1,862
0 votes
0 answers
170 views

I am new to apache beam, and i have a use case where I need to write a java streaming code to read from a KafkaTopic (from which i extract some CustomObject.class) and output the entries to hdfs in ...
vamsi's user avatar
  • 325
1 vote
1 answer
2k views

I am using: Win 10 Pro Intel(R) Xeon(R) W-1250 CPU @ 3.30GHz / 16 GB RAM Anaconda Navigator 2.5.0, Python 3.10.13 in venv pyarrow 11.0.0 pandas 2.1.1 Running scripts in Spyder IDE 5.4.3 I want to open/...
Esat Becco's user avatar
0 votes
1 answer
227 views

We are using Flink version of 1.13.5 and trying to read the ORC files from AWS S3 location. And, we are deploying our application in a self-managed flink cluster. Please find the below code for ...
nirmal's user avatar
  • 107
0 votes
1 answer
201 views

I have like 200 pandas dataframe, and every dataframe has some unique column, or maybe completely different columns. example: df1 = pd.DataFrame({ 'Product': ['Apple', 'Banana', 'Orange', 'Mango'],...
Abdulrahman Sheikho's user avatar
0 votes
0 answers
893 views

I have images where the text is strike-out and replace by next words. Sometimes it's just one line that gets struck out. Other times, multiple lines are. I expected output should be like this. remove ...
Do Chi Bao's user avatar
0 votes
0 answers
78 views

I am testing the impact of different data formats on hive query efficiency(win10,only my desktop). The original data is 400 txt files of almost the same size (total memory 169MB). I first converted to ...
fei yang's user avatar
0 votes
0 answers
102 views

I have an application deployed that is supposed to parse/download an ORC File from an S3 bucket. I have tried multiple things, one of them being, downloading the File locally in the app, and try to ...
FluffyGus's user avatar
0 votes
0 answers
438 views

My orc with snappy compression dataset was 3.3 GB when it was originally constructed via a series of small writes to 128 Kb files. It totals 400 million rows, has one timestamp column, and the rest ...
user19695124's user avatar
1 vote
0 answers
126 views

I am trying to write my dataframe df_trans(which has about 10 mill records) to file and want to compare the performance by writing it to parquet vs orc vs csv. df_trans.write.mode('overwrite').parquet(...
OhMoh24's user avatar
  • 71
0 votes
0 answers
207 views

To read orc file from a GCS bucket i'm using below code snippet, where i'm creating hadoop configuration and setting required file system attributes to use gcs bucket val hadoopConf = new ...
Nitish N Banakar's user avatar
2 votes
1 answer
424 views

I have a fileA in orc with the following format key id_1 id_2 value value_1 .... value_30 If I use the following config: 'spark.sql.orc.filterPushdown' : true And ...
olaf's user avatar
  • 347
1 vote
1 answer
371 views

I have created a table in hive 3.1.3 as below; Create external table test_tez_orc_zstd ( Id bigint )stored as orc Tblproperties(orc.compress=zstd) Location '...' It is created, and then I wanted to ...
CompEng's user avatar
  • 7,426
1 vote
1 answer
372 views

From my understanding orc filter is extremely fast because both file and stripe have column-level aggregates count, min, max, and sum However it would seem like those meta data is useful only if the ...
olaf's user avatar
  • 347
0 votes
1 answer
103 views

I am writing orc snappy files to a Google Cloud Storage Bucket using PySpark and Hive. Hive produces a single file output that is significantly smaller than the output produced by Hive. How can I make ...
Matt Weisman's user avatar
0 votes
0 answers
428 views

Vectorization in Hive is a feature (available from Hive 0.13.0) that when enabled rather than reading one row at a time reads a block on 1024 rows. This improves the CPU Usage for operations like, ...
The Anh Nguyen's user avatar
0 votes
2 answers
452 views

I have an issue that if i run pyspark code to save data to external orc file for hive table it also overwrites hive table schema. What should i do to keep original Hive schema after each overwrite? ...
Robertas Kirka's user avatar
1 vote
1 answer
94 views

I need to write two functions to get the output format and the output index for file conversion. As part of this, I wrote a TransformSettings class for these methods and set the default value. And in ...
vsathyak's user avatar
1 vote
1 answer
2k views

Currently we have Trino query run and fetch data write this to local filesystem upload this file to s3 bucket. For smaller data this is no issue. But currently with large data volume, this is posing ...
Azima's user avatar
  • 4,171
1 vote
1 answer
518 views

I want to write a simple dataframe as an ORC file. The only sequence is of an integer type. If I set all values to None, an exception is raised on to_orc. I understand that pyarrow cannot infer ...
Blaf's user avatar
  • 2,378
0 votes
1 answer
240 views

i am looking for a way too get the text from the image below. I tried to use tesseract but the outputs weren't good at all (see code block below). Do i have to edit the picture to get a better output? ...
Jannes's user avatar
  • 1
1 vote
1 answer
531 views

I am using the following components - Hadoop 3.1.4 , Hive 3.1.3 and Tez 0.9.2 And there is an ORC table from which I am trying to extract count of the rows in the table. select count(*) from ORC_TABLE ...
Afroz Baig's user avatar
2 votes
1 answer
3k views

I'm using Hive version 3.1.3 on Hadoop 3.3.4 with Tez 0.9.2. When I create an ORC table that contains splits and try to query it, I get an ORC split generation failed exception. If I concatenate the ...
Patrick Tucci's user avatar
0 votes
0 answers
266 views

I am trying to use pyarrow with orc but i don't find how to build it with orc extension, anyone knows how to ? I am on Windows 10 File ~\Miniconda3\lib\site-packages\owlna-0.0.1-py3.9.egg\owlna\table....
Devyl's user avatar
  • 695
0 votes
1 answer
702 views

I have two databases, each contains a table, which is stored in a single S3 file like: part-00000-77654909-37c7-4c9e-8840-b2838792f98d-c000.snappy.orc of size ~83MB. I'm trying to execute a primitive ...
astef's user avatar
  • 9,738
1 vote
0 answers
72 views

Reading the message specifications in the ORC file format (https://orc.apache.org/specification/ORCv1/) I see every field is marked optional. Yet some of the fields are always required (such as the ...
rde1's user avatar
  • 11
1 vote
1 answer
471 views

I have read the documentation that says "Only ORC file format is supported in this first release. The feature has been built such that transactions can be used by any storage format that can ...
Kakarot's user avatar
  • 11
2 votes
1 answer
278 views

I'm using apache orc 1.8. Following the short example in the documentation here: https://orc.apache.org/docs/core-java.html I'm failing to be able to write string out to the orc file. import org....
jake wong's user avatar
  • 5,248
1 vote
0 answers
448 views

We're in the migration process and the new code base is developed in node js, the old one being in python. To read orc file from azure datalake we used pyorc for the python project, I was looking for ...
Kashyap's user avatar
  • 477
-1 votes
1 answer
699 views

I connected to a datalake remotely, processed the data on datalake stored in Hadoop clusters using Hive beeline terminal and stored the data on HDFS as orc format. Then I transferred this orc file to ...
canon-ball's user avatar
0 votes
2 answers
210 views

Our environment/versions hadoop 3.2.3 hive 3.1.3 spark 2.3.0 our internal table in hive is defined as CREATE TABLE dw.CLIENT ( client_id integer, client_abbrev string, client_name string, ...
gocham's user avatar
  • 1
1 vote
2 answers
370 views

Following is generated by this line of code: table_bytes = df.to_parquet() table_bytes: b'PAR1\x15\x04\x15@\x15DL\x15\x08\x15\x04\x12\x00\x00 |\x03\x00\x00\x00Tom\x04\x00\x00\x00nick\x05\x00\x00\...
harsh solanki's user avatar
0 votes
0 answers
1k views

I have a pyarrow int32 ChunkedArray containing 18 chunks that I got from an ORC file: import pyarrow.dataset import pyarrow.compute t = pyarrow.dataset.dataset("my/orc/file", format="...
shoojoe's user avatar
  • 61
0 votes
1 answer
1k views

We are using a copy command for loading data into Snowflake. With CSV file format, there is a parameter 'ERROR_ON_COLUMN_COUNT_MISMATCH' to get the error if columns present in input CSV file do not ...
Sarang Mane's user avatar
0 votes
0 answers
1k views

I am trying to read ORC files from spark job. I have defined the below schema based on output of printSchema, df.printSchema(): root |-- application: struct (nullable = true) | |-- appserver: ...
Sam's user avatar
  • 1
0 votes
1 answer
169 views

When I recreate a table in spark using the command displayed from show create table mydb.mytable I stop being able to use the table from Hive. This just happens for a few tables, the other tables I ...
neves's user avatar
  • 40.6k
0 votes
3 answers
296 views

I have pyarrow table with header like that: ['column1','column2','column3','column4','column5' ] I want to swap and mode column header and data: ['column1','column2','column5','column3','column4' ] ...
user avatar
1 vote
2 answers
7k views

I have orc with data as after. Table A: Name age school address phone tony 12 havard UUU 666 tommy 13 abc Null Null john 14 cde ...
user avatar
0 votes
1 answer
464 views

I have been working with Spark and Hadoop ecosystem for some years but never bothered to question my architects about why a certain file format is chosen before they provide any explanation to the ...
Metadata's user avatar
  • 2,153
0 votes
2 answers
2k views

I'm trying to save a Pandas DataFrame as .orc file using Pyarrow. The versions of packages are: pandas==1.3.5 and pyarrow==6.0.1. My python3 version is 3.9.12. Here is the code snippet: import pandas ...
Heyb's user avatar
  • 21
0 votes
1 answer
276 views

Error "declared column type INT for column id incompatible with ORC file column type string query" when copy orc to Redshift using the command: from 's3://' iam_role 'role' format as orc;
lucaspompeun's user avatar
3 votes
0 answers
444 views

Is there any option to directly delete the rows from ORC files, provided its structure. I am using Azure Databricks, With below query i am reading the content of the ORC file, and wanted to delete ...
Tim's user avatar
  • 1,521
0 votes
1 answer
402 views

With spark I can do for example: spark.read.orc("/path/to/file").printSchema But I would like to get something like the output of show create table in hive. Is it possible?
Pavel Orekhov's user avatar
0 votes
1 answer
910 views

If I need to write dataframe on disk which format will perform better csv or 'orc with snappy' ? One hand csv format will avoid compression task overhead but on another hand snappy will reduce total ...
Rex's user avatar
  • 159
1 vote
2 answers
361 views

I am posting a simple c++ Apache orc file reading program which: Read data from ORC file. Filter data based on the given string. Sample Code: #include <iostream> #include <list> #...
Karan Kumar's user avatar
1 vote
1 answer
238 views

I'm writing data to an ORC file. I want to get the length of this file. (including data that has been flush and the data that's still in the buff cache) What should I do, please? I don't want to close ...
liliwei's user avatar
  • 344
0 votes
1 answer
354 views

Can I create 0 byte ORC file? I'd like to test if hive can load 0 byte file into external table without exception. if python can read 0 byte orc file without exception. for filename in glob.glob(...
jeewonb's user avatar
  • 75

1
2 3 4 5
10