Skip to main content
Filter by
Sorted by
Tagged with
1 vote
3 answers
56 views

We are running a pipeline which is runnig of managed apached flink runner on AWS. The version of the flink we are using is 1.19 and the beam version is 2.61.0. First I start the application with ...
ranidu harshana's user avatar
Advice
0 votes
1 replies
56 views

I'm new to Apache Beam running on GCP, but my question is more theoretical than practical. I have a source spanner table and a destination spanner table and I'm fetching data from source table to ...
otto's user avatar
  • 181
1 vote
1 answer
61 views

I noticed that the class PulsarMessage has private getters in version 2.69.0. Shouldn't they be public inorder to access the topics names and/or payload of the message. Artifact link : https://...
Vaibhav Chandra's user avatar
1 vote
1 answer
42 views

I'm working with Apache Beam(2.62) and ran into a confusing behavior with DoFn.process() when using yield from and TaggedOutput. When do_something_second function yields multiple TaggedOutput, ...
Dogil's user avatar
  • 117
-3 votes
1 answer
81 views

I'm trying to write a Python GCP dataflow that processes records from a Spanner change stream and prints them out. I am running it locally and it appears to work but prints no records when I update a ...
Joe P's user avatar
  • 525
0 votes
1 answer
119 views

In the latest Apache Beam 2.68.0, they have changed the behavior of Coders for non-primitive objects. (see the changelog here). Therefore, I get a warning like this on GCP Dataflow. "Using ...
Praneeth Peiris's user avatar
0 votes
1 answer
62 views

I'm currently upgrading the Apache Beam version for my Dataflow application from 2.51.0 to 2.67.0. As part of this process, I'm encountering a compatibility issue with the google-api-services-storage ...
Optimizer's user avatar
  • 271
0 votes
1 answer
107 views

I am trying to build a Python based Apache Beam pipeline which s going to read from Kafka. Kafka requires Truststore and Keystore JKS file based authentication. kafka_consumer_config = { "...
Bhargav Velisetti's user avatar
0 votes
0 answers
56 views

I'm currently working on migrating from ZetaSQL to Calcite within an Apache Beam pipeline. I need to use specific transformations that are only available when the BigQuery dialect is enabled. I ...
Florian Ferreira's user avatar
0 votes
0 answers
66 views

I've been trying for a some time to got a beam pipeline to do data transformations for a fairly simple machine learning transformation, but apache beam and Tensorflow-transform won't play nicely ...
George Chapman-Brown's user avatar
0 votes
0 answers
62 views

I would like to use an ErrorHandler to catch all the errors that happens during my pipeline. I have seen that there is an interface which allows to do so : https://beam.apache.org/releases/javadoc/...
Dev Yns's user avatar
  • 229
0 votes
0 answers
57 views

I have four regions (a, b, c, d) and I want to create a single data set concatenating all the 4 and store in c how can this be done? Tried with dbt- Python but had to hard code a lot looking for a ...
N_epiphany's user avatar
0 votes
1 answer
56 views

I'm experiencing inconsistent behavior between Apache Beam's DirectRunner (local) and DataflowRunner (GCP) when using AvroCoder with an immutable class. Problem I have an immutable class defined using ...
Nihal sharma's user avatar
0 votes
1 answer
80 views

I'm trying to write to BigQuery using Apache Beam, in python. However, I want to use the newest CDC features to write on Bigquery. However, I can't get the correct format of the objects in the ...
José Fonseca's user avatar
1 vote
1 answer
60 views

I'm using Beam SqlTransform in python, trying to define/pass nullable fields. This code works just fine: with beam.Pipeline(options=options) as p: # ... # Use beam.Row to create a schema-aware ...
Yair Maron's user avatar
  • 1,978
0 votes
1 answer
60 views

I read data from BigTable using pipeline.apply Using data from 1 as a side input, I again read from another BigTable using pipeline.apply Finally, after some other steps in the pipeline, I run this ...
Learner's user avatar
  • 33
0 votes
1 answer
86 views

I'm trying to use RequestResponseIO on Dataflow to make parallel requests to an endpoint. As a test, I've created a Cloud Run helloworld endpoint which just receives these requests; it can handle up ...
O Bishop's user avatar
1 vote
0 answers
39 views

I'm currently struggling to run a Beam pipeline in Windows using ReadFromKafka. I'm trying to use ReadFromKafka to consume data from a Kafka topic, my pc is Windows and I already had some other errors ...
samuelfs's user avatar
0 votes
1 answer
57 views

As per this link, it is not easy to convert a PCollection to List. Additional link Whereas in apache spark, collectAsList() helps convert all elements of dataset to list. Why spark allows it with ...
Learner's user avatar
  • 33
0 votes
1 answer
70 views

The below code snippet using BigtableIO returns com.google.bigtable.v2.Row whereas com.google.cloud.bigtable.data.v2.models.Row is more user friendly that is being used in other bigtable java clients ...
Learner's user avatar
  • 33
1 vote
2 answers
114 views

Currently I have a Beam project using version 2.29.0. We are looking to upgrade to a version where Kafka commits are done at the end of the batch. I tried using version 2.64.0 but the Spark Runner is ...
Fabio's user avatar
  • 625
1 vote
2 answers
93 views

I have a simple function that reads from Mongo using Apache Beam def create_mongo_pipeline(p: beam.Pipeline, mongo_uri: str, db: str, coll: str, cert_file: str, gcs_bucket: str) -> None: # Read ...
RDGuida's user avatar
  • 576
-1 votes
1 answer
93 views

I have a DAG file to load data from mongodb to bigquery. I have tested my apache beam pipeline individually but when i am trying to trigger it dynamically to do parallel processing of all collections ...
Ajay Kumar's user avatar
1 vote
1 answer
76 views

I am learning data flow and created my first dataflow template to read a json file. I am using ReadFromJson function in apache_beam.io package but it is giving me error "line 57, in run()...
DevX's user avatar
  • 520
0 votes
0 answers
63 views

Below is my PostgreSQL table: CREATE TABLE customer_data ( id SERIAL PRIMARY KEY, name VARCHAR(100) NOT NULL, email VARCHAR(100), age INT, registration_date DATE, last_login ...
Satya's user avatar
  • 87
0 votes
1 answer
259 views

I'm struggling to make my Flex Template working with Cloud Scheduler. I was able to create it and I can run it from my local machine, through dataflow "create job from template" or using a ...
Rui Bras Fernandes's user avatar
0 votes
0 answers
75 views

My beam python on flinkrunner(PortableRunner) has the following code. from apache_beam.io.external.gcp.pubsub import ReadFromPubSub from dependency_injector import providers ... pubsub_reader = ...
Dogil's user avatar
  • 117
0 votes
1 answer
203 views

I'm trying to run a dataflow job using flex template in docker. Here what I have: FROM python:3.11-slim COPY --from=apache/beam_python3.11_sdk:2.54.0 /opt/apache/beam /opt/apache/beam COPY --from=...
Rafael Paz's user avatar
2 votes
1 answer
112 views

Apache Beam recently introduced Managed I/O APIs for Java and Python. What is the difference between Managed I/O and the regular Apache Beam connectors (sources and sinks) ?
chamikara's user avatar
  • 2,084
0 votes
2 answers
50 views

First question posted on stackoverflow so don't hesitate to give suggestions as to what tags to use for my post etc. I am trying to write an apache beam pipeline which imports timeseries data from a ...
jrood's user avatar
  • 1
0 votes
1 answer
55 views

We use JDBC cross-language transform to read data from MSSQL to BigQuery, and we noticed negative integers are being converted incorrectly. For example: if we have INT column in source with value (-1),...
Matar's user avatar
  • 73
0 votes
1 answer
54 views

I am trying to understand how Google Cloud Dataflow costs when reading a file with beam.io.ReadFromText. From my understanding, every time something is read from a Google Cloud bucket, it incurs ...
Giancarlo Metitieri's user avatar
1 vote
2 answers
127 views

We want to enable vertical autoscaling on our dataflow prime pipeline for a python container: https://cloud.google.com/dataflow/docs/vertical-autoscaling We're trying to run our pipeline through this ...
unitrium's user avatar
-1 votes
2 answers
104 views

Im using GCP Batch Dataflow to process data that im picking from a table. The input here is table data - where im using a query in Java to get the data. After processing, when I'm trying to insert the ...
Insecupa's user avatar
0 votes
1 answer
70 views

I tried to read BiqQuery via Scio with: val tbleRows = sc.withName("Query BQ Table").bigQuerySelect(query) or val tbleRows = sc.withName("Query BQ Table").bigQueryStorage(query) In ...
Dirk Gasser's user avatar
0 votes
1 answer
104 views

Apache Beam with Cloud Data Flow executors takes 5 minutes or more to cold start the Data Pipeline ? Is there any way to minimize the start up time ? Tried optimizing the Dockerfile but still slower. ...
Farrukh Naveed Anjum's user avatar
0 votes
1 answer
37 views

I am working with Apache Beam and encountering an issue when trying to access data from a PCollection that appears to be wrapped in _StateBackedIterable. I have a side input in the form of an ...
Sanjay's user avatar
  • 76
0 votes
2 answers
216 views

I have a beam/dataflow pipeline that reads from Pub/Sub and writes to BiqQuery with WriteToBigQuery. I convert all timestamps to apache_beam.utils.timestamp.Timestamp. I am sure all timestamps are ...
Jonathan's user avatar
  • 802
1 vote
0 answers
35 views

My use case is to write all the parquet filenames to a separate metadata file after writing it to GCS at the end of each window. I have tried a set of different approaches, but with each approach I ...
Adheeban's user avatar
0 votes
1 answer
56 views

Does bigquery.LoadJobConfig() and file loads method of apache beam write to bigquery method are same write_to_bq = ( csv_data | "Write to BigQuery" >> WriteToBigQuery( ...
ShubhGurukul's user avatar
-1 votes
1 answer
59 views

I have a beam application and its running with spark runner. It encountered kind of data lost issue as this application save data to a S3 storage. I looked into this page https://hadoop.apache.org/...
Jie Jason Li's user avatar
0 votes
0 answers
109 views

Our team has a set of data pipelines built as DAGs triggered on Composer (Airflow) that run Beam (Dataflow) jobs. Across these dataflow pipelines, there are a set of common utilities engineers need to ...
Espresso Engineer's user avatar
0 votes
1 answer
69 views

I have two unbounded sources (pubsub): main source: emits values frequently secondary source: sends an event which tells us to read a big query table, since there was a change in the table. I want ...
sanyi14ka's user avatar
  • 829
1 vote
1 answer
42 views

I have a requirement to securely get database credentials, which I'm able to accomplish using a ParDo. However I'd like to use a ReadFromJdbc IO Connector, and I'm facing a challenge passing in the ...
Gina Carson's user avatar
0 votes
1 answer
62 views

Our team is looking to use the pants build system which conveniently packages python code into a PEX with only the required dependent packages. I couldn't find any documentation however about how a ...
Matthew Albrecht's user avatar
0 votes
1 answer
173 views

I am working on a simple Apache Beam pipeline using Python to process a text file and output a CSV. Below is my code: python Copy code import apache_beam as beam p1 = beam.Pipeline() ...
Talha Shaikh's user avatar
0 votes
1 answer
37 views

This is my Code, I have a fixed number of data in my kafka topic , approx 15k. I want the data to be read in batches and the output be written to a file PCollection<KafkaRecord<byte[], byte[]>...
IndiePump's user avatar
2 votes
0 answers
110 views

I have a test dataset to generate elements with timestamps in the next 20 minutes. I want to create 1 minute fixed windows, and then create a 15 minute sliding window every minute from this. n = 20 ...
Mark Chin's user avatar
1 vote
1 answer
77 views

I am attempting to understand the lifecycle of a DoFn in more detail. I've added this counter to the init of my DoFn: > self.counter = Metrics.counter(self.__class__, 'counts') And correspondingly ...
Mark Chin's user avatar
0 votes
1 answer
296 views

I'm working on an Apache Beam pipeline that processes data and writes it to BigQuery. The pipeline works perfectly when using the DirectRunner, but when I switch to the DataflowRunner, it completes ...
Spine Feast's user avatar

1
2 3 4 5
98