4,858 questions
1
vote
3
answers
56
views
Flink and beam pipeline having duplicate messages in kafka consumer
We are running a pipeline which is runnig of managed apached flink runner on AWS. The version of the flink we are using is 1.19 and the beam version is 2.61.0. First I start the application with ...
Advice
0
votes
1
replies
56
views
Apache Beam update of the source table
I'm new to Apache Beam running on GCP, but my question is more theoretical than practical.
I have a source spanner table and a destination spanner table and I'm fetching data from source table to ...
1
vote
1
answer
61
views
How to access topic/payload in PulsarMessage since it has private getters in apache beam pulsar io connector 2.69.0
I noticed that the class PulsarMessage has private getters in version 2.69.0. Shouldn't they be public inorder to access the topics names and/or payload of the message. Artifact link : https://...
1
vote
1
answer
42
views
Apache Beam: yield from works for TaggedOutput, but yield beam.TaggedOutput in except is ignored
I'm working with Apache Beam(2.62) and ran into a confusing behavior with DoFn.process() when using yield from and TaggedOutput.
When do_something_second function yields multiple TaggedOutput, ...
-3
votes
1
answer
81
views
Why are there no records read from spanner change stream?
I'm trying to write a Python GCP dataflow that processes records from a Spanner change stream and prints them out. I am running it locally and it appears to work but prints no records when I update a ...
0
votes
1
answer
119
views
Apache Beam 2.68.0 throws "Using fallback deterministic coder for type" warning
In the latest Apache Beam 2.68.0, they have changed the behavior of Coders for non-primitive objects. (see the changelog here).
Therefore, I get a warning like this on GCP Dataflow.
"Using ...
0
votes
1
answer
62
views
Issue with getSoftDeletePolicy() in google-api-services-storage after Apache Beam Upgrade
I'm currently upgrading the Apache Beam version for my Dataflow application from 2.51.0 to 2.67.0. As part of this process, I'm encountering a compatibility issue with the google-api-services-storage ...
0
votes
1
answer
107
views
Dataflow Python SDK failing to Autheticate to Kafka using truststore and keystore jks files with custom docker image
I am trying to build a Python based Apache Beam pipeline which s going to read from Kafka. Kafka requires Truststore and Keystore JKS file based authentication.
kafka_consumer_config = {
"...
0
votes
0
answers
56
views
Calcite - how to use bigquery dialect with apache beam
I'm currently working on migrating from ZetaSQL to Calcite within an Apache Beam pipeline.
I need to use specific transformations that are only available when the BigQuery dialect is enabled. I ...
0
votes
0
answers
66
views
Solving Version Conflict using Apache beam with ml transforms library
I've been trying for a some time to got a beam pipeline to do data transformations for a fairly simple machine learning transformation, but apache beam and Tensorflow-transform won't play nicely ...
0
votes
0
answers
62
views
how to use the Interface ErrorHandler in Apache beam?
I would like to use an ErrorHandler to catch all the errors that happens during my pipeline.
I have seen that there is an interface which allows to do so : https://beam.apache.org/releases/javadoc/...
0
votes
0
answers
57
views
Creating Global dataset combining multiple regions in BigQuery using Apache Beam
I have four regions (a, b, c, d) and I want to create a single data set concatenating all the 4 and store in c how can this be done? Tried with dbt- Python but had to hard code a lot looking for a ...
0
votes
1
answer
56
views
AvroCoder requires default constructor in DirectRunner locally but works on GCP Dataflow - Why?
I'm experiencing inconsistent behavior between Apache Beam's DirectRunner (local) and DataflowRunner (GCP) when using AvroCoder with an immutable class.
Problem
I have an immutable class defined using ...
0
votes
1
answer
80
views
PCollection Objects Format for Apache Beam to write on BigQuery using CDC in Python
I'm trying to write to BigQuery using Apache Beam, in python.
However, I want to use the newest CDC features to write on Bigquery.
However, I can't get the correct format of the objects in the ...
1
vote
1
answer
60
views
How to define nullable fields for SqlTransform
I'm using Beam SqlTransform in python, trying to define/pass nullable fields.
This code works just fine:
with beam.Pipeline(options=options) as p:
# ...
# Use beam.Row to create a schema-aware ...
0
votes
1
answer
60
views
run & skip steps based on condition in apache beam pipeline which reads data from multiple tables
I read data from BigTable using pipeline.apply
Using data from 1 as a side input, I again read from another BigTable using pipeline.apply
Finally, after some other steps in the pipeline, I run this ...
0
votes
1
answer
86
views
How to maximise throughput with RequestResponseIO on GCP Dataflow
I'm trying to use RequestResponseIO on Dataflow to make parallel requests to an endpoint. As a test, I've created a Cloud Run helloworld endpoint which just receives these requests; it can handle up ...
1
vote
0
answers
39
views
Having difficulty running ReadFromKafka in Windows
I'm currently struggling to run a Beam pipeline in Windows using ReadFromKafka.
I'm trying to use ReadFromKafka to consume data from a Kafka topic, my pc is Windows and I already had some other errors ...
0
votes
1
answer
57
views
What is the equivalent of apache spark's collectAsList method in apache beam?
As per this link, it is not easy to convert a PCollection to List.
Additional link
Whereas in apache spark, collectAsList() helps convert all elements of dataset to list.
Why spark allows it with ...
0
votes
1
answer
70
views
What Row class is returned when Apache Beam BigtableIO is used?
The below code snippet using BigtableIO returns com.google.bigtable.v2.Row whereas com.google.cloud.bigtable.data.v2.models.Row is more user friendly that is being used in other bigtable java clients
...
1
vote
2
answers
114
views
What version of apache-beam I can upgrade that supports spark runner and I would not get "onWindowExpiration is not supported"
Currently I have a Beam project using version 2.29.0. We are looking to upgrade to a version where Kafka commits are done at the end of the batch. I tried using version 2.64.0 but the Spark Runner is ...
1
vote
2
answers
93
views
Worker cannot find external file in Apache Beam
I have a simple function that reads from Mongo using Apache Beam
def create_mongo_pipeline(p: beam.Pipeline, mongo_uri: str, db: str, coll: str, cert_file: str, gcs_bucket: str) -> None:
# Read ...
-1
votes
1
answer
93
views
Unable to trigger dynamically created pipelines in airflow DAG
I have a DAG file to load data from mongodb to bigquery. I have tested my apache beam pipeline individually but when i am trying to trigger it dynamically to do parallel processing of all collections ...
1
vote
1
answer
76
views
apache beam ValueError with using ReadFromJson
I am learning data flow and created my first dataflow template to read a json file. I am using ReadFromJson function in apache_beam.io package but it is giving me error "line 57, in run()...
0
votes
0
answers
63
views
Apache beam write to parquet Python
Below is my PostgreSQL table:
CREATE TABLE customer_data (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL,
email VARCHAR(100),
age INT,
registration_date DATE,
last_login ...
0
votes
1
answer
259
views
Cloud Scheduler to trigger dataflow flex template
I'm struggling to make my Flex Template working with Cloud Scheduler.
I was able to create it and I can run it from my local machine, through dataflow "create job from template" or using a ...
0
votes
0
answers
75
views
java expansion service (grpc server) is UNAVAILABLE when ReadFromPubSub in beam application tries to access
My beam python on flinkrunner(PortableRunner) has the following code.
from apache_beam.io.external.gcp.pubsub import ReadFromPubSub
from dependency_injector import providers
...
pubsub_reader = ...
0
votes
1
answer
203
views
Dataflow Flex Template Docker issue: Cannot start an expansion service since neither Java nor Docker executables are available in the system
I'm trying to run a dataflow job using flex template in docker. Here what I have:
FROM python:3.11-slim
COPY --from=apache/beam_python3.11_sdk:2.54.0 /opt/apache/beam /opt/apache/beam
COPY --from=...
2
votes
1
answer
112
views
What’s the difference between regular Apache Beam connectors and Managed I/O?
Apache Beam recently introduced Managed I/O APIs for Java and Python. What is the difference between Managed I/O and the regular Apache Beam connectors (sources and sinks) ?
0
votes
2
answers
50
views
Creating an exponential moving average using StatefulDoFns in apache-beam but keep running into TypeError: 'float' object is not iterable
First question posted on stackoverflow so don't hesitate to give suggestions as to what tags to use for my post etc.
I am trying to write an apache beam pipeline which imports timeseries data from a ...
0
votes
1
answer
55
views
Apache Beam Cross-language JDBC (MSSQL) - incorrect negative Integer type conversion
We use JDBC cross-language transform to read data from MSSQL to BigQuery, and we noticed negative integers are being converted incorrectly.
For example: if we have INT column in source with value (-1),...
0
votes
1
answer
54
views
How does DataFlow charge for read operations from Cloud Storage
I am trying to understand how Google Cloud Dataflow costs when reading a file with beam.io.ReadFromText. From my understanding, every time something is read from a Google Cloud bucket, it incurs ...
1
vote
2
answers
127
views
Vertical autoscaling dataflow experiments args don't get properly parsed
We want to enable vertical autoscaling on our dataflow prime pipeline for a python container:
https://cloud.google.com/dataflow/docs/vertical-autoscaling
We're trying to run our pipeline through this ...
-1
votes
2
answers
104
views
GCP Batch Dataflow - Records Dropped while inserting to BigQuery
Im using GCP Batch Dataflow to process data that im picking from a table. The input here is table data - where im using a query in Java to get the data.
After processing, when I'm trying to insert the ...
0
votes
1
answer
70
views
How to avoid error creating staging dataset when reading from BigQuery with Scio?
I tried to read BiqQuery via Scio with:
val tbleRows = sc.withName("Query BQ Table").bigQuerySelect(query)
or
val tbleRows = sc.withName("Query BQ Table").bigQueryStorage(query)
In ...
0
votes
1
answer
104
views
How can we optimize the Cloud Data Flow Job to minimize the startup time?
Apache Beam with Cloud Data Flow executors takes 5 minutes or more to cold start the Data Pipeline ? Is there any way to minimize the start up time ?
Tried optimizing the Dockerfile but still slower.
...
0
votes
1
answer
37
views
How to Access Data from _StateBackedIterable in Apache Beam?
I am working with Apache Beam and encountering an issue when trying to access data from a PCollection that appears to be wrapped in _StateBackedIterable.
I have a side input in the form of an ...
0
votes
2
answers
216
views
Beam/Dataflow pipeline writing to BigQuery fails to convert timestamps (sometimes)
I have a beam/dataflow pipeline that reads from Pub/Sub and writes to BiqQuery with WriteToBigQuery. I convert all timestamps to apache_beam.utils.timestamp.Timestamp. I am sure all timestamps are ...
1
vote
0
answers
35
views
How do I write all the filenames written at the end of each window to a metadata file?
My use case is to write all the parquet filenames to a separate metadata file after writing it to GCS at the end of each window.
I have tried a set of different approaches, but with each approach I ...
0
votes
1
answer
56
views
Does bigquery.LoadJobConfig() and file loads method of apache beam write to bigquery method are same
Does bigquery.LoadJobConfig() and file loads method of apache beam write to bigquery method are same
write_to_bq = (
csv_data
| "Write to BigQuery" >> WriteToBigQuery(
...
-1
votes
1
answer
59
views
how to configure beam application with spark runner to use S3ACommitter?
I have a beam application and its running with spark runner. It encountered kind of data lost issue as this application save data to a S3 storage.
I looked into this page https://hadoop.apache.org/...
0
votes
0
answers
109
views
How to load common dependencies into dataflow?
Our team has a set of data pipelines built as DAGs triggered on Composer (Airflow) that run Beam (Dataflow) jobs.
Across these dataflow pipelines, there are a set of common utilities engineers need to ...
0
votes
1
answer
69
views
Join a rapidly and slowly changing unbounded sources in Apache Beam
I have two unbounded sources (pubsub):
main source: emits values frequently
secondary source: sends an event which tells us to read a big query table, since there was a change in the table.
I want ...
1
vote
1
answer
42
views
How to pass credentials from a ParDo into a ReadFromJdbc IO Connector
I have a requirement to securely get database credentials, which I'm able to accomplish using a ParDo. However I'd like to use a ReadFromJdbc IO Connector, and I'm facing a challenge passing in the ...
0
votes
1
answer
62
views
How to run a PEX on Apache Beam + GCP Dataflow?
Our team is looking to use the pants build system which conveniently packages python code into a PEX with only the required dependent packages. I couldn't find any documentation however about how a ...
0
votes
1
answer
173
views
Apache Beam: TypeError: Could not determine schema for type hint Any
I am working on a simple Apache Beam pipeline using Python to process a text file and output a CSV. Below is my code:
python
Copy code
import apache_beam as beam
p1 = beam.Pipeline()
...
0
votes
1
answer
37
views
Unable to Write processed Data to Parquet
This is my Code, I have a fixed number of data in my kafka topic , approx 15k. I want the data to be read in batches and the output be written to a file
PCollection<KafkaRecord<byte[], byte[]>...
2
votes
0
answers
110
views
Apache Beam Window Inconsistency Between DirectRunner and DataflowRunner
I have a test dataset to generate elements with timestamps in the next 20 minutes. I want to create 1 minute fixed windows, and then create a 15 minute sliding window every minute from this.
n = 20
...
1
vote
1
answer
77
views
Apache Beam DoFn Init: Why do init values reset between stream inputs on DirectRunner?
I am attempting to understand the lifecycle of a DoFn in more detail.
I've added this counter to the init of my DoFn:
> self.counter = Metrics.counter(self.__class__, 'counts')
And correspondingly ...
0
votes
1
answer
296
views
Why does my Apache Beam Dataflow pipeline not write to BigQuery?
I'm working on an Apache Beam pipeline that processes data and writes it to BigQuery. The pipeline works perfectly when using the DirectRunner, but when I switch to the DataflowRunner, it completes ...