Newest 'apache-beam' Questions

1 vote

3 answers

56 views

Flink and beam pipeline having duplicate messages in kafka consumer

We are running a pipeline which is runnig of managed apached flink runner on AWS. The version of the flink we are using is 1.19 and the beam version is 2.61.0. First I start the application with ...

ranidu harshana

59

asked Jan 18 at 6:49

Advice

0 votes

1 replies

56 views

Apache Beam update of the source table

I'm new to Apache Beam running on GCP, but my question is more theoretical than practical. I have a source spanner table and a destination spanner table and I'm fetching data from source table to ...

otto

181

asked Dec 16, 2025 at 18:15

1 vote

1 answer

61 views

How to access topic/payload in PulsarMessage since it has private getters in apache beam pulsar io connector 2.69.0

I noticed that the class PulsarMessage has private getters in version 2.69.0. Shouldn't they be public inorder to access the topics names and/or payload of the message. Artifact link : https://...

Vaibhav Chandra

1

asked Dec 10, 2025 at 14:22

1 vote

1 answer

42 views

Apache Beam: yield from works for TaggedOutput, but yield beam.TaggedOutput in except is ignored

I'm working with Apache Beam(2.62) and ran into a confusing behavior with DoFn.process() when using yield from and TaggedOutput. When do_something_second function yields multiple TaggedOutput, ...

Dogil

117

asked Nov 11, 2025 at 20:15

-3 votes

1 answer

81 views

Why are there no records read from spanner change stream?

I'm trying to write a Python GCP dataflow that processes records from a Spanner change stream and prints them out. I am running it locally and it appears to work but prints no records when I update a ...

Joe P

525

asked Oct 31, 2025 at 5:21

0 votes

1 answer

119 views

Apache Beam 2.68.0 throws "Using fallback deterministic coder for type" warning

In the latest Apache Beam 2.68.0, they have changed the behavior of Coders for non-primitive objects. (see the changelog here). Therefore, I get a warning like this on GCP Dataflow. "Using ...

Praneeth Peiris

2,088

asked Oct 23, 2025 at 14:00

0 votes

1 answer

62 views

Issue with getSoftDeletePolicy() in google-api-services-storage after Apache Beam Upgrade

I'm currently upgrading the Apache Beam version for my Dataflow application from 2.51.0 to 2.67.0. As part of this process, I'm encountering a compatibility issue with the google-api-services-storage ...

Optimizer

271

asked Sep 10, 2025 at 10:04

0 votes

1 answer

107 views

Dataflow Python SDK failing to Autheticate to Kafka using truststore and keystore jks files with custom docker image

I am trying to build a Python based Apache Beam pipeline which s going to read from Kafka. Kafka requires Truststore and Keystore JKS file based authentication. kafka_consumer_config = { "...

Bhargav Velisetti

57

asked Aug 25, 2025 at 13:46

0 votes

0 answers

56 views

Calcite - how to use bigquery dialect with apache beam

I'm currently working on migrating from ZetaSQL to Calcite within an Apache Beam pipeline. I need to use specific transformations that are only available when the BigQuery dialect is enabled. I ...

Florian Ferreira

1

asked Jun 26, 2025 at 15:34

0 votes

0 answers

66 views

Solving Version Conflict using Apache beam with ml transforms library

I've been trying for a some time to got a beam pipeline to do data transformations for a fairly simple machine learning transformation, but apache beam and Tensorflow-transform won't play nicely ...

George Chapman-Brown

1

asked Jun 19, 2025 at 18:56

0 votes

0 answers

62 views

how to use the Interface ErrorHandler in Apache beam?

I would like to use an ErrorHandler to catch all the errors that happens during my pipeline. I have seen that there is an interface which allows to do so : https://beam.apache.org/releases/javadoc/...

Dev Yns

229

asked Jun 18, 2025 at 23:00

0 votes

0 answers

57 views

Creating Global dataset combining multiple regions in BigQuery using Apache Beam

I have four regions (a, b, c, d) and I want to create a single data set concatenating all the 4 and store in c how can this be done? Tried with dbt- Python but had to hard code a lot looking for a ...

N_epiphany

1

asked Jun 16, 2025 at 19:31

0 votes

1 answer

56 views

AvroCoder requires default constructor in DirectRunner locally but works on GCP Dataflow - Why?

I'm experiencing inconsistent behavior between Apache Beam's DirectRunner (local) and DataflowRunner (GCP) when using AvroCoder with an immutable class. Problem I have an immutable class defined using ...

Nihal sharma

37

asked Jun 13, 2025 at 6:11

0 votes

1 answer

80 views

PCollection Objects Format for Apache Beam to write on BigQuery using CDC in Python

I'm trying to write to BigQuery using Apache Beam, in python. However, I want to use the newest CDC features to write on Bigquery. However, I can't get the correct format of the objects in the ...

José Fonseca

319

asked Jun 12, 2025 at 23:01

1 vote

1 answer

60 views

How to define nullable fields for SqlTransform

I'm using Beam SqlTransform in python, trying to define/pass nullable fields. This code works just fine: with beam.Pipeline(options=options) as p: # ... # Use beam.Row to create a schema-aware ...

Yair Maron

1,978

asked Jun 2, 2025 at 11:04

0 votes

1 answer

60 views

run & skip steps based on condition in apache beam pipeline which reads data from multiple tables

I read data from BigTable using pipeline.apply Using data from 1 as a side input, I again read from another BigTable using pipeline.apply Finally, after some other steps in the pipeline, I run this ...

Learner

33

asked May 30, 2025 at 20:10

0 votes

1 answer

86 views

How to maximise throughput with RequestResponseIO on GCP Dataflow

I'm trying to use RequestResponseIO on Dataflow to make parallel requests to an endpoint. As a test, I've created a Cloud Run helloworld endpoint which just receives these requests; it can handle up ...

O Bishop

11

asked May 18, 2025 at 20:21

1 vote

0 answers

39 views

Having difficulty running ReadFromKafka in Windows

I'm currently struggling to run a Beam pipeline in Windows using ReadFromKafka. I'm trying to use ReadFromKafka to consume data from a Kafka topic, my pc is Windows and I already had some other errors ...

samuelfs

11

asked May 10, 2025 at 21:15

0 votes

1 answer

57 views

What is the equivalent of apache spark's collectAsList method in apache beam?

As per this link, it is not easy to convert a PCollection to List. Additional link Whereas in apache spark, collectAsList() helps convert all elements of dataset to list. Why spark allows it with ...

Learner

33

asked May 8, 2025 at 18:38

0 votes

1 answer

70 views

What Row class is returned when Apache Beam BigtableIO is used?

The below code snippet using BigtableIO returns com.google.bigtable.v2.Row whereas com.google.cloud.bigtable.data.v2.models.Row is more user friendly that is being used in other bigtable java clients ...

Learner

33

asked May 6, 2025 at 23:57

1 vote

2 answers

114 views

What version of apache-beam I can upgrade that supports spark runner and I would not get "onWindowExpiration is not supported"

Currently I have a Beam project using version 2.29.0. We are looking to upgrade to a version where Kafka commits are done at the end of the batch. I tried using version 2.64.0 but the Spark Runner is ...

Fabio

625

asked Apr 30, 2025 at 16:59

1 vote

2 answers

93 views

Worker cannot find external file in Apache Beam

I have a simple function that reads from Mongo using Apache Beam def create_mongo_pipeline(p: beam.Pipeline, mongo_uri: str, db: str, coll: str, cert_file: str, gcs_bucket: str) -> None: # Read ...

RDGuida

576

asked Apr 30, 2025 at 9:28

-1 votes

1 answer

93 views

Unable to trigger dynamically created pipelines in airflow DAG

I have a DAG file to load data from mongodb to bigquery. I have tested my apache beam pipeline individually but when i am trying to trigger it dynamically to do parallel processing of all collections ...

Ajay Kumar

163

asked Apr 29, 2025 at 9:21

1 vote

1 answer

76 views

apache beam ValueError with using ReadFromJson

I am learning data flow and created my first dataflow template to read a json file. I am using ReadFromJson function in apache_beam.io package but it is giving me error "line 57, in run()...

DevX

520

asked Apr 28, 2025 at 15:12

0 votes

0 answers

63 views

Apache beam write to parquet Python

Below is my PostgreSQL table: CREATE TABLE customer_data ( id SERIAL PRIMARY KEY, name VARCHAR(100) NOT NULL, email VARCHAR(100), age INT, registration_date DATE, last_login ...

Satya

87

asked Apr 13, 2025 at 2:59

0 votes

1 answer

259 views

Cloud Scheduler to trigger dataflow flex template

I'm struggling to make my Flex Template working with Cloud Scheduler. I was able to create it and I can run it from my local machine, through dataflow "create job from template" or using a ...

Rui Bras Fernandes

99

asked Apr 11, 2025 at 20:10

0 votes

0 answers

75 views

java expansion service (grpc server) is UNAVAILABLE when ReadFromPubSub in beam application tries to access

My beam python on flinkrunner(PortableRunner) has the following code. from apache_beam.io.external.gcp.pubsub import ReadFromPubSub from dependency_injector import providers ... pubsub_reader = ...

Dogil

117

asked Mar 17, 2025 at 16:34

0 votes

1 answer

203 views

Dataflow Flex Template Docker issue: Cannot start an expansion service since neither Java nor Docker executables are available in the system

I'm trying to run a dataflow job using flex template in docker. Here what I have: FROM python:3.11-slim COPY --from=apache/beam_python3.11_sdk:2.54.0 /opt/apache/beam /opt/apache/beam COPY --from=...

Rafael Paz

11

asked Mar 14, 2025 at 0:42

2 votes

1 answer

112 views

What’s the difference between regular Apache Beam connectors and Managed I/O?

Apache Beam recently introduced Managed I/O APIs for Java and Python. What is the difference between Managed I/O and the regular Apache Beam connectors (sources and sinks) ?

chamikara

2,084

asked Mar 12, 2025 at 21:51

0 votes

2 answers

50 views

Creating an exponential moving average using StatefulDoFns in apache-beam but keep running into TypeError: 'float' object is not iterable

First question posted on stackoverflow so don't hesitate to give suggestions as to what tags to use for my post etc. I am trying to write an apache beam pipeline which imports timeseries data from a ...

jrood

1

asked Mar 2, 2025 at 15:05

0 votes

1 answer

55 views

Apache Beam Cross-language JDBC (MSSQL) - incorrect negative Integer type conversion

We use JDBC cross-language transform to read data from MSSQL to BigQuery, and we noticed negative integers are being converted incorrectly. For example: if we have INT column in source with value (-1),...

Matar

73

asked Feb 27, 2025 at 10:20

0 votes

1 answer

54 views

How does DataFlow charge for read operations from Cloud Storage

I am trying to understand how Google Cloud Dataflow costs when reading a file with beam.io.ReadFromText. From my understanding, every time something is read from a Google Cloud bucket, it incurs ...

Giancarlo Metitieri

135

asked Feb 24, 2025 at 17:58

1 vote

2 answers

127 views

Vertical autoscaling dataflow experiments args don't get properly parsed

We want to enable vertical autoscaling on our dataflow prime pipeline for a python container: https://cloud.google.com/dataflow/docs/vertical-autoscaling We're trying to run our pipeline through this ...

unitrium

72

asked Feb 24, 2025 at 15:24

-1 votes

2 answers

104 views

GCP Batch Dataflow - Records Dropped while inserting to BigQuery

Im using GCP Batch Dataflow to process data that im picking from a table. The input here is table data - where im using a query in Java to get the data. After processing, when I'm trying to insert the ...

Insecupa

9

asked Feb 19, 2025 at 14:38

0 votes

1 answer

70 views

How to avoid error creating staging dataset when reading from BigQuery with Scio?

I tried to read BiqQuery via Scio with: val tbleRows = sc.withName("Query BQ Table").bigQuerySelect(query) or val tbleRows = sc.withName("Query BQ Table").bigQueryStorage(query) In ...

Dirk Gasser

1

asked Feb 13, 2025 at 7:14

0 votes

1 answer

104 views

How can we optimize the Cloud Data Flow Job to minimize the startup time?

Apache Beam with Cloud Data Flow executors takes 5 minutes or more to cold start the Data Pipeline ? Is there any way to minimize the start up time ? Tried optimizing the Dockerfile but still slower. ...

Farrukh Naveed Anjum

250

asked Feb 12, 2025 at 16:38

0 votes

1 answer

37 views

How to Access Data from _StateBackedIterable in Apache Beam?

I am working with Apache Beam and encountering an issue when trying to access data from a PCollection that appears to be wrapped in _StateBackedIterable. I have a side input in the form of an ...

Sanjay

76

asked Feb 6, 2025 at 12:52

0 votes

2 answers

216 views

Beam/Dataflow pipeline writing to BigQuery fails to convert timestamps (sometimes)

I have a beam/dataflow pipeline that reads from Pub/Sub and writes to BiqQuery with WriteToBigQuery. I convert all timestamps to apache_beam.utils.timestamp.Timestamp. I am sure all timestamps are ...

Jonathan

802

asked Jan 31, 2025 at 16:05

1 vote

0 answers

35 views

How do I write all the filenames written at the end of each window to a metadata file?

My use case is to write all the parquet filenames to a separate metadata file after writing it to GCS at the end of each window. I have tried a set of different approaches, but with each approach I ...

Adheeban

11

asked Jan 17, 2025 at 9:01

0 votes

1 answer

56 views

Does bigquery.LoadJobConfig() and file loads method of apache beam write to bigquery method are same

Does bigquery.LoadJobConfig() and file loads method of apache beam write to bigquery method are same write_to_bq = ( csv_data | "Write to BigQuery" >> WriteToBigQuery( ...

ShubhGurukul

32

asked Jan 15, 2025 at 11:27

-1 votes

1 answer

59 views

how to configure beam application with spark runner to use S3ACommitter?

I have a beam application and its running with spark runner. It encountered kind of data lost issue as this application save data to a S3 storage. I looked into this page https://hadoop.apache.org/...

Jie Jason Li

1

asked Jan 14, 2025 at 12:24

0 votes

0 answers

109 views

How to load common dependencies into dataflow?

Our team has a set of data pipelines built as DAGs triggered on Composer (Airflow) that run Beam (Dataflow) jobs. Across these dataflow pipelines, there are a set of common utilities engineers need to ...

Espresso Engineer

121

asked Jan 14, 2025 at 3:24

0 votes

1 answer

69 views

Join a rapidly and slowly changing unbounded sources in Apache Beam

I have two unbounded sources (pubsub): main source: emits values frequently secondary source: sends an event which tells us to read a big query table, since there was a change in the table. I want ...

sanyi14ka

829

asked Jan 10, 2025 at 15:02

1 vote

1 answer

42 views

How to pass credentials from a ParDo into a ReadFromJdbc IO Connector

I have a requirement to securely get database credentials, which I'm able to accomplish using a ParDo. However I'd like to use a ReadFromJdbc IO Connector, and I'm facing a challenge passing in the ...

Gina Carson

11

asked Jan 9, 2025 at 18:54

0 votes

1 answer

62 views

How to run a PEX on Apache Beam + GCP Dataflow?

Our team is looking to use the pants build system which conveniently packages python code into a PEX with only the required dependent packages. I couldn't find any documentation however about how a ...

Matthew Albrecht

11

asked Jan 9, 2025 at 8:45

0 votes

1 answer

173 views

Apache Beam: TypeError: Could not determine schema for type hint Any

I am working on a simple Apache Beam pipeline using Python to process a text file and output a CSV. Below is my code: python Copy code import apache_beam as beam p1 = beam.Pipeline() ...

Talha Shaikh

45

asked Jan 5, 2025 at 20:52

0 votes

1 answer

37 views

Unable to Write processed Data to Parquet

This is my Code, I have a fixed number of data in my kafka topic , approx 15k. I want the data to be read in batches and the output be written to a file PCollection<KafkaRecord<byte[], byte[]>...

IndiePump

1

asked Jan 5, 2025 at 4:09

2 votes

0 answers

110 views

Apache Beam Window Inconsistency Between DirectRunner and DataflowRunner

I have a test dataset to generate elements with timestamps in the next 20 minutes. I want to create 1 minute fixed windows, and then create a 15 minute sliding window every minute from this. n = 20 ...

Mark Chin

31

asked Dec 28, 2024 at 2:58

1 vote

1 answer

77 views

Apache Beam DoFn Init: Why do init values reset between stream inputs on DirectRunner?

I am attempting to understand the lifecycle of a DoFn in more detail. I've added this counter to the init of my DoFn: > self.counter = Metrics.counter(self.__class__, 'counts') And correspondingly ...

Mark Chin

31

asked Dec 23, 2024 at 17:21

0 votes

1 answer

296 views

Why does my Apache Beam Dataflow pipeline not write to BigQuery?

I'm working on an Apache Beam pipeline that processes data and writes it to BigQuery. The pipeline works perfectly when using the DirectRunner, but when I switch to the DataflowRunner, it completes ...

Spine Feast

143

asked Dec 23, 2024 at 15:23

Collectives™ on Stack Overflow