Skip to content

Client.insert_rows_json(): add option to disable best-effort deduplication #720

@pietrodn

Description

@pietrodn

Currently, the Client.insert_rows_json() method for streaming inserts always inserts an insertId unique identifier for each row provided.
This row identifier can be user-provided; if the user doesn't provide any identifiers, the library automatically fills the row IDs by using UUID4.

Here's the code:

        for index, row in enumerate(json_rows):
            info = {"json": row}
            if row_ids is not None:
                info["insertId"] = row_ids[index]
            else:
                info["insertId"] = str(uuid.uuid4())
            rows_info.append(info)

However, insert IDs are entirely optional, and there are actually valid use cases not to use them. From the BigQuery documentation:

You can disable best effort de-duplication by not populating the insertId field for each row inserted. When you do not populate insertId, you get higher streaming ingest quotas in certain regions. This is the recommended way to get higher streaming ingest quota limits.

The BigQuery Python client library provides no way of omitting the insertIds. it would be nice to have a parameter for that.

Metadata

Metadata

Assignees

Labels

api: bigqueryIssues related to the googleapis/python-bigquery API.type: feature request‘Nice-to-have’ improvement, new feature or different behavior or design.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions