-
Notifications
You must be signed in to change notification settings - Fork 322
Description
Currently, the Client.insert_rows_json() method for streaming inserts always inserts an insertId unique identifier for each row provided.
This row identifier can be user-provided; if the user doesn't provide any identifiers, the library automatically fills the row IDs by using UUID4.
Here's the code:
for index, row in enumerate(json_rows):
info = {"json": row}
if row_ids is not None:
info["insertId"] = row_ids[index]
else:
info["insertId"] = str(uuid.uuid4())
rows_info.append(info)However, insert IDs are entirely optional, and there are actually valid use cases not to use them. From the BigQuery documentation:
You can disable best effort de-duplication by not populating the insertId field for each row inserted. When you do not populate insertId, you get higher streaming ingest quotas in certain regions. This is the recommended way to get higher streaming ingest quota limits.
The BigQuery Python client library provides no way of omitting the insertIds. it would be nice to have a parameter for that.