This piece of code in the online_write_batch function in the Postgres online store pushes data in batches to the online store.
I was wondering whether it makes more sense to put the conn.commit() inside the for loop, or after the for loop. I would love to hear the different trade-offs between the two options!
Copy of the code snippet here:
batch_size = 5000
for i in range(0, len(insert_values), batch_size):
cur_batch = insert_values[i : i + batch_size]
execute_values(
cur,
sql.SQL(
"""
INSERT INTO {}
(entity_key, feature_name, value, event_ts, created_ts)
VALUES %s
ON CONFLICT (entity_key, feature_name) DO
UPDATE SET
value = EXCLUDED.value,
event_ts = EXCLUDED.event_ts,
created_ts = EXCLUDED.created_ts;
""",
).format(sql.Identifier(_table_id(project, table))),
cur_batch,
page_size=batch_size,
)
conn.commit() # << This is the point of interest. Should this be placed in the loop or after?