-
Notifications
You must be signed in to change notification settings - Fork 99
Description
| self._current_row.append(_parse_value(value, field.type_)) |
We have a query returning 80.000 rows with 71 fields on the select list using Python3, Google Cloud Spanner API 1.17 (we tried 1.18, 1.19, and 2.x). I choose version 1.17 because the performance decreases with newer versions of API.
The query returns the data up to 0.9ms when I start to copy the rows from StreamedResultset iterator to a list.
I started to isolate the code, and I'm using an empty "for loop" to simulate the problem and discard any other application performance problem.
Code Snippet:
import threading
import time
import os
from google.cloud.spanner import Client, PingingPool
from google.cloud.spanner_v1 import instance, database
"""Queries sample data from the database using SQL."""
projectId = 'cerc2-datalake-int-01'
instanceId = 'datalake-int-spanner-01'
database_id = 'cerc_datalake_bk_int14'
spanner_client = Client(projectId)
instance = spanner_client.instance(instanceId)
database = instance.database(database_id)
# Read query from query.sql file.
file = open('query.sql',mode='r')
content = file.read()
file.close()
# Executing the query
start = time.time()
print('Starting query')
with database.snapshot() as snapshot:
results = snapshot.execute_sql(
content
)
end = time.time()
print(f'Query execution time: {end - start}')
# Count returned rows ( USING VERSION 2.x: Error in this piece of code, some timestamp conversion error into google-core-api - removed all datetime fields to test)
start = time.time()
i = 0
for row in results:
i+=1
end = time.time()
print(f'Iterate itens {end - start}')
Using cProfile libraries to profile the application, I realized that the call of method _parse_value_pb inside _merge_values into streamed.py is the slowest method in my application.
For testing purposes, I removed the streamed.py file line 106:
self._current_row.append(_parse_value_pb(value, field.type))
The line has been replaced by:
self._current_row.append(value, field.type)
The first scenario (with _parse_value_pb):
Query 0.9ms, resultset iteration: 25 seconds;
The second scenario, removing _parse_value_pb:
Query 0.9ms, resultset iteration: 6 seconds;
All these tests is running on my 2.4Ghz laptop, but when we use Appengine, this routine gets more than 120 seconds;
I've tested with google spanner library 2.1 version, I got 103 seconds instead of 25 seconds ( version 1.17), with the same behavior;
The performance is fair when the resultset has many rows and a few columns in each row. I tested up to 30 resultset columns, and this behavior is not a problem. In my case, I need to work with 71 columns in each row.