Skip to content

Slow _merge_values ( parsing values from protobuf to string) #207

@aguergolet

Description

@aguergolet

self._current_row.append(_parse_value(value, field.type_))

We have a query returning 80.000 rows with 71 fields on the select list using Python3, Google Cloud Spanner API 1.17 (we tried 1.18, 1.19, and 2.x). I choose version 1.17 because the performance decreases with newer versions of API.

The query returns the data up to 0.9ms when I start to copy the rows from StreamedResultset iterator to a list.

I started to isolate the code, and I'm using an empty "for loop" to simulate the problem and discard any other application performance problem.

Code Snippet:

import threading
import time
import os
from google.cloud.spanner import Client, PingingPool
from google.cloud.spanner_v1 import instance, database

"""Queries sample data from the database using SQL."""

projectId = 'cerc2-datalake-int-01'
instanceId = 'datalake-int-spanner-01'
database_id = 'cerc_datalake_bk_int14'

spanner_client = Client(projectId)

instance = spanner_client.instance(instanceId)
database = instance.database(database_id)

# Read query from query.sql file.
file = open('query.sql',mode='r')
content  = file.read()
file.close()

# Executing the query
start = time.time()
print('Starting query')
with database.snapshot() as snapshot:
    results = snapshot.execute_sql(
        content 
    )
end = time.time()
print(f'Query execution time: {end - start}')

# Count returned rows ( USING VERSION 2.x:  Error in this piece of code, some timestamp conversion error into google-core-api - removed all datetime fields to test)
start = time.time()
i = 0    
for row in results:
    i+=1
end = time.time()
print(f'Iterate itens {end - start}')


Using cProfile libraries to profile the application, I realized that the call of method _parse_value_pb inside _merge_values into streamed.py is the slowest method in my application.

For testing purposes, I removed the streamed.py file line 106:

self._current_row.append(_parse_value_pb(value, field.type))

The line has been replaced by:

self._current_row.append(value, field.type)

The first scenario (with _parse_value_pb):
Query 0.9ms, resultset iteration: 25 seconds;

The second scenario, removing _parse_value_pb:
Query 0.9ms, resultset iteration: 6 seconds;

All these tests is running on my 2.4Ghz laptop, but when we use Appengine, this routine gets more than 120 seconds;

I've tested with google spanner library 2.1 version, I got 103 seconds instead of 25 seconds ( version 1.17), with the same behavior;

The performance is fair when the resultset has many rows and a few columns in each row. I tested up to 30 resultset columns, and this behavior is not a problem. In my case, I need to work with 71 columns in each row.

Metadata

Metadata

Assignees

Labels

api: spannerIssues related to the googleapis/python-spanner API.priority: p2Moderately-important priority. Fix may not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions