Slow _merge_values ( parsing values from protobuf to string)

https://github.com/googleapis/python-spanner/blob/4a58bcb4016977a8b2ae7fe886c4db40a13e85cd/google/cloud/spanner_v1/streamed.py#L103


We have a query returning 80.000 rows with 71 fields on the select list using Python3, Google Cloud Spanner API 1.17 (we tried 1.18, 1.19, and 2.x). I choose version 1.17 because the performance decreases with newer versions of API.

The query returns the data up to 0.9ms when I start to copy the rows from StreamedResultset iterator to a list.

I started to isolate the code, and I'm using an empty "for loop" to simulate the problem and discard any other application performance problem.

```
Code Snippet:

import threading
import time
import os
from google.cloud.spanner import Client, PingingPool
from google.cloud.spanner_v1 import instance, database

"""Queries sample data from the database using SQL."""

projectId = 'cerc2-datalake-int-01'
instanceId = 'datalake-int-spanner-01'
database_id = 'cerc_datalake_bk_int14'

spanner_client = Client(projectId)

instance = spanner_client.instance(instanceId)
database = instance.database(database_id)

# Read query from query.sql file.
file = open('query.sql',mode='r')
content  = file.read()
file.close()

# Executing the query
start = time.time()
print('Starting query')
with database.snapshot() as snapshot:
    results = snapshot.execute_sql(
        content 
    )
end = time.time()
print(f'Query execution time: {end - start}')

# Count returned rows ( USING VERSION 2.x:  Error in this piece of code, some timestamp conversion error into google-core-api - removed all datetime fields to test)
start = time.time()
i = 0    
for row in results:
    i+=1
end = time.time()
print(f'Iterate itens {end - start}')


```
Using cProfile libraries to profile the application, I realized that the call of method _parse_value_pb inside _merge_values into streamed.py is the slowest method in my application. 

For testing purposes, I removed the streamed.py file line 106:
>> self._current_row.append(_parse_value_pb(value, field.type))

The line has been replaced by: 
>> self._current_row.append(value, field.type)

The first scenario (with _parse_value_pb):
Query 0.9ms, resultset iteration: 25 seconds;

The second scenario, removing _parse_value_pb:
Query 0.9ms, resultset iteration: 6 seconds;

All these tests is running on my 2.4Ghz laptop, but when we use Appengine, this routine gets more than 120 seconds;

I've tested with google spanner library 2.1 version, I got 103 seconds instead of 25 seconds ( version 1.17), with the same behavior;

The performance is fair when the resultset has many rows and a few columns in each row. I tested up to 30 resultset columns, and this behavior is not a problem. In my case, I need to work with 71 columns in each row.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow _merge_values ( parsing values from protobuf to string) #207

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow _merge_values ( parsing values from protobuf to string) #207

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions