feat: Streaming ingestion latency improvements#292
feat: Streaming ingestion latency improvements#292kpulipati29 merged 4 commits intofeature/range-query-improvementsfrom
Conversation
| rows_to_write = _convert_arrow_to_proto(table, feature_view, join_keys) | ||
| num_spark_driver_cores = int(os.environ.get("SPARK_DRIVER_CORES", 1)) | ||
|
|
||
| if num_spark_driver_cores > 3: |
There was a problem hiding this comment.
Shouldn't this be num_spark_driver_cores > 4 since at 3 you only have 1 process to use so there is no parallelization?
There was a problem hiding this comment.
Yes agree that at 3 cores, there will be no parallelism if I leave 2 cores. But its num_spark_driver_cores > 3: meaning multiprocessing will be used if the num_spark_driver_cores is 4 or 5 or 6 etc..
Not sure if I am missing something here
| num_processes = table.num_rows | ||
|
|
||
| # Input table is split into smaller chunks and processed in parallel | ||
| chunks = self.split_table(num_processes, table) |
There was a problem hiding this comment.
PyArrow has a built-in method table.to_batches(max_chunksize=None) that should be an optimized way to create chunks. Is there some reason not to use that?
There was a problem hiding this comment.
did not come across that, let me check!
vanitabhagwat
left a comment
There was a problem hiding this comment.
Can you add the document showing the improved latency because of the changes?
| if table.num_rows < num_processes: | ||
| num_processes = table.num_rows | ||
|
|
||
| # Input table is split into smaller chunks and processed in parallel |
There was a problem hiding this comment.
I think we should consider a minimum chunk/table size threshold for parallelization, or auto-tune based on table size instead of just relying on spark cores. Smaller table size may not need this.
There was a problem hiding this comment.
I think latency depends not only on table size but also other factors like number of features, feature types etc.. and we might end up always tweaking these as use cases grow if we hard code these. Where as if we depend on resources, user just have to add more resources (add more cores in this case) if they are not satisfied with latency at less number of cores.
In this PR, mullti processing is enabled if the number of cores>2 and smaller TPS/table size usually does not need more than 2 cores, in which case multiprocessing is not used. But the only downside is let say if the user allocates more cores than needed by mistake, for example 3 cores for 50 req/sec , then multiprocessing will be used in this case even though its not needed. But at the same time, using multi processing at lower TPS is not causing any overhead in latency and user can correct this behaviour by reducing the number of cores which is the right thing to do.
| rows_to_write = _convert_arrow_to_proto(table, feature_view, join_keys) | ||
| self.online_write_batch( | ||
| self.repo_config, feature_view, rows_to_write, progress=None |
There was a problem hiding this comment.
call process_chunk() here. Rename process_chunk() to process so its generic.
|
|
||
| if num_spark_driver_cores > 3: | ||
| # Leaving a couple of cores for operating system and other background processes | ||
| num_processes = num_spark_driver_cores - 2 |
There was a problem hiding this comment.
2 cores may be too many. We usually run with 1 Driver core for materialization jobs. Just curious: Any logic to come up with this?
There was a problem hiding this comment.
Got it, when I ran a test leaving 2 cores, the max CPU utilization went up to 64%. So I thought leaving 1 core would further increase the CPU utilization. But I ran another test leaving 1 core and the CPU utilization did not change that much. I changed it to leave 1 core now.
Sure, updated the description with doc. |
8d05d8b
into
feature/range-query-improvements
* fix: Update Docker images to use official sources for integration tests (#260) * fix: Update Docker images to use official sources and improve workflow triggers * docs: Update README to include instructions for running integration tests --------- Co-authored-by: Bhargav Dodla <bdodla@expediagroup.com> * fix: Only add sort key filter models to list if they meet the conditi… (#261) * fix: Only add sort key filter models to list if they meet the conditions. Split read range int tests into its own file. * add pauses in int test to allow for apply/materialize to compelete * set int tests to not run in parallel * fix: Adding Snake Case to Include Metadata Check (#264) * adding snake case to include metadata check * fixing formatting * added trim space to string function * fix: Move use_write_time_for_ttl config to sfv instead of an online store config (#267) * fix: Move use_write_time_for_ttl config to SortedFeatureView instead of an online store config * re-order lines in proto so indexes are in order * fix: Pack and unpack repeated list values into and out of arrow array… (#263) * fix: Pack and unpack repeated list values into and out of arrow arrays. Restructure integration tests to properly separate concerns * Throw error when requested feature service includes both view types * Re-use CopyProtoValuesToArrowArray instead of duplicating switch logic. * fix: fix misplaced reset indexes (#266) * feat: Adding Changes to Check if FV and SFV have Valid Updates (#254) * Adding update changes * fixing tests * fixing linting * addressing PR comments * fixing unit test * Added in sort order check * using feast object * removing return type from validation * Adding another exception type * Addressing PR comments to add an entity check * fixing tests * Addressing PR comments * fixing test assertion * fixing formatting * fix: Validate requested features exist in view. (#269) * fix: Validate requested features exist in view. * add test cases for invalid features in feature service * reduce time complexity and duplicate checks for feature validation * use if-else blocks * fix: Clean Up Error Messages (#274) * writing clearer error messages * addressing PR comments: fixing tests * feat: Adding Alter Table to Support Schema Updates for Cassandra (#262) * Adding update changes to alter table if fv schema changes * addressing PR comments * removing redundant lines * addressing PR comments * editing makefile to get python tests to run * fixing go integration tests * fixing formatting * fixing failing tests * changing number of workers to 1 * removing the igmore to run all python tests * adding back the ignore statement, seems to be running tests twice * reverting back Makefile * removing ignore statement * reverting back * aadding to unit_tests.yml file * aadding to unit_tests.yml file * aadding to unit_tests.yml file * MacOS tests removed due to MacOS runners not supporting docker * feat: Separate entities from features in Range Query response (#265) * feat: seperate entities into their own field in range query response * feat: seperate entities from features in http range query response * change range query response name back to results instead of features * fix http server unit test * add debugging logs * modify getOnlineFeaturesRange logic * fix grpc range query code and integration tests * fix: Javadoc errors (#270) * update the function description to fix javdoc errors. * fix: Formatting --------- Co-authored-by: vbhagwat <vbhagwat@expediagroup.com> * fix: dont add entities to feature vectors * fix: Formatting javadocs (#271) * update the function description to fix javdoc errors. * fix: Formatting * formatting * fix: formatting * fix linting errors * fix params --------- Co-authored-by: vbhagwat <vbhagwat@expediagroup.com> * fix: Formatting (#272) * update the function description to fix javdoc errors. * fix: Formatting * formatting * fix: formatting * fix linting errors * fix params * fix javadoc error --------- Co-authored-by: vbhagwat <vbhagwat@expediagroup.com> * fix: Formatting (#273) * update the function description to fix javdoc errors. * fix: Formatting * formatting * fix: formatting * fix linting errors * fix params * fix javadoc error * Update FeastClient.java --------- Co-authored-by: vbhagwat <vbhagwat@expediagroup.com> * feat: Add getOnlineFeature and getOnlineFeaturesRange overloaded methods wi… (#275) * feat: add getOnlineFeature and getOnlineFeaturesRange overloaded methods without project Co-authored-by: vbhagwat <vbhagwat@expediagroup.com> * update java client for separate entities in response * Update tests to reflect changes * fix processFeatureVector tests * throw exception when more rows than entity values * change based on pr feedback * pr feedback --------- Co-authored-by: omiranda <omiranda@expediagroup.com> Co-authored-by: vanitabhagwat <92561664+vanitabhagwat@users.noreply.github.com> Co-authored-by: vbhagwat <vbhagwat@expediagroup.com> * fix: Keep duplicate requested features in response for range query. (#276) * fix: Http range timestamp values should return in rfc3339 format (#277) * fix: Http range timestamp values should return in rfc3339 format to match how get features timestamps are returned * use the exact timestamp format arrow marshalling uses * fix: Exception Handling in Updates Feature (#278) * added another exception type * raising exceptions in the http methods instead * calling handle_exception method instead of a raise * Added a TODO comment * fix: Timestamps should be seconds percision. Return null status for found null values. (#279) * fix: Timestamps should be seconds percision. Return null status for found null values. * convert UnixTimestamp sort filters as time.Time * use consistent time for type conversion test * fix materializing timestamps as seconds and java converting time values into epoch seconds * fix linting * fix test * fix test * fix: Go server errors should be status errors with proper status codes. (#281) * fix: Go server errors should be status errors with proper status codes. FeastClient handles status exceptions as differentiated FeastExceptions. * fix and expand tests * add more test cases * fix: Validation order for multiple missing end keys should pass (#283) * fix: Validation order for multiple missing end keys should pass * fix int test and create test for only an equals filter * add conversions of int32 to higher values for http * fix: Handle null values in Arrow list conversion to Proto values (#280) * fix: Handle null values in Arrow list conversion to Proto values * fix: Fixed the nil representation in http * fix: Correct variable name for null check in ArrowListToProtoList function * fix: Null value handling for types.Values * fix: Fixed the nil representation in http * fix: Fixed the issue with NOT_FOUND values representation and repeated null values * fix: Fixed the nil representation * fix: Fixed the nil representation in http * fix: Removed duplicate test * fix: Added status handling for feature values and ensured consistency in value-status mapping * fix: Fixed issue with timestamps in HTTP * fix: Fixed failing tests * fix: Simplified null value handling in type conversion --------- Co-authored-by: Bhargav Dodla <bdodla@expediagroup.com> * fix: When given a string for a bytes entity/sort key filter, attempt to convert as base64 string (#285) * fix: Improve Arrow to proto conversion to handle null and empty arrays (#286) Co-authored-by: Bhargav Dodla <bdodla@expediagroup.com> * feat: Add versioning information to the server and expose it via a ne… (#288) * feat: Add versioning information to the server and expose it via a new endpoint 1. Version information is exposed via endpoints 2. It published information to datadog as well 3. Version information is printed on pod startup * fix: fixed test failures * fix: fixed test failures * fix: added comment * fix: update GetVersionInfo method and related proto definitions * fix: set server type in version information for HTTP, gRPC, and hybrid servers * fix: refactor Datadog version info publishing to a separate function * fix: improve error handling for statsd client flush and close operations --------- Co-authored-by: Bhargav Dodla <bdodla@expediagroup.com> * fix: Add http int tests and fix http error codes (#287) * fix: Update Feature View Not Catching SortedFeatureView/FeatureView Errors (#291) * adding a check to see if HTTPStatusError is being raised within _handle_exception and if it is, to return the same. * addressing feedback * feat: Only call relevant registry endpoint when getting feature views… (#289) * feat: Only call relevant registry endpoint when getting feature views by feature refs or service * update reference to old method signatures * fix integration tests * dedupe feature refs like methods used to * deleting test that tries to reject fvs passed to rq, no longer doing this check * small changes based on PR feedback * fix test * maintain order of fvs and features * define sorted feature service for integration tests * reuse regular fv reponse json for sorted fv integration test * add sort key to expected features in integration test * reorder feature schema to match order in expected response * add unit tests for GetSortedFeatureView methods * add util methods needed to construct fs with ordering * changes based on pr feedback * fix response json order * reorder request jsoons * reuse common feature variables * fix other json objects * refactor old code to use list instead of map * add odfvs to fvsToUse more explicitly * refactor again --------- Co-authored-by: omiranda <omiranda@expediagroup.com> * feat: Streaming ingestion latency improvements (#292) * feat: Streaming ingestion latency improvements * feat: remove blank lines * refactor process method * feat: leaving 1 core is enough for multiprocessing --------- Co-authored-by: Bhargav Dodla <13788369+EXPEbdodla@users.noreply.github.com> Co-authored-by: Bhargav Dodla <bdodla@expediagroup.com> Co-authored-by: Manisha Sudhir <30449541+Manisha4@users.noreply.github.com> Co-authored-by: kpulipati29 <kpulipati29@gmail.com> Co-authored-by: omirandadev <136642003+omirandadev@users.noreply.github.com> Co-authored-by: omiranda <omiranda@expediagroup.com> Co-authored-by: vanitabhagwat <92561664+vanitabhagwat@users.noreply.github.com> Co-authored-by: vbhagwat <vbhagwat@expediagroup.com>
What this PR does / why we need it:
Streaming ingestion latencies were ranging from 10sec to 30sec after 1K TPS, which is not desirable. The purpose of this PR is to improve the latencies by parallelizing the time-consuming portions of the code using multiprocessing. Most of the latencies are coming from arrow_to_proto conversion and online_write_batch. To address that, in this PR, the input dataframe/ arrow table is split into smaller inputs, and then they are processed in parallel on multiple cores/processes using multiprocessing.
The number of processes spawned is (number of cores allocated to Spark driver-2), as I left a couple of cores for OS and other background processes.
In this PR, multiprocessing kicks in only if the number of cores allocated to spark driver is > 3. Because otherwise, lets say if the number of driver cores =3 and if we leave 2 cores for OS & other processes, then we will be left with 1 core in which case it will be same as not having multiprocessing. Also, at lower TPS (like 1K TPS) where 2-3 driver cores are used, the latency looks fine without multiprocessing, and hence it's not needed.
Perf test results with multiprocessing:
https://expediagroup.atlassian.net/wiki/spaces/PE/pages/790170632/Range+Query+-+Streaming+ingestion+load+test+results-+ScyllaDB#Performance-test-results-using-the-above-proposed-solution-(multiprocessing)%3A