Add rf_agg_approx_quantiles function#429
Conversation
Initial implementation with quantiles Signed-off-by: Jason T. Brown <jason@astraea.earth>
|
|
||
| describe("DataFrame.tileStats extension methods") { | ||
|
|
||
| val df = TestData.sampleGeoTiff.toDF() |
There was a problem hiding this comment.
would like to see a test of this with projected raster tile, raster ref, raster source etc. maybe in integration tests?
There was a problem hiding this comment.
We can get there with Expressions...
metasim
left a comment
There was a problem hiding this comment.
I working on some alternative approaches to discuss that I think will be more "idiomatic RasterFrames".
| self.columns.foldLeft(self)((df, c) ⇒ df.withColumnRenamed(c, s"$prefix$c").asInstanceOf[DF]) | ||
|
|
||
| /** */ | ||
| def tileStat(): RasterFrameStatFunctions = new RasterFrameStatFunctions(self) |
There was a problem hiding this comment.
Working on an alternative approach to discuss, one that takes an approach more in line with the other aggregate functions (i.e. rf_agg_foo_bar). While I know that Spark DataFrames uses this extension method approach in a couple of places (stats, na, ...) I think it adds an inconsistency that can feel arbitrary if the same thing can be achieved with columnar functions. na is a good example of one that can't, because it's effectively implementing a builder for NA handling, but I don't think were in that situation.
| * | ||
| * @since 2.0.0 | ||
| */ | ||
| def approxTileQuantile( |
There was a problem hiding this comment.
Is this copied? If so, we need to attribute and reference, etc.
|
|
||
| describe("DataFrame.tileStats extension methods") { | ||
|
|
||
| val df = TestData.sampleGeoTiff.toDF() |
There was a problem hiding this comment.
We can get there with Expressions...
| val result = df.tileStat().approxTileQuantile( | ||
| Array("tile", "tilePlus2"), | ||
| Array(0.25, 0.75), | ||
| 0.00001 |
There was a problem hiding this comment.
Is there any way to default this? Or is it strongly scale dependent?
…lt-in functions all the way to a custom aggregate expression.
|
See this for a couple alternate approaches that don't depend DataFrame extension methods. |
metasim
left a comment
There was a problem hiding this comment.
See alternate approaches and let's converge on one.
This reverts commit 456914d.
Alternate approaches to exposing quantile summaries API
Attempt to register with SQL Signed-off-by: Jason T. Brown <jason@astraea.earth>
|
@metasim tests failing on the SQL expression. Not sure how to work in all the different parameters. We could easily register some functions to compute default percentiles like median, quartiles, quantiles, deciles, etc at default relative error. |
|
@vpipkt I think enabling SQL support is going to take some more work, given the SQL function parameter requirements.... perhaps we need to look for something in the official API that uses non-columnar parameters to a function (does it exist)? For some reason I'm loath to make all of those parameters columnar, but perhaps that's the proper way to do it I think the current question is do we merge this without support for SQL, or do we work this out first? |
|
I am fine with a little divergence in the API between SQL and python/scala over this. Possible paths:
|
Signed-off-by: Jason T. Brown <jason@astraea.earth>
|
Latest commit should pass tests and is basically the option 1 above Recommend we add an issue for option 2, lower priority. |
Signed-off-by: Jason T. Brown <jason@astraea.earth>
| tile.foreachDouble(d => if (!isNoData(d)) result = result.insert(d)) | ||
| buffer.update(0, result.toRow) | ||
| } | ||
| else buffer |
There was a problem hiding this comment.
Compiler is complaining about this. Since the function returns Unit this line could be omitted?
There was a problem hiding this comment.
Yeh, that was me not thinking "mutably". It can go.
Signed-off-by: Jason T. Brown <jason@astraea.earth>
Initial implementation with quantiles
To Do