-
Notifications
You must be signed in to change notification settings - Fork 48
Expand file tree
/
Copy pathraster-join.pymd
More file actions
80 lines (58 loc) · 4.52 KB
/
raster-join.pymd
File metadata and controls
80 lines (58 loc) · 4.52 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# Raster Join
```python, init, echo=False
from IPython.display import display
import pyrasterframes.rf_ipython
import pandas as pd
from pyrasterframes.utils import create_rf_spark_session
from pyrasterframes.rasterfunctions import *
from pyspark.sql.functions import *
spark = create_rf_spark_session(**{
'spark.driver.memory': '4G',
'spark.ui.enabled': 'false'
})
```
## Description
A common operation for raster data is reprojecting or warping the data to a different @ref:[CRS][CRS] with a specific @link:[transform](https://gdal.org/user/raster_data_model.html#affine-geotransform) { open=new }. In many use cases, the particulars of the warp operation depend on another set of raster data. Furthermore, the warp is done to put both sets of raster data to a common set of grid to enable manipulation of the datasets together.
In RasterFrames, you can perform a **Raster Join** on two DataFrames containing raster data.
The operation will perform a _spatial join_ based on the [CRS][CRS] and [extent][extent] data in each DataFrame. By default it is a left join and uses an intersection operator.
For each candidate row, all _tile_ columns on the right hand side are warped to match the left hand side's [CRS][CRS], [extent][extent], and dimensions. Warping relies on GeoTrellis library code. You can specify the resampling method to be applied as one of: nearest_neighbor, bilinear, cubic_convolution, cubic_spline, lanczos, average, mode, median, max, min, or sum.
The operation is also an aggregate, with multiple intersecting right-hand side tiles `merge`d into the result. There is no guarantee about the ordering of tiles used to select cell values in the case of overlapping tiles.
When using the @ref:[`raster` DataSource](raster-join.md) you will automatically get the @ref:[CRS][CRS] and @ref:[extent][extent] information needed to do this operation.
## Example Code
Because the raster join is a distributed spatial join, indexing of both DataFrames using the [spatial index][spatial-index] is crucial for performance.
```python, example_raster_join
# Southern Mozambique December 29, 2016
modis = spark.read.raster('s3://astraea-opendata/MCD43A4.006/21/11/2016297/MCD43A4.A2016297.h21v11.006.2016306075821_B01.TIF',
spatial_index_partitions=True) \
.withColumnRenamed('proj_raster', 'modis')
landsat8 = spark.read.raster('https://landsat-pds.s3.us-west-2.amazonaws.com/c1/L8/167/077/LC08_L1TP_167077_20161015_20170319_01_T1/LC08_L1TP_167077_20161015_20170319_01_T1_B4.TIF',
spatial_index_partitions=True) \
.withColumnRenamed('proj_raster', 'landsat')
rj = landsat8.raster_join(modis, resampling_method="cubic_convolution")
# Show some non-empty tiles
rj.select('landsat', 'modis', 'crs', 'extent') \
.filter(rf_data_cells('modis') > 0) \
.filter(rf_tile_max('landsat') > 0)
```
## Additional Options
The following optional arguments are allowed:
* `left_extent` - the column on the left-hand DataFrame giving the [extent][extent] of the tile columns
* `left_crs` - the column on the left-hand DataFrame giving the [CRS][CRS] of the tile columns
* `right_extent` - the column on the right-hand DataFrame giving the [extent][extent] of the tile columns
* `right_crs` - the column on the right-hand DataFrame giving the [CRS][CRS] of the tile columns
* `join_exprs` - a single column expression as would be used in the [`on` parameter of `join`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join)
* `resampling_method` - resampling algorithm to use in reprojection of right-hand tile column
Note that the `join_exprs` will override the join behavior described above. By default the expression is equivalent to:
```python, join_expr, evaluate=False
st_intersects(
st_geometry(left[left_extent]),
st_reproject(st_geometry(right[right_extent]), right[right_crs], left[left_crs])
)
```
Resampling method to use can be specified by passing one of the following strings into `resampling_method` parameter.
The point resampling methods are: `"nearest_neighbor"`, `"bilinear"`, `"cubic_convolution"`, `"cubic_spline"`, and `"lanczos"`.
The aggregating resampling methods are: `"average"`, `"mode"`, `"median"`, `"max"`, "`min`", or `"sum"`.
Note the aggregating methods are intended for downsampling. For example a 0.25 factor and `max` method returns the maximum value in a 4x4 neighborhood.
[CRS]: concepts.md#coordinate-reference-system--crs
[extent]: concepts.md#extent
[spatial-index]:raster-read.md#spatial-indexing-and-partitioning