-
Notifications
You must be signed in to change notification settings - Fork 48
Expand file tree
/
Copy pathvector-data.pymd
More file actions
123 lines (92 loc) · 5.25 KB
/
vector-data.pymd
File metadata and controls
123 lines (92 loc) · 5.25 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# Vector Data
RasterFrames provides a variety of ways to work with spatial vector data (points, lines, and polygons) alongside raster data.
* DataSource for GeoJSON format
* Ability to convert between from [GeoPandas][GeoPandas] and Spark DataFrames
* In PySpark, geometries are [Shapely][Shapely] objects, providing a great deal of interoperability
* Many Spark functions for working with columns of geometries
* Vector data is also the basis for @ref:[zonal map algebra](zonal-algebra.md) operations.
```python, setup, echo=False
import pyrasterframes
from pyrasterframes.utils import create_rf_spark_session
import pyrasterframes.rf_ipython
import geopandas
import folium
spark = create_rf_spark_session('local[2]')
```
## GeoJSON DataSource
```python, read_geojson
from pyspark import SparkFiles
admin1_us_url = 'https://raw.githubusercontent.com/datasets/geo-admin1-us/master/data/admin1-us.geojson'
spark.sparkContext.addFile(admin1_us_url) # this lets us read http scheme uri's in spark
df = spark.read.geojson(SparkFiles.get('admin1-us.geojson'))
df.printSchema()
```
The properties of each discrete geometry are available as columns of the DataFrame, along with the geometry itself.
## GeoPandas and RasterFrames
You can also convert a [GeoPandas][GeoPandas] GeoDataFrame to a Spark DataFrame, preserving the geometry column. This means that any vector format that can be read with [OGR][OGR] can be converted to a Spark DataFrame. In the example below, we expect the same schema as the DataFrame defined above by the GeoJSON reader. Note that in a GeoPandas DataFrame there can be heterogeneous geometry types in the column, which may fail Spark's schema inference.
```python, read_and_normalize
import geopandas
from shapely.geometry import MultiPolygon
def poly_or_mp_to_mp(g):
""" Normalize polygons or multipolygons to all be multipolygons. """
if isinstance(g, MultiPolygon):
return g
else:
return MultiPolygon([g])
gdf = geopandas.read_file(admin1_us_url)
gdf.geometry = gdf.geometry.apply(poly_or_mp_to_mp)
df2 = spark.createDataFrame(gdf)
df2.printSchema()
```
## Shapely Geometry Support
The `geometry` column will have a Spark user-defined type that is compatible with [Shapely][Shapely] when working with Python via PySpark. This means that when the data is collected to the driver, it will be a Shapely geometry object.
```python, show_geom
the_first = df.first()
print(type(the_first['geometry']))
```
Since it is a geometry we can do things like this:
```python, show_wkt
the_first['geometry'].wkt
```
You can also write user-defined functions that take geometries as input, output, or both, via user defined types in the [geomesa_pyspark.types](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/main/python/geomesa_pyspark/types.py) module. Here is a simple **but inefficient** example of a user-defined function that uses both a geometry input and output to compute the centroid of a geometry. Observe in a sample of the data the geometry columns print as well known text (wkt).
```python, add_centroid
from pyspark.sql.functions import udf
from geomesa_pyspark.types import PointUDT
@udf(PointUDT())
def inefficient_centroid(g):
return g.centroid
df.select(df.state_code, inefficient_centroid(df.geometry))
```
## GeoMesa Functions and Spatial Relations
As documented in the @ref:[function reference](reference.md), various user-defined functions implemented by GeoMesa are also available for use. The example below uses a GeoMesa user-defined function to compute the centroid of a geometry. It is logically equivalent to the example above, but more efficient.
```python, native_centroid
from pyrasterframes.rasterfunctions import st_centroid
df.select(df.state_code, inefficient_centroid(df.geometry), st_centroid(df.geometry))
```
The RasterFrames vector functions and GeoMesa functions also provide a variety of spatial relations that are useful in combination with the geometric properties of projected rasters. In this example, we use the @ref:[built-in Landsat catalog](raster-catalogs.md#using-built-in-experimental-catalogs) which provides an extent. We will convert the extent to a polygon and filter to those within approximately 50 km of a selected point.
```python, spatial_relation, evaluate=True
from pyrasterframes.rasterfunctions import st_geometry, st_bufferPoint, st_intersects, st_point
from pyspark.sql.functions import lit
l8 = spark.read.format('aws-pds-l8-catalog').load()
l8 = l8.withColumn('geom', st_geometry(l8.bounds_wgs84)) # extent to polygon
l8 = l8.withColumn('paducah', st_point(lit(-88.628), lit(37.072))) # col of points
l8_filtered = l8 \
.filter(st_intersects(l8.geom, st_bufferPoint(l8.paducah, lit(50000.0)))) \
.filter(l8.acquisition_date > '2018-02-01') \
.filter(l8.acquisition_date < '2018-03-11')
```
```python, folium, echo=False
geo_df = geopandas.GeoDataFrame(
l8_filtered.select('geom', 'bounds_wgs84').toPandas(),
crs='EPSG:4326',
geometry='geom')
# display as folium / leaflet map
m = folium.Map()
layer = folium.GeoJson(geo_df.to_json())
m.fit_bounds(layer.get_bounds())
m.add_child(layer)
m
```
[GeoPandas]: http://geopandas.org
[OGR]: https://gdal.org/drivers/vector/index.html
[Shapely]: https://shapely.readthedocs.io/en/latest/manual.html