rasterframes/python/docs/vector-data.pymd at develop · locationtech/rasterframes

History

123 lines (92 loc) · 5.25 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

# Vector Data

RasterFrames provides a variety of ways to work with spatial vector data (points, lines, and polygons) alongside raster data.

* DataSource for GeoJSON format

* Ability to convert between from [GeoPandas][GeoPandas] and Spark DataFrames

* In PySpark, geometries are [Shapely][Shapely] objects, providing a great deal of interoperability

* Many Spark functions for working with columns of geometries

* Vector data is also the basis for @ref:[zonal map algebra](zonal-algebra.md) operations.

```python, setup, echo=False

import pyrasterframes

from pyrasterframes.utils import create_rf_spark_session

import pyrasterframes.rf_ipython

import geopandas

import folium

spark = create_rf_spark_session('local[2]')

```

## GeoJSON DataSource

```python, read_geojson

from pyspark import SparkFiles

admin1_us_url = 'https://raw.githubusercontent.com/datasets/geo-admin1-us/master/data/admin1-us.geojson'

spark.sparkContext.addFile(admin1_us_url) # this lets us read http scheme uri's in spark

df = spark.read.geojson(SparkFiles.get('admin1-us.geojson'))

df.printSchema()

```

The properties of each discrete geometry are available as columns of the DataFrame, along with the geometry itself.

## GeoPandas and RasterFrames

You can also convert a [GeoPandas][GeoPandas] GeoDataFrame to a Spark DataFrame, preserving the geometry column. This means that any vector format that can be read with [OGR][OGR] can be converted to a Spark DataFrame. In the example below, we expect the same schema as the DataFrame defined above by the GeoJSON reader. Note that in a GeoPandas DataFrame there can be heterogeneous geometry types in the column, which may fail Spark's schema inference.

```python, read_and_normalize

import geopandas

from shapely.geometry import MultiPolygon

def poly_or_mp_to_mp(g):

""" Normalize polygons or multipolygons to all be multipolygons. """

if isinstance(g, MultiPolygon):

return g

else:

return MultiPolygon([g])

gdf = geopandas.read_file(admin1_us_url)

gdf.geometry = gdf.geometry.apply(poly_or_mp_to_mp)

df2 = spark.createDataFrame(gdf)

df2.printSchema()

```

## Shapely Geometry Support

The `geometry` column will have a Spark user-defined type that is compatible with [Shapely][Shapely] when working with Python via PySpark. This means that when the data is collected to the driver, it will be a Shapely geometry object.

```python, show_geom

the_first = df.first()

print(type(the_first['geometry']))

```

Since it is a geometry we can do things like this:

```python, show_wkt

the_first['geometry'].wkt

```

You can also write user-defined functions that take geometries as input, output, or both, via user defined types in the [geomesa_pyspark.types](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/main/python/geomesa_pyspark/types.py) module. Here is a simple **but inefficient** example of a user-defined function that uses both a geometry input and output to compute the centroid of a geometry. Observe in a sample of the data the geometry columns print as well known text (wkt).

```python, add_centroid

from pyspark.sql.functions import udf

from geomesa_pyspark.types import PointUDT

@udf(PointUDT())

def inefficient_centroid(g):

return g.centroid

df.select(df.state_code, inefficient_centroid(df.geometry))

```

## GeoMesa Functions and Spatial Relations

As documented in the @ref:[function reference](reference.md), various user-defined functions implemented by GeoMesa are also available for use. The example below uses a GeoMesa user-defined function to compute the centroid of a geometry. It is logically equivalent to the example above, but more efficient.

```python, native_centroid

from pyrasterframes.rasterfunctions import st_centroid

df.select(df.state_code, inefficient_centroid(df.geometry), st_centroid(df.geometry))

```

The RasterFrames vector functions and GeoMesa functions also provide a variety of spatial relations that are useful in combination with the geometric properties of projected rasters. In this example, we use the @ref:[built-in Landsat catalog](raster-catalogs.md#using-built-in-experimental-catalogs) which provides an extent. We will convert the extent to a polygon and filter to those within approximately 50 km of a selected point.

```python, spatial_relation, evaluate=True

from pyrasterframes.rasterfunctions import st_geometry, st_bufferPoint, st_intersects, st_point

from pyspark.sql.functions import lit

l8 = spark.read.format('aws-pds-l8-catalog').load()

l8 = l8.withColumn('geom', st_geometry(l8.bounds_wgs84)) # extent to polygon

l8 = l8.withColumn('paducah', st_point(lit(-88.628), lit(37.072))) # col of points

l8_filtered = l8 \

.filter(st_intersects(l8.geom, st_bufferPoint(l8.paducah, lit(50000.0)))) \

.filter(l8.acquisition_date > '2018-02-01') \

.filter(l8.acquisition_date < '2018-03-11')

```

```python, folium, echo=False

geo_df = geopandas.GeoDataFrame(

l8_filtered.select('geom', 'bounds_wgs84').toPandas(),

crs='EPSG:4326',

geometry='geom')

# display as folium / leaflet map

m = folium.Map()

layer = folium.GeoJson(geo_df.to_json())

m.fit_bounds(layer.get_bounds())

m.add_child(layer)

```

[GeoPandas]: http://geopandas.org

[OGR]: https://gdal.org/drivers/vector/index.html

[Shapely]: https://shapely.readthedocs.io/en/latest/manual.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

vector-data.pymd

Latest commit

History

vector-data.pymd

File metadata and controls