-
Notifications
You must be signed in to change notification settings - Fork 48
Expand file tree
/
Copy pathtime-series.pymd
More file actions
106 lines (79 loc) · 4.72 KB
/
time-series.pymd
File metadata and controls
106 lines (79 loc) · 4.72 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# Time Series
## Analysis Plan
```python setup, echo=False
from IPython.display import display
import pyrasterframes
from pyrasterframes.utils import create_rf_spark_session
from pyrasterframes.rasterfunctions import *
import pyrasterframes.rf_ipython
from pyspark.sql.functions import udf, lit
from geomesa_pyspark.types import MultiPolygonUDT
# This job is more memory bound, so reduce the concurrent tasks.
spark = create_rf_spark_session("local[4]")
```
In this example, we will show how the flexibility of the DataFrame concept for raster data allows a simple and intuitive way to extract a time series from Earth observation data. We will continue our example from the @ref:[Zonal Map Algebra page](zonal-algebra.md).
We will summarize the change in @ref:[NDVI](local-algebra.md#computing-ndvi) over the spring and early summer of 2018 in the Cuyahoga Valley National Park in Ohio, USA.
```python vector, echo=False, results='hidden'
cat = spark.read.format('aws-pds-modis-catalog').load().repartition(200)
import requests
nps_filepath = '/tmp/parks.geojson'
nps_data_query_url = 'https://services1.arcgis.com/fBc8EJBxQRMcHlei/arcgis/rest/services/' \
'NPS_Park_Boundaries/FeatureServer/0/query' \
'?geometry=-82.451,41.075,-80.682,41.436&inSR=4326&outSR=4326&outFields=*&f=geojson'
r = requests.get(nps_data_query_url)
with open(nps_filepath,'wb') as f:
f.write(r.content)
park_vector = spark.read.geojson(nps_filepath)
@udf(MultiPolygonUDT())
def simplify(g, tol):
return g.simplify(tol)
park_vector = park_vector.withColumn('geo_simp', simplify('geometry', lit(0.001))) \
.select('geo_simp') \
.hint('broadcast')
```
## Catalog Read
As in our other example, we will query for a single known MODIS granule directly. We limit the vector data to the single park of interest. The longer time period selected should show the change in plant vigor as leaves emerge over the spring and into early summer. The definitions of `cat` and `park_vector` are as in the @ref:[Zonal Map Algebra page](zonal-algebra.md).
```python query_catalog
park_cat = cat \
.filter(
(cat.granule_id == 'h11v04') &
(cat.acquisition_date > lit('2018-02-19')) &
(cat.acquisition_date < lit('2018-07-01'))
) \
.crossJoin(park_vector.filter('UNIT_CODE == "CUVA"')) #only coyahuga
```
## Vector and Raster Data Interaction
We follow the same steps as the Zonal Map Algebra analysis: reprojecting the park geometry, filtering for intersection, rasterizing the geometry, and masking the NDVI by the _zone_ _tiles_. The code from that analysis is condensed here for reference.
```python raster_prep
raster_cols = ['B01', 'B02',] # red and near-infrared respectively
rf_park_tile = spark.read.raster(
park_cat.select(['acquisition_date', 'granule_id', 'geo_simp'] + raster_cols),
catalog_col_names=raster_cols) \
.withColumn('park_native', st_reproject('geo_simp', lit('EPSG:4326'), rf_crs('B01'))) \
.filter(st_intersects('park_native', rf_geometry('B01'))) \
.withColumn('dims', rf_dimensions('B01')) \
.withColumn('park_tile', rf_rasterize('park_native', rf_geometry('B01'), lit(1), 'dims.cols', 'dims.rows')) \
.withColumn('ndvi', rf_normalized_difference('B02', 'B01')) \
.withColumn('ndvi_masked', rf_mask('ndvi', 'park_tile'))
```
## Create Time Series
We next aggregate across the cell values to arrive at an average NDVI for each week of the year. We use `pyspark`'s built in `groupby` and time functions with a RasterFrames @ref:[aggregate function](aggregation.md) to do this. Note that the computation is creating a weighted average, which is weighted by the number of valid observations per week.
```python ndvi_time_series
from pyspark.sql.functions import col, year, weekofyear, month
time_series = rf_park_tile \
.groupby(
year('acquisition_date').alias('year'),
weekofyear('acquisition_date').alias('week')) \
.agg(rf_agg_mean('ndvi_masked').alias('ndvi'))
```
Finally, we will take a look at the NDVI over time.
```python time_series_display, evaluate=True
import matplotlib.pyplot as plt
time_series_pdf = time_series.toPandas()
time_series_pdf.sort_values('week', inplace=True)
plt.plot(time_series_pdf['week'], time_series_pdf['ndvi'], 'go-')
plt.xlabel('Week of year, 2018')
plt.ylabel('NDVI')
plt.title('Cuyahoga Valley NP Green-up')
```
We can see two fairly clear elbows in the curve at week 17 and week 21, indicating the start and end of the green up period. Estimation of such parameters is one technique [phenology](https://en.wikipedia.org/wiki/Phenology) researchers use to monitor changes in climate and environment.