This aims to reproduce the logic in TractZonalStatisticsDailyUSforSAM.pynb with Terra. For a single day and measure the process took around 17 minutes in the Pynb. Here we want to get a benchmark in Terra.
The general workflow is to:
import tract boundary
import PRISM raster
calculate zonal statistic for each tract unit
saves results
Lets first import some existing results as a reference point.
Reference results (ESRI)
We have preprocessed the ESRI results from a bunch of .sas7bdat files into a distributed database. Lets pull results from our sample: tract42101000500, tmean, 01/01/2019.
Okay. We now have our tract boundaries. Lets grab the PRISM data! We have already pre-processed all PRISM data into a standardized .tif for performance and to ensure correct projections. Lets pull this from that data lake.
Great! So now we can see that the zonal_stat for this tract. I mean its slightly different but for temperature being off by less than 0.1 degrees seems okay. But we will review this QC in an whole run … see next section.
That was quick for the 3000ish tracks in PA the computations took 4-5 seconds.
QC (PA)
Lets take a look at the correlation to old results for our PA results.
QC PA results with ESRI results
## Op. QC data structuresxwalk_esri_pa_results = df_reference %>%select(GEOID10, value_esri)df_qc_pa = df_results_terra_PA %>%rename(value_terra = value) %>%left_join(xwalk_esri_pa_results,by ='GEOID10')## Plotcorrelation_coefficient <-cor(df_qc_pa$value_terra, df_qc_pa$value_esri, use ="complete.obs")ggplot(df_qc_pa, aes(x = value_terra, y = value_esri)) +geom_point(alpha =0.6) +# Scatter plotgeom_smooth(method ="lm", se =FALSE, color ="blue") +# Fitted linetheme_minimal() +# Use a minimal theme for a nice looklabs(title ="PA/tdmean/2019-01-01: 3218 data points" ,subtitle =paste("Correlation coefficient:", round(correlation_coefficient, 6)),x ="Value Terra",y ="Value Esri") +stat_cor(method ="pearson", label.x =3) # Automatically add correlation coefficient text
We can see that the correlation coefficient is 0.9998. So we can see that the values are pretty much the same. Passes the initial smell test and could be passed to Steve and team for further validation.
Reproduce for whole whole unit (variable, date)
Heres the big test now. Lest try the benchmark for a whole unit of computation which is for every tract in the contigous US calculate zonal stats for a single measure - tdmean and single day 01/01/2019.
To the PA code we need to add one more layer of complexity to loop through all states. Lets turn the state specific computation into a function then just loop!
Function
Parallel runs are a little different as each function runs in its own environment, while global object detection usually works, we had some trouble so lets do a crated function just to explicitly define our dependencies to prevent issues in parallel processing. A few things crated function differs in
name scoping of all dependencies
explicitly defining object dependencies
SpatRaster and SpatVector objects are not able to be used in Parallelization (due to C++ stuff) so we need to import from disk during loops
So this function will work if the user has access to the SAM un-encrypted server where we are staging some pre-processed boudnaries.
Setup Function for state specific zonal calculation
## Create state functioncrate_calc_zonal_state =crate(sam_tiger_lake = sam_tiger_lake,sample = sample,function(state_tmp){## 1. Setup if (!file.exists(sample$prism_cached_SpatRaster)){file.copy(sample$prism_SpatRaster, sample$prism_cached_SpatRaster) } r = terra::rast(sample$prism_cached_SpatRaster)## 2. Get zones for state zone_path =file.path( sam_tiger_lake, glue::glue("tl_2010_{state_tmp}_tract10_albers")) z = terra::vect(zone_path)## 3. Calculate zonal stat df_zonal_stats = terra::zonal( r, z,fun="mean",na.rm=TRUE)## 4. Op dataframe result df_zonal_stats = dplyr::rename(df_zonal_stats, 'value'=1) df_results_terra_1 = tibble::as_tibble(terra::as.data.frame(z)) df_results_terra_2 = dplyr::select(df_results_terra_1, GEOID10) df_results_terra_3 = dplyr::mutate(df_results_terra_2,measure = sample$measure,date = sample$date) df_results_terra = dplyr::bind_cols(df_results_terra_3, df_zonal_stats)return(df_results_terra)})## Testhead(crate_calc_zonal_state('PA'))
GEOID10
measure
date
value
42003560500
tdmean
2019-01-01
7.104099
42003560400
tdmean
2019-01-01
7.127940
42003552400
tdmean
2019-01-01
7.067830
42003552300
tdmean
2019-01-01
7.072888
42003552200
tdmean
2019-01-01
7.199420
42003552100
tdmean
2019-01-01
7.067318
Benchmark
Legacy benchmark: 17 minutes
Terra benchmarks: UHC Workstation (max 6 cores)
1 core: 2.8 minutes
2 core: 1.5 minutes
4 cores: 1 minute
6 cores: minutes
1 core
So lets do this for a measure, day and all tracts. So essentially just loop through the state specific function.
For this computational block of ‘tmean’, ‘01/01/2019’ and all tracts in contiguous US lets validate the results compared to original results.
Code
## Combing dataxwalk_esri_results = df_reference %>%filter(date == sample$date, measure == sample$measure) %>%select(GEOID10, value_esri)df_qc_us = df_measure_day_results_4_core %>%bind_rows() %>%rename(value_terra = value) %>%left_join(xwalk_esri_pa_results, by ='GEOID10')# head(df_qc_us)## Get Correlationcorrelation_coefficient <-cor(df_qc_us$value_terra, df_qc_us$value_esri, use ="complete.obs")## Plot QCggplot(df_qc_us, aes(x = value_terra, y = value_esri)) +geom_point(alpha =0.6) +# Scatter plotgeom_smooth(method ="lm", se =FALSE, color ="blue") +# Fitted linetheme_minimal() +# Use a minimal theme for a nice looklabs(title ="USA/tmean/2019-01-01: 83571 data points",subtitle =paste("Correlation coefficient:", round(correlation_coefficient, 6)),x ="Value Terra",y ="Value Esri") +stat_cor(method ="pearson", label.x =3) # Automatically add correlation coefficient text
Summary
Clearly we see huge performance improvements - the 17 minute process in the legacy code is now down to 1 minute. Proposal for putting this into production.
Production function to accept arguments (measure, day, geogrpahic-level) for looping
Store results as Hive Partitioned similiar to reference results
Protocol for accessing partitioned PRISM zonal stats database
Create a flexible function for custom geographies (input is a an sf object)