library(pacman)
p_load(tidyverse, arrow, cli, reactable, htmltools)
SALURBAL Temperature v1
Loads raw blcok into CCUH as an ETL block and DBT source.
1. Setup
First let’s load dependencies.
Import Pipeline Object
This pipeline is documentation page is found at SALURBAL Temperature v1 - page. Lets declare some local objects to reuse through out this pipeline. Copied the Pipeline Object
from the SALURBAL Temperature v1 - page
## Pipeline Object
= list(
pipeline_instance id = 'salurbal_TMPDAILY__v1'
)
Import upstream blocks
We just need one upstream block blocks:
- SALURBLA Renovate Bundle v1 - p age
- a data parquet
- a metadata parquet
- we will import PA,GT and return a denormalized cube
= arrow::open_dataset('//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/raw__salurbal_TMPDAILY__v1/TMPDAILY_v1.0_internal.parquet') %>%
db_data filter(iso2 %in% c('PA','GT')) %>%
collect()
= arrow::read_parquet('//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/raw__salurbal_TMPDAILY__v1/TMPDAILY_v1.0_metadata.parquet')
df_metadata
= db_data %>%
df_raw left_join(df_metadata)
2. Pre-processing
Here we import od EDA and do any pre-processing needed.
2.1 Pre-process
we do two things
- create a date column from component pieces
- type case value as numeric
- hard code version (as this was not exported in the renovation explicitly)
= df_raw %>%
df_processed mutate(
date = paste(month, day, year) %>%
as.Date(format = "%m %d %Y"),
value = as.numeric(value),
version = 'v1.0')
2.1 EDA
Lets first do some EDA first visualize L1 city 102115
%>%
df_processed filter(str_detect(salid,'203101')) %>%
mutate(geo_tmp = paste0(geo, ' - ', salid)) %>%
select(geo_tmp, salid, date, var_name, value) %>%
ggplot(aes(date, value, color = var_name)) +
facet_wrap(~geo_tmp) +
geom_line() +
labs(title = "Daily Temperature for SALID 102115",
x = "Date",
y = "Temperature (C)")
Looks okay to me. Lets do a quick invotentory of availablility before loaidng into DBT and CCUH block.
%>%
df_processed count(geo, iso2, year) %>%
group_by(iso2, geo) %>%
summarize(
years = n(),
min_year = min(year),
max_year = max(year)
%>%
) ungroup() %>%
arrange(iso2)
iso2 | geo | years | min_year | max_year |
---|---|---|---|---|
GT | L1AD | 20 | 1996 | 2015 |
GT | L1UX | 20 | 1996 | 2015 |
GT | L2 | 20 | 1996 | 2015 |
PA | L1AD | 20 | 1996 | 2015 |
PA | L1UX | 20 | 1996 | 2015 |
PA | L2 | 20 | 1996 | 2015 |
so we have L1AD, L1UX and L2 data for 1996 - 2015 for GT and PA. Quick look at the key metadta - source + public.
%>%
df_metadata count(var_name, var_def, source, public) %>%
reactable()
All data is public access and there are two sources. Acecss is not an issue. looks good lets op for CCUH.
3. Data + Metadata
Data looks okay. Note taht for this phase of DBT we will experiement with sources that are semnatically tidy as well as transcational tidy. so we actually export three parquet files to the DBT source location.
{block_id}__semantic.parquet
(one row per observation){block_id}__codebook.parquet
(metadata on columns){block_id}__transcational.parquet
(one row per data point with all metadata wide)
3.1 Transcational data
Note that since this is renovated sluarbl data set it is already renovated. So we just export this as is.
= df_processed
df_transactional
head(df_transactional) %>%
reactable(
columns = list(
dataset_notes = colDef(
cell = function(value) {
div(style = list(
maxWidth = "200px", # Adjust the width as needed
whiteSpace = "nowrap",
overflow = "hidden",
textOverflow = "ellipsis"
title = value, value)
),
}
),var_def = colDef(
cell = function(value) {
div(style = list(
maxWidth = "200px", # Adjust the width as needed
whiteSpace = "nowrap",
overflow = "hidden",
textOverflow = "ellipsis"
title = value, value)
),
}
),source = colDef(
cell = function(value) {
div(style = list(
maxWidth = "200px", # Adjust the width as needed
whiteSpace = "nowrap",
overflow = "hidden",
textOverflow = "ellipsis"
title = value, value)
),
}
),acknowledgements = colDef(
cell = function(value) {
div(style = list(
maxWidth = "200px", # Adjust the width as needed
whiteSpace = "nowrap",
overflow = "hidden",
textOverflow = "ellipsis"
title = value, value)
),
}
)
) )
2.2 Semantic data
Semantic layer will with tidy for one row per observation. which in the case of SALURBAL is one row per salid per day for this daily dataset with the absolute minimum of surrogate keys.
= df_transactional %>%
df_semantic select(var_name, iso2, strata_id, geo, salid, date, value) %>%
pivot_wider(names_from = var_name, values_from = value) %>%
arrange(geo, iso2, salid, date)
head(df_semantic) %>%
reactable()
3.3 Semantic codebook
Here is a codebook for our semantic data. Lets start with defininig our SALURBAL composite keys.
= tribble(
df_composite_keys_codebook ~column_name, ~column_description, ~column_type,
"iso2", "ISO 3166-1 alpha-2 country code", "character",
"strata_id", "ID of the strata", "character",
"geo", "Geographical information", "character",
"salid", "SALURBAL ID", "character",
"date", "Date of the observation", "date"
)
%>%
df_composite_keys_codebook reactable()
Nexts lets define our value columns. Note that inaddition to traditional DBT metadata we will alslo include custom metadata pointing towards additional documentation.
= df_metadata %>%
df_data_codebook select(column_name = var_name,
column_description = var_def) %>%
distinct() %>%
mutate(column_type = 'numeric',
scope = "SALURBAL renovated",
current_block = 'https://drexel-ccuh.notion.site/SALURBAL-Daily-Temperature-v1-2a927860bba2422d9ce87934868bd654',
upstream_block = 'https://drexel-ccuh.notion.site/Raw-SALURBAL-Daily-Temperature-v1-bd313dc207a1438db5cb3a30df032f1b?pvs=74')
%>%
df_data_codebook reactable()
Lets bind together for our final semantic codebook
= bind_rows(df_composite_keys_codebook, df_data_codebook) %>%
df_codebook mutate(block_id = pipeline_instance$id)
4. Export
4.1 Export to Server
We copied this downstream block’s Export Code
property.
## Downstream Object
= list(
paths_salurbal_TMPDAILY__v1 id = 'salurbal_TMPDAILY__v1',
path = '//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/salurbal_TMPDAILY__v1/salurbal_TMPDAILY__v1.parquet',
path_obt = '//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/salurbal_TMPDAILY__v1/salurbal_TMPDAILY__v1__obt.parquet',
path_semantic = '//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/salurbal_TMPDAILY__v1/salurbal_TMPDAILY__v1__semantic.parquet',
path_codebook = '//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/salurbal_TMPDAILY__v1/salurbal_TMPDAILY__v1__codebook.parquet',
path_dbt_obt = '//files.drexel.edu/encrypted/SOPH/UHC/CCUH/DBT/ccuh-dbt/source/salurbal_TMPDAILY__v1__obt.parquet',
path_dbt_semantic = '//files.drexel.edu/encrypted/SOPH/UHC/CCUH/DBT/ccuh-dbt/source/salurbal_TMPDAILY__v1__semantic.parquet',
path_dbt_codebook = '//files.drexel.edu/encrypted/SOPH/UHC/CCUH/DBT/ccuh-dbt/source/salurbal_TMPDAILY__v1__codebook.parquet'
)
## Export Function
= function(df, path){
export_block if (file.exists(path)) file.remove(path)
%>% arrow::write_parquet(path)
df
}
## Exports
export_block(df_transactional, paths_salurbal_TMPDAILY__v1$path_obt)
export_block(df_codebook, paths_salurbal_TMPDAILY__v1$path_codebook)
export_block(df_semantic, paths_salurbal_TMPDAILY__v1$path_semantic)
export_block(df_transactional, paths_salurbal_TMPDAILY__v1$path_dbt_obt)
export_block(df_codebook, paths_salurbal_TMPDAILY__v1$path_dbt_codebook)
export_block(df_semantic, paths_salurbal_TMPDAILY__v1$path_dbt_semantic)
4.2 Test Export
## Import salurbal_TMPDAILY__v1
= arrow::read_parquet('//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/salurbal_TMPDAILY__v1/salurbal_TMPDAILY__v1__semantic.parquet') df_salurbal_TMPDAILY__v1