library(pacman)
p_load(tidyverse, arrow, cli, reactable, htmltools)SALURBAL Temperature v1
Loads raw blcok into CCUH as an ETL block and DBT source.
1. Setup
First let’s load dependencies.
Import Pipeline Object
This pipeline is documentation page is found at SALURBAL Temperature v1 - page. Lets declare some local objects to reuse through out this pipeline. Copied the Pipeline Object from the SALURBAL Temperature v1 - page
## Pipeline Object
pipeline_instance = list(
id = 'salurbal_TMPDAILY__v1'
)Import upstream blocks
We just need one upstream block blocks:
- SALURBLA Renovate Bundle v1 - p age
- a data parquet
- a metadata parquet
- we will import PA,GT and return a denormalized cube
db_data = arrow::open_dataset('//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/raw__salurbal_TMPDAILY__v1/TMPDAILY_v1.0_internal.parquet') %>%
filter(iso2 %in% c('PA','GT')) %>%
collect()
df_metadata = arrow::read_parquet('//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/raw__salurbal_TMPDAILY__v1/TMPDAILY_v1.0_metadata.parquet')
df_raw = db_data %>%
left_join(df_metadata)2. Pre-processing
Here we import od EDA and do any pre-processing needed.
2.1 Pre-process
we do two things
- create a date column from component pieces
- type case value as numeric
- hard code version (as this was not exported in the renovation explicitly)
df_processed = df_raw %>%
mutate(
date = paste(month, day, year) %>%
as.Date(format = "%m %d %Y"),
value = as.numeric(value),
version = 'v1.0') 2.1 EDA
Lets first do some EDA first visualize L1 city 102115
df_processed %>%
filter(str_detect(salid,'203101')) %>%
mutate(geo_tmp = paste0(geo, ' - ', salid)) %>%
select(geo_tmp, salid, date, var_name, value) %>%
ggplot(aes(date, value, color = var_name)) +
facet_wrap(~geo_tmp) +
geom_line() +
labs(title = "Daily Temperature for SALID 102115",
x = "Date",
y = "Temperature (C)")Looks okay to me. Lets do a quick invotentory of availablility before loaidng into DBT and CCUH block.
df_processed %>%
count(geo, iso2, year) %>%
group_by(iso2, geo) %>%
summarize(
years = n(),
min_year = min(year),
max_year = max(year)
) %>%
ungroup() %>%
arrange(iso2)| iso2 | geo | years | min_year | max_year |
|---|---|---|---|---|
| GT | L1AD | 20 | 1996 | 2015 |
| GT | L1UX | 20 | 1996 | 2015 |
| GT | L2 | 20 | 1996 | 2015 |
| PA | L1AD | 20 | 1996 | 2015 |
| PA | L1UX | 20 | 1996 | 2015 |
| PA | L2 | 20 | 1996 | 2015 |
so we have L1AD, L1UX and L2 data for 1996 - 2015 for GT and PA. Quick look at the key metadta - source + public.
df_metadata %>%
count(var_name, var_def, source, public) %>%
reactable()All data is public access and there are two sources. Acecss is not an issue. looks good lets op for CCUH.
3. Data + Metadata
Data looks okay. Note taht for this phase of DBT we will experiement with sources that are semnatically tidy as well as transcational tidy. so we actually export three parquet files to the DBT source location.
{block_id}__semantic.parquet(one row per observation){block_id}__codebook.parquet(metadata on columns){block_id}__transcational.parquet(one row per data point with all metadata wide)
3.1 Transcational data
Note that since this is renovated sluarbl data set it is already renovated. So we just export this as is.
df_transactional = df_processed
head(df_transactional) %>%
reactable(
columns = list(
dataset_notes = colDef(
cell = function(value) {
div(style = list(
maxWidth = "200px", # Adjust the width as needed
whiteSpace = "nowrap",
overflow = "hidden",
textOverflow = "ellipsis"
), title = value, value)
}
),
var_def = colDef(
cell = function(value) {
div(style = list(
maxWidth = "200px", # Adjust the width as needed
whiteSpace = "nowrap",
overflow = "hidden",
textOverflow = "ellipsis"
), title = value, value)
}
),
source = colDef(
cell = function(value) {
div(style = list(
maxWidth = "200px", # Adjust the width as needed
whiteSpace = "nowrap",
overflow = "hidden",
textOverflow = "ellipsis"
), title = value, value)
}
),
acknowledgements = colDef(
cell = function(value) {
div(style = list(
maxWidth = "200px", # Adjust the width as needed
whiteSpace = "nowrap",
overflow = "hidden",
textOverflow = "ellipsis"
), title = value, value)
}
)
)
)2.2 Semantic data
Semantic layer will with tidy for one row per observation. which in the case of SALURBAL is one row per salid per day for this daily dataset with the absolute minimum of surrogate keys.
df_semantic = df_transactional %>%
select(var_name, iso2, strata_id, geo, salid, date, value) %>%
pivot_wider(names_from = var_name, values_from = value) %>%
arrange(geo, iso2, salid, date)
head(df_semantic) %>%
reactable()3.3 Semantic codebook
Here is a codebook for our semantic data. Lets start with defininig our SALURBAL composite keys.
df_composite_keys_codebook = tribble(
~column_name, ~column_description, ~column_type,
"iso2", "ISO 3166-1 alpha-2 country code", "character",
"strata_id", "ID of the strata", "character",
"geo", "Geographical information", "character",
"salid", "SALURBAL ID", "character",
"date", "Date of the observation", "date"
)
df_composite_keys_codebook %>%
reactable()Nexts lets define our value columns. Note that inaddition to traditional DBT metadata we will alslo include custom metadata pointing towards additional documentation.
df_data_codebook = df_metadata %>%
select(column_name = var_name,
column_description = var_def) %>%
distinct() %>%
mutate(column_type = 'numeric',
scope = "SALURBAL renovated",
current_block = 'https://drexel-ccuh.notion.site/SALURBAL-Daily-Temperature-v1-2a927860bba2422d9ce87934868bd654',
upstream_block = 'https://drexel-ccuh.notion.site/Raw-SALURBAL-Daily-Temperature-v1-bd313dc207a1438db5cb3a30df032f1b?pvs=74')
df_data_codebook %>%
reactable()Lets bind together for our final semantic codebook
df_codebook = bind_rows(df_composite_keys_codebook, df_data_codebook) %>%
mutate(block_id = pipeline_instance$id)4. Export
4.1 Export to Server
We copied this downstream block’s Export Code property.
## Downstream Object
paths_salurbal_TMPDAILY__v1 = list(
id = 'salurbal_TMPDAILY__v1',
path = '//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/salurbal_TMPDAILY__v1/salurbal_TMPDAILY__v1.parquet',
path_obt = '//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/salurbal_TMPDAILY__v1/salurbal_TMPDAILY__v1__obt.parquet',
path_semantic = '//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/salurbal_TMPDAILY__v1/salurbal_TMPDAILY__v1__semantic.parquet',
path_codebook = '//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/salurbal_TMPDAILY__v1/salurbal_TMPDAILY__v1__codebook.parquet',
path_dbt_obt = '//files.drexel.edu/encrypted/SOPH/UHC/CCUH/DBT/ccuh-dbt/source/salurbal_TMPDAILY__v1__obt.parquet',
path_dbt_semantic = '//files.drexel.edu/encrypted/SOPH/UHC/CCUH/DBT/ccuh-dbt/source/salurbal_TMPDAILY__v1__semantic.parquet',
path_dbt_codebook = '//files.drexel.edu/encrypted/SOPH/UHC/CCUH/DBT/ccuh-dbt/source/salurbal_TMPDAILY__v1__codebook.parquet'
)
## Export Function
export_block = function(df, path){
if (file.exists(path)) file.remove(path)
df %>% arrow::write_parquet(path)
}
## Exports
export_block(df_transactional, paths_salurbal_TMPDAILY__v1$path_obt)
export_block(df_codebook, paths_salurbal_TMPDAILY__v1$path_codebook)
export_block(df_semantic, paths_salurbal_TMPDAILY__v1$path_semantic)
export_block(df_transactional, paths_salurbal_TMPDAILY__v1$path_dbt_obt)
export_block(df_codebook, paths_salurbal_TMPDAILY__v1$path_dbt_codebook)
export_block(df_semantic, paths_salurbal_TMPDAILY__v1$path_dbt_semantic)4.2 Test Export
## Import salurbal_TMPDAILY__v1
df_salurbal_TMPDAILY__v1 = arrow::read_parquet('//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/salurbal_TMPDAILY__v1/salurbal_TMPDAILY__v1__semantic.parquet')