SALURBAL Temperature v1

Loads raw blcok into CCUH as an ETL block and DBT source.

Author

Ran Li

Published

May 17, 2024

1. Setup

First let’s load dependencies.

library(pacman)
p_load(tidyverse, arrow, cli, reactable, htmltools)

Import Pipeline Object

This pipeline is documentation page is found at SALURBAL Temperature v1 - page. Lets declare some local objects to reuse through out this pipeline. Copied the Pipeline Object from the SALURBAL Temperature v1 - page

## Pipeline Object
pipeline_instance = list(
     id = 'salurbal_TMPDAILY__v1'
)

Import upstream blocks

We just need one upstream block blocks:

db_data  = arrow::open_dataset('//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/raw__salurbal_TMPDAILY__v1/TMPDAILY_v1.0_internal.parquet') %>% 
  filter(iso2 %in% c('PA','GT')) %>% 
  collect()

df_metadata = arrow::read_parquet('//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/raw__salurbal_TMPDAILY__v1/TMPDAILY_v1.0_metadata.parquet')

df_raw = db_data %>% 
  left_join(df_metadata)

2. Pre-processing

Here we import od EDA and do any pre-processing needed.

2.1 Pre-process

we do two things

  • create a date column from component pieces
  • type case value as numeric
  • hard code version (as this was not exported in the renovation explicitly)
df_processed = df_raw  %>% 
  mutate(
    date = paste(month, day, year) %>% 
      as.Date(format = "%m %d %Y"),
         value = as.numeric(value),
    version = 'v1.0') 

2.1 EDA

Lets first do some EDA first visualize L1 city 102115

df_processed %>%
  filter(str_detect(salid,'203101')) %>% 
  mutate(geo_tmp = paste0(geo, ' - ', salid)) %>% 
  select(geo_tmp, salid, date, var_name, value) %>% 
  ggplot(aes(date, value, color = var_name)) +
  facet_wrap(~geo_tmp) +
  geom_line() +
  labs(title = "Daily Temperature for SALID 102115",
       x = "Date",
       y = "Temperature (C)")

Looks okay to me. Lets do a quick invotentory of availablility before loaidng into DBT and CCUH block.

df_processed %>% 
  count(geo, iso2, year) %>% 
  group_by(iso2, geo) %>% 
  summarize(
    years = n(),
    min_year = min(year),
    max_year = max(year)
  ) %>% 
  ungroup() %>% 
  arrange(iso2)
iso2 geo years min_year max_year
GT L1AD 20 1996 2015
GT L1UX 20 1996 2015
GT L2 20 1996 2015
PA L1AD 20 1996 2015
PA L1UX 20 1996 2015
PA L2 20 1996 2015

so we have L1AD, L1UX and L2 data for 1996 - 2015 for GT and PA. Quick look at the key metadta - source + public.

df_metadata %>% 
  count(var_name, var_def, source, public) %>% 
  reactable()

All data is public access and there are two sources. Acecss is not an issue. looks good lets op for CCUH.

3. Data + Metadata

Data looks okay. Note taht for this phase of DBT we will experiement with sources that are semnatically tidy as well as transcational tidy. so we actually export three parquet files to the DBT source location.

  • {block_id}__semantic.parquet (one row per observation)
  • {block_id}__codebook.parquet (metadata on columns)
  • {block_id}__transcational.parquet (one row per data point with all metadata wide)

3.1 Transcational data

Note that since this is renovated sluarbl data set it is already renovated. So we just export this as is.

df_transactional = df_processed 


head(df_transactional) %>% 
  reactable(
    columns = list(
      dataset_notes = colDef(
        cell = function(value) {
          div(style = list(
            maxWidth = "200px",   # Adjust the width as needed
            whiteSpace = "nowrap",
            overflow = "hidden",
            textOverflow = "ellipsis"
          ), title = value, value)
        }
      ),
       var_def = colDef(
        cell = function(value) {
          div(style = list(
            maxWidth = "200px",   # Adjust the width as needed
            whiteSpace = "nowrap",
            overflow = "hidden",
            textOverflow = "ellipsis"
          ), title = value, value)
        }
      ),
       source = colDef(
        cell = function(value) {
          div(style = list(
            maxWidth = "200px",   # Adjust the width as needed
            whiteSpace = "nowrap",
            overflow = "hidden",
            textOverflow = "ellipsis"
          ), title = value, value)
        }
      ),
       acknowledgements = colDef(
        cell = function(value) {
          div(style = list(
            maxWidth = "200px",   # Adjust the width as needed
            whiteSpace = "nowrap",
            overflow = "hidden",
            textOverflow = "ellipsis"
          ), title = value, value)
        }
      )
    )
  )

2.2 Semantic data

Semantic layer will with tidy for one row per observation. which in the case of SALURBAL is one row per salid per day for this daily dataset with the absolute minimum of surrogate keys.

df_semantic = df_transactional %>% 
  select(var_name, iso2, strata_id, geo, salid, date, value) %>% 
  pivot_wider(names_from = var_name, values_from = value) %>% 
  arrange(geo, iso2, salid, date) 
  

head(df_semantic) %>% 
  reactable()

3.3 Semantic codebook

Here is a codebook for our semantic data. Lets start with defininig our SALURBAL composite keys.

df_composite_keys_codebook =  tribble(
  ~column_name, ~column_description, ~column_type,
  "iso2", "ISO 3166-1 alpha-2 country code", "character",
  "strata_id", "ID of the strata", "character",
  "geo", "Geographical information", "character",
  "salid", "SALURBAL ID", "character",
  "date", "Date of the observation", "date"
)

df_composite_keys_codebook %>% 
  reactable()

Nexts lets define our value columns. Note that inaddition to traditional DBT metadata we will alslo include custom metadata pointing towards additional documentation.

df_data_codebook = df_metadata %>% 
  select(column_name = var_name, 
         column_description = var_def) %>%
  distinct() %>% 
  mutate(column_type = 'numeric',
         scope = "SALURBAL renovated",
         current_block = 'https://drexel-ccuh.notion.site/SALURBAL-Daily-Temperature-v1-2a927860bba2422d9ce87934868bd654',
         upstream_block = 'https://drexel-ccuh.notion.site/Raw-SALURBAL-Daily-Temperature-v1-bd313dc207a1438db5cb3a30df032f1b?pvs=74')

df_data_codebook %>% 
  reactable()

Lets bind together for our final semantic codebook

df_codebook = bind_rows(df_composite_keys_codebook, df_data_codebook) %>% 
  mutate(block_id = pipeline_instance$id)

4. Export

4.1 Export to Server

We copied this downstream block’s Export Code property.

## Downstream Object
paths_salurbal_TMPDAILY__v1 = list(
   id = 'salurbal_TMPDAILY__v1',
   path = '//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/salurbal_TMPDAILY__v1/salurbal_TMPDAILY__v1.parquet',
   path_obt = '//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/salurbal_TMPDAILY__v1/salurbal_TMPDAILY__v1__obt.parquet',
   path_semantic = '//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/salurbal_TMPDAILY__v1/salurbal_TMPDAILY__v1__semantic.parquet',
   path_codebook = '//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/salurbal_TMPDAILY__v1/salurbal_TMPDAILY__v1__codebook.parquet',
     path_dbt_obt = '//files.drexel.edu/encrypted/SOPH/UHC/CCUH/DBT/ccuh-dbt/source/salurbal_TMPDAILY__v1__obt.parquet',
   path_dbt_semantic = '//files.drexel.edu/encrypted/SOPH/UHC/CCUH/DBT/ccuh-dbt/source/salurbal_TMPDAILY__v1__semantic.parquet',
   path_dbt_codebook = '//files.drexel.edu/encrypted/SOPH/UHC/CCUH/DBT/ccuh-dbt/source/salurbal_TMPDAILY__v1__codebook.parquet'
)

## Export Function
export_block = function(df, path){
  if (file.exists(path)) file.remove(path)
  df %>% arrow::write_parquet(path)
}

## Exports
export_block(df_transactional, paths_salurbal_TMPDAILY__v1$path_obt)
export_block(df_codebook, paths_salurbal_TMPDAILY__v1$path_codebook)
export_block(df_semantic, paths_salurbal_TMPDAILY__v1$path_semantic)
export_block(df_transactional, paths_salurbal_TMPDAILY__v1$path_dbt_obt)
export_block(df_codebook, paths_salurbal_TMPDAILY__v1$path_dbt_codebook)
export_block(df_semantic, paths_salurbal_TMPDAILY__v1$path_dbt_semantic)

4.2 Test Export

## Import salurbal_TMPDAILY__v1
df_salurbal_TMPDAILY__v1 = arrow::read_parquet('//files.drexel.edu/colleges/SOPH/Shared/UHC/Projects/CCUH/freeze/v1.0/salurbal_TMPDAILY__v1/salurbal_TMPDAILY__v1__semantic.parquet')