5-13-24 - Complex parquet deployment to GitHub

Author

Ran Li

Published

May 12, 2024

## Dependencies
library(pacman) 
pacman::p_load(tidyverse, arrow)

The goal here is to test the feasibility of utilizing GitHub as publicly accessible parquet host. If we have a .parquet file hosted on a public GitHub repository…

File Access

We found someone a public hosted parquet file on GitHub. Let’s see if we can access it.

parquet_github_raw = 'https://raw.githubusercontent.com/smaanan/sev.en_commodities/main/random_deals.parq'

Import

Indeed we can import! Exciting!!

dfa = parquet_github_raw %>%
  arrow::read_parquet()
dfa %>% slice(1)
deal_id book counterparty commodity_name commodity_code executed_date first_delivery_date last_delivery_date last_trading_date volume buy_sell trading_unit tenor delivery_window strategy
0 Book_7 Counterparty_3 api2coal ATW 2021-03-07 11:50:24 2022-01-01 2022-12-31 2021-12-31 23000 sell MT year Cal 22 NA

Scan

tb = parquet_github_raw %>%
  arrow::open_dataset()

Mmm this workflow doesn’t work on a parquet stored on GitHub. Heres the error:

Error: Invalid: Unrecognized filesystem type in URI: https://raw.githubusercontent.com/smaanan/sev.en_commodities/main/random_deals.parq

Makes sense that you need a filesystem to ton teh query on. While storage solutions such as S3 and Blob storage offer this Github does not!

File Complexity

Lets initailize a named list we sometimes use.

df_etl = tibble(
  version = '0.1'
) %>% 
  mutate(
    vec1 = list(1:3),
    vec2 = list(c('a', 'b', 'c')),
    df1 = list(tibble(a = 1:3, b = c('a', 'b', 'c'))
    )
  )
  
df_etl
version vec1 vec2 df1
0.1 1, 2, 3 a, b, c 1, 2, 3, a, b, c

Lets see if we can save as parquet okay.

df_etl %>%
  arrow::write_parquet('df_etl.parquet')

Read it in?

dfa = arrow::read_parquet('df_etl.parquet')
dfa
version vec1 vec2 df1
0.1 1, 2, 3 a, b, c 1, 2, 3, a, b, c

We definitely can read it in… but is the data structure still usable?

dfa$vec1
<list<integer>[1]>
[[1]]
[1] 1 2 3
dfa$vec2
<list<character>[1]>
[[1]]
[1] "a" "b" "c"
dfa$df1[[1]]
a b
1 a
2 b
3 c

well everything works. But the one downside is that we need each of the things we need to unlist. Maybe we can process this into a named list? OMG. This is possible ….

etl = dfa %>% unlist(use.names = T, recursive = F)
etl
$version
[1] "0.1"

$vec1
[1] 1 2 3

$vec2
[1] "a" "b" "c"

$df1
# A tibble: 3 × 2
      a b    
  <int> <chr>
1     1 a    
2     2 b    
3     3 c    

This is awesome. we can now deploy a complex named list as a parquet to GitHub.