5-13-24 - Complex parquet deployment to GitHub

Author

Ran Li

Published

May 12, 2024

## Dependencies
library(pacman) 
pacman::p_load(tidyverse, arrow)

The goal here is to test the feasibility of utilizing GitHub as publicly accessible parquet host. If we have a .parquet file hosted on a public GitHub repository…

Access the file
- ✅ read_parquet(): can we read the file into memory without download
- ❌ open_dataset(): can we scan the file and query it out of memory
Complexity of the file
- ✅ can we store a named list of dataframes or vectors as a parquet? this would be use to store ETL objects or commonly used workbench objects

File Access

We found someone a public hosted parquet file on GitHub. Let’s see if we can access it.

parquet_github_raw = 'https://raw.githubusercontent.com/smaanan/sev.en_commodities/main/random_deals.parq'

Import

Indeed we can import! Exciting!!

dfa = parquet_github_raw %>%
  arrow::read_parquet()
dfa %>% slice(1)

deal_id	book	counterparty	commodity_name	commodity_code	executed_date	first_delivery_date	last_delivery_date	last_trading_date	volume	buy_sell	trading_unit	tenor	delivery_window	strategy
0	Book_7	Counterparty_3	api2coal	ATW	2021-03-07 11:50:24	2022-01-01	2022-12-31	2021-12-31	23000	sell	MT	year	Cal 22	NA

Scan

tb = parquet_github_raw %>%
  arrow::open_dataset()

Mmm this workflow doesn’t work on a parquet stored on GitHub. Heres the error:

Error: Invalid: Unrecognized filesystem type in URI: https://raw.githubusercontent.com/smaanan/sev.en_commodities/main/random_deals.parq

Makes sense that you need a filesystem to ton teh query on. While storage solutions such as S3 and Blob storage offer this Github does not!

File Complexity

Lets initailize a named list we sometimes use.

df_etl = tibble(
  version = '0.1'
) %>% 
  mutate(
    vec1 = list(1:3),
    vec2 = list(c('a', 'b', 'c')),
    df1 = list(tibble(a = 1:3, b = c('a', 'b', 'c'))
    )
  )
  
df_etl

version	vec1	vec2	df1
0.1	1, 2, 3	a, b, c	1, 2, 3, a, b, c

Lets see if we can save as parquet okay.

df_etl %>%
  arrow::write_parquet('df_etl.parquet')

Read it in?

dfa = arrow::read_parquet('df_etl.parquet')
dfa

version	vec1	vec2	df1
0.1	1, 2, 3	a, b, c	1, 2, 3, a, b, c

We definitely can read it in… but is the data structure still usable?

dfa$vec1

<list<integer>[1]>
[[1]]
[1] 1 2 3

dfa$vec2

<list<character>[1]>
[[1]]
[1] "a" "b" "c"

dfa$df1[[1]]

a	b
1	a
2	b
3	c

well everything works. But the one downside is that we need each of the things we need to unlist. Maybe we can process this into a named list? OMG. This is possible ….

etl = dfa %>% unlist(use.names = T, recursive = F)
etl

$version
[1] "0.1"

$vec1
[1] 1 2 3

$vec2
[1] "a" "b" "c"

$df1
# A tibble: 3 × 2
      a b    
  <int> <chr>
1     1 a    
2     2 b    
3     3 c

This is awesome. we can now deploy a complex named list as a parquet to GitHub.