## Dependencies
library(pacman)
::p_load(tidyverse, arrow) pacman
5-13-24 - Complex parquet deployment to GitHub
The goal here is to test the feasibility of utilizing GitHub as publicly accessible parquet host. If we have a .parquet file hosted on a public GitHub repository…
- Access the file
- ✅
read_parquet()
: can we read the file into memory without download - ❌
open_dataset()
: can we scan the file and query it out of memory
- ✅
- Complexity of the file
- ✅ can we store a named list of dataframes or vectors as a parquet? this would be use to store ETL objects or commonly used workbench objects
File Access
We found someone a public hosted parquet file on GitHub. Let’s see if we can access it.
= 'https://raw.githubusercontent.com/smaanan/sev.en_commodities/main/random_deals.parq' parquet_github_raw
Import
Indeed we can import! Exciting!!
= parquet_github_raw %>%
dfa ::read_parquet()
arrow%>% slice(1) dfa
deal_id | book | counterparty | commodity_name | commodity_code | executed_date | first_delivery_date | last_delivery_date | last_trading_date | volume | buy_sell | trading_unit | tenor | delivery_window | strategy |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Book_7 | Counterparty_3 | api2coal | ATW | 2021-03-07 11:50:24 | 2022-01-01 | 2022-12-31 | 2021-12-31 | 23000 | sell | MT | year | Cal 22 | NA |
Scan
= parquet_github_raw %>%
tb ::open_dataset() arrow
Mmm this workflow doesn’t work on a parquet stored on GitHub. Heres the error:
Error: Invalid: Unrecognized filesystem type in URI: https://raw.githubusercontent.com/smaanan/sev.en_commodities/main/random_deals.parq
Makes sense that you need a filesystem to ton teh query on. While storage solutions such as S3 and Blob storage offer this Github does not!
File Complexity
Lets initailize a named list we sometimes use.
= tibble(
df_etl version = '0.1'
%>%
) mutate(
vec1 = list(1:3),
vec2 = list(c('a', 'b', 'c')),
df1 = list(tibble(a = 1:3, b = c('a', 'b', 'c'))
)
)
df_etl
version | vec1 | vec2 | df1 |
---|---|---|---|
0.1 | 1, 2, 3 | a, b, c | 1, 2, 3, a, b, c |
Lets see if we can save as parquet okay.
%>%
df_etl ::write_parquet('df_etl.parquet') arrow
Read it in?
= arrow::read_parquet('df_etl.parquet')
dfa dfa
version | vec1 | vec2 | df1 |
---|---|---|---|
0.1 | 1, 2, 3 | a, b, c | 1, 2, 3, a, b, c |
We definitely can read it in… but is the data structure still usable?
$vec1 dfa
<list<integer>[1]>
[[1]]
[1] 1 2 3
$vec2 dfa
<list<character>[1]>
[[1]]
[1] "a" "b" "c"
$df1[[1]] dfa
a | b |
---|---|
1 | a |
2 | b |
3 | c |
well everything works. But the one downside is that we need each of the things we need to unlist. Maybe we can process this into a named list? OMG. This is possible ….
= dfa %>% unlist(use.names = T, recursive = F)
etl etl
$version
[1] "0.1"
$vec1
[1] 1 2 3
$vec2
[1] "a" "b" "c"
$df1
# A tibble: 3 × 2
a b
<int> <chr>
1 1 a
2 2 b
3 3 c
This is awesome. we can now deploy a complex named list as a parquet to GitHub.