-
Couldn't load subscription status.
- Fork 35
Open
Description
Our notebooks and experiment scripts frequently repeat a pattern:
- Download a reference data set (if not already present)
- Read the data set with one of our reader functions
- Convert everything in the data set to DataFrames
We should wrap these three steps into a single function so that we and our users don't need to write this code over and over again.
Suggested API:
- Main entry point at
tp.dataset.download_<data set name>(), with optional arguments to specify:- cache directory
- fold name
- whether to return a DataFrame per document or a single stacked DataFrame
- Each
download_<name>()function performs the following steps:- If the raw data set isn't present, download it
- Convert the entire raw data set into DataFrames
- Stack the DataFrames into a single large dataframe (add a leading column with fold name) and write this DataFrame as a single Parquet file in the cache directory
- Use the cached Parquet file for subsequent reads of the data set
- If the user requested a DataFrame per document, split the single large DataFrame into multiple smaller ones
Metadata
Metadata
Assignees
Labels
No labels