Create datasets package with shortcuts to acquire datasets as DataFrames

Our notebooks and experiment scripts frequently repeat a pattern:
* Download a reference data set (if not already present)
* Read the data set with one of our reader functions
* Convert everything in the data set to DataFrames

We should wrap these three steps into a single function so that we and our users don't need to write this code over and over again. 

Suggested API: 
* Main entry point at`tp.dataset.download_<data set name>()`, with optional arguments to specify:
   * cache directory
   * fold name
   * whether to return a DataFrame per document or a single stacked DataFrame
* Each `download_<name>()` function performs the following steps:
   * If the raw data set isn't present, download it
   * Convert the entire raw data set into DataFrames
   * Stack the DataFrames into a single large dataframe (add a leading column with fold name) and write this DataFrame as a single Parquet file in the cache directory
   * Use the cached Parquet file for subsequent reads of the data set
   * If the user requested a DataFrame per document, split the single large DataFrame into multiple smaller ones



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Create datasets package with shortcuts to acquire datasets as DataFrames #199

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Create datasets package with shortcuts to acquire datasets as DataFrames #199

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions