Skip to content

Create datasets package with shortcuts to acquire datasets as DataFrames #199

@frreiss

Description

@frreiss

Our notebooks and experiment scripts frequently repeat a pattern:

  • Download a reference data set (if not already present)
  • Read the data set with one of our reader functions
  • Convert everything in the data set to DataFrames

We should wrap these three steps into a single function so that we and our users don't need to write this code over and over again.

Suggested API:

  • Main entry point attp.dataset.download_<data set name>(), with optional arguments to specify:
    • cache directory
    • fold name
    • whether to return a DataFrame per document or a single stacked DataFrame
  • Each download_<name>() function performs the following steps:
    • If the raw data set isn't present, download it
    • Convert the entire raw data set into DataFrames
    • Stack the DataFrames into a single large dataframe (add a leading column with fold name) and write this DataFrame as a single Parquet file in the cache directory
    • Use the cached Parquet file for subsequent reads of the data set
    • If the user requested a DataFrame per document, split the single large DataFrame into multiple smaller ones

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions