Skip to content

corpus CLI #488

@pmeier

Description

@pmeier

To be able to release the corpus API, we need a way for users to CRUD the corpora on a a given source storage. To make our lives a little easier, we are not targeting a UI for this yet, but should start with a CLI instead. Since the likely consumers of this feature are power users or admins, this is ok.

We can add a ragna corpus subcommand to the CLI. This in turn could have more subcommands:

  • ragna corpus list: List all available corpora
  • ragna corpus ingest: Ingest some documents into a given corpus (more on this later)
  • ragna corpus delete: Delete a given corpus
  • ragna corpus metadata: List all available metadata in a given corpus

Each command needs the source storage the action should be applied to. We have a few options here that we potentially can implement all:

  • Add a --source-storage flag that accepts an import string similar to what we do in our config file, e.g. --source-storage ragna.source_storages.Chroma
  • Allow passing a --config and only accept a --source-storage listed there. Also allow passing the source storage by its display name similar to the API, since we know the options.
  • If we have a --config parameter and --source-storage is not passed, offer the user an interactive list of available source storages to select from

ragna corpus ingest is the trickiest of them. IMO a reasonable default behavior would be

  • We recurse through a positionally supplied root directory
  • We call LocalDocument.from_path on each path for which we have an available DocumentHandler

From there on it is just calling SourceStorage.store and injecting them into Ragna's database.

The tricky part comes when opening this up to other behavior than just the default:

  • Can we assume that by calling ragna corpus ingest you want to ingest files from local disk?
  • Should we enforce that every Document class has a from_path classmethod in order for us to create an arbitrary Document subclass if we have nothing more than a path?

I would love to hear from @nenb @blakerosenthal @dillonroach how this is done in the existing deployment.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions