corpus CLI

To be able to release the corpus API, we need a way for users to [CRUD](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete) the corpora on a a given source storage. To make our lives a little easier, we are not targeting a UI for this yet, but should start with a CLI instead. Since the likely consumers of this feature are power users or admins, this is ok.

We can add a `ragna corpus` subcommand to the CLI. This in turn could have more subcommands:

- `ragna corpus list`: List all available corpora
- `ragna corpus ingest`: Ingest some documents into a given corpus (more on this later)
- `ragna corpus delete`: Delete a given corpus
- `ragna corpus metadata`: List all available metadata in a given corpus

---

Each command needs the source storage the action should be applied to. We have a few options here that we potentially can implement all:

- Add a `--source-storage` flag that accepts an import string similar to what we do in our config file, e.g. `--source-storage ragna.source_storages.Chroma`
- Allow passing a `--config` and only accept a `--source-storage` listed there. Also allow passing the source storage by its display name similar to the API, since we know the options.
- If we have a `--config` parameter and `--source-storage` is not passed, offer the user an interactive list of available source storages to select from

---

`ragna corpus ingest` is the trickiest of them. IMO a reasonable default behavior would be

- We recurse through a positionally supplied root directory
- We call `LocalDocument.from_path` on each path for which we have an available `DocumentHandler`

From there on it is just calling `SourceStorage.store` and injecting them into Ragna's database.

The tricky part comes when opening this up to other behavior than just the default:

- Can we assume that by calling `ragna corpus ingest` you want to ingest files from local disk?
- Should we enforce that every `Document` class has a `from_path` classmethod in order for us to create an arbitrary `Document` subclass if we have nothing more than a path?

I would love to hear from @nenb @blakerosenthal @dillonroach how this is done in the existing deployment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

corpus CLI #488

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

corpus CLI #488

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions