- test out using Avro instead of parquet
Representation of a dataset as graphs with embeddings
Loads a graph datatset
Converts a json context-qas dataset to a graph dataset
gd = GraphDataset.fromJson("articles.json", "articlesGD.parquet")
In format (jsonl file)
[
{
"context": "string",
"qas" :[
{
"question": "string",
"answers": ["string"],
}
]
}
]
Out format (parquet file)
[
{
"context": "base64_encoded_graph_string",
"qas": [
{
"question": "string",
"question_embedding": [0.123, 0.456, ...], // Fixed size 125
"answers": ["str"]
}
]
}
]
Loads a graph datatset
Converts a graph dataset to a answer path dataset used for model training
In format (parquet file)
[
{
"context": "base64_encoded_graph_string",
"qas": [
{
"question": "string",
"question_embedding": [0.123, 0.456, ...], // Fixed size 125
"answers": ["str"]
}
]
}
]
Out format (parquet file)
[
{
"path": [Vector[float32,125]],
"options" :[Vector[float32,125]],
"label": Int
}
]
Create Conda Environment (Required for graph-tool)
conda create --name sent_graph_rag -c conda-forge graph-tool python=3.11.11
conda activate sent_graph_rag
Install sent_graph_rag
pip install -e Desktop/sent_graph_rag
Install Spacy NLP model
python -m spacy download en_core_web_sm
srun -N 1 -n 1 --pty /bin/bash
Start interactive session
srun --gres=gpu:1 -N 1 --mem=100G --time 12:00:00 --pty /bin/bash
Allocates 1 GPU, 100GB of memory, and 12 hours of time.