Skip to content

boostcampaitech7/level4-nlp-finalproject-hackathon-nlp-11-lv3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Level 4. ์ฆ๊ถŒ์‚ฌ ์ž๋ฃŒ ๊ธฐ๋ฐ˜ ์ฃผ์‹ LLM ์„œ๋น„์Šค

ํ”„๋กœ์ ํŠธ ๊ฐœ์š”

ํ”„๋กœ์ ํŠธ ์ฃผ์ œ

  1. ์ฃผ์ œ
    • ์ฆ๊ถŒ์‚ฌ ์ž๋ฃŒ ๊ธฐ๋ฐ˜ ์ฃผ์‹ LLM ์„œ๋น„์Šค ๊ฐœ๋ฐœ
  2. ์š”๊ตฌ์‚ฌํ•ญ
    • PDF ๋ฌธ์„œ๋กœ๋ถ€ํ„ฐ ํ…์ŠคํŠธ, ๊ทธ๋ž˜ํ”„ ๋“ฑ ์ •๋ณด์˜ ์ถ”์ถœ
    • ๋ฐ์ดํ„ฐ ๋ ˆํฌ์ง€ํ† ๋ฆฌ ๊ตฌ์ถ•(GraphDB, VectorDB ๋“ฑ)
    • ์ฟผ๋ฆฌ์— ๋Œ€ํ•ด ๊ฐ€์žฅ ์ ํ•ฉํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฐพ์•„๋‚ด๋Š” RAG ์‹œ์Šคํ…œ ๊ตฌํ˜„
    • ํ”„๋กฌํ”„ํŠธ ๊ฐœ๋ฐœ
    • ๋‹ต๋ณ€ ์ƒ์„ฑ
    • Q&A ๊ธฐ๋Šฅ: ์ •๋Ÿ‰ํ‰๊ฐ€ ๋ชฉ์ 
      • REST API ๋กœ ๊ตฌํ˜„
      • Input: query(์งˆ์˜)
      • Output: context(์ฐธ์กฐํ…์ŠคํŠธ), answer(๋‹ต๋ณ€)

๋ฐ์ดํ„ฐ์…‹

  1. ์ œ๊ณต๋œ ๋ฐ์ดํ„ฐ
    • ์ฆ๊ถŒ์‚ฌ ์ž๋ฃŒ ํŒŒ์ผ(PDF) 100๊ฐœ

ํ‰๊ฐ€ ๋ฐฉ๋ฒ•

  1. ์ •๋Ÿ‰ํ‰๊ฐ€ 50%
    • ํ…Œ์ŠคํŠธ์…‹ ์งˆ์˜์— ๋Œ€ํ•œ ๋‹ต๋ณ€ ์„ฑ๋Šฅ โ€“ ์ง€ํ‘œ G-Eval
  2. ์ •์„ฑํ‰๊ฐ€ 50%
    • ์„œ๋น„์Šค์˜ ์ฐฝ์˜์„ฑ, ์œ ์šฉ์„ฑ, ๊ฐœ๋ฐœ ์™„์„ฑ๋„, ์†Œ์Šค์ฝ”๋“œ ํ’ˆ์งˆ, ๋ฌธ์„œํ™” ์ˆ˜์ค€

๐Ÿ‘จ๐Ÿปโ€๐Ÿ’ปย ํŒ€์› ์†Œ๊ฐœ ๋ฐ ์—ญํ• 

์ด๋ฆ„ ํ”„๋กœํ•„ ์—ญํ• 
๊ถŒ๊ธฐํƒœ API ์„ค๊ณ„ ๋ฐ ๊ฐœ๋ฐœ, RESTful API ๊ตฌํ˜„, OCR ๋ฐ์ดํ„ฐ ํ›„์ฒ˜๋ฆฌ
๊ถŒ์œ ์ง„ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ ์ œ์ž‘, OCR ๋ฐ์ดํ„ฐ ํ›„์ฒ˜๋ฆฌ, Web Design ๋ฐ ๋ฐœํ‘œ ์ž๋ฃŒ
๋ฐ•๋ฌด์žฌ RAG ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌํ˜„, ํ‰๊ฐ€ ์ฝ”๋“œ ๊ตฌํ˜„ ๋ฐ ์‹คํ—˜, ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ ์„ ๋ณ„
๋ฐ•์ •๋ฏธ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ ์ œ์ž‘, OCR ๋ฐ์ดํ„ฐ ํ›„์ฒ˜๋ฆฌ, Front-end
์ด์šฉ์ค€ PM, ๋ฆฌํŒฉํ† ๋ง ๋ฐ ๊ธฐํƒ€ ๊ตฌํ˜„, ์•„ํ‚คํ…์ณ ์„ค๊ณ„ ๋ฐ ์„œ๋น™
์ •์›์‹ DocLayout ๋ชจ๋“ˆ ๊ตฌํ˜„, Embedding Model, Fine Tuning, ๋ฐœํ‘œ

ํ”„๋กœ์ ํŠธ ์ˆ˜ํ–‰ ๋ฐฉ๋ฒ•

1. PDF OCR

๐Ÿ“‘ PDF OCR ์ƒ์„ธ ์„ค๋ช… ๋ณด๊ธฐ

pdf-ocr_flowchart

1.1 ์‹คํ–‰

python pdf_parser.py -i "./pdf/input_pdf_folder"
python data_postprocessor.py

2. RAG

๐Ÿ“‘ RAG ์ƒ์„ธ ์„ค๋ช… ๋ณด๊ธฐ

2.1 ์‹คํ–‰

cd app/RAG

# retrieval ํ‰๊ฐ€
python main.py mode=retrieve

# generator ํ‰๊ฐ€
python main.py mode=generate

# vectordb ์ƒ์„ฑ ๋ฐ ์—…๋ฐ์ดํŠธ
python main.py mode=update_vectordb

2.2 ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ ๊ตฌ์ถ•

  • ๋ชฉ์ 
    • Retriever์˜ Top-K Accuracy ํ‰๊ฐ€ ๋ฐ Retriever, Generator์˜ G-Eval ํ‰๊ฐ€ ์ˆ˜ํ–‰
  • ๋ฐฉ๋ฒ•
    • ์งˆ๋ฌธ ์ƒ์„ฑ: GPT๋ฅผ ํ™œ์šฉํ•˜์—ฌ PDF์—์„œ ๊ฐ ์ข…๋ชฉ์˜ ์ฆ๊ถŒ์‚ฌ๋งˆ๋‹ค text ๊ธฐ๋ฐ˜ ์งˆ๋ฌธ 10๊ฐœ์”ฉ ์ƒ์„ฑ
    • Query ์ •์ œ: ๊ฐ ์ข…๋ชฉ๋ณ„๋กœ 100๊ฐœ์˜ Query๋ฅผ ์ƒ์„ฑํ•œ ํ›„, ์ค‘๋ณต์„ ์ œ๊ฑฐํ•˜์—ฌ ์ตœ์ข… Query ์„ ์ •
    • ๋‹ต๋ณ€ ์ถ”์ถœ: ์ •์ œ๋œ Query๋ฅผ ๊ฐ ์ฆ๊ถŒ์‚ฌ ๋ฆฌํฌํŠธ์— ์ ์šฉํ•˜์—ฌ answers ๋„์ถœ
    • Ground Truth ๊ฐ•ํ™”: ์ข…๋ชฉ๋ณ„๋กœ ๋‹ค์–‘ํ•œ ์ฆ๊ถŒ์‚ฌ(5~6๊ฐœ)๋ฅผ ์„ ์ •ํ•˜์—ฌ Ground Truth์˜ ํ’ˆ์งˆ ํ–ฅ์ƒ
    • ํ‘œ&๊ทธ๋ฆผ ์งˆ๋ฌธ ์ถ”๊ฐ€: ํ‘œ์™€ ๊ทธ๋ฆผ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์งˆ๋ฌธ์„ 10๊ฐœ ์ถ”๊ฐ€ ์ƒ์„ฑ
  • ํ™œ์šฉ
    • Retrieval Top-K Accuracy์—๋Š” ์ „์ฒด 1,843๊ฐœ ํ™œ์šฉ
    • G-Eval ํ‰๊ฐ€๋Š” 1,843๊ฐœ ์ค‘ 75๊ฐœ ์ƒ˜ํ”Œ ์‚ฌ์šฉ

2.3 Embedding Model ํ‰๊ฐ€

Top_1 Top_5 Top_10 Top_20 Top_30 Top_50
TF-IDF 9.80 22.55 37.52 59.89 72.64 90.94
BM25 12.20 28.84 42.33 63.59 79.85 96.12
klue/roberta-large 2.40 11.46 20.89 38.26 59.15 86.88
klue/bert base 5.73 17.38 30.50 49.35 66.73 87.62
multilingual-e5-large-instruct 11.09 29.94 44.92 66.17 80.41 94.82
nlpai-lab/KoE5 15.16 38.26 53.42 71.72 81.52 93.53
BAAI/bge-m3 15.34 41.22 56.38 73.94 84.84 96.30
nlpai-lab/KURE-v1 16.64 42.41 58.41 76.53 85.03 95.38

nlpai-lab์˜ KoE5์™€ KURE-v1์ด ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ์‹ค์ œ ๋ฌธ์„œ๋ฅผ ๊ฒ€ํ† ํ•œ ๊ฒฐ๊ณผ ํŠน์ • Query์— ๋Œ€ํ•œ ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ์ด ๋” ๋›ฐ์–ด๋‚œ KoE5๋ฅผ ์ตœ์ข… ๋ชจ๋ธ๋กœ ์„ ํƒํ•˜์˜€๋‹ค.

2.4 Embedding Model Fine-Tuning

  • Fine-tuning ๋ฐ์ดํ„ฐ: virattt/financial-qa-10K๋ฅผ ๋ฒˆ์—ญํ•œ ๋ฐ์ดํ„ฐ

  • Query Encoder์™€ Passage Encoder๋ฅผ ๋‚˜๋ˆ„์–ด Hard Negative ์—†์ด In-Batch Negatives ๋ฐฉ์‹์œผ๋กœ Multiple Negatives Ranking Loss์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต

  • ๊ฒฐ๊ณผ(Top-K Accuracy)

    KoE5 Fine-Tuned Model
    Top-1 15.16 18.11
    Top-5 38.26 43.07
    Top-10 53.42 58.78
    Top-20 71.72 75.60
    Top-30 81.52 85.40
    Top-50 93.53 95.93

2.5 Vector Store

  • ChromaDB: Metadata๋ฅผ ์ €์žฅํ•˜์—ฌ Filtering ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๊ณ , ํšŒ์‚ฌ๋ณ„ ๊ฒ€์ƒ‰์ด ๊ฐ€๋Šฅํ•ด ์ •๋ณด์˜ ์ •ํ™•์„ฑ์„ ๋†’์ผ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ, ์„œ๋ฒ„ ์‹คํ–‰ ์ค‘์—๋„ DB๋ฅผ ์—…๋ฐ์ดํŠธํ•  ์ˆ˜ ์žˆ์–ด ์œ ์—ฐ์„ฑ์ด ๋›ฐ์–ด๋‚˜ ์ด๋Ÿฌํ•œ ์  ๋•Œ๋ฌธ์— ์„ ํƒํ–ˆ๋‹ค.

2.6 Reranker

  • Cross Encoder๋กœ ๋ฌธ์„œ์™€ ์งˆ์˜์˜ ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•˜์—ฌ ๋ฌธ์„œ๋ฅผ ์žฌ์ •๋ ฌ
  • ์‹คํ—˜
Top_1 Top_5 Top_10 Top_20 Top_30 Top_50
nlpai-lab/KoE5 15.16 38.26 53.42 71.72 81.52 93.53
nlpai-lab/KoE5 + BAAI/bge-reranker-v2-m3 19.78 43.25 61.55 77.08 85.58 95.75
nlpai-lab/KoE5 + Dongjin-kr/ko-reranker 20.15 45.47 61.37 78.00 87.25 96.49
  • Reranker๋ฅผ ์‚ฌ์šฉํ•œ ํ›„ Accuracy๊ฐ€ ์ „๋ฐ˜์ ์œผ๋กœ ์•ฝ 5% ์ด์ƒ ์ฆ๊ฐ€ํ•˜์˜€๊ณ  ๊ทธ ์ค‘ ์„ฑ๋Šฅ์ด ๋” ์ข‹์€ Dongjin-kr/ko-reranker๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

2.7 Generator

  • ํ”„๋กฌํ”„ํŠธ ์—”์ง€๋‹ˆ์–ด๋ง
  • ์ฟผ๋ฆฌ ๋ฆฌ๋ผ์ดํŒ…
    • 2๊ฐœ ์ด์ƒ์˜ ํšŒ์‚ฌ ์ •๋ณด๊ฐ€ ํ•„์š”ํ•˜๊ฑฐ๋‚˜ ์งˆ๋ฌธ์ด ๋ถ€์ ์ ˆํ•œ ๊ฒฝ์šฐ ๋ฆฌ๋ผ์ดํŒ…์„ ํ†ตํ•ด ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ ํ–ฅ์ƒ

2.8 Evaluation

  • G-Eval(Retrieval, Generator)
    • Top-K Accuracy, BLEU ๋“ฑ์€ ์ƒํ™ฉ์— ๋”ฐ๋ผ ์ œ๋Œ€๋กœ ๋œ ํ‰๊ฐ€๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•˜๊ณ , ์‚ฌ๋žŒ์ด ์ผ์ผ์ด ๋ฐ์ดํ„ฐ๋ฅผ ์ฑ„์ ํ•  ์ˆ˜ ์—†์–ด์„œ LLM-as-a-Judge ๋ฐฉ์‹์œผ๋กœ G-Eval์„ ์„ ํƒํ•˜์˜€๋‹ค.
    • ๋น ๋ฅธ ๊ตฌํ˜„๊ณผ ์›ํ™œํ•œ ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด DeepEval Open Source๋ฅผ ํ™œ์šฉ
    • Retrieval G-Eval ๊ฒฐ๊ณผ
Retrieval (top5) ์œ ์‚ฌ์„ฑ ํ•„์ˆ˜ ์ •๋ณด ํฌํ•จ ์—ฌ๋ถ€ ์งˆ๋ฌธ ์ถฉ์กฑ๋„ ๊ด€๋ จ์„ฑ ๊ฐ„๊ฒฐ์„ฑ total
BAAI/bge-m3 2.52 3 2.34 1.92 1 10.81
nlpai-lab/KoE5 2.62 3 2.36 1.98 0.99 10.98
fine-tuned/nlpai-lab/KoE5 2.68 2.87 2.41 1.8 1.3 11.08
  • Generator G-Eval ๊ฒฐ๊ณผ
Generation ๊ด€๋ จ์„ฑ ์‚ฌ์‹ค์  ์ •ํ™•์„ฑ ํ•„์ˆ˜ ์ •๋ณด ํฌํ•จ ์—ฌ๋ถ€ ๋ช…ํ™•์„ฑ๊ณผ ๊ฐ„๊ฒฐ์„ฑ ๋…ผ๋ฆฌ์  ๊ตฌ์กฐ ๊ณผํ•˜์ง€์•Š์€ ์„ธ๋ถ€์ •๋ณด ์ ์ ˆํ•œ ์ถœ์ฒ˜ ํ‘œ์‹œ ํ˜•์‹ ์ ์ ˆ์„ฑ ์ถ”๊ฐ€์  ํ†ตํ•ฉ total
Top-5 2.7 2.7 2.8 2.4 1.6 1.7 1.2 0.4 0.6 16.2
Top-7 3.1 3.0 3.0 2.9 1.6 2.0 1.3 0.4 0.7 18.3
Top-10 3.0 2.9 2.8 2.6 1.7 1.7 1.1 0.4 0.7 17.0

3. API

๐Ÿ“‘ API ์ƒ์„ธ ์„ค๋ช… ๋ณด๊ธฐ

REST API ๊ฐœ๋ฐœ (ํŒŒ์ด์ฌ API, Query API)

3.1 ์‹คํ–‰

cd app
uvicorn main:app --reload --host 0.0.0.0 --port 8000

3.2 Endpoint

  • query
  • documents
  • chatting

4. FE

4.1 ์‹คํ–‰

cd FE
npm install
npm run dev

4.2 ๊ธฐ๋Šฅ

  • AI ๋ชจ๋ธ ์„ ํƒ(GPT-4o, GPT-4o-mini, Clova X)
  • ์ฒจ๋ถ€ํ•œ PDF ๋ฌธ์„œ๋ฅผ ๋ฒกํ„ฐ DBํ™”ํ•˜์—ฌ ํšจ์œจ์ ์ธ ๊ฒ€์ƒ‰ ์ง€์›
  • ์ด์ „ context๋ฅผ ์œ ์ง€ํ•œ ์‹ค์‹œ๊ฐ„ ๋Œ€ํ™”
  • ์œ„์ ฏ: ์ฝ”์Šคํ”ผ ์ง€์ˆ˜, ์‹ค์‹œ๊ฐ„ ํ™˜์œจ, ์ตœ์‹  ๊ฒฝ์ œ ๋‰ด์Šค, ์ข…๋ชฉ ๊ด€๋ จ ์ •๋ณด, ์ข…๋ชฉ๋ณ„ ์ตœ์‹  ๋‰ด์Šค

๊ฒฐ๊ณผ

์‚ฌ์šฉ ๊ธฐ์ˆ 

  • OCR: DocLayout-Yolo, Clova OCR, Upstage Parser API
  • VectorDB: ChromaDB
  • Retriever: Langchain
  • Generator: Langchain, LLM-based Answering Model (gpt-4o, Clova X)
  • Evaluation: G-Eval, Top-K Accuracy
  • API server: Fastapi
  • Web Front-end: React.js, Tailwind CSS

ํŒ€์›Œํฌ & ํ˜‘์—… ๊ฒฝํ—˜

  • ํ˜‘์—… ๋„๊ตฌ : Github issue์™€ discussion์œผ๋กœ task ํ• ๋‹น ๋ฐ ํ† ์˜ ๐Ÿค
  • Commit ๊ด€๋ฆฌ : Github commit message template์œผ๋กœ ์ผ๊ด€์„ฑ ์œ ์ง€, ํ˜‘์—… ํšจ์œจ ์ฆ๋Œ€ ๐Ÿ“š

ํ”„๋กœ์ ํŠธ ์ง„ํ–‰ ๋ฐฉ์‹

  • ํ”„๋กœ์ ํŠธ ๊ด€๋ฆฌ : Notion์— ์™„๋ฃŒ๋œ ์ผ ๊ณต์œ , Zoom meeting์„ ํ†ตํ•ด ์ง„ํ–‰ ์ƒํ™ฉ ํ† ์˜

About

level4-nlp-finalproject-hackathon-nlp-11-lv3 created by GitHub Classroom

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5