Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
9b8c645
Updated .bin to .safetensors
sergiopaniego Aug 27, 2024
1ae6600
[zh-TW] Added chapters 1-11
thliang01 Jun 2, 2025
50c2ce2
add toctree file
thliang01 Jun 2, 2025
d669d0a
Merge branch 'huggingface:main' into main
thliang01 Jun 10, 2025
5902cb1
[zh-TW] Added chapters 1-11
thliang01 Jun 2, 2025
30d57f5
add toctree file
thliang01 Jun 2, 2025
9bd5fa8
added te/chapter1/2-3
RahulKonda18 Jun 23, 2025
26fe94b
Update 2.mdx
RahulKonda18 Jun 24, 2025
15cf873
Merge pull request #976 from RahulKonda18/main
stevhliu Jul 3, 2025
b8ee3c0
Merge branch 'main' of https://github.com/thliang01/hf-llm-course
thliang01 Jul 4, 2025
1b7467e
Merge branch 'huggingface:main' into main
thliang01 Jul 4, 2025
064d497
Merge branch 'main' of github.com:huggingface/course into bin-to-safe…
sergiopaniego Jul 9, 2025
1bf51f2
Updated .bin to .safetensors
sergiopaniego Jul 9, 2025
75460f0
Merge pull request #729 from sergiopaniego/bin-to-safetensors
sergiopaniego Jul 9, 2025
0fb986b
Merge branch 'huggingface:main' into main
thliang01 Jul 9, 2025
86511e4
added te/chapter1/4-11
RahulKonda18 Jul 12, 2025
86d42b8
Merge pull request #1002 from RahulKonda18/main
stevhliu Jul 14, 2025
18bed1c
Merge pull request #952 from thliang01/main
stevhliu Jul 16, 2025
01c1e12
docs: ko: chapter3-3.mdx
Youngdong2 Jul 17, 2025
708e3ad
feat: nmt draft
Youngdong2 Jul 17, 2025
ee3f18f
docs: ko: processing_the_data.md
chhaewxn Jul 19, 2025
fcb2e2b
feat: nmt draft
chhaewxn Jul 19, 2025
9808a12
fix: manual edits
Youngdong2 Jul 19, 2025
bc73184
fix: manual edits
Youngdong2 Jul 19, 2025
565a630
fix: manual edits
chhaewxn Jul 19, 2025
ed05511
fix: toctree edits
chhaewxn Jul 19, 2025
b76ccec
fix: toctree edits
chhaewxn Jul 19, 2025
d584efc
Apply suggestions from code review
chhaewxn Jul 22, 2025
7e74bde
fix: manual edits
chhaewxn Jul 22, 2025
fbed6f7
Add contributor to Chinese (traditional) translation in README
thliang01 Jul 24, 2025
a745abc
Apply suggestions from code review
Youngdong2 Jul 26, 2025
50a63e7
Apply suggestions from code review
Youngdong2 Jul 26, 2025
6a5fad8
Update chapters/ko/chapter3/3.mdx
Youngdong2 Jul 26, 2025
2a782fe
Apply suggestions from code review
Youngdong2 Jul 26, 2025
1964953
Apply suggestions from code review
Youngdong2 Jul 26, 2025
7f4cda9
Apply suggestions from code review
Youngdong2 Jul 26, 2025
361c355
Merge pull request #1019 from thliang01/Add-new-author-in--zh-TW-
stevhliu Jul 29, 2025
3020c15
Merge pull request #1015 from chhaewxn/ko-processing_the_data.md
stevhliu Jul 29, 2025
a972744
Merge branch 'main' into ko-chapter3-3.mdx
Youngdong2 Aug 1, 2025
c4ff46d
docs: ko: understanding_learning_curves
chhaewxn Aug 3, 2025
e446938
feat: nmt draft
chhaewxn Aug 3, 2025
2120da2
fix: manual edits
chhaewxn Aug 3, 2025
e35c581
fix: manual edits
chhaewxn Aug 3, 2025
0a16ad4
Merge pull request #1014 from Youngdong2/ko-chapter3-3.mdx
stevhliu Aug 4, 2025
f8a1e74
Apply suggestion from @seopp
chhaewxn Aug 9, 2025
c8553cd
Apply suggestion from @seopp
chhaewxn Aug 9, 2025
b3d6cd2
Apply suggestion from @seopp
chhaewxn Aug 9, 2025
8c139fe
Apply suggestion from @seopp
chhaewxn Aug 9, 2025
df3c859
Apply suggestion from @seopp
chhaewxn Aug 9, 2025
186985f
Apply suggestion from @seopp
chhaewxn Aug 9, 2025
b4e6023
Apply suggestion from @seopp
chhaewxn Aug 9, 2025
0d5d42b
Apply suggestion from @seopp
chhaewxn Aug 9, 2025
4c5efe1
Apply suggestion from @seopp
chhaewxn Aug 9, 2025
88bc739
fix: apply suggestion from @AhnJoonSung
chhaewxn Aug 11, 2025
9124fc0
Merge pull request #1035 from chhaewxn/ko-understanding_learning_curv…
stevhliu Aug 12, 2025
3c9c7ff
feat(my): Add Myanmar translation for Chapter 0: Setup
kalixlouiis Aug 12, 2025
896777d
Merge pull request #1043 from kalixlouiis/main
stevhliu Aug 12, 2025
538e8c9
fix hfoptions blocks in unit 1 8.mdx
burtenshaw Sep 9, 2025
e2698dc
Merge pull request #1067 from huggingface/burtenshaw-patch-1
burtenshaw Sep 9, 2025
8287911
Merge branch 'release' into release-2025-9-9
burtenshaw Sep 9, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,12 @@ This repo contains the content that's used to create the **[Hugging Face course]
| [Korean](https://huggingface.co/course/ko/chapter1/1) (WIP) | [`chapters/ko`](https://github.com/huggingface/course/tree/main/chapters/ko) | [@Doohae](https://github.com/Doohae), [@wonhyeongseo](https://github.com/wonhyeongseo), [@dlfrnaos19](https://github.com/dlfrnaos19), [@nsbg](https://github.com/nsbg) |
| [Portuguese](https://huggingface.co/course/pt/chapter1/1) (WIP) | [`chapters/pt`](https://github.com/huggingface/course/tree/main/chapters/pt) | [@johnnv1](https://github.com/johnnv1), [@victorescosta](https://github.com/victorescosta), [@LincolnVS](https://github.com/LincolnVS) |
| [Russian](https://huggingface.co/course/ru/chapter1/1) (WIP) | [`chapters/ru`](https://github.com/huggingface/course/tree/main/chapters/ru) | [@pdumin](https://github.com/pdumin), [@svv73](https://github.com/svv73), [@blademoon](https://github.com/blademoon) |
| [Telugu]( https://huggingface.co/course/te/chapter0/1 ) (WIP) | [`chapters/te`](https://github.com/huggingface/course/tree/main/chapters/te) | [@Ajey95](https://github.com/Ajey95)
| [Telugu]( https://huggingface.co/course/te/chapter0/1 ) (WIP) | [`chapters/te`](https://github.com/huggingface/course/tree/main/chapters/te) | [@Ajey95](https://github.com/Ajey95), [@RahulKonda18](https://github.com/RahulKonda18)
| [Thai](https://huggingface.co/course/th/chapter1/1) (WIP) | [`chapters/th`](https://github.com/huggingface/course/tree/main/chapters/th) | [@peeraponw](https://github.com/peeraponw), [@a-krirk](https://github.com/a-krirk), [@jomariya23156](https://github.com/jomariya23156), [@ckingkan](https://github.com/ckingkan) |
| [Turkish](https://huggingface.co/course/tr/chapter1/1) (WIP) | [`chapters/tr`](https://github.com/huggingface/course/tree/main/chapters/tr) | [@tanersekmen](https://github.com/tanersekmen), [@mertbozkir](https://github.com/mertbozkir), [@ftarlaci](https://github.com/ftarlaci), [@akkasayaz](https://github.com/akkasayaz) |
| [Vietnamese](https://huggingface.co/course/vi/chapter1/1) | [`chapters/vi`](https://github.com/huggingface/course/tree/main/chapters/vi) | [@honghanhh](https://github.com/honghanhh) |
| [Chinese (simplified)](https://huggingface.co/course/zh-CN/chapter1/1) | [`chapters/zh-CN`](https://github.com/huggingface/course/tree/main/chapters/zh-CN) | [@zhlhyx](https://github.com/zhlhyx), [petrichor1122](https://github.com/petrichor1122), [@1375626371](https://github.com/1375626371) |
| [Chinese (traditional)](https://huggingface.co/course/zh-TW/chapter1/1) (WIP) | [`chapters/zh-TW`](https://github.com/huggingface/course/tree/main/chapters/zh-TW) | [@davidpeng86](https://github.com/davidpeng86) |
| [Chinese (traditional)](https://huggingface.co/course/zh-TW/chapter1/1) (WIP) | [`chapters/zh-TW`](https://github.com/huggingface/course/tree/main/chapters/zh-TW) | [@davidpeng86](https://github.com/davidpeng86), [@thliang01](https://github.com/thliang01) |
| [Romanian](https://huggingface.co/course/rum/chapter1/1) (WIP) | [`chapters/rum`](https://github.com/huggingface/course/tree/main/chapters/rum) | [@Sigmoid](https://github.com/SigmoidAI), [@eduard-balamatiuc](https://github.com/eduard-balamatiuc), [@FriptuLudmila](https://github.com/FriptuLudmila), [@tokyo-s](https://github.com/tokyo-s), [@hbkdesign](https://github.com/hbkdesign), [@grumpycatyo-collab](https://github.com/grumpycatyo-collab), [@Angroys](https://github.com/Angroys) |

### Translating the course into your language
Expand Down
4 changes: 2 additions & 2 deletions chapters/en/chapter2/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,12 @@ This will save two files to your disk:
```
ls directory_on_my_computer

config.json pytorch_model.bin
config.json model.safetensors
```

If you look inside the *config.json* file, you'll see all the necessary attributes needed to build the model architecture. This file also contains some metadata, such as where the checkpoint originated and what 🤗 Transformers version you were using when you last saved the checkpoint.

The *pytorch_model.bin* file is known as the state dictionary; it contains all your model's weights. The two files work together: the configuration file is needed to know about the model architecture, while the model weights are the parameters of the model.
The *pytorch_model.safetensors* file is known as the state dictionary; it contains all your model's weights. The two files work together: the configuration file is needed to know about the model architecture, while the model weights are the parameters of the model.

To reuse a saved model, use the `from_pretrained()` method again:

Expand Down
46 changes: 41 additions & 5 deletions chapters/en/chapter2/8.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,16 +15,19 @@ TGI, vLLM, and llama.cpp serve similar purposes but have distinct characteristic
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/flash-attn.png" alt="Flash Attention" />

<Tip title="How Flash Attention Works">

Flash Attention is a technique that optimizes the attention mechanism in transformer models by addressing memory bandwidth bottlenecks. As discussed earlier in [Chapter 1.8](/course/chapter1/8), the attention mechanism has quadratic complexity and memory usage, making it inefficient for long sequences.

The key innovation is in how it manages memory transfers between High Bandwidth Memory (HBM) and faster SRAM cache. Traditional attention repeatedly transfers data between HBM and SRAM, creating bottlenecks by leaving the GPU idle. Flash Attention loads data once into SRAM and performs all calculations there, minimizing expensive memory transfers.

While the benefits are most significant during training, Flash Attention's reduced VRAM usage and improved efficiency make it valuable for inference as well, enabling faster and more scalable LLM serving.

</Tip>

**vLLM** takes a different approach by using PagedAttention. Just like how a computer manages its memory in pages, vLLM splits the model's memory into smaller blocks. This clever system means it can handle different-sized requests more flexibly and doesn't waste memory space. It's particularly good at sharing memory between different requests and reduces memory fragmentation, which makes the whole system more efficient.

<Tip title="How PagedAttention Works">

PagedAttention is a technique that addresses another critical bottleneck in LLM inference: KV cache memory management. As discussed in [Chapter 1.8](/course/chapter1/8), during text generation, the model stores attention keys and values (KV cache) for each generated token to reduce redundant computations. The KV cache can become enormous, especially with long sequences or multiple concurrent requests.

vLLM's key innovation lies in how it manages this cache:
Expand All @@ -35,11 +38,13 @@ vLLM's key innovation lies in how it manages this cache:
4. **Memory Sharing**: For operations like parallel sampling, pages storing the KV cache for the prompt can be shared across multiple sequences.

The PagedAttention approach can lead to up to 24x higher throughput compared to traditional methods, making it a game-changer for production LLM deployments. If you want to go really deep into how PagedAttention works, you can read the [the guide from the vLLM documentation](https://docs.vllm.ai/en/latest/design/kernel/paged_attention.html).

</Tip>

**llama.cpp** is a highly optimized C/C++ implementation originally designed for running LLaMA models on consumer hardware. It focuses on CPU efficiency with optional GPU acceleration and is ideal for resource-constrained environments. llama.cpp uses quantization techniques to reduce model size and memory requirements while maintaining good performance. It implements optimized kernels for various CPU architectures and supports basic KV cache management for efficient token generation.

<Tip title="How llama.cpp Quantization Works">

Quantization in llama.cpp reduces the precision of model weights from 32-bit or 16-bit floating point to lower precision formats like 8-bit integers (INT8), 4-bit, or even lower. This significantly reduces memory usage and improves inference speed with minimal quality loss.

Key quantization features in llama.cpp include:
Expand All @@ -49,9 +54,8 @@ Key quantization features in llama.cpp include:
4. **Hardware-Specific Optimizations**: Includes optimized code paths for various CPU architectures (AVX2, AVX-512, NEON)

This approach enables running billion-parameter models on consumer hardware with limited memory, making it perfect for local deployments and edge devices.
</Tip>


</Tip>

### Deployment and Integration

Expand All @@ -63,8 +67,6 @@ Let's move on to the deployment and integration differences between the framewor

**llama.cpp** prioritizes simplicity and portability. Its server implementation is lightweight and can run on a wide range of hardware, from powerful servers to consumer laptops and even some high-end mobile devices. With minimal dependencies and a simple C/C++ core, it's easy to deploy in environments where installing Python frameworks would be challenging. The server provides an OpenAI-compatible API while maintaining a much smaller resource footprint than other solutions.



## Getting Started

Let's explore how to use these frameworks for deploying LLMs, starting with installation and basic setup.
Expand Down Expand Up @@ -146,7 +148,9 @@ response = client.chat.completions.create(
)
print(response.choices[0].message.content)
```

</hfoption>

<hfoption value="llama.cpp" label="llama.cpp">

llama.cpp is easy to install and use, requiring minimal dependencies and supporting both CPU and GPU inference.
Expand Down Expand Up @@ -235,7 +239,9 @@ response = client.chat.completions.create(
)
print(response.choices[0].message.content)
```

</hfoption>

<hfoption value="vllm" label="vLLM">

vLLM is easy to install and use, with both OpenAI API compatibility and a native Python interface.
Expand Down Expand Up @@ -306,6 +312,7 @@ response = client.chat.completions.create(
)
print(response.choices[0].message.content)
```

</hfoption>

</hfoptions>
Expand Down Expand Up @@ -333,6 +340,7 @@ docker run --gpus all \
```

Use the InferenceClient for flexible text generation:

```python
from huggingface_hub import InferenceClient

Expand Down Expand Up @@ -380,7 +388,9 @@ response = client.chat.completions.create(
)
print(response.choices[0].message.content)
```

</hfoption>

<hfoption value="llama.cpp" label="llama.cpp">

For llama.cpp, you can set advanced parameters when launching the server:
Expand Down Expand Up @@ -487,7 +497,9 @@ output = llm(

print(output["choices"][0]["text"])
```

</hfoption>

<hfoption value="vllm" label="vLLM">

For advanced usage with vLLM, you can use the InferenceClient:
Expand Down Expand Up @@ -579,6 +591,7 @@ formatted_prompt = llm.get_chat_template()(chat_prompt) # Uses model's chat tem
outputs = llm.generate(formatted_prompt, sampling_params)
print(outputs[0].outputs[0].text)
```

</hfoption>

</hfoptions>
Expand Down Expand Up @@ -610,7 +623,9 @@ client.generate(
repetition_penalty=1.1, # Reduce repetition
)
```

</hfoption>

<hfoption value="llama.cpp" label="llama.cpp">

```python
Expand All @@ -635,7 +650,9 @@ output = llm(
repeat_penalty=1.1,
)
```

</hfoption>

<hfoption value="vllm" label="vLLM">

```python
Expand All @@ -648,6 +665,7 @@ params = SamplingParams(
)
llm.generate("Write a creative story", sampling_params=params)
```

</hfoption>

</hfoptions>
Expand All @@ -659,14 +677,17 @@ Both frameworks provide ways to prevent repetitive text generation:
<hfoptions id="inference-frameworks" >

<hfoption value="tgi" label="TGI">

```python
client.generate(
"Write a varied text",
repetition_penalty=1.1, # Penalize repeated tokens
no_repeat_ngram_size=3, # Prevent 3-gram repetition
)
```

</hfoption>

<hfoption value="llama.cpp" label="llama.cpp">

```python
Expand All @@ -686,7 +707,9 @@ output = llm(
presence_penalty=0.5, # Additional presence penalty
)
```

</hfoption>

<hfoption value="vllm" label="vLLM">

```python
Expand All @@ -695,6 +718,7 @@ params = SamplingParams(
frequency_penalty=0.1, # Penalize token frequency
)
```

</hfoption>

</hfoptions>
Expand All @@ -706,6 +730,7 @@ You can control generation length and specify when to stop:
<hfoptions id="inference-frameworks" >

<hfoption value="tgi" label="TGI">

```python
client.generate(
"Generate a short paragraph",
Expand All @@ -714,7 +739,9 @@ client.generate(
stop_sequences=["\n\n", "###"],
)
```

</hfoption>

<hfoption value="llama.cpp" label="llama.cpp">

```python
Expand All @@ -729,7 +756,9 @@ response = client.completions.create(
# Via direct library
output = llm("Generate a short paragraph", max_tokens=100, stop=["\n\n", "###"])
```

</hfoption>

<hfoption value="vllm" label="vLLM">

```python
Expand All @@ -741,6 +770,7 @@ params = SamplingParams(
skip_special_tokens=True,
)
```

</hfoption>

</hfoptions>
Expand All @@ -752,6 +782,7 @@ Both frameworks implement advanced memory management techniques for efficient in
<hfoptions id="inference-frameworks" >

<hfoption value="tgi" label="TGI">

TGI uses Flash Attention 2 and continuous batching:

```sh
Expand All @@ -763,7 +794,9 @@ docker run --gpus all -p 8080:80 \
--max-batch-total-tokens 8192 \
--max-input-length 4096
```

</hfoption>

<hfoption value="llama.cpp" label="llama.cpp">

llama.cpp uses quantization and optimized memory layout:
Expand All @@ -789,7 +822,9 @@ For models too large for your GPU, you can use CPU offloading:
--n-gpu-layers 20 \ # Keep first 20 layers on GPU
--threads 8 # Use more CPU threads for CPU layers
```

</hfoption>

<hfoption value="vllm" label="vLLM">

vLLM uses PagedAttention for optimal memory management:
Expand All @@ -806,6 +841,7 @@ engine_args = AsyncEngineArgs(

llm = LLM(engine_args=engine_args)
```

</hfoption>

</hfoptions>
Expand All @@ -818,4 +854,4 @@ llm = LLM(engine_args=engine_args)
- [vLLM GitHub Repository](https://github.com/vllm-project/vllm)
- [PagedAttention Paper](https://arxiv.org/abs/2309.06180)
- [llama.cpp GitHub Repository](https://github.com/ggerganov/llama.cpp)
- [llama-cpp-python Repository](https://github.com/abetlen/llama-cpp-python)
- [llama-cpp-python Repository](https://github.com/abetlen/llama-cpp-python)
4 changes: 2 additions & 2 deletions chapters/fa/chapter2/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ model.save_pretrained("directory_on_my_computer")
```
ls directory_on_my_computer

config.json pytorch_model.bin
config.json model.safetensors
```
{:else}
```
Expand All @@ -204,7 +204,7 @@ config.json tf_model.h5

{#if fw === 'pt'}

فایل *pytorch_model.bin* در واقع *دیکشنری وضعیت‌ها* است و حاوی تمام وزن‌های مدل شماست. این دو فایل به همراه هم کاربرد دارند؛ فایل تنظیمات برای دانستن معماری به کار رفته در مدل ضروری است و پارامترهای مدل هم که همان وزن‌های داخل فایل دوم هستند.
فایل *model.safetensors* در واقع *دیکشنری وضعیت‌ها* است و حاوی تمام وزن‌های مدل شماست. این دو فایل به همراه هم کاربرد دارند؛ فایل تنظیمات برای دانستن معماری به کار رفته در مدل ضروری است و پارامترهای مدل هم که همان وزن‌های داخل فایل دوم هستند.

{:else}

Expand Down
4 changes: 2 additions & 2 deletions chapters/fr/chapter2/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ Cela enregistre deux fichiers sur votre disque :
```
ls directory_on_my_computer

config.json pytorch_model.bin
config.json model.safetensors
```
{:else}
```
Expand All @@ -177,7 +177,7 @@ config.json tf_model.h5
Si vous jetez un coup d'œil au fichier *config.json*, vous reconnaîtrez les attributs nécessaires pour construire l'architecture du modèle. Ce fichier contient également certaines métadonnées, comme l'origine du *checkpoint* et la version de la bibliothèque 🤗 *Transformers* que vous utilisiez lors du dernier enregistrement du point *checkpoint*.

{#if fw === 'pt'}
Le fichier *pytorch_model.bin* est connu comme le *dictionnaire d'état*. Il contient tous les poids de votre modèle. Les deux fichiers vont de pair : la configuration est nécessaire pour connaître l'architecture de votre modèle, tandis que les poids du modèle sont les paramètres de votre modèle.
Le fichier *model.safetensors* est connu comme le *dictionnaire d'état*. Il contient tous les poids de votre modèle. Les deux fichiers vont de pair : la configuration est nécessaire pour connaître l'architecture de votre modèle, tandis que les poids du modèle sont les paramètres de votre modèle.

{:else}
Le fichier *tf_model.h5* est connu comme le *dictionnaire d'état*. Il contient tous les poids de votre modèle. Les deux fichiers vont de pair : la configuration est nécessaire pour connaître l'architecture de votre modèle, tandis que les poids du modèle sont les paramètres de votre modèle.
Expand Down
4 changes: 2 additions & 2 deletions chapters/it/chapter2/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ In questo modo si salvano due file sul disco:
```
ls directory_on_my_computer

config.json pytorch_model.bin
config.json model.safetensors
```
{:else}
```
Expand All @@ -174,7 +174,7 @@ Se si dà un'occhiata al file *config.json*, si riconoscono gli attributi necess

{#if fw === 'pt'}

Il file *pytorch_model.bin* è noto come *state dictionary*; contiene tutti i pesi del modello. I due file vanno di pari passo: la configurazione è necessaria per conoscere l'architettura del modello, mentre i pesi del modello sono i suoi parametri.
Il file *model.safetensors* è noto come *state dictionary*; contiene tutti i pesi del modello. I due file vanno di pari passo: la configurazione è necessaria per conoscere l'architettura del modello, mentre i pesi del modello sono i suoi parametri.

{:else}

Expand Down
Loading