|
| 1 | +# Local Qwen LLM on Android |
| 2 | + |
| 3 | +This example shows how to run Qwen2.5-0.5B-Instruct and Qwen3-0.6B entirely on an Android device using ONNX Runtime. |
| 4 | +All tokens are generated offline on the phone no network calls, no telemetry. |
| 5 | + |
| 6 | +--- |
| 7 | + |
| 8 | +## Key features |
| 9 | + |
| 10 | +- On-device inference with the official onnxruntime-android. |
| 11 | +- Tokenizer compatibility – reads the Hugging Face-standard tokenizer.json shipped with Qwen. |
| 12 | +- Prompt formatting for Qwen 2.5 and Qwen 3, including the **Thinking Mode** toggle supported by Qwen3. |
| 13 | +- Streaming generation with past-KV caching for smooth, low-latency text output (see [OnnxModel.kt](app/src/main/java/com/example/local_llm/OnnxModel.kt)). |
| 14 | +- Output supports Markdown — copy and reuse formatted answers anywhere. |
| 15 | + |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +## 📸 Inference Preview |
| 20 | + |
| 21 | +<p align="center"> |
| 22 | + <img src="demo/Demo.gif" alt="Model Output 2" width="25%" style="margin: 1%"/> |
| 23 | + <img src="demo/Demo2.gif" alt="Input Prompt" width="25%" style="margin: 1%"/> |
| 24 | + <img src="demo/Qwen3demo.gif" alt="Input Prompt" width="25%" style="margin: 1%"/> |
| 25 | +</p> |
| 26 | + |
| 27 | +<p align="center"> |
| 28 | + <em>Figure: App interface showing prompt input and generated answers using the local LLM.</em> |
| 29 | +</p> |
| 30 | + |
| 31 | +--- |
| 32 | + |
| 33 | +## Model Info |
| 34 | + |
| 35 | +This app supports both **Qwen2.5-0.5B-Instruct** and **Qwen3-0.6B** — optimized for instruction-following, QA, and reasoning tasks. |
| 36 | + |
| 37 | +### Option 1: Use Preconverted ONNX Model |
| 38 | + |
| 39 | +Download the `model.onnx` and `tokenizer.json` from Hugging Face: |
| 40 | + |
| 41 | +- 🔹 [Qwen2.5](https://huggingface.co/onnx-community/Qwen2.5-0.5B-Instruct) |
| 42 | +- 🔹 [Qwen3](https://huggingface.co/onnx-community/Qwen3-0.6B-ONNX) |
| 43 | + |
| 44 | +- You can also use quantized models (e.g., `model_q4fp16.onnx`) for faster, lighter inference with minimal accuracy loss. |
| 45 | + |
| 46 | +### ⚙️ Option 2: Convert Model Yourself |
| 47 | + |
| 48 | +```bash |
| 49 | +pip install optimum[onnxruntime] |
| 50 | +# or |
| 51 | +python -m pip install git+https://github.com/huggingface/optimum.git |
| 52 | +``` |
| 53 | + |
| 54 | +Export the model: |
| 55 | + |
| 56 | +```bash |
| 57 | +optimum-cli export onnx --model Qwen/Qwen2.5-0.5B-Instruct qwen2.5-0.5B-onnx/ |
| 58 | +``` |
| 59 | + |
| 60 | +- You can also convert any fine-tuned variant by specifying the model path. |
| 61 | +- Learn more about [Optimum here](https://huggingface.co/docs/optimum/main/en/index). |
| 62 | + |
| 63 | +--- |
| 64 | + |
| 65 | +## ⚙️ Requirements |
| 66 | + |
| 67 | +- [Android Studio](https://developer.android.com/studio) |
| 68 | +- [ONNX Runtime for Android](https://github.com/microsoft/onnxruntime-genai/releases) (already included in this repo). |
| 69 | +- A physical Android device for deployment and testing, ≥ 4 GB RAM for FP16 / Q4 models, ≥ 6 GB RAM for FP32 models. |
| 70 | +- Real hardware preferred—emulators are acceptable for UI checks only. |
| 71 | + |
| 72 | +--- |
| 73 | +#### Choose which Qwen model to run |
| 74 | + |
| 75 | +In[MainActivity.kt](app/src/main/java/com/example/local_llm/MainActivity.kt) you will find two pre-defined `ModelConfig` objects: |
| 76 | + |
| 77 | +```kotlin |
| 78 | +val modelconfigqwen25 = … // Qwen 2.5-0.5B |
| 79 | +val modelconfigqwen3 = … // Qwen 3-0.6B |
| 80 | +```` |
| 81 | +Right below them is a single line that tells the app which one to use: |
| 82 | + |
| 83 | +````kotlin |
| 84 | +val config = modelconfigqwen25 // ← change to modelconfigqwen3 for Qwen 3 |
| 85 | +```` |
| 86 | + |
| 87 | +## How to Build & Run |
| 88 | + |
| 89 | +1. Open Android Studio and create a new project (Empty Activity). |
| 90 | +2. Name your app `local_llm`. |
| 91 | +3. Copy all the project files from `Qwen_QA/Android` into the appropriate folders. |
| 92 | +4. Place your `model.onnx` and `tokenizer.json` in: |
| 93 | + ``` |
| 94 | + app/src/main/assets/ |
| 95 | + ``` |
| 96 | +5. Connect your Android phone using wireless debugging or USB. |
| 97 | +6. To install: |
| 98 | + - Press Run ▶️ in Android Studio, **or** |
| 99 | + - Go to **Build → Generate Signed Bundle / APK** to export the `.apk` file. |
| 100 | +7. Once installed, look for the **Pocket LLM** icon |
| 101 | + <img src="demo/pocket_llm_icon.png" alt="Pocket LLM icon" width="28" style="vertical-align:middle;border-radius:100%"/> |
| 102 | + on your home screen. |
| 103 | +
|
| 104 | +**Note**: All Kotlin files are declared in the package com.example.local_llm, and the Gradle script sets applicationId "com.example.local_llm". |
| 105 | +If you name the app (or change the package) to anything other than local_llm, you must refactor: |
| 106 | +- The directory structure in app/src/main/java/..., |
| 107 | +- Every package com.example.local_llm line, and |
| 108 | +- The applicationId in app/build.gradle. |
| 109 | +- Otherwise, Android Studio will raise “package … does not exist” errors and the project will fail to compile. |
| 110 | +---- |
| 111 | +
|
| 112 | +## Customize Your App Experience with These |
| 113 | +- Define the assistant’s tone and role by setting defaultSystemPrompt (in your model config). |
| 114 | +- Adjust TEMPERATURE to control response randomness — lower for accuracy, higher for creativity ([OnnxModel.kt](app/src/main/java/com/example/local_llm/OnnxModel.kt)). |
| 115 | +- Use REPETITION_PENALTY to avoid repetitive answers and improve fluency ([OnnxModel.kt](app/src/main/java/com/example/local_llm/OnnxModel.kt)). |
| 116 | +- Change MAX_TOKENS to limit or expand the length of generated replies ([OnnxModel.kt](app/src/main/java/com/example/local_llm/OnnxModel.kt)). |
| 117 | +
|
| 118 | +### 📄 License Notice |
| 119 | +Note: These ONNX models are based on Qwen, which is licensed under the [Apache License 2.0](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct/blob/main/LICENSE). |
0 commit comments