Small LLMs aka SLMs

Finetuning

PEFT + TRL — Examples of using peft with trl to finetune 8-bit models with Low Rank Adaption (LoRA)
Unsloth — Fine-tuning LLMs Guide | Unsloth Documentation
Axolotl — Axolotl AI - Open Source Fine Tuning

unsloth is great if you are low on resources and want faster training, but it has some rough edges here and there and some things breaks in new releases. You have to dig into source code to know what changed. But it’s a lot stable and better for the past few months. I used unsloth for finetuning qwen 2.5 vl at work for a pretty complicated task and it worked great for us. - trl + peft: this is more stable and unsloth uses most of the trainers from trl with their own patching, so similar APIs in both cases. - axolotl: this is great option if you want ready to go configs and don’t have to deal with hand rolling code and provides easy distributed GPU support. — via

LLMs that you can run on the desktop or a “regular(ish) PC”.

A look at Apple’s new Transformer-powered predictive text model

the model being used by AppleSpell, an internal macOS application that checks for spelling and grammar mistakes as you type.

found the predictive text model in /System/Library/LinguisticData/RequiredAssets_en.bundle/AssetData/en.lm/unilm.bundle. The bundle contains multiple Espresso model files that are used while typing (Espresso appears to be the internal name for the part of CoreML that runs inference on models).

a set of 15,000 tokens in unilm.bundle/sp.dat that pretty clearly look like they form the vocabulary set for a large language model.

Read the rest of the above blog post to see how the tokenizer works, model architecture (GPT-2?) of about 34M parameters and hidden size of 512 units, which makes it smaller than GPT-2 models.

[2504.05299] SmolVLM: Redefining small and efficient multimodal models — “Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger Idefics-80B model, despite an 18-month development gap. Our largest model, at 2.2B parameters, rivals state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend beyond static images, demonstrating robust video comprehension capabilities.”

Orca 2: Teaching Small Language Models How to Reason - Microsoft Research; see

M2 Max with 64GB RAM. It does ~50 tok/s on our q4 quantized 7b mistral fine-tune, with comparable speeds to GPT-4 via

moondream

moonbeam is a computer-vision model can answer real-world questions about images. It’s tiny by today’s models, with only 1.6B parameters. That enables it to run on a variety of devices, including mobile phones and edge devices.

Apache 2.0. You can use moondream for commercial purposes.

Applications:

Security
Drone and Robotics
Retail and shopping —

Prem 1B and Prem 1B chat

apache 2.0 license
“Our goal is to create models that excel at RAG. Since RAG works by processing information at runtime, the main constraint is LLM size. For RAG, models don’t need to be huge; they just need strong text comprehension to give accurate answers when provided with the right context.”
blog post: SLM Journey Unveiled — “In recent months, the landscape of language models has been enriched by the emergence of several small language models (e.g. TinyLlama, Phi2, Gemma, and StableLM2)”

Florence - a Microsoft Collection; SOTA 200M & 800M parameter vision foundation model. MIT Licensed!. 200M checkpoint beats Flamingo 80B (400x bigger model) by a huge margin. Performs captioning, object detection and segmentation, OCR, phrase grounding and more. Leverages FLD-5B dataset - 5.4 billion annotations across 126 million images. Multi task learning. Fine-tuned model checkpoints beat the likes of PaLI, PaLI-X.

“Florence2 200M, Qwen2 500M, MSFT InstructLM 500M With little fine-tuning they unlock so many creative and on-device use cases” via

Fine-tune Llama-3-8B with Llama-3-405B synthetic data

A simple notebook for fine-tuning a small model (Llama-3-8B) to be an expert in a specific domain, by letting a larger, more capable model (Llama-3-405B) teach it (by generating synthetic dataset for that domain).

—

nisten/Biggie-SmoLlm-0.15B-Base · Hugging Face via

—

AMD Unveils Its First Small Language Model AMD-135… - AMD Community

—

Marlin 2B — a tiny VLM to extract structured information from videos Marlin is finetuned for two questions devs want to ask in their videos: what is happening, and when?

https://vlm.nemostation.com/ - playground; built on Qwen3.5-2B

Papers

(Belcak et al., 2025)

we lay out the position that small language models (SLMs) are sufficiently powerful, inherently more suitable, and necessarily more economical for many invocations in agentic systems, and are therefore the future of agentic AI. Our argumentation is grounded in the current level of capabilities exhibited by SLMs, the common architectures of agentic systems, and the economy of LM deployment. We further argue that in situations where general-purpose conversational abilities are essential, heterogeneous agentic systems (i.e., agents invoking multiple different models) are the natural choice. We discuss the potential barriers for the adoption of SLMs in agentic systems and outline a general LLM-to-SLM agent conversion algorithm.

(Su et al., 2025)

Just an 8B model trained on calling tools and other LLMs to answer queries It’s a great demo of what frontier SLMs will be about in 2026

MacOS desktop

Phone

“phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone.” (Abdin et al., 2024). (No code, or model was announced with the paper).

aiOS™ by Hyperspace “Organizing the World’s AI Agents. Join the world’s largest peer-to-peer AI network and start earning points”

Abdin, M., Jacobs, S. A., Awan, A. A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., Behl, H., Benhaim, A., Bilenko, M., Bjorck, J., Bubeck, S., Cai, M., Mendes, C. C. T., Chen, W., Chaudhary, V., Chopra, P., … Zhou, X. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. https://arxiv.org/abs/2404.14219

Belcak, P., Heinrich, G., Diao, S., Fu, Y., Dong, X., Muralidharan, S., Lin, Y. C., & Molchanov, P. (2025). Small Language Models Are the Future of Agentic AI. https://arxiv.org/abs/2506.02153

Su, H., Diao, S., Lu, X., Liu, M., Xu, J., Dong, X., Fu, Y., Belcak, P., Ye, H., Yin, H., Dong, Y., Bakhturina, E., Yu, T., Choi, Y., Kautz, J., & Molchanov, P. (2025). ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration. https://arxiv.org/abs/2511.21689

btbytes.com

Small LLMs aka SLMs

Finetuning

Papers

MacOS desktop

Phone

Table of Contents

Graph View

Backlinks