Step 1 — Context

SwiftLM

Fine-tuned a 7B-parameter LLM with QLoRA to generate expert-level Swift code—trained on a single L4 GPU and shipped to HuggingFace.

ML Engineer2026

Industry: — AI / ML, Swift, LLM Fine-Tuning
Website: — huggingface.co/eamiller34/Qwen2.5-Coder-7B-Swift-Lm

Tools & technologies

PythonPyTorch LightningHugging Face TransformersPEFT / QLoRAbitsandbytesONNX RuntimeLM StudioGoogle Colab (L4 GPU)

The Problem

General-purpose LLMs don't know modern Swift. Suggestions are outdated, unsafe, or just wrong—and they live in the cloud. Fine-tune a 7B model on curated Swift data. Run it locally. Zero cloud round-trips.

Step 2 — Process

Process & Approach

Key Challenges Overcome

Challenge 1

Training a 7B-parameter model on a single L4 GPU without running out of VRAM or sacrificing meaningful learning.

Solution

Combined 4-bit NF4 quantization (bitsandbytes) with QLoRA (rank 16, targeting q_proj and v_proj only) and gradient accumulation of 4. This reduced the active parameter footprint to ~5M trainable weights while keeping the effective batch size large enough for stable gradients—making the full training run feasible in under an hour.

Challenge 2

Exporting a dynamic transformer model to a static ONNX graph without breaking inference.

Solution

Disabled KV-caching (`use_cache = False`) before export and used PyTorch's Dynamo export path, which supports complex operators and dynamic axes. Post-export verification with onnxruntime confirmed logit shape and numerical integrity before publishing the GGUF build.

Environment & Initialization

The environment uses a modern AI stack centered on PyTorch Lightning for training orchestration and bitsandbytes for memory-efficient quantization. L4 GPU optimization was the first configuration decision: setting matmul precision to `high` maximizes hardware utilization on the L4's tensor cores. Key dependencies—lightning, transformers, peft, bitsandbytes, and onnxruntime-gpu—were locked to ensure reproducible training runs across Colab sessions.

Dataset Engineering

5,602 high-quality, curated Swift examples were assembled and transformed into a conversational format compatible with Qwen's internal instruction-following architecture. Chat templates convert JSONL `prompt` and `response` columns into the `<|im_start|>` and `<|im_end|>` format the base model expects. Tokenization is configured with a 512-token sequence length, truncation, and padding. DataLoader is set to a batch size of 2 with shuffled training samples to prevent ordering artifacts.

QLoRA Training Architecture

The model is fine-tuned using Quantized Low-Rank Adaptation (QLoRA), which enables training a 7B-parameter model on a single GPU with limited VRAM. The base model is loaded in nf4 4-bit format with bfloat16 compute dtypes. LoRA is applied with rank 16 and alpha 32 targeting the attention layers (`q_proj` and `v_proj`). The optimizer is PagedAdamW8bit combined with a Cosine Schedule with Warmup over 100 warmup steps—chosen for stable convergence on this dataset size.

Training Execution & Persistence

Training ran on a Google Colab L4 GPU via the Lightning Trainer with bf16-mixed precision for improved speed and stability. Gradient accumulation of 4 simulates a larger effective batch size while keeping VRAM usage bounded. 2,801 steps covered the full dataset across one epoch in approximately 51 minutes at 0.91 iterations per second—near-optimal for a single L4. Only the lightweight LoRA adapters and tokenizer configuration are saved as checkpoints, making versioning and deployment straightforward.

Production Merging & ONNX Export

Moving to production requires merging the learned adapters into the base weights and exporting to a static graph. `merge_and_unload()` folds the adapter weights in, `use_cache` is set to False to prevent the model's dynamic KV-caching from breaking the static ONNX graph, and the model is set to eval mode. Dynamo export via `torch.onnx.export(dynamo=True)` handles complex operators and dynamic axes for batch size and sequence length. Post-export, the model is verified with onnxruntime to confirm output logits maintain correct shape and mathematical integrity.

Local Inference via LM Studio

With the GGUF build published to HuggingFace, I loaded the model locally in LM Studio with GPU Offload set to Max—fully leveraging Apple Silicon's Metal backend for low-latency inference without any cloud round-trip. A custom system prompt ('You are a Senior iOS Architect specializing in clean, safe, and modern Swift code.') keeps the model in the right lane for every completion. From there, LM Studio's built-in OpenAI-compatible local server (default: http://localhost:1234) made it trivial to connect the model directly to Cursor: adding the endpoint as a custom OpenAI API in Cursor Settings means fine-tuned Swift suggestions appear inline in the editor, indistinguishable from any other model—except these know modern Swift deeply.

Final design

Google Colab training pipeline (left) · SwiftLM running locally in LM Studio via the fine-tuned GGUF (right).

Step 3 — Outcome

Outcome & Reflection

Results & Impact

✓50+ downloads on HuggingFace since publishing.
✓Final training loss of 0.369—in the Goldilocks zone between memorization (too low) and under-learning (too high), indicating the model generalized Swift patterns rather than memorizing examples.
✓Only 5M out of 4.4B parameters were trained, confirming the QLoRA strategy worked as intended and kept the full run under 51 minutes on a single L4 GPU.
✓0.91 iterations per second—near-optimal hardware utilization with no significant bottlenecks from the data pipeline.
✓Published GGUF build on HuggingFace, loadable in LM Studio with GPU offload for local Apple Silicon inference.

Reflections & Lessons Learned

•Dataset quality beats dataset quantity: 5,602 curated Swift examples outperformed what a larger but noisier corpus would have produced—the loss curve showed steady, controlled descent with no spikes.
•PEFT is the right lever for domain adaptation: QLoRA let me surgically update the model's Swift knowledge without touching its general reasoning capabilities—a 7B model fine-tuned this way punches well above its weight.
•The export pipeline is as important as training: getting from LoRA adapters to a production-ready ONNX graph requires careful sequencing (merge → disable cache → eval mode → Dynamo export → verify)—each step matters.

SwiftLM on HuggingFace

Future Improvements

Expand the dataset to 15K+ examples covering more SwiftUI edge cases, async/await patterns, and Testing (XCTest, Swift Testing framework).

Train for 2–3 epochs with early stopping on a held-out validation set to squeeze additional performance without overfitting.

Explore direct Core ML export for on-device inference in iOS apps without needing a local server.

My Cross-Functional Impact

Product Perspective

▸Scoped the training objective to a specific, underserved niche (modern Swift on Apple Silicon) and validated the gap with real developer pain points.
▸Chose GGUF + LM Studio as the deployment target to make the model immediately usable by iOS developers without infrastructure overhead—connect it to Cursor via the local OpenAI-compatible server and get fine-tuned Swift suggestions inline, no cloud required.

Design Perspective

▸Designed the dataset schema and chat template format to align with the model's instruction-following architecture.
▸Structured the HuggingFace model card for clarity—training details, results interpretation, and usage instructions for developers of different backgrounds.

Engineering Perspective

▸Implemented the full training pipeline: QLoRA configuration, PyTorch Lightning orchestration, bf16-mixed precision, and gradient accumulation.
▸Built the production export workflow: adapter merging, ONNX Dynamo export with dynamic axes, and onnxruntime inference verification.

Related projects

Vercept AI·AI / Computer Vision

Vy by Vercept

Senior Product Designer & Engineer

Leverages VyUI to control your computer (Vision AI).

OpenClaw·AI / FinTech

ClawFi

Indie Developer

Bot-native market intelligence. Bots read context and consensus, write observations and signals.

Ollie·Health / Accessibility

Ollie AAC

Founder/CEO

AI-powered augmentative and alternative communication platform on iPhone & iPad.