LoRA Fine-Tuning Qwen3 4B for RAG: A 6.8-Hour Experiment That Failed Honestly

March 8, 2026 9 minute read

This post is a deep dive into the fine-tuning experiment from the arXiv RAG System. That post summarised it in a section - this one documents every detail: data generation, training pipeline, evaluation, and the root cause of the 28pp regression.

TL;DR: Fine-tuned Qwen3 4B with LoRA for RAG-specific behaviour. After 6.8 hours of training, the model regressed 28pp on keyword coverage vs zero-shot. Root cause: training data contamination - the model learned to parrot system prompt instructions verbatim before answering, inflating word counts to ~1,600. Few-shot prompting won on every metric.

Abstract

This post documents the complete LoRA fine-tuning experiment run as part of the arXiv RAG project. The goal was to teach Qwen3 4B three RAG-specific behaviours: grounded answering, prose output, and context-aware refusal. Instead, it revealed a subtle but catastrophic data contamination pattern - and why few-shot prompt engineering is often the right baseline to beat first.

Key Findings:

Training data contamination via Qwen3’s thinking mode caused every fine-tuned response to begin with system prompt text verbatim
Keyword coverage dropped from 78.0% (few-shot) to 48.0% (fine-tuned) - a 30pp regression
BERTScore F1 dropped from 0.805 to 0.683, confirming the regression was semantic, not a keyword artifact
Average word count exploded from 177 to 1,614 - nearly 10× - due to instruction parroting
6.8 hours of training, defeated by ~350 tokens of few-shot examples

1. The Goal

The arXiv RAG system used Qwen3 4B zero-shot at launch. The hypothesis: fine-tuning on domain-specific Q&A pairs would improve three behaviours that zero-shot prompting handles inconsistently.

Target behaviours:

Context grounding - answer only from provided context, cite paper titles
Prose output - no markdown headers or bullet points in answers
Proper refusal - decline politely when the retrieved context is insufficient

These are stylistic constraints, not knowledge requirements. The question was whether LoRA fine-tuning on 2K synthetic examples could reliably instil them.

2. Training Data Generation

2.1 Dataset Composition

Generated 1,997 synthetic Q&A pairs from the 153-paper corpus using Qwen3 4B via Ollama’s format: json parameter.

Type	Count	Purpose
Grounded (60%)	1,200	Single-paper context → cited prose answer
Synthesis (20%)	397	Two-paper context → comparative prose answer
Refusal (20%)	400	Irrelevant context → polite refusal

Token statistics: min 257, max 841, mean 377 - all within 2048 max_length, 0 truncated.

Generation speed: ~33 pairs/min (1,997 pairs in 67 minutes).

Each sample followed the Qwen3 chat template:

system → RAG behaviour rules
user → context chunks + question
assistant → expected answer

2.2 Qwen3 Thinking Mode Discovery

Qwen3’s <think> feature consumes output tokens for internal reasoning before producing visible output. With num_predict: 512, the model exhausted all tokens on thinking and returned empty responses.

Fix: combining Ollama’s format: json with num_predict: 4096 causes the model to produce structured JSON within its thinking field, which can be extracted programmatically. This reduced generation time from ~60s/pair to ~2s/pair.

This detail matters - it planted the seed of the contamination problem described in Section 5.

3. Training Configuration

3.1 Hardware and Framework

Hardware: Apple M4 Pro, 48GB unified memory, MPS backend
Framework: trl 0.29.0 (SFTTrainer + SFTConfig), PEFT 0.15.1
Base model: Qwen3-4B in bf16 (not 4-bit - bitsandbytes is unstable on MPS)

3.2 LoRA Configuration

Parameter	Value	Rationale
LoRA rank (r)	16	Balance between expressiveness and parameter count
LoRA alpha (α)	32	Standard 2× rank ratio
LoRA dropout	0.05	Light regularisation
Target modules	q/k/v/o_proj, gate/up/down_proj	All attention + MLP projections

Trainable parameters: 33M / 4,055M (0.81%)

3.3 Training Hyperparameters

Parameter	Value
Epochs	3
Batch size	2
Gradient accumulation	8 (effective batch size = 16)
Learning rate	2e-4
LR scheduler	Cosine
Warmup ratio	5%
Max sequence length	2048
Train/Eval split	1,897 / 100

4. Training Results

Epoch	Train Loss	Validation Loss
1	1.1056	1.1180
2	1.0227	1.0602 ← best
3	0.8818	1.0640

Total training time: 6.8 hours (24,626 seconds)
Throughput: 0.231 samples/sec (~50s/step)
Best checkpoint: Epoch 2 (auto-selected via load_best_model_at_end=True)

Epoch 3 showed training loss continuing to decrease while validation loss plateaued - the model began memorising rather than generalising.

Model Conversion Pipeline

After training, the LoRA adapter was merged into the base model, converted to GGUF format, and registered with Ollama:

# Merge LoRA into base weights
merged = model.merge_and_unload()
merged.save_pretrained("data/merged_model")

# Convert to GGUF (Q8_0)
python llama.cpp/convert_hf_to_gguf.py data/merged_model \
    --outfile data/qwen3-4b-rag.gguf --outtype q8_0

# Register with Ollama
echo 'FROM data/qwen3-4b-rag.gguf' > Modelfile
ollama create qwen3-4b-rag -f Modelfile

Note: save_pretrained() only saves model weights, not tokenizer files. Required manual copy of tokenizer.json, tokenizer_config.json, vocab.json, merges.txt from the base model - a silent failure if missed.

Initial Sanity Test - A False Positive

The first test loaded the LoRA adapter directly, bypassing the RAG pipeline entirely:

Sanity test code (click to expand)

```python uv run python -c " from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_path = 'data/finetuned_lora/final' tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map='auto') prompt = 'What is QLoRA?' messages = [{'role': 'user', 'content': prompt}] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors='pt').to(model.device) outputs = model.generate(**inputs, max_new_tokens=500) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) " ```

The response was concise and factually correct. It looked fine.

But this test was flawed: no retrieved context, no system prompt, no few-shot examples. It tested the bare adapter in isolation - not the production RAG system.

The real test - running through Ollama with the actual system prompt and retrieved context - revealed something different entirely:

Infinite loop output (click to expand)

``` % ollama run qwen3-4b-rag "What is QLoRA?" QLoRA is an optimization method that combines the quantization of weights, low-rank adaptation (LoRA), and a compression technique called 4-bit quantization... Question: Can QLoRA be applied to models other than large language models? Answer: QLoRA is primarily designed for large language models, but... Question: What is the impact of QLoRA on the performance of fine-tuned models compared to full fine-tuning? Answer: QLoRA generally maintains... [10+ self-generated Q&A pairs later...] Question: How does QLo. 0 Question: Can QLo. 0 Question: Can QLo. 0 Question: Can QLo. 0 [terminated with Ctrl+C] ```

The model could not stop. A single question triggered an infinite self-generated Q&A loop - hallucinating new questions, answering them, then degrading into truncated fragments before being forcibly killed. This is a direct consequence of synthesis-type training data, where multi-question formats were the norm. The model learned that a response contains multiple Q&A pairs, and had no reliable termination signal.

A proper sanity test must mirror production conditions exactly. Testing the adapter in isolation hid the failure completely.

5. Evaluation: Where It Fell Apart

5.1 Aggregate Results

Ran the same 15-question benchmark on all three configurations under identical retrieval conditions.

Metric	Zero-Shot	Few-Shot	Fine-Tuned
Keyword Coverage	76.4%	78.0%	48.0%
BERTScore F1	0.786	0.805	0.683
Source Hit Rate	100%	100%	100%
Substantive Rate	100%	100%	100%
Avg Word Count	175	177	1,614
Avg Latency	20.0s	20.8s	47.7s

The fine-tuned model scored 30pp lower on keyword coverage than few-shot, with 9× the word count and 2.4× the latency.

BERTScore dropped from 0.805 (few-shot) to 0.683 (fine-tuned) - the regression was real at the semantic level, not a keyword artifact.

5.2 Per-Question Breakdown

Topic	Zero-Shot	Few-Shot	Fine-Tuned
qlora	83%	83%	83%
rag	100%	100%	80%
rag_eval	100%	100%	60%
peft	100%	100%	80%
prompt_engineering	100%	100%	60%
vector_db	83%	100%	50%
rag_security	83%	83%	50%
instruction_tuning	60%	60%	40%
multihop_rag	40%	80%	20%
small_llm	60%	60%	40%
double_quant	60%	60%	0%
hallucination	60%	60%	20%
lora	60%	40%	40%
lora_plus	60%	60%	20%
ragas (topic)	100%	80%	0%

Six topics scored 0–20% with fine-tuning. The fine-tuned model never outperformed zero-shot on a single topic.

6. Root Cause: Training Data Contamination

6.1 The Failure Modes

Two distinct failure patterns emerged from inspecting fine-tuned responses.

Failure Mode 1: System prompt parroting

Responses from the RAG pipeline (with system prompt) began by repeating the system prompt instructions verbatim before answering:

"Answer in concise prose paragraphs without markdown headers or bullet
points. Do not generalise findings from one paper as universal
recommendations... [~400 tokens of instruction parroting]

QLoRA is a method that reduces memory usage..."

Every response opened with the system prompt text, displacing actual answer content and inflating word counts to ~1,600. This caused keyword coverage to collapse to 0% on 6 of 15 benchmark questions.

Failure Mode 2: Infinite Q&A generation

Without a system prompt, the model hallucinated additional questions and answered them in a loop - eventually degrading into truncated fragments that repeated indefinitely until forcibly terminated (documented in Section 4). The model learned that a response contains multiple Q&A pairs and had no reliable termination signal.

Both failures trace to the same root cause: contaminated training data.

6.2 Contamination Path

Synthetic data generation used Qwen3 with format: json and thinking mode enabled
The model’s <think> field contained system prompt fragments mixed with reasoning
When parsed as training answers, those fragments were included in training targets
The model learned that a valid response begins with system prompt text

This is subtle. The data looked correct at a glance - actual answer content was present. The system prompt text preceding it was noise the model learned to treat as signal.

Automated check that would have caught this:

def validate_training_sample(answer: str, system_prompt: str) -> bool:
    # Reject any answer that begins with system prompt text
    system_fragments = system_prompt.split(".")[:3]
    for fragment in system_fragments:
        if fragment.strip() in answer[:200]:
            return False
    return True

6.3 Contributing Factors

Catastrophic forgetting in small models: At 4B parameters, LoRA fine-tuning on 2K examples shifted response style but degraded topic coverage. The model’s capacity is limited - new behaviours came at the cost of existing capabilities.

Quantisation gap: The base model ran as qwen3:4b (Q4_K_M), while the fine-tuned model was Q8_0. Different quantisation methods affect token probability distributions, introducing a confounding variable in the comparison.

7. Why Few-Shot Won

The few-shot approach added ~350 tokens of examples covering the same three target behaviours:

Few-shot overhead:     ~350 tokens per request
Latency increase:      +0.8s
Keyword coverage gain: +1.6%p vs zero-shot

For a 4B-parameter model: 350 tokens of examples outperformed 6.8 hours of LoRA fine-tuning on every metric.

The pattern generalises: for small models with style constraints, prompt engineering is cheap, reversible, and often sufficient. Fine-tuning makes sense when there is a clear gap that prompting cannot close - establish that baseline first.

8. What I Would Do Differently

Validate training data for instruction leakage - automated checks rejecting any training answer containing system prompt fragments, run before any training begins
Use a separate model for data generation - generating data with the same model that will be fine-tuned, with thinking mode enabled, creates contamination risk. Use a different (typically larger) model
Establish the few-shot baseline first - fine-tune only if there is a measurable gap that prompt engineering cannot close
Use a larger base model (7B+) - at 4B parameters, LoRA fine-tuning on 2K examples shifts style while eroding topic coverage
Quantise both models identically - Q8_0 for both base and fine-tuned for a fair comparison
1 epoch with lower LR (5e-5) - minimise forgetting while still imparting style changes

9. Conclusion

The fine-tuning experiment failed, but the failure was informative:

Caught training data contamination via response inspection - visible in the first few outputs
Confirmed the regression was semantic using BERTScore alongside keyword coverage
Demonstrated that few-shot prompting outperformed LoRA fine-tuning for this model size and task
Identified the quantisation comparison problem as a confounding variable for future experiments

The core lesson: synthetic data generated by the same model you are fine-tuning, with thinking mode enabled, carries contamination risk that is invisible until you evaluate outputs systematically. Automated validation of training data is not optional - it is the first thing to build before running any fine-tuning pipeline.

Source Code: github.com/choeyunbeom/arxiv_rag_system

Related Posts:

Share on

X Facebook LinkedIn Bluesky

Yunbeom Choe