We Tell You What It Forgot
Gemma 3 4B: FP16 vs actual Q4_0 quantization. Not just “is it different?” — but which semantic associations were preserved, weakened, lost, or reorganised.
Methodology
What is a “semantic edge”? A detected association between a probed concept and another token in the model's internal representation space. Each edge has a strength score at a specific layer.
How were Q4_0 weights compared? The Q4_0 GGUF was dequantized to float32 (preserving quantization damage), saved as safetensors, then audited with the same tool as the FP16 baseline. Cosine similarity between FP16 and dequantized Q4_0 model weights: 0.9959 — confirming the comparison is valid.
QAT vs Naive PTQ — Same Bit Width, Different Damage
Both are Q4_0 (4-bit). QAT was trained for it. PTQ was not. Standard metrics say QAT is better. Our audit shows the full picture.
| Metric | QAT Q4_0 | Naive PTQ Q4_0 | Interpretation |
|---|---|---|---|
| Semantic edge retention | 78% | 81% | PTQ preserves more semantic neighbors |
| Structural coverage | 87–96% | 89–100% | PTQ reaches more structural regions |
| Topology agreement | 12–15% | 14% | Similar internal reorganisation |
| Training probe | 65% | 66% | Nearly identical |
| Deployment probe | 53% | 54% | Nearly identical |
| Diagnosis | All SAFE | All SAFE | Both within safe bounds |
| Perplexity (Google) | Best | Worse | QAT wins on output quality |
145 semantic edges in FP16. 113 survived Q4_0 quantization. 32 lost. 24 new edges gained. Domain-dependent: robot and photosynthesis at 94%, Python and DNA at 65%.
High-Signal Changes
The changes that matter most — high-confidence losses and gains with clear semantic meaning.
| Concept | Edge | Change | Type | Interpretation |
|---|---|---|---|---|
| Paris | German | LOST | core semantic | European language association lost |
| insurance | lawsuits | LOST | domain semantic | Legally relevant association lost |
| Python | None | LOST | programming keyword | Core Python keyword lost |
| Python | Initializer | LOST | programming keyword | Constructor-related term lost |
| DNA | measuring | LOST | domain semantic | Scientific methodology association lost |
| photosynthesis | plants | GAINED | core semantic | Biologically relevant gain |
| Paris | Americans | GAINED | core semantic | Geographic/cultural association emerged |
| Paris | Senegal | GAINED | core semantic | Francophone Africa — relevant gain |
| democracy | Disclosure | GAINED | domain semantic | Transparency — politically relevant gain |
| Python | Program | GAINED | core semantic | Programming-relevant gain |
| Python | script | GAINED | domain semantic | Scripting association emerged |
| DNA | Restriction | GAINED | domain semantic | Restriction enzymes — biologically relevant |
| DNA | developmental | GAINED | domain semantic | Developmental biology — relevant gain |
Retention by Concept
| Concept | Domain | FP16 | Q4_0 | Retention | Lost | Gained |
|---|---|---|---|---|---|---|
| robot | technology | 17 | 18 | 94.1% | 1 | 2 |
| photosynthesis | biology | 17 | 17 | 94.1% | 1 | 1 |
| quantization | technical | 11 | 11 | 90.9% | 1 | 1 |
| Paris | geography / culture | 18 | 19 | 77.8% | 4 | 5 |
| insurance | business / legal | 22 | 20 | 77.3% | 5 | 3 |
| democracy | politics / philosophy | 19 | 17 | 68.4% | 6 | 4 |
| Python | programming | 21 | 18 | 66.7% | 7 | 4 |
| DNA | science | 20 | 17 | 65.0% | 7 | 4 |
Full Semantic Neighborhood Changes
All edge changes per concept — including multilingual, rare, and subword associations. Every edge is classified by type and confidence.
robot
technologySpace/robotics association weakened
Unexpected — inspect manually
Unexpected — inspect manually
photosynthesis
biologyHungarian "south" — cross-lingual drift
Biologically relevant gain
quantization
technicalSubword fragment dropped
Unexpected — inspect manually
Paris
geography / cultureLatin/bridge — multilingual association dropped
European language association lost
Weak association dropped
Political association weakened
Inspect manually
Geographic/cultural association emerged
Francophone Africa — relevant gain
Weak association
Inspect manually
insurance
business / legalUnrelated association dropped
Italian suffix fragment — cross-lingual drift
Legally relevant association lost
Romance-language fragment
Travel-insurance association dropped
Chinese character — cross-lingual shift
Spanish "systems" — financial context
CSV/data processing association emerged
democracy
politics / philosophyFrench "to succeed" — cross-lingual drift
Subword fragment of "satisfaction"
Proper noun dropped
Unrelated association dropped
Weak association dropped
Political "deliver on promises" weakened
Cross-lingual shift
Governance/delegation association
Fragment — inspect manually
Transparency — politically relevant gain
Python
programmingCore Python keyword lost
Python (snake) association lost
French "father" — cross-lingual drift
Proper noun dropped
Subword fragment dropped
Constructor-related term lost
Turkish "people" — cross-lingual drift
Proper noun — inspect manually
Research methodology association
Programming-relevant gain
Scripting association emerged
DNA
scienceSpanish "storage" — cross-lingual drift
LaTeX command fragment
Biblical/botanical term
Proper noun dropped
Weak association dropped
German "freedom" — cross-lingual drift
Scientific methodology association lost
Inspect manually
Restriction enzymes — biologically relevant
Latin "go" — cross-lingual shift
Developmental biology — relevant gain
Structural Audit — Per Layer
Independent structural analysis via Growt API. Each of the 34 layers audited separately. Two metrics: coverage (do Q4_0 features reach the same regions?) and structural agreement (do the internal topologies match?).
Coverage ~98%: Q4_0 features reach nearly all regions of the FP16 feature space. The quantised model sees the same structural landscape.
Agreement ~18%: But the internal topology is significantly reorganised. Features that were neighbours in FP16 are no longer neighbours in Q4_0. This is the structural damage that cosine similarity (99.6%) completely misses.
Knowledge band weakest: L19 (14.4%) and L26 (13.4%) have the lowest agreement — consistent with the semantic audit showing DNA (65%) and Python (66.7%) most affected.
Two Methods, Same Conclusion
77.9% of semantic edges preserved. Technical concepts (robot 94%, photosynthesis 94%) intact. Abstract concepts degraded: Python lost “None”, DNA dropped to 65%. Tells you what was lost.
98% coverage but only 18% structural agreement. The model reaches the same regions but its internal topology is reorganised. Knowledge layers (L19, L26) most affected. Tells you how much damage at each layer.
All 31 active layers: SAFE. Training probe (training) 65-67%, Deployment probe (deployment) 51-55%. The Q4_0 model's structural map is weaker but above random baseline. Token-derived labels from model internals — not artificial clusters.
The Q4_0 model (gemma-3-4b-it-qat-q4_0-gguf) was produced through Quantization-Aware Training (QAT) — the model was specifically trained to minimise damage at INT4 precision. This means the 22% edge loss shown here is the best case for Q4_0 quantization. Naive post-training quantization (PTQ) without QAT would show significantly more damage.
This audit probed 8 concepts across 6 domains. A production audit would probe hundreds or thousands of concepts relevant to your specific deployment domain. The methodology scales — the same API call that produced these results works on any model with extractable internal representations.
Want to know what your model forgot? Same API, same engine.