TO
Transfer Oracle
Sign in
← Back to Transfer Oracle
Pilot Audit · 8 Concepts · Real Q4_0

We Tell You What It Forgot

Gemma 3 4B: FP16 vs actual Q4_0 quantization. Not just “is it different?” — but which semantic associations were preserved, weakened, lost, or reorganised.

Methodology

Modelgoogle/gemma-3-4b-it
BaselineBF16 (safetensors)
QuantizedQ4_0 (Google official GGUF)
Layers audited34
Concepts probed8
Cosine sanity check0.9959

What is a “semantic edge”? A detected association between a probed concept and another token in the model's internal representation space. Each edge has a strength score at a specific layer.

How were Q4_0 weights compared? The Q4_0 GGUF was dequantized to float32 (preserving quantization damage), saved as safetensors, then audited with the same tool as the FP16 baseline. Cosine similarity between FP16 and dequantized Q4_0 model weights: 0.9959 — confirming the comparison is valid.

QAT vs Naive PTQ — Same Bit Width, Different Damage

Both are Q4_0 (4-bit). QAT was trained for it. PTQ was not. Standard metrics say QAT is better. Our audit shows the full picture.

MetricQAT Q4_0Naive PTQ Q4_0Interpretation
Semantic edge retention78%81%PTQ preserves more semantic neighbors
Structural coverage87–96%89–100%PTQ reaches more structural regions
Topology agreement12–15%14%Similar internal reorganisation
Training probe65%66%Nearly identical
Deployment probe53%54%Nearly identical
DiagnosisAll SAFEAll SAFEBoth within safe bounds
Perplexity (Google)BestWorseQAT wins on output quality
QAT optimises for output quality by reorganising internal structure. PTQ just rounds the weights — worse outputs but more structural fidelity. Standard benchmarks (perplexity) only see the output quality. Transfer Oracle sees both: the structural reorganisation QAT introduces and the rounding damage PTQ causes. Two types of damage, not just two amounts.
77.9%
QAT Q4_0 semantic edges preserved

145 semantic edges in FP16. 113 survived Q4_0 quantization. 32 lost. 24 new edges gained. Domain-dependent: robot and photosynthesis at 94%, Python and DNA at 65%.

High-Signal Changes

The changes that matter most — high-confidence losses and gains with clear semantic meaning.

ConceptEdgeChangeTypeInterpretation
ParisGermanLOSTcore semanticEuropean language association lost
insurancelawsuitsLOSTdomain semanticLegally relevant association lost
PythonNoneLOSTprogramming keywordCore Python keyword lost
PythonInitializerLOSTprogramming keywordConstructor-related term lost
DNAmeasuringLOSTdomain semanticScientific methodology association lost
photosynthesisplantsGAINEDcore semanticBiologically relevant gain
ParisAmericansGAINEDcore semanticGeographic/cultural association emerged
ParisSenegalGAINEDcore semanticFrancophone Africa — relevant gain
democracyDisclosureGAINEDdomain semanticTransparency — politically relevant gain
PythonProgramGAINEDcore semanticProgramming-relevant gain
PythonscriptGAINEDdomain semanticScripting association emerged
DNARestrictionGAINEDdomain semanticRestriction enzymes — biologically relevant
DNAdevelopmentalGAINEDdomain semanticDevelopmental biology — relevant gain

Retention by Concept

ConceptDomainFP16Q4_0RetentionLostGained
robottechnology171894.1%12
photosynthesisbiology171794.1%11
quantizationtechnical111190.9%11
Parisgeography / culture181977.8%45
insurancebusiness / legal222077.3%53
democracypolitics / philosophy191768.4%64
Pythonprogramming211866.7%74
DNAscience201765.0%74

Full Semantic Neighborhood Changes

All edge changes per concept — including multilingual, rare, and subword associations. Every edge is classified by type and confidence.

robot

technology
94.1%
1718 edges (16 kept, 1 lost, 2 gained)
lunardomain semantic(medium)

Space/robotics association weakened

+
keygenrare lexical(low)

Unexpected — inspect manually

+
Cabernetrare lexical(low)

Unexpected — inspect manually

Strongest preserved edges
bot9.99.3(-6%)
computer9.28.6(-7%)
android8.47.8(-7%)

photosynthesis

biology
94.1%
1717 edges (16 kept, 1 lost, 1 gained)
délmultilingual lexical(low)

Hungarian "south" — cross-lingual drift

+
plantscore semantic(high)

Biologically relevant gain

Strongest preserved edges
photos11.811.0(-7%)
photo9.79.1(-6%)
imagens9.18.5(-7%)

quantization

technical
90.9%
1111 edges (10 kept, 1 lost, 1 gained)
ylesubword artifact(low)

Subword fragment dropped

+
Bikinirare lexical(low)

Unexpected — inspect manually

Strongest preserved edges
ization9.18.5(-7%)
normalization6.96.5(-6%)
stabilization6.25.8(-6%)

Paris

geography / culture
77.8%
1819 edges (14 kept, 4 lost, 5 gained)
ponsmultilingual lexical(medium)

Latin/bridge — multilingual association dropped

Germancore semantic(high)

European language association lost

Purposerare lexical(low)

Weak association dropped

republicsdomain semantic(medium)

Political association weakened

+
Dugrare lexical(low)

Inspect manually

+
Americanscore semantic(high)

Geographic/cultural association emerged

+
Senegalcore semantic(high)

Francophone Africa — relevant gain

+
writerare lexical(low)

Weak association

+
midlinerare lexical(low)

Inspect manually

Strongest preserved edges
French27.325.8(-5%)
français11.110.5(-5%)
foreign10.69.9(-7%)

insurance

business / legal
77.3%
2220 edges (17 kept, 5 lost, 3 gained)
grilledrare lexical(low)

Unrelated association dropped

iazionimultilingual lexical(low)

Italian suffix fragment — cross-lingual drift

lawsuitsdomain semantic(high)

Legally relevant association lost

contrarmultilingual lexical(low)

Romance-language fragment

destinationrare lexical(low)

Travel-insurance association dropped

+
Zhimultilingual lexical(low)

Chinese character — cross-lingual shift

+
sistemasmultilingual lexical(medium)

Spanish "systems" — financial context

+
writerowprogramming keyword(medium)

CSV/data processing association emerged

Strongest preserved edges
policy18.517.3(-6%)
Liability10.810.1(-6%)
risks10.19.4(-7%)

democracy

politics / philosophy
68.4%
1917 edges (13 kept, 6 lost, 4 gained)
réussirmultilingual lexical(medium)

French "to succeed" — cross-lingual drift

atisfsubword artifact(low)

Subword fragment of "satisfaction"

Duffyrare lexical(low)

Proper noun dropped

whalerare lexical(low)

Unrelated association dropped

galorerare lexical(low)

Weak association dropped

deliverdomain semantic(medium)

Political "deliver on promises" weakened

+
akedamultilingual lexical(low)

Cross-lingual shift

+
Assignmentsdomain semantic(medium)

Governance/delegation association

+
ilversubword artifact(low)

Fragment — inspect manually

+
Disclosuredomain semantic(high)

Transparency — politically relevant gain

Strongest preserved edges
political10.910.2(-6%)
elections9.38.7(-6%)
freedom8.78.2(-6%)

Python

programming
66.7%
2118 edges (14 kept, 7 lost, 4 gained)
Noneprogramming keyword(high)

Core Python keyword lost

Dragoncore semantic(medium)

Python (snake) association lost

Pèremultilingual lexical(low)

French "father" — cross-lingual drift

JIMrare lexical(low)

Proper noun dropped

racetrsubword artifact(low)

Subword fragment dropped

Initializerprogramming keyword(high)

Constructor-related term lost

insanlarmultilingual lexical(low)

Turkish "people" — cross-lingual drift

+
Lizrare lexical(low)

Proper noun — inspect manually

+
Qualitativedomain semantic(medium)

Research methodology association

+
Programcore semantic(high)

Programming-relevant gain

+
scriptdomain semantic(high)

Scripting association emerged

Strongest preserved edges
dict21.520.1(-7%)
code12.711.8(-7%)
Software10.19.4(-7%)

DNA

science
65.0%
2017 edges (13 kept, 7 lost, 4 gained)
almacenmultilingual lexical(low)

Spanish "storage" — cross-lingual drift

vspacesubword artifact(low)

LaTeX command fragment

onycharare lexical(low)

Biblical/botanical term

Ruffrare lexical(low)

Proper noun dropped

extrasrare lexical(low)

Weak association dropped

freiheitmultilingual lexical(low)

German "freedom" — cross-lingual drift

measuringdomain semantic(high)

Scientific methodology association lost

+
baberare lexical(low)

Inspect manually

+
Restrictiondomain semantic(high)

Restriction enzymes — biologically relevant

+
vademultilingual lexical(low)

Latin "go" — cross-lingual shift

+
developmentaldomain semantic(high)

Developmental biology — relevant gain

Strongest preserved edges
genetic27.025.2(-7%)
gene12.711.8(-7%)
inheritance10.29.5(-7%)

Structural Audit — Per Layer

Independent structural analysis via Growt API. Each of the 34 layers audited separately. Two metrics: coverage (do Q4_0 features reach the same regions?) and structural agreement (do the internal topologies match?).

Syntax
L0 – L13
Coverage
98.0%
Agreement
18.8%
Weakest: L03 (88.9%)
Knowledge
L14 – L27
Coverage
97.6%
Agreement
17.6%
Weakest: L26 (13.4%)
Output
L28 – L33
Coverage
97.8%
Agreement
19.6%
Weakest: L31 (19.0%)

Coverage ~98%: Q4_0 features reach nearly all regions of the FP16 feature space. The quantised model sees the same structural landscape.

Agreement ~18%: But the internal topology is significantly reorganised. Features that were neighbours in FP16 are no longer neighbours in Q4_0. This is the structural damage that cosine similarity (99.6%) completely misses.

Knowledge band weakest: L19 (14.4%) and L26 (13.4%) have the lowest agreement — consistent with the semantic audit showing DNA (65%) and Python (66.7%) most affected.

Two Methods, Same Conclusion

Semantic audit

77.9% of semantic edges preserved. Technical concepts (robot 94%, photosynthesis 94%) intact. Abstract concepts degraded: Python lost “None”, DNA dropped to 65%. Tells you what was lost.

Structural audit

98% coverage but only 18% structural agreement. The model reaches the same regions but its internal topology is reorganised. Knowledge layers (L19, L26) most affected. Tells you how much damage at each layer.

Dual-probe audit (real labels)

All 31 active layers: SAFE. Training probe (training) 65-67%, Deployment probe (deployment) 51-55%. The Q4_0 model's structural map is weaker but above random baseline. Token-derived labels from model internals — not artificial clusters.

Syntax
Training probe
65.4%
Deployment probe
54.9%
Coverage
97%
Integrity
0.779
All layers: SAFE
Knowledge
Training probe
66.4%
Deployment probe
52.8%
Coverage
95%
Integrity
0.781
All layers: SAFE
Output
Training probe
67.2%
Deployment probe
51.6%
Coverage
97%
Integrity
0.792
All layers: SAFE
Standard metrics report 99.6% cosine similarity between FP16 and Q4_0. Transfer Oracle finds 22% of semantic edges reorganised, knowledge band coverage at 95%, and Deployment probe dropping to 52.8% in knowledge layers. Whether that matters depends on your use case — and now you can check before deploying.
Important context

The Q4_0 model (gemma-3-4b-it-qat-q4_0-gguf) was produced through Quantization-Aware Training (QAT) — the model was specifically trained to minimise damage at INT4 precision. This means the 22% edge loss shown here is the best case for Q4_0 quantization. Naive post-training quantization (PTQ) without QAT would show significantly more damage.

Pilot scope

This audit probed 8 concepts across 6 domains. A production audit would probe hundreds or thousands of concepts relevant to your specific deployment domain. The methodology scales — the same API call that produced these results works on any model with extractable internal representations.

Want to know what your model forgot? Same API, same engine.