Pilot Audit · 8 Concepts · Real Q4_0

We Tell You What It Forgot

Gemma 3 4B: FP16 vs actual Q4_0 quantization. Not just “is it different?” — but which semantic associations were preserved, weakened, lost, or reorganised.

Methodology

Modelgoogle/gemma-3-4b-it

BaselineBF16 (safetensors)

QuantizedQ4_0 (Google official GGUF)

Layers audited34

Concepts probed8

Cosine sanity check0.9959

What is a “semantic edge”? A detected association between a probed concept and another token in the model's internal representation space. Each edge has a strength score at a specific layer.

How were Q4_0 weights compared? The Q4_0 GGUF was dequantized to float32 (preserving quantization damage), saved as safetensors, then audited with the same tool as the FP16 baseline. Cosine similarity between FP16 and dequantized Q4_0 model weights: 0.9959 — confirming the comparison is valid.

QAT vs Naive PTQ — Same Bit Width, Different Damage

Both are Q4_0 (4-bit). QAT was trained for it. PTQ was not. Standard metrics say QAT is better. Our audit shows the full picture.

Metric	QAT Q4_0	Naive PTQ Q4_0	Interpretation
Semantic edge retention	78%	81%	PTQ preserves more semantic neighbors
Structural coverage	87–96%	89–100%	PTQ reaches more structural regions
Topology agreement	12–15%	14%	Similar internal reorganisation
Training probe	65%	66%	Nearly identical
Deployment probe	53%	54%	Nearly identical
Diagnosis	All SAFE	All SAFE	Both within safe bounds
Perplexity (Google)	Best	Worse	QAT wins on output quality

QAT optimises for output quality by reorganising internal structure. PTQ just rounds the weights — worse outputs but more structural fidelity. Standard benchmarks (perplexity) only see the output quality. Transfer Oracle sees both: the structural reorganisation QAT introduces and the rounding damage PTQ causes. Two types of damage, not just two amounts.

77.9%

QAT Q4_0 semantic edges preserved

145 semantic edges in FP16. 113 survived Q4_0 quantization. 32 lost. 24 new edges gained. Domain-dependent: robot and photosynthesis at 94%, Python and DNA at 65%.

High-Signal Changes

The changes that matter most — high-confidence losses and gains with clear semantic meaning.

Concept	Edge	Change	Type	Interpretation
Paris	German	LOST	core semantic	European language association lost
insurance	lawsuits	LOST	domain semantic	Legally relevant association lost
Python	None	LOST	programming keyword	Core Python keyword lost
Python	Initializer	LOST	programming keyword	Constructor-related term lost
DNA	measuring	LOST	domain semantic	Scientific methodology association lost
photosynthesis	plants	GAINED	core semantic	Biologically relevant gain
Paris	Americans	GAINED	core semantic	Geographic/cultural association emerged
Paris	Senegal	GAINED	core semantic	Francophone Africa — relevant gain
democracy	Disclosure	GAINED	domain semantic	Transparency — politically relevant gain
Python	Program	GAINED	core semantic	Programming-relevant gain
Python	script	GAINED	domain semantic	Scripting association emerged
DNA	Restriction	GAINED	domain semantic	Restriction enzymes — biologically relevant
DNA	developmental	GAINED	domain semantic	Developmental biology — relevant gain

Retention by Concept

Concept	Domain	FP16	Q4_0	Retention	Lost	Gained
robot	technology	17	18	94.1%	1	2
photosynthesis	biology	17	17	94.1%	1	1
quantization	technical	11	11	90.9%	1	1
Paris	geography / culture	18	19	77.8%	4	5
insurance	business / legal	22	20	77.3%	5	3
democracy	politics / philosophy	19	17	68.4%	6	4
Python	programming	21	18	66.7%	7	4
DNA	science	20	17	65.0%	7	4

Full Semantic Neighborhood Changes

All edge changes per concept — including multilingual, rare, and subword associations. Every edge is classified by type and confidence.

robot

technology

94.1%

17 → 18 edges (16 kept, 1 lost, 2 gained)

✗

lunardomain semantic(medium)

Space/robotics association weakened

keygenrare lexical(low)

Unexpected — inspect manually

Cabernetrare lexical(low)

Unexpected — inspect manually

Strongest preserved edges

bot9.9 → 9.3(-6%)

computer9.2 → 8.6(-7%)

android8.4 → 7.8(-7%)

photosynthesis

biology

94.1%

17 → 17 edges (16 kept, 1 lost, 1 gained)

✗

délmultilingual lexical(low)

Hungarian "south" — cross-lingual drift

plantscore semantic(high)

Biologically relevant gain

Strongest preserved edges

photos11.8 → 11.0(-7%)

photo9.7 → 9.1(-6%)

imagens9.1 → 8.5(-7%)

quantization

technical

90.9%

11 → 11 edges (10 kept, 1 lost, 1 gained)

✗

ylesubword artifact(low)

Subword fragment dropped

Bikinirare lexical(low)

Unexpected — inspect manually

Strongest preserved edges

ization9.1 → 8.5(-7%)

normalization6.9 → 6.5(-6%)

stabilization6.2 → 5.8(-6%)

Paris

geography / culture

77.8%

18 → 19 edges (14 kept, 4 lost, 5 gained)

✗

ponsmultilingual lexical(medium)

Latin/bridge — multilingual association dropped

✗

Germancore semantic(high)

European language association lost

✗

Purposerare lexical(low)

Weak association dropped

✗

republicsdomain semantic(medium)

Political association weakened

Dugrare lexical(low)

Inspect manually

Americanscore semantic(high)

Geographic/cultural association emerged

Senegalcore semantic(high)

Francophone Africa — relevant gain

writerare lexical(low)

Weak association

midlinerare lexical(low)

Inspect manually

Strongest preserved edges

French27.3 → 25.8(-5%)

français11.1 → 10.5(-5%)

foreign10.6 → 9.9(-7%)

insurance

business / legal

77.3%

22 → 20 edges (17 kept, 5 lost, 3 gained)

✗

grilledrare lexical(low)

Unrelated association dropped

✗

iazionimultilingual lexical(low)

Italian suffix fragment — cross-lingual drift

✗

lawsuitsdomain semantic(high)

Legally relevant association lost

✗

contrarmultilingual lexical(low)

Romance-language fragment

✗

destinationrare lexical(low)

Travel-insurance association dropped

Zhimultilingual lexical(low)

Chinese character — cross-lingual shift

sistemasmultilingual lexical(medium)

Spanish "systems" — financial context

writerowprogramming keyword(medium)

CSV/data processing association emerged

Strongest preserved edges

policy18.5 → 17.3(-6%)

Liability10.8 → 10.1(-6%)

risks10.1 → 9.4(-7%)

democracy

politics / philosophy

68.4%

19 → 17 edges (13 kept, 6 lost, 4 gained)

✗

réussirmultilingual lexical(medium)

French "to succeed" — cross-lingual drift

✗

atisfsubword artifact(low)

Subword fragment of "satisfaction"

✗

Duffyrare lexical(low)

Proper noun dropped

✗

whalerare lexical(low)

Unrelated association dropped

✗

galorerare lexical(low)

Weak association dropped

✗

deliverdomain semantic(medium)

Political "deliver on promises" weakened

akedamultilingual lexical(low)

Cross-lingual shift

Assignmentsdomain semantic(medium)

Governance/delegation association

ilversubword artifact(low)

Fragment — inspect manually

Disclosuredomain semantic(high)

Transparency — politically relevant gain

Strongest preserved edges

political10.9 → 10.2(-6%)

elections9.3 → 8.7(-6%)

freedom8.7 → 8.2(-6%)

Python

programming

66.7%

21 → 18 edges (14 kept, 7 lost, 4 gained)

✗

Noneprogramming keyword(high)

Core Python keyword lost

✗

Dragoncore semantic(medium)

Python (snake) association lost

✗

Pèremultilingual lexical(low)

French "father" — cross-lingual drift

✗

JIMrare lexical(low)

Proper noun dropped

✗

racetrsubword artifact(low)

Subword fragment dropped

✗

Initializerprogramming keyword(high)

Constructor-related term lost

✗

insanlarmultilingual lexical(low)

Turkish "people" — cross-lingual drift

Lizrare lexical(low)

Proper noun — inspect manually

Qualitativedomain semantic(medium)

Research methodology association

Programcore semantic(high)

Programming-relevant gain

scriptdomain semantic(high)

Scripting association emerged

Strongest preserved edges

dict21.5 → 20.1(-7%)

code12.7 → 11.8(-7%)

Software10.1 → 9.4(-7%)

DNA

science

65.0%

20 → 17 edges (13 kept, 7 lost, 4 gained)

✗

almacenmultilingual lexical(low)

Spanish "storage" — cross-lingual drift

✗

vspacesubword artifact(low)

LaTeX command fragment

✗

onycharare lexical(low)

Biblical/botanical term

✗

Ruffrare lexical(low)

Proper noun dropped

✗

extrasrare lexical(low)

Weak association dropped

✗

freiheitmultilingual lexical(low)

German "freedom" — cross-lingual drift

✗

measuringdomain semantic(high)

Scientific methodology association lost

baberare lexical(low)

Inspect manually

Restrictiondomain semantic(high)

Restriction enzymes — biologically relevant

vademultilingual lexical(low)

Latin "go" — cross-lingual shift

developmentaldomain semantic(high)

Developmental biology — relevant gain

Strongest preserved edges

genetic27.0 → 25.2(-7%)

gene12.7 → 11.8(-7%)

inheritance10.2 → 9.5(-7%)

Structural Audit — Per Layer

Independent structural analysis via Growt API. Each of the 34 layers audited separately. Two metrics: coverage (do Q4_0 features reach the same regions?) and structural agreement (do the internal topologies match?).

Syntax

L0 – L13

Coverage

98.0%

Agreement

18.8%

Weakest: L03 (88.9%)

Knowledge

L14 – L27

Coverage

97.6%

Agreement

17.6%

Weakest: L26 (13.4%)

Output

L28 – L33

Coverage

97.8%

Agreement

19.6%

Weakest: L31 (19.0%)

Coverage ~98%: Q4_0 features reach nearly all regions of the FP16 feature space. The quantised model sees the same structural landscape.

Agreement ~18%: But the internal topology is significantly reorganised. Features that were neighbours in FP16 are no longer neighbours in Q4_0. This is the structural damage that cosine similarity (99.6%) completely misses.

Knowledge band weakest: L19 (14.4%) and L26 (13.4%) have the lowest agreement — consistent with the semantic audit showing DNA (65%) and Python (66.7%) most affected.

Two Methods, Same Conclusion

Semantic audit

77.9% of semantic edges preserved. Technical concepts (robot 94%, photosynthesis 94%) intact. Abstract concepts degraded: Python lost “None”, DNA dropped to 65%. Tells you what was lost.

Structural audit

98% coverage but only 18% structural agreement. The model reaches the same regions but its internal topology is reorganised. Knowledge layers (L19, L26) most affected. Tells you how much damage at each layer.

Dual-probe audit (real labels)

All 31 active layers: SAFE. Training probe (training) 65-67%, Deployment probe (deployment) 51-55%. The Q4_0 model's structural map is weaker but above random baseline. Token-derived labels from model internals — not artificial clusters.

Syntax

Training probe

65.4%

Deployment probe

54.9%

Coverage

97%

Integrity

0.779

All layers: SAFE

Knowledge

Training probe

66.4%

Deployment probe

52.8%

Coverage

95%

Integrity

0.781

All layers: SAFE

Output

Training probe

67.2%

Deployment probe

51.6%

Coverage

97%

Integrity

0.792

All layers: SAFE

Standard metrics report 99.6% cosine similarity between FP16 and Q4_0. Transfer Oracle finds 22% of semantic edges reorganised, knowledge band coverage at 95%, and Deployment probe dropping to 52.8% in knowledge layers. Whether that matters depends on your use case — and now you can check before deploying.

Important context

The Q4_0 model (gemma-3-4b-it-qat-q4_0-gguf) was produced through Quantization-Aware Training (QAT) — the model was specifically trained to minimise damage at INT4 precision. This means the 22% edge loss shown here is the best case for Q4_0 quantization. Naive post-training quantization (PTQ) without QAT would show significantly more damage.

Pilot scope

This audit probed 8 concepts across 6 domains. A production audit would probe hundreds or thousands of concepts relevant to your specific deployment domain. The methodology scales — the same API call that produced these results works on any model with extractable internal representations.

Want to know what your model forgot? Same API, same engine.

Get Your API Key ViT-B/16 Benchmark →