← Back to Content Test Hub

Duplicate Detection Fixtures

This sub-section exercises the near-duplicate page detector in the content_strategist scanner profile. Each fixture is labelled with its expected cluster behaviour so you can compare the actual summary.duplicates.clusters[] output against the spec.

Cluster A — Exact duplicates

Three pages with byte-identical body content. Chrome (nav/footer) and page titles differ. Expected: one cluster of size 3.

PageExpected
/content/duplicates/cluster-a-1CLUSTER A (size 3)
/content/duplicates/cluster-a-2CLUSTER A (size 3)
/content/duplicates/cluster-a-3CLUSTER A (size 3)

Cluster B — Near-duplicates, one-sentence delta

Three pages sharing the same body with one sentence paraphrased per page. Expected: one cluster of size 3 at the default threshold of 3.

PageDeltaExpected
/content/duplicates/cluster-b-1Original sentenceCLUSTER B (size 3)
/content/duplicates/cluster-b-2Para 1 sentence 2 paraphrasedCLUSTER B (size 3)
/content/duplicates/cluster-b-3Para 1 sentence 3 paraphrasedCLUSTER B (size 3)

Cluster C — Paragraph reordering

Two pages with the same three paragraphs in different order. SimHash is order-sensitive at shingle boundaries; clustering at threshold 3 is not guaranteed. Verify empirically.

PageParagraph orderExpected
/content/duplicates/cluster-c-1P1 → P2 → P3VERIFY — may cluster with C-2
/content/duplicates/cluster-c-2P3 → P1 → P2VERIFY — may cluster with C-1

Anti-cluster — Translated content

Same source text in English and Swedish. Different language tokens produce entirely different fingerprints. Expected: no cluster.

PageLanguageExpected
/content/duplicates/anti-translation-enEnglishNO CLUSTER
/content/duplicates/anti-translation-svSwedishNO CLUSTER

Anti-cluster — Short pages (under 30-word floor)

Pages with fewer than 30 words of body content. Excluded from fingerprinting entirely. Expected: cluster_id: null.

PageContentExpected
/content/duplicates/anti-short-1"Thank you" confirmation stub (~10 words)NO FINGERPRINT — cluster_id: null
/content/duplicates/anti-short-2"Page not found" stub (~12 words)NO FINGERPRINT — cluster_id: null

Anti-cluster — Same topic, different copy

Two pages about Clarity Analytics with substantially different body text (pricing vs. security). Expected: no cluster — proves the detector groups by copy similarity, not by topic.

PageTopicExpected
/content/duplicates/anti-different-1Pricing plans (Starter / Growth / Enterprise)NO CLUSTER
/content/duplicates/anti-different-2Security and compliance (TLS, AES-256, SOC 2)NO CLUSTER

Transitive cluster — Chain of 4

Pages T-1↔T-2, T-2↔T-3, and T-3↔T-4 are each within distance 3. T-1↔T-4 distance is likely above 3. Union-find should still merge all four into one cluster of size 4.

PageSentences paraphrased vs T-1Expected
/content/duplicates/transitive-10 (base)TRANSITIVE CLUSTER (size 4)
/content/duplicates/transitive-21 (para 1, sentence 1)TRANSITIVE CLUSTER (size 4)
/content/duplicates/transitive-32 (para 1 s1 + para 2 s1)TRANSITIVE CLUSTER (size 4)
/content/duplicates/transitive-43 (para 1 s1 + para 2 s1 + para 3 s1)TRANSITIVE CLUSTER (size 4)

Chrome-only difference

Two pages with identical body content but very different nav/footer/aside blocks. Chrome is stripped before hashing. Expected: cluster of size 2.

PageChromeExpected
/content/duplicates/chrome-diff-1Minimal nav (2 links), minimal aside/footerCHROME CLUSTER (size 2)
/content/duplicates/chrome-diff-215-link nav, rich aside, verbose footerCHROME CLUSTER (size 2)

Singleton control

A unique page about audit logging with no near-duplicates in this fixture set. Expected: cluster_id: null.

PageContentExpected
/content/duplicates/singletonAudit logging feature descriptionNO CLUSTER — cluster_id: null