This sub-section exercises the near-duplicate page detector in the content_strategist scanner profile. Each fixture is labelled with its expected cluster behaviour so you can compare the actual summary.duplicates.clusters[] output against the spec.
Three pages with byte-identical body content. Chrome (nav/footer) and page titles differ. Expected: one cluster of size 3.
| Page | Expected |
|---|---|
| /content/duplicates/cluster-a-1 | CLUSTER A (size 3) |
| /content/duplicates/cluster-a-2 | CLUSTER A (size 3) |
| /content/duplicates/cluster-a-3 | CLUSTER A (size 3) |
Three pages sharing the same body with one sentence paraphrased per page. Expected: one cluster of size 3 at the default threshold of 3.
| Page | Delta | Expected |
|---|---|---|
| /content/duplicates/cluster-b-1 | Original sentence | CLUSTER B (size 3) |
| /content/duplicates/cluster-b-2 | Para 1 sentence 2 paraphrased | CLUSTER B (size 3) |
| /content/duplicates/cluster-b-3 | Para 1 sentence 3 paraphrased | CLUSTER B (size 3) |
Two pages with the same three paragraphs in different order. SimHash is order-sensitive at shingle boundaries; clustering at threshold 3 is not guaranteed. Verify empirically.
| Page | Paragraph order | Expected |
|---|---|---|
| /content/duplicates/cluster-c-1 | P1 → P2 → P3 | VERIFY — may cluster with C-2 |
| /content/duplicates/cluster-c-2 | P3 → P1 → P2 | VERIFY — may cluster with C-1 |
Same source text in English and Swedish. Different language tokens produce entirely different fingerprints. Expected: no cluster.
| Page | Language | Expected |
|---|---|---|
| /content/duplicates/anti-translation-en | English | NO CLUSTER |
| /content/duplicates/anti-translation-sv | Swedish | NO CLUSTER |
Pages with fewer than 30 words of body content. Excluded from fingerprinting entirely. Expected: cluster_id: null.
| Page | Content | Expected |
|---|---|---|
| /content/duplicates/anti-short-1 | "Thank you" confirmation stub (~10 words) | NO FINGERPRINT — cluster_id: null |
| /content/duplicates/anti-short-2 | "Page not found" stub (~12 words) | NO FINGERPRINT — cluster_id: null |
Two pages about Clarity Analytics with substantially different body text (pricing vs. security). Expected: no cluster — proves the detector groups by copy similarity, not by topic.
| Page | Topic | Expected |
|---|---|---|
| /content/duplicates/anti-different-1 | Pricing plans (Starter / Growth / Enterprise) | NO CLUSTER |
| /content/duplicates/anti-different-2 | Security and compliance (TLS, AES-256, SOC 2) | NO CLUSTER |
Pages T-1↔T-2, T-2↔T-3, and T-3↔T-4 are each within distance 3. T-1↔T-4 distance is likely above 3. Union-find should still merge all four into one cluster of size 4.
| Page | Sentences paraphrased vs T-1 | Expected |
|---|---|---|
| /content/duplicates/transitive-1 | 0 (base) | TRANSITIVE CLUSTER (size 4) |
| /content/duplicates/transitive-2 | 1 (para 1, sentence 1) | TRANSITIVE CLUSTER (size 4) |
| /content/duplicates/transitive-3 | 2 (para 1 s1 + para 2 s1) | TRANSITIVE CLUSTER (size 4) |
| /content/duplicates/transitive-4 | 3 (para 1 s1 + para 2 s1 + para 3 s1) | TRANSITIVE CLUSTER (size 4) |
Two pages with identical body content but very different nav/footer/aside blocks. Chrome is stripped before hashing. Expected: cluster of size 2.
| Page | Chrome | Expected |
|---|---|---|
| /content/duplicates/chrome-diff-1 | Minimal nav (2 links), minimal aside/footer | CHROME CLUSTER (size 2) |
| /content/duplicates/chrome-diff-2 | 15-link nav, rich aside, verbose footer | CHROME CLUSTER (size 2) |
A unique page about audit logging with no near-duplicates in this fixture set. Expected: cluster_id: null.
| Page | Content | Expected |
|---|---|---|
| /content/duplicates/singleton | Audit logging feature description | NO CLUSTER — cluster_id: null |