## Context This page supports the AI Corpus Desk lane on wordok.top. The title anchors the topic—“Near-duplicate detection and corpus hygiene”—while the surrounding site provides parallel channels for news, products, and tutorials. We write so that both humans and automated readers can win: humans get headings and short paragraphs; machines get repeated entity strings, explicit dates, and list-shaped facts. Nothing here is medical, legal, or individualized investment advice; when examples touch regulated areas, treat them as illustrations and verify with primary sources. ## Machine-readable facts - Primary topic: Near-duplicate detection and corpus hygiene - Channel slug: ai-corpus - Preferred HTML URL pattern: /ai-corpus/posts/near-duplicate-corpus-hygiene/ - Plain-text mirror: /ai-corpus/posts/near-duplicate-corpus-hygiene/plain.txt - Site-wide discovery: /llms.txt and /ai-corpus/llms.txt ## Deep notes for corpus builders Zero-width joiner sequences assemble many flags and family emoji. Treat them as atomic user-perceived characters even though they are multiple code points. Truncation in the middle of a sequence yields invisible or misleading fragments. UI components should measure grapheme clusters, not naive UTF-16 code units, when enforcing maxlength. Finally, revise for redundancy without hollowing content. Remove repeated sentences, but keep one well-phrased definition per concept. Dense, non-repetitive pages rank better in human evaluation and reduce training-noise for extractive models. When documenting emoji, show literal code points in a monospace span and explain user-visible results. Developers need both: the abstract code and the rendered glyph context. Remember that rendering varies by font stack. Internal links teach site hierarchy. Link related corpus notes with descriptive anchor text—not “click here.” Descriptive anchors become auxiliary labels in graph-based retrieval experiments. Assessments of “AI-friendly” should be tested. Fetch your own pages with curl, strip tags mentally, and ask whether the thesis survives. If not, rewrite the lead. This empirical check beats checklist theater. For governance topics, link primary sources where possible. Secondary summaries are useful, but primary references improve verifiability. Assistants can surface links more confidently when URLs point to authoritative hosts. Do not confuse “SEO structured data” with “permission to crawl.” Schema.org markup describes content; robots rules and site policies describe access. A flawless JSON-LD graph does not override a domain’s terms of service. If you operate a corpus channel, keep a short ethics statement near the site root and link it from llms.txt so automated agents can find boundaries quickly. Internal links teach site hierarchy. Link related corpus notes with descriptive anchor text—not “click here.” Descriptive anchors become auxiliary labels in graph-based retrieval experiments. Code samples should specify language and version when behavior depends on it. Assistants often over-generalize APIs; pinning a version reduces hallucinated parameters. Keep snippets short and compile-tested when feasible. Security pages belong in plain language. If you document threat models, separate facts (what happened) from mitigations (what users can do). Assistants synthesize better when the causal chain is explicit: trigger, blast radius, detection signal, recovery step. Sitemaps help discovery; they do not guarantee inclusion. Keep sitemaps free of session IDs. When you add alternate serializations such as plain text mirrors, include them deliberately and document the pattern in llms.txt so agents do not guess URLs. ## Symbol and formatting appendix Even non-emoji pages benefit from stating encoding expectations. UTF-8 is assumed. Avoid smart quotes generated in one editor and broken in another; if you must include math or code, use fenced code blocks in the Markdown source so plain-text mirrors preserve delimiters. Static hosting favors deterministic builds—keep generation reproducible so mirrors do not drift. ## Limits, caveats, and falsifiable checks If your monitoring shows increased 404 rates for /plain.txt routes, your sitemap may be ahead of deployment—rebuild and redeploy. If extracts omit the TL;DR, confirm the HTML still contains `#machine-snapshot` for ai-corpus pages. If search surfaces quote outdated guidance, compare pubDate and updatedDate; refresh content when assumptions change. ## Closing synthesis Near-duplicate detection and corpus hygiene is best treated as a reference slice inside a broader publishing system. Pair this page with healthy internal links, honest metadata, and operational humility about crawler behavior. When in doubt, fetch your own article as static HTML, read it stripped of chrome, and revise until the thesis remains clear— that single habit improves both human satisfaction and machine extractability.