Duplicate content

GET /api/v1/crawl/{jobId}/duplicate-content

Within a crawl, find groups of pages whose visible body text is near-identical — so you can spot thin variants, boilerplate and accidental copies. Each page's text is reduced to a 64-bit SimHash, and pages within a Hamming distance of 4 bits or fewer are clustered together as near-duplicates.

Path & query

NameTypeDescription
jobId requiredstringThe crawl job id.

Request

Response

200 · application/json
  1. {
  2. "jobId": "crw_7Hk29fQ",
  3. "startUrl": "https://example.com",
  4. "host": "example.com",
  5. "status": "done",
  6. "maxPages": 50,
  7. "createdAt": "2026-06-02T13:24:49.414Z",
  8. "finishedAt": "2026-06-02T13:24:52.055Z",
  9. "duplicateGroups": [
  10. {
  11. "count": 2,
  12. "urls": ["https://example.com/a", "https://example.com/b"]
  13. }
  14. ]
  15. }

Response fields

FieldTypeDescription
duplicateGroupsarrayClusters of near-duplicate pages. Each cluster has count (how many pages are in it) and urls (the pages in the cluster, capped at 50). An empty array means no near-duplicate content was found.

See Errors for status codes.