Duplicate content

GET /api/v1/crawl/{jobId}/duplicate-content

Within a crawl, find groups of pages whose visible body text is near-identical — so you can spot thin variants, boilerplate and accidental copies. Each page's text is reduced to a 64-bit SimHash, and pages within a Hamming distance of 4 bits or fewer are clustered together as near-duplicates.

Path & query

Name	Type	Description
`jobId` required	string	The crawl job id.

Request

Response

200 · application/json

{
"jobId": "crw_7Hk29fQ",
"startUrl": "https://example.com",
"host": "example.com",
"status": "done",
"maxPages": 50,
"createdAt": "2026-06-02T13:24:49.414Z",
"finishedAt": "2026-06-02T13:24:52.055Z",
"duplicateGroups": [
{
"count": 2,
"urls": ["https://example.com/a", "https://example.com/b"]
}
]
}

Response fields

Field	Type	Description
`duplicateGroups`	array	Clusters of near-duplicate pages. Each cluster has `count` (how many pages are in it) and `urls` (the pages in the cluster, capped at 50). An empty array means no near-duplicate content was found.

See Errors for status codes.