Duplicate content
GET /api/v1/crawl/{jobId}/duplicate-content
Within a crawl, find groups of pages whose visible body text is near-identical — so you can spot thin variants, boilerplate and accidental copies. Each page's text is reduced to a 64-bit SimHash, and pages within a Hamming distance of 4 bits or fewer are clustered together as near-duplicates.
Path & query
| Name | Type | Description |
|---|---|---|
jobId required | string | The crawl job id. |
Request
Response
200 · application/json
- {
- "jobId": "crw_7Hk29fQ",
- "startUrl": "https://example.com",
- "host": "example.com",
- "status": "done",
- "maxPages": 50,
- "createdAt": "2026-06-02T13:24:49.414Z",
- "finishedAt": "2026-06-02T13:24:52.055Z",
- "duplicateGroups": [
- {
- "count": 2,
- "urls": ["https://example.com/a", "https://example.com/b"]
- }
- ]
- }
Response fields
| Field | Type | Description |
|---|---|---|
duplicateGroups | array | Clusters of near-duplicate pages. Each cluster has count (how many pages are in it) and urls (the pages in the cluster, capped at 50). An empty array means no near-duplicate content was found. |
See Errors for status codes.