What Is a Search Engine?
A search engine is a distributed software system that discovers, indexes, ranks, and serves web documents in response to user queries. When you type "best on-page SEO tool" into Google, Bing, or DuckDuckGo, the engine does not search the live web in real time. Instead, it looks up your query in a precomputed index of billions of documents — an index it has been building and refreshing for years — ranks the matching documents against hundreds of signals, and returns the top few in milliseconds.
Modern search engines are some of the largest engineering systems ever built. Google alone operates millions of servers across dozens of data centres, crawls hundreds of billions of URLs per day, stores a copy of the indexable web several times over, and serves billions of queries every 24 hours. Bing, Yandex, Baidu, and the newer generative-AI engines run comparable (if smaller) infrastructures.
Understanding how a search engine works is the foundation of all technical SEO. If you understand crawling, indexing, ranking, and serving as discrete pipeline stages, almost every SEO problem resolves into a precise question at one of those stages.
A Brief History of Search Engines
The web has had something like a search engine since the moment it became possible to list more than one document. The field can be divided into four distinct eras.
1990-1994: Pre-Google Pioneers
Archie (1990), launched at McGill University, indexed FTP file listings and is often called the first search engine. Veronica and Jughead followed, indexing Gopher menus. The first web-specific engines — W3Catalog, Wandex, Aliweb, and JumpStation — appeared in 1993 and 1994. They were small, manually curated, and quickly overrun by the web's growth.
1994-1998: The First Mass-Market Era
Yahoo! launched as a curated directory in 1994, then added search. Lycos, Infoseek, AltaVista, Excite, and HotBot all scaled to tens of millions of documents. AltaVista's full-text search and fast crawl were revolutionary. Ranking in this era was primarily based on keyword matching — page titles, meta keyword tags, and on-page text frequency.
1998-2010: The Google / PageRank Era
Google launched in September 1998 with a radically better ranking algorithm: PageRank, which treated hyperlinks as votes. Larry Page and Sergey Brin's original paper ("The Anatomy of a Large-Scale Hypertextual Web Search Engine") is still worth reading. Google rapidly overtook every competitor; by 2004 it had around 80% of global search share.
2010-present: The Machine Learning Era
Google began replacing hand-tuned ranking signals with machine-learned models: RankBrain (2015), Neural Matching (2018), BERT (2019), MUM (2021), and the SGE / AI Overviews rollout (2023-2024). Rankings now depend on deep neural models that interpret query intent, entity relationships, and document semantics. Understanding what the user is actually asking matters more than matching keywords character-for-character.
The Core Pipeline: Crawl → Index → Rank → Serve
Every major search engine has the same four-stage pipeline. The internal implementations differ wildly, but the conceptual model is universal.
Stage 1: Crawling
A crawler (also called a spider or bot) is a program that fetches URLs from the web and passes the content to the rest of the pipeline. Googlebot is the most famous example. Other named crawlers include Bingbot, DuckDuckBot, YandexBot, Baiduspider, and Applebot.
A crawler starts with a seed set of URLs (historically DMOZ; today a mix of sitemaps, prior crawl data, and externally discovered links), fetches each URL, parses it, extracts all links, and adds those links to the crawl frontier. The frontier is prioritised by crawl budget: popular, fresh, and authoritative pages get crawled more often than obscure ones.
Important HTTP details Google respects
- robots.txt — controls crawl access. See the robots.txt guide for details.
- Status codes — 200 (OK), 301 (moved), 404 (not found), 410 (gone), 429 (too many requests), 5xx (server errors). Bots throttle themselves on 429 and 5xx.
- Last-Modified / If-Modified-Since — lets Google avoid re-downloading unchanged pages.
- Sitemaps — XML sitemaps advertise URLs with
lastmod,changefreq, andpriorityhints.
Stage 2: Indexing
Once a URL is crawled, the rendered HTML is parsed and a huge set of features is extracted. Google renders JavaScript (via a headless Chromium) so single-page applications can be indexed, though rendering is a secondary, slower queue.
Features extracted include the title tag, meta description, heading structure, body text, image alts, structured data, internal and external links, canonical URL, robots meta, hreflang, and Core Web Vitals signals. See our indexing guide for a full list of signals extracted.
Documents are then stored in an inverted index: a mapping from terms to the documents containing them. Google's index is sharded across thousands of servers and contains trillions of postings (term-document pairs). Additional indices capture embeddings, entities, images, videos, and structured facts.
Stage 3: Ranking
When a query arrives, the engine identifies the "initial retrieval set" — the hundreds or thousands of documents that could plausibly match the query. It then ranks this set using a blend of signals: textual match, semantic match, backlinks, freshness, user engagement, E-E-A-T, page speed, device compatibility, language, location, and many more.
Ranking is where search engines differentiate. Google uses a sequence of models (BERT, MUM, neural matching, RankBrain, SpamBrain, helpful content classifier, reviews classifier, core ranking model) layered on top of classical IR signals.
Stage 4: Serving
The final stage assembles the Search Engine Results Page (SERP). Modern SERPs are more than ten blue links: they include featured snippets, People Also Ask, image packs, video carousels, local packs, shopping results, knowledge panels, and increasingly, AI-generated overviews. See the SERP snippet generator to preview how your page will render.
The Main Search Engines in 2026
| Engine | Launched | Global share | Notes |
|---|---|---|---|
| 1998 | ~91% | Dominant in every major market except China, Russia, and South Korea. | |
| Bing | 2009 (Live Search prior) | ~4% | Powers Yahoo, DuckDuckGo, Ecosia, and most ChatGPT web results. Much larger than its own share implies. |
| Yandex | 1997 | ~1-2% | Dominant in Russia. Strong on Cyrillic and Turkish. Leaked source in 2023 provided an SEO bonanza. |
| Baidu | 2000 | ~1% | Dominant in mainland China. Prefers simplified Chinese, localised hosting, ICP licences. |
| Naver | 1999 | ~0.5% | Dominant in South Korea. Integrates blogs, cafes, and proprietary verticals. |
| DuckDuckGo | 2008 | ~0.7% | Privacy-focused meta-search. Pulls from Bing, Yandex, and its own crawler. |
| Brave Search | 2021 | <0.3% | Independent index; privacy focus. |
| Perplexity, You.com, Andi | 2022+ | <0.3% | "Answer engines" that use LLMs to synthesise responses from a web index. |
How Bing Differs from Google
Bing's core pipeline mirrors Google's, but with different weights. Bing leans more heavily on exact-match keywords in titles, social signals, and on-site engagement metrics. Bing also respects <meta name="keywords"> in a limited way and runs its own webmaster tools suite (Bing Webmaster Tools) that exposes more raw data than Google Search Console.
How Yandex Differs
Yandex leaked 1,922 ranking factors in January 2023, giving the SEO community a rare look under the hood. The leak confirmed that Yandex uses PageRank-like link analysis, on-page signals, user behaviour (clickstream data from Yandex Metrica and Browser), and regional host quality. It also revealed a "commercial factor" that essentially down-ranks pages that look like low-quality affiliate sites.
How Baidu Differs
Baidu is the dominant Chinese engine. To rank, you generally need: simplified Chinese content, a Chinese-registered domain (.cn), an ICP licence, and hosting physically inside mainland China. Baidu heavily favours its own properties (Baike, Tieba, Zhidao) in results.
The Rise of Generative Search (SGE, AI Overviews, and LLM Engines)
Between 2023 and 2026, the search experience has changed more dramatically than in any previous five-year period. Three threads of innovation are converging:
- Google's AI Overviews (formerly SGE). An LLM-generated summary at the top of the SERP, grounded on live retrieval from Google's index. Rolled out broadly in 2024.
- ChatGPT Search, Perplexity, You.com. Standalone answer engines that combine an LLM with a live web retrieval layer. They cite sources and drive meaningful referral traffic to well-optimised content.
- Bing Copilot. Microsoft's integrated Bing + GPT experience. Surfaces citations more prominently than pure ChatGPT Search.
The SEO implication is that "ranking" in 2026 can mean three distinct things: ranking in classical blue links, being included in an AI Overview, or being the cited source a chatbot hands back to the user. All three depend on the same fundamentals — clean HTML, structured data, strong topical authority, and fast, accessible pages — but they reward slightly different content patterns.
Ranking Factors: An Overview
Google has publicly confirmed over 200 ranking factors, and the leaked 2024 "Content Warehouse API" docs hinted at thousands more features the ranking system considers. Below is a non-exhaustive overview of the most consequential families.
Content Quality and Relevance
- Topical match: does the page actually answer the query?
- Depth: is the content comprehensive or thin?
- Freshness: for time-sensitive queries, newer usually wins.
- Duplicate content: Google deduplicates near-identical pages into a single canonical.
Authority Signals
- Backlinks: number and quality of sites linking to you.
- Domain authority (as a concept — Google does not use Moz's metric, but has analogous internal signals).
- E-E-A-T: Experience, Expertise, Authoritativeness, Trust — evaluated especially for YMYL ("Your Money or Your Life") queries.
Technical Signals
- HTTPS — required for most competitive queries.
- Mobile-friendliness — Google is mobile-first indexed.
- Page speed and Core Web Vitals — LCP, INP, CLS.
- Crawlability — correct robots.txt, canonicals, hreflang, sitemaps.
- Structured data — enables rich results.
User Engagement Signals
- Click-through rate from the SERP.
- Dwell time and scroll depth.
- Bounce rate (indirectly — Google has said it is not a direct factor but related signals are used).
- "Pogosticking" — when users click a result, come back, and click another.
A Simple Example: One Query, Four Stages
Let us trace a query — "how to submit a sitemap to google" — through the four stages.
- Crawl. Googlebot fetched the URL
ranknibbler.com/how-to-submit-sitemap-to-googlewhen it discovered it from the homepage. - Index. Google extracted the title "How to submit a sitemap to Google...", the H1, all H2s, the body text, and all structured data. It stored postings for terms like "sitemap", "submit", "google", and "search console".
- Rank. When a user typed the query, Google's model scored every URL in its initial retrieval set. The RankNibbler page scored well because the body directly answered the query, the title matched, and the domain had a reasonable authority signal.
- Serve. Google assembled a SERP with a mix of featured snippet, organic results, a PAA ("People Also Ask") box, and possibly an AI Overview citing this page.
Vertical Search Engines
Not every search engine is a general-purpose one. Dozens of vertical engines dominate their niches.
- YouTube — second largest search engine by query volume. Ranking combines watch time, CTR, session length, and title/description matching.
- Amazon A9 / A10 — ranks products by sales velocity, conversion rate, and title/keyword relevance.
- App Store Search (Apple / Google Play) — ranks mobile apps by downloads, ratings, and on-store keyword match.
- Pinterest — a visual search engine for ideas, with heavy reliance on image metadata and pin descriptions.
- TikTok — increasingly used by Gen Z as a primary search engine for product and location discovery.
- GitHub Code Search — searches billions of lines of open-source code.
How Search Engines Make Money
Almost all general-purpose search engines are advertising-supported. Google Ads (formerly AdWords) and Microsoft Advertising run auctions on keywords; advertisers bid for placement above and alongside organic results. A single high-commercial-intent query can cost hundreds of dollars per click (think "mesothelioma lawyer").
A small number of engines take alternative models: DuckDuckGo sells contextual ads without user profiling; Kagi charges a monthly subscription; Ecosia uses ads but donates profits to reforestation; Brave Search embeds optional ads into Brave's browser.
Common Misconceptions
"Google searches the live web when I type a query."
No. It searches its pre-built index. The live web only gets crawled in advance.
"More keywords on my page means higher ranking."
Not since roughly 2004. Keyword density is not a direct ranking factor; semantic match is.
"Submitting to Google guarantees indexing."
No. You can submit a URL via Search Console, but Google decides whether to crawl and index it based on quality signals.
"Search engines index my whole site automatically."
No. Large sites typically have 30-80% of URLs indexed, not 100%. The rest fall into "crawled not indexed" or "discovered not indexed." See our indexing guide.
Tools to Understand How Engines See Your Site
- RankNibbler audit — runs 30+ checks that simulate what Googlebot extracts.
- Title tag checker — confirms titles are indexable.
- Meta description checker — verifies snippet readiness.
- Heading structure checker — validates H1/H2/H3 hierarchy.
- Structured data checker — confirms schema is parseable.
- Canonical URL checker — ensures indexing routes to the right page.
- Robots directives checker — identifies noindex/nofollow issues.
- Website speed test — measures Core Web Vitals.
- Google Search Console (free) — canonical source for crawl errors, index coverage, and query performance.
- Bing Webmaster Tools (free) — analogous for Bing.
Case Study: Debugging a Deindexed Page
Say you notice a key page is no longer appearing for its primary query. A systematic debug walks through each pipeline stage:
- Serving. Does the page appear anywhere in results for
site:yourdomain.com/path? If yes, it is indexed but ranking low. If no, skip down. - Indexing. Open Search Console → URL Inspection. Is the URL indexed? What is the "Last crawl" date?
- Crawling. If "Crawled - currently not indexed," check whether content is thin or duplicative. If "Discovered - currently not indexed," check internal linking and crawl budget.
- Fetching. If Googlebot cannot reach the URL, check robots.txt, HTTP status codes, server stability, and geo-blocking.
Frequently Asked Questions
How many search engines are there?
Dozens of general-purpose engines exist globally, but only five or six account for more than 99% of search query volume: Google, Bing, Yandex, Baidu, Naver, and DuckDuckGo. Add vertical engines (YouTube, Amazon, App Store) and the number explodes into the hundreds.
What is the difference between a search engine and a browser?
A browser (Chrome, Firefox, Safari) is the application you use to view web pages. A search engine (Google, Bing) is the service you use to find pages. Chrome uses Google as its default search engine; Safari uses Google but lets you switch to Bing, Yahoo, Ecosia, or DuckDuckGo.
How often does Google crawl my site?
It depends on authority, change frequency, and crawl budget. A high-authority news site may be crawled every few minutes; a small static blog might be crawled once every few days. See Search Console's Crawl Stats report for exact numbers.
Do AI chatbots count as search engines?
Yes and no. ChatGPT Search, Perplexity, and You.com function as search engines — they retrieve, cite, and rank results. Pure LLMs without retrieval (like raw GPT-4 without browsing) are not search engines because they do not query an index of the live web.
What is the smallest search engine I should care about for SEO?
If you have a US/UK/EU audience, Google, Bing, and DuckDuckGo cover essentially all your organic surface. Add Yandex if you target Russia, Baidu for mainland China, Naver for Korea.
Can search engines see JavaScript content?
Google and Bing render JavaScript, but rendering is a secondary queue with latency measured in days, not seconds. For critical content, server-side render or pre-render.
How does a search engine decide what to show first?
It scores every candidate document with a blend of hundreds of features. The top-scoring document wins position one. "Scoring" is now primarily a learned function from machine learning models, not a handwritten formula.
Can I trust AI Overviews' citations?
Mostly, but verify. AI Overviews occasionally hallucinate quotes or attribute claims to sources that did not make them. Check the cited URL before quoting it downstream.
What is the "dark web" from a search engine perspective?
Content on Tor hidden services (.onion) is not crawled by mainstream engines. Specialised .onion engines exist (Ahmia, OnionSearch) but they are small, noisy, and mostly used for research.
Do search engines cost money to use?
Not for consumers. Google, Bing, DuckDuckGo, and most others are free at the point of use. Kagi ($10/mo) and Neeva (discontinued) are the rare subscription-based exceptions.
How do I remove a URL from Google?
Use Search Console's Removal Tool for short-term suppression (six months). For permanent removal, add noindex to the page, return a 404/410, or use robots.txt to block it (the last option requires care because blocked URLs can still appear with a "No information available" snippet).
Why do the same query give different results on different devices?
Google personalises results based on location, language, device class (mobile vs desktop), and a small amount of history. Two users on two devices can see meaningfully different SERPs.
Crawlers: A Closer Look
Most SEOs think of Googlebot as a monolithic entity, but in reality it is a fleet of specialised crawlers. Knowing which one is hitting your site helps you interpret server logs and Search Console reports.
| User agent | Purpose | Frequency |
|---|---|---|
| Googlebot Desktop | Desktop HTML discovery (secondary since 2019) | ~10% of fetches |
| Googlebot Smartphone | Mobile HTML discovery (primary since 2019) | ~80% of fetches |
| Googlebot Image | Image indexing for Google Images | Varies |
| Googlebot Video | Video indexing | Varies |
| Googlebot News | News publisher crawl | High, for news publishers |
| AdsBot-Google | Google Ads landing page quality | On demand |
| Mediapartners-Google | AdSense content matching | On demand |
| Google-InspectionTool | Search Console URL Inspection | Manual triggers |
| Google-Read-Aloud | Chrome's text-to-speech | User triggered |
| GoogleOther | Internal research and experiments | Low |
| Google-Extended | Bard/Gemini training signal (opt-out) | Varies |
Each of these user agents can be addressed in robots.txt individually. You can, for example, allow Googlebot Smartphone but disallow Google-Extended to opt out of LLM training without losing Search indexing.
Rendering: How Search Engines Handle JavaScript
In 2026, Googlebot renders JavaScript for the vast majority of pages it crawls. The pipeline is two-pass:
- Initial HTML fetch. Googlebot requests the URL like any browser, receives raw HTML, and parses it.
- Rendering queue. If the HTML depends on JavaScript for critical content, the URL is added to a rendering queue.
- Rendered DOM index. A headless Chromium instance executes the JS, waits for network idle, and passes the rendered DOM back to the indexer.
Practical consequences: content that requires user interaction (click, scroll, tap) to appear will not be indexed. Content behind cookie consent walls may be skipped. Content that loads after a 10-15 second delay may be missed. Critical SEO content — title, meta description, H1, primary body text, canonicals, internal links — should be present in the initial HTML whenever possible.
Server-Side Rendering (SSR) vs Client-Side Rendering (CSR)
- SSR delivers complete HTML on the first request. Googlebot sees the content immediately. Next.js, Nuxt, Remix, and Astro all support SSR by default.
- CSR delivers a minimal HTML shell that JS fills in. Googlebot sees the content only after the second-pass render. Classic React/Vue/Angular SPAs without SSR are CSR.
- Prerendering is a middle ground: a build-time snapshot of each route that Googlebot sees instead of the live app. Works for content sites with limited interactivity.
- Hybrid / incremental static regeneration (ISR) combines static build with periodic rebuilds — Next.js popularised this pattern.
Data Centres, Latency, and Global Crawl
Google's crawl originates from data centres distributed across several continents. Server logs typically show requests from Mountain View, Dublin, Frankfurt, Singapore, and Sao Paulo among others. Bing crawls from Redmond-adjacent ranges and Asian POPs. Yandex crawls primarily from Russia and some European locations. Baidu crawls mostly from within mainland China.
If you geo-restrict content or block specific IP ranges, you can accidentally block the crawler. Always whitelist the published crawler IP ranges (Google, Bing, and others publish these) rather than blocking broadly.
How Search Engines Handle Duplicate Content
The web is full of duplicates — syndicated articles, product descriptions copied across stores, printer-friendly versions, parameter variants. Search engines detect near-duplicates during indexing and collapse them into canonical clusters. Within a cluster, one URL is chosen as the "canonical representative" and the others are mostly ignored for ranking.
Signals Google uses for canonical selection include rel=canonical (strongest hint), sitemap inclusion, internal link count, HTTPS preference, cleaner URL, higher user engagement. You can influence but not force the outcome. See the duplicate content guide.
Search Engine Spam and Policy Enforcement
Every major engine has a spam team. At Google, it is a division called Search Quality. The SpamBrain system (2018+) uses machine learning to detect:
- Cloaking (serving different content to users vs crawlers).
- Doorway pages.
- Hidden text and links.
- Link schemes (link farms, private blog networks, link buying).
- Scraped content.
- Sneaky redirects.
- Hacked content.
- User-generated spam.
- Thin affiliate pages.
- Automatically generated low-quality content.
When SpamBrain or a human reviewer detects a violation, the site can receive a manual action visible in Search Console. Consequences range from a specific page being demoted to full site-wide deindexing. Recovery requires fixing the underlying issue and submitting a reconsideration request.
Comparative Ranking Philosophy Across Engines
| Engine | Philosophy | Notable biases |
|---|---|---|
| Machine-learned blend of hundreds of signals, with heavy weighting on authority and intent match. | Favours brand signals, established domains, and content that answers the exact query. | |
| Bing | Closer to classical IR with strong on-page keyword weighting and social signals. | Slight preference for exact-match domains and older content. |
| Yandex | Heavy emphasis on user engagement from its own ecosystem (Browser, Metrica, Mail). | Strong commercial-page detection; strict on duplicate thin affiliate. |
| Baidu | Prefers Chinese-hosted, Chinese-language content with ICP licence. | Favours Baidu's own properties. |
| DuckDuckGo | Meta-engine; no personalisation, no click-history feedback. | Results tend to closely mirror Bing's with privacy-oriented filtering. |
| Naver | Vertical-focused; blogs and cafes weighted heavily. | Korean-language dominance, Naver ecosystem favouritism. |
| Perplexity / LLM engines | Retrieve, synthesise, cite. Ranking is less about position and more about being quoted. | Favours clean factual content, direct claims, and citable authority. |
The Google Leak of 2024 and What It Revealed
In May 2024, roughly 2,500 pages of internal Google Content Warehouse API documentation leaked via a misconfigured GitHub repository. The leak, analysed by Rand Fishkin and Mike King, confirmed (or publicly revealed) several ranking-adjacent details:
- NavBoost — a system that uses click data from Chrome and the SERP to re-rank results.
- SiteAuthority — an internal signal that looks a lot like what the SEO community has been calling domain authority for years, despite Google having denied such a thing existed.
- Demotions — explicit signals for demoting low-quality sites, affiliate-heavy content, and sites with excessive anchor text manipulation.
- Panda-era signals still active as embedded features.
- Twiddlers — a concept for post-ranking adjustments based on freshness, diversity, or specific policies.
Whether the leaked features are actually used in production ranking is debated, but the leak provided the clearest public view of Google's internal architecture in over a decade.
How to Optimise for Each Pipeline Stage
Optimising for Crawling
- Maintain an accurate XML sitemap.
- Keep robots.txt lean and correct.
- Fix 5xx errors promptly — servers returning errors earn smaller crawl budgets.
- Eliminate redirect chains; every extra hop wastes crawl budget. Use the redirect checker.
- Link internally to deep pages so they are discoverable without relying on the sitemap.
Optimising for Indexing
- Ensure every canonical URL returns 200 OK.
- Remove
noindexfrom pages you want indexed. - Eliminate duplicates with canonical URLs.
- Avoid thin pages — short, low-value content is frequently "crawled not indexed."
- Add structured data to help Google understand entity relationships.
Optimising for Ranking
- Write content that matches real search intent.
- Build backlinks through legitimate outreach.
- Improve E-E-A-T signals with author bios, credentials, and transparent sources.
- Optimise Core Web Vitals.
- Keep content current — staleness hurts.
Optimising for Serving (Rich Results)
- Add the right schema for your content type (Article, Product, FAQ, How-To, Recipe).
- Structure content to win featured snippets: definitions at the top, lists and tables for comparison queries.
- Optimise for AI Overview citation by writing direct, clear factual sentences.
- Craft compelling titles and descriptions that improve SERP CTR.
The Role of User Signals
Publicly, Google downplays user behaviour as a direct ranking factor. Internally, the leaked Content Warehouse docs and Yandex leak both confirm extensive use of click, dwell, and pogosticking data. A rough summary of how engines use user signals:
- Click-through rate — strong signal but heavily adjusted for position bias.
- Dwell time — how long before returning to the SERP. Very short dwells count against the page.
- Pogosticking — clicking result A, returning, clicking result B. Pattern suggests A was a bad match.
- Search refinement — if users refine their query after clicking, the first result failed to satisfy.
- Explicit feedback — "was this helpful?" widgets and similar.
You cannot game user signals directly. You earn them by publishing content that actually answers the query.
Algorithm Updates: A Brief Timeline
| Year | Update | Focus |
|---|---|---|
| 2011 | Panda | Low-quality, thin content demoted. |
| 2012 | Penguin | Link schemes and over-optimised anchor text. |
| 2013 | Hummingbird | Semantic search and conversational queries. |
| 2015 | RankBrain | First major ML ranking model. |
| 2016 | Mobile-First Indexing (rollout begins) | Smartphone crawl becomes primary. |
| 2018 | Medic | YMYL and health content scrutiny. |
| 2019 | BERT | Natural language understanding. |
| 2021 | MUM | Multimodal understanding; Page Experience update. |
| 2022 | Helpful Content Update | People-first content emphasis. |
| 2023 | SGE (beta) | Generative search experience. |
| 2024 | AI Overviews (GA) | Gemini-powered summaries replace SGE. |
| 2024 | Helpful Content Update integrated into core | Helpful content signals become a ranking factor group within the core algorithm. |
| 2025 | March 2025 Core Update | Large shifts in niche authority signals. |
Privacy-First Engines: DuckDuckGo, Brave, Kagi
A sub-ecosystem of engines optimises for privacy rather than ad revenue. Their market share is small but growing among technical users.
- DuckDuckGo combines Bing's index with its own crawler, with heavy privacy guarantees (no tracking, no cookies, no personalisation).
- Brave Search operates its own independent index with an optional "AI Answer" layer. Integrated directly into the Brave browser.
- Kagi is subscription-only ($10-25/mo). It combines results from Google, Bing, Marginalia, Teclis, and a custom index with user-configurable weights.
- Ecosia uses Bing's index but donates most of its ad revenue to reforestation.
- Mojeek and Marginalia run fully independent indexes at smaller scale, useful for niche queries.
The Future: Agent-Based Search
A trend accelerating in 2025-2026 is "agentic" search — LLM-powered agents that can not just retrieve information but take actions on behalf of users (book a flight, compare prices, file paperwork). Google, OpenAI, and Anthropic are all investing heavily. The SEO implication is that ranking in traditional SERPs will matter less while being the endorsed source an agent trusts will matter more. Practitioners will increasingly optimise for entity signals, structured data, and authoritative citability rather than keyword positioning.
Query Understanding: From Keywords to Intents
Early search engines matched queries to documents purely by keyword frequency. Modern engines translate queries into intents, entities, and probable answers. A query like "how old is the Eiffel Tower" is not parsed as four tokens; it is parsed as:
- Intent: factual question.
- Entity: Eiffel Tower (Wikidata Q243).
- Attribute: age / construction date.
- Answer type: number or date.
Google's model then retrieves candidate passages from its index, synthesises an answer, and serves it via a mix of featured snippet, Knowledge Graph, and AI Overview. The raw ten-blue-links are a fallback for when the model is less confident.
Entity Recognition and the Knowledge Graph
Google's Knowledge Graph is a massive structured database of real-world entities — people, companies, places, products, concepts — and their relationships. As of 2025 it contained over 5 billion entities and 500 billion facts. When you search for an entity, Google tries to recognise it in the query, retrieve its profile, and supplement organic results with a knowledge panel.
From an SEO perspective, being part of the Knowledge Graph is enormously valuable. Inclusion typically requires:
- A presence on Wikipedia or Wikidata.
- Consistent mentions across authoritative sources.
- Structured data on your own site (Organization, Person, Product).
- Unique, unambiguous identity (name + disambiguator).
Multimodal Search
Google Lens, Circle to Search, and similar visual search features let users query with images instead of text. Bing Visual Search offers the same. Multimodal engines (Gemini, GPT-4V) can accept images, audio, and text as combined input. Optimising for multimodal search means:
- Descriptive image alt text.
- High-quality images with unique filenames.
- Image schema (
ImageObject) with caption metadata. - Audio transcripts for podcasts.
- Video chapter markers for long video content.
The Cost of Running a Search Engine
For perspective on scale: estimates from former Google engineers place the cost of operating Google Search at somewhere between $10 and $25 billion per year. This covers data centres, bandwidth, hardware, and a search-ranking team of several thousand engineers. The direct revenue from Search advertising is in the $160+ billion per year range. That 6-10x margin is what funds the entire rest of Alphabet.
New entrants (Kagi, Brave, Perplexity) operate at orders of magnitude smaller scale and correspondingly smaller indexes. The economics of running an independent large-scale web index are brutal, which is why most "alternative" engines piggyback on Bing or Google's public APIs.
Opting Out of Indexing and Training
If you want to control whether your content is indexed or used for LLM training:
| Goal | Mechanism |
|---|---|
| Block search indexing | <meta name="robots" content="noindex"> |
| Block Google search specifically | <meta name="googlebot" content="noindex"> |
| Block crawling entirely | Disallow: in robots.txt |
| Block Google LLM training | User-agent: Google-Extended\nDisallow: / |
| Block ChatGPT / GPTBot | User-agent: GPTBot\nDisallow: / |
| Block Anthropic ClaudeBot | User-agent: ClaudeBot\nDisallow: / |
| Block Perplexity | User-agent: PerplexityBot\nDisallow: / |
| Block Common Crawl | User-agent: CCBot\nDisallow: / |
Each directive is independent. Blocking Google-Extended does not affect Google Search ranking.
Bot Verification: Is This Really Googlebot?
Many "Googlebot" requests in server logs are actually fake — spammers and scrapers masquerading as Googlebot to bypass simple bot-blocking rules. To verify a claim, do a reverse DNS lookup on the IP and then a forward lookup on the returned hostname. Genuine Googlebot requests come from *.googlebot.com or *.google.com. Anything else claiming to be Googlebot is fraudulent.
# Reverse lookup
host 66.249.66.1
# 1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.
# Forward lookup
host crawl-66-249-66-1.googlebot.com
# crawl-66-249-66-1.googlebot.com has address 66.249.66.1
Google publishes its crawler IP ranges as a JSON file (developers.google.com/search/apis/ipranges/googlebot.json) you can use for automated verification.
Search Engine Benchmarks: When Google Is Not Best
Google is not universally the best engine for every type of query. In specific verticals, specialised engines outperform:
- Academic / research: Google Scholar, Semantic Scholar, arXiv.
- Code: GitHub Code Search, grep.app, Sourcegraph.
- Pricing / deals: Kayak (travel), CamelCamelCamel (Amazon pricing).
- Legal: Westlaw, LexisNexis, CourtListener.
- Patents: Google Patents (yes, Google), Espacenet.
- Recipes: Allrecipes and Supercook can out-filter Google for dietary constraints.
- Forums: Google often surfaces Reddit via site: operator; native Reddit search is usually worse.
A Practical Primer on SERP-Feature Targeting
If you want to practise what this page preaches, pick a query where your site already ranks 4-15 and target a specific SERP feature. Pattern:
- Pick a query you want to win.
- Inspect the current SERP. What features appear? Which ones do you not own?
- Pick one feature you can realistically target (featured snippet, PAA, AI Overview citation).
- Rewrite the relevant content section to match the feature's trigger pattern.
- Validate with the SERP snippet generator.
- Republish and monitor Search Console for 2-4 weeks.
This is a repeatable, measurable cycle. Apply it weekly and accumulate wins.
Final Thoughts
A search engine is a pipeline: it crawls the web, indexes what it finds, ranks candidates against a query, and serves a results page. Every SEO problem reduces to a question about one of these stages. If your pages are not getting traffic, figure out whether they are being crawled, whether they are being indexed, whether they are being ranked, and whether they are being served. That mental model alone will solve a surprising fraction of the work.
The search landscape in 2026 is more fragmented than at any point in twenty years, but the fundamentals have not changed: publish high-quality, original content; make it technically clean; earn authority through editorial linking; and monitor how engines interpret your pages. Do those four things, and you will rank on whatever engine matters most to your audience.
Last updated: March 2026