Guide

robots.txt isn’t control, it’s a signal

A sudden spike in traffic revealed a hard truth. Not all visitors are people, and not all bots play by the rules. This article breaks down why robots.txt isn’t real control, how modern sites are accessed by multiple systems, and what you actually need in place to manage it properly.

27 April 2026

robots.txt isn’t control, it’s a signal

I recently saw a 13,000% spike in traffic across two websites. One was my photography portfolio, the other a client’s online art store. 

Both showed the same pattern – Android WebView, paid social, mostly from Italy, Spain and Portugal.

Neither of us had run ads.

So the question was straightforward. If we didn’t send that traffic, who did?

There isn’t a clean answer, but the pattern isn’t unusual. It’s likely some form of paid campaign using our URLs as targets, not to promote the work, but to serve another.

That could be cheap traffic, scraping, testing ad funnels, or something less obvious. In most cases, you won’t find the source.

What it does highlight is something more important.

The web is no longer just for users

We still tend to think of websites as something built for people, but that’s no longer the case.

Every site is now accessed by a mix of:

  • search engines
  • AI systems
  • aggregators
  • scrapers
  • ad platforms

All interacting with the same content, but for very different reasons. Some of that activity is useful, some neutral and some of it is clearly not in your interest.

The problem is, most sites treat all of it the same.

The default response: robots.txt

When this kind of traffic shows up, one of the first technical checks is often robots.txt, especially when you think your site is being used to train AI with your copyrighted images/artwork.

For clarity, robots.txt does one thing. It tells crawlers what they are allowed to access. It’s one of the few tools we have that directly addresses automated access.

Well-behaved crawlers like Googlebot and Bingbot will respect it. Everything else is optional.

That’s the main limitation. robots.txt is not enforcement, it’s just a policy file.

Why that matters

If all bots behaved like search engines, robots.txt would be enough.

They don’t.

In practice, you’re dealing with a mix of behaviours:

  • crawlers that follow rules
  • crawlers that partially follow rules
  • systems that ignore them entirely

Some don’t even identify themselves properly. Others mimic legitimate user agents.

And in cases like the spike we saw, traffic can arrive through ad platforms or in-app browsers, which completely bypass traditional crawling patterns.

At that point, robots.txt is irrelevant.

What robots.txt is still good for

Despite the limitations, it still has a role.

Used properly, it lets you define intent.

You can:

  • allow search engines to index your content
  • restrict sensitive or irrelevant areas
  • decide how known AI crawlers interact with your site

That last point is becoming more important.

There are now identifiable AI-related crawlers, and you can choose how to handle them. Whether you allow, restrict, or monitor them depends on your position.

It’s not perfect, but it’s a clear starting point.

Where real control actually sits

If robots.txt is policy, then control sits elsewhere. In practice, it comes down to how your infrastructure handles traffic.

That includes:

  • rate limiting to manage request volume
  • bot filtering at the edge, often via services like Cloudflare
  • server-side rules to block suspicious patterns
  • monitoring behaviour, not just access

This is the difference that matters.

robots.txt reacts to declared intent.
Infrastructure reacts to actual behaviour.

If something is hitting your site at scale, ignoring rules, or routing through WebView traffic, the only effective response is at that level.

A more realistic setup

For most sites, a practical approach looks like this:

  • use robots.txt to define clear access rules for legitimate crawlers
  • explicitly allow search engines
  • make a conscious decision on AI crawlers rather than ignoring them

Alongside that:

  • enable bot protection and rate limiting
  • monitor traffic patterns, not just totals
  • investigate spikes rather than dismissing them

It’s not about blocking everything. It’s about understanding who is accessing your site and why.

The shift

The main change here is conceptual. Websites used to be built primarily for users, with search engines as a secondary concern.

Now they sit in the middle of a much larger system.

They are accessed, interpreted and reused by multiple layers of automation, often without visibility or attribution. That changes the role of development.

It’s no longer just about building pages and optimising for search. It’s about managing access, intent and behaviour across a range of systems you don’t control.

Final point

robots.txt still matters, but only if you see it for what it is – a way to state your rules.

It is not a way to enforce them.

If you rely on it as a control mechanism, you’ll miss most of what’s actually happening.

And when traffic appears out of nowhere, that gap becomes very obvious.

If you're planning an MVP or early-stage product and want to make sure the foundations are right before you build, we're happy to talk it through.

Start a conversation