Article | AI Crawlers Are Flooding the Web—And We're on the Front Lines

Websites face a digital infestation and not the creepy-crawly kind.

The web development team at 42, Inc. has been on high alert recently, fending off an influx of AI crawler bots that scour websites to harvest content for training large language models (LLMs). These crawlers have been hitting our client sites by the hundreds, sometimes thousands, originating from a wide array of IP addresses in what appears to be a coordinated data grab from major AI developers.

"We can tell when a new AI model is in training," said one of our senior engineers. "Suddenly, we'd see a surge in bot traffic from hundreds of IPs. It's not cool, man."

The surge in bot traffic isn't just annoying. It's resource-draining. Automated traffic made up 51% of all internet traffic last year, according to Imperva's 2025 Bad Bot Report, surpassing human traffic for the first time. Many bots behave poorly, getting lost in search loops on content-rich sites, especially the ones powered by strong internal search and navigation tools. The result? Increased server load, degraded performance, and higher bandwidth bills.

But blocking all bots isn't ideal. Search engines, accessibility tools, and even AI bots serve useful functions. That's why 42, Inc. has adopted a more nuanced defense: a lightweight metadata cushion.

What's a Metadata Cushion?

A metadata cushion is a clever server-side strategy that delivers enough information to satisfy crawlers without rendering the whole page. Instead of building and sending the entire React app or dynamic content, the server returns basic metadata in a fast, minimal format, like JSON-LD or simplified HTML headers.

Metadata Cushion can ease server strain while providing structure bots to index, log, or train on if permitted.

Technical Implementation

Note: We are going to get geeky for a moment!

In NGINX, we route known crawler traffic to a lightweight JSON metadata response:

                
                map $http_user_agent $is_bot {
                    default 0;
                    ~*(googlebot|bingbot|facebookexternalhit|GPTBot) 1;
                }

                server {
                    location / {
                        if ($is_bot) {
                            return 200 '{"title":"Home","description":"Welcome to our site"}';
                            add_header Content-Type application/json;
                        }

                        proxy_pass http://localhost:3000;
                        include proxy_params;
                    }
                }

In a Node/Express + React setup, we can detect bot user agents and short-circuit the response:

                
                app.get('*', (req, res) => {
                    const botAgents = [/GPTBot/i, /bingbot/i, /facebookexternalhit/i];
                    const isBot = botAgents.some(bot => bot.test(req.headers['user-agent']));

                    if (isBot) {
                        return res.json({
                        title: 'Home',
                        description: 'Welcome to our site. Metadata only.'
                        });
                    }

                    res.sendFile(path.join(__dirname, 'build', 'index.html'));
                });

Where AI Comes In

To manage this at scale, 42, Inc. uses machine learning to detect and respond to suspicious bot behavior in real time. AI helps us analyze traffic patterns, fingerprint suspicious activity, and adjust filters and response strategies dynamically. AI makes our defenses more innovative and efficient.

Still, the battle is ongoing. Every mitigation tactic takes time away from core services. Time we'd rather spend improving user experience and delivering real value. But in today's internet, protecting your digital front door has become just as important as what's behind it.

AI Crawlers Are Flooding the Web, And We're on the Front Lines

How 42, Inc. defends websites from a growing wave of automated bot traffic.

Websites face a digital infestation and not the creepy-crawly kind.

"We can tell when a new AI model is in training," said one of our senior engineers. "Suddenly, we'd see a surge in bot traffic from hundreds of IPs. It's not cool, man."

But blocking all bots isn't ideal. Search engines, accessibility tools, and even AI bots serve useful functions. That's why 42, Inc. has adopted a more nuanced defense: a lightweight metadata cushion.

What's a Metadata Cushion?

Metadata Cushion can ease server strain while providing structure bots to index, log, or train on if permitted.

Technical Implementation

In NGINX, we route known crawler traffic to a lightweight JSON metadata response:

In a Node/Express + React setup, we can detect bot user agents and short-circuit the response:

Where AI Comes In

To manage this at scale, 42, Inc. uses machine learning to detect and respond to suspicious bot behavior in real time. AI helps us analyze traffic patterns, fingerprint suspicious activity, and adjust filters and response strategies dynamically. AI makes our defenses more innovative and efficient.

Still, the battle is ongoing. Every mitigation tactic takes time away from core services. Time we'd rather spend improving user experience and delivering real value. But in today's internet, protecting your digital front door has become just as important as what's behind it.

We want to hear from you.