Skip to main content

Microsoft search indexing can be so aggressive that it resembles DoS traffic

As part of my consulting business I have a number of web servers I take care of. This morning, I woke up to receive a particularly crappy message related to one of those servers:

possible DoS attack

Awesome, right? Ever notice how you never get these sorts of messages between the hours of 9 AM and 5 PM, Monday through Friday?

So I tried to SSH into the target server, and was pleased to find I was able to connect. Relieved that this was likely a false alarm, I found this in the Apache logs:

40.77.167.20 - - [19/Jan/2016:19:43:15 -0500] "GET /robots.txt HTTP/1.1" 200 146
40.77.167.20 - - [19/Jan/2016:19:43:15 -0500] "GET /robots.txt HTTP/1.1" 200 146
40.77.167.20 - - [19/Jan/2016:19:43:15 -0500] "GET /robots.txt HTTP/1.1" 200 146
40.77.167.20 - - [19/Jan/2016:19:43:15 -0500] "GET /robots.txt HTTP/1.1" 403 5
40.77.167.20 - - [19/Jan/2016:19:43:15 -0500] "GET /robots.txt HTTP/1.1" 403 5
40.77.167.20 - - [19/Jan/2016:19:43:15 -0500] "GET /css/main.css HTTP/1.1" 403 5

Take a note at the timeframe on these connections: six connections from the same IP address within 1 second, five of which were to the same file. Also note that the initial connections were successful - errors only began occurring because my Apache config blocks suspicious traffic.

You've probably guessed who this IP address belongs to if you read the headline to this article:

NetRange: 40.74.0.0 - 40.125.127.255
NetName: MSFT
Organization: Microsoft Corporation (MSFT)

At first I thought this IP might be part of Microsoft's cloud server system, Azure, or some other product that might be operated by customers. However, that seemed unlikely as this host was going after the robots.txt file and nothing else other than CSS. That is what search engine spiders do. And this IP very much looks like part of Microsoft's search infrastructure:

# host 40.77.167.20
20.167.77.40.in-addr.arpa domain name pointer msnbot-40-77-167-20.search.msn.com.
The day after these weird connections, the same Microsoft IP came back with a more normal traffic pattern:

40.77.167.20 - - [20/Jan/2016:06:53:35 -0500] "GET /robots.txt HTTP/1.1" 200 237
40.77.167.20 - - [20/Jan/2016:06:53:36 -0500] "GET /index.html HTTP/1.1" 301 245

A standard installation of mod_evasive would result in a temporary blacklist for this kindof traffic. It is unclear if this behavior was intentional on the part of Microsoft, or if more rapid requests for files can be expected. The people who make their bread and butter spreading SEO gossip seem to agree that connectivity failures & web server 50* errors can have an impact of search engine rankings. However, such reports should be taken as just that - gossip.

Both Google & Bing report errors encountered during site indexing through their Search Console and Webmaster Tools, but I wasn't able to find anything published by either Bing or Google about how such errors impact search engine placement even in vague terms. Hopefully this was a one-time error on Microsoft's part and not part of a new approach to indexing (fingers crossed).