Showing posts with label search. Show all posts
Showing posts with label search. Show all posts

Wednesday, January 20, 2016

Microsoft search indexing can be so aggressive that it resembles DoS traffic

As part of my consulting business I have a number of web servers I take care of. This morning, I woke up to receive a particularly crappy message related to one of those servers:

possible DoS attack

Awesome, right? Ever notice how you never get these sorts of messages between the hours of 9 AM and 5 PM, Monday through Friday?

So I tried to SSH into the target server, and was pleased to find I was able to connect. Relieved that this was likely a false alarm, I found this in the Apache logs:

40.77.167.20 - - [19/Jan/2016:19:43:15 -0500] "GET /robots.txt HTTP/1.1" 200 146
40.77.167.20 - - [19/Jan/2016:19:43:15 -0500] "GET /robots.txt HTTP/1.1" 200 146
40.77.167.20 - - [19/Jan/2016:19:43:15 -0500] "GET /robots.txt HTTP/1.1" 200 146
40.77.167.20 - - [19/Jan/2016:19:43:15 -0500] "GET /robots.txt HTTP/1.1" 403 5
40.77.167.20 - - [19/Jan/2016:19:43:15 -0500] "GET /robots.txt HTTP/1.1" 403 5
40.77.167.20 - - [19/Jan/2016:19:43:15 -0500] "GET /css/main.css HTTP/1.1" 403 5

Take a note at the timeframe on these connections: six connections from the same IP address within 1 second, five of which were to the same file. Also note that the initial connections were successful - errors only began occurring because my Apache config blocks suspicious traffic.

You've probably guessed who this IP address belongs to if you read the headline to this article:

NetRange: 40.74.0.0 - 40.125.127.255
NetName: MSFT
Organization: Microsoft Corporation (MSFT)

At first I thought this IP might be part of Microsoft's cloud server system, Azure, or some other product that might be operated by customers. However, that seemed unlikely as this host was going after the robots.txt file and nothing else other than CSS. That is what search engine spiders do. And this IP very much looks like part of Microsoft's search infrastructure:

# host 40.77.167.20
20.167.77.40.in-addr.arpa domain name pointer msnbot-40-77-167-20.search.msn.com.
The day after these weird connections, the same Microsoft IP came back with a more normal traffic pattern:

40.77.167.20 - - [20/Jan/2016:06:53:35 -0500] "GET /robots.txt HTTP/1.1" 200 237
40.77.167.20 - - [20/Jan/2016:06:53:36 -0500] "GET /index.html HTTP/1.1" 301 245

A standard installation of mod_evasive would result in a temporary blacklist for this kindof traffic. It is unclear if this behavior was intentional on the part of Microsoft, or if more rapid requests for files can be expected. The people who make their bread and butter spreading SEO gossip seem to agree that connectivity failures & web server 50* errors can have an impact of search engine rankings. However, such reports should be taken as just that - gossip.

Both Google & Bing report errors encountered during site indexing through their Search Console and Webmaster Tools, but I wasn't able to find anything published by either Bing or Google about how such errors impact search engine placement even in vague terms. Hopefully this was a one-time error on Microsoft's part and not part of a new approach to indexing (fingers crossed).

Saturday, December 20, 2014

Windows 7 and Windows 8 Basics: Searching by File Size, Modification Date and Other File Properties

It was one of these days, not long ago, that I work up one day and realized that I had become an Old Man. Mine is the last generation that remembers a time prior to the internet. I remember using acoustic couplers. My first laptop, a Toshiba, had dual 5 1/2 inch floppy drives, but had no hard drive. I was so excited when I got my hands on that machine. It meant I could connect to networks using my acoustic coupler from a pay phone!

My ruminations on aging is at least somewhat related to the topic at hand. You see, among the memories rattling around my grey hair ensconced head are a few about searching Windows file systems for files of specific types. This sort of thing is very important, even just for every day normal computer usage.

When your computer starts running out of space, wouldn't it be nice to be able to find all of the really large files on that computer? Or perhaps you are looking for an important document you wrote - you can't remember the name of the file but you remember the week that you wrote it. Doing this in Windows XP is straight-forward, because the Windows XP search box (what Microsoft calls the "Search Companion") includes these more advanced functions, and accessing that search box is as simple as clicking the Start button and clicking Search from the resulting contextual menu. Such a search box typically looks similar to this:

Windows XP, Josh Wieder, search, dog
Ruff!
As you can see selecting size and date modification are simple in this format. However, Microsoft, in their infinite wisdom, decided to abandon this simple and straight forward menu, replacing it with a single magnifying glass icon without any options whatsoever:

Windows 7, Josh Wieder, Search bar
Searching mad stupid.
Without the simple and easy to use Search Companion, how are we supposed to look for files based on their properties instead of their name?

The answer, unfortunately for users only accustomed to graphical interfaces, is a series of command line arguments.

Here is a list of such the available search commands for Windows 7 and Windows 8, taken from the relevant Microsoft KB article:

Example search termUse this to find
System.FileName:~<"notes"
Files whose names begin with "notes." The ~< means "begins with."
System.FileName:="quarterly report"
Files named "quarterly report." The = means "matches exactly."
System.FileName:~="pro"
Files whose names contain the word "pro" or the characters pro as part of another word (such as "process" or "procedure"). The ~= means "contains."
System.Kind:<>picture
Files that aren't pictures. The <> means "is not."
System.DateModified:05/25/2010
Files that were modified on that date. You can also type "System.DateModified:2010" to find files changed at any time during that year.
System.Author:~!"herb"
Files whose authors don't have "herb" in their name. The ~! means "doesn't contain."
System.Keywords:"sunset"
Files that are tagged with the word sunset.
System.Size:<1mb
Files that are less than 1 MB in size.
System.Size:>1mb
Files that are more than 1 MB in size.

In addition to these commands, users can also use a series of Boolean command line operators to further refine searches:

OperatorExampleUse this to
AND
tropical AND island
Find files that contain both of the words "tropical" and "island" (even if those words are in different places in the file). In the case of a simple text search, this gives the same results as typing "tropical island."
NOT
tropical NOT island
Find files that contain the word "tropical," but not "island."
OR
tropical OR island
Find files that contain either of the words "tropical" or "island."

Although the commands themselves are non-intuitive, using them is straight-forward. Simply type the appropriate command into the Windows search box, either in the Start menu or in the top-right corner of a File Manager menu. Here is an example, where we have searched for all files larger than 100MB in size in the drive C:\

Windows 7, Josh Wieder, search terms
A search example in Windows 7
There are a variety of circumstances where Windows' search implementation will fail to meet a user's needs. First and foremost, the search function is resource intensive, inaccurate and slow. Compared to Linux's `grep`, `find` and `locate` commands, Windows Search is almost laughably bad, particularly when attempting to search for strings inside of files.

There are other tools available for Windows that vastly improve on the default Windows search function. My recommendation at this time is GrepWin built by Stefan Kiing, available for download at the Google Code site.

GrepWin allows users to search by simple strings, operators and terms like those we described above, providing faster more accurate responses than those available from Windows' default search. In addition to basic search functionality, GrepWin also accepts regular expressions as input. While cryptic, and with a steep initial learning curve, regular expressions are incredibly powerful and a fundamental part of modern computer programming. With regular expressions, you may find specific and complex patterns from large datasets efficiently and quickly. We will almost certainly explore regular expressions in depth with their own post (or perhaps series of posts).

Thats it for now on Windows-based searching. When we return to searching in the future, we will likely spend more time on searching databases, arrays and other data structures as well as providing more theoretical explanations for file system search.

Tuesday, November 25, 2014

How To Find Files Over a Certain Size Using Redhat/CentOS/Fedora Linux

Here is a quick tip for all of those Redhat/CentOS/Fedora users out there. Do you need to find all files over a certain size, either in a specific directory, your current directory, or in your entire computer/server?

No problem, just execute the following:

find / -type f -size +500000k -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'

In the example above, I am looking for all files over 500MB in size (500000k, where k = kilobytes). The place where I have typed "/" in the above command indicates the path to search in. By selecting "/" I am searching in the entire filesystem; I could easily indicate a specific directory by changing my command as follows:

find /path/to/my/directory -type f -size +500000k -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'

Alternatively, I could search in my current directory by replacing "/" with "." like so:

find . -type f -size +500000k -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'

Easy!

NSA Leak Bust Points to State Surveillance Deal with Printing Firms

Earlier this week a young government contractor named Reality Winner was accused by police of leaking an internal NSA document to news outle...