Friday, August 26, 2016

Dell, how I hate thee (let me count the ways)

A Dell "feature" that appears to be designed to force customers to use only Dell parts reduced the speed of a set of SSDs one of my customers installed on their rack-mountable R900 server by a factor of 1000.

Before I get into this, there are some provisos. This server was using Linux kernel 2.6.32. The SSDs involved are Samsung 850 Pro SATA-style solid state disks. SSD is not quite ready for prime time in the 2.6.32 kernel; NVMe support was first added in 3.3, TRIM wasn't available at all until 2.6.33,
and a ton of other things we all take for granted like the device mapper are part of the 4.* kernel.

Consumer-level Samsung drivers bring their own issues. Despite what the knuckle-heads on Reddit have to say about the topic, the Linux kernel still blacklists queued TRIM functions from every Samsung SSD in the 8** series. As of the latest Github commit as of this writing for kernel 4.8 queued TRIM still doesn't work for these devices.

More importantly, the R900 isn't a new server. This is an 8 year old box. There is a SAS backplane involved which, although having a theoretical max data transfer of 3.0 Gbps, was designed before SSDs were widely available, and introduces a bunch of contacts, wiring and complexity that is likely all screwed up and almost certainly not optimized for fat-guy Peta Belly Flops of computing power.

Initial benchmarking with fio and ioping in addition to monitoring CPU iowait times with top and checking out iostat had this server's SSDs performing *slower* than a similar server with 7500 RPM sata disks in a ZFS pool.

I did a bunch of stuff to this box hoping to shake a few extra IOPS out of it. I installed Dell's dsu to get my hands on the latest drivers & firmware (under the mistaken belief that an update on either front had been released in the last decade). I had never physically seen this server; so there was a lot of lspci-ing and modprobe-ing.

Luckily, I stayed focused on the controller & backplane a SAS 6/iR (FW 00.25.47.00.06.22.03.00) and LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08) (FW 1.06), respecticely. Eventually I stumbled upon this post that described - accurately - how the Dell was automatically stepping down the SATA port speed on non-Dell-certified disks from SATA II (3gbps) to SATA I (1.5gbps).

Many moons ago, Dell's RAID cards would simply not allow users to install non-Dell disks. My experience with the R900 using BIOS version 1.2.0 would indicate that - although I am able to use non-certified disks without fatal errors, the backplane deliberately slows these disks down without reason, and in a way that is almost always transparent to the end user. I will hold off on making accusations here until I get my hands on the source code for this firmware, but the evidence up to this point is fairly damning. If anyone from Dell has an explanation for this sort of behavior, I would be happy to publish your feedback here.  

There is a workaround for this issue, albeit its incredibly hack-y. It involves the use of the (now defunct) lsiutil application (available here or direct mirror here).  This application allows us to make calls directly to the backplane. In this case, the fix involves resetting the minimum link speed on the back plane from 1.5Gbps to 3.0Gbps. 

Heres a step-by-step:

   - download the zip file
   - unzip the file in a directory of your choice; # unzip LSIUtil_1.62.zip -d /home/joshw/lsiutil/
   - navigate to the directory referencing your OS; # cd /home/joshw/lsiutil/Linux/
   - identify the version of the application matching your processor/OS bit type. For linux, there is a 32 bit, AMD64 and x86_64 version. I selected the x86_64 and applied an executable bit: # chmod +x lsiutil.x86_64
   - make sure youre root: # sudo su
   - run the application: # ./lsiutil.x86_64

You should see something like this:

# ./lsiutil.x86_64

LSI Logic MPT Configuration Utility, Version 1.62, January 14, 2009

1 MPT Port found

     Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev  IOC
 1.  /proc/mpt/ioc0    LSI Logic SAS1068E B3     105      00192f00     0

Select a device:  [1-1 or 0 to quit]

 Most likely you will only have one option here, so select it by pressing 1:

Select a device:  [1-1 or 0 to quit] 1

 1.  Identify firmware, BIOS, and/or FCode
 2.  Download firmware (update the FLASH)
 4.  Download/erase BIOS and/or FCode (update the FLASH)
 8.  Scan for devices
10.  Change IOC settings (interrupt coalescing)
13.  Change SAS IO Unit settings
16.  Display attached devices
20.  Diagnostics
21.  RAID actions
22.  Reset bus
23.  Reset target
42.  Display operating system names for devices
45.  Concatenate SAS firmware and NVDATA files
59.  Dump PCI config space
60.  Show non-default settings
61.  Restore default settings
66.  Show SAS discovery errors
69.  Show board manufacturing information
97.  Reset SAS link, HARD RESET
98.  Reset SAS link
99.  Reset port
 e   Enable expert mode in menus
 p   Enable paged mode
 w   Enable logging

From here select option 13

Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 13

You will be immediately prompted with some configuration questions. Just press RETURN to keep the current / default values:
SATA Maximum Queue Depth:  [0 to 255, default is 8]
Device Missing Report Delay:  [0 to 2047, default is 0]
Device Missing I/O Delay:  [0 to 255, default is 0]

Eventually you will be dumped out here:

PhyNum  Link      MinRate  MaxRate  Initiator  Target    Port
   0    Enabled     1.5      3.0    Enabled    Disabled  Auto
   1    Enabled     1.5      3.0    Enabled    Disabled  Auto
   2    Enabled     1.5      3.0    Enabled    Disabled  Auto
   3    Enabled     1.5      3.0    Enabled    Disabled  Auto
   4    Enabled     1.5      3.0    Enabled    Disabled  Auto
   5    Enabled     1.5      3.0    Enabled    Disabled  Auto
   6    Enabled     1.5      3.0    Enabled    Disabled  Auto
   7    Enabled     1.5      3.0    Enabled    Disabled  Auto

Select a Phy:  [0-7, 8=AllPhys, RETURN to quit]

Select 8 to make changes to all of the available ports simultaneously

Select a Phy:  [0-7, 8=AllPhys, RETURN to quit] 8

Again, you will be prompted for several values. You want to be very careful here as we only want to change one value - MinRate (this should be the second value you are prompted to modify. Every other value should remain default by pressing RETURN.

Link:  [0=Disabled, 1=Enabled, or RETURN to not change]
MinRate:  [0=1.5 Gbps, 1=3.0 Gbps, or RETURN to not change] 1
MaxRate:  [0=1.5 Gbps, 1=3.0 Gbps, or RETURN to not change]
Initiator:  [0=Disabled, 1=Enabled, or RETURN to not change]
Target:  [0=Disabled, 1=Enabled, or RETURN to not change]
Port configuration:  [1=Auto, 2=Narrow, 3=Wide, or RETURN to not change]

Once you've finished you will be dumped back to the port menu:

PhyNum  Link      MinRate  MaxRate  Initiator  Target    Port
   0    Enabled     3.0      3.0    Enabled    Disabled  Auto
   1    Enabled     3.0      3.0    Enabled    Disabled  Auto
   2    Enabled     3.0      3.0    Enabled    Disabled  Auto
   3    Enabled     3.0      3.0    Enabled    Disabled  Auto
   4    Enabled     3.0      3.0    Enabled    Disabled  Auto
   5    Enabled     3.0      3.0    Enabled    Disabled  Auto
   6    Enabled     3.0      3.0    Enabled    Disabled  Auto
   7    Enabled     3.0      3.0    Enabled    Disabled  Auto

Press RETURN from here to save your changes.

Select a Phy:  [0-7, 8=AllPhys, RETURN to quit]

You'll be prompted again for some other values; again keep the defaults or current values by pressing RETURN:

Persistence:  [0=Disabled, 1=Enabled, default is 1]
Physical mapping:  [0=None, 1=DirectAttach, 2=EnclosureSlot, default is 2]
Number of Target IDs to reserve:  [0 to 32, default is 8]

This will take you back to the main menu. Select 0 from here to save & quit:

Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 0

... Which takes you back to the device menu. Hit 0 again and you are finally done:

     Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev  IOC
 1.  /proc/mpt/ioc0    LSI Logic SAS1068E B3     105      00192f00     0

Select a device:  [1-1 or 0 to quit] 0
root at someServer in /home/joshw/Linux
#


Here are some real-world benchmarks using fio showing before & after metrics.


BEFORE

1X 4GB FILE RANDOM READ/WRITE

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.2.8
Starting 1 process
test: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [m(1)] [100.0% done] [21804KB/7072KB/0KB /s] [5451/1768/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=490: Fri Aug 26 13:37:42 2016
  read : io=3071.7MB, bw=20940KB/s, iops=5235, runt=150207msec
  write: io=1024.4MB, bw=6983.2KB/s, iops=1745, runt=150207msec
  cpu          : usr=1.79%, sys=11.75%, ctx=786417, majf=0, minf=24
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=786347/w=262229/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64


SEVERAL SMALLER FILES / RANDOM WRITE ONLY

fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k --direct=0 --size=64M --numjobs=32 --runtime=60 --group_reporting --iodepth=16
randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
fio-2.13-83-g747b
Starting 32 processes
Jobs: 28 (f=28): [_(2),w(16),_(1),w(7),E(1),w(5)] [89.8% done] [0KB/311.7MB/0KB /s] [0/79.8K/0 iops] [eta 00m:05s]
randwrite: (groupid=0, jobs=32): err= 0: pid=4447: Fri Aug 26 13:05:05 2016
  write: io=2048.0MB, bw=47437KB/s, iops=11859, runt= 44209msec
    slat (usec): min=19, max=13715K, avg=2573.62, stdev=166998.51
    clat (usec): min=5, max=13728K, avg=38699.79, stdev=646546.36
     lat (usec): min=26, max=13729K, avg=41273.42, stdev=667725.13
    clat percentiles (usec):
     |  1.00th=[  454],  5.00th=[  540], 10.00th=[  580], 20.00th=[  636],
     | 30.00th=[  692], 40.00th=[  756], 50.00th=[  868], 60.00th=[ 1160],
     | 70.00th=[ 5536], 80.00th=[11328], 90.00th=[17536], 95.00th=[24704],
     | 99.00th=[41216], 99.50th=[52480], 99.90th=[12779520], 99.95th=[13697024],
     | 99.99th=[13697024]
    lat (usec) : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=0.02%
    lat (usec) : 500=2.57%, 750=36.65%, 1000=17.06%
    lat (msec) : 2=8.94%, 4=3.35%, 10=8.24%, 20=15.22%, 50=7.37%
    lat (msec) : 100=0.29%, 250=0.01%, >=2000=0.26%
  cpu          : usr=0.13%, sys=2.11%, ctx=67366, majf=0, minf=982
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=524288/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16


AFTER

1X 4GB FILE RANDOM READ/WRITE

test: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
test: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [m] [100.0% done] [74324K/24284K/0K /s] [18.6K/6071 /0  iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=59026: Fri Aug 26 19:10:52 2016
  read : io=3070.6MB, bw=74180KB/s, iops=18545 , runt= 42386msec
  write: io=1025.5MB, bw=24775KB/s, iops=6193 , runt= 42386msec
  cpu          : usr=8.79%, sys=55.13%, ctx=796212, majf=0, minf=20
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=786053/w=262523/d=0, short=r=0/w=0/d=0



SEVERAL SMALLER FILES / RANDOM WRITE ONLY
Jobs: 32 (f=32)
randwrite: (groupid=0, jobs=32): err= 0: pid=36168: Fri Aug 26 17:09:26 2016
  write: io=2048.0MB, bw=1168.1MB/s, iops=299251 , runt=  1752msec
    slat (usec): min=7 , max=21074 , avg=91.01, stdev=468.57
    clat (usec): min=7 , max=21729 , avg=697.25, stdev=1302.17
     lat (usec): min=15 , max=21799 , avg=789.95, stdev=1385.85
    clat percentiles (usec):
     |  1.00th=[  118],  5.00th=[  390], 10.00th=[  482], 20.00th=[  506],
     | 30.00th=[  524], 40.00th=[  540], 50.00th=[  556], 60.00th=[  580],
     | 70.00th=[  588], 80.00th=[  604], 90.00th=[  620], 95.00th=[  636],
     | 99.00th=[10688], 99.50th=[10688], 99.90th=[14656], 99.95th=[20608],
     | 99.99th=[20864]
    bw (KB/s)  : min=24175, max=50576, per=3.10%, avg=37155.51, stdev=7603.34
    lat (usec) : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.27%, 250=2.72%
    lat (usec) : 500=13.04%, 750=82.28%, 1000=0.11%
    lat (msec) : 2=0.04%, 4=0.05%, 10=0.20%, 20=1.20%, 50=0.07%
  cpu          : usr=7.61%, sys=70.17%, ctx=2186, majf=0, minf=913
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=524288/d=0, short=r=0/w=0/d=0

Monday, August 15, 2016

Building a new gaming PC (with some digs about NZXT)

Never let it be said that I do not support the aspirations of today's young people. My contribution to the next generation was helping a local teen build out a high powered gaming PC. It was my first time installing a closed-loop liquid cooling system (the Kraken x61). 

Building a gaming rig. Notice how decades of IT work has resulted in a Quasimodo hunch
The highlights of the PC included the following:

    - Intel i7 6700K CPU
    - NZXT Kraken x61 liquid cooling system
    - Nvidia GeForce GTX 1080 graphics card
    - ASUS Z170-E motherboard
    - NZXT S340 case
    - Corsair Vengeance LPX DDR4 RAM
    - Samsung EVO 850 SSD
    - EVGA Fully Modular GQ 650W power supply

During build-out I encountered two issues that weren't the result of my own fumbling, shaky hands. One of these issues I think is forgivable and the other is not.

The ASUS 7170-E motherboard and its associated BIOS is specifically designed with gaming and overclocking in mind. I'll save the overclocking for another time; what I had problems with is the BIOS' "QFan" monitoring system, which was unable to recognize the Kraken x61 CPU fan. Both the BIOS and motherboard appears to have more of a focus on open loop cooling systems - the 7170-E has a dedicated on-board four port connector labeled "WPUMP" for water pumps, and the QFan monitoring within the BIOS treats CPU fans and pumps as separate entities.

Booting with default BIOS settings continuously failed with an error message complaining of an invalid RPM lower limit assigned to the CPU Fan, and that I should resolve this by either disabling that lower limit within the BIOS or confirming that the CPU Fan was attached to the on-board CPU Fan Header. Of course, neither of these resolved the problem. I tried a ton of alternatives - plugging the CPU fan into the WPUMP header, changing the fan type from "auto" to "PWM", etc - without any success. Eventually I was forced to resort to Google, where I found a post on a message board suggesting to completely disable QFan's CPU Fan monitoring functionality (you can do this by pressing F9 to go the Advanced Setup menu, then goto Monitor, scroll down to CPU Fan monitoring and press space to disable). Doing this resolved the issue.

Once I had Windows installed and I had installed NZXT's CAM application (through which I was able to manage and monitor CPU Fan speed and other metrics without any problems), I decided to update the BIOS in one more attempt to resolve this whole QFan business. I can't remember the BIOS versions involved off hand other than that I know it was a major release - something similar to a leap from 0230 to 1205. I was pleasantly surprised to see that immediately after upgrading the BIOS the x61 CPU fan was recognized. With this confirmed, I powered off the system one last time to do some last minute-cable management (I'll post a pic at some point). However, when I rebooted a second time, the error reappeared even though the BIOS upgrade was successful - strangely, the BIOS screen became heavily pixelated At that point I gave up and disabled QFan again which once again got be through POST without issues (the pixelation disappeared).

This obstacle didn't bother me; I'm not sure whether the issue with the Kraken or with the ASUS BIOS, but the fix was simple enough (even if it took me a while to figure out). The second obstacle I ran into, also related to an NZXT part, was more frustrating.

We went with the Kraken x61 fan for two reasons. First - because its claimed stats for air circulation along with its claimed decibel rating were among the best in its price range. And second - the case preferred by my gamer customer was an NZXT S340, so I rubbed both of my brain cells together and reasoned that NZXT coolers are bound to fit more easily into NZXT cases than other comparable coolers. Of course, I didn't reach this conclusion on my own. NZXT's documentation for their mid-size S340 case clearly says "Full 280mm radiator support for the latest Kraken cooler".

So let me be the first to say that, no, a 280mm radiator will not fit within the NZXT S340. There is, in fact, 280mm of fan space. You can easily install two separate 140mm fans. And you can also install a 280mm radiator in addition to those fans (after physically modifying the case). What will not fit as the case is designed is the two thick rubber tubes that are attached to every cooling radiator on the market. Compounding the bullsh*t nature of the claim is that NZXT only manufactures closed-loop coolers, and the x61 is the only 280mm cooler that NZXT makes.

I did get it to work (that's why I make the big bucks), but the installation should have been much easier. There are conceivably a few different ways to do this, but here is what I did:

     - I removed the front panel and used a pair of wire cutters to remove the series of small,
        completely pointless plastic tabs inside the panel that prevent the case from closing with the
        Kraken radiator inside.

     - There are two metal holes in the front of the S340 case where the fans are designed to be
        mounted. I threaded the CPU cooler and tubes through the top-most hole. This, in turn, makes it         impossible to mount the fans where they are designed.

     - I then proceeded to install the top fan into the top hole, leaving enough space for the tubes. This
        means that only two screws can be used to secure the fans instead of 8. These two screws will             connect the bottom of the top fan where the top of the bottom fan was designed to be screwed in.

     - This leaves *just* enough clearance for the bottom fan to be squeezed into the remaining space.
        To avoid shaking, I installed a pair of adhesive bumper strips to the bottom of this fan. Instead of         using screws, a series of zip ties can be used to make sure the fans stay in place. This, in addition         to gravity and pressure, will keep the fans and radiator in place.

This is very much a hack. There are other options for getting this to work, but the list is narrowed by the very small amount of clearance in the front of the server and the radiator itself.

Is it possible that I overlooked something significant in the install of the cooler with this case? Absolutely. If that is in fact the case, and I made a mistake, I'm still going to blame NZXT - because the case provided no documentation for how to install a 280mm cooler, and the documentation for the cooler only included instructions for larger cases (where the cooler is installed on top).

Anyway, I hope this post helps save readers some time and prevents a few headaches.

Sunday, August 14, 2016

Pandora account compromise warning message

Here is a copy of the email I was sent by Pandora to inform me that my account was compromised kindof but not really and it was totally not their fault.

Pandora account compromise confirmation

This is somewhat old news (I received this email July 6th) but the more copies of this online the better, IMO.

There are a number of things about this email that irritate me. First of all, the email is so incredibly vague that I have absolutely no idea what happened. Someone, somewhere posted my Pandora username (email address?) on the internet along with, presumably, one of the bazillion passwords associated with it. Who posted this information? Why? Where was it taken from? Was it stolen from one of Pandora's infrastructure providers?

If what Pandora implies in the email is true - that the compromise is completely unrelated to Pandora in any way - why are they sending me this email? Does Pandora scour the internet for the email addresses and account names of its many users? If Pandora had no responsibility for this breach and they sent me this message in order to be proactive to protect me - which is great - then why couldn't they be more forthcoming with detailed information? I get that many of Pandora's users are going to be non-technical, but you can include a link to a website with a comprehensive explanation of what happened or simply format the email to begin with a "tl;dr" version, followed by an exhaustive version for nerds.

There are no hard and fast rules for dealing with a compromise, but Pandora's message left me with many more questions than answers.

Thursday, August 4, 2016

Stay classy, Microsoft

Someone more cynical than myself might think that Microsoft's sudden 66% decrease of OneDrive storage space is a bait & switch - give away the space for free until users become dependent, than take it away, threaten to delete it, forcing those who have become accustomed to the free service to pony up and pay.


Sunday, July 31, 2016

Media, "Experts", too quick to assign responsibility for DNC hacks

I'd like to tell you a story. Its a story that doesn't particularly make me look very good. It was at a point in my career where I still had a lot to learn, and like many young people I thought I was smarter than I was. But its a true story and there is an important point to it, so I'm telling it here even at the risk of looking a bit like a schmuck.

To tell the story, we have to go back in time. The year was 2006. There were still movies in the theaters that didn't have a single comic book character in them. George W. Bush was still best known for destroying the middle east and not for his adorable stick-figure self-portraits. No one that worked outside of telecommunications or that didn't wallpaper their house in aluminum foil believed that the NSA was wiretapping everyone and everything. And I had just received a promotion.

I was working within the primary data center of an internet service provider. The company I was working for had a tiered engineering structure and I had just gone from Tier 1 to Tier 2. I would be making more money and accepting more responsibility in return.

A big part of that responsibility was investigating and resolving abuse complaints received by the ISP. Whether a company hosts servers, websites, emails or provides commercial internet service (this company provided all of the above) occasionally someone will do something on your network they aren't supposed to. Sometimes when someone does something naughty on your network, someone from another network notices. Maybe someone downloaded copyrighted material with P2P software and was caught: the copyright holder would send in a DMCA request. Maybe someone's website has been compromised and the hacker has started scanning the entire internet for a specific exploit; the admin of another network notices and sends an email begging to make the scanning stop. Or maybe someone has defrauded the company by using a stolen credit card and fake company details to sign up for a dedicated server, which in turn is used to send spam - one of the many IP reputation services send over an automated email sending examples of the messages. It had become part of my job to read these messages, investigate them where needed and determine how to handle them.

I was really excited about this promotion. When I was younger I had read books like the Cuckoos Egg; now that was going to be my life. But there was a problem: at this point I knew quite a bit about web servers, but not so much about email servers. I knew even less about the even-at-the-time out-of-date and incredibly-proprietary custom qmail cluster that provided an enormous chunk of this company's email. So I started reading.

I read every RFC that referenced the SMTP protocol. Then I read how no one pays any attention to that shit. I read all about qmail. I learned how to read email headers. I learned how to tell when headers were forged and some of the tricks spammers used. I handled my first few dozen cases well and closed them quickly. 

But there was a problem. The cases I came across lacked drama. It wasn't like the Cuckoo's Egg. Although in a few cases I might have been able to find out exactly who was responsible for hacking a server or setting up an illegal spam service, there was nothing I could do with that information. Even in the rare circumstance where the person was actually in the United States, what was I going to do? Call 9-11? Call the State Attorney's Office? Call the FBI or the Secret Service? Despite what you might read in the funny papers, law enforcement is not equipped to investigate or prosecute the vast majority of "cybercrime" cases. Victims have no one to call, local, state and Federal police don't want to be involved unless there is a political or regulatory angle, and the most simple hacking case is almost always a mess of jurisdictional SNAFU's. You think Bernie Fife knows how to get a warrant for those Ukrainian VPN logs? (He doesn't.) The fact is, when you read about a criminal computer crime investigation, you are essentially viewing a photograph of Big Foot. 

But I desperately wanted to be a White Hat Cyber Cop. I wanted to take down a Cyber Porn ring or a bunch of Russian mobsters (Russian Business Network was my Moby Dick). But that just wasn't my job. My job was help fix whatever had been broken, to make sure that my customers were able to safely resume doing business as normal, and to maybe make some recommendations to make the next hack a little harder to pull off without making everyone's life miserable.

One day I came across evidence that two servers owned by the same customer had been the source of a substantial amount of malicious network traffic. Somehow (this was a big network) this had been missed up to this point. It had been going on for months. These servers had been used to break into other servers on other networks; VPN tunnels would then be established and spam would be sent through the tunnels. Most of the time it looked like normal ssl traffic. 

The more I investigated the situation the more I became convinced this customer was not the victim of these attacks, but was responsible for the attacks. There was no smoking gun, but it in my mind everything in my mind pointed to the customer being the Bad Guy. I spoke to the technician who built the pair of servers for the customer, and the tech remembered the customer had a series of very specific, unusual requests for how the disks were supposed to be partitioned and for how the kernel was to be configured that was similar to how I had seen customers setup a server that could be immediately wiped of any incriminating evidence. I checked out the websites hosted on the servers. The main website - I will never forget this - was an incredibly bare-bones CMS selling decorative rocks. Geodes, crystals, that sort of thing. That might not be so weird for someone with a $2 a month webhosting plan, but this guy had multiple dedicated servers; most of the customers getting servers were insurance companies, universities, doctors offices, military contractors. And this guy. Selling rocks.

I sent the customer several warnings about the hacking; I gave him my best estimation of how he could lock down his server and told him he could hire us to secure it for him. The responses were spotty, and the hacking continued. Eventually, I made the case to management to cancel this customer's service. I was able to get them to agree to my assessment and the customer's account was canceled. 

It was almost immediately after that when I realized that I had completely misread the situation.

Sophisticated spammers know how to plan for having their service canceled. Its part of doing business for them. When they sign up for a 1 year contract they know they are only getting a few months of service out of it. Spammers have always been at the forefront of complex unattended installation, continuous data recovery, imaging and virtualization because they have to turn servers up fast and whenever the banhammer comes down they need to already be activating service at another provider. 

When you cancel a spammer's server, they might send an email in asking why they can't reach their host, and when you tell them they've been spamming they will never contact you again. They're prepared, so there is no point in further discussion.

But the customer with the rock website contacted us, and when we told him he had been spamming he was completely devastated. He sent multiple emails. He called everyone at my company he could. It was clear he had no backups, no plan B. The servers were his livelihood. He begged us to reactivate them, at least long enough to make a backup.

I knew I had made a mistake. I was able to work out a compromise in which we built out a new server to replace his two older servers and helped him transfer his data over safely. The story had a happy ending; the customer got a reduced monthly rate, my company got to reduce the power usage in the data center and keep its profit margin the same, and we stopped the hacking. But the happy ending isn't what's important here.

What's important is that I was wrong. When it counted, I was paying more attention to what I wanted to find than I was to what I could find. I made intuitive leaps based on reasoning that didn't support those leaps. I wanted to be Clifford Stoll. I wanted to impress my boss. I wanted to Get the Bad Guys. Perhaps more important than any of these things, I wanted to have The Answer. More compelling than my fantasizes of being a Cyber Cop was my fear of being incompetent. I thought that being competent meant always having the right solution. 

I could have done my job more effectively by taking more time to review the evidence, and spending less time trying to "connect" a handful of dots that didn't lead anywhere meaningful. Although the story had a happy ending, it could just as easily have had a terrible ending. What if the downtime I caused that customer destroyed his business? 

Over the years I have taken this experience to heart. I've become very reluctant to use intuitive leaps to justify troubleshooting or infosec determinations. Although computing provides us with a rare opportunity to work in a forum in which objective decision making is possible. There are right and wrong answers in computing; but there are also situations in which we don't have enough data to determine the difference between them. Its become easier for me to point out when there isn't enough information to resolve a problem (owning my own business has had no small part in this).

Alright, so that's the story. What on earth does all of this have to do with the DNC hacks?

Over the last week or so I've begun getting my hands on and reviewing the emails and attachments from the Democratic National Committee that have been leaked to the public by a shadowy figure(s) named Guccifer 2.0. This hack became international news beginning last month when the controversial "cyberwarfare" company Crowdstrike announced that the DNC had been hacked, and shortly afterward documents from the DNC began being leaked to a variety of different news outlets, from the Smoking Gun to Wikileaks.

From the very beginning of the DNC hack's injection into the news cycle, the blame for the incident has been squarely laid at the feet of Russian intelligence services. The Russian connection was established by Crowdstrike, who had been asked by the DNC to investigate a hack before the leaks began. Crowdstrike CTO Dmitri Alperovitch published a public report of the findings of their investigation, apparently at the behest of the DNC, in which samples of malware were provided that had links to other attacks that had already been attributed to Russian intelligence, like the compromise of the German Bundestag's network discovered earlier this year.

The attribution to Russian intelligence has gained steam over the last few weeks until we reached the point we are at now - where news outlets are now reporting the Russian intelligence attribution as fact. It is primarily this that I take issue with. Please note that it may very well be the case that Russian intelligence is behind all this. My concern is there is not nearly enough evidence to declare that attribution as fact without additional evidence.

Crowdstrike's report does not provide the required evidence to establish the attribution. Although the report provides a malware sample and a list of IP addresses associated with prior Russian intelligence-attributed hacks that Crowdstrike claims to have recovered through their investigation, these samples are provided without any form of context and in a format that makes it impossible for other researchers to attempt to replicate their findings. There is no explanation of how these samples were acquired. This is a bit like if your doctor told you that you have lung cancer, and as evidence offers you a picture of a cancer cell that's been cut out of a medical journal instead of, say, an X-Ray of your chest. The Crowdstrike report is an explanation of Crowdstrike's findings. It is not proof of Crowdstrike's findings.

There are a number of reasons why Crowdstrike would have opted the report in a way that cannot be objectively verified or peer reviewed. The first and foremost reason is that the DNC almost certainly asked them not to provide any information about their network. Another possibility (that is less defensible but I hear repeatedly) is that Crowdstrike would not want to reveal their "sources and methods".

And, to be fair, Crowdstrike provided their findings to two other companies - Fidelis, Mandiant and ThreatConnect - all of whom have apparently confirmed at least some of Crowdstrike's findings.

So I am willing to overlook the fact that Kurtz has a long standing history of making inflammatory accusations that are both demonstrably false and troublingly indicative of someone with little to no understanding of infosec. I am willing to overlook the fact that Crowdstrike's claim to fame was not for its skill in solving complex hacking investigations but for offering so-called "hack-back" retaliation services - a business opportunity that Crowdstrike was able to capture because their methodology was so ethically and legally questionable that no one else in the infosec community would have anything to do with it.

I am even willing to overlook the fact that Crowdstrike has corporate partnerships with the two out of three of "independent" companies that confirmed their findings.




Let's take for granted that Crowdstrike's report is 100% accurate and Russian intelligence services did, in fact, compromise DNC systems.

Even if we take that for granted, it still doesn't mean that the DNC email leaks can be objectively attributed to Russian intelligence. 

Those who have read the Crowdstrike (or Fedelis) reports may notice that there is a lack of any mention of the DNC's email servers or evidence of large-scale file retrieval. Its quite likely that these details were left out as part of the concerns I listed already - that the DNC hopes to profit from security-through-obscurity and prevent even basic information about their network from going public. Reporters eager to demonstrate the Russian connection have relied primarily on the @pwnallthethings Twitter feed, maintained by Matt Tait (who, apropos of nothing, claims to have been "an information security specialist for GCHQ").

Tait's Twitter feed has been used to bridge the gap between the Crowdstrike report and the DNC documents leaks by Guccifer 2.0. Tait's primary contribution was discovering that a number of the documents released by Guccifer 2.0 had been modified, and that the individual who made these changes was using a version of Windows with the Russian Language pack enabled. When reporters and bloggers say that "metadata" within the Guccifer 2.0 documents proves a Russian intelligence connection, this is what they are talking about.

In addition to this finding, journalists relied on retweets from Tait's Twitter account for confirmation of other findings, such as the Bundestag link, as illustrated here:
As I was reading through Tait's tweets and his subsequent blog guest posts, I saw myself 10 years ago, with the rock reseller. The DNC hacks significantly increased Tait's cache on social media, as can be seen here (the hack became public June 14th).

@pwnallthethings follower growth for July 2016
Just to be clear: I'm not alleging some sort of a conspiracy. I didn't accuse the rock seller of being a spammer because I hated him and wanted to get him. I went after him because it was a better story than the truth. It was more interesting than the truth. And there was evidence that confirmed my story, just as there is evidence pointing toward Russian Intelligence being behind the DNC leaks. Its just not enough evidence for us to claim it as a fact (yet).

Tait rejects the claim that his findings are influenced by bias:
Seems reasonable. But the trouble is that everyone is biased. I'm biased. You're biased. If you are human, and you have a subjective point of view of consciousness, you are biased. The way to handle this is not to deny it, but to account for it. I don't think Tait or the journalists who have used his findings as definitive proof that "Russians did it" have a bone to pick with Russia. Its just a damn good story. Who wouldn't want to be part of a spy novel?

Also, I use Tait here because the media has decided to rely on his findings so consistently, but he is not alone in transforming tenuous circumstantial findings into Objective Truth. Some of my personal favorites are:

   - Vice Magazine brought in linguists (I am very much avoiding the use of a hackneyed but still-amusing pun here) to analyze the transcript of an interview between a Vice reporter and Guccifer 2.0. Even the honey-picked quotes provided by Vice made it clear that nothing could be proved from these transcripts other than that Guccifer 2.0 likely used Google Translate, but the article has been used as further "proof" that Guccifer 2.0 is Russian and not Romanian.

   - The version of MS Office used to modify leaked files appears to be cracked. Cracked versions of Office are "popular among Russians and Romanians". Because no one anywhere else in the world pirates Microsoft software (certainly I don't - stop looking at my torrents).
This is just silly, but its taken as gospel by a media that is both hungry to spark a Cyber War and whose reporters frequently have the technical acumen of my 94 year old grandmother.

So before we wrap this post up lets quickly review the fallacies that are used to confirm the Russian Connection:


THE RUSSIANS HACKED THE DNC, SO THE DNC LEAKS CAME FROM THE RUSSIANS

This is the big one. As I said earlier, I am taking for granted that Crowdstrike's report is God's Own Truth, and that a pair of separate Russian intelligence services hacked the DNC and had access to the DNC's network for up to a year.

Even if we accept that Russian Intelligence hacked the DNC, it does not mean that Russian Intelligence leaked the documents. Let's consider some scenarios.

The number 1 reason why networks and servers are compromised is because those networks / servers are vulnerable to compromise. That's such an obvious statement it comes across as a tautology. But its not, and there are important consequences of this obvious statement. I am regularly called in to help companies that have discovered a breach in their IT infrastructure. Something that often happens is I find evidence of multiple compromises; either the victim is using multiple vulnerable software packages, or multiple parties have taken advantage of the same exploit, or the network was compromised a long time ago by a clever hacker who was able to maintain a presence on the network until some much-less-competent hacker came along and defaced a website or broke something.

One of the most compelling alternate explanations relies on a similar chain of events happening at the DNC. Russian intelligence had compromised the DNC for a long time using the sophisticated techniques described by CrowdStrike. The Russians stayed present in the network for a year in order to accomplish what intelligence services typically want to accomplish - compiling as much information as possible. Then, some knucklehead(s) named Guccifer 2.0 comes along and compromises an email server with the goal of accomplishing some hare-brained political goals known only to him/them. Guccifer 2.0, being a moron, sets off the bells and whistles that cause the DNC to contact CrowdStrike, who in turn discover the Russian intelligence presence.

There's other options. Remember that guy name Edward Snowden? Remember how he worked for a US intelligence agency? Remember how he leaked a bunch of documents to the media? Remember this other person Chelsea Manning? Remember how Chelsea released all of those cables that included detailed intelligence analyses of foreign countries? Remember how those documents had huge political implications in those countries, like maybe sparking the Arab Spring? The point is that leaks within intelligence services happen that aren't necessarily planned by that intelligence service. Those leaks can have devastating impacts on the elections of foreign countries. Here, Guccifer 2.0 is either a Russian intelligence employee or a hacker whose true target was Russian intelligence. Theres a few options within this option - Guccifer 2.0 as working for another nation hoping to influence the US election and increasing US/Russian tensions, Guccifer 2.0 as a Russian intelligence employee who has for whatever reason a *huuuuuuuuuge* (get it?) man-crush on Trump. Some of these options are crazy. But its no more crazy than the explanations of the Putin-Trump Axis of Evil floating through the media.


EVERYONE WHO SPEAKS RUSSIAN WORKS FOR THE GRU/FSB

It sounds silly when its put into words, doesn't it? But this is what the "metadata" and "language analysis" comes down to. Guccifer 2.0 is using Office with Russian language settings. Guccifer 2.0 is chatting the way a Russian would chat. ERGO Guccifer 2.0 is Russian. ERGO Guccifer 2.0 is really Russian Intelligence. I'm not sure how to explain how stupid this is, other than to just point out that, no, not everyone who speaks Russian is a GRU agent. Maybe visit Russia and meet some of them? There are some people who speak Russian who are butchers and bakers and candlestick makers. By golly, there are even people who speak Russian that don't live in Russia at all! I know, your mind is blown, right?


EVERY POLITICAL HACK IS STATE SPONSORED

Not every hacker is state-sponsored. Gee whiz, there are even *groups* of hackers who *cooperate* with each other and even *manipulate the media* and *lie about their identity* who are just teenagers somewhere. There is a rich, long standing history of teenagers playing such pranks. Kids have been hacking for longer and frequently using more sophisticated techniques than governments have. Some of the first government "cyber warfare" programs were just field agents who paid kids to hack for them and paid them in drugs. Really.

One of the most recent, well known examples of this is the lulzsec hacking group. lulzsec had a very pointed political agenda and targeted government agencies, law enforcement groups, media companies and others that opposed that agenda. The lulzsec political agenda did not fall into the binary Team Red / Team Blue archetypes that inform what passes for American political commentary, but it was there and it clearly was important to lulzsec and their supporters. Before the indictments began, there were plenty of rumors that lulzsec was state-sponsored.


If you've made it this far - congratulations. You're almost at the end. Let's wrap up.

Some companies tell us that there is evidence the DNC was hacked by Russian intelligence. That evidence hasn't been published. There is different evidence that Russian intelligence is behind the Guccifer 2.0 account. Most of that evidence turns out to be at best incredibly flimsy and circumstantial and at worst utterly irrelevant.

It may very well be the case that Russian intelligence is responsible for the DNC email leaks, but the fact remains that further investigation is required to confirm the identity of Guccifer 2.0. Attributing the attacks to the Russians before such an investigation can occur does an enormous dis-service. The Cold War actually completely sucked. We should avoid repeating that experience based on the flimsy BS that has largely informed the coverage of the DNC hacks up to this point.

Reporters never open infected Wikileaks attachments

Since I've published my findings on malware in the GI Files Wikileaks file dumps and my subsequent attempts to encourage Wikileaks to label such malicious content, I've repeatedly been told by a variety of "Security Experts®" that no one will open infected attachments from email file dumps.

I plan on writing a post on how assumptions about user behavior are frequently inaccurate, and how assumptions based on the behavior of Wikileaks researchers analyzing email dumps based on the typical behavior of normal email users is particularly prone to failure, but for now I'll just leave this here:

Saturday, July 30, 2016

524.dat & chrome_patch.hta [UPDATED]

    A few minutes ago I clicked a link to an article and I noticed something fishy. The new site attempted to automatically redirect my browser to this:


    This piece of garbage phishing page didn't even wait for me to be suckered by their super-convincing download link, and used a setTimeout() call to try to force my browser to download something called `9901224839027/1469890408944162/chrome_patch.hta`. 
    Here is chrome_patch.hta as it is seen in the wild:


    And here is chrome_patch.hta after we apply deobfuscation 101:


    As you can see, chrome_patch.hta downloads a .dat fie `17/524.dat` and creates an executable `g2924808f66985de3a9ad1e3d743e0d.exe` before providing victims with a reassuring "Update completed" window.
    I've been seeing similar versions of this same method to force users to swallow the 524.dat payload, like this:
    I've found some complaints as far back as a month ago. I'm going to try to get my hands on these and look a bit closer as time permits and post the results here. I can't promise it will be all that interesting though as this script was pretty artless & obvious. If anyone's already seen the payload please share! Thanks.

UPDATE: It looks like someone uploaded the payload to malwr last week. Their PE scanner is about as good as it gets for automated scanning. Just looking through malwr's list of registry keys it looks like the payload adds ~5 domains to Windows' URL Security Zones or as I prefer to call it the Circle of Trust:


HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\SafeBoot\Option gets modified also, which is weird. This is the registry key that determines whether the next reboot will put Windows into Safe Mode or not. This could be an attempt to disable antivirus software and is a loud flashing sign that this payload is going to be a loud, obnoxious dick.

I also found a third version of chrome_patch.hta that is significantly different than the one I have and the other version I posted above; I think whoever is responsible for this is making some changes on the fly, or a few different people are tweaking it. The tweaks don't include changing the filenames (although some components have been removed in my version), and I've only seen it use two different domains to download from. Small potatoes.

ANOTHER UPDATE: I think I scared our hacker friend a bit. The domain name registration for the website used to host the phishing script & payload file has disappeared. Those files appear to have been removed from the server also, or at least taken offline or moved somewhere I cant find them. This is a pretty fast reaction from our hacker friend (< 24 hours from my post / reporting the issue to involved parties). It supports the idea I had earlier that hacker friend is actively developing this little project. If you're listening, hacker friend: why did you take your toys and go home?