Showing posts with label linux. Show all posts
Showing posts with label linux. Show all posts

Thursday, February 11, 2016

Recovering network access to EC2 instances

So you've screwed something up. You made a typo in your sshd_config file. You added a firewall rule, or a route, or some other thing, and lost your network access to your EC2 instance. And of course whatever you broke, you broke permanently - you wrote your firewall rules directly to /etc/sysconfig/iptables, you made your goofy change to /etc/sysconfig/network-scripts/whatever-interface; so rebooting won't make a damn bit of difference. You read the warnings, you know you shouldn't have. But you did anyway.

Oh, and you don't have any backups. Or you have backups from three months ago. Restoring from your crappy backups would mean hours to days of non-stop work and consistent downtime. Or Amazon or whatever other company you're using for backups actually broke your backups/lost your backups/never actually provided you with the backups you paid for.

Don't panic. You've got this. You remember that Amazon has some sort of Java-based something or other. Its got to be a virtual KVM. You login to the web console and find out that the Java-based something is a completely worthless SSH client, and not a KVM at all.

You are going to be fired.

Unless you found this post. I will save your backside, sir and/or ma'am.

Well, I will save your backside provided your environment has a couple of caveats. I will make them clear so that if you don't meet them you can get going somewhere else to find a solution ASAP. Here they are:

    - This is for Linux. If you are using Windows you are fired. Just kidding! You can mount Windows volumes in Linux, but reconfiguring network settings in this way is much more complicated since those settings are often stored in the registry rather than flat files. This walkthrough is just for volume management side of things; if you're dealing with Windows consider mounting the volume on a Linux VM and then using a tool like this one to modify the registry in the broken volume.
    - This only works for EBS volumes. There may be a way to do it with other types of volumes, but I haven't had to worry about it, and it will be much more complex than this if there is a way to do this with non-EBS instance store volumes.
    - I'm going to take for granted that you know how to start and stop an EC2 instance, and how to deploy an EC2 instance. I'm assuming this because you had to have done these things to make the instance you just broke. If you broke somebody else's instance and you don't know how to even restart the damn thing, well, first off - lol. And second, you're fired.
    - You need to either already have or be able to provision a second linux EBS-back EC2 instance in the same availability zone as your broken server

Those should be the only requirements. It won't matter if your broken volume is magnetic or SSD. Here is what to do:

1. For this to work you need a second EBS-backed EC2 instance running linux, other than the broken one, within the same region and availability zone (i.e. us-west-1a) as the broken server. It doesn't need to be the same "flavor" of Linux, but it makes things a lot easier if the kernel version is pretty close to one another. If you do not already have one deployed, create one now. Make a note of the instance-id of the second server (if you created your instance a while back, the instance-id will look like this: i-123a45fe - if you just created your instance, the id will be longer, 17 characters, like this: i-1234567890abcdef0).

2. From the AWS Management Console, select Instances and then highlight the broken instance. Make a note of the instance-id . Then STOP the instance.

3. Next, select Volumes. If you haven't already, give the volume of both the broken instance and your second troubleshooting instance a descriptive Name so you can quickly tell them apart. Make a note of the names and volume-id's and which instances they are connected to.

4. Highlight the volume of the BROKEN server, right-click, and select DETACH VOLUME.

5. Detaching volumes should be processed quickly, but your browser won't recognize the change right away. Refresh your screen to make sure the volume is detached. Then, right click the detached volume and select ATTACH VOLUME.

This will open a new window asking you which instance to attach the volume to and what to name the volume on the new server. Select your secondary, working server to attach the volume to. It should be alright to leave the default device label - it should be /dev/sdf. The only concern here is that you don't want to name the new volume a label that is already assigned. If you only have one EBS volume attached to your server, it will automatically be assigned /dev/sda1. If you've customized volume management for your server, you know these settings; if you haven't, then this walkthrough will assume you use /dev/sdf for the broken disk volume label on the secondary server.

6. SSH to your working secondary server and make a new folder under /. You will be mounting the broken disk to this directory

    # cd /
    # mkdir broken/

7. Here's where things can get a bit complicated, and where a lot of the walkthroughs available on this subject get things wrong. In Step 5 we created a volume label /dev/sdf for mounting the broken disk to our secondary server; but it won't show up as /dev/sdf on your secondary server.

You should have two EBS devices attached: /dev/sda1, which is the default volume, and /dev/sdf, which is the broken drive. /dev/sda1 will show up as /dev/xvda1 - the "s" is translated to "xv" to indicate that it is a virtual disk. /dev/sdf will show up as two additional devices: /dev/xvdf and /dev/xvdf1. You will want to use /dev/xvdf1.

Where you go from here depends on the sort of filesystems that are in use. In most instances, the filesystem in use will be XFS. You can check the filesystem by running this command:

    # mount -l |grep xvd
    /dev/xvda1 on / type xfs (rw,relatime,attr2,inode64,noquota)

The filesystem is shown directly after the "type". This is important because attempting to mount the broken volume directly will fail when it uses XFS, like this:

    # mount /dev/xvdf1 /broken/
    mount: /dev/xvdf1 is write-protected, mounting read-only
    mount: unknown filesystem type '(null)'

Even though the error appears to indicate the volume was mounted "read-only", nothing get's mounted - the /broken/ directory will be empty and `mount -l` will not display /dev/xvdf1.

The problem here is that the filesystem must be specified by using the "-t" flag (using -t auto will also fail). Here is the correct command:

     # mount -t xfs /dev/xvdf1 /broken/

If successful, the command will output nothing. You can confirm by checking for content in the /broken/ directory and by running this:

    # mount -l |grep xvdf1
    # /dev/xvdf1 on / type xfs (rw,relatime,attr2,inode64,noquota)

8. You can now navigate through the /broken/ directory as if it were / on the broken server. You can use /broken/var/log/ to identify errors, and rewrite configuration files like /broken/etc/sysconfig/network-scripts/. Be sure to remember to prepend /broken/ when navigating! It's easy to forget where you are and change something on your secondary working server, so don't do that, or else ...

9. Once you have reversed whatever was broken, unmount the disk from the broken server:

    # umount /broken

10. Detach the now-fixed volume from the secondary server.

11. Refresh your window and reattach the volume to the original server.

12. Restart the server and you should now be back in business.

There are so many reasons why this process is a huge pain in the ass as compared to a virtual KVM utility. I recently had to perform this procedure to resolve a networking issue on a server where most of the services were still responding - http & https were all fine, but SSH was dead. With a virtual console I could have repaired the issue without any downtime. Using this procedure forced me to bring down the server for 5 minutes or so to perform the repairs. That sucks. And up-to-date image backups would not have made anything better; remaining the server may have shaved a minute or two off of the total downtime that was required to run this procedure, but there would still be downtime.

I'm not sure why Amazon has declined to implement this sort of feature; Rackspace and others make it available. My guess would be that there are security issues involved, but that's just a guess. In any case, hopefully this walkthough helps out.

h/t Several of the images here were taken from a walkthrough by Mike Culver. Mike's screenshots were great and spared me having to take my own; unfortunately his walkthrough as currently published in Amazon's tutorials section fails in a variety of cases, including my recent one, which is why I wrote this.

Friday, October 2, 2015

Fedora Project's RHEL yum repo has been throwing errors since yesterday UPDATED

A few of my Red Hat servers run cron jobs to check for updates. starting yesterday (Thursday October 1st, 2015) at around 3PM I encountered 503 unavailable errors when attempting to contact a Fedora Project URL that hosts the metalink for the rhui-REGION-rhel-server-releases repository - a core RHEL repository for EC2.

Could not get metalink error was
14: HTTPS Error 503 - Service Unavailable

3 hours later or so, the URL began responding again, but the problems remained. `yum` now reports corrupted update announcements from the repo:

Update notice RHSA-2014:0679 (from rhui-REGION-rhel-server-releases) is broken, or a bad duplicate, skipping.
You should report this problem to the owner of the rhui-REGION-rhel-server-releases repository.
Update notice RHSA-2014:1327 (from rhui-REGION-rhel-server-releases) is broken, or a bad duplicate, skipping.
Update notice RHEA-2015:0372 (from rhui-REGION-rhel-server-releases) is broken, or a bad duplicate, skipping.
Update notice RHBA-2015:0335 (from rhui-REGION-rhel-server-releases) is broken, or a bad duplicate, skipping.
Update notice RHEA-2015:0371 (from rhui-REGION-rhel-server-releases) is broken, or a bad duplicate, skipping.
Update notice RHSA-2015:0416 (from rhui-REGION-rhel-server-releases) is broken, or a bad duplicate, skipping.
Update notice RHBA-2015:0303 (from rhui-REGION-rhel-server-releases) is broken, or a bad duplicate, skipping.
Update notice RHBA-2015:0556 (from rhui-REGION-rhel-server-releases) is broken, or a bad duplicate, skipping.
Update notice RHSA-2015:0290 (from rhui-REGION-rhel-server-releases) is broken, or a bad duplicate, skipping.
Update notice RHBA-2015:0596 (from rhui-REGION-rhel-server-releases) is broken, or a bad duplicate, skipping.
Update notice RHBA-2015:0578 (from rhui-REGION-rhel-server-releases) is broken, or a bad duplicate, skipping.
Update notice RHSA-2015:0716 (from rhui-REGION-rhel-server-releases) is broken, or a bad duplicate, skipping.
Update notice RHSA-2015:1115 (from rhui-REGION-rhel-server-releases) is broken, or a bad duplicate, skipping.
Update notice RHBA-2015:1533 (from rhui-REGION-rhel-server-releases) is broken, or a bad duplicate, skipping.
Update notice RHSA-2015:1586 (from rhui-REGION-rhel-server-releases) is broken, or a bad duplicate, skipping.
Update notice RHSA-2015:1705 (from rhui-REGION-rhel-server-releases) is broken, or a bad duplicate, skipping.

I sent a tweet to Fedora to hopefully get some feedback. Because this wasn't a super critical issue I've been slacking on troubleshooting as well I will update here and/or provide a new post with more info.

UPDATE: I am increasingly convinced that this is an error with the repository and not something with my server. Check out the following command output:

Nothing marked as out of sync:
# yum distro-sync
Loaded plugins: amazon-id, rhui-lb
No packages marked for distribution synchronization

No problems listed by `package-cleanup`:
#package-cleanup --problems
Loaded plugins: amazon-id, rhui-lb
No Problems Found

`yum check` finds nothing:
# yum check
Not loading "rhnplugin" plugin, as it is disabled
Loading "amazon-id" plugin
Not loading "product-id" plugin, as it is disabled
Loading "rhui-lb" plugin
Not loading "subscription-manager" plugin, as it is disabled
Config time: 0.012
Yum version: 3.4.3
rpmdb time: 0.000
check all

The cache has been cleaned (repeatedly):
# yum clean all
Not loading "rhnplugin" plugin, as it is disabled
Loading "amazon-id" plugin
Not loading "product-id" plugin, as it is disabled
Loading "rhui-lb" plugin
Not loading "subscription-manager" plugin, as it is disabled
Config time: 0.021
Yum version: 3.4.3
Cleaning repos: epel rhui-REGION-client-config-server-7 rhui-REGION-rhel-server-optional rhui-REGION-rhel-server-releases rhui-REGION-rhel-server-rh-common
Cleaning up everything

No orphans:
# package-cleanup --orphans
Not loading "rhnplugin" plugin, as it is disabled
Loading "amazon-id" plugin
Not loading "product-id" plugin, as it is disabled
Loading "rhui-lb" plugin
Not loading "subscription-manager" plugin, as it is disabled
Config time: 0.012
Setting up Package Sacks
pkgsack time: 0.327
rpmdb time: 0.000

By default, EC2 instances automatically repopulate mirrorlist URLs configured in /etc/yum.repos.d/*.repo files using the region in which the instance is hosted, like this:


I've manually updated the relevant .repo file with my region and upped the debugging level variables for yum-cron to try to narrow things down a bit. No answers yet ...

LATEST UPDATE (11-19): I believe I somewhat figured this out quite a while ago, but I just haven't had the time to update this post.

Amazon manages the licensing information for EC2 instances with operating systems that require it - like Windows and RHEL. So, the short answer is: Amazon broke it. I can't remember off-hand what the licensing agreement is in relation to this particular issue. I do know that I was still paying the exorbitant monthly rate for an RHEL-licensed instance. And I certainly received no notification that my RHEL license was expiring.

This was a very bad experience. The fact is, there are very few reasons why a non-enterprise scale user would ever use RHEL as opposed to CentOS. For Enterprise users that do require licensing, I would highly recommend looking into a Satellite-based updating solution. I'm not sure ATM what the logistics of doing such a thing using a platform like Amazon, but I am sure to be doing my homework on the subject shortly.

Saturday, July 27, 2013

Saturday, March 9, 2013

Samba 4 and Linux Domain Controllers

Samba 4 is nothing short of amazing. Until recently I was familiar with earlier versions and had done nothing more than mounted cross-OS volumes (to create simple white-label NFS storage devices, for example). Version 4 has hacked some major portions of the Windows kernel functionality and re-worked them in Python. 

For example, did you know that a Linux server can be an Active Directory Domain Controller? Install samba-tool and run the following command (assuming your domain already exists): 

# samba-tool join MY.DOMAIN DC -Uadministrator@my.domain --realm=MY.DOMAIN

Use the 'samdump' operator for Kerberos data to standard output: 

# samba-tool samdump

In no way would I recommend this for outside of a testing / development environment - there are some key differences between samba 4 AD and real AD (one issue documented so far is that samba 4 uses some NT 4 notions that Windows simply emulates in recent versions, for example primary and secondary domain controller relationships. 

In any event, I can see some use for testing for example being able to closely integrate Linux-based network monitoring tools without cygwin.

Sunday, January 13, 2013

File Defragmentation Tools for Windows 2003/2008, Redhat/CentOS and Ubuntu

For managing fragmentation of NTFS (Windows Server 2003/2008, XP, Vista, and Windows 7):

For general disk defragmentation, the following utilities offer a substantial improvement in overall performance and efficacy over native operating system tools:
Auslogics Disk Defrag or Raxco PerfectDisk

For use on disks unsupported by the above tools, frequently executed and/or locked files or even a straightforward command line utility that can easily be used as part of a shell script:
Contig from the Sysinternals Suite
Contig has been of particular value when managing backup servers - servers storing huge files with substantial writes on a regular basis. Being able to specify the backup files allows for properly scheduling defragmentation by backup job, and in the process eliminating the need for downtime on these systems as part of this manner of disk maintenance. Can also be used for per-file fragmentation analysis and reporting.

For managing fragmentation of ext4 file systems (newer versions Redhat/CentOS, Ubuntu, Debian, etc):

e4defrag - Linux users (or at least the Linux users I know) have been waiting a long time for the use of an online defragmentation utility. We've all ignored it, pretended as though fragmentation didn't happen on our Linux machines, until the time came for a reboot after 2-3 years of uptime and read/writes forced an fsck that occurred at the worst possible time.

e2freefrag - Provides online fragmentation analysis and reporting by disk, file or mount point.

For managing fragmentation of ext3 file systems  (slightly legacy versions Redhat/CentOS, Ubuntu, Debian, etc)

Good luck! Your options are unfortunately a bit limited.

Many readers may ask: ext3 is a journalled filesystem, why even bother? Primarily, in order to increase IOPS (currently the primary performance bottleneck in terms of price per unit of measurement). Journalled filesystems have seek times just as NTFS does. Reducing those seek times improves performance. Further, unexpected system events can lead to the operating system forcing the journal to be processed. Regular maintenance helps to ensure this process is timely and that downtime is minimized as a result. I have often heard it said that this process "often takes only a second" and as a result can be safely disregarded. While I respect everyone's opinion, I have to very urgently disagree. Most of my experience has been in working in commercial data center environments with several thousand servers. At scale, the statistically insignificant becomes a regular headache. What often happens is part of my concern as an administrator, disaster recovery is just as important in my opinion - safeguarding from improbable catastrophic scenarios and reducing their impact has always been part of my agenda.

That said, let's continue: ext3 requires you to unmount your partition to defragment it. IMO, ext3 is still the most widely used Linux filesystem. I highly recommend the e2fsprogs suite, which includes the following tools:

e2fsck - its just fsck. Not a vulgar typo; performs a filesystem integrity check
mke2fs - creates filesystems
resize2fs - expands and contracts ext2, ext3 and ext4 file systems
tune2fs - modify file system parameters
dumpe2fs - prints superblock and block group info to standard output or pi pe destination of choice
debugfs - a simple memory-only filesystem that can be mounted to perform emergency troubleshooting of your primary filesystem

For defragmentation, you will be typically be using the following:

mount - used to mount and unmount filesystem (also widely known as featuring one of the more chuckle-inducing linux commands when in need of command syntax assistance, #man mount)
fsck - File System ChecK. Checks the specified file system for errors. 
[note: modifying /etc/fstab allows you to specify which devices are mounted]

Some solid non-OS included tools are:


Sunday, January 6, 2013

Pidgin Instant Messenger Log Data Location

Pidgin is a popular IM client. I've been using it for years, mostly because of its simplicity when used within alternate operating systems. I need a non-browser based IM client that I can use in Fedora and Windows with the ability to easily transfer log data between the two. My only complaint is that the log search function is not very great, and Pidgin does not provide you with the ability to locate or change the log file path within the application. For those of you who need to find Pidgin logs, here are the paths for both Linux and Windows.

Installations include an actual 
pidgeon. Rabies sold separately.
Linux-based operating systems store log data within the root directory like so: ~/.purple/logs

Windows XP stores your logs here: 
C:\Documents and Settings\username\Application Data\.purple\logs

Windows Vista and Windows 7 store your logs here:

When running Pidgin within Windows, Pidgin uses the PURPLEHOME environment variable to establish the log data location. You can easily modify this variable to establish a better log file location through the Control Panel.

Select System --> Advanced --> Environment Variables, find PURPLEHOME and adjust its path to your requirements.

Extra information can be found on Pidgin's Developer website.

Thursday, April 12, 2012

Same Domain, Multiple Machines, SSL?

I saw a lot of misinformation about this on the inter-tubes recently, some of it intentional misleading of customers, some of it unintentional, so it might be remedial for a lot of readers but posting a clarification here because its worth it to help clear up the confusion. Here are some facts that should help people when first making the leap to securing multiple server environments:

Servers are domain and private key specific. They are not machine specific. You are welcome to generate multiple SSL certificates for the same domain to host on separate servers. Think for a bit, this *has* to be true. When everyone goes to, are they hitting the same web server or SSL caching server? Of course not.*

The most common scenario where this would be valuable is with a load balanced web cluster, but I recently came across this in a deployment with web and mail component where the mail admin neglected to give their MTA a unique FQDN *and* the organization is using SSL/TLS for mail retrieval *and* the organization does not wish to use a self-signed certificate to this end.

You dont need to purchase multiple certificates to this end. Just export the certificate to a PFX and import it to the next server. In IIS6, this process is outlined here: To use OpenSSL in Linux, here is a good guide: :

 (*Yeah I know they are using hardware acceleration, smarty pants. Same argument applies, plus complexity of dealing with hardware tokens)

RAT Bastard

Earlier this week, several servers I maintain were targeted by automated attempts to upload a remote access trojan (RAT). The RAT is a simpl...