Recovering a Debian System after running rm /*

Flying - felixtsao

The Oxford English Dictionary1 defines an ohnosecond as:

a moment in which one realizes that one has made an error, typically by pressing the wrong button.

It’s more commonly referred to in Operations Management parlance as:

OHGODFUARGHFLKJAFWHATIDIDIDOARGHNO

It is unfortunately something that will happen to everyone during their systems administration career, and the variations are almost endless, some notable occurrences include:-

  • Copying SSL libraries over from a Debian host to a RHEL host
  • Setting a new root password and immediately losing it
  • Copying over an out of date backup CMS to a production system
  • Running one of the many variations of ‘rm’ at the wrong level

Unfortunately in a recent scenario, a poor hypothetical sysadmin managed to issue:

rm /*

instead of:

 rm ./*

This removed every non directory at the / level. The impact of this varies between operating systems, and even between Linux distributions. We’re lucky that in this scenario, there was no ‘-rf’ specified – or it’d be ‘recover from backup’ time, however this situation did (hypothetically) pose an interesting conundrum.

The Problem

In Debian x86_64 systems, /lib64 is a symlink to /lib, and you’ll find most applications (for instance, ‘ldd’) are linked to libraries in /lib64:

linux-vdso.so.1 => (0x00007fff251ff000)
libc.so.6 => /lib/libc.so.6 (0x00007f278eb38000)
/lib64/ld-linux-x86-64.so.2 (0x00007f278eea2000)

In the event of /lib64 not being accessible, most applications will fail to run because they’ll be missing a myriad of dependencies and won’t be able to find them. After a bit of brief investigation and some furious attempts to revive it with frankly disappointing results, including:

  • Using a statically compiled symlinker such as sln (Available by default on CentOS and RHEL, not on the affected Debian box)
  • Copying over sln via netcat and writing it out (proved surprisingly difficult)
  • Trying to copy over a symlink via rsync (couldn’t rsync/scp/sftp as they need to exec another process – which they can’t because of missing libraries)
  • Using BusyBox (Needs dynamic linking)
  • Writing a linker in C, compiling it, getting it over there via a mixture of cat, echo \x{..}\x{..}, and other incantations (I lost the will to live around this point)

The Epiphany

I eventually remembered a slideshow – chmod -x chmod – which was surprisingly relevant. You see, those more eagle eyed may have noticed we would end up missing one important dependency: ld-linux-x86-64.so.2.

ld-linux and ld-linux-x86-64 find and load shared libraries used by other applications – preparing programs to run, and then actually executing them too. Most Linux binaries require dynamic linking, meaning at runtime the libraries that the application depends on are loaded in from a shared source rather than compiled into the executable, unless the -static option was used during compilation. As this is quite unlikely (with most modern distributions), this means if you cannot access ld-linux.so, you’re in trouble. Luckily, you can still use ld-linux.so to execute arbitrary commands, and it’ll resolve the dependencies relative to your LD_LIBRARY_PATH at that point. A simple:

/lib/ld-2.11.1.so /bin/ln -s /lib/ /lib64/

Restored the symlink and allowed normal execution of binaries again, leaving our hypothetical sysadmin off the hook, except for having to write a mildly humiliating email to the rest of the operations team who, understandably, responded a bit like this.

Photo Flying by felixtsao (CC)

Tumblr to WordPress Import – Maintaining Links

Starting Life - jimdeane

I recently migrated away from Tumblr, as I found that Tumblr was heading more towards micro-blogging – reducing the size of the posting editor (seriously? LOOK AT THE PROPORTIONS) which made embedding code snippets or writing more lengthy posts pretty arduous. As an unapologetic geek – WordPress seemed like the natural choice.

The Tumblr to WordPress Import Process did a reasonably good job of importing everything – but I wanted to make sure to not lose the (already indexed/linked) URLs. Unfortunately this wasn’t quite as easy – as defining custom permalinks on a per-post basis in WordPress appears to still be manual (via an .htaccess). To resolve this generically (without having to make a new alias for the mammoth amount of posts I had (ahem), I simply set the permalink format to the name of the post (which follows the same format as tumblr) and defined the following RedirectMatch regular expression in my htaccess:

RedirectMatch permanent ^\/post\/(\d+)\/(.*)$ /$2

From a URL such as:

http://alexjs.eu/post/36766766992/cors-headers-in-nginx

This will isolate:

cors-headers-in-nginx

and (permanently) redirect it to:

http://alexjs.eu/cors-headers-in-nginx

and, as such, play nicely with search engines.

Photo Starting Life by jimdeane (CC)

CORS Headers in Nginx

Update: Before going much further, there now is a much more comprehensive CORS walkthrough for nginx at enable-cors.org – so check that out before following the below.

If you’ve deployed even a mildly complex web application in the last few years, you’ve probably had to care about CORS headers. They allow webpages to make requests to another domain, or the same domain on another scheme. Without them, you’ll find that trying to request other assets will be forbidden by your browser, and things won’t load.

They’re relatively simple to implement. You just add a header:

Access-Control-Allow-Origin: https://www.alexjs.im

to the HTTP responses of assets you’d like to call in your webapp. Thanks to Michiel Kalkman’s gist you can easily achieve this in Nginx – with something relatively standards compliant, too.

The problem, it seems, is that despite the W3C spec and RFC 6454 prescribing the use of a list of origins, not all browsers (e.g. Firefox) support multiple domains in an Access-Control-Allow-Origin header:

Access-Control-Allow-Origin: https://www.alexjs.im https://www.alexsmith.org

The easiest solution is to use a wildcard:

Access-Control-Allow-Origin: *

However that can cause some security implications. The best compromise I’ve found to get around this was to implement a simple whitelist in the Nginx config and match against that. I’ve put this in a public gist – and I’m testing it for deployment now.

I’ve not yet done any performance testing, so I’m not sure how efficient the Nginx regex engine is and what the overall effect on throughput/capacity is. I’ll probably forget to update this post with a bit of information once that’s complete.

Update:

This has been in production for a couple of months now, and we haven’t had any performance issues. It seems that for the throughput we require (<10 req/s) we’re able to yield the load on a single m1.small comfortably, so I think the nginx regex engine’s pretty efficient.

Celery and a failing MySQL Server

Celery is a distributed task queue for Python. It’s pretty useful, and a lot of apps I’m involved in deploying seem to be using it lately.

Something it seems to struggle with is stability; in the event of a database disappearing, being unable to resolve a database’s hostname, or a single connection to a database failing, it just shuts down.

I needed this to not happen, when running things in “the cloud” (sorry) you’re very much at the mercy of other people controlling your networking/tin/everything – so you need to write applications that are capable of a little bit of failure (even if the application was originally written in this way to avoid split brain or similar). To get around this, we implemented monit. I am definitely not a fan of apps automatically restarting, but it was the only trivial resolution in this situation. Just append this to your monit config and you should be sorted. My understanding is that there isn’t a better solution yet, but would be interested to know if anyone has seen one.

check process celeryd with pidfile /var/run/celeryd.pid
start program = "/etc/init.d/celeryd start" with timeout 10 seconds
stop program = "/etc/init.d/celeryd stop"
if changed pid then restart
if 5 restarts within 5 cycles then timeout
alert youremailaddresshere

(I appreciate this is especially tedious, but this is for my reference)

Making nginx ignore query string parameters

When using nginx as a caching proxy, I found myself needing to ignore particular parameters for both the cache key and the values being passed to the backend. In this particular situation the value I wanted to ignore was ‘uid’. An example URI being:

http://myapplication.fqdn/foo.ext?env=bar&uid=baz&node=qux

or

http://myapplication.fqdn/foo.ext?uid=bar

To ignore this, in the top of my site configuration I put:

proxy_cache_key         "$scheme$host$uri$is_args$args";

in the server stanza:

if ($args ~ (.*?)(?:^|(&))uid=[^&]*(?:(\2.*)|&(.*))?) {
    set $args $1$3$4;
}
if ($args ~ (^w)) {
    set $args ?$args;
}

and the location stanza:

proxy_pass              http://appservers$uri$args;

So now my backend servers see:

GET /foo.ext?env=bar&node=qux

or

GET /bar.ext

and seldom few hits get through to there anyway, as the cache key flattens it appropriately.

Easy.

EDIT: The ‘easy’ bit is a lie, it seems. Thanks to @davidgl for pulling me out of regex hell. Several revisions here helped by him.

fail2ban time offset issues

While trying to set up fail2ban, I found that even though my regex/logs matched up nothing was being banned/caught by fail2ban

After a bit of investigation it seems that the auth.log time was being written in GMT whereas fail2ban was expecting it in BST:

==> /var/log/auth.log <==
Oct 11 20:52:21 ns2 sshd[18119]: Invalid user test from 1.2.3.4
Oct 11 20:52:21 ns2 sshd[18119]: Failed none for invalid user test from 1.2.3.4 port 47862 ssh2
Oct 11 20:52:28 ns2 sshd[18119]: Failed password for invalid user test from 1.2.3.4 port 47862 ssh2
==> /var/log/fail2ban.log <==
2010-10-11 21:52:04,017 fail2ban.filter: DEBUG  /var/log/auth.log has been modified
2010-10-11 21:52:04,029 fail2ban.filter.datedetector: DEBUG  Sorting the template list

Fairly simple fix of:

rm /etc/localtime
ln -s /usr/share/zoneinfo/Europe/London /etc/localtime

and I am now successfully banning myself from accessing my server.

MessageLabs Mail Filtering and Vague Errors

450 Requested action aborted [7.2] 20412, please visit <a href="http://www.messagelabs.com/support">www.messagelabs.com/support</a> for more details about this error message.

It took a remarkably large amount of searching to find out what ‘[7.2]’ meant in this error message, and why we kept getting a mailserver’s IP blacklisted, but if this happens to you, hopefully this will help resolve it.

When MessageLabs returns a [7.2], this seems to mean that they’ve checked the IP address of the host which is connecting to their MX against the CBL. Connections will be dropped immediately, rather than mail being rejected, as such:

# telnet cluster8a.eu.messagelabs.com 25
Trying 85.158.143.51…
Connected to cluster8a.eu.messagelabs.com (85.158.143.51).
Escape character is ‘^]’.
450 Requested action aborted [7.2] 20412, please visit <a href="http://www.messagelabs.com/support">www.messagelabs.com/support</a> for more details about this error message.
Connection closed by foreign host.

The easiest way to get around this is to fix your mail server, then request delisting from the CBL.

In a completely unrelated note (ahem), it seems that you may be added to the CBL if you send an email from a domain where the sending mail server is explicitly disallowed by SPF records (such as -all with no matching include), to a gmail address; Google will automatically (?) submit the IP address to the CBL and your problems will begin (again).

I highly recommend robtex as a lazy way to check your hosts against blacklists.

VMWare ESX and a full SQL Server Database

Hypothetical situation. You installed VMWare ESX, possibly upgraded from 3.5 to 4, went with the embedded SQL Server, and Many Years Later the VirtualCenter server no longer starts. You look through the event logs and the best you can find is:

Faulting application vpxd.exe, version 4.0.10021.0, faulting module kernel32.dll, version 5.2.3790.4480, fault address 0x0000bef7.

So you decide to look at general application eventlog events rather than just for VMware:

Could not allocate space for object ‘dbo.VPX_EVENT’.’PK_VPX_EVENT’ in database ‘VIM_VCDB’ because the ‘PRIMARY’ filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.

“Great”, you think. I can just pass this over to a DBA to get them to increase the filegroup size. Then you dig a bit deeper and look at the event log for SQLServer:

CREATE DATABASE or ALTER DATABASE failed because the resulting cumulative database size would exceed your licensed limit of 4096 MB per database.

“Oh no!” you sob. You really don’t want to try migrating to an enterprise database right now. Worry not, there’s a VMWare solution. This easy process is:

  • Install Microsoft SQL Server Management Studio Express
  • Download and extract VCDB_PURGE_MSSQL.zip
  • Make sure all VMWare VirtualCenter processes are stopped
  • Open Microsoft SQL Server Management Studio Express
  • File -> Open -> Choose the extracted sql script
  • Change the database from ‘master’ to ‘VIM_VCDB’ in the dropdown on the top bar
  • Press ‘Execute’
  • Evaluate the deleted rows, make sure it’s not more than you’d expect (ok, I didn’t do this)
  • Change
SET @DELETE_DATA = 0
to
SET @DELETE_DATA = 1
  • Press ‘Execute’ again.
  • Wait. Get a coffee. Get eight. It will eventually finish:
****************** SUMMARY *******************
Deleted 8400 rows from VPX_TASK table.
Deleted 2585209 rows from VPX_EVENT_ARG table.
Deleted 1662120 rows from VPX_EVENT table.
Deleted 0 rows from VPX_HIST_STAT1 table.
Deleted 0 rows from VPX_SAMPLE_TIME1 table.
Deleted 0 rows from VPX_HIST_STAT2 table.
Deleted 0 rows from VPX_SAMPLE_TIME2 table.
Deleted 0 rows from VPX_HIST_STAT3 table.
Deleted 0 rows from VPX_SAMPLE_TIME3 table.
Deleted 105331 rows from VPX_HIST_STAT4 table.
Deleted 373 rows from VPX_SAMPLE_TIME4 table.
  • Start VCenter Server. Wait. Try and connect. Hope. Pray.
  • Connect to VCenter Server
  • From the client, press Ctrl-Shift-I
  • Go to ‘Database Retention Policy’, and enable it.

Hopefully this will save someone a bit of googling.

Checking SSH Private Keys for Passphrases

Imposing ridiculously over the top security policies? Want to make sure any SSH private keys on your jump-off/administration server have a passphrase?

Don’t waste time trying to get expect working…

expect <<EOF
spawn ssh-keygen -f file -y
expect -timeout 1 "Enter passphrase:" {exit 1}
EOF

Just look at the damn file (thanks @ealexhudson and @Azquelt) and check if it’s got ‘Proc-Type: 4,ENCRYPTED’ in.

Without

root@a-server ~ # find /home/*/.ssh/ -name "id_*sa" -exec grep -L ENCRYPTED {} \; | wc -l
19

With

root@a-server ~ # find /home/*/.ssh/ -name "id_*sa" -exec grep -l ENCRYPTED {} \; | wc -l
1

Lovely. This of course doesn’t solve the issue of checking, from the SSH public keys, whether the private keys have passphrases or not.

LVM Stale NFS File Handles (Part 1)

So, here’s an interesting issue

(initramfs) mount
rootfs on / type rootfs (rw)
none on /sys type sysfs (rw,nosuid,nodev,noexec)
none on /proc type proc (rw,nosuid,nodev,noexec)
udev on /dev type tmpfs (rw,size=10240k,mode=755)
/dev/pudding/root on /mnt type ext3 (rw,errors=continue,data=ordered)

 

So I’m using BusyBox, with an LVM volume mounted on /mnt. Happy?

(initramfs) ls /mnt
ls: /mnt/initrd.img.old: Stale NFS file handle
ls: /mnt/vmlinuz: Stale NFS file handle
ls: /mnt/vmlinuz.old: Stale NFS file handle

Only one directory (was, a while ago) exported by NFS, which isn’t one that is affected, and the box has never mounted anything by NFS. It seems like the error can be caused when a file is open and the disk falls out from underneath it, and an ambiguous error code is sent back which is interpreted as a stale filehandle. Either way, the superblock on this particular FS is corrupted, so the next step would be to attempt to recover using one of the backup superblocks. I’ll attempt this later and let you know how it goes. I’m sure you’ll be on the edge of your seats.