1.1 billion row database running Percona MySQL on EC2 with RAID 1 SSDs (160GB datadir)

f

Day 1

So today, we broke our hosting companies record.. 1.1 billion rows in a Percona MySQL InnoDB table! Sounds impressive, but reality soon kicks in when you can’t run SELECT() queries against it for fear of saturating our poor SAS disks and locking up the web application. And even harder when you’re working against the clock on deadlines.

With a total datadir size of around 160GB, 1,159,945,113 rows in a single table and no long term NoSQL plans on the horizon, our immediate options were limited. The affected database servers were limited to just 300 IOPS over SAS RAID 1, of which the disk IO showed a cool and steady 99.95% average IO utilization for a 24 hour period.

So in a small moment of caffeine fueled epiphany, we fired up the biggest bad ass EC2 instance possible (hi1.4xlarge), this is where things started to get fun. For those of you that don’t know about the hi1.4xlarge specs, let me fill you in;

  • 8 virtual cores, clocking in at a total of 35 ECU (EC2 Compute Units).
  • 60.5 GB of RAM.
  • 10 Gigabit Ethernet connectivity
  • 2 TB of local SSD-backed storage, visible to you as a pair of 1 TB volumes.
  • 120,000 random read IOPS (theoretical 240,000 IOPS in RAID 1)

Wait.. what? 240 THOUSAND IOPS? I thought “no, it’s gotta be a typo”, the last time I played with SSDs, the floor limit was around 4k IOPS.

After firing up a few tests, I damn near came when I saw the results scrolling by in my putty window.

(EDIT 8th Nov 2012): After being recommended to give ‘ioping’ tool a try, I re-ran these tests and sadly the numbers were not as exciting;

So for shits and giggles, I thought I’d see if I could max out the supposedly 10Gbit switch port – although finding a test file that would allow download at full 10gbit line speed proved extremely difficult. Eventually we found leaseweb mirrors, but even with 32 threads we only peaked at around 201mb/sec. (2gbit).

By this point I was having more fun with a server that costs 3 bucks an hour, than I ever could have done with a $500/hour high-class hooker on a Friday night.

To get the most performance possible, I used the following approach;

After a few hours of pure geek heaven, I calmed down and started thinking how we could overcome a few stumbling blocks;

  • If the instance is stopped, any data on the SSDs is discarded and it takes several hours to reload from EBS.
  • The server was not part of the managed hosting platform, and thus couldn’t be used directly in production code or use any live replication.
  • EC2 SSDs were not automatically placed into RAID 1 (they are a pair of 1TB drives), and require initial sync time.
  • All queries on the EC2 instance would have to be ran as one-off scripts, and couldn’t be sanely implemented into our existing code base at this stage.
  • The effective data was split over two existing databases, and merging them would be troublesome

So we took an LVM snapshot of databases and started to copy the datadir’s to EC2.. And this is where the fun started to turn into a nightmare of 70+ hours work.

I almost choked on my tea as I saw the 93796434 alarms flood my mailbox

Whilst attempting to resolve a transfer speed problem with the networks team, one of the firewalls locked up. This lasted several (very long) minutes, and once the fail over kicked in we decided to leave it until monday and settle for 10MB/sec.

Now, I’m not an impatient kinda guy, but this was like driving a Austin Metro after just having a test drive in a Lamborghini. Patiently, I sit and watch the rsync progress meter get closer and closer, teasing me with its damn fluxuating average speed.

After about 30 minutes of flicking to and from the putty window and realising it’s now 8pm, I reluctantly background my screen session and sign off for the night.

Day 2

At this point we had both a SQL dump and a datadir snapshot, but there was a problem. In order to run two datadirs, you either need to merge them, or you need to run two instances of MySQL, and I gave up pretty quickly trying to modify Debian’s init scripts to support multiple MySQL instances. So I opt’d to merge;

Annoyingly, using symlink’s seems to break Percona innobackupex copy-back tool fails with the following message;

To fix this, you have to modify the datadir var in my.cnf to point directly at your SSD mount to make it work. With that problem sorted, we then play the waiting game again;

By this stage I was beginning to wish I never ran that 10gbit line speed test, everything else felt so slow and slugish in comparison, and made our production systems feel like I was back in ’99!

Once that finished, we copied in the other datadir sub directories (using the first ib_logfile0), but as you can expect MySQL did not like this;

At this point we got tablespace id mismatch problems, and had to use the recovery toolkit;

http://www.mysqlperformanceblog.com/2011/05/13/connecting-orphaned-ibd-files/

For those that know me, you’ll know that the Percona Toolkit scares the shit out of me, and those 3 seconds it took to run ibdconnect were the longest 3 seconds of my life.

I breathed a sigh of relief when I saw the word “Successfully” at the end, but sadly this process did not work very well and any attempt to access the database resulted in a hard crash.

So the only alternative was to try and get two instances of MySQL working (VERY frustrating), or import the raw SQL. Now using single threaded exports on MySQL isn’t too bad, but using single threaded imports is extremely slow. It took approx 42 minutes to dump 1.1 billion rows across 160GB, so not too bad.

Interesting note, the row count in show table status LIES!! It lies hard. It almost made me delete the entire thing and start over – luckily I did a COUNT(*) check on production before hand

There are a few problems with the current multi-threaded import tools currently available;

  • mydumper does not compile very well across different platforms (See [bug here][1]).
  • mysqldump does not allow you to use multiple threads
  • both mysqldump and mydumper do not allow you to extract the indexes as ALTER statements (I’ll go into more detail about this in another blog post).
  • using mysql client to import/execute files does not handle connection problems gracefully, and you WILL end up with lost data!!

In the end, we used the following approach:

  • Dumped each database table data into its own individual file, with no schema or indexes.
  • Dumped each database table schema into it’s own individual file, with just the schema
  • Imported the schema files into the new database, then ran SQLYog’s “Schema Comparison” tool to manually generate the ALTER statements.
  • Imported the data files (concurently) into the new database (at this point there are no indexes, so it’s much faster)
  • Imported the indexes (sequentially) into the new database

We also had to apply the following my.cnf tuning to get a reasonable amount of performance;

After these tweaks were added in, we imported roughly 24,592,269 rows in about 5 minutes, with an ETA of 3 hours until finish – not bad!

Once the database was fully imported and confirmed working, I copied the datadir to an EBS volume, so it would be much faster to start up the next time.

Let me tell you, the above was the accumulation of several hours of frustrated fighting with Navicat/SQLYog/mysqldump/innodb/mydumper – and made me realise just how little tools there are for dealing with large data sets. Here are a couple of interesting blogs I found on the subject;

http://www.mysqlperformanceblog.com/2012/05/16/benchmarking-single-row-insert-performance-on-amazon-ec2/

http://www.dslreports.com/shownews/Mysql-and-a-billion-rows-using-innodb-87890

http://palominodb.com/blog/2011/08/02/mydumper-myloader-fast-backup-and-restore

http://37signals.com/svn/posts/3174-taking-the-pain-out-of-mysql-schema-changes

Results

Sadly, we can’t post any benchmarks as it would involve disclosing the database schema and data for the benchmarks to be relevant, and we ran out of time to create any meaningful graphs.

However, I can tell you that queries which originally took 40+ minutes, were now taking around 49 seconds.

This was my second time using SSDs in a production environment, and honestly, the performance boost was shocking.

Some annoyances

mysql client does not gracefully recover from errors when executing a large SQL dump;

EC2 Ubuntu 11 64bit instance does not have a working setuptools package, you should install from source instead.

Using mysqldump --threads is useless – see http://lists.mysql.com/mysql/227886

Here is a slightly annoying message you might see after trying to stop/start instances too quickly, if you get this message and you know you haven’t reached your limit – just wait about 5-10 minutes and the error will go away.

And if you break fstab, there is no recovery KVM. You have to boot another instance to repair it.. and if you deattach the root device, you will get this next time;

EC2 can also be bitchy when it comes to releasing/attaching disks, and using tools such as cfdisk.. sometimes, they hang, and that can cause data corruption (or in our case, DID cause data corruption!)