Huge Data -vs- The Database, how the industry is adapting in the face of the data explosion

Note This Article was actually written back in May after the UC at the request of Linux Magazine, through a series of events It went unpublished. Between then and now Jeremy ended up doing a great job covering most of the topics, so in the end it was unneeded. Now I had this completed article and thought, what should I do with it? In the I decided to publish them here. Also note I did update a few items.

As more companies move to MySQL and the demands for data increase, we push the bounds of the database further. The challenges large Web properties (who have pioneered many of the large MySQL deployments) faced when they stored 50GB of data and had 5,000 users were nothing like the challenges of storing 500GB of data supporting 100,000 users. Today, as we are seeing more and more 10+TB-sized datasets being used and accessed by millions of users, the same properties are again forced to think of new ways to maintain the performance, ease of use, and freedom that using MySQL has afforded them in the past. They have had to adapt and overcome these challenges to survive.

Solutions that work on one-sized environments present new challenges in others. Engineers are considering all their options. These options include moving data to non-relational solutions or even caching large chunks of their data in Memcached. Additionally, many talented engineers have also looked to database designs of the past for clues, resurrecting older database design methodologies like sharding to help keep things moving forward. Some are finding new uses for old technologies like replication by building complex master-master or massive read-write splitting setups to get the job done. So…what is the problem, and what is being done about it?

Most MySQL customers use either MyISAM or Innodb as their storage engine. They originated from those glorious days of yesteryear when we never thought we would need more then a 32-bit machine, an SMP machine meant two CPU’s, and 64MB of memory meant you had a powerhouse. The Innodb storage engine was written way back in the mid 90’s. It’s a beautiful piece of coding that has really stood up over time. However, this meant some optimizations were made based on how to get good performance out of a single CPU server with 128M of Ram and a database that was only a few GB in size. I am not saying these engines have remained unchanged; on the contrary, they have changed dramatically over the years. But while new releases have helped improve performance on larger boxes, there are many places where old code has really hindered performance.

Let’s look a little deeper into Innodb. When you read through the code you find all kinds of interesting assumptions. When I read the comments in the code, I still find it amazing to see numerous references to how this piece of code will behave on a 100Mhz Pentium machine. Even so, those are just comments. What about real scalability concerns? Using MySQL 5.0 & 5.1, we recommended that clients keep their servers under 8 cores. Why? In some cases you would see a performance regression on machines with more then 8 cores, take a look:

Yes…that is a performance regression! Thankfully, this is being fixed in 5.4. ( I need to retest this to refresh my graph but others have already retested and found the scalability much improved. )

Another key example can be found by looking at some code from the 5.1 Innodb code:

        /* Flush a few oldest pages to make a new checkpoint younger */ 

        if (buf_get_modified_ratio_pct() > 70) { 

                /* If there are lots of modified pages in the buffer pool 
                (> 70 %), we assume we can afford reserving the disk(s) for 
                the time it requires to flush 100 pages */ 

                n_pages_flushed = buf_flush_batch(BUF_FLUSH_LIST, 100, 
                                                  ut_dulint_max); 
        } else { 
                /* Otherwise, we only flush a small number of pages so that 
                we do not unnecessarily use much disk i/o capacity from 
                other work */ 

                n_pages_flushed = buf_flush_batch(BUF_FLUSH_LIST, 10, 
                                                  ut_dulint_max); 
        } 

This code comes from the main Innodb server thread. In a nutshell: every so often this code will execute a flush of dirty pages to disk. The key is that the flush only flushes 100 pages (16k). That means that at peak you only flush ~1.5MB worth of dirty pages. On a system with a 128MB BP, this may be okay, but with buffer pools now in excess of 128GB, flushing only 1.5MB at a time is not enough.

Over the past few years, fixing these types of internal issues has been a top priority throughout the community. Recognizing the limitations inherent in the most popular database engine (Innodb), companies like Google have been blazing a trail by hacking and adjusting code to meet the demands of their internal enterprise systems and then releasing the code back to the community. Google has pioneered features like semi-sync replication, performance enhancements like multiple background read/write threads, and scalability fixes for SNMP systems. In fact, the flush code I posted above was originally patched by Google to allow Innodb to accept a dynamic variable.

Other companies and organizations are working on similar enhancements. While the name Google certainly attracts the most attention, several other companies and organizations are also developing and fixing MySQL and Innodb code in order to improve scalability and performance. Companies like Monty Program Ab and Percona have been working on fixing performance and scalability issues in several different ways. One is working on new enhanced storage engines. Percona’s XtraDB (which is based on Innodb) takes some of the best of community-contributed patches and combines them with some of its own performance fixes and enhancements. Monty Program Ab is hard at work on the Maria storage engine, which is designed to be a better performing replacement for MyIsam (as well as offer transactional support). Primebase is working on a completely fresh transactional design for their PBXT Storage engine. Want to play with these patches and new storage engines? The OurDelta project takes all the best third-party and community patches and builds binaries for mass consumption, meaning you can go test these out today.

Not wanting to take a back seat, Sun has also been spending a lot of time analyzing what the community has contributed and pushing changes into the MySQL code base to address some of their largest customers’ biggest challenges. At the 2009 MySQL User Conference, MySQL 5.4 was announced. MySQL 5.4 is largely a release aimed at scalability, addressing issues like multi-core scalability (no more 8 core limit!), Io tuning, and general performance on higher-end systems. Many of the ideas for performance enhancements come directly from the aforementioned patches and efforts.

Oracle – who owns Innobase – has also been building and enhancing the next generation of Innodb called the Innodb Plugin. The Plugin is a refactoring of the Innodb code that acts as a drop in replacement for the standard MySQL-supplied Innodb. One key feature is the ability to compress database pages, allowing for smaller disk footprints and higher memory utilization. Also, the latest Plugin release includes SMP scalability and performance features based on Google’s and Percona’s work.

But we are not just dealing with limitations within the code. Larger datasets breed a different set of challenges. The nemesis of DBA’s everywhere is disk access. IO access is the slowest part of any database operation. The goal of most database designs is to have all of your hot data accessed in memory. Why? Memory access speeds are measured in nano-seconds, whereas disk speeds are measured in milliseconds (1ms = 1000000ns). Data sizes are exploding across the board as businesses demand more and more data be stored for longer periods of time. This explosion means less data can fit into memory, and more data is being read from disk – resulting in significant slowdowns in performance.

In order to compensate for increasing data size, many companies are looking outside the database for solutions. You don’t have to look very far for examples. Several of the sessions at the 2009 MySQL User Conference showed how people are looking to external technologies like Hadoop or Memcached to speed data access. Consider the proliferation of Memcached in the datacenter. Many of the top websites in the world rely on hundreds of Memcached servers to help boost data access times. One could argue that Memcached’s popularity is due to the direct results of poor data access times from today’s relational databases.

Memcached is not enough for some company’s however, instead they are looking to persistent key value stores and column databases to boost performance for large applications. Projects like Voldimort, Tokyo Cabinet, Cassandra, Hbase, and others are quickly gaining speed in popularity and features. These solutions offer blazingly fast key lookups and are easy to scale up.

Other tools are also popping up that aim to help manage large amount of data. One of the standard tricks we do in many large deployments is to shard data into several different systems. “Sharding” is basically breaking up your data into smaller chunks of data. A simple sharding example is having a dedicated database server for your West Coast clients and another one dedicated for your East Coast clients. This works well, but it does require application changes. Query proxies like MySQL Proxy and Spock Proxy are working to make sharding data more effective and easier to deploy.

Yet another project designed to tackle the data size-versus-performance issue is the Waffle Grid Project. (Author’s note: I am one of the founders of this project, so forgive me if I am biased). Waffle Grid takes Innodb and extends it by building a remote level 2 cache using Memcached. How this works: when a page is removed from the Innodb Buffer Pool, it is sent remotely to Memcached. When you need a page from disk, it first checks Memcached, and if its not found, then it is read from disk. Essentially, you are replacing slow disk calls with faster remote memory calls. This allows you to cobble together the memory from several different machines and use it to boost the performance on one machine.

In the end, it’s not just about new applications or software. Hardware and appliance vendors also see an opportunity. What is the most obvious way to overcome disk performance issues? Faster disk, of course…and this is the year of solid-state drives. Without the moving parts (“Look, Ma…no arms!”) of traditional disk drives, these components are scary fast on random read operations. Take a look at the database benchmarks when running on some solid state drives versus traditional drives:

In these tests, it took 8 traditional disks in a RAID setup to come close to meeting the performance displayed from a single Intel consumer SSD. With read-disk latency times of less then 1ms, solid state makes a compelling case as a weapon to attack IO issues. Using solid state or ram-based storage to boost performance is not a new technique. In fact, companies like Texas Memory Systems have been building large DDR-based sans for years. The big change has been the dramatic drop in the price of flash memory over the past 12-18 months. This has opened the door to other vendors and made really fast disk subsystems accessible to more companies.

Sun Microsystems is one of many vendors offering solid-state disk within its product lines. At the 2009 User Conference, Doug MacAskill of SmugMug gave a keynote address in which he touted how fast Sun’s 7000 Series of storage are when running MySQL. He said his array is delivering over 9600 iops – with most requests being handled in less then 166 microseconds. That is just fast. But Sun was not the only vendor whose wares were on display the conference. Everywhere you turned, someone was advertising MySQL and Memcached appliances based on flash. Vendors like Virident and Schooner Information Technologies were on hand talking about how their flash-based appliances would accelerate Innodb performance on terabyte-size datasets. How fast are some of these new flash appliances? I had the opportunity to test a flash appliance from Violin Memory and was able to push Innodb to deliver 40K iops using it. By comparison, 4 Intel SSD’s delivered about 3K iops in the same test.

Many of the companies I work with daily have encountered MySQL scalability issues, but there are many different forces all working to help you scale your MySQL database. Sun’s engineering staff is hard at work solving customer challenges, and so are hundreds of other dedicated community contributors around the world. Several hardware vendors are stepping up to the plate as well to help overcome some of the oldest database problems, and new tools and applications are filling the gaps.

This entry was posted in Matt, mysql, Other dbs, performance. Bookmark the permalink.

Comments are closed.