Archive for the ‘Tokyo’ Category

New Benchmark I am working on that tests MYSQL -vs- NOSQL

2010-03-29

I am giving a talk in a couple of weeks at the 2010 MySQL User Conference that will touch on use cases for NOSQL tools -vs- More relational tools, the talk is entitled “Choosing the Right Tools for the Job, SQL or NOSQL”. While this talk is NOT supposed to be a deep dive into the good, bad, and ugly of these solutions, rather a way to discuss potential use cases for various solutions and where they may make a lot of sense, being me I still felt a need to at least do some minor benchmarking of these solutions. The series of posts I wrote last year over on mysqlperformanceblog.com comparing Tokyo Tyrant to both MySQL and Memcached was fairly popular. In fact the initial set of benchmark scripts I used for that series actually has been put to good use since then testing out things like a pair gear6 appliances, memcachedb, new memcached versions, and various memcached API’s.

When I started really digging into some of the other popular nosql solutions to expand my benchmarks it became apparent that most of these tools have fairly well defined API’s for Ruby, however in general the API’s for perl in some cases may not exist at all or are rather immature at this point. So I decided to rewrite my initial benchmark suite in Perl. With the help of my co-presenter for this talk ( Yves ) we are writing a tool that will hopefully be able to test the same basic tests against a wide variety of solutions. Currently I have tests written for Tyrant, Memcached, Cassandra, and MySQL. We will be expanding these tests to include Redis and MongoDB for sure (Maybe NDB) … beyond that I am not 100% sure. The challenge is going to be writing code that not only tests basic features, but also can test the advanced features of these solutions. After all a simple PK lookup can be done on all of these solutions, but that’s not necessarily the bread and butter of a solution like MongoDB or even Cassandra. Its the extra features that make these more compelling. We will be releasing the code when its ready.

I have not started my more exhaustive benchmarks yet… as I am still writing parts of the benchmark, but I have been running a few benchmarks. I generally hate publishing or mentioning results until I have taken the time to analyse them and ensure I did not miss anything, but what the hell. In a very short read only test, using PK based lookups to compare Innodb -vs- cassandra -vs- memcached ( a really small data set that should easily fit into memory on both on my laptop **single node **) I end up averaging ~1.2K reads per second from Cassandra, ~ 4K reads per second from Innodb, and ~ 17K reads per second in memcached. Now as I setup more benchmarks I will test multi-node performance, tune the configs for the workload, etc… but it is interesting to see the early performance difference.

More later.

A few key Tokyo Cabinet Notes

2010-03-26

I wanted to publish a few interesting gotcha’s , facts, and settings people who use or want to use Tokyo Cabinet/Tyrant should know.

A quick overview, Tokyo Tyrant is the network daemon that sits ontop of Tokyo Cabinet. This means that in order to access cabinet from another server you have to access it though Tyrant. In the context of this post consider when I say Tokyo to mean the entire stack.

#1. Tokyo Cabinet allows for a single write thread. Multiple processes can try and write through tyrant but they will wait. In order to get around this limitation you need to shard your data. Using something like a memcached api ontop of a hash table is one effective way to do this.

#2. Tokyo is not durable. This means in the event of system crash you will lose data. You can call a sync process to sync data to disk, but this locks the writer process. Your best bet is to use replication to ensure you have a copy of the data and backup often.

#3. Settings for Tokyo Cabinet Files can be set via Tokyo Tyrant by adding the settings after the cabinet file: i.e.

/var/lib/tokyo/data.tch#BNUM=20000#xmsiz=10485760

Some of these settings only take place on file creation or on optimize so make sure you check the documentation.

#4. By Default there is a limit of 2GB per file to Cabinet files, this can be worked around by setting the #opt setting for your table type. For instance #opt=HDBTLARGE enables large files for the hash table. This setting takes place on creation or when you optimize. You will corrupt your file if you hit 2GB without this setting. If you experience this, your best bet is to restore from a backup that is < 2GB and switch the large file flag. (Note if I am correct you can only change the file to support large tables by using the cabinet mgr tools, i.e. running tchmgr -tl cabinet.tch against an offline file )

#5. Run optimize on a regular basis, I have seen files shrink by as much as 90% from running optimize.

* To run optimize on a table from tyrant you can run tcrmgr optimize -port xxx localhost ( This will lock writes )
* To run optimize a table from the cabinet command use the mgr for the correct table type ( i.e. tchmgr for the has table ).

#6. Increase the number of Tyrant threads from the default 8 if your having issues with refused connections. This is done on the command line when starting tyrant: ttserver -thnum 16

#7. Log your Tyrant errors to a log file by using the -log flag when starting Tyrant. By default just setting the log will also log info/warning messages, disable this by setting the -le flag which tells tyrant to only log errors.

#8. If your using a cabinet “table” database make sure you build the indexes you need otherwise your probably going to get rather slow performance.

#9. In terms of performance the BNUM setting typically has the largest impact on performance. According to the docs “specifies the number of elements of the bucket array”. Every table type is a bit different, so check the docs for the exact settings.

#10. For hash tables setting xmsiz can make a huge difference. This defines the memory allocated to mapping objects.