Tailsweep
Svenska UK

Meny

  • Hem
  • Tailsweep
  • Tailsweep Blog Search
  • Tailsweeps Blogg
  • Google group
  • AddThis Social Bookmark Button

Projekt

  • Mammatus
  • Parhely
  • Haloe
  • AbstractCache
  • Utils

Arkiv

  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008

Sidor

Kategorier

    AJAX
    Backup
    BigTable
    cache
    Geo
    haloe
    Hibernate
    Javascript
    Job
    Lucene
    Mail
    Monitor
    Monitoring
    MySQL
    optimization
    regex
    release
    SCM
    Server
    sharding
    Spatial
    Tools
    Uncategorized

Prenumerera

RSS Senaste nytt som RSS

External On Disk TupleSorter

There are times when the resultset of an application for example Lucene is just too large to sort in the RAM and Java will throw an OOE. The normal way to solve this is to buy more memory. I frankly think you could make your application a little more stable than that :) How would you for example feel if MySQL core dumped whenever the resultset was too large to sort in memory ?

So what to do when you cannot rely on Collections.sort in every situation ? Dump it to disk and merge sort it there.

Great, it should be hundreds of already released stuff regarding this I figured. Well I have googled a lot and almost no one seems to have published their disk based sorts. The exception is the open source databases like Derby etc but the code is so proprietary that it is almost impossible to reverse engineer back to something usable. Then I finally read an article written by Sammi Larbi which published semi functional code for a csv file sorter. This inspired me to create a generic Comparable Sorter something like the guys at Java Forum are discussing.

However a sorter which don’t handle the data which is connected to the sorted column is quite useless right ? It was then I came up with the idea that you should be able to sort Tuples by their keys and which can contain a payload of whatever (PK, the whole row, ref to a file etc). It took me a day to fix Sammi’s code for the CSV and convert it to a Tuple variant.

Since I already have written a bunch of serializers for primitive types (and some non-primitive as well). I could just hook that system into the tuple serialization process which improves serialization speed up to a 100 times depending on the data type.

It is now included in the trunk of utils-1.4-SNAPSHOT, sample here: TupleSorter

It is actually quite fast to sort. I tested it on my laptop which has a (slow) 5400 rpm disk and 2.0 GHz dual-core CPU and 2G RAM. It sorted 100K Tuple<Integer,Integer> in just above a second. This is’nt too bad at all, I expected worse results but need to test it more.

I still have a lot to learn from Peter Boncz which tells you why you should avoid Tuples (all db’s have tuples) and go for single value arrays. But I’m not smart enough :) Watch the MonetDB/X100 presentation and hopefully you will pick something up from this smart guy.

Etiketter:disk sort, external sort, tuple

3 months, 3 weeks ago , Thursday, July 31st, 2008
Postad i Uncategorized
Feed för det här inlägget
 Lämna svar
 trackback

2 Responses to “External On Disk TupleSorter”

  1. Ken Krugler Says:
    August 10th, 2008 at 8:37 pm

    How does your TupleSorter performance compare to sorting with Hadoop’s MapFile?

    See http://www.krugle.org/kse/files/svn/svn.apache.org/lucene/hadoop/trunk/src/java/org/apache/hadoop/io/MapFile.java for the source.

    We make extensive use of this for sorting large data sets, and so far it’s worked pretty well. There are some limitations to MapFiles (e.g. you can’t append) so it’s not a perfect solution.

    Thanks,

    – Ken

  2. Marcus Herou Says:
    August 11th, 2008 at 3:23 am

    Yep I figured that you Hadoop guys had nice sorting algorithms for the Map Files and I guess the distributed map> process as well. I have not perf tested this sorting algo fully yet. I will do a benchmark with similar data and compare to the MapFile.

    Regarding append: Is’nt this already an open JIRA-CASE ? I think a have seen zillions of frustrated devs at the mailinglist asking for this.

Leave a Reply

*
To prove you're a person (not a spam script), type the security text shown in the picture. Click here to regenerate some new text.
Click to hear an audio file of the anti-spam word

Copyright © 2007 Tailsweep AB

Tailsweep development Blog is proudly powered by WordPress
Entries (RSS) and Comments (RSS).