Tailsweep
Svenska UK

Meny

  • Hem
  • Tailsweep
  • Tailsweep Blog Search
  • Tailsweeps Blogg
  • Google group
  • AddThis Social Bookmark Button

Projekt

  • Mammatus
  • Parhely
  • Haloe
  • AbstractCache
  • Utils

Arkiv

  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008

Sidor

Kategorier

    AJAX
    Backup
    BigTable
    Browser
    cache
    Geo
    haloe
    Hibernate
    Javascript
    Job
    Lucene
    Mail
    Monitor
    Monitoring
    MySQL
    optimization
    regex
    release
    SCM
    Server
    sharding
    Spatial
    Tools
    Uncategorized

Prenumerera

RSS Senaste nytt som RSS

Arkiv för July, 2008

External On Disk TupleSorter

Thursday, July 31st, 2008

There are times when the resultset of an application for example Lucene is just too large to sort in the RAM and Java will throw an OOE. The normal way to solve this is to buy more memory. I frankly think you could make your application a little more stable than that :) How would you for example feel if MySQL core dumped whenever the resultset was too large to sort in memory ?

So what to do when you cannot rely on Collections.sort in every situation ? Dump it to disk and merge sort it there.

Great, it should be hundreds of already released stuff regarding this I figured. Well I have googled a lot and almost no one seems to have published their disk based sorts. The exception is the open source databases like Derby etc but the code is so proprietary that it is almost impossible to reverse engineer back to something usable. Then I finally read an article written by Sammi Larbi which published semi functional code for a csv file sorter. This inspired me to create a generic Comparable Sorter something like the guys at Java Forum are discussing.

However a sorter which don’t handle the data which is connected to the sorted column is quite useless right ? It was then I came up with the idea that you should be able to sort Tuples by their keys and which can contain a payload of whatever (PK, the whole row, ref to a file etc). It took me a day to fix Sammi’s code for the CSV and convert it to a Tuple variant.

Since I already have written a bunch of serializers for primitive types (and some non-primitive as well). I could just hook that system into the tuple serialization process which improves serialization speed up to a 100 times depending on the data type.

It is now included in the trunk of utils-1.4-SNAPSHOT, sample here: TupleSorter

It is actually quite fast to sort. I tested it on my laptop which has a (slow) 5400 rpm disk and 2.0 GHz dual-core CPU and 2G RAM. It sorted 100K Tuple<Integer,Integer> in just above a second. This is’nt too bad at all, I expected worse results but need to test it more.

I still have a lot to learn from Peter Boncz which tells you why you should avoid Tuples (all db’s have tuples) and go for single value arrays. But I’m not smart enough :) Watch the MonetDB/X100 presentation and hopefully you will pick something up from this smart guy.

Tags: disk sort, external sort, tuple
Postad i Uncategorized | 3 Comments »

Copyright © 2007 Tailsweep AB

Tailsweep development Blog is proudly powered by WordPress
Entries (RSS) and Comments (RSS).