Tailsweep
Svenska UK

Meny

  • Hem
  • Tailsweep
  • Tailsweep Blog Search
  • Tailsweeps Blogg
  • AddThis Social Bookmark Button

Projekt

  • AbstractCache
  • Utils

Arkiv

  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008

Sidor

Kategorier

    AJAX
    Backup
    BigTable
    cache
    Geo
    Hibernate
    Javascript
    Job
    Lucene
    Mail
    Monitor
    Monitoring
    MySQL
    optimization
    regex
    release
    SCM
    Server
    Spatial
    Tools
    Uncategorized

Prenumerera

RSS Senaste nytt som RSS

Mammatus released!

September 7th, 2008 Skrivet av Marcus Herou

We have renamed the AbstractCache project to Mammatus since the project isn’t only about caching anymore but storage in general. Mammatus (bumpy clouds) is a nice name for this kind of project. Check it out

Postad i Uncategorized  Etiketter:cache, mammatus, storage
 Inga kommentarer

Renaming of AbstractCache

August 3rd, 2008 Skrivet av Marcus Herou

Since the AbstractCache project is mostly about storing tuples in various implementations I think that the name AbstractCache is misleading. The implementations CAN be caches but should just be treated as storage mechanisms for Tuples.

Can you come up with a new name for this project ? Perhaps something involving the word Tuple ?

Postad i cache  Etiketter:cache, renaming, tuple
 5 Kommentarer »

BerkeleyDB support

August 3rd, 2008 Skrivet av Marcus Herou

I read about using Lucene as a database on the Lucene mailinglist. Then someone threw in BerkeleyDB as an alternative. I thought yeah right, an Oracle db. It will probably be lightweight and easy to use NOT!

I was wrong however it is easy as hell and the tutorial on the Oracle webpage is super. I’m surprised to see such a full blown db with foreign keys, various constraints etc having such an easy API.

Check it out

And here is my Tuple implementation

Postad i cache  Etiketter:berkeley db, cache, tuple
 Inga kommentarer

Optimized SOLR indexing

August 3rd, 2008 Skrivet av Marcus Herou

I noticed a really fast and cool way of posting docs to SOLR in the solr mailing list.

About the same config parameters are available to SOLR as in raw Lucene but instead of committing on RAM they commit on time and the number of documents. You should therefore estimate how much a document weighs in average and adjust the maxBufferedDocs accordingly.

By Jeremy Hinegardner

–clip clip–

If the xml files are available locally on the machine where the solr instances
lie you can instead tell solr to load the file from disk instead of transmitting
the file over http.

You have to set enableRemoteStreaming=”true” in the solrconfig.xml and then your
curl request would I think be:

curl -d stream.file=/tmp/post.xml http://localhost:8983/solr/update

–clip clip–

Postad i Uncategorized  Etiketter:indexing performance, solr
 Inga kommentarer

Optimized Lucene indexing

August 3rd, 2008 Skrivet av Marcus Herou

There are some nice indexing settings which I think emerged in lucene-2.3.2 which you can use to increase the writing speed.

I use settings like this:

IndexWriter indexWriter = new IndexWriter(directory, this.getAnalyzer(), new KeepOnlyLastCommitDeletionPolicy(), new IndexWriter.MaxFieldLength(5000));

indexWriter.setMaxBufferedDocs(10000);
indexWriter.setMaxBufferedDeleteTerms(10000);
indexWriter.setMergeFactor(10);
indexWriter.setUseCompoundFile(false);
indexWriter.setRAMBufferSizeMB(50);
indexWriter.setMergeScheduler(new ConcurrentMergeScheduler());

With this settings I managed to write 100000 documents which each is about 2K in about 12 sec ~ 10000 docs per sec. Optimization took 1.7 sec. All this on a a slow laptop with a 5400 RPM SATA disk.

Update: I forgot to turn on tokenization: It took 30 sec and 6 sec to optimize which is about 3 times slower.

If you are prior to 2.3.2 you can use IndexWriter.ramSizeInBytes() on each add/update of a document to estimate when it is time to flush (commit) the IndexWriter or create a bg job which runs every 5 secs or so and does the same. I typically used a combo of a background thread which commited on a ramThreshold and a foreground emergencyCommitThreshols before 2.3.2. This worked really well but since the code now has moved into the core of Lucene I think it is wise to give the Lucene guys a chance to sort it out.

Postad i Uncategorized  Etiketter:indexing performance, Lucene, solr
 Inga kommentarer

External On Disk TupleSorter

July 31st, 2008 Skrivet av Marcus Herou

There are times when the resultset of an application for example Lucene is just too large to sort in the RAM and Java will throw an OOE. The normal way to solve this is to buy more memory. I frankly think you could make your application a little more stable than that :) How would you for example feel if MySQL core dumped whenever the resultset was too large to sort in memory ?

So what to do when you cannot rely on Collections.sort in every situation ? Dump it to disk and merge sort it there.

Great, it should be hundreds of already released stuff regarding this I figured. Well I have googled a lot and almost no one seems to have published their disk based sorts. The exception is the open source databases like Derby etc but the code is so proprietary that it is almost impossible to reverse engineer back to something usable. Then I finally read an article written by Sammi Larbi which published semi functional code for a csv file sorter. This inspired me to create a generic Comparable Sorter something like the guys at Java Forum are discussing.

However a sorter which don’t handle the data which is connected to the sorted column is quite useless right ? It was then I came up with the idea that you should be able to sort Tuples by their keys and which can contain a payload of whatever (PK, the whole row, ref to a file etc). It took me a day to fix Sammi’s code for the CSV and convert it to a Tuple variant.

Since I already have written a bunch of serializers for primitive types (and some non-primitive as well). I could just hook that system into the tuple serialization process which improves serialization speed up to a 100 times depending on the data type.

It is now included in the trunk of utils-1.4-SNAPSHOT, sample here: TupleSorter

It is actually quite fast to sort. I tested it on my laptop which has a (slow) 5400 rpm disk and 2.0 GHz dual-core CPU and 2G RAM. It sorted 100K Tuple<Integer,Integer> in just above a second. This is’nt too bad at all, I expected worse results but need to test it more.

I still have a lot to learn from Peter Boncz which tells you why you should avoid Tuples (all db’s have tuples) and go for single value arrays. But I’m not smart enough :) Watch the MonetDB/X100 presentation and hopefully you will pick something up from this smart guy.

Postad i Uncategorized  Etiketter:disk sort, external sort, tuple
 2 Kommentarer »

Optimization on the rocks

June 18th, 2008 Skrivet av Marcus Herou

We have experienced an increased load on our database server the last weeks. After a little tweaking and adding of some indices in the right places we decreased the load 10 times!

Watch and enjoy :)

Postad i Uncategorized   2 Kommentarer »

Now we are using HBase

May 25th, 2008 Skrivet av Marcus Herou

I wrote a couple of days ago about HBase and stated that I most likely would refuse HBase in the Tailsweep backend system since I thought performance of the underlying HDFS will be an issue. However I could not resist the urge of creating an implementation because I really believe in the Hadoop and Lucene community and the purity of the implementations which springs out of them.

I will now give HBase a test go for Tailsweep for storage of millions of feed items. The millions will become billions in a not to long time frame and for that I need a scalable architecture. Even though HBase might not get the optimal rtt for random access I hope the overall throughput and scalability of a couple or more of HBase servers will outperform any RDBMS and give the architecture the HA it needs. Frankly MySQL will not cut it. It is super slow even today with only 10M documents.

I will need to performance test the impl at some point though. More about that later.

I have basically created a HashMap which uses HBase as internal storage mechanism. This is a part of the AbstractCache project where all implementations implement the Cache interface which is a subclass of java.util.Map

The Cache interface have some extra methods such as keyIterator and valueIterator typically used when you need to access huge amount of data. In HBaseCache the iterators uses the HScannerInterface for retrieval of the key/values.

I can recommend anyone which is new to HBase to look at the implementation since it is fairly straight forward and uses almost all CRUD ops in HBase.

It is built against HBase-1.2 HBase-0.1.2 and you need that jar in your CP to get it working.

Example:

HbaseCache hbaseCache = new HbaseCache();
hbaseCache.setRegion(”test”);
hbaseCache.init();
Map cache = hbaseCache;
//cache.clear();
cache.put(”testkey”, “testvalue”);
cache.put(”testkey2″, “testvalue2″);

System.out.println(cache.get(”testkey”));
System.out.println(cache.get(”testkey2″));

cache.remove(”testkey”);
System.out.println(cache.get(”testkey”));
System.out.println(cache.keySet());
System.out.println(hbaseCache.keyIterator().next());
System.out.println(hbaseCache.valueIterator().next());
hbaseCache.destroy();

HbaseCache is part of AbstractCache

Postad i ,   Etiketter:cache, hadoop, , hdfs
 2 Kommentarer »

Blogs are upgraded

May 24th, 2008 Skrivet av Marcus Herou

We have now upgraded the blogs to use wp-2.5. Installed Gengo as well, let’s see how well it plays with wp.

Postad i Uncategorized   Inga kommentarer

Copyright © 2007 Tailsweep AB

Tailsweep development Blog is proudly powered by WordPress
Entries (RSS) and Comments (RSS).