Tailsweep
Svenska UK

Meny

  • Hem
  • Tailsweep
  • Tailsweep Blog Search
  • Tailsweeps Blogg
  • Google group
  • AddThis Social Bookmark Button

Projekt

  • Mammatus
  • Parhely
  • Haloe
  • AbstractCache
  • Utils

Arkiv

  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008

Sidor

Kategorier

    AJAX
    Backup
    BigTable
    Browser
    cache
    Geo
    haloe
    Hibernate
    Javascript
    Job
    Lucene
    Mail
    Monitor
    Monitoring
    MySQL
    optimization
    regex
    release
    SCM
    Server
    sharding
    Spatial
    Tools
    Uncategorized

Prenumerera

RSS Senaste nytt som RSS

Arkiv för kategorin ‘Uncategorized’

« Older Entries

Replication in Mammatus

Sunday, December 14th, 2008

I have created a way of replicating state which is similar to MySQL.

We have several cases where we want to update a Btree on a central server and then having it replicated across all slave nodes.

Today we serialize a HashMap to disk, rsyncs it and when the slaves understands that the underlying file is changed it initializes itself with that. This works, however it is not a smart way of doing it since it needs to reload the entire state even though just one entry has been added. To solve that you need to add transaction logging and replicate those transactions.

So how does it work ?

* TransactionLogger needs to be initialized on both master and slave.

* You write to the master file.

* The slave polls the master and sends it’s latest sequence number (trx id) called X.

* The master sends the delta entries from X to Y where Y is the latest entry noted on the master when the client initiated the request.

I wrote the transaction loggers as separate modules so you need to wire them up to make the storage synchronized.

On the slave you need a StateChangeListener and on the master you need to wrap the storage engine in a TransactionLoggerCacheStrategy.

Here is a fully working example spring context file.

Example code:

public static void main(String[] args)
{
String[] cfg = {”logManager.xml”};
ClassPathXmlApplicationContext ctx = new ClassPathXmlApplicationContext(cfg);
Cache cacheMaster = (Cache)ctx.getBean(”masterCache”);
Cache cacheSlave = (Cache)ctx.getBean(”slaveCache”);

cacheMaster.put(”testing”, new Date());
while(true)
{
Date date = (Date)cacheSlave.get(”testing”);
if(date != null)
{
System.out.println(”Huzza!”);
System.exit(0);
}
try
{
Thread.sleep(1000);
}
catch (InterruptedException e)
{
e.printStackTrace();
}
}
}

Tags: mammatus, master, ,
Postad i Uncategorized | No Comments »

Spring with Hadoop

Saturday, December 13th, 2008

We have really been struggling with creating a way of launch hadoop jobs and create and wire all components with Spring.

Finally we have come to a nice way of doing this where we make use of the Hadoop Configuration to tell the jobs which spring context files they should use.

Example

Client (from where you launch JobClient)

JobConf job = createJob();

job.set(”configs”, “classpath:ctx1.xml,”classpath:ctx2.xml”);

…..

Inside a Mapper, Reducer or MapRunnable public void configure(JobConf jobConf) method.

String[] configs = jobConf.get(”configs”).split(”,”);
ApplicationContext ctx = new ClassPathXmlApplicationContext(configs);

…Extract the beans you want and manually wire up the Job. e.g.

this.contentParsers = (ContentParsers)ctx.getBean(”contentParsers”);

For this to work you need to have all configurations in your jar-file which you tell hadoop to run with:

job.setJar(jarFile);

and if you want to add some dependency jar files use:

job.set(”tmpjars”, “/lib/jar1,/lib/jar2″);

where the tmpjars must reside in HDFS before running the job.

use ${HADOOP_HOME}/bin/hadoop dfs -copyFromLocal your_working_dir/lib /

This will put the dir /lib in the HDFS root, which of course is just an example.

We use the same spring context files in both dev/stage/prod environments and use environment specific property files which we use to filter the context files before wrapping them inside the jar.

Example:

—clip context file—

<property name=”numberOfUrlsPerCrawl” value=”${numberOfUrlsPerCrawl}” />

—clip—

environment.local.properties

numberOfUrlsPerCrawl=100

environment.prod.properties

numberOfUrlsPerCrawl=100000

The client side of course as well is Spring wired.

Tags: hadoop, spring
Postad i Uncategorized | No Comments »

Firefox swaps contents between iframes

Wednesday, November 26th, 2008

More and more users are switching over to using Firefox these days and we think that’s all good since we ourselves like the browser very much. But it has a bug in how it handles dynamically (javascript) rendered iframes . In short Firefox sometimes, under certain circumstances , swaps contents between iframes.
When debugging using firebug one can see that the iframe src doesn’t match with the expected content in the iframe.

Since we, as a lot of other ad networks, display our ads in an iframe this can result in an ad from our network ending up in a placement belonging to another ad network and vice versa. This isn’t a good thing for either advertisers or site owners/bloggers.

A workaround for this is to reload the iframes onload.

(iframe.src=iframe.src)

But this isn’t good enough. Mainly because it will cause 2 ad impressions / page impression => incorrect statistics.
Of course we could append some flag on the src url when reloading.

(i.e dontCountThis=1)

But we also run third party ad scripts from several other ad networks…

So since there’s no sufficient solution that we can implement we would like to encourage you all to please go and vote for a fix for this bug -> https://bugzilla.mozilla.org/show_bug.cgi?id=388714

Postad i Browser, Uncategorized | 1 Comment »

Mammatus uses Haloe

Sunday, September 14th, 2008

We have got rid of the dependency to core-lucene in Haloe and now the LuceneMap uses Haloe instead.

Tags: haloe, mammatus
Postad i Uncategorized | 2 Comments »

Haloe released!

Monday, September 8th, 2008

Glad to announce that we have released yet another OpenSource package named Haloe. Do you want to ease the search integration of your code ? Well Haloe addresses just that and currently provides some different implementations like plain Lucene, SOLR etc.

Tags: haloe, Lucene, solr
Postad i Uncategorized | 2 Comments »

Parhely released!

Monday, September 8th, 2008

I’m glad to announce that we finally have released Parhely which is an ORM for HBase.

Postad i Uncategorized | No Comments »

Mammatus released!

Sunday, September 7th, 2008

We have renamed the AbstractCache project to Mammatus since the project isn’t only about caching anymore but storage in general. Mammatus (bumpy clouds) is a nice name for this kind of project. Check it out.

Tags: cache, mammatus, storage
Postad i Uncategorized | No Comments »

Optimized SOLR indexing

Sunday, August 3rd, 2008

I noticed a really fast and cool way of posting docs to SOLR in the solr mailing list.

About the same config parameters are available to SOLR as in raw Lucene but instead of committing on RAM they commit on time and the number of documents. You should therefore estimate how much a document weighs in average and adjust the maxBufferedDocs accordingly.

By Jeremy Hinegardner

–clip clip–

If the xml files are available locally on the machine where the solr instances
lie you can instead tell solr to load the file from disk instead of transmitting
the file over http.

You have to set enableRemoteStreaming=”true” in the solrconfig.xml and then your
curl request would I think be:

curl -d stream.file=/tmp/post.xml http://localhost:8983/solr/update

–clip clip–

Tags: indexing performance, solr
Postad i Uncategorized | No Comments »

Optimized Lucene indexing

Sunday, August 3rd, 2008

There are some nice indexing settings which I think emerged in lucene-2.3.2 which you can use to increase the writing speed.

I use settings like this:

IndexWriter indexWriter = new IndexWriter(directory, this.getAnalyzer(), new KeepOnlyLastCommitDeletionPolicy(), new IndexWriter.MaxFieldLength(5000));

indexWriter.setMaxBufferedDocs(10000);
indexWriter.setMaxBufferedDeleteTerms(10000);
indexWriter.setMergeFactor(10);
indexWriter.setUseCompoundFile(false);
indexWriter.setRAMBufferSizeMB(50);
indexWriter.setMergeScheduler(new ConcurrentMergeScheduler());

With this settings I managed to write 100000 documents which each is about 2K in about 12 sec ~ 10000 docs per sec. Optimization took 1.7 sec. All this on a a slow laptop with a 5400 RPM SATA disk.

Update: I forgot to turn on tokenization: It took 30 sec and 6 sec to optimize which is about 3 times slower.

If you are prior to 2.3.2 you can use IndexWriter.ramSizeInBytes() on each add/update of a document to estimate when it is time to flush (commit) the IndexWriter or create a bg job which runs every 5 secs or so and does the same. I typically used a combo of a background thread which commited on a ramThreshold and a foreground emergencyCommitThreshols before 2.3.2. This worked really well but since the code now has moved into the core of Lucene I think it is wise to give the Lucene guys a chance to sort it out.

Tags: indexing performance, Lucene, solr
Postad i Uncategorized | No Comments »

External On Disk TupleSorter

Thursday, July 31st, 2008

There are times when the resultset of an application for example Lucene is just too large to sort in the RAM and Java will throw an OOE. The normal way to solve this is to buy more memory. I frankly think you could make your application a little more stable than that :) How would you for example feel if MySQL core dumped whenever the resultset was too large to sort in memory ?

So what to do when you cannot rely on Collections.sort in every situation ? Dump it to disk and merge sort it there.

Great, it should be hundreds of already released stuff regarding this I figured. Well I have googled a lot and almost no one seems to have published their disk based sorts. The exception is the open source databases like Derby etc but the code is so proprietary that it is almost impossible to reverse engineer back to something usable. Then I finally read an article written by Sammi Larbi which published semi functional code for a csv file sorter. This inspired me to create a generic Comparable Sorter something like the guys at Java Forum are discussing.

However a sorter which don’t handle the data which is connected to the sorted column is quite useless right ? It was then I came up with the idea that you should be able to sort Tuples by their keys and which can contain a payload of whatever (PK, the whole row, ref to a file etc). It took me a day to fix Sammi’s code for the CSV and convert it to a Tuple variant.

Since I already have written a bunch of serializers for primitive types (and some non-primitive as well). I could just hook that system into the tuple serialization process which improves serialization speed up to a 100 times depending on the data type.

It is now included in the trunk of utils-1.4-SNAPSHOT, sample here: TupleSorter

It is actually quite fast to sort. I tested it on my laptop which has a (slow) 5400 rpm disk and 2.0 GHz dual-core CPU and 2G RAM. It sorted 100K Tuple<Integer,Integer> in just above a second. This is’nt too bad at all, I expected worse results but need to test it more.

I still have a lot to learn from Peter Boncz which tells you why you should avoid Tuples (all db’s have tuples) and go for single value arrays. But I’m not smart enough :) Watch the MonetDB/X100 presentation and hopefully you will pick something up from this smart guy.

Tags: disk sort, external sort, tuple
Postad i Uncategorized | 3 Comments »

« Older Entries

Copyright © 2007 Tailsweep AB

Tailsweep development Blog is proudly powered by WordPress
Entries (RSS) and Comments (RSS).