Tailsweep
Svenska UK

Meny

  • Hem
  • Tailsweep
  • Tailsweep Blog Search
  • Tailsweeps Blogg
  • Google group
  • AddThis Social Bookmark Button

Projekt

  • Mammatus
  • Parhely
  • Haloe
  • AbstractCache
  • Utils

Arkiv

  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008

Sidor

Kategorier

    AJAX
    Backup
    BigTable
    Browser
    cache
    Geo
    haloe
    Hibernate
    Javascript
    Job
    Lucene
    Mail
    Monitor
    Monitoring
    MySQL
    optimization
    regex
    release
    SCM
    Server
    sharding
    Spatial
    Tools
    Uncategorized

Prenumerera

RSS Senaste nytt som RSS

Inlägg taggade ‘hadoop’

Spring with Hadoop

Saturday, December 13th, 2008

We have really been struggling with creating a way of launch hadoop jobs and create and wire all components with Spring.

Finally we have come to a nice way of doing this where we make use of the Hadoop Configuration to tell the jobs which spring context files they should use.

Example

Client (from where you launch JobClient)

JobConf job = createJob();

job.set(”configs”, “classpath:ctx1.xml,”classpath:ctx2.xml”);

…..

Inside a Mapper, Reducer or MapRunnable public void configure(JobConf jobConf) method.

String[] configs = jobConf.get(”configs”).split(”,”);
ApplicationContext ctx = new ClassPathXmlApplicationContext(configs);

…Extract the beans you want and manually wire up the Job. e.g.

this.contentParsers = (ContentParsers)ctx.getBean(”contentParsers”);

For this to work you need to have all configurations in your jar-file which you tell hadoop to run with:

job.setJar(jarFile);

and if you want to add some dependency jar files use:

job.set(”tmpjars”, “/lib/jar1,/lib/jar2″);

where the tmpjars must reside in HDFS before running the job.

use ${HADOOP_HOME}/bin/hadoop dfs -copyFromLocal your_working_dir/lib /

This will put the dir /lib in the HDFS root, which of course is just an example.

We use the same spring context files in both dev/stage/prod environments and use environment specific property files which we use to filter the context files before wrapping them inside the jar.

Example:

—clip context file—

<property name=”numberOfUrlsPerCrawl” value=”${numberOfUrlsPerCrawl}” />

—clip—

environment.local.properties

numberOfUrlsPerCrawl=100

environment.prod.properties

numberOfUrlsPerCrawl=100000

The client side of course as well is Spring wired.

Tags: hadoop, spring
Postad i Uncategorized | No Comments »

Now we are using HBase

Sunday, May 25th, 2008

I wrote a couple of days ago about HBase and stated that I most likely would refuse HBase in the Tailsweep backend system since I thought performance of the underlying HDFS will be an issue. However I could not resist the urge of creating an implementation because I really believe in the Hadoop and Lucene community and the purity of the implementations which springs out of them.

I will now give HBase a test go for Tailsweep for storage of millions of feed items. The millions will become billions in a not to long time frame and for that I need a scalable architecture. Even though HBase might not get the optimal rtt for random access I hope the overall throughput and scalability of a couple or more of HBase servers will outperform any RDBMS and give the architecture the HA it needs. Frankly MySQL will not cut it. It is super slow even today with only 10M documents.

I will need to performance test the impl at some point though. More about that later.

I have basically created a HashMap which uses HBase as internal storage mechanism. This is a part of the AbstractCache project where all implementations implement the Cache interface which is a subclass of java.util.Map

The Cache interface have some extra methods such as keyIterator and valueIterator typically used when you need to access huge amount of data. In HBaseCache the iterators uses the HScannerInterface for retrieval of the key/values.

I can recommend anyone which is new to HBase to look at the implementation since it is fairly straight forward and uses almost all CRUD ops in HBase.

It is built against HBase-1.2 HBase-0.1.2 and you need that jar in your CP to get it working.

Example:

HbaseCache hbaseCache = new HbaseCache();
hbaseCache.setRegion(”test”);
hbaseCache.init();
Map cache = hbaseCache;
//cache.clear();
cache.put(”testkey”, “testvalue”);
cache.put(”testkey2″, “testvalue2″);

System.out.println(cache.get(”testkey”));
System.out.println(cache.get(”testkey2″));

cache.remove(”testkey”);
System.out.println(cache.get(”testkey”));
System.out.println(cache.keySet());
System.out.println(hbaseCache.keyIterator().next());
System.out.println(hbaseCache.valueIterator().next());
hbaseCache.destroy();

HbaseCache is part of AbstractCache

Tags: cache, hadoop, , hdfs
Postad i , | 2 Comments »

Copyright © 2007 Tailsweep AB

Tailsweep development Blog is proudly powered by WordPress
Entries (RSS) and Comments (RSS).