Spring with Hadoop
Saturday, December 13th, 2008We have really been struggling with creating a way of launch hadoop jobs and create and wire all components with Spring.
Finally we have come to a nice way of doing this where we make use of the Hadoop Configuration to tell the jobs which spring context files they should use.
Example
Client (from where you launch JobClient)
JobConf job = createJob();
job.set(”configs”, “classpath:ctx1.xml,”classpath:ctx2.xml”);
…..
Inside a Mapper, Reducer or MapRunnable public void configure(JobConf jobConf) method.
String[] configs = jobConf.get(”configs”).split(”,”);
ApplicationContext ctx = new ClassPathXmlApplicationContext(configs);
…Extract the beans you want and manually wire up the Job. e.g.
this.contentParsers = (ContentParsers)ctx.getBean(”contentParsers”);
For this to work you need to have all configurations in your jar-file which you tell hadoop to run with:
job.setJar(jarFile);
and if you want to add some dependency jar files use:
job.set(”tmpjars”, “/lib/jar1,/lib/jar2″);
where the tmpjars must reside in HDFS before running the job.
use ${HADOOP_HOME}/bin/hadoop dfs -copyFromLocal your_working_dir/lib /
This will put the dir /lib in the HDFS root, which of course is just an example.
We use the same spring context files in both dev/stage/prod environments and use environment specific property files which we use to filter the context files before wrapping them inside the jar.
Example:
—clip context file—
<property name=”numberOfUrlsPerCrawl” value=”${numberOfUrlsPerCrawl}” />
—clip—
environment.local.properties
numberOfUrlsPerCrawl=100
environment.prod.properties
numberOfUrlsPerCrawl=100000
The client side of course as well is Spring wired.


