How do you change per process ulimit without rebooting ? We have not found a way but a workaround.
root:
ulimit -n 65536
su $user -p
Done! The -p preserves root’s environment.
How do you change per process ulimit without rebooting ? We have not found a way but a workaround.
root:
ulimit -n 65536
su $user -p
Done! The -p preserves root’s environment.
Tailsweep har en enorm utvecklingstakt och vi behöver stärka upp vårt utvecklingsteam med fler utvecklare.
Tailsweep är ett datadrivet företag som i alla aspekter hanterar stora mängder data. Har du erfarenhet av att skriva program som processar stora mängder data (gärna med nedan nämnda tekniker) eller helt enkelt har följande två enkla egenskaper:
Så är du med största sannolikhet rätt person för jobbet och du kommer trivas hos oss. De “krav” som nämns nedan är endast för att ge en hint om vilka tekniker vi använder. Främst letar vi efter personer som passar i bolaget och som älskar att utveckla och är bra på det. Allt annat är egentligen ointressant.
De tre områden som du kommer arbeta inom är:
Om du har erfarenhet inom nedan nämnda tekniker ges det en guldstjärna i kanten:
Det språk vi i huvudsak utvecklar i är Java så det är viktigt att du behärskar det språket men om du besitter andra nischade kunskaper så väger det såklart också tungt tex genom erfarenhet inom nån sökmotor, statistiksystem eller liknande.
Vi skriver i princip alla våra mallar i Velocity så det är klart att det är trevligt om du sett det mallspråket förr.
Vi driftar, utvecklar och arbetar på Ubuntu Linux. Vi använder samma OS lokalt som på driftplattformen för att säkerställa att inga konstiga OS-relaterade buggar hittar ut i prod som inte gick att testa lokalt.
Andra meriterande teknikkunskaper
Vidare listar jag några andra verktyg och tekniker som används flitigt men som bara är kuriosa i sammanhanget
Exempel på projekt för att komma igång på Tailsweep
Låter det intressant ? Då kommer du gilla att jobba på Tailsweep.
Skicka ett mail till job at tailsweep.com med din CV så kontaktar jag dig och sätter upp ett möte.
Med vänlig hälsning
//Marcus Herou, CTO Tailsweep AB
Do you add dependency support for your jobs in Hadoop by configuring the “tmpjars” property ?
This means that your jar-files need to be located on HDFS and loaded by Hadoop on runtime.
If you do so then your app will be significantly slower in terms of startup time. You can reduce the startup time from 1 min to less then 10 secs by patching the mapred/org/apache/hadoop/mapred/TaskRunner.java class to find the files from a local repo instead from HDFS
Find the place where the classpath is being built in that source file (line 272 in hadoop-0.18.3) and insert this code snippet between
classPath.append(sep);
classPath.append(workDir);
–SNIPPET_HERE–
// Build exec child jmv args.
Vector<String> vargs = new Vector<String>(8);
File jvm = // use same jvm as parent
new File(new File(System.getProperty(”java.home”), “bin”), “java”);
vargs.add(jvm.toString());
Here the snippet is:
<code>
String additionalClassPath = conf.get(”mapred.additional.class.path”);
if (additionalClassPath != null)
{
String[] localfiles = additionalClassPath.split(”,”);
for(int i = 0; i < localfiles.length;i++)
{
String localfile = localfiles[i].trim();
LOG.info(”Adding “+localfile);
classPath.append(sep);
classPath.append(localfile);
}
}
</code>
Then just build the new hadoop jar by issuing “ant jar” make sure that you have the same jar on all nodes as well as the jobtracker.
We proudly announce that Mammatus have support for transactional replication of configurable KeyValueStore(s). Something similar to Cassandra (where is it thesedays?) or Voldemort
Our pagehit/adhit tracking services at script.tailsweep.com uses this feature and we have about 1000 web requests per second so you can say that it is quite stress tested :).
Look in the MasterSlaveTest class for examples.
I really loves to have backups, but hate to pay for it since it deep down in my gut feels like wasted money somehow. So how do you get most bang for the buck ?
Buy some simple 1TB USB2 drives and just plug them into one of your servers and mount them as regular drives. Simple as that.
Want to have RAID ? No problem, this is what we did.
FInd the device-names by issuing:
sudo fdisk -l
The two drives came out as /dev/sdb1 and /dev/sdc1
Here is the magic:
mknod /dev/md0 b 9 0
mdadm -C -v /dev/md0 -l 1 -n 2 /dev/sdb1 /dev/sdc1
mkfs.ext3 -L/usb_drive1 /dev/md0
tune2fs -c 0 /dev/md0
tune2fs -i 0 /dev/md0
tune2fs -o journal_data_writeback /dev/md0
Mount it.
mount /dev/md0 /srv/backup
That is really it
This is how it looks now in our cabinet, really ugly but what the heck, who cares haha.
We have now started to experiment with Hive. It makes perfect sence since what we have built internally is basically Hive but in the form of zillions of Haoop jobs.
How nice would it not be to just clean your data, create a csv format of the actual log and then inject it into HIve and then apply various SQL commands which outputs the results to a format of your choice ?
Sounds like a DataWareHouse ? Well it is more or less but it has the computing power of all machines in the cluster which makes it very useful. We are using MonetDB right as of current and it is blazing fast but it performs poorly on a machine with little memory (which is no surprise) and as well claims all memory it can find so we limit it with some tricks to not swap out the machine completely.
We had issues with trying to figure out howto get SOLR to be able to handle external scores. Thanks to Grant Ingersoll and Yonik Seeley we now have figured this out.
The solution: ExternalFileField + FunctionQuery
This is how I tested this setup.
# solr.xml
<?xml version="1.0" encoding="UTF-8" ?>
<solr persistent="true" sharedLib="lib">
<cores adminPath="/admin/cores">
<core name="test" instanceDir="test" />
</cores>
</solr>
# Schema, a pkId (blog entry) belongs to a blogId (the blog)
<schema name="test" version="1.1">
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="integer" class="solr.IntField" omitNorms="true"/>
<fieldType name="float" class="solr.FloatField" omitNorms="true"/>
<fieldType name="entryRankFile" keyField="pkId" defVal="0" stored="false" indexed="false" class="solr.ExternalFileField" valType="float"/>
<fieldType name="blogRankFile" keyField="blogId" defVal="0" stored="false" indexed="false" class="solr.ExternalFileField" valType="float"/>
</types>
<fields>
<field name="pkId" type="string" indexed="true" stored="true" required="true" />
<field name="blogId" type="integer" indexed="true" stored="true" required="true" />
<field name="entryRank" type="entryRankFile" />
<field name="blogRank" type="blogRankFile" />
</fields>
<uniqueKey>pkId</uniqueKey>
<defaultSearchField>pkId</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>
</schema>
# dataDir/external_blogRank.txt
1=2.0
2=1.0
3=3.0
4=1.0
# Add doc file, save it as /tmp/add.xml
<add>
<doc><field name="pkId">1</field><field name="blogId">1</field></doc>
<doc><field name="pkId">2</field><field name="blogId">1</field></doc>
<doc><field name="pkId">3</field><field name="blogId">2</field></doc>
<doc><field name="pkId">4</field><field name="blogId">3</field></doc>
<doc><field name="pkId">5</field><field name="blogId">4</field></doc>
</add>
# Add some data
curl http://127.0.0.1:8110/solr/test/update --data-binary @/tmp/add.xml -H "Content-Type: text/xml"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">239</int></lst>
</response>
# Commit
curl http://127.0.0.1:8110/solr/test/update -H "Content-Type: text/xml" --data-binary '<commit />'
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">6</int></lst>
</response>
# Issue query, should return all entries which have the highest blogRank first
mahe@mahe-laptop:~$ GET “http://127.0.0.1:8110/solr/test/select?indent=on&start=0&rows=100&q=*:* _val_:\”log(blogRank)\”"
<?xml version=”1.0″ encoding=”UTF-8″?>
<response>
<lst name=”responseHeader”>
<int name=”status”>0</int>
<int name=”QTime”>3</int>
<lst name=”params”>
<str name=”start”>0</str>
<str name=”indent”>on</str>
<str name=”q”>*:* _val_:”log(blogRank)”</str>
<str name=”rows”>100</str>
</lst>
</lst>
<result name=”response” numFound=”5″ start=”0″>
<doc>
<int name=”blogId”>3</int>
<str name=”pkId”>4</str>
</doc>
<doc>
<int name=”blogId”>1</int>
<str name=”pkId”>1</str>
</doc>
<doc>
<int name=”blogId”>1</int>
<str name=”pkId”>2</str>
</doc>
<doc>
<int name=”blogId”>2</int>
<str name=”pkId”>3</str>
</doc>
<doc>
<int name=”blogId”>4</int>
<str name=”pkId”>5</str>
</doc>
</result>
</response>
Badabom badabing!
Update:
An even better query (Thanks to Yonik): Takes the actual internal scoring into account as well.
GET ‘http://127.0.0.1:8110/solr/test/select?indent=on&start=0&rows=100&q={!boost b=blogRank v=$qq}&qq=title:solr&debugQuery=on’
I have created a way of replicating state which is similar to MySQL.
We have several cases where we want to update a Btree on a central server and then having it replicated across all slave nodes.
Today we serialize a HashMap to disk, rsyncs it and when the slaves understands that the underlying file is changed it initializes itself with that. This works, however it is not a smart way of doing it since it needs to reload the entire state even though just one entry has been added. To solve that you need to add transaction logging and replicate those transactions.
So how does it work ?
* TransactionLogger needs to be initialized on both master and slave.
* You write to the master file.
* The slave polls the master and sends it’s latest sequence number (trx id) called X.
* The master sends the delta entries from X to Y where Y is the latest entry noted on the master when the client initiated the request.
I wrote the transaction loggers as separate modules so you need to wire them up to make the storage synchronized.
On the slave you need a StateChangeListener and on the master you need to wrap the storage engine in a TransactionLoggerCacheStrategy.
Here is a fully working example spring context file.
Example code:
public static void main(String[] args)
{
String[] cfg = {”logManager.xml”};
ClassPathXmlApplicationContext ctx = new ClassPathXmlApplicationContext(cfg);
Cache cacheMaster = (Cache)ctx.getBean(”masterCache”);
Cache cacheSlave = (Cache)ctx.getBean(”slaveCache”);
cacheMaster.put(”testing”, new Date());
while(true)
{
Date date = (Date)cacheSlave.get(”testing”);
if(date != null)
{
System.out.println(”Huzza!”);
System.exit(0);
}
try
{
Thread.sleep(1000);
}
catch (InterruptedException e)
{
e.printStackTrace();
}
}
}
We have really been struggling with creating a way of launch hadoop jobs and create and wire all components with Spring.
Finally we have come to a nice way of doing this where we make use of the Hadoop Configuration to tell the jobs which spring context files they should use.
Example
Client (from where you launch JobClient)
JobConf job = createJob();
job.set(”configs”, “classpath:ctx1.xml,”classpath:ctx2.xml”);
…..
Inside a Mapper, Reducer or MapRunnable public void configure(JobConf jobConf) method.
String[] configs = jobConf.get(”configs”).split(”,”);
ApplicationContext ctx = new ClassPathXmlApplicationContext(configs);
…Extract the beans you want and manually wire up the Job. e.g.
this.contentParsers = (ContentParsers)ctx.getBean(”contentParsers”);
For this to work you need to have all configurations in your jar-file which you tell hadoop to run with:
job.setJar(jarFile);
and if you want to add some dependency jar files use:
job.set(”tmpjars”, “/lib/jar1,/lib/jar2″);
where the tmpjars must reside in HDFS before running the job.
use ${HADOOP_HOME}/bin/hadoop dfs -copyFromLocal your_working_dir/lib /
This will put the dir /lib in the HDFS root, which of course is just an example.
We use the same spring context files in both dev/stage/prod environments and use environment specific property files which we use to filter the context files before wrapping them inside the jar.
Example:
—clip context file—
<property name=”numberOfUrlsPerCrawl” value=”${numberOfUrlsPerCrawl}” />
—clip—
environment.local.properties
numberOfUrlsPerCrawl=100
environment.prod.properties
numberOfUrlsPerCrawl=100000
The client side of course as well is Spring wired.