Tailsweep
Svenska UK

Meny

  • Hem
  • Tailsweep
  • Tailsweep Blog Search
  • Tailsweeps Blogg
  • Google group
  • AddThis Social Bookmark Button

Projekt

  • Mammatus
  • Parhely
  • Haloe
  • AbstractCache
  • Utils

Arkiv

  • maj 2009
  • april 2009
  • december 2008
  • november 2008
  • oktober 2008
  • september 2008
  • augusti 2008
  • juli 2008
  • juni 2008
  • maj 2008
  • mars 2008
  • februari 2008
  • januari 2008
  • december 2007

Sidor

  • Kontakt

Kategorier

    AJAX
    Backup
    BigTable
    Browser
    cache
    Geo
    haloe
    Hibernate
    Javascript
    Job
    Lucene
    Mail
    Monitor
    Monitoring
    MySQL
    optimization
    regex
    release
    SCM
    Server
    sharding
    Spatial
    Tools
    Allmänt

Prenumerera

RSS Senaste nytt som RSS

Cheap backup

maj 5th, 2009 Skrivet av Marcus Herou

I really loves to have backups, but hate to pay for it since it deep down in my gut feels like wasted money somehow. So how do you get most bang for the buck ?

Buy some simple 1TB USB2 drives and just plug them into one of your servers and mount them as regular drives. Simple as that.

Want to have RAID ? No problem, this is what we did.

FInd the device-names by issuing:

sudo fdisk -l

The two drives came out as /dev/sdb1 and /dev/sdc1

Here is the magic:

mknod /dev/md0 b 9 0
mdadm -C -v /dev/md0 -l 1 -n 2 /dev/sdb1 /dev/sdc1
mkfs.ext3 -L/usb_drive1 /dev/md0
tune2fs -c 0 /dev/md0
tune2fs -i 0 /dev/md0
tune2fs -o journal_data_writeback /dev/md0

Mount it.

mount /dev/md0 /srv/backup

That is really it :)

This is how it looks now in our cabinet, really ugly but what the heck, who cares haha.

Postad i Uncategorized   Inga kommentarer

Tailsweep goes Hive

april 27th, 2009 Skrivet av Marcus Herou

We have now started to experiment with Hive. It makes perfect sence since what we have built internally is basically Hive but in the form of zillions of Haoop jobs.

How nice would it not be to just clean your data, create a csv format of the actual log and then inject it into HIve and then apply various SQL commands which outputs the results to a format of your choice ?

Sounds like a DataWareHouse ? Well it is more or less but it has the computing power of all machines in the cluster which makes it very useful. We are using MonetDB right as of current and it is blazing fast but it performs poorly on a machine with little memory (which is no surprise) and as well claims all memory it can find so we limit it with some tricks to not swap out the machine completely.

Postad i Uncategorized  Etiketter:hive, monetdb
 1 kommentar »

Solr external scoring

april 24th, 2009 Skrivet av Marcus Herou

We had issues with trying to figure out howto get SOLR to be able to handle external scores. Thanks to Grant Ingersoll and Yonik Seeley we now have figured this out.

The solution: ExternalFileField + FunctionQuery

This is how I tested this setup.

# solr.xml
<?xml version="1.0" encoding="UTF-8" ?>
<solr persistent="true" sharedLib="lib">
 <cores adminPath="/admin/cores">
        <core name="test" instanceDir="test" />
 </cores>
</solr>

# Schema, a pkId (blog entry) belongs to a blogId (the blog)
<schema name="test" version="1.1">
    <types>
   	<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
    	<fieldType name="integer" class="solr.IntField" omitNorms="true"/>
    	<fieldType name="float" class="solr.FloatField" omitNorms="true"/>
    	<fieldType name="entryRankFile" keyField="pkId" defVal="0" stored="false" indexed="false" class="solr.ExternalFileField" valType="float"/>
	<fieldType name="blogRankFile" keyField="blogId" defVal="0" stored="false" indexed="false" class="solr.ExternalFileField" valType="float"/>
    </types>
    <fields>
	<field name="pkId" type="string" indexed="true" stored="true" required="true" />
	<field name="blogId" type="integer" indexed="true" stored="true" required="true" />
	<field name="entryRank" type="entryRankFile" />
	<field name="blogRank" type="blogRankFile" />
    </fields>
    <uniqueKey>pkId</uniqueKey>
    <defaultSearchField>pkId</defaultSearchField>
    <solrQueryParser defaultOperator="OR"/>
</schema>

# dataDir/external_blogRank.txt
1=2.0
2=1.0
3=3.0
4=1.0

# Add doc file, save it as /tmp/add.xml
<add>
    <doc><field name="pkId">1</field><field name="blogId">1</field></doc>
    <doc><field name="pkId">2</field><field name="blogId">1</field></doc>
    <doc><field name="pkId">3</field><field name="blogId">2</field></doc>
    <doc><field name="pkId">4</field><field name="blogId">3</field></doc>
    <doc><field name="pkId">5</field><field name="blogId">4</field></doc>
</add>

# Add some data
curl http://127.0.0.1:8110/solr/test/update --data-binary @/tmp/add.xml -H "Content-Type: text/xml"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">239</int></lst>
</response>

# Commit
curl http://127.0.0.1:8110/solr/test/update -H "Content-Type: text/xml" --data-binary '<commit />'
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">6</int></lst>
</response>

# Issue query, should return all entries which have the highest blogRank first

mahe@mahe-laptop:~$ GET “http://127.0.0.1:8110/solr/test/select?indent=on&start=0&rows=100&q=*:* _val_:\”log(blogRank)\”"

<?xml version=”1.0″ encoding=”UTF-8″?>
<response>

<lst name=”responseHeader”>
<int name=”status”>0</int>
<int name=”QTime”>3</int>
<lst name=”params”>
<str name=”start”>0</str>
<str name=”indent”>on</str>
<str name=”q”>*:* _val_:”log(blogRank)”</str>
<str name=”rows”>100</str>
</lst>
</lst>
<result name=”response” numFound=”5″ start=”0″>
<doc>
<int name=”blogId”>3</int>
<str name=”pkId”>4</str>
</doc>
<doc>
<int name=”blogId”>1</int>
<str name=”pkId”>1</str>
</doc>
<doc>
<int name=”blogId”>1</int>
<str name=”pkId”>2</str>
</doc>
<doc>
<int name=”blogId”>2</int>
<str name=”pkId”>3</str>
</doc>
<doc>
<int name=”blogId”>4</int>
<str name=”pkId”>5</str>
</doc>
</result>
</response>

Badabom badabing!

Update:

An even better query (Thanks to Yonik): Takes the actual internal scoring into account as well.

GET ‘http://127.0.0.1:8110/solr/test/select?indent=on&start=0&rows=100&q={!boost b=blogRank v=$qq}&qq=title:solr&debugQuery=on’

Postad i Lucene, Uncategorized  Etiketter:externalfilefield, function query
 5 Kommentarer »

Replication in Mammatus

december 14th, 2008 Skrivet av Marcus Herou

I have created a way of replicating state which is similar to MySQL.

We have several cases where we want to update a Btree on a central server and then having it replicated across all slave nodes.

Today we serialize a HashMap to disk, rsyncs it and when the slaves understands that the underlying file is changed it initializes itself with that. This works, however it is not a smart way of doing it since it needs to reload the entire state even though just one entry has been added. To solve that you need to add transaction logging and replicate those transactions.

So how does it work ?

* TransactionLogger needs to be initialized on both master and slave.

* You write to the master file.

* The slave polls the master and sends it’s latest sequence number (trx id) called X.

* The master sends the delta entries from X to Y where Y is the latest entry noted on the master when the client initiated the request.

I wrote the transaction loggers as separate modules so you need to wire them up to make the storage synchronized.

On the slave you need a StateChangeListener and on the master you need to wrap the storage engine in a TransactionLoggerCacheStrategy.

Here is a fully working example spring context file.

Example code:

public static void main(String[] args)
{
String[] cfg = {”logManager.xml”};
ClassPathXmlApplicationContext ctx = new ClassPathXmlApplicationContext(cfg);
Cache cacheMaster = (Cache)ctx.getBean(”masterCache”);
Cache cacheSlave = (Cache)ctx.getBean(”slaveCache”);

cacheMaster.put(”testing”, new Date());
while(true)
{
Date date = (Date)cacheSlave.get(”testing”);
if(date != null)
{
System.out.println(”Huzza!”);
System.exit(0);
}
try
{
Thread.sleep(1000);
}
catch (InterruptedException e)
{
e.printStackTrace();
}
}
}

Postad i Uncategorized  Etiketter:mammatus, master, replication, slave
 Inga kommentarer

Spring with Hadoop

december 13th, 2008 Skrivet av Marcus Herou

We have really been struggling with creating a way of launch hadoop jobs and create and wire all components with Spring.

Finally we have come to a nice way of doing this where we make use of the Hadoop Configuration to tell the jobs which spring context files they should use.

Example

Client (from where you launch JobClient)

JobConf job = createJob();

job.set(”configs”, “classpath:ctx1.xml,”classpath:ctx2.xml”);

…..

Inside a Mapper, Reducer or MapRunnable public void configure(JobConf jobConf) method.

String[] configs = jobConf.get(”configs”).split(”,”);
ApplicationContext ctx = new ClassPathXmlApplicationContext(configs);

…Extract the beans you want and manually wire up the Job. e.g.

this.contentParsers = (ContentParsers)ctx.getBean(”contentParsers”);

For this to work you need to have all configurations in your jar-file which you tell hadoop to run with:

job.setJar(jarFile);

and if you want to add some dependency jar files use:

job.set(”tmpjars”, “/lib/jar1,/lib/jar2″);

where the tmpjars must reside in HDFS before running the job.

use ${HADOOP_HOME}/bin/hadoop dfs -copyFromLocal your_working_dir/lib /

This will put the dir /lib in the HDFS root, which of course is just an example.

We use the same spring context files in both dev/stage/prod environments and use environment specific property files which we use to filter the context files before wrapping them inside the jar.

Example:

—clip context file—

<property name=”numberOfUrlsPerCrawl” value=”${numberOfUrlsPerCrawl}” />

—clip—

environment.local.properties

numberOfUrlsPerCrawl=100

environment.prod.properties

numberOfUrlsPerCrawl=100000

The client side of course as well is Spring wired.

Postad i Uncategorized  Etiketter:hadoop, spring
 Inga kommentarer

Firefox swaps contents between iframes

november 26th, 2008 Skrivet av Peter Gustafsson

More and more users are switching over to using Firefox these days and we think that’s all good since we ourselves like the browser very much. But it has a bug in how it handles dynamically (javascript) rendered iframes . In short Firefox sometimes, under certain circumstances , swaps contents between iframes.
When debugging using firebug one can see that the iframe src doesn’t match with the expected content in the iframe.

Since we, as a lot of other ad networks, display our ads in an iframe this can result in an ad from our network ending up in a placement belonging to another ad network and vice versa. This isn’t a good thing for either advertisers or site owners/bloggers.

A workaround for this is to reload the iframes onload.

(iframe.src=iframe.src)

But this isn’t good enough. Mainly because it will cause 2 ad impressions / page impression => incorrect statistics.
Of course we could append some flag on the src url when reloading.

(i.e dontCountThis=1)

But we also run third party ad scripts from several other ad networks…

So since there’s no sufficient solution that we can implement we would like to encourage you all to please go and vote for a fix for this bug -> https://bugzilla.mozilla.org/show_bug.cgi?id=388714

Postad i Browser, Uncategorized   4 Kommentarer »

Sharding

oktober 24th, 2008 Skrivet av Marcus Herou

We have for quite some time been hesitant whether to choose Hypertable, HBase, MySQL cluster or create a home made sharded solution. Guess where we ended up :)

We have extended Parhely to support sharding and we are defaulting now to 10 shards (about 1G each) per machine. I can recommend the High performance MySQL since it contains many advanced hands on tips which you can apply to many datastructures. We will for instance apply some of them to a Sharded Lucene (Solr) implementation.

The upside with shards is obvious:

* Better write performance

* More read performance (if you are wise in you sharding choice)

Downside

* Lower delete by query performance since you need to remove data both from the master and from all shards matching the query. We solve it by first selecting all deletable ids and then batch delete them in chunks.

* Maintenance - It is a lot easier to have one database. But if you have over 100G data that is a mess too.

We stuck with MySQL because of time constraints but we have written most of the source code in a way that makes it a lot easier to move to Postgresql or perhaps MonetDB (very good column-store). Since we have not tested MonetDB enough and that I was a little afraid about that it eagerly claims as much memory as the machine allows we chose MySQL (better safe than sorry).

Why did you not use HiveDB ? Well it is focused on MySQL and only Java based. Does not seem generic enough but who am I to judge? We need to sit in the driver seat in this application since it is our baby.

Postad i MySQL, optimization, sharding  Etiketter:MySQL, performance, sharding
 1 kommentar »

Mammatus uses Haloe

september 14th, 2008 Skrivet av Marcus Herou

We have got rid of the dependency to core-lucene in Haloe and now the LuceneMap uses Haloe instead.

Postad i Uncategorized  Etiketter:haloe, mammatus
 2 Kommentarer »

Haloe updated

september 14th, 2008 Skrivet av Marcus Herou

We have added some tests to the trunk of Haloe. By running them it was apparent that certain use cases did not work.

We as well added a LoadBalancedSolrDocumentIndexer which can be used to search against symmetrical indexes.

Postad i haloe  Etiketter:haloe
 Inga kommentarer

Haloe released!

september 8th, 2008 Skrivet av Marcus Herou

Glad to announce that we have released yet another OpenSource package named Haloe. Do you want to ease the search integration of your code ? Well Haloe addresses just that and currently provides some different implementations like plain Lucene, SOLR etc.

Postad i Allmänt  Etiketter:haloe, Lucene, solr
 2 Kommentarer »

« Older Entries

Copyright © 2007 Tailsweep AB

Tailsweep development Blog is proudly powered by WordPress
Entries (RSS) and Comments (RSS).