Author Archive

Announcing Pynoceros

Tim Wintle - October 30th, 2009

I haven’t mentioned any free time coding on this blog for a while, but I thought some people might be interested in a project I released a couple of days ago.

pynoceros logoPynoceros is a python port of the javascript parser from the Rhino javascript interpreter. (It’s a fairly straight forward language conversion at the moment – so don’t expect it to be pythonic!)

Why? Well Rhino is a  stable, and well used code base, and it forms the basis for the YUI Compressor – which I’ve often wanted a python version of.

My version of the YUI compressor is almost done (I’ve just got to find time to add all the license information and prepare the release), but in the mean time I thought I’d release pynoceros for anyone who’s interested.

Marketers over-valuing Twitter?

Tim Wintle - October 18th, 2009

I’ve been arguing for a while that some marketers massively over-rate twitter when trying to measure on-line opinion. A majority of the “social media monitoring tools” put far too much emphasis on twitter in my opinion; and now two press release from Hitwise strongly support my argument.

To summarise, I feel that focusing on twitter ends up creating a very bad sample for any kind of opinion research, practically ignoring the effect of Facebook, Beebo, Myspace, Youtube, Search engines, News sites, Email,  Blogs, Forums, Instant messaging, and all the millions of other websites on-line.

What is more, I believe that focusing on twitter so strongly is what throws twitter’s collective opinion out of line with the rest of the internet. Online marketers going to twitter to measure internet opinion is like a market research team only inviting people who work for market research companies to give feedback on a product.

(more…)

US bloggers required to disclose Endorsements

Tim Wintle - October 13th, 2009

The US FTC recently updated their guidelines on endorsements and testimonials, bringing in specific examples regarding bloggers who are endorsing a product as part of an advertising campaign.

The guidelines were last updated in 1980, and it appears they wanted to clarify the implications to online advertising.

An example from the updated guidelines:

The advertiser requests that a blogger try a new body lotion and write a review of the product on her blog. Although the advertiser does not make any specific claims about the lotion’s ability to cure skin conditions and the blogger does not ask the advertiser whether there is substantiation for the claim, in her review the blogger writes
that the lotion cures eczema and recommends the product to her blog readers who suffer from this condition. The advertiser is subject to liability for misleading or unsubstantiated representations made through the blogger’s endorsement. The blogger also is subject to liability for misleading or unsubstantiated representations made in the course of her endorsement. The blogger is also liable if she fails to disclose clearly and conspicuously that she is being paid for her services.

In order to limit its potential liability, the advertiser should ensure that the advertising service provides guidance and training to its bloggers concerning the need to ensure that statements they make are truthful and substantiated. The advertiser should also monitor bloggers who are being paid to promote its products and take steps necessary to halt the continued publication of deceptive representations when they are discovered.

According to the Wall Street Journal,

Regulators say they haven’t seen a wave of abuses involving endorsements by bloggers but wanted to establish clear rules to prevent any problems in the future.

The FTC announcement can be read here, where you can also find the full FTC guidelines.

Do you really need those Keyword args? (python optimisation)

Tim Wintle - September 14th, 2009

I’ve been reading the python interpreter fairly closely recently (more on that to come in a later blog post), and I was surprised by how much optimisation there is around function calls, dispatching them differently depending on their parameters (expected and supplied)*.

I was going to include this in a later post, but to keep that shorter here’s a quick example. Let’s take these two massively simplified functions.


def a(spam, eggs):
    return

def b(spam, eggs=None):
    return

Calling a(0,1) is fast because the interpreter skips the keyword argument tests and pushes the parameters direct onto the stack for the new function.

Next came b(0,1) – which takes roughly 10% longer on my machine as there is more for the interpreter to set up.

Calling with keyword arguments is far slower though – with:

a(spam = 0, eggs = 1)

and

b(spam = 0, eggs = 1)

both taking 50% longer than the fastest opportunity (and a negligible difference between the two). Obviously a large part of the 50% increase is setting up the names of the parameters on the stack – but if you read the interpreter source you’ll see there’s far more than that at play (including quick lookup of “self” for bound methods etc.).

(obviously most non-trivial functions will spend a significant time within the function body – which will reduce the relative performance boost – but there’s sure to be the odd situation where it’s worth knowing this.)

* n.b. – I was looking at the 2.5 tag of python, but as far as I can tell none of this code seems to have changed so far in the 3.x trunk.

The Team…

Tim Wintle - July 17th, 2009

As time goes by it’s getting tougher and tougher to get a photo of the people that make up Team Rubber, so I thought I’d use the opportunity of a fire drill today to try to snap a photo of the office.

Unfortunately it seems half the office was out today (and our fire procedure isn’t strict enough to get the London office to travel down to Bristol just to line up outside for ten minutes.) – but here’s a snap of those of us that were in.

team 17 july 2009 The Team...

For those interested – here is the last time we actually managed to snap the majority of the team (in December ‘07)

The team - December 2007

Python tail-optimisation using byteplay

Tim Wintle - April 20th, 2009

(I’m going to start off by emphasizing that this is not for production use, it is just a bit of harmless fun while I was looking at the structure of python’s bytecode, and I thought it might be interesting reading for others)

There have been quite a few hacks in the past to add tail-call optimisation to python – normally in cross-interpreter python, but while I was playing with the byteplay module thought I’d have a go at writing a function that re-compiles a function with basic tail-call optimisation inserted.

My method is basic, and converts tail-recursive calls in a (pure) function into jump statements.

(more…)

Inovative viral marketing

Tim Wintle - March 24th, 2009

I just got sent an interesting request from a friend of mine – he’s applying for a job and as part of the application process he’s been asked to film a job interview talking about the company, put it on YouTube, and drive traffic to it to see who can get the most views.

The result is that thousands of people will be getting emails from friends asking them to watch a video of one of their  friends recommending the company (great way to mix viral marketing and word of mouth!)

You can seem my friend Alex telling the worst joke ever here. (it’s worth it – it’s about dwarfs)

We can’t hide it – gender test

Tim Wintle - March 24th, 2009

Just found the gender genie – a web app that tries to determine if the author of a passage of text is male or female.

It seems to work very well – it worked out that Adam was male from his last post, that Helen was female from this post, and that Ben was male from this post.

On the other hand, it did say that Michaela was Male from this post – so it’s not perfect.

Nothing stays hidden on-line

Tim Wintle - March 17th, 2009

All these blog posts from colleagues in texas; how come this hasn’t been posted here yet?

2332364105 94a2ea3265 Nothing stays hidden on line

(Thanks to Clare Reddington for pointing out the flickr photostream)

YouTube, Google and counting views.

Tim Wintle - March 13th, 2009

As many people in industries such as ours will have noticed – YouTube is being slow at updating the view count for some videos at the moment. Luckily we have our own numbers to go by, so it’s not affecting us as much as it is affecting many companies, but I thought I’d put my explanation up here so we can refer people to it.

According to youtube, this seems to be due to an algorithm change made on the 25th of February. (they have made similar comments elsewhere)

Quote:

We’ve made a change in our public-facing view counts across the site
that will enable us to consistently reflect what is considered a
‘view,’ based upon video consumption, video streaming and spam
filtering. This only affects view counts from February 25 moving
forward.Implementing this change also caused view count updates to slow down a
bit in general; many people have noticed this and we’re aware of the
issue.

This raises some very interesting points (these are my observations, have not been confirmed with Google and may not reflect the opinions of Team Rubber):

First, for people who don’t deal with software like this every day (like I do for the viral ad network), I’ll explain the common way that numbers like this are updated:

  • There are one or more “tracking servers”, running all over the place – these are the servers that actually record a “view”, “hit”, or “action” – and they simply record lots of information about each action, which will be looked over later.
  • Every few minutes the main algorithm runs over all the data it hasn’t looked at yet and updates the numbers that are shown on the dashboards.

The important thing to notice is that the views are recorded right at the beginning and they will be updated at some point. Even if the main algorithm is stopped entirely for a few days, it will carry on in the future if you’re patient.

Prioritizing videos (”Why does this only happen once I reach 200/300 views?”)

You may have noticed that the number of views per video has always been updated quicker for videos with few views than for videos with more views. For example, a newly uploaded video will normally update it’s view count within a few minutes of a video being watched, where a video that has already had several thousand views will update it’s view count more slowly.

This suggests that when Google run their main script, they tend to update the numbers for videos with less views more often than for videos with a higher number of views – and leave the other data to be processed less often (say every few hours)

This makes a lot of sense, because  people with 50 views are more likely to be watching their numbers every few minutes to see if they have another 5 views than people who have had 200,000 views – who may only care about their views increasing by 1,000. It keeps users happier.

This explains why we (and others affected by this issue) have seen view counts rising as normal until they get above 200-300 views – at which point the numbers appear “stuck”.

Balancing the work (”Why doesn’t this affect all videos?”)

Clearly a massive site like YouTube getting so many views need more than one computer running to update these numbers. I’m going to assume that Google run this over their normal map-reduce system.

They may tens, hundreds, or even thousands of computers running their view-counting algorithm (and I don’t expect to ever find out…), but all views for a video have to be counted by the same computer, so they need some manner of splitting up the millions of views they have recorded into batches of work to be done.

They almost certainly do this using some form of hash function – you can picture this as saying that every video on YouTube is grouped into various buckets – each of these buckets will have it’s views processed on the same machine (or at the same time).

The problem comes when a hash function doesn’t split up the items equally (i.e. one “bucket” has significantly more/less videos in it than another one). This appears to be the problem here – only some videos have been affected, and my assumption is that this is because one of these “buckets” has ended up with far more views than the others – meaning that one set of machines (or one job) gets over-loaded and ends up being incredibly slow.

Lessons Learned

For me, working with a similar system to the above, the number one thing that I have learned is that for tasks like this that might be incredibly sensitive to hash functions it’s not safe to assume that a hash function that’s theoretically good is going to remain good.

I don’t know if they are able to,  but the situation would be better if YouTube chose the hash function at the beginning of each main job. I.e. each time that they run the main script that updates the information on the dashboards, they chose to use a different hash function. This way, if a video ends up in a bucket that’s overloaded one time, it will end up in a different bucket next time (which shouldn’t be overloaded).

Of course, this is all theoretical, and is based on a large number of assumptions – YouTube may perform their hashing at a far earlier stage, and they may not be able to change the hash function each time they run the job.

Tim Wintle