So yesterday a job description at my previous employer popped
up in my facebook stream which reminded me of the programming excercise that we included in
the interview process just before I left the company. In short it comes
Funda has an API that lets you do queries, the response is paged, max.
25 objects at a time
The API is rate limited at about 100 req./minute
Request all pages for a given query
Count the times a realtor ID is in the result
Aggregate and sum the realtor ID's and create a top 10 list of realtors
with the most objects
Scraping this is pretty easy, but the rate limiting got me thinking. A
great library for doing queue work like this (create a large list of URLs
to scrape, then do it 4 at the same time or something) is async by caolan, but it lacks real rate limiting.
Room for improvement!
Expanding async The async library already has a pretty convenient way to create dynamically
sized queues with concurrency, in the form of:
To add rate limiting to queues I created a mixin that adds some methods to async that will
create a form of an event loop structure that'll fire every X ms.
Where X is of course the max. speed that we can query the target website.
The usage is still the same, but the queue variable now has a chainable
method 'rateLimit' added. Executing the same code like before
but rate limited to 1 request per second will give a sorted response, because
even though we have a concurrency of four, the max. time processing an
item is 1 second. The previous record will therefore always be processed.
Transforming it in real world code The response that we get from funda has a 'Paging' parameter
that contains the next URL that we can call. If it's empty, we've
reached the end of our set. In pseudo code:
Counting realtor IDs Because the purpose of the assignment is to count the realtor IDs we'll
add a simple object map where we gather all the data:
Hooking it together We'll need some small things to do, first, we'll need to incorporate
the base URL, then, we'll need to normalize the URLs we receive from
'VolgendeUrl' and maybe do some sanitizing. The final script
will look something like this:
Running it To run it: execute the following commands on your local system or on