Rate limited website scraping with node.js and async

This post was originally published on Coding Glamour.

So yesterday a job description at my previous employer popped up in my facebook stream which reminded me of the programming excercise that we included in the interview process just before I left the company. In short it comes down to:

Funda has an API that lets you do queries, the response is paged, max. 25 objects at a time
The API is rate limited at about 100 req./minute
Request all pages for a given query
Count the times a realtor ID is in the result
Aggregate and sum the realtor ID's and create a top 10 list of realtors with the most objects

Scraping this is pretty easy, but the rate limiting got me thinking. A great library for doing queue work like this (create a large list of URLs to scrape, then do it 4 at the same time or something) is async by caolan, but it lacks real rate limiting. Room for improvement! Expanding async
The async library already has a pretty convenient way to create dynamically sized queues with concurrency, in the form of:

// create a queue that does 4 items at the same time
// that for every item in the queue outputs the value times 2
var q = async.queue(function (item, next) {
    // add a random timeout so we can see the queue'ing
    setTimeout(function () {
        console.log(item * 2);
        next();
    }, Math.random() * 1000 |0);
}, 4);
q.drain = function () { console.log("done"); };
q.push([ 1,2,3,4,5,6,7,8,9,10 ]);
// gives something like (order can be different)
// but higher numbers are pushed later than lower numbers
// 8, 6, 12, 4, 2, 10, 18, 16, 14, 20, done

To add rate limiting to queues I created a mixin that adds some methods to async that will create a form of an event loop structure that'll fire every X ms. Where X is of course the max. speed that we can query the target website. The usage is still the same, but the queue variable now has a chainable method 'rateLimit' added. Executing the same code like before but rate limited to 1 request per second will give a sorted response, because even though we have a concurrency of four, the max. time processing an item is 1 second. The previous record will therefore always be processed.

// change
// }, 4);
// into
}, 4).rateLimit(1000);
// gives
// 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, done

Transforming it in real world code
The response that we get from funda has a 'Paging' parameter that contains the next URL that we can call. If it's empty, we've reached the end of our set. In pseudo code:

func processItem (url)
    resp = request(url)
    if resp.Paging.VolgendeUrl
        processItem resp.Paging.VolgendeUrl
    else
        "done"

In javascript with async, this will look like:

var async = require("async");
var q = async.queue(function (url, next) {
    request.get(url, function (err, res, body) {
        // parse the body to JSON
        body = JSON.parse(body);
        if (body && body.Paging.VolgendeUrl) {
            q.push(body.Paging.VolgendeUrl);
        }
        // do stuff like counting realtors
        next();
    });
}, 1).rateLimit(60000 / 60); // 60 per minute, just to be safe
// initial page
q.push("/zaandam/tuin/");
q.drain = function () {
    console.log("done");
};

Counting realtor IDs
Because the purpose of the assignment is to count the realtor IDs we'll add a simple object map where we gather all the data:

// the key will be the realtor ID and the value the no of times we encountered this realtor
var map = {}; 
// =================
// when a request comes in:
// first grab the realtor IDs
var realtorIds = body.Objects.map(function (obj) {
    return obj.MakelaarId;
});
// then move it to the map
realtorIds.forEach(function (rid) {
    // check for existing one, if not initialize it with '1'
    map[rid] = (map[rid] || 0) + 1;
});
// =================
// on drain:
// make a sortable object with {id: [Number], cnt: [Number]}
var sortable = Object.keys(map).map(function (k) {
    return {
        id: k,
        cnt: makelaarMap[k]
    };
});
// now sort it on cnt HI-LO
var sorted = sortable.sort(function (o, p) {
    return o.cnt > p.cnt ? -1 : (o.cnt === p.cnt ? 0 : 1);
});
// output it
for (var ix = 0; ix < 10; ix++) {
    console.log(ix+1 + '.', sorted[ix].id, 'has', sorted[ix].cnt, 'objects');
}

Hooking it together
We'll need some small things to do, first, we'll need to incorporate the base URL, then, we'll need to normalize the URLs we receive from 'VolgendeUrl' and maybe do some sanitizing. The final script will look something like this:

var async = require("async");
var request = require("request");
var makelaarMap = {};
var q = async.queue(function (zo, next) {
    console.log("process", zo);
    var url = "http://partnerapi.funda.nl/feeds/Aanbod.svc/json/a001e6c3ee6e4853ab18fe44cc1494de/?type=koop&pagesize=25&zo=" + zo;
    request.get(url, function (err, res, body) {
        // add error checking (see err, and res.statusCode)
        if ((body = body && JSON.parse(body)) && body.Paging.VolgendeUrl) {
            q.push(body.Paging.VolgendeUrl.replace(/^\/\~\/\w+/, ""));
        }
        body.Objects.map(function (o) { return o.MakelaarId; }).forEach(function (mid) {
            makelaarMap[mid] = (makelaarMap[mid] || 0) + 1;
        });
        next();
    });
}, 1).rateLimit(60000 / 60); // 60 per minute
// initial page
q.push("/zaandam/tuin/");
q.drain = function () {
    var sorted = Object.keys(makelaarMap).map(function (k) {
        return {
            id: k,
            cnt: makelaarMap[k]
        };
    }).sort(function (o, p) {
        return o.cnt > p.cnt ? -1 : (o.cnt === p.cnt ? 0 : 1);
    });
    for (var ix = 0; ix < 10; ix++) {
        console.log(ix+1 + '.', sorted[ix].id, 'has', sorted[ix].cnt, 'objects');
    }
};

Running it
To run it: execute the following commands on your local system or on Cloud9 IDE:

$ git clone https://github.com/janjongboom/async node_modules/async
$ npm install request
# paste the code in server.js

$ node server.js