A few weeks ago I launched a wee website that shows you all the Number 1 songs on your birthday. It does so by scraping the Official Charts website grabbing the chart topper on your birthday each year. This worked well but wasn't very fast - especially as you get older!
I decided to write a quick script to go off and gather all the data for each day between the chart records starting (in 1953) and today (ok, a few weeks ago really...). That's over 24,800 days!
This article is going to look at how I managed to make that happen without melting my computer, using Promise batching.
The Problem
The lookup method I wrote, which scrapes the data for a specific day (getNumberOne(year, month, day)
), returns a promise, and v1 of the website would build an array of the years needed (now
- birth year
), create an array of Promises and pass that to Promise.all
.
Promise.all(years.map((y) => getNumberOne(y, month, day)));
I'd wait for the Promises to return and then render the output out.
This works OK for a small-ish set (most people using this are going to be under 100, right?), but I when I build an array of 24,000 promises and fired it at Promise.all
it quickly failed.
Promise.all
will try to run everything at once, no questions asked. Good for most cases, but with 24,000 at once it really has no chance of success.
Batch by Batch
Clearly, the solution is to break this up into smaller batches. Somewhere under 100 at a time seemed like a good starting point. After all, I only intended to run this once, store the data then pull future queries from there, so it didn't need to be quick.
A quick Google brought me to supercharge/promise-pool which lets you run an array of promises in parallel batches.
After installing PromisePool (npm install @supercharge/promise-pool --save
) I required it into my script:
const PromisePool = require('@supercharge/promise-pool');
Then I needed to build an array of all 24,000 dates that I want to lookup:
// `date` comes is as `yyy-mm-dd` date string
const d = new Date(date.slice(0, 4), date.slice(4, 6), date.slice(6, 8));
const now = new Date();
const dates = [];
// increment the date by 1 day until it equals today
for (; d <= now; d.setDate(d.getDate() + 1)) {
dates.push({
year: d.getFullYear(),
month: `0${d.getMonth()}`.slice(-2),
day: `0${d.getDate()}`.slice(-2),
});
}
Now I can use PromisePool to set up a 100-at-a-time batch process.
const { results } = await PromisePool
.withConcurrency(100)
.for(dates)
.process(async (d) => {
return getNumberOneRemote(d.year, d.month, d.day);
});
The withConcurency
method tells it the number of Promises I want to run in parallel (i.e. batches of 100 requests).
Then I pass the array of dates to the for
method.
Finally the process
method accepts a function which will be run for each of the dates when it is that ones turn for processing.
In this case I call the getNumberOne
function which does the scraping, and return it's Promise.
It really was that simple!
Once it completed I converted the results into an object format so I can use the date string as the lookup key, then saved it to a file for use later.
// build an object with the date strings as the key
const saveData = {};
results.forEach((r) => {
saveData[r.date] = r;
});
// save to a file
fs.writeFileSync('./data/number-ones.json', JSON.stringify(saveData, null, 2));
I think this ran for about 15 minutes before completing successfully, so actually pretty fast considering what I was asking it to do!
See It In Action
You can check all the final code in the project's GitHub repository and the supercharge/promise-pool project there too.