logo logo

I Don’t Need No Stinking API – Web Scraping in 2016 and Beyond

Social media APIs and their rate limits have not been nice to me recently, especially Instagram. Who needs it anyway?

Sites are increasingly getting smarter against scraping / data mining attempts. AngelList even detects PhantomJS (have not seen other sites do this). But if you are automating your exact actions that happen via a browser, can this be blocked?

First off, in terms of concurrency or the amount of horsepower you get for your hard earned $$$ – Selenium sucks. It’s simply not built for what you would consider ‘scraping’. But with sites being built with more and more smarts these days, the only truely reliable way to mine data off the internets is to use browser automation.

My stack looks like, pretty much all JavaScript. There goes a few readers 😑😆 – WebdriverIO, Node.js and a bunch of NPM packages including the likes of antigate (thanks to Troy Hunt – Breaking CAPTCHA with automated humans) but I’m sure most of my techniques can be applied to any flavour of the Selenium 2 driver. It just happens that I find coding JavaScript optimal for browser automation.

clicks number 0 1


No comments to display.