ezyang's blog

the arc of software bends towards understanding

Web

Interactive scraping with Jupyter and Puppeteer

One of the annoying things about scraping websites is bouncing back and forth between the browser where you are using Dev Tools to work out what selectors you should be using to scrape out data, and your actual scraping script, which is usually some batch program that may have to take a few steps before the step you are debugging. A batch script is fine once your scraper is up and running, but while developing, it’s really handy to pause the scraping process at some page and fiddle around with the DOM to see what to do.

Read more...