Web Scraping for Journalists
Scraping – getting a computer to capture information from online sources – is one of the most powerful techniques for data-savvy journalists who want to get to the story first, or find exclusives that no one else has spotted.
Paul Bradshaw will show you how to scrape content from the web and find stories that otherwise might have been missed.
This two-day workshop in scraping is designed for reporters with no knowledge of scraping or programming and provides essential skills for getting original stories by compiling data across a range of online sources. By the end of the workshop, you will be able to use specialist scraping tools (without programming) and begin to write your own, more advanced, scrapers. You will also be able to communicate with programmers on relevant projects. (See below for more information and technical requirements – you must bring your own laptop).
Paul Bradshaw runs the MA in Data Journalism and the MA in Multiplatform and Mobile Journalism at Birmingham City University, and also works as a consulting data journalist with the BBC England Data Unit. A journalist, writer and trainer, he has worked with news organisations including The Guardian, Telegraph, Mirror, Der Tagesspiegel and The Bureau of Investigative Journalism. He publishes the Online Journalism Blog, is the co-founder of the award-winning investigative journalism network HelpMeInvestigate.com, and has been listed on both Journalism.co.uk‘s list of leading innovators in media, and the US Poynter Institute’s list of the 35 most influential people in social media.
His books include Finding Stories in Spreadsheets, Scraping for Journalists, The Data Journalism Heist, Snapchat for Journalists and the Online Journalism Handbook.
Tuesday, 4 December: Scraping basics
10:30-11:15am Introduction: What scraping is and how news organisations are using it
11:30-12.15pm Pitching story ideas involving scraping
12:15-1pm Scraping basics: finding structure in HTML and URLs
2-3.45pm Simple scraping jobs: checking a webpage every day; identifying information using XPath
4-5pm Introduction to scraping tools: Outwit Hub
Wednesday, 5 December: Looking at what’s available
9-10am Advanced Outwit Hub: scraping multiple pages
10-10:15am What’s possible with programming: APIs, regex and loops
10:30am-12pm Scraping text that fits a pattern: regex
1-2pm Advanced scraping options: coding, PDFs and spreadsheets
2-5pm Project surgery: your scraping challenges
Scraping is the process of automatically collating information from the web. It might be grabbing entries across hundreds of webpages, fetching and combining dozens of spreadsheets, or thousands of PDFs.
The results have led to exclusive stories for organisations ranging from the Bureau of Investigative Journalism and Trinity Mirror, to DC Thomson, Channel 4 and the BBC.
Delegates will be using their own laptop and should have a Google drive account, and have downloaded the free version of outwit hub ahead of the course. A GitHub account would also be useful.
The software is all free. However the free version of OutWit Hub only allows you to scrape 100 rows, so you may want to pay for the full version but can decide after you’ve learnt how to use it on the course.
You may also like the following events from Eventbrite:
Also check out other Workshops in London
, Arts Events in London
, Literary Art Events in London
Liked this event? Spread the word :