Scrapers Used on Github

Scrapers Used on Github

Sebastian Muñoz-Najar Galvez (See code for this project here) The objective of this project is to create a database of all available Github repositories explicitly devoted to the development or implementation of web scrapers in order to (1) identify the languages used for scraping and (2) the themes and websites frequently scraped. Scrapers are a genre of code used to collect, aggregate and organize information from a website. Scrapers capitalize on regular patterns of site layout and other principles of progressive enhancement design to automate requests and aggregate information available piecemeal on a site. An alternative to scrapers is interaction with an API, where available. The web is not an archive through and through; some regions resist archival work (See ‘Swiss Scraper’ below), therefore it becomes relevant to identify the regions of the web that have been scraped, and how researchers went about doing so. Working with Github’s API GitHub’s API is a very thorough archive of repositories, users and code. Authorized applications can make 30 requests per minute and a search of GitHub’s repositories returns a  json document with a list of up to 1000 elements. However, the total amount of results from any given query may be over 1000. Therefore, for queries with 1k+ results it is necessary to make several ordered requests. I used the date of creation to segment my query of scraping repositories. This process involved a great deal of trial and error since I didn’t know how many scrapers were build for any particular interval of time. The key words for every request were ‘scrape OR scraper OR scraping’. The API looked for...