Sebastian Muñoz-Najar Galvez
(See code for this project here)
The objective of this project is to create a database of all available Github repositories explicitly devoted to the development or implementation of web scrapers in order to (1) identify the languages used for scraping and (2) the themes and websites frequently scraped.
Scrapers are a genre of code used to collect, aggregate and organize information from a website. Scrapers capitalize on regular patterns of site layout and other principles of progressive enhancement design to automate requests and aggregate information available piecemeal on a site. An alternative to scrapers is interaction with an API, where available.
The web is not an archive through and through; some regions resist archival work (See ‘Swiss Scraper’ below), therefore it becomes relevant to identify the regions of the web that have been scraped, and how researchers went about doing so.
Working with Github’s API
GitHub’s API is a very thorough archive of repositories, users and code. Authorized applications can make 30 requests per minute and a search of GitHub’s repositories returns a json document with a list of up to 1000 elements. However, the total amount of results from any given query may be over 1000. Therefore, for queries with 1k+ results it is necessary to make several ordered requests.
I used the date of creation to segment my query of scraping repositories. This process involved a great deal of trial and error since I didn’t know how many scrapers were build for any particular interval of time.
The key words for every request were ‘scrape OR scraper OR scraping’. The API looked for matches in the title, description and README file of every repository between 2008 and present day. The final database has N= 17366
Getting the README docs
Once I had every repository with their respective statistics and urls to their content I scraped Github for each repo’s README file. No sophisticated technique was necessary to scrape Github, however I did need to set a sleep time of 2 seconds between requests in order not to time out.
I saved this second database to a json file (10MB) after applying some basic cleaning up to the collected text.
Scrapers and the language wars
Developers and data analysts disagree as to which language is faster and more readable for coding in particular problem tasks such as scraping. These debates are usually quite heated and polarized; they may involve core disagreements as to how to delimit a spectrum of tasks or how to evaluate the value of code (optimization vs. readability, for instance). Grounding this debate there’s actually quite a bit of research on what languages are favored within different fields.
For scrapers, we found the following results:
From the graphs above one may very well conclude that Python has gotten more popular as a language for scrapers in the last few years. The question that follows is, popular among whom? Are coders converting to Python for this particular task? Or is there a new group of coders trained in Python who have taken over scraping? In order to tackle these questions I conducted a cohort analysis following the tutorial written by Greg Reda.
Before analyzing how each cohort employs different languages I wanted to know if people writing scrapers actually continue to do so over time. The following map shows cohorts defined by the month when they created their first scraper (y-axis) and records the proportion of each cohort that wrote new scrapers in the following months (x-axis).
After 2009 cohorts start being more persistent in writing scrapers but its always a small group (not larger than 10% of the original cohort) that continues to devote repositories explicitly to scrapers. This is of course not the same as saying that they stop writing scrapers altogether since a scraper is a simple form of code that can be integrated to more complex scripts. This is however an indicator of scrapers losing their novelty fast: For most cohorts, after one month scrapers become integrated to more complex tasks or are abandoned altogether.
Now, focusing specifically on the use of Python I inspected changes in the proportion of users that employ this language to write scrapers in each cohort. In the following heat map each cohort starts with a proportion of python users This proportion is rarely higher than 40% Nevertheless this may comprise 2-10 python users for the early cohorts and up to 500 for the latter ones. If the hue gets darker when inspecting the heat map from left to right that would mean that a larger proportion of each cohort adopts python over time. If the hue gets darker when inspecting the heat map from top to bottom at would mean that new cohorts use python more consistently than older ones.
The diagonal line gets in fact darker as one goes further down. But there’s little evidence of python being adopted by older cohorts or, in fact, of python being used consistently over time. This heat map must be read in relation to the one above. People simply don’t continue to devote repos to scrapers after their first few scripts, but the small proportion of people who do may not be using Python. How to explain the line graph above? I’d say that every month a large new cohort of coders starts scraping and the largest coherent group (not necessarily the majority!) are using python.
What are people scraping?
I clustered scraping repos using the Ward algorithm. The distance between repos was determined by their cosine similarity in relation to token words in their readme files (tokens defined using tfidf). I compared the results of all other available clustering algorithms in scipy.cluster.hierarchy.linkage and I tried different upper and lower thresholds for document frequency as well as several lists of stopwords. The best results found are presented in the dendogram below.
(The clusters shown below are based on repositories written in Python exclusively)
There’s still much to be improved in the clustering algorithm used to produce the dendogram above. Regarding stopwords, for instance, I decided to include terms that referred to specific libraries since I was more interested in what people scrape rather than the specifics of how people go about doing so. This is the main reason I focused on python repos (I wouldn’t be able to identify most technical references in other languages). A more robust clustering algorithm could have been employed on all repositories and a more exhaustive stopword list could be compiled.
There are many venues of research into programmer cultures that can be explored by scraping github and similar communities such as stackoverflow or r/python. In particular I’d be interested in scraping github for python code and exploring the development and diffusion of pythonic style or the coherence of style in different tasks such as scraping and machine learning. The basic intuition is that the techniques of distant reading used to analyze the development of style in literature could also be employed for code.