Introducing: Data Analysis Studio for Summer 2016

We cover a lot of ground in the Lede Program – you might start without knowing much at all about your computer, and by the end you’re using the command line to run Python machine learning libraries, hex-binning geocoded CSV files, fetching quartiles out of dataframes, and running joins on SQL databases. That’s a lot of tech, really really quickly. Any time you have that many tools at your disposal, three big questions appear When should I use what tool? How do I keep it all straight? What if I want to learn more about X? To focus on answering these questions we’re introducing a new course for Summer 2016: Data Analysis Studio, a project-based course focused on beginning-to-end workflow and applying your newfound skills. You’ll spend seven weeks crafting projects from start to finish, learning new skills and refining the ones you’ve gained through your other coursework. Frequent critiques will not only make your projects stronger, but also introduce you and other students to new methods of analyzing data, building visuals and avoiding common pitfalls. Let’s take a look at how Data Analysis Studio answers those “big three questions.” When should I use what tool? Early on in your data career, a lot of tools look the same, and much of their nuance is still hidden. For example, if you’re storing data – when is a CSV file best? What about a SQL database? Why would I need a shapefile? While in-class explanations and working through assignments helps to learn the basics, it can be difficult to understand the real-world implications of your choices until you’ve actually made them. Data Analysis Studio not only gets you practice using these skills in a realistic environment, but feedback from instructors and visiting professionals keeps you moving in the right direction as you...

Mapping home-delivery postal service in Switzerland

Fanny Giroud In Switzerland, La Poste has a legal obligation to make sure that everybody has access to a close-by post office. In some remote areas though, La Poste has invented a system to provide postal services directly at home through the mailman, who will ring the bell at your door if you have place a little sign on your mailbox. The number of villages where this home-delivery service is in place is now a third of all postal access points in Switzerland. While La Poste is saving money by shutting down post offices and replacing them with home-delivery systems, people who aren’t staying at home all day long (like unemployed or retired ones) can’t easily send a package or pay a bill. La Poste refused to give me the complete list of these villages so I decided to scrape all the pointon the map that they provide (which is not the best for visualization). As of August 26th, 2015, it looks like La Poste is migrating its websites to a new platform, and the map that we are interested in is still up, but doesn’t seem to appear on the new website (see notebook for links). In the last weeks, it also seemed like the map was undergoing lost of maintenance work: the query url parameters have been modified; the number of villages has increased by two; one error point with coordinates in Somalia was removed; and layers of security certificates have been added. These changes have slowed me down significantly as I believed that the code was creating mistakes, but I added code for every possible mistake...
Exploring Global Terrorism

Exploring Global Terrorism

Aliza Goldberg RAND Corporation’s database of all terrorist attacks from 1968-2009 reveals that the most common weapons used by terrorists was explosives and the most terrorist attacks occurred in 2006. The most fatalities from a single attack happened on 9/11, but the most overall deaths from terrorist attacks happened in Iraq. The number of attacks over the course of 41 years was truly staggering, as seen from map visualizations of the database. RAND Corporation, a well-known American think tank focused on international military affairs, began this database in 1980 after forming the Cabinet Committee to Combat Terrorism in 1972. Rand uses the common academic definition of terrorism, quoting terrorist scholar Bruce Hoffman to specify that terrorism is: 1. violent 2. meant to create fear 3. intended to coerce counteraction 4. politically motivated 5. against civilians 6. by either a group or an individual Since terrorism can be difficult to classify, the data may be skewed. Other terrorist attacks may have gone unreported. Borders and names of countries have changed over the last 41 years, which may have led to some analysis errors. The database ends in 2009, so terrorist attacks since then, such as the rise of ISIL, are not included. I cleaned the database to turn the dates into recognizable times and to eliminate the “unknown” and “other” perpetrators only for analysis of terrorist groups. With bar graphs, I the “year of terror” (2006), the weapons used, how fatal those weapons were and the top deadliest countries. Using a Mapquest API key, I geocoded all of the terrorist attacks by city or country. I used a pivot table...
Scrapers Used on Github

Scrapers Used on Github

Sebastian Muñoz-Najar Galvez (See code for this project here) The objective of this project is to create a database of all available Github repositories explicitly devoted to the development or implementation of web scrapers in order to (1) identify the languages used for scraping and (2) the themes and websites frequently scraped. Scrapers are a genre of code used to collect, aggregate and organize information from a website. Scrapers capitalize on regular patterns of site layout and other principles of progressive enhancement design to automate requests and aggregate information available piecemeal on a site. An alternative to scrapers is interaction with an API, where available. The web is not an archive through and through; some regions resist archival work (See ‘Swiss Scraper’ below), therefore it becomes relevant to identify the regions of the web that have been scraped, and how researchers went about doing so. Working with Github’s API GitHub’s API is a very thorough archive of repositories, users and code. Authorized applications can make 30 requests per minute and a search of GitHub’s repositories returns a  json document with a list of up to 1000 elements. However, the total amount of results from any given query may be over 1000. Therefore, for queries with 1k+ results it is necessary to make several ordered requests. I used the date of creation to segment my query of scraping repositories. This process involved a great deal of trial and error since I didn’t know how many scrapers were build for any particular interval of time. The key words for every request were ‘scrape OR scraper OR scraping’. The API looked for...
Airbnb Data

Airbnb Data

Adam Stoddard The Airbnb marketplace is very diverse: apartments and other housing could consist of anything from a bedbug ridden couch to a glamorous full floor penthouse. How do we quantify this database? Some of the most interesting factors to look at are text-based. Airbnb includes text descriptions of the apartments and ‘about’ the host. The descriptions of apartments, using both cosine similarity and topic modeling, are about what you would expect: descriptions consist of words people use to describe housing: beds, baths, location, access, nearby restaurants, subways, bars, etc. But the topic modeling on host descriptions can be enlightening, allowing us to see how people think of themselves. Some hosts group themselves into categories, which could involve being a “professional” who “enjoys” “traveling”, or an “artist” in “Brooklyn” who spends time with “girlfriends.” Host topic modeling using gemsim: 0 place family friends much girlfriends school good entrepreneur ive give 1 really de going also huge always living et vous things 2 living brooklyn ny manhattan people two well park best walk 3 things people time reading make moved meeting year amazing see 4 month great place also architect kyle couple married home ny 5 live favorite years enjoy garden travel home living like life 6 travel professional easy going organized time make clean park slope 7 great work host travel neighborhood see manhattan good currently trip 8 stayed living writer good dogs editor comfortable shows owner magazine 9 great restaurants yorker home space please event traveling way easy What does the marketplace look like? The following histogram shows the number of bedrooms: One bedrooms clearly dominate, with far more units than either studios or two...

Political Donors in Norway

Gunn Kari Hegvik As data on private political donors in Norway are defective, not collected and sorted, I want to build a database of all donors back until 2011. I want to finish this, before the national election in 2017. My first step is to scrape the donordata for the Conservative Party, the party that currently has the PM. I chose them first; because they have the most donors, and second; because their donors tend to be wealthy people and companies in shipping, real-estate and investment banking. Inspired by the NYTimes-story “Small pool of rich donors Dominates Election giving”, I wanted to find out how many families dominates election giving to the Conservative party. I also wanted to find out who the most faithful donors are, which donors stopped donating, and who the newcomers are. Compared to the US, where donordata are made public on given dates, the Norwegian political parties have to make donations public no more than four weeks after the donation was made. So my second part of the project was to build a newsbot, using Mandrill, that would email me, if any changes were made to the Conservative Partys 2015 donors website. To build the bot, I was inspired by quakebot, the LATimes newsbot that we worked on for the first part of summer. I set the bot to print out a sentence, that can be published as part of a story, if one or several new donors are made public. The newsbot will scrape the 2015-site every five minute, and runs from an ec2 server. As there is an election for local municipalities coming...
Civic Engagement Measures

Civic Engagement Measures

Rashida Kamal Purpose & Goals While there are challenges to measuring the resilience of a particular community, academics, NGOs and government agencies have identified several critical indicators of resilience. In their paper, “Measuring Capacities for Community Resilience,” Sherrieb, Kathleen et. al., have proposed four such indicators: social capital, economic development, communication, and community competency. For this project, I was particularly interested in social capital, defined as both formal and informal networks of social support. I looked at volunteer rates across the United States and a few other questions around civic life from the Current Population Survey. Specifically, I was curious to see if these items were affected by different demographic distributions from community to community. Given that several major metropolitan areas in the U.S. have experienced or are currently experiencing gentrification, it would be interesting to see how civic life changes as a community changes. The Dataset & Methodology The Current Population Survey is conducted by the U.S. Census Bureau and Bureau of Labor Statistics. For the most part, the survey is concerted with data around employment, but in September and November of each year, a Volunteer Supplement and Civic Engagement Supplement is conducted in addition to the main survey. Unfortunately, while data from the Volunteer Supplement is available for 2014, 2013, 2012, 2011, and 2010, only the 2013, 2011, and 2010 Civic Engagement data is readily available on the Current Population Survey FTP. Each year’s data for each survey is includes over 100,000 individuals. The data was in a .dat fixed-width file, made intelligible by the accompany documentation. Each type of response for each question was assigned a...
Predicting disease spread based on climate change

Predicting disease spread based on climate change

Meghan Bongartz Code available at: https://github.com/mbongartz/final-project The conversation about disease in the United States tends to revolve only around those diseases which pose a current threat or problem. This means that we spend far more time on average talking about measles than Ebola – but it makes it far more terrifying when Ebola is being talked about because it means that it is suddenly posing a threat and we do not have the infrastructure to deal with an outbreak. There are some diseases that we don’t currently consider threats in the United States for which it would be difficult to predict when they could become problems due to the way they are spread. However, there are other diseases that may spread or move with climate change, and we should be able to plan for these. My goal was to investigate the risk for spread of tropical diseases in the United States as climate changes over time. There are a plethora of diseases that could be impacted by climate change for various reasons, but I narrowed my area of interest down to vector-borne diseases and, more specifically, mosquito-borne diseases because they show a stronger climate preference than some other vectors such as ticks. I looked at two different vectors: Tiger Mosquitos and Southern House Mosquitos. Before addressing the vectors, though, I needed data on the rate at which climate is changing in the United States. This was available from the National Oceanic and Atmospheric Administration here: http://www.ncdc.noaa.gov/cag/time-series/us. NOAA has information about temperature and precipitation since 1895 that can be downloaded in a nicely formatted CSV file; however, the type of information,...

Teacher Diversity

Katie Worth Schools in the United States, despite decades of nominally trying to diversify their workforce, still mostly employ white teachers. In a 2011 report titled “Teacher Diversity Matters” by the Center for American Progress, author Ulrich Boser noted that at some point in the next several years, the number of non-Hispanic white children in America’s public schools will be outnumbered by the number of children of color – and in fact, that’s already the case in some states, like California, where 72 percent of students are of color. This diversity is not reflected among the teachers who provide these children with an education: Only 17 percent of the country’s teaching force is non-Hispanic white. In California, just 29 percent of teachers are non-Hispanic whites. In fact, more than 20 states have a disparity of more than 25 percentage points between the diversity of students and teachers. There is evidence that this yawning gap between the ethnic racial heritage of the student body and the people who teach them has an impact on the quality of education. A follow-up to the 2011 study, “Teacher Diversity Revisited,” published in 2014, notes that teachers of color can serve as role models for students of color, and help them feel more at home in schools.  Further, non-white students have better educational outcomes if they are taught by teachers of color. The benefits aren’t just for students of color, either: White students with a diverse teachers profit educationally from interacting with people and authority figures who look differently than they do. But this phenomenon hasn’t been explored on a data level before, so...
Is there a housing bubble in Switzerland?

Is there a housing bubble in Switzerland?

Michael and Arthur 1. Financial crisis 2008 and it’s impact on Switzerland Financial crisis was caused by a bubble on the us housing market, it made banks tremble The impact was strong and worldwide Many studies in the US, our reference for this project is the «Bay Area Blues: The Effect of the Housing Crisis». They say expensive houses are less affected than cheap houses Ongoing surge of average house prices in Switzerland in general, but lack of local data. 2. What are the criteria for a bubble? You see strong growth in loan volume and the real estate prices If both grow over years faster than GDP, you can talk about a bubble If we find data that confirm the two criteria, it is worth to investigate further, if not, we have to give up. 3. What did we do? We did look at the Public Data you find on the Webside of Swiss National Bank and Office for Statistics for the time from 2004 to 2014 Inflation: 5.5 % GDP: 32.4 % House prices 54.5 % Housing credits: 86.6 % regional range from 36.8 % to 60.4 %, 8 regions published Conclusion: There must be regions where you can talk about a bubble, let’s go on and find local data. Officially you don’t get statistic on local sales. Finally we found a private company specialized in the housing business that gave as some data. 4. The...