Introducing: Data Analysis Studio for Summer 2016

We cover a lot of ground in the Lede Program – you might start without knowing much at all about your computer, and by the end you’re using the command line to run Python machine learning libraries, hex-binning geocoded CSV files, fetching quartiles out of dataframes, and running joins on SQL databases. That’s a lot of tech, really really quickly. Any time you have that many tools at your disposal, three big questions appear When should I use what tool? How do I keep it all straight? What if I want to learn more about X? To focus on answering these questions we’re introducing a new course for Summer 2016: Data Analysis Studio, a project-based course focused on beginning-to-end workflow and applying your newfound skills. You’ll spend seven weeks crafting projects from start to finish, learning new skills and refining the ones you’ve gained through your other coursework. Frequent critiques will not only make your projects stronger, but also introduce you and other students to new methods of analyzing data, building visuals and avoiding common pitfalls. Let’s take a look at how Data Analysis Studio answers those “big three questions.” When should I use what tool? Early on in your data career, a lot of tools look the same, and much of their nuance is still hidden. For example, if you’re storing data – when is a CSV file best? What about a SQL database? Why would I need a shapefile? While in-class explanations and working through assignments helps to learn the basics, it can be difficult to understand the real-world implications of your choices until you’ve actually made them. Data Analysis Studio not only gets you practice using these skills in a realistic environment, but feedback from instructors and visiting professionals keeps you moving in the right direction as you...
Exploring Global Terrorism

Exploring Global Terrorism

Aliza Goldberg RAND Corporation’s database of all terrorist attacks from 1968-2009 reveals that the most common weapons used by terrorists was explosives and the most terrorist attacks occurred in 2006. The most fatalities from a single attack happened on 9/11, but the most overall deaths from terrorist attacks happened in Iraq. The number of attacks over the course of 41 years was truly staggering, as seen from map visualizations of the database. RAND Corporation, a well-known American think tank focused on international military affairs, began this database in 1980 after forming the Cabinet Committee to Combat Terrorism in 1972. Rand uses the common academic definition of terrorism, quoting terrorist scholar Bruce Hoffman to specify that terrorism is: 1. violent 2. meant to create fear 3. intended to coerce counteraction 4. politically motivated 5. against civilians 6. by either a group or an individual Since terrorism can be difficult to classify, the data may be skewed. Other terrorist attacks may have gone unreported. Borders and names of countries have changed over the last 41 years, which may have led to some analysis errors. The database ends in 2009, so terrorist attacks since then, such as the rise of ISIL, are not included. I cleaned the database to turn the dates into recognizable times and to eliminate the “unknown” and “other” perpetrators only for analysis of terrorist groups. With bar graphs, I the “year of terror” (2006), the weapons used, how fatal those weapons were and the top deadliest countries. Using a Mapquest API key, I geocoded all of the terrorist attacks by city or country. I used a pivot table...
Civic Engagement Measures

Civic Engagement Measures

Rashida Kamal Purpose & Goals While there are challenges to measuring the resilience of a particular community, academics, NGOs and government agencies have identified several critical indicators of resilience. In their paper, “Measuring Capacities for Community Resilience,” Sherrieb, Kathleen et. al., have proposed four such indicators: social capital, economic development, communication, and community competency. For this project, I was particularly interested in social capital, defined as both formal and informal networks of social support. I looked at volunteer rates across the United States and a few other questions around civic life from the Current Population Survey. Specifically, I was curious to see if these items were affected by different demographic distributions from community to community. Given that several major metropolitan areas in the U.S. have experienced or are currently experiencing gentrification, it would be interesting to see how civic life changes as a community changes. The Dataset & Methodology The Current Population Survey is conducted by the U.S. Census Bureau and Bureau of Labor Statistics. For the most part, the survey is concerted with data around employment, but in September and November of each year, a Volunteer Supplement and Civic Engagement Supplement is conducted in addition to the main survey. Unfortunately, while data from the Volunteer Supplement is available for 2014, 2013, 2012, 2011, and 2010, only the 2013, 2011, and 2010 Civic Engagement data is readily available on the Current Population Survey FTP. Each year’s data for each survey is includes over 100,000 individuals. The data was in a .dat fixed-width file, made intelligible by the accompany documentation. Each type of response for each question was assigned a...
Is there a housing bubble in Switzerland?

Is there a housing bubble in Switzerland?

Michael and Arthur 1. Financial crisis 2008 and it’s impact on Switzerland Financial crisis was caused by a bubble on the us housing market, it made banks tremble The impact was strong and worldwide Many studies in the US, our reference for this project is the «Bay Area Blues: The Effect of the Housing Crisis». They say expensive houses are less affected than cheap houses Ongoing surge of average house prices in Switzerland in general, but lack of local data. 2. What are the criteria for a bubble? You see strong growth in loan volume and the real estate prices If both grow over years faster than GDP, you can talk about a bubble If we find data that confirm the two criteria, it is worth to investigate further, if not, we have to give up. 3. What did we do? We did look at the Public Data you find on the Webside of Swiss National Bank and Office for Statistics for the time from 2004 to 2014 Inflation: 5.5 % GDP: 32.4 % House prices 54.5 % Housing credits: 86.6 % regional range from 36.8 % to 60.4 %, 8 regions published Conclusion: There must be regions where you can talk about a bubble, let’s go on and find local data. Officially you don’t get statistic on local sales. Finally we found a private company specialized in the housing business that gave as some data. 4. The...
NYC Taxi Complaint Data

NYC Taxi Complaint Data

Elliot Ramos Refusing to see the truth at refusals In October of 2013 and January of 2014, I obtained a series of files from TLC, 311 and DoITT. The agencies collaborated to provide an extensive set of data that included fields key fields not found on the open data portal for New York City. Specifically the “descriptor” fields, which includes TLC’s categorization for taxi complaints as well as the verbatim narrative field, which is filled out via 311 dispatcher or view form submission online by residents. The data set is extensive and required months of manual work at some points. For the purposes of the class project, I’m focusing on the analysis and slicing of the data using pandas. Here is the data provided by the city of New York, at first, they had provided two files split up into complaints with summons and complaints without summons. Specific locations were not provided, but Service Request numbers were. The data goes back to January 2010 by incident date, however a handful of earlier records were included in this set and were excluded from the overall analysis as noise. Excel files: Using excel, those files were stacked atop of each other, originally given flags to TRUE if the records resulted in a summons. Subsequent requests were made to provide additional data that had service requests numbers and the Open Data portal Unique IDs, this allowed for a merging of data using CSV kit and data with the open data site that included location data such as x, y coordinates. Open Data taxi complaint set: https://data.cityofnewyork.us/Social-Services/311-Taxi-Complaints/uppf-z66u Subsequent requests were made to fill...