Introducing: Data Analysis Studio for Summer 2016

We cover a lot of ground in the Lede Program – you might start without knowing much at all about your computer, and by the end you’re using the command line to run Python machine learning libraries, hex-binning geocoded CSV files, fetching quartiles out of dataframes, and running joins on SQL databases. That’s a lot of tech, really really quickly. Any time you have that many tools at your disposal, three big questions appear When should I use what tool? How do I keep it all straight? What if I want to learn more about X? To focus on answering these questions we’re introducing a new course for Summer 2016: Data Analysis Studio, a project-based course focused on beginning-to-end workflow and applying your newfound skills. You’ll spend seven weeks crafting projects from start to finish, learning new skills and refining the ones you’ve gained through your other coursework. Frequent critiques will not only make your projects stronger, but also introduce you and other students to new methods of analyzing data, building visuals and avoiding common pitfalls. Let’s take a look at how Data Analysis Studio answers those “big three questions.” When should I use what tool? Early on in your data career, a lot of tools look the same, and much of their nuance is still hidden. For example, if you’re storing data – when is a CSV file best? What about a SQL database? Why would I need a shapefile? While in-class explanations and working through assignments helps to learn the basics, it can be difficult to understand the real-world implications of your choices until you’ve actually made them. Data Analysis Studio not only gets you practice using these skills in a realistic environment, but feedback from instructors and visiting professionals keeps you moving in the right direction as you...

Weibo Text Mining

The objective of our project is to evaluate and predict the public attitudes on a specific social issue through an online social platform called Weibo (a Chinese Twitter). See code for this project here Topic Man’s brutal beating of female driver divides Chinese public after different car videos emerge. The different public opinion on this topic: – The woman deserved it – The man lost his mind Data 7,000 tweets from May 03 to June 03, including usernames, ids, publish date and time, counts of reposts, counts of like, content, and etc. Data Collection – Access to API of Weibo To apply natural language processing techniques on weibo content analysis, we tried to use API of Weibo, and later to do the web scraping try to get the content people posted on this topic.  But we failed to get the dataset because they provide very little data. – Then we found a dataset already made by a person and posted online, in contains over 7000 tweets on this topic. – We use TFIDF to extract the key words in Chinese from over 7000 tweets on this topic Method -Supervised Learning Randomly select 1/10 tweet from the database and analyze the attitude of the content. 1: The woman deserved it; -1: The man lost his mind Read the tweets, decide the attitude of the content, and skip the ones with murky attitude. (Eg: “I think both A and B were wrong, I can’t decide who is at more fault.”) Processing Data -clean data we need to get rid of the reposted content and also pay attention to the punctuation in special...