The objective of our project is to evaluate and predict the public attitudes on a specific social issue through an online social platform called Weibo (a Chinese Twitter).

See code for this project here


Man’s brutal beating of female driver divides Chinese public after different car videos emerge.

The different public opinion on this topic:

– The woman deserved it

– The man lost his mind


7,000 tweets from May 03 to June 03, including usernames, ids, publish date and time, counts of reposts, counts of like, content, and etc.

Data Collection

– Access to API of Weibo

To apply natural language processing techniques on weibo content analysis, we tried to use API of Weibo, and later to do the web scraping try to get the content people posted on this topic.  But we failed to get the dataset because they provide very little data.

– Then we found a dataset already made by a person and posted online, in contains over 7000 tweets on this topic.

– We use TFIDF to extract the key words in Chinese from over 7000 tweets on this topic


-Supervised Learning

Randomly select 1/10 tweet from the database and analyze the attitude of the content.

1: The woman deserved it;

-1: The man lost his mind

Read the tweets, decide the attitude of the content, and skip the ones with murky attitude. (Eg: “I think both A and B were wrong, I can’t decide who is at more fault.”)

Processing Data

-clean data

we need to get rid of the reposted content and also pay attention to the punctuation in special Chinese input method.

-text mining

Word Segmentation

Before: # 成都司机女司机变道遭殴打 #

After: # 成都 司机 女司机 变道 遭 殴打 #

-data mining

We first import training data from the excel spreadsheet into pandas data frame.

Then we prepare for text analysis, use the “jieba” package to segment the Chinese sentences on weibo. After extracting features, we build a model for our training data. And we evaluate our model with 10-fold cross-validation.

Split the tweets into two datasets based on the publish date

Before and after the release of the second video (May 6th)

Test again:

Model Application

Apply the model on 5,000 untrained tweets, and let machine predict the attitude of the content. (1 à The woman deserved it; -1 à The man lost his mind)

Percentage of tweets with predicted attitude of “The woman deserved it”: