Purpose & Goals
While there are challenges to measuring the resilience of a particular community, academics, NGOs and government agencies have identified several critical indicators of resilience. In their paper, “Measuring Capacities for Community Resilience,” Sherrieb, Kathleen et. al., have proposed four such indicators: social capital, economic development, communication, and community competency. For this project, I was particularly interested in social capital, defined as both formal and informal networks of social support. I looked at volunteer rates across the United States and a few other questions around civic life from the Current Population Survey. Specifically, I was curious to see if these items were affected by different demographic distributions from community to community. Given that several major metropolitan areas in the U.S. have experienced or are currently experiencing gentrification, it would be interesting to see how civic life changes as a community changes.
The Dataset & Methodology
The Current Population Survey is conducted by the U.S. Census Bureau and Bureau of Labor Statistics. For the most part, the survey is concerted with data around employment, but in September and November of each year, a Volunteer Supplement and Civic Engagement Supplement is conducted in addition to the main survey. Unfortunately, while data from the Volunteer Supplement is available for 2014, 2013, 2012, 2011, and 2010, only the 2013, 2011, and 2010 Civic Engagement data is readily available on the Current Population Survey FTP.
Each year’s data for each survey is includes over 100,000 individuals. The data was in a .dat fixed-width file, made intelligible by the accompany documentation. Each type of response for each question was assigned a numerical value. The biggest challenge of this project was to make the data analysis-ready. In order to clean the data, I first used the documentation to read and convert the fixed-width file into a dataframe, focusing on survey questions that I may be interested in. The second data cleaning challenge was to re-code the numerical values associated with each question. For some questions, I thought it would be more helpful to split responses into different categories before taking counts.
One important statistical consideration in working with this dataset is weighting. As the Current Population Survey is only a sample of the population it hopes to represent, the survey responses must be weighted to account for the differences from the sample and the population. Within this sample, each individual response is assigned a weight that is essentially the inverse of the probability that the person is in the sample. As the response rates of the supplements are slightly different from the main survey, the questions on the supplement are weighted separately. It is important to remember that the proportions produced from this data are population estimates (and for publication, it would be important to report standard error with each of the estimates).
Initially, when I had looked through the available items from the survey, I was excited to see that the county of each respondent had been recorded. My original intention had been to focus more narrowly on volunteer and civic life measures in New York City. However, upon closer analysis of how the weights were calculated, it became clear that it would not be statistical sound to make population estimates at the county level (the weights were based on state-level population and according to the U.S. Census Bureau, could be reasonably used to provides estimates for areas as large as states and the 12 largest Metropolitan Statistical Areas (MSA), as defined by the Office of Management and Budget).
The final transformation of my data was creating a dictionary for each state and MSA. I recorded the proportion of given categories within a state for demographic questions and the volunteer and civic engagement questions I was interested in. I also looked at how the demographics broke down in volunteers in each of those geographical areas. Highlights from this analysis are reported under “Findings.”
Next, I tried to create a machine learning algorithm that would make predictions about volunteer rates based on a community’s demographics. In order to do this, I labeled all of my volunteer rate data for each state and MSA (across available years) as either high volunteerism (3), moderate volunteerism (2), or low volunteerism (1). The data was labeled as high if it fell above the 75th percentile of all volunteer rates (in my dataset) and as low if it fell below the 25th percentile. I applied a Random Forest Classifier to train over 75% of my labeled data and after cross validation, found an accuracy score of approximately 72%. I ran some additional test data to see how the model would make predictions – however, for even extremely skewed demographic proportions, the model seemed to make the same predictions. It is possible that one of the demographic features is having an unanticipated influence on the outcome – further testing is likely necessarily to isolate the exact issue.
I examined three items from the supplements:
- Who volunteers?
- Who talks to their neighbors (a lot)?
- Who helps their neighbors?
For each of these items, looked at the head and tail of the dataframe, when sorted by the rates associated with these items. Note that the lowercase items refer to Metropolitan Statistical Areas, whereas the uppercase items refer to states. Some states and MSAs appear mutltiple times as each row represents the values for a given area for a given year (from the 2010, 2011, and 2013 datasets).
Note that the lowercase items refer to Metropolitan Statistical Areas, whereas the uppercase items refer to states.
Perhaps fitting in with American geographical stereotypes, Midwesterners are particularly involved in volunteer work. Disappointingly, the NYC Metropolitan Area (which actually encompasses parts of NY, NJ, and PA), had consistently low volunteer rates in the years represented by this data. It would be interesting to see if the rate was higher in 2012, given the response efforts after Superstorm Sandy.
Who talks to their neighbors (a lot)?
Who helps their neighbors?
As you’ll notice, Utah appears multiple times across these questions as having high rates of civic engagement. I took a closer look at Utah’s demographic information:
In Utah, white females with a some college or a Bachelor’s degree volunteer at slightly higher rates. We will see that this is pattern is also consistent with NYC MSA volunteers:
Though the NYC MSA has a much lower volunteer rate than Utah, we see that roughly the same types of people are volunteering. An interesting deviation, however, is the proportion of volunteers represented by those having an 150k or more. Though this group represents 17% of the NYC general population, they make up 31% of its volunteers.
For the NYC MSA, the other civic life questions (who talks to their neighbors and who helps their neighbors) showed similar demographic distributions, though the 150k+ category of talkers and helpers dropped back down to the proportion expected given their distribution in the general population.
Lessons Learned + Other Ideas
- Narrower focus: by the time that the data had been cleaned and re-categorized, there was still a ton of data to potentially work with. It would have been more ideal to focus on just one area after a bit of EDA. For example, it would have been interesting to focus on the NYC MSA to see subtle changes in rates and demographics over time. Or, though Civic Engagement Supplement data is not available for 2012, it would have been interesting to look at the volunteer data to see if there was a boost in the volunteer rate (and perhaps, if surrounding areas like New Jersey and Pennsylvania also saw a similar boost). In class, Allison brought up a great point that not all service is robust, involved service. Looking more specifically at NYC would make looking at the other questions and data points in the Volunteer Supplement more feasible.
- A few other ideas for what to explore next:
- It may be worth looking at percent change in demographics from year to year for a given community to understand how changing demographics impacts civic life measure.
- I could use the data from the Volunteer Supplement, vectorize demographic information for each respondent and create an algorithm that would predict if a given person is a volunteer or not (rather than predicting rates across a whole community). This may be useful if non-profits or government agencies would like to target service opportunities to those who are most likely to volunteer (or boost volunteerism amongst others).
- I didn’t explore many of the questions from the supplements. I could look at what proportion of people go to public meetings, participate in civic groups, volunteer through their church, etc.
- Other cities – I could look more closely at the data for San Francisco, Detroit, Chicago, DC, etc.