Refusing to see the truth at refusals
In October of 2013 and January of 2014, I obtained a series of files from TLC, 311 and DoITT. The agencies collaborated to provide an extensive set of data that included fields key fields not found on the open data portal for New York City. Specifically the “descriptor” fields, which includes TLC’s categorization for taxi complaints as well as the verbatim narrative field, which is filled out via 311 dispatcher or view form submission online by residents.
The data set is extensive and required months of manual work at some points. For the purposes of the class project, I’m focusing on the analysis and slicing of the data using pandas.
Here is the data provided by the city of New York, at first, they had provided two files split up into complaints with summons and complaints without summons. Specific locations were not provided, but Service Request numbers were. The data goes back to January 2010 by incident date, however a handful of earlier records were included in this set and were excluded from the overall analysis as noise.
Using excel, those files were stacked atop of each other, originally given flags to TRUE if the records resulted in a summons.
Subsequent requests were made to provide additional data that had service requests numbers and the Open Data portal Unique IDs, this allowed for a merging of data using CSV kit and data with the open data site that included location data such as x, y coordinates.
Open Data taxi complaint set:
Subsequent requests were made to fill in the gap of records from October 2013 to January 2014, thus the records are really only reliable for 2010-2013, giving me four years worth of complaints. While records do include Jaunary 2014 data, it was also excluded from the analysis.
Further, summons data was not included with the gap for Oct-Jan. Not wanting to press my luck in angering the Information officers, and not having an immediate need for that field, I did ask for it. I’ve left the summons field in there for 2010-2013, but did not use it.
1) Cab refusals and cab complaints in general have declined from 2010 to 2013 (and appear to be trending that way)
This could be indicative of a lot of transit issues at play. Usually weather and time of day are the biggest factors with cab refusals, (try hailing a cab when it starts to rain). But since the data was collected, New York City has implemented CitiBike, a bike-sharing program, the green boro cabs, that serve areas above the Manhattan central business district and the outter boros (although recent data suggest they’re heavily serving gentrifying areas such as Astoria, Harlem, Park Slope and Williamsburg. Ride services such as Lyft and Uber have become popular in use and have been a point of consternation for the taxi industry. The decline in complaints may reflect a declining reliance on the taxi infrastructure.
2) Cab refusals, like all cab rides in general are heavily concentrated in Manhattan¶
This is to be expected given that the sheer amount of cab rides will assure it will have an equally high number of complaints per ride without complaints.
3) Complaints peak during 4pm on weekdays and weekends during shift change and again on weekend evenings as cabs become scarcer
The sample size of complaints per given rides is extraordinally small. Any given day can have tens of thousands of cab rides in the city. With this data set, we can see there were 72,506 complaints total from 2010 to 2014. Of that, the greatest category of complaints was for refused rides, totalling at 16,136 for the 4-year period. And for the 2013 data set, they make up only 3,393 complaints for that year.
4) Brooklynites like to complain. A LOT.
It’s important to remember that the people that use cabs are those with some amount of disposable income, not exclusively, but usually. Of the 2013 complaints, which were classified by intended destination (if reasonably discernable!), Refusals intended for Brooklyn were still the greatest amount of complainers:
The kmeans cateogrization clusters reveal a funny amount of Brooklyn references as well.
Also, when mapped out, the point of origin for a lot of the complaints are still in Brooklyn. This is indicative of issues with inter-boro transit, which may have been addressed with Green Cabs and services such as Uber.
5 Location data is flawed.
Unlike the GPS-based data of the taxi rides data set, a good chunk of this data has transposed locations. The above map shows complaints located in Brooklyn about trying to GET to Brooklyn.
6) Things I’m sad I didn’t have time to do…
- Download historic weather data from the weather.io api, and create a dictionary of days with precipitation and compare it to days with complaints to determine the correlation of complaints to rain and snow days.
- Get the most recent data for 2014 and 2015, bind them to areas such as census tracts, then track rate of decline per tract, then take the pubilc data sets for 2013, 2014, 2015 rides, and see if the overall number of cab rides has remained the same for those tracts, compare it with green cab and uber data (which is also recently available for 2015! It would offer a view of whether amid the Uber debate, if people are opting to use other ride options, translating to fewer complaints. Heck, even the Citibike rides are tracked, too!
- Create a Twitter Bot that if users tweet at it with a medallion number, it replies back with complaints for that medallion.