Unified New York City Taxi and Uber data
Code in support of this post: Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance
This repo provides scripts to download, process, and analyze data for over 1.3 billion taxi and Uber trips originating in New York City. The data is stored in a PostgreSQL database, and uses PostGIS for spatial calculations, in particular mapping latitude/longitude coordinates to census tracts.
The yellow and green taxi data comes from the NYC Taxi & Limousine Commission, and Uber data comes via FiveThirtyEight, who obtained it via a FOIL request. In August 2016, the TLC began providing for-hire vehicle trip records as well.
Your mileage may vary, but on my MacBook Air, this process took about 3 days to complete. The unindexed database takes up 330 GB on disk. Adding indexes for improved query performance increases total disk usage by another 100 GB.
PostgreSQL and PostGIS1. Install
Both are available via Homebrew on Mac OS X
2. Download raw taxi data
3. Initialize database and set up schema
4. Import taxi data into database and map to census tracts
5. Optional: download and import Uber data from FiveThirtyEight's GitHub repository, and TLC for-hire vehicle records
Additional Postgres and R scripts for analysis are in the
analysis/ folder, or you can do your own!
tripstable contains all yellow and green taxi trips, plus Uber pickups from April 2014 through September 2014. Each trip has a
cab_type_id, which references the
cab_typestable and refers to one of
uber. Each trip maps to a census tract for pickup and dropoff
nyct2010table contains NYC census tracts plus the Newark Airport. It also maps census tracts to NYC's official neighborhood tabulation areas
taxi_zonestable contains the TLC's official taxi zone boundaries. Starting in July 2016, the TLC no longer provides pickup and dropoff coordinates. Instead, each trip comes with taxi zone pickup and dropoff location IDs
uber_trips_2015table contains Uber pickups from Jan–Jun, 2015. These are kept in a separate table because they don't have specific latitude/longitude coordinates, only location IDs. The location IDs are stored in the
taxi_zone_lookupstable, which also maps them (approximately) to neighborhood tabulation areas
fhv_tripstable contains all FHV trip records made available by the TLC
central_park_weather_observationshas summary weather data by date
Other data sources
These are bundled with the repository, so no need to download separately, but:
- Shapefile for NYC census tracts and neighborhood tabulation areas comes from Bytes of the Big Apple
- Shapefile for taxi zone locations comes from the TLC
- Mapping of FHV base numbers to names comes from the TLC
- Central Park weather data comes from the National Climatic Data Center
Data issues encountered
- Remove carriage returns and empty lines from TLC data before passing to Postgres
- Some raw data files have extra columns with empty data, had to create dummy columns
junk2to absorb them
- Two of the
yellowtaxi raw data files had a small number of rows containing extra columns. I discarded these rows
- The official NYC neighborhood tabulation areas (NTAs) included in the shapefile are not exactly what I would have expected. Some of them are bizarrely large and contain more than one neighborhood, e.g. "Hudson Yards-Chelsea-Flat Iron-Union Square", while others are confusingly named, e.g. "North Side-South Side" for what I'd call "Williamsburg", and "Williamsburg" for what I'd call "South Williamsburg". In a few instances I modified NTA names, but I kept the NTA geographic definitions
- The shapefile includes only NYC census tracts. Trips to New Jersey, Long Island, Westchester, and Connecticut are not mapped to census tracts, with the exception of the Newark Airport, for which I manually added a fake census tract
- The Uber 2015 and FHV data uses location IDs instead of latitude/longitude. The location IDs do not exactly overlap with the NYC neighborhood tabulation areas (NTAs) or census tracts, but I did my best to map Uber location IDs to NYC NTAs
Why not use BigQuery or Redshift?
Google BigQuery and Amazon Redshift would probably provide significant performance improvements over PostgreSQL. A lot of the data is already available on BigQuery, but in scattered tables, and each trip has only latitude and longitude coordinates, not census tracts and neighborhoods. PostGIS seemed like the easiest way to map coordinates to census tracts. Once the mapping is complete, it might make sense to load the data back into BigQuery or Redshift to make the analysis faster. Note that BigQuery and Redshift cost some amount of money, while PostgreSQL and PostGIS are free.
TLC summary statistics
There's a Ruby script in the
tlc_statistics/ folder to import data from the TLC's summary statistics reports:
firstname.lastname@example.org, or open a GitHub issue