Citibike, a bike share program in NYC, collects, maintains, and makes public data about the entire bikeshare system and all rides made. On a personal level, Citibike allows users to download their trip history from the "Trips" link in their user profile. The available personal data consists of each trip's starting dock, starting time, ending dock, ending time and the duration of the trip. Fine location data is notably missing -- the bikes themselves do not contain any sort of location awareness.
While exploring this data and working some tools that are relatively new to me (Jupyter Notebook, pandas, etc.), I'll investigate some theories that I have about the way I ride, and some of the ways that Citibike as a whole operates. For exmaple, in the Citibike app, the program assumes that the rider rides 7.456 miles per hour when presenting the user with a "distance" for each of their completed rides. Do I ride faster or slower than that? Are my distance numbers underestimated or overestimated?
First, I'll start by reading a text file containing the trips exported from my user profile. Then, I'll clean up some of the data types and isolate unique trips by start and end stations.
import pandas as pd
import numpy as np
import googlemaps
import config
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
gmaps = googlemaps.Client(key=config.gmaps_api_key)
# Read text file into pandas. By observation, use new line as separator declare no header so first line is read as data.
trips_df = pd.read_csv('trips.txt', sep="\n", header=None)
# Convert df from number_of_trips*5 by 1 to number_of_trips by 5
trips_df = (pd.DataFrame(np.reshape(trips_df.values,(round(len(trips_df)/5), 5)),
columns=['start_time','start_loc','end_time','end_loc','duration']))
# Convert strings of times to datetimes
trips_df[['start_time','end_time']] = trips_df[['start_time','end_time']].apply(pd.to_datetime)
# Replace imported duration time with datetime calculated from start and end times.
trips_df['duration'] = trips_df['end_time'] - trips_df['start_time']
trips_df.head()
# Count number of trips that start and end at same station. For these trips, assume no meaningful travel. Sum durations for later reporting.
non_trips = sum(trips_df['start_loc'] == trips_df['end_loc'])
non_trips_time = trips_df[trips_df['start_loc'] == trips_df['end_loc']]['duration'].sum()
# Drop these trips, reset index.
trips_df.drop(trips_df[trips_df['start_loc'] == trips_df['end_loc']].index, inplace = True)
trips_df.reset_index(inplace = True, drop = True)
# Append city and state to locations to avoid ambiguity.
trips_df[['start_loc','end_loc']] = trips_df[['start_loc','end_loc']].astype(str) + ", New York, NY"
trips_df.head()
# Group and count rows where start_loc and end_loc have the same values. Reset index and rename column to be count.
# Create a unique df so as to minimize the number of hits on the Google Maps API.
unique = trips_df.groupby(['start_loc','end_loc']).size().reset_index().rename(columns= {0:'count'})
unique.head()
Finding the unique trips serves a two fold purpose, one, it's interesting to see which trips I take most frequently; and two, it will reduce the number of hits on the Google Maps API.
Here, I'll use the Google Maps Directions API to get bike directions. Since I rarely ride the wrong way on one way streets or go too far out of my way to find a bike lane, let's assume that I ride those routes for my trips.
# Define function that will use Google Maps API to retrieve assumed routes for all unique trips.
def get_bike_directions(row):
origin = row['start_loc']
destination = row['end_loc']
mode = "bicycling"
return gmaps.directions(origin, destination, mode)
directions = unique.apply(get_bike_directions, axis = 1)
unique['directions'] = directions
# Parse returned JSON to extract overall distance and overall duration of route.
unique['dist'] = unique['directions'].apply(lambda x : x[0]["legs"][0]['distance']['text'])
unique['google_time'] = unique['directions'].apply(lambda x : x[0]["legs"][0]['duration']['text'])
unique.head()
Here, I left outer join "all trips" with "unique trips". A note: it would be more efficient to do the little bit of cleaning prior to the join.
trips_df = pd.merge(trips_df, unique, how='left', left_on=['start_loc','end_loc'], right_on=['start_loc','end_loc']).drop(['count'], axis =1)
trips_df['dist'] = trips_df['dist'].apply(lambda x : x[:-3])
trips_df = trips_df.rename(columns={'dist':'dist_miles'})
trips_df['speed_mph'] = trips_df['dist_miles'].astype(float) / (trips_df['duration']/np.timedelta64(1, 'h'))
trips_df.head()
I'll plot the ride speeds as a histogram to get a sense of the distribution of my ride speeds.
num_bins = 20
fig, ax = plt.subplots()
ax.hist(trips_df['speed_mph'], bins=num_bins)
ax.set_xlabel('Speeds (mph)')
ax.set_ylabel('Number of Rides')
ax.set_title('Histogram of Ride Speeds')
fig.tight_layout()
plt.show()
Seems like a pretty safe bet that I beat the Citibike bike assumed 7.456 mph. Let's take a look at those outliers.
trips_df[trips_df.speed_mph > 12]
Luckily, these are my own rides so I can recall that I did indeed take shortcuts on id=4 and id=46 above. Also, at 23mph, that id=20 is a particularly egregious failing of my hacky attempt to prep the station locations for Google Maps API.
# Based on histogram above, assume that rides over 13 mph are erroneous -- i.e., I took a shortcut down a one-way compared
# to what Google suggested, or, for the > 20 mph ride, the address/pathfinding is wrong because of Brooklyn vs NY, NY.
trips_df = trips_df.drop(trips_df[trips_df.speed_mph > 13].index)
num_bins = 20
fig, ax = plt.subplots()
ax.hist(trips_df['speed_mph'], bins=num_bins)
ax.set_xlabel('Speeds (mph)')
ax.set_ylabel('Number of Rides')
ax.set_title('Histogram of Ride Speeds')
fig.tight_layout()
plt.show()
trips_df['speed_mph'].describe()
As guessed, on average, I ride faster than the Citibike app expects and the in-app distance numbers are underestimated.
trips_df.loc[trips_df.speed_mph.idxmin()]
The slowest trip I took, I recall being one where I had to fiddle with the seat and I docked it as soon as I could.
trip_time = trips_df['duration'].sum()
days = trip_time.days
hours, remainder = divmod(trip_time.seconds, 3600)
minutes, seconds = divmod(remainder, 60)
trip_time = days * 24 + hours + minutes/60
print("The total time spent riding Citibike is %.1f hours." % (trip_time))
print("The total distance traveled using Citibike is %.1f miles." % (trip_time*trips_df['speed_mph'].mean()))