While this project emphasizes communicating data findings, it's essential to first perform data wrangling to ensure the dataset is tidy and that the findings be accurate.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
We'll first load and look at the dataset
df = pd.read_csv('201902-fordgobike-tripdata.csv')
df.head()
duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 52185 | 2019-02-28 17:32:10.1450 | 2019-03-01 08:01:55.9750 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 13.0 | Commercial St at Montgomery St | 37.794231 | -122.402923 | 4902 | Customer | 1984.0 | Male | No |
1 | 42521 | 2019-02-28 18:53:21.7890 | 2019-03-01 06:42:03.0560 | 23.0 | The Embarcadero at Steuart St | 37.791464 | -122.391034 | 81.0 | Berry St at 4th St | 37.775880 | -122.393170 | 2535 | Customer | NaN | NaN | No |
2 | 61854 | 2019-02-28 12:13:13.2180 | 2019-03-01 05:24:08.1460 | 86.0 | Market St at Dolores St | 37.769305 | -122.426826 | 3.0 | Powell St BART Station (Market St at 4th St) | 37.786375 | -122.404904 | 5905 | Customer | 1972.0 | Male | No |
3 | 36490 | 2019-02-28 17:54:26.0100 | 2019-03-01 04:02:36.8420 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 70.0 | Central Ave at Fell St | 37.773311 | -122.444293 | 6638 | Subscriber | 1989.0 | Other | No |
4 | 1585 | 2019-02-28 23:54:18.5490 | 2019-03-01 00:20:44.0740 | 7.0 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 222.0 | 10th Ave at E 15th St | 37.792714 | -122.248780 | 4898 | Subscriber | 1974.0 | Male | Yes |
We'll look at the structure and contents of the DataFrame, to help identify issues like missing values or data types that may need to be adjusted.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 183412 entries, 0 to 183411 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 183412 non-null int64 1 start_time 183412 non-null object 2 end_time 183412 non-null object 3 start_station_id 183215 non-null float64 4 start_station_name 183215 non-null object 5 start_station_latitude 183412 non-null float64 6 start_station_longitude 183412 non-null float64 7 end_station_id 183215 non-null float64 8 end_station_name 183215 non-null object 9 end_station_latitude 183412 non-null float64 10 end_station_longitude 183412 non-null float64 11 bike_id 183412 non-null int64 12 user_type 183412 non-null object 13 member_birth_year 175147 non-null float64 14 member_gender 175147 non-null object 15 bike_share_for_all_trip 183412 non-null object dtypes: float64(7), int64(2), object(7) memory usage: 22.4+ MB
- The 'start_time' and 'end_time' fields should be changed to the datetime data type.
- The 'member_birth_year' field should be converted to an integer data type.
- The 'user_type' and 'member_gender' fields could be changed to a category data type.
- The 'bike_share_for_all_trip' field could be changed to a boolean data type.
- The 'bike_id', 'start_station_id', and 'end_station_id' fields should be changed to strings.
We'll check for duplicates
sum(df.duplicated())
0
Next, we'll count the total number of missing (null) values in each column of a DataFrame.
df.isnull().sum()
duration_sec 0 start_time 0 end_time 0 start_station_id 197 start_station_name 197 start_station_latitude 0 start_station_longitude 0 end_station_id 197 end_station_name 197 end_station_latitude 0 end_station_longitude 0 bike_id 0 user_type 0 member_birth_year 8265 member_gender 8265 bike_share_for_all_trip 0 dtype: int64
Several columns contain missing data (null values)
Let's summarize a DataFrame's key statistical properties, such as count, mean, standard deviation, and quartiles.
df.describe()
duration_sec | start_station_id | start_station_latitude | start_station_longitude | end_station_id | end_station_latitude | end_station_longitude | bike_id | member_birth_year | |
---|---|---|---|---|---|---|---|---|---|
count | 183412.000000 | 183215.000000 | 183412.000000 | 183412.000000 | 183215.000000 | 183412.000000 | 183412.000000 | 183412.000000 | 175147.000000 |
mean | 726.078435 | 138.590427 | 37.771223 | -122.352664 | 136.249123 | 37.771427 | -122.352250 | 4472.906375 | 1984.806437 |
std | 1794.389780 | 111.778864 | 0.099581 | 0.117097 | 111.515131 | 0.099490 | 0.116673 | 1664.383394 | 10.116689 |
min | 61.000000 | 3.000000 | 37.317298 | -122.453704 | 3.000000 | 37.317298 | -122.453704 | 11.000000 | 1878.000000 |
25% | 325.000000 | 47.000000 | 37.770083 | -122.412408 | 44.000000 | 37.770407 | -122.411726 | 3777.000000 | 1980.000000 |
50% | 514.000000 | 104.000000 | 37.780760 | -122.398285 | 100.000000 | 37.781010 | -122.398279 | 4958.000000 | 1987.000000 |
75% | 796.000000 | 239.000000 | 37.797280 | -122.286533 | 235.000000 | 37.797320 | -122.288045 | 5502.000000 | 1992.000000 |
max | 85444.000000 | 398.000000 | 37.880222 | -121.874119 | 398.000000 | 37.880222 | -121.874119 | 6645.000000 | 2001.000000 |
As we can see, the earliest recorded birth year for members is 1881, which appears to be an error. We will investigate these inaccurate birth years more thoroughly in the cleaning section of this report.
Lets check how many people are riding bike above 90
df.query("member_birth_year < 1934")
duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1285 | 148 | 2019-02-28 19:29:17.6270 | 2019-02-28 19:31:45.9670 | 158.0 | Shattuck Ave at Telegraph Ave | 37.833279 | -122.263490 | 173.0 | Shattuck Ave at 55th St | 37.840364 | -122.264488 | 5391 | Subscriber | 1900.0 | Male | Yes |
5197 | 217 | 2019-02-28 13:51:46.2380 | 2019-02-28 13:55:24.1270 | 70.0 | Central Ave at Fell St | 37.773311 | -122.444293 | 71.0 | Broderick St at Oak St | 37.773063 | -122.439078 | 5801 | Subscriber | 1931.0 | Male | No |
5266 | 384 | 2019-02-28 13:35:05.4280 | 2019-02-28 13:41:30.2230 | 84.0 | Duboce Park | 37.769200 | -122.433812 | 71.0 | Broderick St at Oak St | 37.773063 | -122.439078 | 6608 | Subscriber | 1931.0 | Male | No |
5447 | 147 | 2019-02-28 13:08:56.9350 | 2019-02-28 13:11:24.0620 | 84.0 | Duboce Park | 37.769200 | -122.433812 | 72.0 | Page St at Scott St | 37.772406 | -122.435650 | 5018 | Subscriber | 1931.0 | Male | No |
10827 | 1315 | 2019-02-27 19:21:34.4360 | 2019-02-27 19:43:30.0080 | 343.0 | Bryant St at 2nd St | 37.783172 | -122.393572 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 6249 | Subscriber | 1900.0 | Male | No |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
177708 | 1527 | 2019-02-01 19:09:28.3870 | 2019-02-01 19:34:55.9630 | 343.0 | Bryant St at 2nd St | 37.783172 | -122.393572 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 5286 | Subscriber | 1900.0 | Male | No |
177885 | 517 | 2019-02-01 18:38:40.4710 | 2019-02-01 18:47:18.3920 | 25.0 | Howard St at 2nd St | 37.787522 | -122.397405 | 30.0 | San Francisco Caltrain (Townsend St at 4th St) | 37.776598 | -122.395282 | 2175 | Subscriber | 1902.0 | Female | No |
177955 | 377 | 2019-02-01 18:23:33.4110 | 2019-02-01 18:29:50.7950 | 26.0 | 1st St at Folsom St | 37.787290 | -122.394380 | 321.0 | 5th St at Folsom | 37.780146 | -122.403071 | 5444 | Subscriber | 1933.0 | Female | Yes |
182830 | 428 | 2019-02-01 07:45:05.9340 | 2019-02-01 07:52:14.9220 | 284.0 | Yerba Buena Center for the Arts (Howard St at ... | 37.784872 | -122.400876 | 67.0 | San Francisco Caltrain Station 2 (Townsend St... | 37.776639 | -122.395526 | 5031 | Subscriber | 1901.0 | Male | No |
183388 | 490 | 2019-02-01 00:39:53.1120 | 2019-02-01 00:48:03.3380 | 61.0 | Howard St at 8th St | 37.776513 | -122.411306 | 81.0 | Berry St at 4th St | 37.775880 | -122.393170 | 5411 | Subscriber | 1927.0 | Male | No |
187 rows × 16 columns
Data Cleaning¶
The initial step in the cleaning process involved making a copy of the original dataset.
# make a copy
df_clean = df.copy()
We'll remove the null values from the dataset and reset the index
# dropping rows with null values
df_clean.dropna(inplace=True)
df_clean.reset_index(drop=True, inplace = True)
Let's review the dataset overview to see the changes
# Display DataFrame info
df_clean.info()
# Check for null counts
null_counts = df_clean.isnull().sum()
print(null_counts)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 174952 entries, 0 to 174951 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 174952 non-null int64 1 start_time 174952 non-null object 2 end_time 174952 non-null object 3 start_station_id 174952 non-null float64 4 start_station_name 174952 non-null object 5 start_station_latitude 174952 non-null float64 6 start_station_longitude 174952 non-null float64 7 end_station_id 174952 non-null float64 8 end_station_name 174952 non-null object 9 end_station_latitude 174952 non-null float64 10 end_station_longitude 174952 non-null float64 11 bike_id 174952 non-null int64 12 user_type 174952 non-null object 13 member_birth_year 174952 non-null float64 14 member_gender 174952 non-null object 15 bike_share_for_all_trip 174952 non-null object dtypes: float64(7), int64(2), object(7) memory usage: 21.4+ MB duration_sec 0 start_time 0 end_time 0 start_station_id 0 start_station_name 0 start_station_latitude 0 start_station_longitude 0 end_station_id 0 end_station_name 0 end_station_latitude 0 end_station_longitude 0 bike_id 0 user_type 0 member_birth_year 0 member_gender 0 bike_share_for_all_trip 0 dtype: int64
Next, we'll change the datatypes that we mentioned before
# Define a dictionary of columns and their desired data types
dtype_mapping = {
'start_time': 'datetime64[ns]',
'end_time': 'datetime64[ns]',
'member_birth_year': 'int',
'bike_id': 'object',
'start_station_id': 'object',
'end_station_id': 'object',
'user_type': 'category',
'member_gender': 'category',
'bike_share_for_all_trip': 'category'
}
# Apply the conversions
for column, dtype in dtype_mapping.items():
df_clean[column] = df_clean[column].astype(dtype)
Let's check the overview of the dataframe
df_clean.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 174952 entries, 0 to 174951 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 174952 non-null int64 1 start_time 174952 non-null datetime64[ns] 2 end_time 174952 non-null datetime64[ns] 3 start_station_id 174952 non-null object 4 start_station_name 174952 non-null object 5 start_station_latitude 174952 non-null float64 6 start_station_longitude 174952 non-null float64 7 end_station_id 174952 non-null object 8 end_station_name 174952 non-null object 9 end_station_latitude 174952 non-null float64 10 end_station_longitude 174952 non-null float64 11 bike_id 174952 non-null object 12 user_type 174952 non-null category 13 member_birth_year 174952 non-null int64 14 member_gender 174952 non-null category 15 bike_share_for_all_trip 174952 non-null category dtypes: category(3), datetime64[ns](2), float64(4), int64(2), object(5) memory usage: 17.9+ MB
The earliest birth year listed is 1881, and there are several member birth years suggesting individuals are over 90, which seems highly improbable. We will eliminate any instances where a member's birth year indicates they would be over 90 years old, as this appears to be an error.
Let's take a look at the statistics for the member birth years in the dataset.
df_clean.member_birth_year.describe()
count 174952.000000 mean 1984.803135 std 10.118731 min 1878.000000 25% 1980.000000 50% 1987.000000 75% 1992.000000 max 2001.000000 Name: member_birth_year, dtype: float64
Let's look at a histogram of member birth years, using 5-year bins from 1880 to 2000 to visualize the frequency distribution of the data.
bin_edges = np.arange (0, df_clean['member_birth_year'].max()+5, 5)
plt.hist(data = df_clean, x = 'member_birth_year', bins = bin_edges)
plt.xlim(1880, 2000)
plt.xlabel('Member Birth Year')
plt.ylabel('Frequency');
Next, we'll look for outliers in member birth years by visualizing the data with a box plot.
# look for outliers
sb.boxplot(x=df_clean['member_birth_year'])
<Axes: xlabel='member_birth_year'>
Let's calculate the first and third quartiles (Q1 and Q3) and the interquartile range (IQR) for member birth years to identify potential outliers.
# Using the Interquartile Range to understand outliers
q1 = df_clean.member_birth_year.quantile(0.25)
q3 = df_clean.member_birth_year.quantile(0.75)
iqr = q3 - q1
print(q1)
print(q3)
print(iqr)
1980.0 1992.0 12.0
We'll calculate the lower whisker for identifying outliers in member birth years, which is determined by subtracting 1.5 times the interquartile range (IQR) from Q1.
lower_whisker = q1 - 1.5 * iqr
print(lower_whisker)
1962.0
Let's filter the dataset to examine members born before 1962.
df_clean.query("member_birth_year < 1962")
duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 1793 | 2019-02-28 23:49:58.632 | 2019-03-01 00:19:51.760 | 93.0 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 323.0 | Broadway at Kearny | 37.798014 | -122.405950 | 5200 | Subscriber | 1959 | Male | No |
40 | 116 | 2019-02-28 23:44:00.988 | 2019-02-28 23:45:57.482 | 104.0 | 4th St at 16th St | 37.767045 | -122.390833 | 93.0 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 823 | Subscriber | 1959 | Male | No |
62 | 681 | 2019-02-28 23:19:37.366 | 2019-02-28 23:30:58.862 | 43.0 | San Francisco Public Library (Grove St at Hyde... | 37.778768 | -122.415929 | 70.0 | Central Ave at Fell St | 37.773311 | -122.444293 | 6333 | Subscriber | 1959 | Male | No |
196 | 547 | 2019-02-28 22:25:51.137 | 2019-02-28 22:34:58.970 | 76.0 | McCoppin St at Valencia St | 37.771662 | -122.422423 | 43.0 | San Francisco Public Library (Grove St at Hyde... | 37.778768 | -122.415929 | 6333 | Subscriber | 1961 | Female | No |
297 | 217 | 2019-02-28 21:58:47.639 | 2019-02-28 22:02:24.693 | 149.0 | Emeryville Town Hall | 37.831275 | -122.285633 | 153.0 | 59th St at Horton St | 37.840945 | -122.291360 | 5210 | Subscriber | 1961 | Male | No |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
174844 | 459 | 2019-02-01 05:15:05.178 | 2019-02-01 05:22:44.272 | 104.0 | 4th St at 16th St | 37.767045 | -122.390833 | 30.0 | San Francisco Caltrain (Townsend St at 4th St) | 37.776598 | -122.395282 | 3446 | Subscriber | 1959 | Male | No |
174852 | 373 | 2019-02-01 04:42:44.709 | 2019-02-01 04:48:58.076 | 131.0 | 22nd St at Dolores St | 37.755000 | -122.425728 | 129.0 | Harrison St at 20th St | 37.758862 | -122.412544 | 5427 | Subscriber | 1958 | Male | No |
174853 | 100 | 2019-02-01 04:46:54.805 | 2019-02-01 04:48:34.843 | 80.0 | Townsend St at 5th St | 37.775235 | -122.397437 | 67.0 | San Francisco Caltrain Station 2 (Townsend St... | 37.776639 | -122.395526 | 3138 | Subscriber | 1950 | Male | No |
174926 | 400 | 2019-02-01 00:46:47.276 | 2019-02-01 00:53:27.596 | 220.0 | San Pablo Ave at MLK Jr Way | 37.811351 | -122.273422 | 337.0 | Webster St at 19th St | 37.806970 | -122.266588 | 3487 | Subscriber | 1945 | Male | Yes |
174929 | 490 | 2019-02-01 00:39:53.112 | 2019-02-01 00:48:03.338 | 61.0 | Howard St at 8th St | 37.776513 | -122.411306 | 81.0 | Berry St at 4th St | 37.775880 | -122.393170 | 5411 | Subscriber | 1927 | Male | No |
5781 rows × 16 columns
Lets look at the bottom 1% of birth years
df.member_birth_year.describe(percentiles = [.01])
count 175147.000000 mean 1984.806437 std 10.116689 min 1878.000000 1% 1955.000000 50% 1987.000000 max 2001.000000 Name: member_birth_year, dtype: float64
We'll look for birth years prior to 1955.
df_clean.query("member_birth_year < 1955")
duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
476 | 235 | 2019-02-28 21:17:57.047 | 2019-02-28 21:21:52.631 | 34.0 | Father Alfred E Boeddeker Park | 37.783988 | -122.412408 | 58.0 | Market St at 10th St | 37.776619 | -122.417385 | 5202 | Subscriber | 1954 | Male | No |
956 | 384 | 2019-02-28 19:56:45.837 | 2019-02-28 20:03:10.473 | 250.0 | North Berkeley BART Station | 37.873558 | -122.283093 | 257.0 | Fifth St at Delaware St | 37.870407 | -122.299676 | 1671 | Subscriber | 1954 | Male | No |
1033 | 303 | 2019-02-28 19:49:38.120 | 2019-02-28 19:54:42.044 | 43.0 | San Francisco Public Library (Grove St at Hyde... | 37.778768 | -122.415929 | 76.0 | McCoppin St at Valencia St | 37.771662 | -122.422423 | 6333 | Subscriber | 1945 | Male | Yes |
1238 | 148 | 2019-02-28 19:29:17.627 | 2019-02-28 19:31:45.967 | 158.0 | Shattuck Ave at Telegraph Ave | 37.833279 | -122.263490 | 173.0 | Shattuck Ave at 55th St | 37.840364 | -122.264488 | 5391 | Subscriber | 1900 | Male | Yes |
1295 | 1362 | 2019-02-28 19:02:33.643 | 2019-02-28 19:25:16.561 | 15.0 | San Francisco Ferry Building (Harry Bridges Pl... | 37.795392 | -122.394203 | 97.0 | 14th St at Mission St | 37.768265 | -122.420110 | 48 | Subscriber | 1954 | Male | No |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
174773 | 191 | 2019-02-01 06:32:38.467 | 2019-02-01 06:35:50.222 | 15.0 | San Francisco Ferry Building (Harry Bridges Pl... | 37.795392 | -122.394203 | 20.0 | Mechanics Monument Plaza (Market St at Bush St) | 37.791300 | -122.399051 | 3108 | Subscriber | 1947 | Male | No |
174808 | 966 | 2019-02-01 05:57:01.688 | 2019-02-01 06:13:08.313 | 126.0 | Esprit Park | 37.761634 | -122.390648 | 16.0 | Steuart St at Market St | 37.794130 | -122.394430 | 1338 | Subscriber | 1952 | Male | No |
174853 | 100 | 2019-02-01 04:46:54.805 | 2019-02-01 04:48:34.843 | 80.0 | Townsend St at 5th St | 37.775235 | -122.397437 | 67.0 | San Francisco Caltrain Station 2 (Townsend St... | 37.776639 | -122.395526 | 3138 | Subscriber | 1950 | Male | No |
174926 | 400 | 2019-02-01 00:46:47.276 | 2019-02-01 00:53:27.596 | 220.0 | San Pablo Ave at MLK Jr Way | 37.811351 | -122.273422 | 337.0 | Webster St at 19th St | 37.806970 | -122.266588 | 3487 | Subscriber | 1945 | Male | Yes |
174929 | 490 | 2019-02-01 00:39:53.112 | 2019-02-01 00:48:03.338 | 61.0 | Howard St at 8th St | 37.776513 | -122.411306 | 81.0 | Berry St at 4th St | 37.775880 | -122.393170 | 5411 | Subscriber | 1927 | Male | No |
1680 rows × 16 columns
We will remove these instead and reset the index
df_clean.drop(df_clean[df_clean.member_birth_year < 1955].index, inplace=True)
df_clean.reset_index(drop=True, inplace = True)
Check if there are any records after removing them
df_clean.query("member_birth_year < 1955")
duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip |
---|
We'll look at overview of the dataframe again and check for nulls
# Display DataFrame info
df_clean.info()
# Check for null counts
null_counts = df_clean.isnull().sum()
print(null_counts)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 173272 entries, 0 to 173271 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 173272 non-null int64 1 start_time 173272 non-null datetime64[ns] 2 end_time 173272 non-null datetime64[ns] 3 start_station_id 173272 non-null object 4 start_station_name 173272 non-null object 5 start_station_latitude 173272 non-null float64 6 start_station_longitude 173272 non-null float64 7 end_station_id 173272 non-null object 8 end_station_name 173272 non-null object 9 end_station_latitude 173272 non-null float64 10 end_station_longitude 173272 non-null float64 11 bike_id 173272 non-null object 12 user_type 173272 non-null category 13 member_birth_year 173272 non-null int64 14 member_gender 173272 non-null category 15 bike_share_for_all_trip 173272 non-null category dtypes: category(3), datetime64[ns](2), float64(4), int64(2), object(5) memory usage: 17.7+ MB duration_sec 0 start_time 0 end_time 0 start_station_id 0 start_station_name 0 start_station_latitude 0 start_station_longitude 0 end_station_id 0 end_station_name 0 end_station_latitude 0 end_station_longitude 0 bike_id 0 user_type 0 member_birth_year 0 member_gender 0 bike_share_for_all_trip 0 dtype: int64
df_clean.describe()
duration_sec | start_time | end_time | start_station_latitude | start_station_longitude | end_station_latitude | end_station_longitude | member_birth_year | |
---|---|---|---|---|---|---|---|---|
count | 173272.000000 | 173272 | 173272 | 173272.000000 | 173272.000000 | 173272.000000 | 173272.000000 | 173272.000000 |
mean | 703.878549 | 2019-02-15 21:21:32.505712128 | 2019-02-15 21:33:16.883208960 | 37.771060 | -122.351657 | 37.771257 | -122.351228 | 1985.171747 |
min | 61.000000 | 2019-02-01 00:00:20.636000 | 2019-02-01 00:04:52.058000 | 37.317298 | -122.453704 | 37.317298 | -122.453704 | 1955.000000 |
25% | 323.000000 | 2019-02-08 08:30:43.535000064 | 2019-02-08 08:41:02.230500096 | 37.770407 | -122.411901 | 37.770407 | -122.411647 | 1980.000000 |
50% | 510.000000 | 2019-02-15 21:32:02.059000064 | 2019-02-15 21:45:20.187500032 | 37.780760 | -122.398279 | 37.781010 | -122.397437 | 1987.000000 |
75% | 788.000000 | 2019-02-22 11:19:06.316000 | 2019-02-22 11:33:13.886749952 | 37.797320 | -122.283093 | 37.797673 | -122.286533 | 1992.000000 |
max | 84548.000000 | 2019-02-28 23:59:18.548000 | 2019-03-01 08:01:55.975000 | 37.880222 | -121.874119 | 37.880222 | -121.874119 | 2001.000000 |
std | 1647.305625 | NaN | NaN | 0.100682 | 0.118001 | 0.100587 | 0.117560 | 9.378430 |
What is the structure of your dataset?¶
The dataset has 173,272 entries after cleaning the data. The dataset contains these features:
- duration_sec
- start_time
- end_time
- start_station_id
- start_station_name
- start_station_latitude
- start_station_longitude
- end_station_id
- end_station_name
- end_station_latitude
- end_ station_longitude
- bike_id
- user_type
- member_birth_year
- member_gender
- bike_share_for_all_trip
What is/are the main feature(s) of interest in your dataset?¶
The main features of interest in the dataset are those that effectively predict bike trip duration and influence trip frequency. Understanding these key factors will help us identify the most significant predictors for both aspects of bike usage.
What features in the dataset do you think will help support your investigation into your feature(s) of interest?¶
In my investigation into the features of interest regarding bike rides, I anticipate that the following dataset features will provide valuable insights:
Time of Day: I expect this feature to significantly influence both the frequency and duration of bike rides, with peak usage occurring during commuting hours.
Day of the Week: Similar to time of day, this feature is likely to affect ride patterns, with certain days showing higher activity levels.
User Type: Differentiating between subscribers and customers will help in understanding how user engagement affects riding behavior.
Member Birth Year: Analyzing the birth year of riders may provide insights into generational differences in bike usage.
Gender: Exploring the relationship between gender and ride frequency/duration will help identify any disparities in bike usage patterns.
These features collectively will support a comprehensive analysis of the factors influencing bike ride behavior.
Univariate Exploration¶
In this section, we will analyze the distributions of individual variables. First, we’ll focus on the distribution of bike trip durations.
Here, we'll create a histogram of trip durations in seconds
bin_edges = np.arange (0, df_clean['duration_sec'].max()+100, 100)
plt.hist(data = df_clean, x = 'duration_sec', bins = bin_edges)
plt.xlim(0, 3000)
plt.xlabel('Trip Duration (seconds)')
plt.ylabel('Frequency');
From this histogram, we see frequency distribution of trip durations, with most trips lasting between 0 and 1000 seconds, and the frequency sharply decreasing for longer durations.
# gauging bin limits are appropriate for the plot
df_clean['duration_sec'].describe()
count 173272.000000 mean 703.878549 std 1647.305625 min 61.000000 25% 323.000000 50% 510.000000 75% 788.000000 max 84548.000000 Name: duration_sec, dtype: float64
The data follows a highly skewed, long-tailed distribution, with most trip durations under 1,000 seconds. To gain clearer insights, a logarithmic transformation is appropriate.
bin_edges = 10 ** np.arange(1, 5 + 0.1, 0.1)
ticks = [10, 30, 100, 300, 1000, 3000, 10000, 30000, 100000]
plt.hist(data = df_clean, x = 'duration_sec', bins=bin_edges)
plt.xscale('log')
plt.xticks(ticks, ticks)
plt.xlabel('Trip Duration (seconds)')
plt.ylabel('Frequency');
On a logarithmic scale, the duration of bike trips appears quite standard, showing a prominent peak near 650 seconds. To enhance the clarity of our findings, it would be beneficial to include an additional column that presents bike trip durations in minutes, as this format is generally easier for people to understand than seconds.
# create a duration column in minutes
df_clean['duration_min'] = df_clean['duration_sec'] / 60
Histogram of trip duration in minutes
bin_edges = np.arange (0, df_clean['duration_min'].max()+5, 5)
plt.hist(data = df_clean, x = 'duration_min', bins = bin_edges)
plt.xlim(0, 100)
plt.xlabel('Trip Duration (minutes)')
plt.ylabel('Frequency');
Since the data shows a strong skew when expressed in seconds, using a logarithmic scale is again the most suitable option.
bin_edges = 10 ** np.arange(0, 3 + 0.1, 0.1)
ticks = [1, 3, 10, 30, 100, 300, 1000]
plt.hist(data = df_clean, x = 'duration_min', bins=bin_edges)
plt.xscale('log')
plt.xticks(ticks, ticks)
plt.xlabel('Trip Duration (minutes)')
plt.ylabel('Frequency');
When we look at the bike trip durations on a log scale in minutes, we see a peak around 9 to 10 minutes.
We'll take a look at the the time of day feature by extracting the start hour from the start time
# extract the start hour of the trip from the start time column
df_clean['start_hour'] = df_clean['start_time'].dt.strftime('%H')
df_clean['start_hour'] = df_clean['start_hour'].astype(int)
# test
df_clean.head()
duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | duration_min | start_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 52185 | 2019-02-28 17:32:10.145 | 2019-03-01 08:01:55.975 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 13.0 | Commercial St at Montgomery St | 37.794231 | -122.402923 | 4902 | Customer | 1984 | Male | No | 869.750000 | 17 |
1 | 61854 | 2019-02-28 12:13:13.218 | 2019-03-01 05:24:08.146 | 86.0 | Market St at Dolores St | 37.769305 | -122.426826 | 3.0 | Powell St BART Station (Market St at 4th St) | 37.786375 | -122.404904 | 5905 | Customer | 1972 | Male | No | 1030.900000 | 12 |
2 | 36490 | 2019-02-28 17:54:26.010 | 2019-03-01 04:02:36.842 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 70.0 | Central Ave at Fell St | 37.773311 | -122.444293 | 6638 | Subscriber | 1989 | Other | No | 608.166667 | 17 |
3 | 1585 | 2019-02-28 23:54:18.549 | 2019-03-01 00:20:44.074 | 7.0 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 222.0 | 10th Ave at E 15th St | 37.792714 | -122.248780 | 4898 | Subscriber | 1974 | Male | Yes | 26.416667 | 23 |
4 | 1793 | 2019-02-28 23:49:58.632 | 2019-03-01 00:19:51.760 | 93.0 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 323.0 | Broadway at Kearny | 37.798014 | -122.405950 | 5200 | Subscriber | 1959 | Male | No | 29.883333 | 23 |
We will create a countplot showing the frequency of bike rides by hour of the day.
# plot the frequency of bike rides by hour of the day
plt.figure(figsize=[10,6])
base_color = sb.color_palette()[0]
sb.countplot(data = df_clean, x = 'start_hour', color = base_color)
plt.xlabel('Start Hour')
plt.title('Frequency of Bike Rides by Hour of the Day')
Text(0.5, 1.0, 'Frequency of Bike Rides by Hour of the Day')
The visualization demonstrates a significant increase in bike trips during the morning rush from 7 to 9 AM and in the evening from 4 to 7 PM.
To gain further insights, we'll extract the day of the week to see if these patterns align with weekday commuting behavior.
# extracting the day of the week from the start_time column
df_clean['day'] = df_clean['start_time'].dt.strftime('%A')
Let's check the dataframe for the changes
df_clean
duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | duration_min | start_hour | day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 52185 | 2019-02-28 17:32:10.145 | 2019-03-01 08:01:55.975 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 13.0 | Commercial St at Montgomery St | 37.794231 | -122.402923 | 4902 | Customer | 1984 | Male | No | 869.750000 | 17 | Thursday |
1 | 61854 | 2019-02-28 12:13:13.218 | 2019-03-01 05:24:08.146 | 86.0 | Market St at Dolores St | 37.769305 | -122.426826 | 3.0 | Powell St BART Station (Market St at 4th St) | 37.786375 | -122.404904 | 5905 | Customer | 1972 | Male | No | 1030.900000 | 12 | Thursday |
2 | 36490 | 2019-02-28 17:54:26.010 | 2019-03-01 04:02:36.842 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 70.0 | Central Ave at Fell St | 37.773311 | -122.444293 | 6638 | Subscriber | 1989 | Other | No | 608.166667 | 17 | Thursday |
3 | 1585 | 2019-02-28 23:54:18.549 | 2019-03-01 00:20:44.074 | 7.0 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 222.0 | 10th Ave at E 15th St | 37.792714 | -122.248780 | 4898 | Subscriber | 1974 | Male | Yes | 26.416667 | 23 | Thursday |
4 | 1793 | 2019-02-28 23:49:58.632 | 2019-03-01 00:19:51.760 | 93.0 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 323.0 | Broadway at Kearny | 37.798014 | -122.405950 | 5200 | Subscriber | 1959 | Male | No | 29.883333 | 23 | Thursday |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
173267 | 480 | 2019-02-01 00:04:49.724 | 2019-02-01 00:12:50.034 | 27.0 | Beale St at Harrison St | 37.788059 | -122.391865 | 324.0 | Union Square (Powell St at Post St) | 37.788300 | -122.408531 | 4832 | Subscriber | 1996 | Male | No | 8.000000 | 0 | Friday |
173268 | 313 | 2019-02-01 00:05:34.744 | 2019-02-01 00:10:48.502 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 66.0 | 3rd St at Townsend St | 37.778742 | -122.392741 | 4960 | Subscriber | 1984 | Male | No | 5.216667 | 0 | Friday |
173269 | 141 | 2019-02-01 00:06:05.549 | 2019-02-01 00:08:27.220 | 278.0 | The Alameda at Bush St | 37.331932 | -121.904888 | 277.0 | Morrison Ave at Julian St | 37.333658 | -121.908586 | 3824 | Subscriber | 1990 | Male | Yes | 2.350000 | 0 | Friday |
173270 | 139 | 2019-02-01 00:05:34.360 | 2019-02-01 00:07:54.287 | 220.0 | San Pablo Ave at MLK Jr Way | 37.811351 | -122.273422 | 216.0 | San Pablo Ave at 27th St | 37.817827 | -122.275698 | 5095 | Subscriber | 1988 | Male | No | 2.316667 | 0 | Friday |
173271 | 271 | 2019-02-01 00:00:20.636 | 2019-02-01 00:04:52.058 | 24.0 | Spear St at Folsom St | 37.789677 | -122.390428 | 37.0 | 2nd St at Folsom St | 37.785000 | -122.395936 | 1057 | Subscriber | 1989 | Male | No | 4.516667 | 0 | Friday |
173272 rows × 19 columns
Next, we'll categorize the days of the week in the dataframe to ensure they are ordered from Monday to Sunday for better data analysis.
# order the days of the week
df_clean['day'] = pd.Categorical(df_clean['day'], categories= ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday', 'Sunday'],ordered=True)
This bar plot visualizes the frequency of bike rides for each day of the week
# plotting the frequency of bike rides by the day of the week
plt.figure(figsize=[10,6])
base_color = sb.color_palette()[0]
sb.countplot(data = df_clean, x = 'day', color = base_color)
plt.xlabel('Start Day')
plt.title('Frequency of Bike Rides by Day of the Week')
plt.xticks(rotation=15);
The graph clearly indicates that weekdays are the prime time for bike trips
The next focus will be on categorizing users to analyze whether subscribers tend to make more trips than one-time customers.
# frequency of bike rides by user type
base_color = sb.color_palette()[0]
user_order = df_clean['user_type'].value_counts().index
sb.countplot(data = df_clean, x = 'user_type', color = base_color, order = user_order)
plt.xlabel('User')
plt.title('Frequency of Bike Rides by User Type');
It turns out that program subscribers use bikes more frequently than occasional users. We’ll delve into whether one-time customers embark on longer bike rides than subscribers. This analysis will be featured in the next section through a bivariate exploration.
Next, we'll look at the member age. We'll create am age column from the 'member_birth_year' column
#creating age column from member birth year (2019 - member birth year since this data is from 2019)
df_clean['age'] = (2019 - df_clean['member_birth_year'])
# checking this worked
df_clean
duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | duration_min | start_hour | day | age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 52185 | 2019-02-28 17:32:10.145 | 2019-03-01 08:01:55.975 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 13.0 | Commercial St at Montgomery St | 37.794231 | -122.402923 | 4902 | Customer | 1984 | Male | No | 869.750000 | 17 | Thursday | 35 |
1 | 61854 | 2019-02-28 12:13:13.218 | 2019-03-01 05:24:08.146 | 86.0 | Market St at Dolores St | 37.769305 | -122.426826 | 3.0 | Powell St BART Station (Market St at 4th St) | 37.786375 | -122.404904 | 5905 | Customer | 1972 | Male | No | 1030.900000 | 12 | Thursday | 47 |
2 | 36490 | 2019-02-28 17:54:26.010 | 2019-03-01 04:02:36.842 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 70.0 | Central Ave at Fell St | 37.773311 | -122.444293 | 6638 | Subscriber | 1989 | Other | No | 608.166667 | 17 | Thursday | 30 |
3 | 1585 | 2019-02-28 23:54:18.549 | 2019-03-01 00:20:44.074 | 7.0 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 222.0 | 10th Ave at E 15th St | 37.792714 | -122.248780 | 4898 | Subscriber | 1974 | Male | Yes | 26.416667 | 23 | Thursday | 45 |
4 | 1793 | 2019-02-28 23:49:58.632 | 2019-03-01 00:19:51.760 | 93.0 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 323.0 | Broadway at Kearny | 37.798014 | -122.405950 | 5200 | Subscriber | 1959 | Male | No | 29.883333 | 23 | Thursday | 60 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
173267 | 480 | 2019-02-01 00:04:49.724 | 2019-02-01 00:12:50.034 | 27.0 | Beale St at Harrison St | 37.788059 | -122.391865 | 324.0 | Union Square (Powell St at Post St) | 37.788300 | -122.408531 | 4832 | Subscriber | 1996 | Male | No | 8.000000 | 0 | Friday | 23 |
173268 | 313 | 2019-02-01 00:05:34.744 | 2019-02-01 00:10:48.502 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 66.0 | 3rd St at Townsend St | 37.778742 | -122.392741 | 4960 | Subscriber | 1984 | Male | No | 5.216667 | 0 | Friday | 35 |
173269 | 141 | 2019-02-01 00:06:05.549 | 2019-02-01 00:08:27.220 | 278.0 | The Alameda at Bush St | 37.331932 | -121.904888 | 277.0 | Morrison Ave at Julian St | 37.333658 | -121.908586 | 3824 | Subscriber | 1990 | Male | Yes | 2.350000 | 0 | Friday | 29 |
173270 | 139 | 2019-02-01 00:05:34.360 | 2019-02-01 00:07:54.287 | 220.0 | San Pablo Ave at MLK Jr Way | 37.811351 | -122.273422 | 216.0 | San Pablo Ave at 27th St | 37.817827 | -122.275698 | 5095 | Subscriber | 1988 | Male | No | 2.316667 | 0 | Friday | 31 |
173271 | 271 | 2019-02-01 00:00:20.636 | 2019-02-01 00:04:52.058 | 24.0 | Spear St at Folsom St | 37.789677 | -122.390428 | 37.0 | 2nd St at Folsom St | 37.785000 | -122.395936 | 1057 | Subscriber | 1989 | Male | No | 4.516667 | 0 | Friday | 30 |
173272 rows × 20 columns
# plotting age distribution
bin_edges = np.arange (10, df_clean['age'].max()+2, 2)
plt.hist(data = df_clean, x = 'age', bins = bin_edges)
plt.xlim(10, 70)
plt.xlabel('Member Age (years)')
plt.ylabel('Frequency');
sb.violinplot(data = df_clean, y = 'age')
plt.ylabel('Member Age (years)');
The graphs above indicate a noticeable peak around the age of 30.
Next, we’ll analyze gender to determine if one demographic uses FordGo Bikes more than the others.
# plotting gender
base_color = sb.color_palette()[0]
gender_order = df_clean['member_gender'].value_counts().index
sb.countplot(data = df_clean, x = 'member_gender', color = base_color, order = gender_order)
plt.xlabel('Gender')
plt.title('Frequency of Bike Rides by Gender');
The data shows that a significant number of bike riders are male. We will later examine how gender impacts the length of bike trips.
We will now focus on examining the number of bike trips that were part of the Bike Share for All program.
base_color = sb.color_palette()[0]
share_order = df_clean['bike_share_for_all_trip'].value_counts().index
sb.countplot(data = df_clean, x = 'bike_share_for_all_trip', color = base_color, order = share_order)
plt.xlabel('Bike Share for All')
plt.title('Frequency of Bike Share for All Trips');
The data above indicates that most trips were not part of the Bike Share for All program.
In the final part of the univariate exploration, we will calculate the distance using the provided start and end longitude and latitude coordinates.
def haversine_np(start_station_longitude, start_station_latitude, end_station_longitude, end_station_latitude):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees).
All args must be of equal length.
"""
# Convert decimal degrees to radians
start_station_longitude, start_station_latitude, end_station_longitude, end_station_latitude = map(np.radians,
[start_station_longitude, start_station_latitude, end_station_longitude, end_station_latitude])
# Differences in coordinates
dlon = end_station_longitude - start_station_longitude
dlat = end_station_latitude - start_station_latitude
# Haversine formula
a = np.sin(dlat/2.0)**2 + np.cos(start_station_latitude) * np.cos(end_station_latitude) * np.sin(dlon/2.0)**2
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a)) # Using arctan2 for stability
# Radius of the Earth in kilometers
radius_earth_km = 6371.0
km = radius_earth_km * c
return km
We'll create a distance column
# creating a distance column
df_clean['distance'] = haversine_np(df_clean['start_station_longitude'],df_clean['start_station_latitude'],df_clean['end_station_longitude'],df_clean['end_station_latitude'])
# checking that this worked
df_clean
duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | ... | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | duration_min | start_hour | day | age | distance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 52185 | 2019-02-28 17:32:10.145 | 2019-03-01 08:01:55.975 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 13.0 | Commercial St at Montgomery St | 37.794231 | ... | 4902 | Customer | 1984 | Male | No | 869.750000 | 17 | Thursday | 35 | 0.544709 |
1 | 61854 | 2019-02-28 12:13:13.218 | 2019-03-01 05:24:08.146 | 86.0 | Market St at Dolores St | 37.769305 | -122.426826 | 3.0 | Powell St BART Station (Market St at 4th St) | 37.786375 | ... | 5905 | Customer | 1972 | Male | No | 1030.900000 | 12 | Thursday | 47 | 2.704545 |
2 | 36490 | 2019-02-28 17:54:26.010 | 2019-03-01 04:02:36.842 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 70.0 | Central Ave at Fell St | 37.773311 | ... | 6638 | Subscriber | 1989 | Other | No | 608.166667 | 17 | Thursday | 30 | 0.260739 |
3 | 1585 | 2019-02-28 23:54:18.549 | 2019-03-01 00:20:44.074 | 7.0 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 222.0 | 10th Ave at E 15th St | 37.792714 | ... | 4898 | Subscriber | 1974 | Male | Yes | 26.416667 | 23 | Thursday | 45 | 2.409301 |
4 | 1793 | 2019-02-28 23:49:58.632 | 2019-03-01 00:19:51.760 | 93.0 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 323.0 | Broadway at Kearny | 37.798014 | ... | 5200 | Subscriber | 1959 | Male | No | 29.883333 | 23 | Thursday | 60 | 3.332203 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
173267 | 480 | 2019-02-01 00:04:49.724 | 2019-02-01 00:12:50.034 | 27.0 | Beale St at Harrison St | 37.788059 | -122.391865 | 324.0 | Union Square (Powell St at Post St) | 37.788300 | ... | 4832 | Subscriber | 1996 | Male | No | 8.000000 | 0 | Friday | 23 | 1.464766 |
173268 | 313 | 2019-02-01 00:05:34.744 | 2019-02-01 00:10:48.502 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 66.0 | 3rd St at Townsend St | 37.778742 | ... | 4960 | Subscriber | 1984 | Male | No | 5.216667 | 0 | Friday | 35 | 1.402716 |
173269 | 141 | 2019-02-01 00:06:05.549 | 2019-02-01 00:08:27.220 | 278.0 | The Alameda at Bush St | 37.331932 | -121.904888 | 277.0 | Morrison Ave at Julian St | 37.333658 | ... | 3824 | Subscriber | 1990 | Male | Yes | 2.350000 | 0 | Friday | 29 | 0.379066 |
173270 | 139 | 2019-02-01 00:05:34.360 | 2019-02-01 00:07:54.287 | 220.0 | San Pablo Ave at MLK Jr Way | 37.811351 | -122.273422 | 216.0 | San Pablo Ave at 27th St | 37.817827 | ... | 5095 | Subscriber | 1988 | Male | No | 2.316667 | 0 | Friday | 31 | 0.747282 |
173271 | 271 | 2019-02-01 00:00:20.636 | 2019-02-01 00:04:52.058 | 24.0 | Spear St at Folsom St | 37.789677 | -122.390428 | 37.0 | 2nd St at Folsom St | 37.785000 | ... | 1057 | Subscriber | 1989 | Male | No | 4.516667 | 0 | Friday | 30 | 0.710395 |
173272 rows × 21 columns
Now lets check for outliers
df_clean.distance.describe()
count 173272.000000 mean 1.691558 std 1.096204 min 0.000000 25% 0.910955 50% 1.430675 75% 2.225687 max 69.469241 Name: distance, dtype: float64
df_clean.query("distance > 20")
duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | ... | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | duration_min | start_hour | day | age | distance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
105822 | 6945 | 2019-02-12 14:28:44.402 | 2019-02-12 16:24:30.158 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 300.0 | Palm St at Willow St | 37.317298 | ... | 4780 | Subscriber | 1985 | Female | No | 115.75 | 14 | Tuesday | 34 | 69.469241 |
1 rows × 21 columns
In this section, we'll remove the outliers
df_clean = df_clean.drop(df_clean[df_clean.distance > 20].index)
df_clean.distance.describe()
count 173271.000000 mean 1.691167 std 1.084047 min 0.000000 25% 0.910955 50% 1.430675 75% 2.225687 max 15.673955 Name: distance, dtype: float64
Lets see plot the distance distribution
# plotting distance distribution
bin_edges = np.arange (0, df_clean['distance'].max()+0.5, 0.5)
plt.hist(data = df_clean, x = 'distance', bins = bin_edges)
plt.xlabel('Distance (km)')
plt.xlim(0,10)
plt.ylabel('Frequency');
Here, we see the frequency distribution of distances traveled, with most trips covering between 0 and 2 kilometers, and the frequency significantly decreasing for longer distances.
Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?¶
The duration of bike trips was highly skewed to the right, so a log scale was applied to create a log-normal distribution. To enhance comprehension, the duration of bike trips in minutes was added to the dataset. The resulting graph clearly indicates that most bike rides are under 30 minutes, with an average duration around 9-10 minutes.
Additionally, a distance column was created using longitude and latitude data, revealing that the majority of bike trips were under 2 km.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?¶
The time of day and day of the week were examined, indicating that bikes were primarily used for commuting, with peak usage occurring from 07:00-09:00 and 16:00-19:00 on weekdays.
An age column was included to assess the ages of members. The analysis showed a peak of riders around 30 years old. To tidy the data, individuals over the age of 65 were removed, representing the top 1% of the dataset.
Moreover, it was observed that there were more subscribers than customers taking bike trips, and a higher number of male riders compared to female or other categories. Most trips were not part of the bike share for all trip scheme.
Bivariate Exploration¶
We will now embark on a bivariate analysis to examine potential relationships within the data.
df_clean.info()
<class 'pandas.core.frame.DataFrame'> Index: 173271 entries, 0 to 173271 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 173271 non-null int64 1 start_time 173271 non-null datetime64[ns] 2 end_time 173271 non-null datetime64[ns] 3 start_station_id 173271 non-null object 4 start_station_name 173271 non-null object 5 start_station_latitude 173271 non-null float64 6 start_station_longitude 173271 non-null float64 7 end_station_id 173271 non-null object 8 end_station_name 173271 non-null object 9 end_station_latitude 173271 non-null float64 10 end_station_longitude 173271 non-null float64 11 bike_id 173271 non-null object 12 user_type 173271 non-null category 13 member_birth_year 173271 non-null int64 14 member_gender 173271 non-null category 15 bike_share_for_all_trip 173271 non-null category 16 duration_min 173271 non-null float64 17 start_hour 173271 non-null int64 18 day 173271 non-null category 19 age 173271 non-null int64 20 distance 173271 non-null float64 dtypes: category(4), datetime64[ns](2), float64(6), int64(4), object(5) memory usage: 24.5+ MB
We'll designate numeric and categorical variables to simplify plotting.
# assigning numeric and categoric variables
numeric_vars = ['duration_min', 'age', 'distance']
categoric_vars = ['member_gender', 'user_type', 'bike_share_for_all_trip', 'day', 'start_hour']
To start, we will analyze the numeric variables to uncover any existing relationships or correlations.
g = sb.PairGrid(data = df_clean, vars = numeric_vars)
g = g.map_diag(plt.hist, bins = 20);
g.map_offdiag(plt.scatter, alpha=0.3)
<seaborn.axisgrid.PairGrid at 0x15bbf9dd0>
The plots indicate that there are no strong correlations among the numeric variables. We will investigate the relationships between the numeric and categorical variables, as well as those between different categorical variables.
Duration Min¶
We will look at the relationship between the duration of bike trips (measured in minutes) among other features
def boxgrid(x, y, **kwargs):
default_color = sb.color_palette()[0]
# Call boxplot with explicit parameters
sb.boxplot(x=x, y=y, color=default_color, **{k: v for k, v in kwargs.items() if k != 'color'})
plt.figure(figsize=[16, 16])
g = sb.PairGrid(data=df_clean, y_vars='duration_min', x_vars=['member_gender', 'user_type', 'bike_share_for_all_trip'],
height=3, aspect=1.5)
plt.ylim(0, 50)
g.map(boxgrid)
plt.show()
<Figure size 1600x1600 with 0 Axes>
The plot clearly shows that, on average, females and individuals identifying as "other" take longer bike trips than males. Interestingly, customers tend to takelonger trips than subscribers, suggesting that when customers opt for a one-time ride, they go for longer distances. Additionally, the bike share program does not significantly influence trip duration, so it will not be considered in the remainder of the analysis.
Next, we will explore the features associated with the time of day and day of the week
def boxgrid(x, y, **kwargs):
default_color = sb.color_palette()[0]
# Call boxplot with explicit parameters
sb.boxplot(x=x, y=y, color=default_color, **{k: v for k, v in kwargs.items() if k != 'color'})
plt.figure(figsize=[16, 16])
g = sb.PairGrid(data=df_clean, y_vars='duration_min', x_vars=['day', 'start_hour'],
height=3, aspect=1.5)
plt.ylim(0, 50)
g.map(boxgrid)
# Rotate x-axis labels properly
for ax in g.axes.flatten():
# Get current tick positions
ticks = ax.get_xticks()
ax.set_xticks(ticks) # Ensure ticks are set
ax.set_xticklabels(ax.get_xticklabels(), rotation=25, ha='right')
plt.show();
<Figure size 1600x1600 with 0 Axes>
The day of the week influenced trip duration, with longer bike rides occurring on weekends. Additionally, most extended bike trips happened in the afternoon, specifically between 14:00 - 15:00
Distance¶
We will take a brief look at the other features and how they relate to each other.
def boxgrid(x, y, **kwargs):
default_color = sb.color_palette()[0]
# Call boxplot with explicit parameters
sb.boxplot(x=x, y=y, color=default_color, **{k: v for k, v in kwargs.items() if k != 'color'})
plt.figure(figsize=[16, 16])
g = sb.PairGrid(data=df_clean, y_vars='distance', x_vars=['member_gender', 'user_type' ,'bike_share_for_all_trip'],
height=3, aspect=1.5)
plt.ylim(0, 5)
g.map(boxgrid)
plt.show()
<Figure size 1600x1600 with 0 Axes>
The analysis above reveals that some variables had a small effect on distance traveled. Notably, females and those who chose 'other' as their gender traveled farther, as did customers compared to subscribers. Furthermore, individuals not engaged in the bike share program also tended to cover longer distances.
We'll take a brief look at distance along with the other variables.
def boxgrid(x, y, **kwargs):
default_color = sb.color_palette()[0]
sb.boxplot(x=x, y=y, color=default_color, **{k: v for k, v in kwargs.items() if k != 'color'})
plt.figure(figsize=[16, 16])
g = sb.PairGrid(data=df_clean, y_vars='distance', x_vars=['day', 'start_hour'], height=3, aspect=1.5)
plt.ylim(0, 5)
g.map(boxgrid)
for ax in g.axes.flat:
# Get the current x-ticks
ticks = ax.get_xticks()
# Set the new labels for the current ticks
ax.set_xticks(ticks) # Ensure ticks are set before labels
ax.set_xticklabels(ax.get_xticklabels(), rotation=25, ha='right')
plt.show();
<Figure size 1600x1600 with 0 Axes>
The analysis above shows that the day of the week does not really affect the distance traveled. However, it is evident that longer distances were covered in the mornings, particularly between 7 AM and 8 AM. Since there are no compelling relationships regarding distance to explore further, we will now turn our attention to age and its relation to other variables.
Age¶
def boxgrid(x, y, **kwargs):
default_color = sb.color_palette()[0]
# Call boxplot with explicit parameters
sb.boxplot(x=x, y=y, color=default_color, **{k: v for k, v in kwargs.items() if k != 'color'})
plt.figure(figsize=[16, 16])
g = sb.PairGrid(data=df_clean, y_vars='age', x_vars=['member_gender', 'user_type', 'bike_share_for_all_trip'],
height=3, aspect=1.5)
g.map(boxgrid)
plt.show()
<Figure size 1600x1600 with 0 Axes>
The boxplot above indicates that the majority of members are in their early 30s, regardless of gender or user type. Participants in the bike share for all scheme tend to be younger on average than those who are not involved.
def boxgrid(x, y, **kwargs):
default_color = sb.color_palette()[0]
# Call boxplot with explicit parameters
sb.boxplot(x=x, y=y, color=default_color, **{k: v for k, v in kwargs.items() if k != 'color'})
plt.xticks(rotation=25)
plt.figure(figsize=[16, 16])
g = sb.PairGrid(data=df_clean, y_vars='age', x_vars=['day', 'start_hour'],
height=3, aspect=1.5)
g.map(boxgrid)
plt.show()
<Figure size 1600x1600 with 0 Axes>
The plot above shows that younger individuals are more likely to take bike trips on weekends. On average, these trips occur late at night and in the early morning hours. We will exclude 'age,' 'distance,' and 'bike share for all' from further analysis, as no significant relationships were identified with these variables.
User Type and Gender¶
Let's examine the relationship between user type and gender.
#clustered bar chart
ax = sb.countplot(data = df_clean, x = 'user_type', hue = 'member_gender', hue_order=['Male', 'Female', 'Other'])
ax.legend(loc = 2, framealpha = 1)
<matplotlib.legend.Legend at 0x15ce5c150>
This clustered bar chart clearly shows that subscribers took more bike rides than customers. The majority of both groups were male.
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?¶
During this part of the investigation, it was found that, on average, individuals who identified as female or as "other" took longer bike rides than males. Additionally, customers generally took longer rides compared to members. Those not participating in the bike share for all scheme also had longer bike trips. While the time of year did not significantly affect the duration of bike trips. The day of the week impacted trip duration, with longer rides occurring on weekends. Furthermore, the hours between 2-3 PM saw the longest bike trips.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?¶
In terms of the day of the week, younger individuals tended to take more bike trips on weekends. Additionally, trips made very late at night and during the early morning hours were predominantly by younger people.
From the demographic analysis, it is clear that subscribers took more bike rides than customers, and the majority of both groups were male.
When examining other features in the dataset, it appears that those who identified as female or other traveled slightly greater distances than males. Customers also traveled slightly further than subscribers on average, and those not involved in the bike share for all scheme tended to cover greater distances as well. Most members are in their early 30s, and those participating in the bike share for all scheme are generally younger than those who are not.
Moving forward, 'age', 'distance', and 'bike share for all' will not be included in any further analysis, as they do not relate to the features of interest and do not demonstrate strong relationships with other features in the dataset.
Multivariate Exploration¶
Next, we'll utilize multivariate plots to examine how the duration and frequency of bike trips relate to various categorical variables, including gender, user type, time of day, and day of the week.
g = sb.PairGrid(data=df_clean, y_vars='duration_min', x_vars=['member_gender', 'user_type', 'day'])
plt.ylim(0, 50)
rotation = 30
for ax in g.fig.axes:
ax.tick_params(axis='x', rotation=rotation)
default_blue = sb.color_palette()[0]
g.map(sb.violinplot, inner='quartile', color=default_blue)
plt.show()
fig = plt.figure(figsize=(12, 8)) # Adjust width and height as needed
g = sb.PairGrid(data=df_clean, y_vars='duration_min', x_vars=['member_gender', 'user_type', 'day'], height=4) # You can set height per subplot
plt.ylim(0, 50)
rotation = 30
for ax in g.fig.axes:
ax.tick_params(axis='x', rotation=rotation)
default_blue = sb.color_palette()[0]
g.map(sb.boxplot, color=default_blue)
plt.show()
<Figure size 1200x800 with 0 Axes>
Lets create a clustered bar chart illustrating the relationship between duration, user type, and gender
# clustered bar chart
default_colors = ['#1f77b4', '#ff7f0e', '#2ca02c'] # Blue, orange, green
user_type_order = ['Subscriber', 'Customer']
plt.figure(figsize=[10, 6])
ax = sb.barplot(data=df_clean, x='user_type', y='duration_min', hue='member_gender', hue_order=['Male', 'Female', 'Other'], palette=default_colors, order=user_type_order)
ax.legend(loc=2, ncol=1, framealpha=1, title='member_gender')
plt.show()
The data indicates that females and others took longer bike trips than males, regardless of their status as customers or subscribers.
A clustered bar chart will be created to explore the relationship between duration, day of the week, and gender.
# clustered bar chart
plt.figure(figsize = [12, 8])
ax = sb.barplot(data = df_clean, x = 'day', y = 'duration_min', hue = 'member_gender', hue_order=['Male', 'Female', 'Other'], palette=default_colors)
ax.legend(loc = 2, ncol = 2, framealpha = 1, title = 'member_gender')
<matplotlib.legend.Legend at 0x15cf82690>
As shown above, females and individuals identifying as other took longer bike rides than males on each day of the week, and all genders had longer bike trips on weekends.
We will explore the relationship between duration, user type, and the day of the week.
# clustered bar chart
plt.figure(figsize = [12, 8])
ax = sb.barplot(data = df_clean, x = 'day', y = 'duration_min', hue = 'user_type', palette=default_colors[:2])
ax.legend(loc = 2, ncol = 2, framealpha = 1, title = 'user type')
<matplotlib.legend.Legend at 0x159917610>
Facet Plot
Let's try a Facet Plot for more clarity
# Define the order for the days
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
# Convert the 'day' column to a categorical type with the specified order
df_clean['day'] = pd.Categorical(df_clean['day'], categories=day_order, ordered=True)
# Create a FacetGrid
g = sb.FacetGrid(df_clean, col='user_type', col_wrap=2, height=4, aspect=1.5)
# Create the bar plot within each facet
g.map_dataframe(sb.barplot, x='day', y='duration_min', hue='user_type', palette=default_colors[:2], order=day_order)
# Add legends and titles
g.add_legend(title='User Type', bbox_to_anchor=(1.05, 1), loc='upper left')
g.set_axis_labels('Day', 'Duration (minutes)')
g.set_titles(col_template='{col_name}')
# Show the plot
plt.tight_layout()
plt.show()
As seen above, customers had longer bike rides than subscribers on every day of the week, and both groups took longer trips on the weekends.
Frequency of bike trips based on the day of the week and time of day¶
Next, a heatmap will be created to visualize the frequency of bike trips based on the day of the week and time of day.
# Generate a second dataframe for visualization
df_clean2 = pd.pivot_table(df_clean[['day', 'start_hour', 'duration_min']], index=['day', 'start_hour'], aggfunc='count')
# Unstacking to achieve the appropriate format for the heatmap.
df_clean3 = df_clean2.unstack(level=0)
# Generate new labels for the hours.
am_hrs = [f"{hr}am" for hr in range(1, 12)]
pm_hrs = [f"{hr}pm" for hr in range(1, 12)]
complete_hrs = ["12am"] + am_hrs + ["12pm"] + pm_hrs
# Abbreviated names for the days of the week
day_abbr = ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']
# Create the heatmap
sb.set_context("talk")
f, ax = plt.subplots(figsize=(11, 15))
ax = sb.heatmap(df_clean3, annot=True, fmt="d", linewidths=.5, ax=ax, xticklabels=day_abbr, yticklabels=complete_hrs, cmap="viridis")
ax.axes.set_title("Heatmap of Ride Counts by Day and Hour of Day", fontsize=24, y=1.01)
ax.set(xlabel='Day of Week', ylabel='Starting Hour of Ride');
plt.show()
The heatmap above reveals that on weekdays, the majority of bike trips occur between 6 AM and 9 AM, as well as 4 PM to 7 PM.
Time of day, Day of the week, and Duration (minutes)¶
Now let's visualize the relationship between time of day, day of the week, and bike ride duration.
plt.figure(figsize=[12, 12])
cat_means = df_clean.groupby(['day', 'start_hour'], observed=False).mean(numeric_only=True)['duration_min']
cat_means = cat_means.reset_index(name='duration_min_avg')
cat_means = cat_means.pivot(index='start_hour', columns='day', values='duration_min_avg')
sb.heatmap(cat_means, annot=True, fmt='.3f',
cbar_kws={'label': 'Average Duration of Bike Trip (minutes)'}, xticklabels=day_abbr, yticklabels=complete_hrs, cmap="viridis_r")
plt.xlabel("Day of the Week")
plt.ylabel("Starting Hour of the Bike Ride")
plt.title("Heatmap of Ride Duration by Day and Hour of Day", fontsize=24, y=1.01);
The heatmap above shows that bike rides tend to be a bit longer on weekends compared to weekdays. It also reveals that the longest trips, on average, occur during the early morning hours.
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?¶
The investigation into bike trip length and frequency revealed notable relationships, particularly regarding the impact of the day of the week and time of day. The analysis showed that both factors influenced bike trip frequency, with most trips occurring on weekdays during commuting hours, as people utilized bikes for their daily commutes. It was evident that these features supported each other in highlighting patterns of bike usage.
Were there any interesting or surprising interactions between features?¶
While it was expected that more bike trips would occur during commuting hours, it was surprising to find that longer bike trips on average were more common during weekends compared to weekdays. Additionally, an unexpected finding was that some of the longest bike trips during weekdays took place between 1 AM and 3 AM. The data also revealed interesting trends related to gender. All genders tended to take longer bike rides on weekends, but those identifying as 'female' or 'other' consistently took longer trips throughout the week. Furthermore, customers generally took longer trips than subscribers on all days.
Conclusions¶
During the exploration, we saw saw that most bike trips lasted under 30 minutes, averaging around 9 minutes, with most trips under 2 km. Users primarily rode during weekdays and commuting hours. Most users were subscribers, but customers tended to take longer trips. The service was most popular among those in their mid-twenties to mid-thirties, with usage declining with age. Males used the service more, but females and those identifying as 'other' had longer trips. Most trips were not part of the Bike Share program. Bivariate analysis showed weak correlations among age, duration, and distance. Longer trips occurred on weekends and in the afternoon. Univariate exploration confirmed that trips were most common during commuting hours, with unexpectedly long trips occurring late at night. Interestingly, females and 'other' users traveled greater distances, as did customers compared to subscribers. Mornings, especially 7-8 AM, saw longer distances traveled. Most users were in their early 30s, with those in the Bike Share for All program being younger. Younger riders were more active on weekends and took late-night trips.
# save as a csv
df_clean.to_csv('2019-fordgobike-data-clean.csv', index = False)