Part I - Ford GoBike Data Exploration¶

Jose Murguia¶

Introduction¶

This document explores a dataset containing the trip data of the Ford GoBike approximately 183,412.

Preliminary Wrangling¶

While this project emphasizes communicating data findings, it's essential to first perform data wrangling to ensure the dataset is tidy and that the findings be accurate.

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb


%matplotlib inline

We'll first load and look at the dataset

In [2]:
df = pd.read_csv('201902-fordgobike-tripdata.csv')
df.head()
Out[2]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
0 52185 2019-02-28 17:32:10.1450 2019-03-01 08:01:55.9750 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 13.0 Commercial St at Montgomery St 37.794231 -122.402923 4902 Customer 1984.0 Male No
1 42521 2019-02-28 18:53:21.7890 2019-03-01 06:42:03.0560 23.0 The Embarcadero at Steuart St 37.791464 -122.391034 81.0 Berry St at 4th St 37.775880 -122.393170 2535 Customer NaN NaN No
2 61854 2019-02-28 12:13:13.2180 2019-03-01 05:24:08.1460 86.0 Market St at Dolores St 37.769305 -122.426826 3.0 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 5905 Customer 1972.0 Male No
3 36490 2019-02-28 17:54:26.0100 2019-03-01 04:02:36.8420 375.0 Grove St at Masonic Ave 37.774836 -122.446546 70.0 Central Ave at Fell St 37.773311 -122.444293 6638 Subscriber 1989.0 Other No
4 1585 2019-02-28 23:54:18.5490 2019-03-01 00:20:44.0740 7.0 Frank H Ogawa Plaza 37.804562 -122.271738 222.0 10th Ave at E 15th St 37.792714 -122.248780 4898 Subscriber 1974.0 Male Yes

We'll look at the structure and contents of the DataFrame, to help identify issues like missing values or data types that may need to be adjusted.

In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183412 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration_sec             183412 non-null  int64  
 1   start_time               183412 non-null  object 
 2   end_time                 183412 non-null  object 
 3   start_station_id         183215 non-null  float64
 4   start_station_name       183215 non-null  object 
 5   start_station_latitude   183412 non-null  float64
 6   start_station_longitude  183412 non-null  float64
 7   end_station_id           183215 non-null  float64
 8   end_station_name         183215 non-null  object 
 9   end_station_latitude     183412 non-null  float64
 10  end_station_longitude    183412 non-null  float64
 11  bike_id                  183412 non-null  int64  
 12  user_type                183412 non-null  object 
 13  member_birth_year        175147 non-null  float64
 14  member_gender            175147 non-null  object 
 15  bike_share_for_all_trip  183412 non-null  object 
dtypes: float64(7), int64(2), object(7)
memory usage: 22.4+ MB
  • The 'start_time' and 'end_time' fields should be changed to the datetime data type.
  • The 'member_birth_year' field should be converted to an integer data type.
  • The 'user_type' and 'member_gender' fields could be changed to a category data type.
  • The 'bike_share_for_all_trip' field could be changed to a boolean data type.
  • The 'bike_id', 'start_station_id', and 'end_station_id' fields should be changed to strings.

We'll check for duplicates

In [4]:
sum(df.duplicated())
Out[4]:
0

Next, we'll count the total number of missing (null) values in each column of a DataFrame.

In [5]:
df.isnull().sum()
Out[5]:
duration_sec                  0
start_time                    0
end_time                      0
start_station_id            197
start_station_name          197
start_station_latitude        0
start_station_longitude       0
end_station_id              197
end_station_name            197
end_station_latitude          0
end_station_longitude         0
bike_id                       0
user_type                     0
member_birth_year          8265
member_gender              8265
bike_share_for_all_trip       0
dtype: int64

Several columns contain missing data (null values)

Let's summarize a DataFrame's key statistical properties, such as count, mean, standard deviation, and quartiles.

In [6]:
df.describe()
Out[6]:
duration_sec start_station_id start_station_latitude start_station_longitude end_station_id end_station_latitude end_station_longitude bike_id member_birth_year
count 183412.000000 183215.000000 183412.000000 183412.000000 183215.000000 183412.000000 183412.000000 183412.000000 175147.000000
mean 726.078435 138.590427 37.771223 -122.352664 136.249123 37.771427 -122.352250 4472.906375 1984.806437
std 1794.389780 111.778864 0.099581 0.117097 111.515131 0.099490 0.116673 1664.383394 10.116689
min 61.000000 3.000000 37.317298 -122.453704 3.000000 37.317298 -122.453704 11.000000 1878.000000
25% 325.000000 47.000000 37.770083 -122.412408 44.000000 37.770407 -122.411726 3777.000000 1980.000000
50% 514.000000 104.000000 37.780760 -122.398285 100.000000 37.781010 -122.398279 4958.000000 1987.000000
75% 796.000000 239.000000 37.797280 -122.286533 235.000000 37.797320 -122.288045 5502.000000 1992.000000
max 85444.000000 398.000000 37.880222 -121.874119 398.000000 37.880222 -121.874119 6645.000000 2001.000000

As we can see, the earliest recorded birth year for members is 1881, which appears to be an error. We will investigate these inaccurate birth years more thoroughly in the cleaning section of this report.

Lets check how many people are riding bike above 90

In [7]:
df.query("member_birth_year < 1934")
Out[7]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
1285 148 2019-02-28 19:29:17.6270 2019-02-28 19:31:45.9670 158.0 Shattuck Ave at Telegraph Ave 37.833279 -122.263490 173.0 Shattuck Ave at 55th St 37.840364 -122.264488 5391 Subscriber 1900.0 Male Yes
5197 217 2019-02-28 13:51:46.2380 2019-02-28 13:55:24.1270 70.0 Central Ave at Fell St 37.773311 -122.444293 71.0 Broderick St at Oak St 37.773063 -122.439078 5801 Subscriber 1931.0 Male No
5266 384 2019-02-28 13:35:05.4280 2019-02-28 13:41:30.2230 84.0 Duboce Park 37.769200 -122.433812 71.0 Broderick St at Oak St 37.773063 -122.439078 6608 Subscriber 1931.0 Male No
5447 147 2019-02-28 13:08:56.9350 2019-02-28 13:11:24.0620 84.0 Duboce Park 37.769200 -122.433812 72.0 Page St at Scott St 37.772406 -122.435650 5018 Subscriber 1931.0 Male No
10827 1315 2019-02-27 19:21:34.4360 2019-02-27 19:43:30.0080 343.0 Bryant St at 2nd St 37.783172 -122.393572 375.0 Grove St at Masonic Ave 37.774836 -122.446546 6249 Subscriber 1900.0 Male No
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
177708 1527 2019-02-01 19:09:28.3870 2019-02-01 19:34:55.9630 343.0 Bryant St at 2nd St 37.783172 -122.393572 375.0 Grove St at Masonic Ave 37.774836 -122.446546 5286 Subscriber 1900.0 Male No
177885 517 2019-02-01 18:38:40.4710 2019-02-01 18:47:18.3920 25.0 Howard St at 2nd St 37.787522 -122.397405 30.0 San Francisco Caltrain (Townsend St at 4th St) 37.776598 -122.395282 2175 Subscriber 1902.0 Female No
177955 377 2019-02-01 18:23:33.4110 2019-02-01 18:29:50.7950 26.0 1st St at Folsom St 37.787290 -122.394380 321.0 5th St at Folsom 37.780146 -122.403071 5444 Subscriber 1933.0 Female Yes
182830 428 2019-02-01 07:45:05.9340 2019-02-01 07:52:14.9220 284.0 Yerba Buena Center for the Arts (Howard St at ... 37.784872 -122.400876 67.0 San Francisco Caltrain Station 2 (Townsend St... 37.776639 -122.395526 5031 Subscriber 1901.0 Male No
183388 490 2019-02-01 00:39:53.1120 2019-02-01 00:48:03.3380 61.0 Howard St at 8th St 37.776513 -122.411306 81.0 Berry St at 4th St 37.775880 -122.393170 5411 Subscriber 1927.0 Male No

187 rows × 16 columns

In [ ]:
 

Data Cleaning¶

The initial step in the cleaning process involved making a copy of the original dataset.

In [8]:
# make a copy
df_clean = df.copy()

We'll remove the null values from the dataset and reset the index

In [9]:
# dropping rows with null values
df_clean.dropna(inplace=True)
In [10]:
df_clean.reset_index(drop=True, inplace = True)

Let's review the dataset overview to see the changes

In [11]:
# Display DataFrame info
df_clean.info()

# Check for null counts
null_counts = df_clean.isnull().sum()
print(null_counts)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174952 entries, 0 to 174951
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration_sec             174952 non-null  int64  
 1   start_time               174952 non-null  object 
 2   end_time                 174952 non-null  object 
 3   start_station_id         174952 non-null  float64
 4   start_station_name       174952 non-null  object 
 5   start_station_latitude   174952 non-null  float64
 6   start_station_longitude  174952 non-null  float64
 7   end_station_id           174952 non-null  float64
 8   end_station_name         174952 non-null  object 
 9   end_station_latitude     174952 non-null  float64
 10  end_station_longitude    174952 non-null  float64
 11  bike_id                  174952 non-null  int64  
 12  user_type                174952 non-null  object 
 13  member_birth_year        174952 non-null  float64
 14  member_gender            174952 non-null  object 
 15  bike_share_for_all_trip  174952 non-null  object 
dtypes: float64(7), int64(2), object(7)
memory usage: 21.4+ MB
duration_sec               0
start_time                 0
end_time                   0
start_station_id           0
start_station_name         0
start_station_latitude     0
start_station_longitude    0
end_station_id             0
end_station_name           0
end_station_latitude       0
end_station_longitude      0
bike_id                    0
user_type                  0
member_birth_year          0
member_gender              0
bike_share_for_all_trip    0
dtype: int64

Next, we'll change the datatypes that we mentioned before

In [12]:
# Define a dictionary of columns and their desired data types
dtype_mapping = {
    'start_time': 'datetime64[ns]',
    'end_time': 'datetime64[ns]',
    'member_birth_year': 'int',
    'bike_id': 'object',
    'start_station_id': 'object',
    'end_station_id': 'object',
    'user_type': 'category',
    'member_gender': 'category',
    'bike_share_for_all_trip': 'category'
}

# Apply the conversions
for column, dtype in dtype_mapping.items():
    df_clean[column] = df_clean[column].astype(dtype)

Let's check the overview of the dataframe

In [13]:
df_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174952 entries, 0 to 174951
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration_sec             174952 non-null  int64         
 1   start_time               174952 non-null  datetime64[ns]
 2   end_time                 174952 non-null  datetime64[ns]
 3   start_station_id         174952 non-null  object        
 4   start_station_name       174952 non-null  object        
 5   start_station_latitude   174952 non-null  float64       
 6   start_station_longitude  174952 non-null  float64       
 7   end_station_id           174952 non-null  object        
 8   end_station_name         174952 non-null  object        
 9   end_station_latitude     174952 non-null  float64       
 10  end_station_longitude    174952 non-null  float64       
 11  bike_id                  174952 non-null  object        
 12  user_type                174952 non-null  category      
 13  member_birth_year        174952 non-null  int64         
 14  member_gender            174952 non-null  category      
 15  bike_share_for_all_trip  174952 non-null  category      
dtypes: category(3), datetime64[ns](2), float64(4), int64(2), object(5)
memory usage: 17.9+ MB

The earliest birth year listed is 1881, and there are several member birth years suggesting individuals are over 90, which seems highly improbable. We will eliminate any instances where a member's birth year indicates they would be over 90 years old, as this appears to be an error.

Let's take a look at the statistics for the member birth years in the dataset.

In [14]:
df_clean.member_birth_year.describe()
Out[14]:
count    174952.000000
mean       1984.803135
std          10.118731
min        1878.000000
25%        1980.000000
50%        1987.000000
75%        1992.000000
max        2001.000000
Name: member_birth_year, dtype: float64

Let's look at a histogram of member birth years, using 5-year bins from 1880 to 2000 to visualize the frequency distribution of the data.

In [15]:
bin_edges = np.arange (0, df_clean['member_birth_year'].max()+5, 5)
plt.hist(data = df_clean, x = 'member_birth_year', bins = bin_edges)
plt.xlim(1880, 2000)
plt.xlabel('Member Birth Year')
plt.ylabel('Frequency');
No description has been provided for this image

Next, we'll look for outliers in member birth years by visualizing the data with a box plot.

In [16]:
# look for outliers
sb.boxplot(x=df_clean['member_birth_year'])
Out[16]:
<Axes: xlabel='member_birth_year'>
No description has been provided for this image
In [ ]:
 

Let's calculate the first and third quartiles (Q1 and Q3) and the interquartile range (IQR) for member birth years to identify potential outliers.

In [17]:
# Using the Interquartile Range to understand outliers
q1 = df_clean.member_birth_year.quantile(0.25)
q3 = df_clean.member_birth_year.quantile(0.75)
iqr = q3 - q1
print(q1)
print(q3)
print(iqr)
1980.0
1992.0
12.0

We'll calculate the lower whisker for identifying outliers in member birth years, which is determined by subtracting 1.5 times the interquartile range (IQR) from Q1.

In [18]:
lower_whisker = q1 - 1.5 * iqr
print(lower_whisker)
1962.0

Let's filter the dataset to examine members born before 1962.

In [19]:
df_clean.query("member_birth_year < 1962")
Out[19]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
4 1793 2019-02-28 23:49:58.632 2019-03-01 00:19:51.760 93.0 4th St at Mission Bay Blvd S 37.770407 -122.391198 323.0 Broadway at Kearny 37.798014 -122.405950 5200 Subscriber 1959 Male No
40 116 2019-02-28 23:44:00.988 2019-02-28 23:45:57.482 104.0 4th St at 16th St 37.767045 -122.390833 93.0 4th St at Mission Bay Blvd S 37.770407 -122.391198 823 Subscriber 1959 Male No
62 681 2019-02-28 23:19:37.366 2019-02-28 23:30:58.862 43.0 San Francisco Public Library (Grove St at Hyde... 37.778768 -122.415929 70.0 Central Ave at Fell St 37.773311 -122.444293 6333 Subscriber 1959 Male No
196 547 2019-02-28 22:25:51.137 2019-02-28 22:34:58.970 76.0 McCoppin St at Valencia St 37.771662 -122.422423 43.0 San Francisco Public Library (Grove St at Hyde... 37.778768 -122.415929 6333 Subscriber 1961 Female No
297 217 2019-02-28 21:58:47.639 2019-02-28 22:02:24.693 149.0 Emeryville Town Hall 37.831275 -122.285633 153.0 59th St at Horton St 37.840945 -122.291360 5210 Subscriber 1961 Male No
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
174844 459 2019-02-01 05:15:05.178 2019-02-01 05:22:44.272 104.0 4th St at 16th St 37.767045 -122.390833 30.0 San Francisco Caltrain (Townsend St at 4th St) 37.776598 -122.395282 3446 Subscriber 1959 Male No
174852 373 2019-02-01 04:42:44.709 2019-02-01 04:48:58.076 131.0 22nd St at Dolores St 37.755000 -122.425728 129.0 Harrison St at 20th St 37.758862 -122.412544 5427 Subscriber 1958 Male No
174853 100 2019-02-01 04:46:54.805 2019-02-01 04:48:34.843 80.0 Townsend St at 5th St 37.775235 -122.397437 67.0 San Francisco Caltrain Station 2 (Townsend St... 37.776639 -122.395526 3138 Subscriber 1950 Male No
174926 400 2019-02-01 00:46:47.276 2019-02-01 00:53:27.596 220.0 San Pablo Ave at MLK Jr Way 37.811351 -122.273422 337.0 Webster St at 19th St 37.806970 -122.266588 3487 Subscriber 1945 Male Yes
174929 490 2019-02-01 00:39:53.112 2019-02-01 00:48:03.338 61.0 Howard St at 8th St 37.776513 -122.411306 81.0 Berry St at 4th St 37.775880 -122.393170 5411 Subscriber 1927 Male No

5781 rows × 16 columns

Lets look at the bottom 1% of birth years

In [20]:
df.member_birth_year.describe(percentiles = [.01])
Out[20]:
count    175147.000000
mean       1984.806437
std          10.116689
min        1878.000000
1%         1955.000000
50%        1987.000000
max        2001.000000
Name: member_birth_year, dtype: float64

We'll look for birth years prior to 1955.

In [21]:
df_clean.query("member_birth_year < 1955")
Out[21]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
476 235 2019-02-28 21:17:57.047 2019-02-28 21:21:52.631 34.0 Father Alfred E Boeddeker Park 37.783988 -122.412408 58.0 Market St at 10th St 37.776619 -122.417385 5202 Subscriber 1954 Male No
956 384 2019-02-28 19:56:45.837 2019-02-28 20:03:10.473 250.0 North Berkeley BART Station 37.873558 -122.283093 257.0 Fifth St at Delaware St 37.870407 -122.299676 1671 Subscriber 1954 Male No
1033 303 2019-02-28 19:49:38.120 2019-02-28 19:54:42.044 43.0 San Francisco Public Library (Grove St at Hyde... 37.778768 -122.415929 76.0 McCoppin St at Valencia St 37.771662 -122.422423 6333 Subscriber 1945 Male Yes
1238 148 2019-02-28 19:29:17.627 2019-02-28 19:31:45.967 158.0 Shattuck Ave at Telegraph Ave 37.833279 -122.263490 173.0 Shattuck Ave at 55th St 37.840364 -122.264488 5391 Subscriber 1900 Male Yes
1295 1362 2019-02-28 19:02:33.643 2019-02-28 19:25:16.561 15.0 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 97.0 14th St at Mission St 37.768265 -122.420110 48 Subscriber 1954 Male No
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
174773 191 2019-02-01 06:32:38.467 2019-02-01 06:35:50.222 15.0 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 20.0 Mechanics Monument Plaza (Market St at Bush St) 37.791300 -122.399051 3108 Subscriber 1947 Male No
174808 966 2019-02-01 05:57:01.688 2019-02-01 06:13:08.313 126.0 Esprit Park 37.761634 -122.390648 16.0 Steuart St at Market St 37.794130 -122.394430 1338 Subscriber 1952 Male No
174853 100 2019-02-01 04:46:54.805 2019-02-01 04:48:34.843 80.0 Townsend St at 5th St 37.775235 -122.397437 67.0 San Francisco Caltrain Station 2 (Townsend St... 37.776639 -122.395526 3138 Subscriber 1950 Male No
174926 400 2019-02-01 00:46:47.276 2019-02-01 00:53:27.596 220.0 San Pablo Ave at MLK Jr Way 37.811351 -122.273422 337.0 Webster St at 19th St 37.806970 -122.266588 3487 Subscriber 1945 Male Yes
174929 490 2019-02-01 00:39:53.112 2019-02-01 00:48:03.338 61.0 Howard St at 8th St 37.776513 -122.411306 81.0 Berry St at 4th St 37.775880 -122.393170 5411 Subscriber 1927 Male No

1680 rows × 16 columns

We will remove these instead and reset the index

In [22]:
df_clean.drop(df_clean[df_clean.member_birth_year < 1955].index, inplace=True)
In [23]:
df_clean.reset_index(drop=True, inplace = True)

Check if there are any records after removing them

In [24]:
df_clean.query("member_birth_year < 1955")
Out[24]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip

We'll look at overview of the dataframe again and check for nulls

In [25]:
# Display DataFrame info
df_clean.info()

# Check for null counts
null_counts = df_clean.isnull().sum()
print(null_counts)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173272 entries, 0 to 173271
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration_sec             173272 non-null  int64         
 1   start_time               173272 non-null  datetime64[ns]
 2   end_time                 173272 non-null  datetime64[ns]
 3   start_station_id         173272 non-null  object        
 4   start_station_name       173272 non-null  object        
 5   start_station_latitude   173272 non-null  float64       
 6   start_station_longitude  173272 non-null  float64       
 7   end_station_id           173272 non-null  object        
 8   end_station_name         173272 non-null  object        
 9   end_station_latitude     173272 non-null  float64       
 10  end_station_longitude    173272 non-null  float64       
 11  bike_id                  173272 non-null  object        
 12  user_type                173272 non-null  category      
 13  member_birth_year        173272 non-null  int64         
 14  member_gender            173272 non-null  category      
 15  bike_share_for_all_trip  173272 non-null  category      
dtypes: category(3), datetime64[ns](2), float64(4), int64(2), object(5)
memory usage: 17.7+ MB
duration_sec               0
start_time                 0
end_time                   0
start_station_id           0
start_station_name         0
start_station_latitude     0
start_station_longitude    0
end_station_id             0
end_station_name           0
end_station_latitude       0
end_station_longitude      0
bike_id                    0
user_type                  0
member_birth_year          0
member_gender              0
bike_share_for_all_trip    0
dtype: int64
In [26]:
df_clean.describe()
Out[26]:
duration_sec start_time end_time start_station_latitude start_station_longitude end_station_latitude end_station_longitude member_birth_year
count 173272.000000 173272 173272 173272.000000 173272.000000 173272.000000 173272.000000 173272.000000
mean 703.878549 2019-02-15 21:21:32.505712128 2019-02-15 21:33:16.883208960 37.771060 -122.351657 37.771257 -122.351228 1985.171747
min 61.000000 2019-02-01 00:00:20.636000 2019-02-01 00:04:52.058000 37.317298 -122.453704 37.317298 -122.453704 1955.000000
25% 323.000000 2019-02-08 08:30:43.535000064 2019-02-08 08:41:02.230500096 37.770407 -122.411901 37.770407 -122.411647 1980.000000
50% 510.000000 2019-02-15 21:32:02.059000064 2019-02-15 21:45:20.187500032 37.780760 -122.398279 37.781010 -122.397437 1987.000000
75% 788.000000 2019-02-22 11:19:06.316000 2019-02-22 11:33:13.886749952 37.797320 -122.283093 37.797673 -122.286533 1992.000000
max 84548.000000 2019-02-28 23:59:18.548000 2019-03-01 08:01:55.975000 37.880222 -121.874119 37.880222 -121.874119 2001.000000
std 1647.305625 NaN NaN 0.100682 0.118001 0.100587 0.117560 9.378430

What is the structure of your dataset?¶

The dataset has 173,272 entries after cleaning the data. The dataset contains these features:

  • duration_sec
  • start_time
  • end_time
  • start_station_id
  • start_station_name
  • start_station_latitude
  • start_station_longitude
  • end_station_id
  • end_station_name
  • end_station_latitude
  • end_ station_longitude
  • bike_id
  • user_type
  • member_birth_year
  • member_gender
  • bike_share_for_all_trip

What is/are the main feature(s) of interest in your dataset?¶

The main features of interest in the dataset are those that effectively predict bike trip duration and influence trip frequency. Understanding these key factors will help us identify the most significant predictors for both aspects of bike usage.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?¶

In my investigation into the features of interest regarding bike rides, I anticipate that the following dataset features will provide valuable insights:

  • Time of Day: I expect this feature to significantly influence both the frequency and duration of bike rides, with peak usage occurring during commuting hours.

  • Day of the Week: Similar to time of day, this feature is likely to affect ride patterns, with certain days showing higher activity levels.

  • User Type: Differentiating between subscribers and customers will help in understanding how user engagement affects riding behavior.

  • Member Birth Year: Analyzing the birth year of riders may provide insights into generational differences in bike usage.

  • Gender: Exploring the relationship between gender and ride frequency/duration will help identify any disparities in bike usage patterns.

These features collectively will support a comprehensive analysis of the factors influencing bike ride behavior.

Univariate Exploration¶

In this section, we will analyze the distributions of individual variables. First, we’ll focus on the distribution of bike trip durations.

Here, we'll create a histogram of trip durations in seconds

In [27]:
bin_edges = np.arange (0, df_clean['duration_sec'].max()+100, 100)
plt.hist(data = df_clean, x = 'duration_sec', bins = bin_edges)
plt.xlim(0, 3000)
plt.xlabel('Trip Duration (seconds)')
plt.ylabel('Frequency');
No description has been provided for this image
In [ ]:
 
In [ ]:
 

From this histogram, we see frequency distribution of trip durations, with most trips lasting between 0 and 1000 seconds, and the frequency sharply decreasing for longer durations.

In [28]:
# gauging bin limits are appropriate for the plot
df_clean['duration_sec'].describe()
Out[28]:
count    173272.000000
mean        703.878549
std        1647.305625
min          61.000000
25%         323.000000
50%         510.000000
75%         788.000000
max       84548.000000
Name: duration_sec, dtype: float64

The data follows a highly skewed, long-tailed distribution, with most trip durations under 1,000 seconds. To gain clearer insights, a logarithmic transformation is appropriate.

In [29]:
bin_edges = 10 ** np.arange(1, 5 + 0.1, 0.1)
ticks = [10, 30, 100, 300, 1000, 3000, 10000, 30000, 100000]
plt.hist(data = df_clean, x = 'duration_sec', bins=bin_edges)
plt.xscale('log')
plt.xticks(ticks, ticks)
plt.xlabel('Trip Duration (seconds)')
plt.ylabel('Frequency');
No description has been provided for this image

On a logarithmic scale, the duration of bike trips appears quite standard, showing a prominent peak near 650 seconds. To enhance the clarity of our findings, it would be beneficial to include an additional column that presents bike trip durations in minutes, as this format is generally easier for people to understand than seconds.

In [30]:
# create a duration column in minutes
df_clean['duration_min'] = df_clean['duration_sec'] / 60

Histogram of trip duration in minutes

In [31]:
bin_edges = np.arange (0, df_clean['duration_min'].max()+5, 5)
plt.hist(data = df_clean, x = 'duration_min', bins = bin_edges)
plt.xlim(0, 100)
plt.xlabel('Trip Duration (minutes)')
plt.ylabel('Frequency');
No description has been provided for this image

Since the data shows a strong skew when expressed in seconds, using a logarithmic scale is again the most suitable option.

In [32]:
bin_edges = 10 ** np.arange(0, 3 + 0.1, 0.1)
ticks = [1, 3, 10, 30, 100, 300, 1000]
plt.hist(data = df_clean, x = 'duration_min', bins=bin_edges)
plt.xscale('log')
plt.xticks(ticks, ticks)
plt.xlabel('Trip Duration (minutes)')
plt.ylabel('Frequency');
No description has been provided for this image

When we look at the bike trip durations on a log scale in minutes, we see a peak around 9 to 10 minutes.

In [ ]:
 

We'll take a look at the the time of day feature by extracting the start hour from the start time

In [33]:
# extract the start hour of the trip from the start time column
df_clean['start_hour'] = df_clean['start_time'].dt.strftime('%H')
In [34]:
df_clean['start_hour'] = df_clean['start_hour'].astype(int)
In [35]:
# test
df_clean.head()
Out[35]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip duration_min start_hour
0 52185 2019-02-28 17:32:10.145 2019-03-01 08:01:55.975 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 13.0 Commercial St at Montgomery St 37.794231 -122.402923 4902 Customer 1984 Male No 869.750000 17
1 61854 2019-02-28 12:13:13.218 2019-03-01 05:24:08.146 86.0 Market St at Dolores St 37.769305 -122.426826 3.0 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 5905 Customer 1972 Male No 1030.900000 12
2 36490 2019-02-28 17:54:26.010 2019-03-01 04:02:36.842 375.0 Grove St at Masonic Ave 37.774836 -122.446546 70.0 Central Ave at Fell St 37.773311 -122.444293 6638 Subscriber 1989 Other No 608.166667 17
3 1585 2019-02-28 23:54:18.549 2019-03-01 00:20:44.074 7.0 Frank H Ogawa Plaza 37.804562 -122.271738 222.0 10th Ave at E 15th St 37.792714 -122.248780 4898 Subscriber 1974 Male Yes 26.416667 23
4 1793 2019-02-28 23:49:58.632 2019-03-01 00:19:51.760 93.0 4th St at Mission Bay Blvd S 37.770407 -122.391198 323.0 Broadway at Kearny 37.798014 -122.405950 5200 Subscriber 1959 Male No 29.883333 23
In [ ]:
 

We will create a countplot showing the frequency of bike rides by hour of the day.

In [36]:
# plot the frequency of bike rides by hour of the day
plt.figure(figsize=[10,6])
base_color = sb.color_palette()[0]
sb.countplot(data = df_clean, x = 'start_hour', color = base_color)
plt.xlabel('Start Hour')
plt.title('Frequency of Bike Rides by Hour of the Day')
Out[36]:
Text(0.5, 1.0, 'Frequency of Bike Rides by Hour of the Day')
No description has been provided for this image

The visualization demonstrates a significant increase in bike trips during the morning rush from 7 to 9 AM and in the evening from 4 to 7 PM.

To gain further insights, we'll extract the day of the week to see if these patterns align with weekday commuting behavior.

In [37]:
# extracting the day of the week from the start_time column
df_clean['day'] = df_clean['start_time'].dt.strftime('%A')

Let's check the dataframe for the changes

In [38]:
df_clean
Out[38]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip duration_min start_hour day
0 52185 2019-02-28 17:32:10.145 2019-03-01 08:01:55.975 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 13.0 Commercial St at Montgomery St 37.794231 -122.402923 4902 Customer 1984 Male No 869.750000 17 Thursday
1 61854 2019-02-28 12:13:13.218 2019-03-01 05:24:08.146 86.0 Market St at Dolores St 37.769305 -122.426826 3.0 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 5905 Customer 1972 Male No 1030.900000 12 Thursday
2 36490 2019-02-28 17:54:26.010 2019-03-01 04:02:36.842 375.0 Grove St at Masonic Ave 37.774836 -122.446546 70.0 Central Ave at Fell St 37.773311 -122.444293 6638 Subscriber 1989 Other No 608.166667 17 Thursday
3 1585 2019-02-28 23:54:18.549 2019-03-01 00:20:44.074 7.0 Frank H Ogawa Plaza 37.804562 -122.271738 222.0 10th Ave at E 15th St 37.792714 -122.248780 4898 Subscriber 1974 Male Yes 26.416667 23 Thursday
4 1793 2019-02-28 23:49:58.632 2019-03-01 00:19:51.760 93.0 4th St at Mission Bay Blvd S 37.770407 -122.391198 323.0 Broadway at Kearny 37.798014 -122.405950 5200 Subscriber 1959 Male No 29.883333 23 Thursday
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
173267 480 2019-02-01 00:04:49.724 2019-02-01 00:12:50.034 27.0 Beale St at Harrison St 37.788059 -122.391865 324.0 Union Square (Powell St at Post St) 37.788300 -122.408531 4832 Subscriber 1996 Male No 8.000000 0 Friday
173268 313 2019-02-01 00:05:34.744 2019-02-01 00:10:48.502 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 66.0 3rd St at Townsend St 37.778742 -122.392741 4960 Subscriber 1984 Male No 5.216667 0 Friday
173269 141 2019-02-01 00:06:05.549 2019-02-01 00:08:27.220 278.0 The Alameda at Bush St 37.331932 -121.904888 277.0 Morrison Ave at Julian St 37.333658 -121.908586 3824 Subscriber 1990 Male Yes 2.350000 0 Friday
173270 139 2019-02-01 00:05:34.360 2019-02-01 00:07:54.287 220.0 San Pablo Ave at MLK Jr Way 37.811351 -122.273422 216.0 San Pablo Ave at 27th St 37.817827 -122.275698 5095 Subscriber 1988 Male No 2.316667 0 Friday
173271 271 2019-02-01 00:00:20.636 2019-02-01 00:04:52.058 24.0 Spear St at Folsom St 37.789677 -122.390428 37.0 2nd St at Folsom St 37.785000 -122.395936 1057 Subscriber 1989 Male No 4.516667 0 Friday

173272 rows × 19 columns

Next, we'll categorize the days of the week in the dataframe to ensure they are ordered from Monday to Sunday for better data analysis.

In [39]:
# order the days of the week
df_clean['day'] = pd.Categorical(df_clean['day'], categories= ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday', 'Sunday'],ordered=True)

This bar plot visualizes the frequency of bike rides for each day of the week

In [40]:
# plotting the frequency of bike rides by the day of the week
plt.figure(figsize=[10,6])
base_color = sb.color_palette()[0]
sb.countplot(data = df_clean, x = 'day', color = base_color)
plt.xlabel('Start Day')
plt.title('Frequency of Bike Rides by Day of the Week')
plt.xticks(rotation=15);
No description has been provided for this image

The graph clearly indicates that weekdays are the prime time for bike trips

The next focus will be on categorizing users to analyze whether subscribers tend to make more trips than one-time customers.

In [41]:
# frequency of bike rides by user type
base_color = sb.color_palette()[0]
user_order = df_clean['user_type'].value_counts().index
sb.countplot(data = df_clean, x = 'user_type', color = base_color, order = user_order)
plt.xlabel('User')
plt.title('Frequency of Bike Rides by User Type');
No description has been provided for this image

It turns out that program subscribers use bikes more frequently than occasional users. We’ll delve into whether one-time customers embark on longer bike rides than subscribers. This analysis will be featured in the next section through a bivariate exploration.

Next, we'll look at the member age. We'll create am age column from the 'member_birth_year' column

In [42]:
#creating age column from member birth year (2019 - member birth year since this data is from 2019)
df_clean['age'] = (2019 - df_clean['member_birth_year'])
In [43]:
# checking this worked
df_clean
Out[43]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip duration_min start_hour day age
0 52185 2019-02-28 17:32:10.145 2019-03-01 08:01:55.975 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 13.0 Commercial St at Montgomery St 37.794231 -122.402923 4902 Customer 1984 Male No 869.750000 17 Thursday 35
1 61854 2019-02-28 12:13:13.218 2019-03-01 05:24:08.146 86.0 Market St at Dolores St 37.769305 -122.426826 3.0 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 5905 Customer 1972 Male No 1030.900000 12 Thursday 47
2 36490 2019-02-28 17:54:26.010 2019-03-01 04:02:36.842 375.0 Grove St at Masonic Ave 37.774836 -122.446546 70.0 Central Ave at Fell St 37.773311 -122.444293 6638 Subscriber 1989 Other No 608.166667 17 Thursday 30
3 1585 2019-02-28 23:54:18.549 2019-03-01 00:20:44.074 7.0 Frank H Ogawa Plaza 37.804562 -122.271738 222.0 10th Ave at E 15th St 37.792714 -122.248780 4898 Subscriber 1974 Male Yes 26.416667 23 Thursday 45
4 1793 2019-02-28 23:49:58.632 2019-03-01 00:19:51.760 93.0 4th St at Mission Bay Blvd S 37.770407 -122.391198 323.0 Broadway at Kearny 37.798014 -122.405950 5200 Subscriber 1959 Male No 29.883333 23 Thursday 60
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
173267 480 2019-02-01 00:04:49.724 2019-02-01 00:12:50.034 27.0 Beale St at Harrison St 37.788059 -122.391865 324.0 Union Square (Powell St at Post St) 37.788300 -122.408531 4832 Subscriber 1996 Male No 8.000000 0 Friday 23
173268 313 2019-02-01 00:05:34.744 2019-02-01 00:10:48.502 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 66.0 3rd St at Townsend St 37.778742 -122.392741 4960 Subscriber 1984 Male No 5.216667 0 Friday 35
173269 141 2019-02-01 00:06:05.549 2019-02-01 00:08:27.220 278.0 The Alameda at Bush St 37.331932 -121.904888 277.0 Morrison Ave at Julian St 37.333658 -121.908586 3824 Subscriber 1990 Male Yes 2.350000 0 Friday 29
173270 139 2019-02-01 00:05:34.360 2019-02-01 00:07:54.287 220.0 San Pablo Ave at MLK Jr Way 37.811351 -122.273422 216.0 San Pablo Ave at 27th St 37.817827 -122.275698 5095 Subscriber 1988 Male No 2.316667 0 Friday 31
173271 271 2019-02-01 00:00:20.636 2019-02-01 00:04:52.058 24.0 Spear St at Folsom St 37.789677 -122.390428 37.0 2nd St at Folsom St 37.785000 -122.395936 1057 Subscriber 1989 Male No 4.516667 0 Friday 30

173272 rows × 20 columns

In [44]:
# plotting age distribution 
bin_edges = np.arange (10, df_clean['age'].max()+2, 2)
plt.hist(data = df_clean, x = 'age', bins = bin_edges)
plt.xlim(10, 70)
plt.xlabel('Member Age (years)')
plt.ylabel('Frequency');
No description has been provided for this image
In [45]:
sb.violinplot(data = df_clean, y = 'age')
plt.ylabel('Member Age (years)');
No description has been provided for this image

The graphs above indicate a noticeable peak around the age of 30.

Next, we’ll analyze gender to determine if one demographic uses FordGo Bikes more than the others.

In [46]:
# plotting gender 
base_color = sb.color_palette()[0]
gender_order = df_clean['member_gender'].value_counts().index
sb.countplot(data = df_clean, x = 'member_gender', color = base_color, order = gender_order)
plt.xlabel('Gender')
plt.title('Frequency of Bike Rides by Gender');
No description has been provided for this image

The data shows that a significant number of bike riders are male. We will later examine how gender impacts the length of bike trips.

We will now focus on examining the number of bike trips that were part of the Bike Share for All program.

In [47]:
base_color = sb.color_palette()[0]
share_order = df_clean['bike_share_for_all_trip'].value_counts().index
sb.countplot(data = df_clean, x = 'bike_share_for_all_trip', color = base_color, order = share_order)
plt.xlabel('Bike Share for All')
plt.title('Frequency of Bike Share for All Trips');
No description has been provided for this image

The data above indicates that most trips were not part of the Bike Share for All program.

In the final part of the univariate exploration, we will calculate the distance using the provided start and end longitude and latitude coordinates.

In [48]:
def haversine_np(start_station_longitude, start_station_latitude, end_station_longitude, end_station_latitude):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees).
    All args must be of equal length.    
    """
    # Convert decimal degrees to radians
    start_station_longitude, start_station_latitude, end_station_longitude, end_station_latitude = map(np.radians, 
        [start_station_longitude, start_station_latitude, end_station_longitude, end_station_latitude])
    
    # Differences in coordinates
    dlon = end_station_longitude - start_station_longitude
    dlat = end_station_latitude - start_station_latitude
    
    # Haversine formula
    a = np.sin(dlat/2.0)**2 + np.cos(start_station_latitude) * np.cos(end_station_latitude) * np.sin(dlon/2.0)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))  # Using arctan2 for stability
    
    # Radius of the Earth in kilometers
    radius_earth_km = 6371.0
    km = radius_earth_km * c
    
    return km

We'll create a distance column

In [49]:
# creating a distance column 
df_clean['distance'] = haversine_np(df_clean['start_station_longitude'],df_clean['start_station_latitude'],df_clean['end_station_longitude'],df_clean['end_station_latitude'])
In [50]:
# checking that this worked 
df_clean
Out[50]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude ... bike_id user_type member_birth_year member_gender bike_share_for_all_trip duration_min start_hour day age distance
0 52185 2019-02-28 17:32:10.145 2019-03-01 08:01:55.975 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 13.0 Commercial St at Montgomery St 37.794231 ... 4902 Customer 1984 Male No 869.750000 17 Thursday 35 0.544709
1 61854 2019-02-28 12:13:13.218 2019-03-01 05:24:08.146 86.0 Market St at Dolores St 37.769305 -122.426826 3.0 Powell St BART Station (Market St at 4th St) 37.786375 ... 5905 Customer 1972 Male No 1030.900000 12 Thursday 47 2.704545
2 36490 2019-02-28 17:54:26.010 2019-03-01 04:02:36.842 375.0 Grove St at Masonic Ave 37.774836 -122.446546 70.0 Central Ave at Fell St 37.773311 ... 6638 Subscriber 1989 Other No 608.166667 17 Thursday 30 0.260739
3 1585 2019-02-28 23:54:18.549 2019-03-01 00:20:44.074 7.0 Frank H Ogawa Plaza 37.804562 -122.271738 222.0 10th Ave at E 15th St 37.792714 ... 4898 Subscriber 1974 Male Yes 26.416667 23 Thursday 45 2.409301
4 1793 2019-02-28 23:49:58.632 2019-03-01 00:19:51.760 93.0 4th St at Mission Bay Blvd S 37.770407 -122.391198 323.0 Broadway at Kearny 37.798014 ... 5200 Subscriber 1959 Male No 29.883333 23 Thursday 60 3.332203
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
173267 480 2019-02-01 00:04:49.724 2019-02-01 00:12:50.034 27.0 Beale St at Harrison St 37.788059 -122.391865 324.0 Union Square (Powell St at Post St) 37.788300 ... 4832 Subscriber 1996 Male No 8.000000 0 Friday 23 1.464766
173268 313 2019-02-01 00:05:34.744 2019-02-01 00:10:48.502 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 66.0 3rd St at Townsend St 37.778742 ... 4960 Subscriber 1984 Male No 5.216667 0 Friday 35 1.402716
173269 141 2019-02-01 00:06:05.549 2019-02-01 00:08:27.220 278.0 The Alameda at Bush St 37.331932 -121.904888 277.0 Morrison Ave at Julian St 37.333658 ... 3824 Subscriber 1990 Male Yes 2.350000 0 Friday 29 0.379066
173270 139 2019-02-01 00:05:34.360 2019-02-01 00:07:54.287 220.0 San Pablo Ave at MLK Jr Way 37.811351 -122.273422 216.0 San Pablo Ave at 27th St 37.817827 ... 5095 Subscriber 1988 Male No 2.316667 0 Friday 31 0.747282
173271 271 2019-02-01 00:00:20.636 2019-02-01 00:04:52.058 24.0 Spear St at Folsom St 37.789677 -122.390428 37.0 2nd St at Folsom St 37.785000 ... 1057 Subscriber 1989 Male No 4.516667 0 Friday 30 0.710395

173272 rows × 21 columns

Now lets check for outliers

In [51]:
df_clean.distance.describe()
Out[51]:
count    173272.000000
mean          1.691558
std           1.096204
min           0.000000
25%           0.910955
50%           1.430675
75%           2.225687
max          69.469241
Name: distance, dtype: float64
In [52]:
df_clean.query("distance > 20")
Out[52]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude ... bike_id user_type member_birth_year member_gender bike_share_for_all_trip duration_min start_hour day age distance
105822 6945 2019-02-12 14:28:44.402 2019-02-12 16:24:30.158 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 300.0 Palm St at Willow St 37.317298 ... 4780 Subscriber 1985 Female No 115.75 14 Tuesday 34 69.469241

1 rows × 21 columns

In this section, we'll remove the outliers

In [53]:
df_clean = df_clean.drop(df_clean[df_clean.distance > 20].index)
In [54]:
df_clean.distance.describe()
Out[54]:
count    173271.000000
mean          1.691167
std           1.084047
min           0.000000
25%           0.910955
50%           1.430675
75%           2.225687
max          15.673955
Name: distance, dtype: float64

Lets see plot the distance distribution

In [55]:
# plotting distance distribution
bin_edges = np.arange (0, df_clean['distance'].max()+0.5, 0.5)
plt.hist(data = df_clean, x = 'distance', bins = bin_edges)
plt.xlabel('Distance (km)')
plt.xlim(0,10)
plt.ylabel('Frequency');
No description has been provided for this image

Here, we see the frequency distribution of distances traveled, with most trips covering between 0 and 2 kilometers, and the frequency significantly decreasing for longer distances.

In [ ]:
 

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?¶

The duration of bike trips was highly skewed to the right, so a log scale was applied to create a log-normal distribution. To enhance comprehension, the duration of bike trips in minutes was added to the dataset. The resulting graph clearly indicates that most bike rides are under 30 minutes, with an average duration around 9-10 minutes.

Additionally, a distance column was created using longitude and latitude data, revealing that the majority of bike trips were under 2 km.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?¶

The time of day and day of the week were examined, indicating that bikes were primarily used for commuting, with peak usage occurring from 07:00-09:00 and 16:00-19:00 on weekdays.

An age column was included to assess the ages of members. The analysis showed a peak of riders around 30 years old. To tidy the data, individuals over the age of 65 were removed, representing the top 1% of the dataset.

Moreover, it was observed that there were more subscribers than customers taking bike trips, and a higher number of male riders compared to female or other categories. Most trips were not part of the bike share for all trip scheme.

Bivariate Exploration¶

We will now embark on a bivariate analysis to examine potential relationships within the data.

In [56]:
df_clean.info()
<class 'pandas.core.frame.DataFrame'>
Index: 173271 entries, 0 to 173271
Data columns (total 21 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration_sec             173271 non-null  int64         
 1   start_time               173271 non-null  datetime64[ns]
 2   end_time                 173271 non-null  datetime64[ns]
 3   start_station_id         173271 non-null  object        
 4   start_station_name       173271 non-null  object        
 5   start_station_latitude   173271 non-null  float64       
 6   start_station_longitude  173271 non-null  float64       
 7   end_station_id           173271 non-null  object        
 8   end_station_name         173271 non-null  object        
 9   end_station_latitude     173271 non-null  float64       
 10  end_station_longitude    173271 non-null  float64       
 11  bike_id                  173271 non-null  object        
 12  user_type                173271 non-null  category      
 13  member_birth_year        173271 non-null  int64         
 14  member_gender            173271 non-null  category      
 15  bike_share_for_all_trip  173271 non-null  category      
 16  duration_min             173271 non-null  float64       
 17  start_hour               173271 non-null  int64         
 18  day                      173271 non-null  category      
 19  age                      173271 non-null  int64         
 20  distance                 173271 non-null  float64       
dtypes: category(4), datetime64[ns](2), float64(6), int64(4), object(5)
memory usage: 24.5+ MB

We'll designate numeric and categorical variables to simplify plotting.

In [57]:
# assigning numeric and categoric variables
numeric_vars = ['duration_min', 'age', 'distance']
categoric_vars = ['member_gender', 'user_type', 'bike_share_for_all_trip', 'day', 'start_hour']

To start, we will analyze the numeric variables to uncover any existing relationships or correlations.

In [58]:
g = sb.PairGrid(data = df_clean, vars = numeric_vars)
g = g.map_diag(plt.hist, bins = 20);
g.map_offdiag(plt.scatter, alpha=0.3)
Out[58]:
<seaborn.axisgrid.PairGrid at 0x15bbf9dd0>
No description has been provided for this image

The plots indicate that there are no strong correlations among the numeric variables. We will investigate the relationships between the numeric and categorical variables, as well as those between different categorical variables.

Duration Min¶

We will look at the relationship between the duration of bike trips (measured in minutes) among other features

In [59]:
def boxgrid(x, y, **kwargs):
    default_color = sb.color_palette()[0]
    # Call boxplot with explicit parameters
    sb.boxplot(x=x, y=y, color=default_color, **{k: v for k, v in kwargs.items() if k != 'color'})

plt.figure(figsize=[16, 16])
g = sb.PairGrid(data=df_clean, y_vars='duration_min', x_vars=['member_gender', 'user_type', 'bike_share_for_all_trip'],
                height=3, aspect=1.5)
plt.ylim(0, 50)
g.map(boxgrid)
plt.show()
<Figure size 1600x1600 with 0 Axes>
No description has been provided for this image

The plot clearly shows that, on average, females and individuals identifying as "other" take longer bike trips than males. Interestingly, customers tend to takelonger trips than subscribers, suggesting that when customers opt for a one-time ride, they go for longer distances. Additionally, the bike share program does not significantly influence trip duration, so it will not be considered in the remainder of the analysis.

Next, we will explore the features associated with the time of day and day of the week

In [60]:
def boxgrid(x, y, **kwargs):
    default_color = sb.color_palette()[0]
    # Call boxplot with explicit parameters
    sb.boxplot(x=x, y=y, color=default_color, **{k: v for k, v in kwargs.items() if k != 'color'})

plt.figure(figsize=[16, 16])
g = sb.PairGrid(data=df_clean, y_vars='duration_min', x_vars=['day', 'start_hour'],
                height=3, aspect=1.5)
plt.ylim(0, 50)

g.map(boxgrid)

# Rotate x-axis labels properly
for ax in g.axes.flatten():
    # Get current tick positions
    ticks = ax.get_xticks()
    ax.set_xticks(ticks)  # Ensure ticks are set
    ax.set_xticklabels(ax.get_xticklabels(), rotation=25, ha='right')

plt.show();
<Figure size 1600x1600 with 0 Axes>
No description has been provided for this image

The day of the week influenced trip duration, with longer bike rides occurring on weekends. Additionally, most extended bike trips happened in the afternoon, specifically between 14:00 - 15:00

Distance¶

We will take a brief look at the other features and how they relate to each other.

In [61]:
def boxgrid(x, y, **kwargs):
    default_color = sb.color_palette()[0]
    # Call boxplot with explicit parameters
    sb.boxplot(x=x, y=y, color=default_color, **{k: v for k, v in kwargs.items() if k != 'color'})

plt.figure(figsize=[16, 16])
g = sb.PairGrid(data=df_clean, y_vars='distance', x_vars=['member_gender', 'user_type' ,'bike_share_for_all_trip'],
                height=3, aspect=1.5)
plt.ylim(0, 5)
g.map(boxgrid)
plt.show()
<Figure size 1600x1600 with 0 Axes>
No description has been provided for this image

The analysis above reveals that some variables had a small effect on distance traveled. Notably, females and those who chose 'other' as their gender traveled farther, as did customers compared to subscribers. Furthermore, individuals not engaged in the bike share program also tended to cover longer distances.

We'll take a brief look at distance along with the other variables.

In [62]:
def boxgrid(x, y, **kwargs):
    default_color = sb.color_palette()[0]
    sb.boxplot(x=x, y=y, color=default_color, **{k: v for k, v in kwargs.items() if k != 'color'})

plt.figure(figsize=[16, 16])
g = sb.PairGrid(data=df_clean, y_vars='distance', x_vars=['day', 'start_hour'], height=3, aspect=1.5)

plt.ylim(0, 5)
g.map(boxgrid)

for ax in g.axes.flat:
    # Get the current x-ticks
    ticks = ax.get_xticks()
    # Set the new labels for the current ticks
    ax.set_xticks(ticks)  # Ensure ticks are set before labels
    ax.set_xticklabels(ax.get_xticklabels(), rotation=25, ha='right')

plt.show();
<Figure size 1600x1600 with 0 Axes>
No description has been provided for this image

The analysis above shows that the day of the week does not really affect the distance traveled. However, it is evident that longer distances were covered in the mornings, particularly between 7 AM and 8 AM. Since there are no compelling relationships regarding distance to explore further, we will now turn our attention to age and its relation to other variables.

Age¶

In [63]:
def boxgrid(x, y, **kwargs):
    default_color = sb.color_palette()[0]
    # Call boxplot with explicit parameters
    sb.boxplot(x=x, y=y, color=default_color, **{k: v for k, v in kwargs.items() if k != 'color'})

plt.figure(figsize=[16, 16])
g = sb.PairGrid(data=df_clean, y_vars='age', x_vars=['member_gender', 'user_type', 'bike_share_for_all_trip'],
                height=3, aspect=1.5)

g.map(boxgrid)
plt.show()
<Figure size 1600x1600 with 0 Axes>
No description has been provided for this image

The boxplot above indicates that the majority of members are in their early 30s, regardless of gender or user type. Participants in the bike share for all scheme tend to be younger on average than those who are not involved.

In [64]:
def boxgrid(x, y, **kwargs):
    default_color = sb.color_palette()[0]
    # Call boxplot with explicit parameters
    sb.boxplot(x=x, y=y, color=default_color, **{k: v for k, v in kwargs.items() if k != 'color'})
    

    plt.xticks(rotation=25) 

plt.figure(figsize=[16, 16])
g = sb.PairGrid(data=df_clean, y_vars='age', x_vars=['day', 'start_hour'],
                height=3, aspect=1.5)

g.map(boxgrid)
plt.show()
<Figure size 1600x1600 with 0 Axes>
No description has been provided for this image

The plot above shows that younger individuals are more likely to take bike trips on weekends. On average, these trips occur late at night and in the early morning hours. We will exclude 'age,' 'distance,' and 'bike share for all' from further analysis, as no significant relationships were identified with these variables.

User Type and Gender¶

Let's examine the relationship between user type and gender.

In [65]:
#clustered bar chart
ax = sb.countplot(data = df_clean, x = 'user_type', hue = 'member_gender', hue_order=['Male', 'Female', 'Other'])
ax.legend(loc = 2, framealpha = 1)
Out[65]:
<matplotlib.legend.Legend at 0x15ce5c150>
No description has been provided for this image

This clustered bar chart clearly shows that subscribers took more bike rides than customers. The majority of both groups were male.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?¶

During this part of the investigation, it was found that, on average, individuals who identified as female or as "other" took longer bike rides than males. Additionally, customers generally took longer rides compared to members. Those not participating in the bike share for all scheme also had longer bike trips. While the time of year did not significantly affect the duration of bike trips. The day of the week impacted trip duration, with longer rides occurring on weekends. Furthermore, the hours between 2-3 PM saw the longest bike trips.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?¶

In terms of the day of the week, younger individuals tended to take more bike trips on weekends. Additionally, trips made very late at night and during the early morning hours were predominantly by younger people.

From the demographic analysis, it is clear that subscribers took more bike rides than customers, and the majority of both groups were male.

When examining other features in the dataset, it appears that those who identified as female or other traveled slightly greater distances than males. Customers also traveled slightly further than subscribers on average, and those not involved in the bike share for all scheme tended to cover greater distances as well. Most members are in their early 30s, and those participating in the bike share for all scheme are generally younger than those who are not.

Moving forward, 'age', 'distance', and 'bike share for all' will not be included in any further analysis, as they do not relate to the features of interest and do not demonstrate strong relationships with other features in the dataset.

Multivariate Exploration¶

Next, we'll utilize multivariate plots to examine how the duration and frequency of bike trips relate to various categorical variables, including gender, user type, time of day, and day of the week.

In [66]:
g = sb.PairGrid(data=df_clean, y_vars='duration_min', x_vars=['member_gender', 'user_type', 'day'])
plt.ylim(0, 50)
rotation = 30


for ax in g.fig.axes:
    ax.tick_params(axis='x', rotation=rotation)


default_blue = sb.color_palette()[0]


g.map(sb.violinplot, inner='quartile', color=default_blue)


plt.show()
No description has been provided for this image
In [67]:
fig = plt.figure(figsize=(12, 8))  # Adjust width and height as needed
g = sb.PairGrid(data=df_clean, y_vars='duration_min', x_vars=['member_gender', 'user_type', 'day'], height=4)  # You can set height per subplot

plt.ylim(0, 50)
rotation = 30


for ax in g.fig.axes:
    ax.tick_params(axis='x', rotation=rotation)


default_blue = sb.color_palette()[0]


g.map(sb.boxplot, color=default_blue)


plt.show()
<Figure size 1200x800 with 0 Axes>
No description has been provided for this image

Lets create a clustered bar chart illustrating the relationship between duration, user type, and gender

In [68]:
# clustered bar chart
default_colors = ['#1f77b4', '#ff7f0e', '#2ca02c']  # Blue, orange, green
user_type_order = ['Subscriber', 'Customer']
plt.figure(figsize=[10, 6])
ax = sb.barplot(data=df_clean, x='user_type', y='duration_min', hue='member_gender', hue_order=['Male', 'Female', 'Other'], palette=default_colors, order=user_type_order)
ax.legend(loc=2, ncol=1, framealpha=1, title='member_gender')

plt.show()
No description has been provided for this image

The data indicates that females and others took longer bike trips than males, regardless of their status as customers or subscribers.

A clustered bar chart will be created to explore the relationship between duration, day of the week, and gender.

In [69]:
# clustered bar chart
plt.figure(figsize = [12, 8])
ax = sb.barplot(data = df_clean, x = 'day', y = 'duration_min', hue = 'member_gender', hue_order=['Male', 'Female', 'Other'], palette=default_colors)
ax.legend(loc = 2, ncol = 2, framealpha = 1, title = 'member_gender')
Out[69]:
<matplotlib.legend.Legend at 0x15cf82690>
No description has been provided for this image
In [ ]:
 

As shown above, females and individuals identifying as other took longer bike rides than males on each day of the week, and all genders had longer bike trips on weekends.

We will explore the relationship between duration, user type, and the day of the week.

In [70]:
# clustered bar chart
plt.figure(figsize = [12, 8])
ax = sb.barplot(data = df_clean, x = 'day', y = 'duration_min', hue = 'user_type', palette=default_colors[:2])
ax.legend(loc = 2, ncol = 2, framealpha = 1, title = 'user type')
Out[70]:
<matplotlib.legend.Legend at 0x159917610>
No description has been provided for this image

Facet Plot
Let's try a Facet Plot for more clarity

In [71]:
# Define the order for the days
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Convert the 'day' column to a categorical type with the specified order
df_clean['day'] = pd.Categorical(df_clean['day'], categories=day_order, ordered=True)

# Create a FacetGrid
g = sb.FacetGrid(df_clean, col='user_type', col_wrap=2, height=4, aspect=1.5)

# Create the bar plot within each facet
g.map_dataframe(sb.barplot, x='day', y='duration_min', hue='user_type', palette=default_colors[:2], order=day_order)

# Add legends and titles
g.add_legend(title='User Type', bbox_to_anchor=(1.05, 1), loc='upper left')
g.set_axis_labels('Day', 'Duration (minutes)')
g.set_titles(col_template='{col_name}')

# Show the plot
plt.tight_layout()
plt.show()
No description has been provided for this image

As seen above, customers had longer bike rides than subscribers on every day of the week, and both groups took longer trips on the weekends.

Frequency of bike trips based on the day of the week and time of day¶

Next, a heatmap will be created to visualize the frequency of bike trips based on the day of the week and time of day.

In [72]:
# Generate a second dataframe for visualization
df_clean2 = pd.pivot_table(df_clean[['day', 'start_hour', 'duration_min']], index=['day', 'start_hour'], aggfunc='count')

# Unstacking to achieve the appropriate format for the heatmap.
df_clean3 = df_clean2.unstack(level=0)

# Generate new labels for the hours.
am_hrs = [f"{hr}am" for hr in range(1, 12)]
pm_hrs = [f"{hr}pm" for hr in range(1, 12)]
complete_hrs = ["12am"] + am_hrs + ["12pm"] + pm_hrs

# Abbreviated names for the days of the week
day_abbr = ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']

# Create the heatmap
sb.set_context("talk")
f, ax = plt.subplots(figsize=(11, 15))
ax = sb.heatmap(df_clean3, annot=True, fmt="d", linewidths=.5, ax=ax, xticklabels=day_abbr, yticklabels=complete_hrs, cmap="viridis")
ax.axes.set_title("Heatmap of Ride Counts by Day and Hour of Day", fontsize=24, y=1.01)
ax.set(xlabel='Day of Week', ylabel='Starting Hour of Ride');



plt.show()
No description has been provided for this image

The heatmap above reveals that on weekdays, the majority of bike trips occur between 6 AM and 9 AM, as well as 4 PM to 7 PM.

Time of day, Day of the week, and Duration (minutes)¶

Now let's visualize the relationship between time of day, day of the week, and bike ride duration.

In [73]:
plt.figure(figsize=[12, 12])
cat_means = df_clean.groupby(['day', 'start_hour'], observed=False).mean(numeric_only=True)['duration_min']
cat_means = cat_means.reset_index(name='duration_min_avg')
cat_means = cat_means.pivot(index='start_hour', columns='day', values='duration_min_avg')
sb.heatmap(cat_means, annot=True, fmt='.3f',
           cbar_kws={'label': 'Average Duration of Bike Trip (minutes)'}, xticklabels=day_abbr, yticklabels=complete_hrs, cmap="viridis_r")
plt.xlabel("Day of the Week")
plt.ylabel("Starting Hour of the Bike Ride")
plt.title("Heatmap of Ride Duration by Day and Hour of Day", fontsize=24, y=1.01);
No description has been provided for this image

The heatmap above shows that bike rides tend to be a bit longer on weekends compared to weekdays. It also reveals that the longest trips, on average, occur during the early morning hours.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?¶

The investigation into bike trip length and frequency revealed notable relationships, particularly regarding the impact of the day of the week and time of day. The analysis showed that both factors influenced bike trip frequency, with most trips occurring on weekdays during commuting hours, as people utilized bikes for their daily commutes. It was evident that these features supported each other in highlighting patterns of bike usage.

Were there any interesting or surprising interactions between features?¶

While it was expected that more bike trips would occur during commuting hours, it was surprising to find that longer bike trips on average were more common during weekends compared to weekdays. Additionally, an unexpected finding was that some of the longest bike trips during weekdays took place between 1 AM and 3 AM. The data also revealed interesting trends related to gender. All genders tended to take longer bike rides on weekends, but those identifying as 'female' or 'other' consistently took longer trips throughout the week. Furthermore, customers generally took longer trips than subscribers on all days.

Conclusions¶

During the exploration, we saw saw that most bike trips lasted under 30 minutes, averaging around 9 minutes, with most trips under 2 km. Users primarily rode during weekdays and commuting hours. Most users were subscribers, but customers tended to take longer trips. The service was most popular among those in their mid-twenties to mid-thirties, with usage declining with age. Males used the service more, but females and those identifying as 'other' had longer trips. Most trips were not part of the Bike Share program. Bivariate analysis showed weak correlations among age, duration, and distance. Longer trips occurred on weekends and in the afternoon. Univariate exploration confirmed that trips were most common during commuting hours, with unexpectedly long trips occurring late at night. Interestingly, females and 'other' users traveled greater distances, as did customers compared to subscribers. Mornings, especially 7-8 AM, saw longer distances traveled. Most users were in their early 30s, with those in the Bike Share for All program being younger. Younger riders were more active on weekends and took late-night trips.

In [74]:
# save as a csv 
df_clean.to_csv('2019-fordgobike-data-clean.csv', index = False)