shift() Method: Shift Values in Column Up or Down¶

Date published: 2019-04-21

Category: Data Analysis

Subcategory: Data Wrangling

Tags: shift method, pandas, python

The shift() method for a pandas series helps shift values in a column up or down. This is similar to using the SQL window functions for LAG() and LEAD(). You can learn about these SQL window functions via Mode's SQL tutorial.

In this tutorial, I'll walk through an example of using the shift() pandas series method for analyzing bike rides.

Setup Code¶

Import Module¶

In [143]:

                
                    Copied!
                    
import pandas as pd
import datetime
import random
import numpy as np
import pandas as pd
import datetime
import random
import numpy as np

Example 1: Duration Idle Time Between Bike Rides¶

In this example, I assume a service in which I lend out a single bike for people to ride throughout the day in San Francisco. Each record in the dataset is the start and end time for a ride.

Create Bike Rides Data¶

The code below creates a list of start and end times of rides.

In [144]:

                
                    Copied!
                    
                        
                        
                    
                    

            
random.seed(38)
ride_start_time = datetime.datetime.today()
ride_end_time = ride_start_time + datetime.timedelta(minutes=20)
ride_start_times = []
ride_end_times = []
count_of_rides = 6

for ride in range(count_of_rides):
    minutes_between_rides = random.randint(8, 100)
    ride_start_time = ride_end_time + datetime.timedelta(minutes=minutes_between_rides)
    
    minutes_riding = random.randint(5, 25)
    ride_end_time = ride_start_time + datetime.timedelta(minutes=minutes_riding)
    
    ride_start_times.append(ride_start_time)
    ride_end_times.append(ride_end_time)
random.seed(38)
ride_start_time = datetime.datetime.today()
ride_end_time = ride_start_time + datetime.timedelta(minutes=20)
ride_start_times = []
ride_end_times = []
count_of_rides = 6

for ride in range(count_of_rides):
    minutes_between_rides = random.randint(8, 100)
    ride_start_time = ride_end_time + datetime.timedelta(minutes=minutes_between_rides)
    
    minutes_riding = random.randint(5, 25)
    ride_end_time = ride_start_time + datetime.timedelta(minutes=minutes_riding)
    
    ride_start_times.append(ride_start_time)
    ride_end_times.append(ride_end_time)

Create a pandas dataframe given the lists ride_start_times and ride_end_times.

In [145]:

                
                    Copied!
                    
data = {'ride_start_time': ride_start_times,
        'ride_end_time':  ride_end_times
       }
df_bike_rides = pd.DataFrame(data)
data = {'ride_start_time': ride_start_times,
        'ride_end_time':  ride_end_times
       }
df_bike_rides = pd.DataFrame(data)

View the newly created dataframe df_bike_rides.

In [146]:

                
                    Copied!
                    
df_bike_rides
df_bike_rides

Out[146]:

	ride_start_time	ride_end_time
0	2019-04-21 21:23:29.711347	2019-04-21 21:41:29.711347
1	2019-04-21 22:43:29.711347	2019-04-21 22:51:29.711347
2	2019-04-21 23:07:29.711347	2019-04-21 23:23:29.711347
3	2019-04-22 01:00:29.711347	2019-04-22 01:19:29.711347
4	2019-04-22 02:14:29.711347	2019-04-22 02:20:29.711347
5	2019-04-22 03:45:29.711347	2019-04-22 03:55:29.711347

Find the Mean Duration of Time, in Minutes, Bike is Left Idle Between Rides¶

For example, the first ride ended around 4:24PM and was next used for a ride at 5:52PM. I want to calculate a new column that states there was approximately 86 minutes, equivalent to approximately 5160 seconds, of idle time between these rides.

I want to do a column-by-column comparison. I use the shift() method to create a new column in df_bike_rides that's a shift of value in ride_end_time down one period.

In [147]:

                
                    Copied!
                    
df_bike_rides['previous_ride_end_time'] = df_bike_rides['ride_end_time'].shift(periods=1)
df_bike_rides['previous_ride_end_time'] = df_bike_rides['ride_end_time'].shift(periods=1)

View df_bike_rides.

In [148]:

                
                    Copied!
                    
df_bike_rides
df_bike_rides

Out[148]:

	ride_start_time	ride_end_time	previous_ride_end_time
0	2019-04-21 21:23:29.711347	2019-04-21 21:41:29.711347	NaT
1	2019-04-21 22:43:29.711347	2019-04-21 22:51:29.711347	2019-04-21 21:41:29.711347
2	2019-04-21 23:07:29.711347	2019-04-21 23:23:29.711347	2019-04-21 22:51:29.711347
3	2019-04-22 01:00:29.711347	2019-04-22 01:19:29.711347	2019-04-21 23:23:29.711347
4	2019-04-22 02:14:29.711347	2019-04-22 02:20:29.711347	2019-04-22 01:19:29.711347
5	2019-04-22 03:45:29.711347	2019-04-22 03:55:29.711347	2019-04-22 02:20:29.711347

The shift() method for a pandas series is similar to a window function in SQL using LAG() and LEAD(). The same operation above would look like the following in SQL:

SELECT *, LAG(ride_end_time) OVER (ORDER BY ride_start_time ASC) AS previous_ride_end_time
FROM df_bike_rides
ORDER BY ride_start_time ASC

Given this new column for previous_ride_end_time, I can subtract the time between a new bike ride's start time and the previous ride's end time. The result is the duration the bike was idle between rides.

In [149]:

                
                    Copied!
                    
df_bike_rides['duration_bike_idle_between_rides'] = df_bike_rides['ride_start_time'] - df_bike_rides['previous_ride_end_time']
df_bike_rides['duration_bike_idle_between_rides'] = df_bike_rides['ride_start_time'] - df_bike_rides['previous_ride_end_time']

View df_bike_rides.

In [150]:

                
                    Copied!
                    
df_bike_rides
df_bike_rides

Out[150]:

	ride_start_time	ride_end_time	previous_ride_end_time	duration_bike_idle_between_rides
0	2019-04-21 21:23:29.711347	2019-04-21 21:41:29.711347	NaT	NaT
1	2019-04-21 22:43:29.711347	2019-04-21 22:51:29.711347	2019-04-21 21:41:29.711347	01:02:00
2	2019-04-21 23:07:29.711347	2019-04-21 23:23:29.711347	2019-04-21 22:51:29.711347	00:16:00
3	2019-04-22 01:00:29.711347	2019-04-22 01:19:29.711347	2019-04-21 23:23:29.711347	01:37:00
4	2019-04-22 02:14:29.711347	2019-04-22 02:20:29.711347	2019-04-22 01:19:29.711347	00:55:00
5	2019-04-22 03:45:29.711347	2019-04-22 03:55:29.711347	2019-04-22 02:20:29.711347	01:25:00

The new column duration_bike_idle_between_rides shows the duration of idle bike time between rides in the format HH-MM-SS. The value of 01:02:00 is equivalent to saying 1 hour and 2 minutes. Below, I convert that timedelta format into a single numerical value of minutes. I utilize the dt accessor and total_seconds() method to calculate the total seconds a bike is idle between rides. Then I divide this value by 60 to get a value in minutes.

In [151]:

                
                    Copied!
                    
df_bike_rides['minutes_bike_idle_between_rides'] = df_bike_rides['duration_bike_idle_between_rides'].dt.total_seconds()/60
df_bike_rides['minutes_bike_idle_between_rides'] = df_bike_rides['duration_bike_idle_between_rides'].dt.total_seconds()/60

View df_bike_rides below with a new column for the minutes_bike_idle_between_rides.

In [152]:

                
                    Copied!
                    
df_bike_rides
df_bike_rides

Out[152]:

	ride_start_time	ride_end_time	previous_ride_end_time	duration_bike_idle_between_rides	minutes_bike_idle_between_rides
0	2019-04-21 21:23:29.711347	2019-04-21 21:41:29.711347	NaT	NaT	NaN
1	2019-04-21 22:43:29.711347	2019-04-21 22:51:29.711347	2019-04-21 21:41:29.711347	01:02:00	62.0
2	2019-04-21 23:07:29.711347	2019-04-21 23:23:29.711347	2019-04-21 22:51:29.711347	00:16:00	16.0
3	2019-04-22 01:00:29.711347	2019-04-22 01:19:29.711347	2019-04-21 23:23:29.711347	01:37:00	97.0
4	2019-04-22 02:14:29.711347	2019-04-22 02:20:29.711347	2019-04-22 01:19:29.711347	00:55:00	55.0
5	2019-04-22 03:45:29.711347	2019-04-22 03:55:29.711347	2019-04-22 02:20:29.711347	01:25:00	85.0

I calculate the mean minutes_bike_idle_between_rides value as 63 minutes.

In [153]:

                
                    Copied!
                    
avg_minutes_bikes_idle_between_rides = df_bike_rides['minutes_bike_idle_between_rides'].mean()
avg_minutes_bikes_idle_between_rides
avg_minutes_bikes_idle_between_rides = df_bike_rides['minutes_bike_idle_between_rides'].mean()
avg_minutes_bikes_idle_between_rides

Out[153]:

63.0

Example 2: Duration Idle Time Between Bike Rides Per Unique Bike¶

This example below is similar to the one above. However, I assume I now operate a fleet of 2 bikes and rent them out for people to ride to specific stations in the city of San Francisco.

Below I create a pandas dataframe with details on bike ride times, the bike id and the start and end station.

In [154]:

                
                    Copied!
                    
                        
                        
                    
                    

            
data = {'ride_start_time': ride_start_times,
        'ride_end_time':  ride_end_times,
        'bike_id': [1, 22, 1, 1, 22, 22],
        'start_station': ["21st & Folsom", "21st & Folsom", "4th & King", "24th & Valencia", "4th & King",  "16th and Mission"],
        'end_station':  ["4th & King", "4th & King", "24th & Valencia", "Embarcadero & Market",  "16th and Mission", "4th & King"]
       }
df_bike_sharing = pd.DataFrame(data)
data = {'ride_start_time': ride_start_times,
        'ride_end_time':  ride_end_times,
        'bike_id': [1, 22, 1, 1, 22, 22],
        'start_station': ["21st & Folsom", "21st & Folsom", "4th & King", "24th & Valencia", "4th & King",  "16th and Mission"],
        'end_station':  ["4th & King", "4th & King", "24th & Valencia", "Embarcadero & Market",  "16th and Mission", "4th & King"]
       }
df_bike_sharing = pd.DataFrame(data)

View df_bike_sharing.

In [155]:

                
                    Copied!
                    
df_bike_sharing
df_bike_sharing

Out[155]:

	ride_start_time	ride_end_time	bike_id	start_station	end_station
0	2019-04-21 21:23:29.711347	2019-04-21 21:41:29.711347	1	21st & Folsom	4th & King
1	2019-04-21 22:43:29.711347	2019-04-21 22:51:29.711347	22	21st & Folsom	4th & King
2	2019-04-21 23:07:29.711347	2019-04-21 23:23:29.711347	1	4th & King	24th & Valencia
3	2019-04-22 01:00:29.711347	2019-04-22 01:19:29.711347	1	24th & Valencia	Embarcadero & Market
4	2019-04-22 02:14:29.711347	2019-04-22 02:20:29.711347	22	4th & King	16th and Mission
5	2019-04-22 03:45:29.711347	2019-04-22 03:55:29.711347	22	16th and Mission	4th & King

I sort df_bike_sharing first by the bike_id column and then the ride_start_time column.

In [156]:

                
                    Copied!
                    
df_bike_sharing.sort_values(by=['bike_id', 'ride_start_time'], inplace=True)
df_bike_sharing.sort_values(by=['bike_id', 'ride_start_time'], inplace=True)

In [157]:

                
                    Copied!
                    
df_bike_sharing
df_bike_sharing

Out[157]:

	ride_start_time	ride_end_time	bike_id	start_station	end_station
0	2019-04-21 21:23:29.711347	2019-04-21 21:41:29.711347	1	21st & Folsom	4th & King
2	2019-04-21 23:07:29.711347	2019-04-21 23:23:29.711347	1	4th & King	24th & Valencia
3	2019-04-22 01:00:29.711347	2019-04-22 01:19:29.711347	1	24th & Valencia	Embarcadero & Market
1	2019-04-21 22:43:29.711347	2019-04-21 22:51:29.711347	22	21st & Folsom	4th & King
4	2019-04-22 02:14:29.711347	2019-04-22 02:20:29.711347	22	4th & King	16th and Mission
5	2019-04-22 03:45:29.711347	2019-04-22 03:55:29.711347	22	16th and Mission	4th & King

For the bike_id column, I shift down values by 1 to create a new column called previous_bike_id. I do this so I can easily compare a bike id to the previous ride's ID to identify the last ride by a bike id for a day.

Similar to LAG()

In [158]:

                
                    Copied!
                    
df_bike_sharing['previous_bike_id'] = df_bike_sharing['bike_id'].shift(periods=1)
df_bike_sharing['previous_bike_id'] = df_bike_sharing['bike_id'].shift(periods=1)

In [159]:

                
                    Copied!
                    
df_bike_sharing
df_bike_sharing

Out[159]:

	ride_start_time	ride_end_time	bike_id	start_station	end_station	previous_bike_id
0	2019-04-21 21:23:29.711347	2019-04-21 21:41:29.711347	1	21st & Folsom	4th & King	NaN
2	2019-04-21 23:07:29.711347	2019-04-21 23:23:29.711347	1	4th & King	24th & Valencia	1.0
3	2019-04-22 01:00:29.711347	2019-04-22 01:19:29.711347	1	24th & Valencia	Embarcadero & Market	1.0
1	2019-04-21 22:43:29.711347	2019-04-21 22:51:29.711347	22	21st & Folsom	4th & King	1.0
4	2019-04-22 02:14:29.711347	2019-04-22 02:20:29.711347	22	4th & King	16th and Mission	22.0
5	2019-04-22 03:45:29.711347	2019-04-22 03:55:29.711347	22	16th and Mission	4th & King	22.0

In SQL, this operation would be:

SELECT *, LAG(bike_id) OVER (ORDER BY bike_id ASC, ride_start_time ASC) AS previous_bike_id
FROM df_bike_sharing 
ORDER BY bike_id ASC, ride_start_time ASC

For the ride_end_time column, I shift down values by 1 to create a new column called previous_ride_end_time. I do this so I can later find the idle bike time between rides.

In [160]:

                
                    Copied!
                    
df_bike_sharing['previous_ride_end_time'] = df_bike_sharing['ride_end_time'].shift(periods=1)
df_bike_sharing['previous_ride_end_time'] = df_bike_sharing['ride_end_time'].shift(periods=1)

In [161]:

                
                    Copied!
                    
df_bike_sharing
df_bike_sharing

Out[161]:

	ride_start_time	ride_end_time	bike_id	start_station	end_station	previous_bike_id	previous_ride_end_time
0	2019-04-21 21:23:29.711347	2019-04-21 21:41:29.711347	1	21st & Folsom	4th & King	NaN	NaT
2	2019-04-21 23:07:29.711347	2019-04-21 23:23:29.711347	1	4th & King	24th & Valencia	1.0	2019-04-21 21:41:29.711347
3	2019-04-22 01:00:29.711347	2019-04-22 01:19:29.711347	1	24th & Valencia	Embarcadero & Market	1.0	2019-04-21 23:23:29.711347
1	2019-04-21 22:43:29.711347	2019-04-21 22:51:29.711347	22	21st & Folsom	4th & King	1.0	2019-04-22 01:19:29.711347
4	2019-04-22 02:14:29.711347	2019-04-22 02:20:29.711347	22	4th & King	16th and Mission	22.0	2019-04-21 22:51:29.711347
5	2019-04-22 03:45:29.711347	2019-04-22 03:55:29.711347	22	16th and Mission	4th & King	22.0	2019-04-22 02:20:29.711347

We only want to calculate a duration the bike was idle in a row if it's a comparison for the same bike id. Below I show the code to replace all values in a row with NaN or NaT if the condition is false.

In [162]:

                
                    Copied!
                    
df_bike_sharing.where(df_bike_sharing['bike_id'] == df_bike_sharing['previous_bike_id'])
df_bike_sharing.where(df_bike_sharing['bike_id'] == df_bike_sharing['previous_bike_id'])

Out[162]:

	ride_start_time	ride_end_time	bike_id	start_station	end_station	previous_bike_id	previous_ride_end_time
0	NaT	NaT	NaN	NaN	NaN	NaN	NaT
2	2019-04-21 23:07:29.711347	2019-04-21 23:23:29.711347	1.0	4th & King	24th & Valencia	1.0	2019-04-21 21:41:29.711347
3	2019-04-22 01:00:29.711347	2019-04-22 01:19:29.711347	1.0	24th & Valencia	Embarcadero & Market	1.0	2019-04-21 23:23:29.711347
1	NaT	NaT	NaN	NaN	NaN	NaN	NaT
4	2019-04-22 02:14:29.711347	2019-04-22 02:20:29.711347	22.0	4th & King	16th and Mission	22.0	2019-04-21 22:51:29.711347
5	2019-04-22 03:45:29.711347	2019-04-22 03:55:29.711347	22.0	16th and Mission	4th & King	22.0	2019-04-22 02:20:29.711347

With the above change, I calculate ride_start_time minus previous_ride_end_time where the above condition holds True. With that calculation, I create a new column called duration_bike_idle.

In [163]:

                
                    Copied!
                    
df_bike_sharing['duration_bike_idle'] = (df_bike_sharing['ride_start_time']-df_bike_sharing['previous_ride_end_time']).where(df_bike_sharing['bike_id'] == df_bike_sharing['previous_bike_id'])
df_bike_sharing['duration_bike_idle'] = (df_bike_sharing['ride_start_time']-df_bike_sharing['previous_ride_end_time']).where(df_bike_sharing['bike_id'] == df_bike_sharing['previous_bike_id'])

In [165]:

                
                    Copied!
                    
df_bike_sharing
df_bike_sharing

Out[165]:

	ride_start_time	ride_end_time	bike_id	start_station	end_station	duration_bike_idle
0	2019-04-21 21:23:29.711347	2019-04-21 21:41:29.711347	1	21st & Folsom	4th & King	NaT
2	2019-04-21 23:07:29.711347	2019-04-21 23:23:29.711347	1	4th & King	24th & Valencia	01:26:00
3	2019-04-22 01:00:29.711347	2019-04-22 01:19:29.711347	1	24th & Valencia	Embarcadero & Market	01:37:00
1	2019-04-21 22:43:29.711347	2019-04-21 22:51:29.711347	22	21st & Folsom	4th & King	NaT
4	2019-04-22 02:14:29.711347	2019-04-22 02:20:29.711347	22	4th & King	16th and Mission	03:23:00
5	2019-04-22 03:45:29.711347	2019-04-22 03:55:29.711347	22	16th and Mission	4th & King	01:25:00

I drop previous_bike_id and previous_ride_end_time since they were intermediary outputs used for calculations. They're not necessary for a final presentation of the critical details.

In [164]:

                
                    Copied!
                    
df_bike_sharing.drop(['previous_bike_id', 'previous_ride_end_time'], axis=1, inplace=True)
df_bike_sharing.drop(['previous_bike_id', 'previous_ride_end_time'], axis=1, inplace=True)

The new column duration_bike_idle_seconds shows the duration of idle bike time between rides in the format HH-MM-SS. The value of 01:26:00 is equivalent to saying 1 hour and 26 minutes. Below, I convert that timedelta format into a single numerical value of minutes. I utilize the dt accessor and total_seconds() method to calculate the total seconds a bike is idle between rides of the same bike id. Then I divide this value by 60 to get a value in minutes.

In [166]:

                
                    Copied!
                    
df_bike_sharing['duration_bike_idle_seconds'] = df_bike_sharing['duration_bike_idle'].dt.total_seconds()
df_bike_sharing['duration_bike_idle_seconds'] = df_bike_sharing['duration_bike_idle'].dt.total_seconds()

In [167]:

                
                    Copied!
                    
df_bike_sharing
df_bike_sharing

Out[167]:

	ride_start_time	ride_end_time	bike_id	start_station	end_station	duration_bike_idle	duration_bike_idle_seconds
0	2019-04-21 21:23:29.711347	2019-04-21 21:41:29.711347	1	21st & Folsom	4th & King	NaT	NaN
2	2019-04-21 23:07:29.711347	2019-04-21 23:23:29.711347	1	4th & King	24th & Valencia	01:26:00	5160.0
3	2019-04-22 01:00:29.711347	2019-04-22 01:19:29.711347	1	24th & Valencia	Embarcadero & Market	01:37:00	5820.0
1	2019-04-21 22:43:29.711347	2019-04-21 22:51:29.711347	22	21st & Folsom	4th & King	NaT	NaN
4	2019-04-22 02:14:29.711347	2019-04-22 02:20:29.711347	22	4th & King	16th and Mission	03:23:00	12180.0
5	2019-04-22 03:45:29.711347	2019-04-22 03:55:29.711347	22	16th and Mission	4th & King	01:25:00	5100.0

Here is the average seconds each bike_id is idle during the day in seconds between the first and last ride. I group by the bike_id column and calculate the mean of the duration_bike_idle_seconds_time column. I reset the index and rename the columns so this final output is easier to understand.

In [168]:

                
                    Copied!
                    
df_bike_sharing.groupby('bike_id')['duration_bike_idle_seconds'].mean().reset_index().rename(columns={'duration_bike_idle_seconds': 'avg_seconds_idle_between_rides'})
df_bike_sharing.groupby('bike_id')['duration_bike_idle_seconds'].mean().reset_index().rename(columns={'duration_bike_idle_seconds': 'avg_seconds_idle_between_rides'})

Out[168]:

	bike_id	avg_seconds_idle_between_rides
0	1	5490.0
1	22	8640.0

Bike id of 22 was left idle longer between rides than the bike id of 1.