ROW_NUMBER Window Function in Pandas

One of the most powerful data manipulation methods in modern SQL are window functions (Sorry MySQL). They are a bit of a departure from traditional SQL queries in that they allow you to perform calculations across a set of rows, sort of like your standard aggregation functions but without having to group output into a single row.

In trying to emulate the ROW_NUMBER() window function for an analysis I was completing in Python, but I was shocked to discover that Pandas does not have a clear (at least not clear to me) method for accomplishing this relatively common task. In R’s dplyr package there is a row_number() function as well as your standard complement of window functions.

This is the equivalent SQL I was trying to emulate in Pandas

SELECT
user_id,
start_time,
session_id,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY start_time) AS session_number
FROM
df

Here is some equivalent python to accomplish the same task:

g = df.sort(['user_id','start_time', 'start_time']).groupby('user_id')
df['session_number'] = g['start_time'].rank(method='first', na_option='top')

First we group the data frame by our partition column and then we apply the pandas rank function to the start_time GroupSeries. We end up with each session numbered by the order in which it was created.

Category: Product #: Regular price:$ (Sale ends ) Available from: Condition: Good ! Order now!

One thought on “ROW_NUMBER Window Function in Pandas

Leave a Reply

Your email address will not be published. Required fields are marked *