One of the most powerful data manipulation methods in modern SQL are window functions (Sorry MySQL). They are a bit of a departure from traditional SQL queries in that they allow you to perform calculations across a set of rows, sort of like your standard aggregation functions but without having to group output into a single row.
In trying to emulate the ROW_NUMBER() window function for an analysis I was completing in Python, but I was shocked to discover that Pandas does not have a clear (at least not clear to me) method for accomplishing this relatively common task. In R’s dplyr package there is a row_number() function as well as your standard complement of window functions.
This is the equivalent SQL I was trying to emulate in Pandas
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY start_time) AS session_number
Here is some equivalent python to accomplish the same task:
g = df.sort(['user_id','start_time', 'start_time']).groupby('user_id')
df['session_number'] = g['start_time'].rank(method='first', na_option='top')
First we group the data frame by our partition column and then we apply the pandas rank function to the start_time GroupSeries. We end up with each session numbered by the order in which it was created.