ROW_NUMBER Window Function in Pandas

One of the most powerful data manipulation methods in modern SQL are window functions (Sorry MySQL). They are a bit of a departure from traditional SQL queries in that they allow you to perform calculations across a set of rows, sort of like your standard aggregation functions but without having to group output into a single row.

In trying to emulate the ROW_NUMBER() window function for an analysis I was completing in Python, but I was shocked to discover that Pandas does not have a clear (at least not clear to me) method for accomplishing this relatively common task. In R’s dplyr package there is a row_number() function as well as your standard complement of window functions.

This is the equivalent SQL I was trying to emulate in Pandas

ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY start_time) AS session_number

Here is some equivalent python to accomplish the same task:

g = df.sort(['user_id','start_time', 'start_time']).groupby('user_id')
df['session_number'] = g['start_time'].rank(method='first', na_option='top')

First we group the data frame by our partition column and then we apply the pandas rank function to the start_time GroupSeries. We end up with each session numbered by the order in which it was created.

Read More

MinHash in Pure Pig and Hive

I recently undertook a project where I needed to compute the similarity between about 300 million objects. The vast majority of these objects would have no features in common, but to compute the similarity of each combination of 300 million objects would require 9×1016 comparisons, which is safe to say, a lot of comparisons and also mostly a waste as the overwhelming majority of those comparisons will have nothing in common.

Now there are many tools out there that can solve this issue much more intelligently than brute forcing the solution. Among them are Spark/MlLib, Mahout, RHadoop and H20’s Sparkling Water to name a few. Mostly because I wanted to see if I could, I set out to solve this using only core Pig and Hive which could be run on a vanilla EMR using only what the standard AMI provides.

Read More

Amazon S3 – Amazon’s ‘Sorta’ Simple Storage Solution

Ever since Amazon released S3 (Simple Storage Solution) in 2006, it has revolutionized the way companies store and analyze data. However, there are a few ways a developer can get tripped up using this otherwise simple service.

One major perk of using S3 is that data can be seamlessly transferred to and from HDFS, which when combined with Amazon’s Elastic Map Reduce (EMR), makes Big Data analytics extremely accessible. However, since S3 is not the typical POSIX filesystem that many developers have come to expect, it can lead to some unexpected behavior.

Read the rest on the AWeber Engineering Blog

Read More