MinHash in Pure Pig and Hive

I recently undertook a project where I needed to compute the similarity between about 300 million objects. The vast majority of these objects would have no features in common, but to compute the similarity of each combination of 300 million objects would require 9×1016 comparisons, which is safe to say, a lot of comparisons and also mostly a waste as the overwhelming majority of those comparisons will have nothing in common.

Now there are many tools out there that can solve this issue much more intelligently than brute forcing the solution. Among them are Spark/MlLib, Mahout, RHadoop and H20’s Sparkling Water to name a few. Mostly because I wanted to see if I could, I set out to solve this using only core Pig and Hive which could be run on a vanilla EMR using only what the standard AMI provides.

Min-Hash

There is a well known¬†implementation of Local Sensitive Hashing called min-hash, which is explained really well here if you’re interested. The gist of minhash is that you apply a number of hashing functions to each field of an object and take the minimum value for each hashing algorithm (hence min-hash), then you separate the output of the hashing functions into groups called bands, each band generally consists of 4 or 5 hashes, and compare objects that have at least one matching band. This reduces the number of comparisons that have to be made from the number of fields to the number of hash bands. For objects like a body of text which may have thousands of words to compare this reduces the number of comparisons that need to be made from thousands to 50 or so.

Step 1: Hash

First we need to hash each of the fields, since Pig has an ideal hashing function in Piggybank (installed by default in EMR) I wrote a Pig script to accomplish this portion of the project.

REGISTER '/home/hadoop/pig/lib/piggybank.jar';
DEFINE hash org.apache.pig.piggybank.evaluation.string.HashFNV;
set default_parallel 16;

a = LOAD 's3://my-data/' USING PigStorage(',') AS (uid:chararray, feature:int);
a = DISTINCT a;
hashed = FOREACH a GENERATE uid, hash(feature) AS feature_hashed PARALLEL 32;
STORE hashed INTO 's3://my-data/hashed/' USING PigStorage(',');

Step 2: 200 Hashes

In order to, easily, turn our 1 hash into 200 hashes we need to generate 200 random integers. We will later apply a function to the original hash values using these random integers to create 200 unique hashes.

-- Load the data from step 1
CREATE EXTERNAL TABLE hashes (uid STRING, hash BIGINT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://my-data/hashed'

-- Load our dummy table
CREATE EXTERNAL TABLE dummy_table (ds STRING)
LOCATION 's3://dummy_table/';

-- Create a table to hold our XOR'ed hashes
CREATE TABLE random_xor (minhash_id INT, band_id INT, rand_xor BIGINT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/lsh/random_xor/';

INSERT INTO TABLE random_xor
SELECT
ROW_NUMBER() OVER () AS minhash_id,
floor((ROW_NUMBER() OVER () - 1) / 4) AS band_id,
floor(rand() * 2147483647) AS rand_xor
FROM
dummy_table
LIMIT 200;

Step 3: XOR and Hash Bands

This is where the min-hash comes into play. We will take each hash and XOR it by each of the 200 random integers. Then for each uid/hash combination we will take the minimum value. Finally we’ll combine groups of 5 min hashes into a band.

CREATE EXTERNAL TABLE hash_bands (email STRING, band_id INT, hash_band array)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://my-data/hash_bands/;

INSERT OVERWRITE TABLE hash_bands
SELECT
a.uid,
a.band_id,
sort_array(collect_set(a.min_hash)) AS hash_band
FROM
(SELECT
h.uid,
x.minhash_id,
x.band_id,
MIN(h.hash ^ x.rand_xor) AS min_hash
FROM
hashes h INNER JOIN
random_xor x
GROUP BY
h.uid,
x.minhash_id,
x.band_id
) a
GROUP BY
a.uid,
a.band_id;

Step 4: Find Similar Bands

Now that each object has a row consisting of a uid and a hash band, we can compare uids to see if they have any matching hash bands. We do this in Hive with a self join on the recently created hash_bands table. This query will output any distinct uids that share a hash band (and a count of the number of hash band matches).


SELECT
a.uid,
b.uid,
COUNT(a.uid) AS similar_hash_bands
FROM
hash_bands a
INNER JOIN hash_bands b ON a.band_id = b.band_id AND a.hash_band = b.hash_band
WHERE
a.uid < b.uid
GROUP BY
a.uid,
b.uid;

And voila! You’re done. We’ve greatly reduced the complexity required to compare large amounts of sparsely populated records, all within the friendly confines of a default Amazon EMR implementation.

Category: Product #: Regular price:$ (Sale ends ) Available from: Condition: Good ! Order now!

Leave a Reply

Your email address will not be published. Required fields are marked *