MySQL : Make Your Query Runs (at least) 16800% Faster. :: One of our data analyst was asked to analyze a bunch of data from 2 tables. One of it  contained nearly 200 million records, and the other about 120 million records. The query to run against this dataset is doing several JOINs on non-unique fields, causing the resultset to be estimated at trillions of records.

As some of you may have already guessed - a week went by, and the query still hasn't finished :)

I was asked to help, so I tried almost every trick that I knew, including (but not limited to) :

# vertical partitioning : split the columns into a table each
# horizontal partitioning : splitting the table into 10 smaller tables
# putting the tables on fast SSDs
# putting the tables on RAM disk : on several occasions, I actually managed to saturate the memory bandwidth......
# enlarging the JOIN buffers : up to tens of gigabytes.
# etc, etc.

Still no joy. The query just won't finish. And I've been sleeping on my work desk for almost a week.

One day before the deadline (just like in the movies, huh?), another of our data analyst suggested that I hashes the fields being compared, and then compare the hashes instead.

To my surprise, it works GREAT. It's like a miracle.

The query which wasn't finished in a week, now finished in an hour. 
We got the job done, at the very least, 16800% faster.

Mind = Blown :)

So, here's how it's done, hopefully you'll find it useful too later.
Enjoy !

======
ORIGINAL QUERY

SELECT a.f1, a.f2, b.f1, b.f2  
FROM t1 AS a, t2 AS b 
WHERE a.f1=b.f1 AND a.f2=b.f2 AND a.f3=b.f3 AND a.f4=b.f4 AND a.f5=b.f5;


======
CREATE THE HASHES FROM EACH TABLE

CREATE TABLE hash1 (KEY(uid), KEY(hash)) ENGINE=MyISAM AS 
(
SELECT uid, MD5(CONCAT(f1,f2,f3,f4,f5)) AS hash  
FROM t1 
)

CREATE TABLE hash2 (KEY(uid), KEY(hash)) ENGINE=MyISAM AS 
(
SELECT uid, MD5(CONCAT(f1,f2,f3,f4,f5)) AS hash  
FROM t2 
)


======
COMPARE THE HASHES 
This is where the magic happens :) 
we compare the hashes, and then we store the uid related to those hashes.

CREATE TABLE hash_result (KEY(uid1), KEY(uid2)) ENGINE=MyISAM AS 
(
SELECT a.uid AS uid1, b.uid AS uid2 
FROM hash1 AS a, hash2 AS b 
WHERE a.hash=b.hash 
)


======
CREATE THE END RESULT  
Collect the records with the corresponding uid :

CREATE TABLE end_result ENGINE=MyISAM AS 
(
SELECT a.f1, a.f2, b.f1, b.f2 
FROM t1 AS a, t2 AS b, hash_result AS c 
WHERE a.uid=c.uid1 AND b.uid=c.uid2 
)
Shared publiclyView activity