I was trying to run a simple join query on MariaDB (MySQL) and its performance was horrendous. Here’s how I cut down the query’s run time from over 380 hours to under 12 hours by executing part of it with two simple Unix commands.
Below is the query, with forms part of a more complex GHTorrent analysis I implemented using the simple-rolap relational online analytical processing framework.
select distinct
project_commits.project_id,'%x%v1') as week_commit
date_format(created_at, from project_commits
left join commits
on project_commits.commit_id = commits.id;
Both join fields are indexed. However, MariaDB implements the join with a full scan of project_commits
and an index lookup on commits
. This can be seen in the output of EXPLAIN
.
+------+-------------+-----------------+--------+---------------+---------+
| id | select_type | table | type | possible_keys | key |
+------+-------------+-----------------+--------+---------------+---------+
| 1 | SIMPLE | project_commits | ALL | NULL | NULL |
| 1 | SIMPLE | commits | eq_ref | PRIMARY | PRIMARY |
+------+-------------+-----------------+--------+---------------+---------+
+---------+-------------------------------------+------------+-----------------+
| key_len | ref | rows | Extra |
+---------+-------------------------------------+------------+-----------------+
| NULL | NULL | 5417294109 | Using temporary |
| 4 | ghtorrent.project_commits.commit_id | 1 | |
+---------+-------------------------------------+------------+-----------------+
The sizes of the two tables are relatively large: project_commits
contains 5 billion rows and commits
847 million rows. Also the server’s memory size is relatively small (16GB). This probably means that index lookups hit the (unfortunately magnetic) disk, and therefore performance took a heavy hit. According to the output of pmonitor run on the generated temporary table, the query, which at that point had already run for more than half a day, would take another 373 hours to complete.
/home/mysql/ghtorrent/project_commits#P#p0.MYD 6.68% ETA 373:38:11
To me this seemed excessive, because the I/O time requirements for a sort-merge join were orders of magnitude below the projected execution time. ( An answer I got from a question I posted on dba.stackexchange.com helpfully offered numerous things to try, but no conviction regarding their efficacy. I tried the first suggestion, but the results weren’t promising. As experimenting with each suggestion could easily take at least half a day, I proceeded with a way I knew would work efficiently and reliably.
I exported the two tables into files, joined them with with the Unix join command, piped the result to uniq to remove the duplicate rows, and imported the result back into the database. The process started at 20:41 at the import (including index building) had finished by 9:53 of the next day. Here are the precise steps I followed.
I first exported the fields of the two tables I wanted to join sorted on the join field. To ensure the sort order was compatible with that used by the Unix tools, I cast the field into a character type.
I saved the output of the following SQL query into the file commits_week.txt
.
select cast(id as char) as cid,
'%x%v1') as week_commit
date_format(created_at, from commits
order by cid;
I also saved the output of the following SQL query into the file project_commits.txt
:
select cast(commit_id as char) as cid, project_id
from project_commits
order by cid;
This generated the following files.
-rw-r--r-- 1 dds dds 15G Aug 4 21:09 commits_week.txt
-rw-r--r-- 1 dds dds 93G Aug 5 00:36 project_commits.txt
Crucially, I run the mysql client with the --quick
option to avoid running out of memory as the client tried to gather all results before outputting them.
Second, I joined the two text files using the Unix join command. This scans linearly through both files and combines the records whose first field matches. As the files are already sorted this can be done very efficiently: at the speed of I/O. I also piped the output of join into uniq to eliminate duplicate records. This handled the SQL distinct
clause in the original query. Again, on already sorted output this can be done through a simple linear scan.
Here is the Unix shell command I run.
join commits_week.txt project_commits.txt | uniq >joined_commits.txt
After the processing, which took just an hour, I had the desired result in a file.
-rw-r--r-- 1 dds dds 133G Aug 5 01:40 joined_commits.txt
Finally, I imported the text file back into the database as a table.
create table half_life.week_commits_all (
INT(11) not null,
project_id CHAR(7)) ENGINE=MyISAM;
week_commit
data local infile 'joined_commits.txt'
load into table half_life.week_commits_all
by ' '; fields terminated
Ideally, MariaDB should support sort-merge joins and its optimizer should employ them when the runtime of alternative strategies is projected to be excessive. Until that time, using Unix shell commands designed in the 1970s can provide a huge performance boost.
Comments Toot! TweetLast modified: Sunday, August 5, 2018 8:20 pm
Unless otherwise expressly stated, all original material on this page created by Diomidis Spinellis is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.