A simple Hive SQL query run on a 50GB size employee log table is running for hours.
select dept,count(distinct emp_id) from emp_log group by dept;
There are just 4-5 departments and a huge number of employees per department.
It was run with Hive 0.14 + Tez on 1TB memory. Is there any way to optimize this code block for better performance?
Modification 1
Tested with collect_list replacing distinct.
SELECT dept, size(collect_list(emp_id)) nb_emps
FROM emp_log
GROUP BY dept
Got the below error,
Status: Failed Vertex failed, vertexName=Reducer 2,vertexId=vertex_1446976653619_0043_1_02, diagnostics=[Task failed,taskId=task_1446976653619_0043_1_02_000282, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:
ALTER TABLE `emp_log` ADD INDEX `emp_log_idx_dept` (`dept`);
SELECT
emp_log.dept,
count(DISTINCT emp_log.emp_id)
FROM
emp_log
GROUP BY
emp_log.dept
ORDER BY
NULL