Thursday, January 15, 2009

SQL Group By

Besides doing SQL joins the other thing you'll be asked in job interviews is to use a GROUP BY clause in some fashion. Even though GROUP BY is a rare sight in the wild, potential employers still want to know that you know how to use it.

Let's take a look at our standard job interview tables.
+---------------+--------------+
| Field | Type |
+---------------+--------------+
| employee_id | int(11) |
| department_id | int(11) |
| name | varchar(255) |
| salary | int(11) |
+---------------+--------------+

+---------------+--------------+
| Field | Type |
+---------------+--------------+
| department_id | int(11) |
| name | varchar(255) |
+---------------+--------------+
We want to list all all departments along with how many employees are in each one. Here's the answer.
select
d.name,
count(e.employee_id)
from
department d
left outer join
employee e using (department_id)
group by
d.department_id;
To better understand how GROUP BY works it sometimes helps to construct the query without the COUNT function and GROUP BY clause to see what the raw result set looks like. In this case the query would look something like this.
select
d.name,
e.employee_id
from
department d
left outer join
employee e using (department_id);
Which produces a result set like the following.
department               employee id
----------------------- --------------
accounting 1
accounting 2
accounting 3
accounting 4
accounting 5
accounting 6
human resources 7
human resources 8
human resources 9
human resources 10
human resources 11
human resources 12
human resources 13
human resources 14
information technology 15
information technology 16
information technology 17
information technology 18
information technology 19
information technology 20
marketing 21
marketing 22
marketing 23
marketing 24
research & development [NULL]
engineering 25
engineering 26
engineering 27
engineering 28
engineering 29
Note, because we did a LEFT OUTER JOIN we got a row for "Research & Development"" even though there aren't any corresponding employees. Now, let's add the COUNT function and see what happens.
select
d.name,
count(e.employee_id)
from
department d
left outer join
employee e using (department_id);
This query results in...
department     count
------------ ------
accounting 29
We only get one result. That's because we told the query to count all the rows in the result set. In addition to the count we also told it to return the department name. Since d.name in this case is completely ambiguous MySQL just picks the first one from the uncounted/uncollapsed result set.

Now instead of counting all the rows we want to count the rows for each department. This is where GROUP BY comes into play.
select
d.name,
count(e.employee_id)
from
department d
left outer join
employee e using (department_id)
group by
d.department_id;
This query results in...
department                 employee id
----------------------- ------------
accounting 6
human resources 8
information technology 6
marketing 4
research & development 0
engineering 5
Now MySQL returns a row for each unique department. It also picks the department name from the first result in each group to return for the 'department' column.

It's layoff season and management wants to know which departments have people making six figures. Let's look at adding a WHERE clause.
select
d.name,
count(e.employee_id) as emp_count
from
department d
inner join
employee e using (department_id)
where
e.salary >= 100000
group by
d.department_id;
Results in...
department                        emp_count
------------------------- ----------
accounting 1
human resources 1
information technology 1
engineering 4
So almost every department has at least one person who makes 100K or more. That's not interesting. What management really wants to know is which departments have more than 2 people making six figures. If an interviewer asks you this chances are good they're slightly evil. Here's how you beat the evil.
select
d.name,
count(e.employee_id) as emp_count
from
department d
left outer join
employee e using (department_id)
where
e.salary >= 100000
group by
d.department_id
having
emp_count > 2;
Results in...
department              emp_count
------------- ----------
engineering 4
The HAVING clause! It works in conjunction with with the GROUP BY to further filter the result set. The GROUP BY clause takes the raw result set and performs a calculation and filters the results to a smaller set of rows. The HAVING clause picks up reduced result set and applies an additional filter. The column used in the HAVING condition must be an explicitly selected column. In this case that's either d.name or emp_count.

Tip: It's a good idea to name your calculated columns to make them easier to reference later in the query if required.

That should give you enough ammunition to pass any GROUP BY questions an interviewer throws at you. If you do get a crazy question not covered in the article let me know if the comments and I'll add it. Good luck.

No comments: