Coding Tank: SQL Group By

Besides doing SQL joins the other thing you'll be asked in job interviews is to use a GROUP BY clause in some fashion. Even though GROUP BY is a rare sight in the wild, potential employers still want to know that you know how to use it.

Let's take a look at our standard job interview tables.

+---------------+--------------+
| Field         | Type         |
+---------------+--------------+
| employee_id   | int(11)      | 
| department_id | int(11)      | 
| name          | varchar(255) | 
| salary        | int(11)      | 
+---------------+--------------+

+---------------+--------------+
| Field         | Type         |
+---------------+--------------+
| department_id | int(11)      | 
| name          | varchar(255) | 
+---------------+--------------+

We want to list all all departments along with how many employees are in each one. Here's the answer.

select
   d.name,
   count(e.employee_id)
from
   department d
left outer join
   employee e using (department_id)
group by
   d.department_id;

To better understand how GROUP BY works it sometimes helps to construct the query without the COUNT function and GROUP BY clause to see what the raw result set looks like. In this case the query would look something like this.

select
   d.name,
   e.employee_id
from
   department d
left outer join
   employee e using (department_id);

Which produces a result set like the following.

department               employee id
-----------------------  --------------
accounting                1
accounting                2
accounting                3
accounting                4
accounting                5
accounting                6
human resources           7
human resources           8
human resources           9
human resources          10
human resources          11
human resources          12
human resources          13
human resources          14
information technology   15
information technology   16
information technology   17
information technology   18
information technology   19
information technology   20
marketing                21
marketing                22
marketing                23
marketing                24
research & development   [NULL]
engineering              25
engineering              26
engineering              27
engineering              28
engineering              29

Note, because we did a LEFT OUTER JOIN we got a row for "Research & Development"" even though there aren't any corresponding employees. Now, let's add the COUNT function and see what happens.

select
   d.name,
   count(e.employee_id)
from
   department d
left outer join
   employee e using (department_id);

This query results in...

department     count
------------   ------
accounting     29

We only get one result. That's because we told the query to count all the rows in the result set. In addition to the count we also told it to return the department name. Since d.name in this case is completely ambiguous MySQL just picks the first one from the uncounted/uncollapsed result set.

Now instead of counting all the rows we want to count the rows for each department. This is where GROUP BY comes into play.

select
   d.name,
   count(e.employee_id)
from
   department d
left outer join
   employee e using (department_id)
group by
   d.department_id;

This query results in...

department                 employee id
-----------------------    ------------
accounting                 6
human resources            8
information technology     6
marketing                  4
research & development     0
engineering                5

Now MySQL returns a row for each unique department. It also picks the department name from the first result in each group to return for the 'department' column.

It's layoff season and management wants to know which departments have people making six figures. Let's look at adding a WHERE clause.

select
   d.name,
   count(e.employee_id) as emp_count
from
   department d
inner join
   employee e using (department_id)
where
   e.salary >= 100000
group by
   d.department_id;

Results in...

department                        emp_count
-------------------------         ----------
accounting                        1
human resources                   1
information technology            1
engineering                       4

So almost every department has at least one person who makes 100K or more. That's not interesting. What management really wants to know is which departments have more than 2 people making six figures. If an interviewer asks you this chances are good they're slightly evil. Here's how you beat the evil.

select
   d.name,
   count(e.employee_id) as emp_count
from
   department d
left outer join
   employee e using (department_id)
where
   e.salary >= 100000
group by
   d.department_id
having 
   emp_count > 2;

Results in...

department              emp_count
-------------           ----------
engineering             4

The HAVING clause! It works in conjunction with with the GROUP BY to further filter the result set. The GROUP BY clause takes the raw result set and performs a calculation and filters the results to a smaller set of rows. The HAVING clause picks up reduced result set and applies an additional filter. The column used in the HAVING condition must be an explicitly selected column. In this case that's either d.name or emp_count.

Tip: It's a good idea to name your calculated columns to make them easier to reference later in the query if required.

That should give you enough ammunition to pass any GROUP BY questions an interviewer throws at you. If you do get a crazy question not covered in the article let me know if the comments and I'll add it. Good luck.

Coding Tank

Thursday, January 15, 2009

SQL Group By

No comments:

Followers

Blog Archive

About Me