Let's take a look at our standard job interview tables.
We want to list all all departments along with how many employees are in each one. Here's the answer.+---------------+--------------+
| Field | Type |
+---------------+--------------+
| employee_id | int(11) |
| department_id | int(11) |
| name | varchar(255) |
| salary | int(11) |
+---------------+--------------+
+---------------+--------------+
| Field | Type |
+---------------+--------------+
| department_id | int(11) |
| name | varchar(255) |
+---------------+--------------+
To better understand how GROUP BY works it sometimes helps to construct the query without the COUNT function and GROUP BY clause to see what the raw result set looks like. In this case the query would look something like this.select
d.name,
count(e.employee_id)
from
department d
left outer join
employee e using (department_id)
group by
d.department_id;
Which produces a result set like the following.select
d.name,
e.employee_id
from
department d
left outer join
employee e using (department_id);
Note, because we did a LEFT OUTER JOIN we got a row for "Research & Development"" even though there aren't any corresponding employees. Now, let's add the COUNT function and see what happens.department employee id
----------------------- --------------
accounting 1
accounting 2
accounting 3
accounting 4
accounting 5
accounting 6
human resources 7
human resources 8
human resources 9
human resources 10
human resources 11
human resources 12
human resources 13
human resources 14
information technology 15
information technology 16
information technology 17
information technology 18
information technology 19
information technology 20
marketing 21
marketing 22
marketing 23
marketing 24
research & development [NULL]
engineering 25
engineering 26
engineering 27
engineering 28
engineering 29
This query results in...select
d.name,
count(e.employee_id)
from
department d
left outer join
employee e using (department_id);
We only get one result. That's because we told the query to count all the rows in the result set. In addition to the count we also told it to return the department name. Since d.name in this case is completely ambiguous MySQL just picks the first one from the uncounted/uncollapsed result set.department count
------------ ------
accounting 29
Now instead of counting all the rows we want to count the rows for each department. This is where GROUP BY comes into play.
This query results in...select
d.name,
count(e.employee_id)
from
department d
left outer join
employee e using (department_id)
group by
d.department_id;
Now MySQL returns a row for each unique department. It also picks the department name from the first result in each group to return for the 'department' column.department employee id
----------------------- ------------
accounting 6
human resources 8
information technology 6
marketing 4
research & development 0
engineering 5
It's layoff season and management wants to know which departments have people making six figures. Let's look at adding a WHERE clause.
Results in...select
d.name,
count(e.employee_id) as emp_count
from
department d
inner join
employee e using (department_id)
where
e.salary >= 100000
group by
d.department_id;
So almost every department has at least one person who makes 100K or more. That's not interesting. What management really wants to know is which departments have more than 2 people making six figures. If an interviewer asks you this chances are good they're slightly evil. Here's how you beat the evil.department emp_count
------------------------- ----------
accounting 1
human resources 1
information technology 1
engineering 4
Results in...select
d.name,
count(e.employee_id) as emp_count
from
department d
left outer join
employee e using (department_id)
where
e.salary >= 100000
group by
d.department_id
having
emp_count > 2;
The HAVING clause! It works in conjunction with with the GROUP BY to further filter the result set. The GROUP BY clause takes the raw result set and performs a calculation and filters the results to a smaller set of rows. The HAVING clause picks up reduced result set and applies an additional filter. The column used in the HAVING condition must be an explicitly selected column. In this case that's either d.name or emp_count.department emp_count
------------- ----------
engineering 4
Tip: It's a good idea to name your calculated columns to make them easier to reference later in the query if required.
That should give you enough ammunition to pass any GROUP BY questions an interviewer throws at you. If you do get a crazy question not covered in the article let me know if the comments and I'll add it. Good luck.
No comments:
Post a Comment