Coding Tank: January 2009

Monday, January 26, 2009

Ruby or Python

I'm trying to figure out if I should learn Ruby or Python next. Both seem like popular languages and ideally I'd like to learn them both. But one has to come first. Since my current motivation for learning anything new is to make myself more valuable to potential employers I'm going to use that as the basis for my choice.

I loaded up Craigslist (SF Bay Area) this morning and performed a few searches.

All Jobs
Ruby = 125
Python = 133

This would seem to tip the scales in favor of Python. But I'm a web guy so I'm specifically interested in Internet Engineering jobs. Let's see how the numbers break down when I apply that filter.

Internet Engineering Jobs
Ruby = 45
Python = 32

That would seem to favor Ruby. Let's try the search terms for their respective web application frameworks.

Internet Engineering Jobs
Rails = 28
Django = 10

That seals it. I'm learning Ruby. Not that I'm trying to pit one language against the other. I'm just making a practical decision about which one to learn first. I've heard very positive things about both would like to eventually get into both.

Thursday, January 15, 2009

SQL Group By

Besides doing SQL joins the other thing you'll be asked in job interviews is to use a GROUP BY clause in some fashion. Even though GROUP BY is a rare sight in the wild, potential employers still want to know that you know how to use it.

Let's take a look at our standard job interview tables.

+---------------+--------------+
| Field         | Type         |
+---------------+--------------+
| employee_id   | int(11)      | 
| department_id | int(11)      | 
| name          | varchar(255) | 
| salary        | int(11)      | 
+---------------+--------------+

+---------------+--------------+
| Field         | Type         |
+---------------+--------------+
| department_id | int(11)      | 
| name          | varchar(255) | 
+---------------+--------------+

We want to list all all departments along with how many employees are in each one. Here's the answer.

select
   d.name,
   count(e.employee_id)
from
   department d
left outer join
   employee e using (department_id)
group by
   d.department_id;

To better understand how GROUP BY works it sometimes helps to construct the query without the COUNT function and GROUP BY clause to see what the raw result set looks like. In this case the query would look something like this.

select
   d.name,
   e.employee_id
from
   department d
left outer join
   employee e using (department_id);

Which produces a result set like the following.

department               employee id
-----------------------  --------------
accounting                1
accounting                2
accounting                3
accounting                4
accounting                5
accounting                6
human resources           7
human resources           8
human resources           9
human resources          10
human resources          11
human resources          12
human resources          13
human resources          14
information technology   15
information technology   16
information technology   17
information technology   18
information technology   19
information technology   20
marketing                21
marketing                22
marketing                23
marketing                24
research & development   [NULL]
engineering              25
engineering              26
engineering              27
engineering              28
engineering              29

Note, because we did a LEFT OUTER JOIN we got a row for "Research & Development"" even though there aren't any corresponding employees. Now, let's add the COUNT function and see what happens.

select
   d.name,
   count(e.employee_id)
from
   department d
left outer join
   employee e using (department_id);

This query results in...

department     count
------------   ------
accounting     29

We only get one result. That's because we told the query to count all the rows in the result set. In addition to the count we also told it to return the department name. Since d.name in this case is completely ambiguous MySQL just picks the first one from the uncounted/uncollapsed result set.

Now instead of counting all the rows we want to count the rows for each department. This is where GROUP BY comes into play.

select
   d.name,
   count(e.employee_id)
from
   department d
left outer join
   employee e using (department_id)
group by
   d.department_id;

This query results in...

department                 employee id
-----------------------    ------------
accounting                 6
human resources            8
information technology     6
marketing                  4
research & development     0
engineering                5

Now MySQL returns a row for each unique department. It also picks the department name from the first result in each group to return for the 'department' column.

It's layoff season and management wants to know which departments have people making six figures. Let's look at adding a WHERE clause.

select
   d.name,
   count(e.employee_id) as emp_count
from
   department d
inner join
   employee e using (department_id)
where
   e.salary >= 100000
group by
   d.department_id;

Results in...

department                        emp_count
-------------------------         ----------
accounting                        1
human resources                   1
information technology            1
engineering                       4

So almost every department has at least one person who makes 100K or more. That's not interesting. What management really wants to know is which departments have more than 2 people making six figures. If an interviewer asks you this chances are good they're slightly evil. Here's how you beat the evil.

select
   d.name,
   count(e.employee_id) as emp_count
from
   department d
left outer join
   employee e using (department_id)
where
   e.salary >= 100000
group by
   d.department_id
having 
   emp_count > 2;

Results in...

department              emp_count
-------------           ----------
engineering             4

The HAVING clause! It works in conjunction with with the GROUP BY to further filter the result set. The GROUP BY clause takes the raw result set and performs a calculation and filters the results to a smaller set of rows. The HAVING clause picks up reduced result set and applies an additional filter. The column used in the HAVING condition must be an explicitly selected column. In this case that's either d.name or emp_count.

Tip: It's a good idea to name your calculated columns to make them easier to reference later in the query if required.

That should give you enough ammunition to pass any GROUP BY questions an interviewer throws at you. If you do get a crazy question not covered in the article let me know if the comments and I'll add it. Good luck.

Monday, January 12, 2009

SQL Joins

If you are interviewing for a web engineering position you are guaranteed to be asked to write an SQL join or two. This is an easy way for the interviewer to verify that you really do know how to write SQL and aren't lying about it. When I learned SQL back in the old days (early 90's) we didn't have fancy join statements. We did everything in the WHERE clause and we liked it. Seriously, I liked it. I had a hard time adjusting to reading and writing join statements. But over the last couple of years I've become accustomed to it. Let's look at a few different types of joins you can do. First let's define a couple of tables. Here's are your standard job interview test tables. NOTE: This article is written from a MySQL point of view.


+---------------+--------------+------+-----+---------+----------------+
| Field         | Type         | Null | Key | Default | Extra          |
+---------------+--------------+------+-----+---------+----------------+
| employee_id   | int(11)      | NO   | PRI | NULL    | auto_increment | 
| department_id | int(11)      | NO   |     | NULL    |                | 
| name          | varchar(255) | NO   |     | NULL    |                | 
+---------------+--------------+------+-----+---------+----------------+

+---------------+--------------+------+-----+---------+----------------+
| Field         | Type         | Null | Key | Default | Extra          |
+---------------+--------------+------+-----+---------+----------------+
| department_id | int(11)      | NO   | PRI | NULL    | auto_increment | 
| name          | varchar(255) | NO   |     | NULL    |                | 
+---------------+--------------+------+-----+---------+----------------+

I populated these tables with some random data. Most employees have corresponding records in the department table but some don't. Additionally, there are some departments without any associated employees.

Let's look at a few questions and answers to demonstrate the various types of joins you can perform on these two tables.

List all employees along with their department.


select
   e.name,
   d.name
from
   employee e
inner join
   department d using (department_id)

This uses an 'inner join' since the result set will only include rows from both tables where department_id matches. We could have omitted 'inner' and just used 'join'. The two are equivalent. However, I prefer to use 'inner join' since it explicitly states what is happening.

List all employees along with their department. Include employees who don't have a department.


select
   e.name,
   d.name
from
   employee e
left outer join
   department d using (department_id)

When performing a join the 'left' table is the table specified in the 'from' clause. To include all rows from this table whether or not there is a matching record in the department table you need to use a 'left outer join'. Technically you could just write 'left join' but again I prefer the more verbose syntax since it leaves no question about the intent of the query.

List all employees along with their department. Also include departments which don't have any employees.


select
   e.name,
   d.name
from
   employee e
right outer join
   department d using (department_id)

Using a 'right outer join' will force all rows from the department table to be represented in the result set. Right joins in a production environment are extremely rare. Personally, I've never seen or used one in any of the code I've worked on in my 13 years of experience. The concept is confusing and performance is poor. I strongly recommend against using them. But in case you're asked to explain a 'right join' in an interview you now know.

I'm not going to discuss the 'full outer join' since MySQL doesn't support it and like the 'right join' you shouldn't be using it anyway. But it does what you would expect it to. Include rows from both left and right tables regardless if there's a match or not.

Something else you shouldn't ever do in MySQL is substitute 'cross join' for 'inner join'. Seriously, in MySQL the two are syntactically equivalent which makes no sense since the phrase 'cross join' has always meant a cartesian product in my experience.

Thursday, January 8, 2009

SQL Injection

SQL Injection is kind of old news. We've been hearing about it for years and things like cross-site scripting is more interesting now anyway. But how many of you have actually been paying attention to how SQL injections attacks happen? How many of you are actually doing something about it?

An SQL injection is when someone passes arbitrary SQL to your application. This usually happens through the html forms on your site. For example, you have a user login form where a user types in their email address and password. Without knowing anything about the database structure of the application I could enter something like this in the email field.

blah' or 'n'=='n

or worse...

'; drop table user;--

When your code drops those strings into the middle of an SQL statement bad things are going to happen. I'm not going to spend a lot of time explaining how to mount an SQL injection attack. People much smarter than me have already done so. Fire up Google and learn how it's done by the pros. The question is, are you protected? Here's how.

Sanitize Input
All data entering your database should be considered suspect. Any value than can vary in an SQL statement needs to be sanitized before it's used. Don't try to code this routine yourself. Use what the database vendor already gives you. If you're using MySQL then you're going to want to use mysql_real_escape_string()

Hide Error Messages
The public should never ever see error messages generated by the application server. Those are for developers to look at and no one else. These messages give away all kinds of information useful to attackers. Many times you're going to want to avoid even indicating an error occurred. This paper shows how an attacker was able to use the presence or absence of a 500 error page to know if they were able to generate valid SQL or not. How to handle errors and bad user input is entirely application dependent. Use your best judgement.

Limit Database Permissions
The web application should connect to the database as a user with a limited set of permissions. If the user anonymous you're going to want to use a read-only db user in that case. Again, this is application dependent. Use your judgment but think about database permissions. Think about it hard.

Parameterized Queries
I saved the best for last. In my experience 99% of all web application piece together SQL on the fly. This is bad for performance and even worse for security. I'm not judging because I've written tons of code like that myself. However, there is a better way. Using parameterized queries gives you some performance gains as well as a huge improvement in security. No longer can your original statement be modified into something you never intended.

Parameterized queries is where you construct your SQL statement using '?' placeholders for variable data and then bind the actual values to those placeholders. Once again, an example speaks far better than my prose. This is in PHP5.


$sql = "select first_name, last_name from user where email = ?";
$stmt = $mysqli->prepare($sql);

$email = "someone@yahoo.com";
$stmt->bind_param("s", $email);

$stmt->execute();
$stmt->bind_result($first, $last);

while($stmt->fetch()) {
 print("$first $last");
}

Monday, January 5, 2009

Cookie Security

Describe the security model for browser cookies?

That question was asked of me in a job interview recently. Unfortunately, I didn't have a good answer because I haven't actually worked with the details of creating and reading cookies for a long time. When you don't exercise a muscle it loses its tone. If you don't exercise knowledge it fades. That's what happened here. Unless you're coding new sites frequently you're probably not dealing with cookie security that often. It was time for me to hit the books and brush up on my knowledge.

So, what is the security model for browser cookies? Here goes.

First, browsers will only send cookies to the site which set the cookie. In this case we define "site" by a domain name and path. If a cookie is set with a domain of news.google.com then only news.google.com can read that cookie. A cookie set with mail.google.com can only be read by mail.google.com.

What if you want to share cookies among subdomains in a given domain? For example you want news.google.com and mail.google.com to read the same cookie. If the cookie is set to a subset of the fully qualified name then the cookie will be shared among any server whose tail matches the domain of the cookie.

For example: a cookie with a domain of google.com will be sent to mail.google.com as well as www.google.com and news.google.com.

Furthermore, a cookie with a domain of news.google.com will be sent to foo.news.google.com and foo.bar.news.google.com.

When setting the domain for a cookie you must specify a domain name and not just a top-level-domain. Meaning, you can't set a cookie to just .com or .edu. Browsers won't let you do this. But what about co.uk? That's a TLD but it has two parts so according to the specification it's fair game. Older browsers do in fact allow this to happen. However, newer browsers have restrictions to prevent cookies being tied to these particular two-part TLDs although each browser implements these restrictions a little differently.

So now we understand domains and tail matching. We can also restrict a cookie to a particular path on the server. For example, if a cookie's path is set to /blah/ it will only be available to requests with the /blah/ directory and any sub-directory such as /blah/hooga/.

There is one extra layer of built-in protection that's worth noting. When creating a cookie you can specify if it is to be sent only over HTTPS or not. However, since the majority of my work exists in the plain HTTP world this feature doesn't do me a lot of good.

We now understand the rules for which cookies get sent to which servers but there's a couple of other problems. First, cookies are sent as clear text across the net and can be eavesdropped on. Second, users can modify their cookies to contain values you may not want. How do we protect against eavesdropping and tampering?

Let's address the eavesdropping question first. We'll use two-way encryption to reduce the chance of eavesdropping. We're using two-way encryption because we want to send an encrypted value over the net but we need to decrypt the value so we can actually read it once the cookie has landed on the server. This doesn't make the cookie bullet proof because it can still be decrypted by a 3rd party with brute-force. Because of this and other reasons, you should store as little personal information as possible in cookies. If this cookie is intended to recognize a logged in user you may want to minimally include userid and ip address.

Moving on... let's say a cookie has been modified by whatever means and it may contain bad values. Values that might allow a user to impersonate other users. Like admins. We don't want that.

Here is where we're going to use one-way hashing on the cookie value itself. Let's use the logged in user example. We want to store username in a cookie for users who are logged in so the site will recognize them between sessions. We'll take the username, say, 'bob', add a salt to it and generate a hash using whatever algorithm floats your boat. Now we'll append the hash to the username and that becomes are full cookie value. Of course, you'll need a delimiter between the individual fields so you can pick them apart when you need to read it later. For example...

bob|cac991e4b010585f61ed2e40641ec77e

This is our basic cookie value. This is what we're going to encrypt and decrypt. Upon decryption we're going to rehash the username and make sure it matches up with the hash sent in the cookie. If the hashes match we're good cookie. If not, then you know something fishy is going on.

Coding Tank