The Power of Left Joins: Filtering an Insert

I have always considered the inner join to be more intuitive than other types of table joins. In many, perhaps most, cases when you go after data from multiple tables, you are only interested in pulling or manipulating the data that is matched across those tables. By contrast, a left join is going to return all of the data from first table, and if the second table does not have a match, it simply returns NULL values. This can cause problems if you are trying to perform an operation that requires that real data be returned without exception.

I began to see the utility of left joins the first time I needed to load data to a table, but couldn’t be sure that the data didn’t already exist in the table. Essentially, what I wanted to do was filter out the data that already existed and only load new data. To illustrate how this works, consider a customer table with the following columns:

id INT PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50),
address VARCHAR(100),
city VARCHAR(100),
state VARCHAR(3),
zip VARHCAR(9),
phone VARCHAR(10)

The data that we want to load is in a table called customer_staging, a table that holds data temporarily. It contains the same set of columns, but differing primary keys. If you just insert from the staging table to the customer table, there is a good chance  that you are going to duplicate existing customer data, causing all kinds of headaches when you go to query those customers, and a mess that you will need to clean up (more on how to do that cleanup later).

To prevent this, you can construct a query that only pulls those records that do not have a match in the other table. In other words, you want the opposite of an inner join. The form of such a query as a SELECT statement could be written as follows:

SELECT cs.first_name, cs.last_name, cs.address, cs.city, cs.state, cs.zip, cs.phone
FROM customer_staging cs
LEFT JOIN customer c
ON c.first_name = cs.first_name
AND c.last_name = cs.last_name
AND c.address = cs.address
AND c.city = cs.city
AND c.state = cs.state
AND c.zip = cs.zip
AND c.phone = cs.phone
WHERE c.id IS NULL;

Rather than pulling those records that match between the two tables, the displayed results will be those records from the customer_staging table that do not have a match in the customer table. You can then turn this SELECT query into an INSERT query:

INSERT INTO customer (first_name, last_name, address, city, state, zip, phone)
SELECT cs.first_name, cs.last_name, cs.address, cs.city, cs.state, cs.zip, cs.phone
FROM customer_staging cs
LEFT JOIN customer c
ON c.first_name = cs.first_name
AND c.last_name = cs.last_name
AND c.address = cs.address
AND c.city = cs.city
AND c.state = cs.state
AND c.zip = cs.zip
AND c.phone = cs.phone
WHERE c.id IS NULL;

Using a left join in this way preserves the data integrity of your customer table against duplicate data. Note that how you define the criteria of your join can be very important. I recommend to match on as many fields as you can reasonably expect to relate uniquely to your data, but no more. In this case, adding more constraints to the join is actually going to make it more likely that you will accidentally insert duplicate data. For example, you might have customers change their phone number on occasion. If this is a possibility, and you do not want to enter them as a new row in your customer table, you would want to remove that criteria from the above query.

On the other hand, using too few criteria can result in the misidentification of duplicates and filtering out of results that you want to be included. For example, if you only filter on address information, a new customer moving into an address previously occupied by an old customer would be rejected. Using additional constraints makes problems like this far less likely.

In future posts, I will be covering other cool and interesting uses of left joins.

Advertisements