SQL WHERE IN List: How to Reference Long Lists in SQL

As your data grows, adopting best practices for scalability becomes crucial. Working with large datasets can be challenging, especially when you’re unsure how to manage or query them efficiently. There are many situations where you may need to generate insights based on a SQL list of values, but doing this without hard-coding can be tricky.

SQL offers an efficient solution to navigate this scenario. The WHERE IN list SQL statement allows you to filter large datasets by checking whether a column’s value matches any value in a specified list. In this blog post, we’ll explore how to use the SQL WHERE IN list clause to filter long lists in your data while maintaining SQL best practices.

Prerequisites

Basic knowledge of SQL.
Sample data (See how you can generate one).
IDE to run SQL queries. (I’ll be using Quadratic as my IDE).

Basic syntax

The WHERE IN list SQL clause provides an efficient way to filter records by checking whether a value matches any item within a specified list. It serves as a cleaner and more scalable alternative to writing multiple OR conditions. Here’s the basic syntax:

SELECT column1, column2
FROM table_name
WHERE column_name IN (value1, value2, value3, ...);

This syntax selects the columns to display in the result, specifies the table to pull data from, and then filters through the table. It then returns only the table where column_name equals any of the values inside the parentheses.

Using WHERE IN with numbers

We can use the WHERE IN clause with numbers. Here’s an example:

SELECT *
FROM Employees
WHERE department_id IN (1, 4, 6, 9);

Here, SQL will look for information in the Employee table and then return employees whose department_id is 1, 4, 6, or 9. In simple terms, we’re asking the question, “Show me all employees who work in department 1, 4, 6, or 9.”

Using WHERE IN with strings

We can also use the WHERE IN clause to filter out data using strings. Here:

SELECT *
FROM Products
WHERE category IN ('Electronics', 'Furniture', 'Books');

This SQL query searches the Products table and then brings out products under the categories of Electronics, Furniture, or Books.

Using WHERE IN with subqueries

Subqueries are useful when you want to add an extra bit of filtering to your data. Suppose we want to filter out employees who work with an active department. We can do that by using subqueries:

SELECT employee_id, full_name, department_id
FROM Employees
WHERE department_id IN (
    SELECT department_id
    FROM Departments
    WHERE status = 'Active'
);

In this query, we begin by selecting the employee_id, full_name, and department_id columns from the Employees table. We then apply a filter to return only those employees whose department_id appears in the list provided by the subquery.

The subquery runs first. It retrieves all department_id values from the Departments table where the status is set to "Active". Once this list is generated, the outer query selects only the employees whose departments match those active department IDs. The final result will be only employees in active departments.

Checking multiple columns using WHERE IN

We can filter multiple columns at the same time using SQL query WHERE IN list with tuple comparison. For example, if we want to retrieve employees who belong to specific department + location combinations, we can compare both columns against a list of approved pairs. Here:

SELECT employee_id, full_name, department_id, location
FROM Employees
WHERE (department_id, location) IN (
    SELECT dept_id, location
    FROM ApprovedDeptLocations
);

In this query, the subquery runs first and returns all valid department–location combinations. The outer query then uses this list to filter the Employees table, returning only those employees whose department_id and location match one of the approved combinations provided by the subquery.

NOT IN EXCLUSION

The NOT IN clause is used to return rows that do not match any value in the list. We’ll use the same example, but this time with the NOT IN clause:

SELECT employee_id, full_name, department_id
FROM Employees
WHERE department_id NOT IN (
    SELECT department_id
    FROM Departments
    WHERE status = 'Active'
);

Here, the subquery first retrieves the active department IDs, then the outer query returns employees whose department_id isn’t on the list.

Handling NULL values

You need to be careful when handling NULL values, especially in NOT IN operations. Any comparison with a NULL value returns UNKNOWN, and not TRUE or FALSE. This means that NOT IN operation can break when NULL appears. Here’s an example:

SELECT *
FROM employees
WHERE department_id NOT IN (
    SELECT department_id
    FROM departments
);

Here, if any department_id is NULL, the query fails. The ideal approach is to filter out NULL values. Here:

SELECT *
FROM employees
WHERE department_id NOT IN (
    SELECT department_id
    FROM departments
    WHERE department_id IS NOT NULL
);

This query explicitly handles NULL values to prevent unexpected results.

Referencing data in Quadratic using SQL parameterized queries

First, what are parameterized queries? Parametrized queries use placeholders instead of embedding raw values directly. SQL query parameterization helps to improve performance, prevent SQL injection, and keep your code cleaner and more maintainable.

Quadratic provides an efficient way of parameterizing SQL query from long lists of values. With it, you can use the WHERE IN clause, run your SQL query, and immediately see the updates reflected in your table.

To begin, create a single Python cell that formats all your values as a comma-separated list wrapped in single quotes. This prepares the data so it can be easily plugged into your SQL query. Here’s an example of how to structure and format your list for referencing:

import pandas as pd

df = q.cells("Sample_Data")

google_users = df[df['UTM_Source'] == 'google']

emails = google_users['email'].tolist()
formatted_emails = ', '.join([f"'{email}'" for email in emails])

formatted_emails

This query extracts all user emails with a UTM_Source = 'google' and returns them as a comma-separated list in a single cell. If you prefer not to format this manually, Quadratic AI can automatically generate the Python code that structures your values correctly in a single cell – just prompt “List all user emails whose UTM_Source is “google” in a single Python cell, comma separated and single quotes around each value”.

After this is done, we can then reference that particular cell using the WHERE IN clause:

SELECT customer, number_of_seats
FROM demo_stripe_subscriptions
WHERE customer IN ({{ K2 }})

This query retrieves data from the stripe_subscriptions table and returns the customer’s name and the number of seats included in their subscription. It filters only to show customers whose names match the value in cell K2. Here’s the result in Quadratic:

SQL WHERE value IN list

This method improves the logic of your queries without repeating conditions or hardcoding values.

Conclusion

Referencing long lists in SQL is crucial for extracting meaningful insights from large and complex datasets. Data analysts not only need to understand how to use these techniques, but also how to apply them using best practices.

In this blog post, we walked through the SQL WHERE IN list clause and how it helps filter large datasets efficiently. We explored how to use it with strings, numbers, and subqueries. We also saw how to filter based on multiple columns, how to apply NOT IN for exclusions, and how to correctly handle NULL values. Finally, we saw how to streamline this process in Quadratic using a parameterized query in Python for safer and more efficient list referencing.

SQL WHERE IN List: How to reference long lists in SQL