Sunday, December 14, 2014

Working with Age Groups in SQL

Following on from my previous post SQL - Beyond the standard ASC / DESC ORDER BY which introduced the CASE WHEN statement, this post tackles the problem of handling age groups with SQL and introduces a technique I call “query wrapping”.

When writing queries to generate reports, it is a common requirement to aggregate records into groups and a typical grouping is age. The following example outlines the process of aggregating a set of employee records into the following age groups:


Under 18
18 – 25
26 – 35
36 – 55
56+


For this post, the sample employee list I am using looks like this:
 


Before assigning an employee an age group, the employee’s age needs to be calculated. To do this the DATEDIFF T-SQL function can be used. MSDN defines the syntax as:
 

DATEDIFF(datepart, startdate, enddate)
 

To calculate an employee’s age, the number of years in between their date of birth and today’s date is required:
 

DATEDIFF(yy, DateOfBirth, GETDATE())

Note “yy” specifies that the number of years between the two dates is required and GETDATE() will return today's date.
 

The DATEDIFF function can be used in conjunction with a CASE WHEN statement to create a query which will add an age group label to each employee and can in turn be grouped upon:
 

SELECT
    CASE
        WHEN DATEDIFF(yy, DateOfBirth, GETDATE()) <= 17
            THEN 'Under 18'
        WHEN DATEDIFF(yy, DateOfBirth, GETDATE()) BETWEEN 18 AND 25
            THEN '18-25'
        WHEN DATEDIFF(yy, DateOfBirth, GETDATE()) BETWEEN 26 AND 35
            THEN '26-35'
        WHEN DATEDIFF(yy, DateOfBirth, GETDATE()) BETWEEN 36 AND 55
            THEN '36-55'
        WHEN DATEDIFF(yy, DateOfBirth, GETDATE()) > 55
            THEN '56+'
    END AS YearGroup
FROM Employees

This query will generate the following result set:



This result set will be the basis to determining how many employees are in each age group. To create the age group aggregation it is required to group on the result set displayed above. How do you group on the result set of another query? You use that result set as the input to your GROUP BY query. It is possible to dynamically generate a query's FROM data-set from an SQL query instead of referencing an actual table. Take a look at the query below to see this in action:
 


SELECT AgeGroups.YearGroup AS [Year Group]
    , COUNT(AgeGroups.YearGroup) AS [Group Count]
FROM
(
    SELECT
        CASE
            WHEN DATEDIFF(yy, DateOfBirth, GETDATE()) <= 17
                THEN 'Under 18'
            WHEN DATEDIFF(yy, DateOfBirth, GETDATE()) BETWEEN 18 AND 25
                THEN '18-25'
            WHEN DATEDIFF(yy, DateOfBirth, GETDATE()) BETWEEN 26 AND 35
                THEN '26-35'
            WHEN DATEDIFF(yy, DateOfBirth, GETDATE()) BETWEEN 36 AND 55
                THEN '36-55'
            WHEN DATEDIFF(yy, DateOfBirth, GETDATE()) > 55
                THEN '56+'
        END AS YearGroup
    FROM Employees
) AgeGroups
GROUP BY AgeGroups.YearGroup

The query previously written to add an age group label against each employee has been included as the FROM dataset for the GROUP BY query. I think of this as wrapping the group by statement around the result set of the inner age group query. This is an important technique to understand as it allows you to build up very powerful SQL queries by wrapping queries around one another. Data-sets which are being JOINED on to can also be created dynamically using a separate SQL statement opening up a limitless number of querying options.
 

This is where thinking in sets when working in SQL helps. Think of the inner age group labeling query as generating your initial data-set and the outer group by query filtering (or in this case aggregating) down this result set further to something closer to what is required.
 

When executed, the updated query produces the required aggregated result set:
 


When writing an SQL query it is usually habit to start at the top and work down--what you want as the end result. I find it easier to write the inner most queries first where I am pulling in all the data I require to work with in its most primitive form. I then start to wrap these queries in other queries which will better shape the data into the end result that I require and look for ways to JOIN this data on to other data-sets as required.
 

Hopefully this has illustrated the technique of “query wrapping” whilst demonstrating how to work with age groups in SQL and helped open a couple more SQL querying doors for you.