Title of the Blog


Rants, musings, and other bits of information that may or may not be useful and/or interesting.


Determining Concurrency With SQL

This week, I was presented with an interesting task: Determine concurrent user metrics for one of the applications I support. While this sounds like a simple metric to gather, the actual implementation is pretty interesting.

The Data

Our user session history is stored in a database table that looks like:

start_date                  end_date  
18-oct-2013 09:18:32.132    18-oct-2013 09:36:12.045
18-oct-2013 09:25:34.267    18-oct-2013 09:45:10.988
18-oct-2013 09:37:36.002    18-oct-2013 10:22:07.324
18-oct-2013 10:12:45.723    18-oct-2013 10:20:55.229
...

There are other columns as well, but they're not really relevant to this exercise.

So what we have is events with defined start and end times in different fields, and we want to use that to determine how these records overlap in time.

Step 1: Identify the Sample Set

Let say we want to get average and peak concurrency for a specific time frame. For this example, lets use Oct 18th, 2013. This gives us a 24-hour span to work with.

Now, to filter down to the appropriate records for this timeframe, it's tempting to simply window it with:

SELECT
    *
FROM
    session_history
WHERE
    start_date BETWEEN '18-oct-2013' AND '19-oct-2013';

However, this isn't actually the window we want. Why? Well, I think it's easier to visualize:

user_sessions

The orange bars represent active user sessions over time. The green area is the timeframe we care about. What happens if we use the filter above to get our initial dataset? Well, we end up throwing out any sessions that started before our sample range, but extended into it. The same would happen if you used the end_date instead of the start_date. This means our concurrency numbers can report lower than they actually are.

So to properly select our sample set, we need to pull records that fulfill the following conditions:

  • The session ends after the beginning of our sample range
  • The session begins before the end of our sample range

The resultant SQL looks like:

SELECT
    *
FROM
    session_history
WHERE
       start_date < '19-oct-2013'
    AND end_date > '18-oct-2013';

This gives us any session that was active during the sample range. Now we start the transform:

Step 2: Create Separate Start/End Events

The next thing we want to do is split each session into two separate start and end events. We're going to do this by taking our sample set, and unioning it to itself:

WITH sessions AS (
    SELECT *
    FROM session_history
    WHERE start_date < '19-oct-2013'
       AND end_date > '18-oct-2013')
SELECT
    1 AS event,
    start_date AS event_date
FROM
    sessions begin
UNION ALL
SELECT
    -1 AS event,
    end_date AS event_date
FROM
    sessions end;

When run against the sample set above, you end up with:

event    event_date              
 1        18-oct-2013 09:18:32.132
 1        18-oct-2013 09:25:34.267
-1        18-oct-2013 09:36:12.045
 1        18-oct-2013 09:37:36.002
-1        18-oct-2013 09:45:10.988
 1        18-oct-2013 10:12:45.723
-1        18-oct-2013 10:20:55.229
-1        18-oct-2013 10:22:07.324

Now, you might be seeing where this is going. We have separate, timestamped events for every session start and end, and a column which has '1' for a session start, and '-1' for a session end.

Step 3: Add an Aggregate

So all the magic is going to happen using that event column we just created. Think of it like a counter. When a session starts, we add 1 to the active session total. When a session ends, we subtract 1 from the total. So now we simply have to start keeping count. We can do this with an aggregate function:

WITH sessions AS (
    SELECT *
    FROM session_history
    WHERE start_date < '19-oct-2013'
       AND end_date > '18-oct-2013')
SELECT
    q.*,
    SUM( event ) OVER ( ORDER BY event_date ASC )
        AS concurrent
FROM    
    (SELECT
        1 AS event,
        start_date AS event_date
    FROM
        sessions begin
    UNION ALL
    SELECT
        -1 AS event,
        end_date AS event_date
    FROM
        sessions end) q;

I should mention that all SQL in this article is written for Oracle. Function syntax may vary depending on your flavor of SQL.

Okay, what the heck is this thing doing now? We're taking the result set of the last step, which has separate start/end events, indicated with 1 and -1 respectively. Then, instead of using the typical aggregate implementation of SUM(), we're adding an analytic clause. This causes it to keep the record set intact, and append a rolling sum as a new column. The ORDER BY distinction tells it how it should order the set before doing summation.

When run against our sample events, the output is as follows:

event   event_date                  concurrent              
 1      18-oct-2013 09:18:32.132    1
 1      18-oct-2013 09:25:34.267    2
-1      18-oct-2013 09:36:12.045    1
 1      18-oct-2013 09:37:36.002    2
-1      18-oct-2013 09:45:10.988    1
 1      18-oct-2013 10:12:45.723    2
-1      18-oct-2013 10:20:55.229    1
-1      18-oct-2013 10:22:07.324    0

Pretty cool, eh? We've generated very accurate concurrency numbers that report as often as our users log in and out. Now, it's just a matter of determining the aggregates we were initially looking for.

Step 4: Re-Filter and Calculate Final Metrics

We're on the home stretch. We've got our concurrency numbers, and now we just need to get an AVG() and MAX(). Almost. Remember how way back at the beginning we made sure to include events that etended outside of our sample range? Well, now that we've got our concurrency numbers, we don't need them anymore. So when applying our aggregates, we also want to filter out any events that fall outside of our timeframe. (This is necessary, as they won't actually be accurate, and will skew our final calculations)

So our final SQL is going to look like this:

WITH sessions AS (
    SELECT *
    FROM session_history
    WHERE start_date < '19-oct-2013'
       AND end_date > '18-oct-2013')
SELECT
    AVG( concurrent ) AS avg_concurrent,
    MAX( concurrent ) AS max_concurrent
FROM
    (SELECT
        q.*,
        SUM( event ) OVER ( ORDER BY event_date ASC )
            AS concurrent
    FROM    
        (SELECT
            1 AS event,
            start_date AS event_date
        FROM
            sessions begin
        UNION ALL
        SELECT
            -1 AS event,
            end_date AS event_date
        FROM
            sessions end) q)
WHERE
    event_date BETWEEN '18-oct-2013' AND '19-oct-2013';

Which when applied to our sample dataset results in:

avg_concurrent    max_concurrent
1.25              2

We could also apply a grouping to our aggregate (Say, by hour) and see a hourly trend of concurrent user sessions. It's relatively easy to modify this query to answer quite a lot of questions. You really only need to adjust the date filters, and what kind of aggregate you want at the end. The rest is essentially a template.

This query is also quite re-usable. You can apply it to just about anything transactional that you might want to trend. As long as you have a start/end time, (or just a start time and duration) you can use this to determine concurrency.

comments powered by Disqus