Saturday, May 12, 2012

Create running totals with data.

This is a fairly common requirement when reporting on data. The data analyst will typically extract data using sql and then use the SUM formula within Excel (with the aid of fixed and relative references, that are copied down). For example, supposing we are looking at a table that records number of starters by dates. We could have data like the following:
"record_date";"yr1starter"
"2004-01-01";"820"
"2004-01-02";"1891"
"2004-01-03";"1667"
"2004-01-04";"1607"
"2004-01-05";"1542"
We copy this into a spreadsheet (normally by using text-to-columns so the data is in separate columns). In order to get a running balance we can then use the following formula: =SUM(B$2:B2) for cell C2. We then copy down, so for instance the formula in cell C3 becomes: =SUM(B$2:B3) It's not the only way to do this in Excel - you could always add the subtotal above to the current record field, but that runs the danger of a gap in the figures. The attached graphic shows what the cumulative formula list will look like
Excel will normally display the calculations for example:
yr1starters cumulative total
820 820
1891 2711
1667 4378
1607 5985
1542 7527
However, it is possible to create cumulative/running totals in SQL (via an inner query, normally called a sub-query). I will first show this using MySql syntax:
SET @runtotal:=0;
SELECT qry1.r_date, qry1.day_starts,
(@runtotal := @runtotal + qry1.day_starts) AS runningtotal
FROM
(SELECT
record_date AS r_date, yr1starts AS day_starts
FROM startstops
WHERE record_date between '2004-01-01' AND '2004-01-31'
ORDER BY record_date) AS qry1
The key is to set the initial variable @runtototal to 0, so it can then total up the running total (the ORDER BY clause at the end of the select statement is also important). Incidentally, this query (and other ways to achieve it using Mysql) can be seen at StackOverflow. It is actually a bit more difficult to achieve this using sql*server. One way is to use a cursor, which I feel is OTT for this. Another way is to use a cross-join, and so join the table to itself. Performance can be an issue here if the data set is large (helped possibly by indexes)
SELECT a.record_date, a.yr1starter, SUM(b.yr1starter) AS running_total
FROM dbo.startstops a
CROSS JOIN dbo.startstops b
WHERE (b.record_date <= a.record_date)
AND a.record_date between '2004-01-01' AND '2004-01-31' AND b.record_date between '2004-01-01' AND '2004-01-31'
GROUP BY a.record_date,a.yr1starter ORDER BY a.record_date,a.yr1starter
You can read about different ways of calculating running totals in sql*server (including the usage of cursors) at the sqlteam link.

No comments:

Post a Comment