Showing posts with label data analysis. Show all posts
Showing posts with label data analysis. Show all posts

Saturday, May 12, 2012

Create running totals with data.

This is a fairly common requirement when reporting on data. The data analyst will typically extract data using sql and then use the SUM formula within Excel (with the aid of fixed and relative references, that are copied down). For example, supposing we are looking at a table that records number of starters by dates. We could have data like the following:
"record_date";"yr1starter"
"2004-01-01";"820"
"2004-01-02";"1891"
"2004-01-03";"1667"
"2004-01-04";"1607"
"2004-01-05";"1542"
We copy this into a spreadsheet (normally by using text-to-columns so the data is in separate columns). In order to get a running balance we can then use the following formula: =SUM(B$2:B2) for cell C2. We then copy down, so for instance the formula in cell C3 becomes: =SUM(B$2:B3) It's not the only way to do this in Excel - you could always add the subtotal above to the current record field, but that runs the danger of a gap in the figures. The attached graphic shows what the cumulative formula list will look like
Excel will normally display the calculations for example:
yr1starters cumulative total
820 820
1891 2711
1667 4378
1607 5985
1542 7527
However, it is possible to create cumulative/running totals in SQL (via an inner query, normally called a sub-query). I will first show this using MySql syntax:
SET @runtotal:=0;
SELECT qry1.r_date, qry1.day_starts,
(@runtotal := @runtotal + qry1.day_starts) AS runningtotal
FROM
(SELECT
record_date AS r_date, yr1starts AS day_starts
FROM startstops
WHERE record_date between '2004-01-01' AND '2004-01-31'
ORDER BY record_date) AS qry1
The key is to set the initial variable @runtototal to 0, so it can then total up the running total (the ORDER BY clause at the end of the select statement is also important). Incidentally, this query (and other ways to achieve it using Mysql) can be seen at StackOverflow. It is actually a bit more difficult to achieve this using sql*server. One way is to use a cursor, which I feel is OTT for this. Another way is to use a cross-join, and so join the table to itself. Performance can be an issue here if the data set is large (helped possibly by indexes)
SELECT a.record_date, a.yr1starter, SUM(b.yr1starter) AS running_total
FROM dbo.startstops a
CROSS JOIN dbo.startstops b
WHERE (b.record_date <= a.record_date)
AND a.record_date between '2004-01-01' AND '2004-01-31' AND b.record_date between '2004-01-01' AND '2004-01-31'
GROUP BY a.record_date,a.yr1starter ORDER BY a.record_date,a.yr1starter
You can read about different ways of calculating running totals in sql*server (including the usage of cursors) at the sqlteam link.

Monday, December 27, 2010

How to replicate a "GROUP BY" clause in Excel

As a Data analyst, I often use the “group by” clause in sql to aggregate records against a key field. For instance, to bring back records where a field has greater than 10 entries I’d use a statement like “select city, count(*) from city_sales group by city having count(*) > 10”
However I’m currently in a role where I can’t use a database :( So, can Excel do the business? Well yes, pivot tables can provide a workaround. It’s a bit more involved than using a decent database (and obviously you’d need that database if you’re dealing with hundreds of thousands of rows) but at least it’s an option.
Here’s a short example. Supposing I have various sales records for different cities (see data below). I’ve just listed a few records, but there could be 100s or 1000s of records. Supposing I wanted to see the total sales figure, for those cities with greater than 3 sales?









What I’d do is create a couple of pivot tables. The first one involves putting the city in the row label, and then putting the inv_num field in the data values area. Note that I have to change the “Value field settings” here to show the count (and not the sum). It’s then easy enough to change the sort by to a descending sort on the count of inv_num.














I now create a similar pivot table, but this time I put the Amount field into the Data Value, and this time keep the formula as “Sum of”.

It’s now a simple matter to copy the relevant data, and then use VLOOKUP to bring back the sum of sales for those cities you’ve brought back.













Remember I can also put a field in the Report filter (equivalent to a Where clause in sql) so I could (for example) select only cities from a specified area.
I still prefer sql by far for this type of calculation, but if Excel is your only tool, at least we have this option of the pivot tables.
Hope this is of use to someone out there!
Mark