MarkFightsCode: excel

Showing posts with label excel. Show all posts

Saturday, May 12, 2012

Create running totals with data.

This is a fairly common requirement when reporting on data. The data analyst will typically extract data using sql and then use the SUM formula within Excel (with the aid of fixed and relative references, that are copied down). For example, supposing we are looking at a table that records number of starters by dates. We could have data like the following:

"record_date";"yr1starter"

"2004-01-01";"820"

"2004-01-02";"1891"

"2004-01-03";"1667"

"2004-01-04";"1607"

"2004-01-05";"1542"

We copy this into a spreadsheet (normally by using text-to-columns so the data is in separate columns). In order to get a running balance we can then use the following formula: =SUM(B$2:B2) for cell C2. We then copy down, so for instance the formula in cell C3 becomes: =SUM(B$2:B3) It's not the only way to do this in Excel - you could always add the subtotal above to the current record field, but that runs the danger of a gap in the figures. The attached graphic shows what the cumulative formula list will look like

Excel will normally display the calculations for example:

yr1starters cumulative total

820 820

1891 2711

1667 4378

1607 5985

1542 7527

However, it is possible to create cumulative/running totals in SQL (via an inner query, normally called a sub-query). I will first show this using MySql syntax:

SET @runtotal:=0;

SELECT qry1.r_date, qry1.day_starts,

(@runtotal := @runtotal + qry1.day_starts) AS runningtotal

FROM

(SELECT

record_date AS r_date, yr1starts AS day_starts

FROM startstops

WHERE record_date between '2004-01-01' AND '2004-01-31'

ORDER BY record_date) AS qry1

The key is to set the initial variable @runtototal to 0, so it can then total up the running total (the ORDER BY clause at the end of the select statement is also important). Incidentally, this query (and other ways to achieve it using Mysql) can be seen at StackOverflow. It is actually a bit more difficult to achieve this using sql*server. One way is to use a cursor, which I feel is OTT for this. Another way is to use a cross-join, and so join the table to itself. Performance can be an issue here if the data set is large (helped possibly by indexes)

SELECT a.record_date, a.yr1starter, SUM(b.yr1starter) AS running_total

FROM dbo.startstops a

CROSS JOIN dbo.startstops b

WHERE (b.record_date <= a.record_date)

AND a.record_date between '2004-01-01' AND '2004-01-31' AND b.record_date between '2004-01-01' AND '2004-01-31'

GROUP BY a.record_date,a.yr1starter ORDER BY a.record_date,a.yr1starter

You can read about different ways of calculating running totals in sql*server (including the usage of cursors) at the sqlteam link.

Monday, December 27, 2010

How to replicate a "GROUP BY" clause in Excel

As a Data analyst, I often use the “group by” clause in sql to aggregate records against a key field. For instance, to bring back records where a field has greater than 10 entries I’d use a statement like “select city, count(*) from city_sales group by city having count(*) > 10”
However I’m currently in a role where I can’t use a database :( So, can Excel do the business? Well yes, pivot tables can provide a workaround. It’s a bit more involved than using a decent database (and obviously you’d need that database if you’re dealing with hundreds of thousands of rows) but at least it’s an option.
Here’s a short example. Supposing I have various sales records for different cities (see data below). I’ve just listed a few records, but there could be 100s or 1000s of records. Supposing I wanted to see the total sales figure, for those cities with greater than 3 sales?

What I’d do is create a couple of pivot tables. The first one involves putting the city in the row label, and then putting the inv_num field in the data values area. Note that I have to change the “Value field settings” here to show the count (and not the sum). It’s then easy enough to change the sort by to a descending sort on the count of inv_num.

I now create a similar pivot table, but this time I put the Amount field into the Data Value, and this time keep the formula as “Sum of”.

It’s now a simple matter to copy the relevant data, and then use VLOOKUP to bring back the sum of sales for those cities you’ve brought back.

Remember I can also put a field in the Report filter (equivalent to a Where clause in sql) so I could (for example) select only cities from a specified area.
I still prefer sql by far for this type of calculation, but if Excel is your only tool, at least we have this option of the pivot tables.
Hope this is of use to someone out there!
Mark

Friday, July 3, 2009

Duplication versus effort

As someone who’s used SQL over many years, I have a habit of importing data into a database whenever I want to list duplicates (or sole entries as opposed to duplicates). Although this is normally easily done in sql (via the Group By clause) it can be a bad habit, as other programs can also accomplish this for you.
For example, if you program in unix or linux, and you want to show unique lines in a file (ie remove duplicate entries), you can use the following command:
sort yrfilename | uniq
That’s not all, you can also use switches to manipulate the data further:
sort yrfilename | uniq –d # show the duplicate records
sort yrfilename | uniq –u # displays only the nonduplicate entries
sort yrfilename | uniq – c # count for each record
Now not everyone is on a Unix based operating system, but never fear, there are various ways to look for duplicate entries in Excel too (I’d be surprised if there wasn’t an easy way to do it in Powershell either). A useful way to do this on a data set (in more modern versions of Excel) is to choose the Data Menu, then select the Advanced Filter option. Here you can either filter the list in-place, or copy to another location (my preferred solution). So long as you select the “Unique records only” option, you lose the duplicate lines. Alternatively you can just select the “Remove Duplicates” menu option in the Data Menu, and you’ll be given the choice of which columns to check for duplicate information.
Incidentally, in older versions of Excel, I would use conditional formatting to show which fields were duplicated (I’d sort the column first). If the entries I were checking were in column A, I’d use something like || Formula Is = countif(A:A, A2) > 1 || and copy down.
I imagine there are lots of different ways of accomplishing the rooting out of duplicate entries (Perl, sed, awk anyone?). I think it’s always good to have an alternative way of achieving what you want – one day the database may be down, or Excel disabled by a virus. The more options you have under your belt the better!

MarkFightsCode