Friday, July 3, 2009

Duplication versus effort

As someone who’s used SQL over many years, I have a habit of importing data into a database whenever I want to list duplicates (or sole entries as opposed to duplicates). Although this is normally easily done in sql (via the Group By clause) it can be a bad habit, as other programs can also accomplish this for you.
For example, if you program in unix or linux, and you want to show unique lines in a file (ie remove duplicate entries), you can use the following command:
sort yrfilename | uniq
That’s not all, you can also use switches to manipulate the data further:
sort yrfilename | uniq –d # show the duplicate records
sort yrfilename | uniq –u # displays only the nonduplicate entries
sort yrfilename | uniq – c # count for each record
Now not everyone is on a Unix based operating system, but never fear, there are various ways to look for duplicate entries in Excel too (I’d be surprised if there wasn’t an easy way to do it in Powershell either). A useful way to do this on a data set (in more modern versions of Excel) is to choose the Data Menu, then select the Advanced Filter option. Here you can either filter the list in-place, or copy to another location (my preferred solution). So long as you select the “Unique records only” option, you lose the duplicate lines. Alternatively you can just select the “Remove Duplicates” menu option in the Data Menu, and you’ll be given the choice of which columns to check for duplicate information.
Incidentally, in older versions of Excel, I would use conditional formatting to show which fields were duplicated (I’d sort the column first). If the entries I were checking were in column A, I’d use something like || Formula Is = countif(A:A, A2) > 1 || and copy down.
I imagine there are lots of different ways of accomplishing the rooting out of duplicate entries (Perl, sed, awk anyone?). I think it’s always good to have an alternative way of achieving what you want – one day the database may be down, or Excel disabled by a virus. The more options you have under your belt the better!

No comments:

Post a Comment