How to quickly group FlatFile records and get a grouped Field maximum

· by Vincent De Smet · Read in about 2 min · (412 Words)

If the data was in a RDBMS you’d simply GROUP BY Field1 and SELECT MAX(Field2).

However, all I got are: FlatFiles, little time and PowerShell, so Let’s GO!

First, I have to filter down records in the file based on the first 3 characters of each line, only ANV records were of interest to me, let me load these records in memory real fast:

$ANV=gc SSRF-FAH.BAK | ? { $_.substring(0,3) -eq "ANV" }

Great, let’s have a look at what the first few records we just loaded look like:

$ANV | Select -First 10

Which gives me:

ANVA182007                                                                   20130425...
ANVA182600                                                                   20120829...
ANVA182607                                                                   20130425...
ANVA188001                                                                   20120829...
ANVA188010                                                                   20120829...
ANVA188011                                                                   20120829...
ANVA188012                                                                   20120829...
ANVA188014                                                                   20120829...
ANVA188016                                                                   20120829...
ANVA190000                                                                   20120829...

I only care about the first 20 characters and only the last 2 record fields listed below:

Field Start Length
RecordType 00 03
Category 03 02
Code 05 15

Let’s create some custom properties to display this data

$ANV | Select `
@{Name="Category";Expression={$_.Substring(3,2)}}, `
@{Name="Code";Expression={$_.Substring(5,15)}} `
| Select -First 10

Above we used HashTables (a data structure which stores Key-Value pairs). In PowerShell you write HashTables with the @{<key>=<value>} syntax. The Select-Object cmdlet accepts a HashTable with Name and Expression Fields to create custom (a.k.a calculated) properties.

I actually only want the length of the codes, so getting the Length after trimming the field gives me the following snippet:

 $ANV | Select `
 @{Name="Category";Expression={$_.Substring(3,2)}}, `
 @{Name="CodeLength";Expression={$_.Substring(5,15).trim().Length}} `
 | Select -First 10

Now let’s group by the Category Field:

$ANV | Select `
@{Name="Category";Expression={$_.Substring(3,2)}}, `
@{Name="CodeLength";Expression={$_.Substring(5,15).trim().Length}} | `
Group Category | Select -First 10

This gives us the following

Count Name                      Group
----- ----                      -----
   16 A1                        {@{Category=A1; CodeLength=5}, @{Category=A1; Code...}}
   13 A3                        {@{Category=A3; CodeLength=4}, @{Category=A3; Code...}}
  530 A4                        {@{Category=A4; CodeLength=1}, @{Category=A4; Code...}}
  351 A6                        {@{Category=A6; CodeLength=4}, @{Category=A6; Code...}}
    1 A7                        {@{Category=A7; CodeLength=5}}
    3 C4                        {@{Category=C4; CodeLength=1}, @{Category=C4; Code...}}
  216 I1                        {@{Category=I1; CodeLength=3}, @{Category=I1; Code...}}
    3 I3                        {@{Category=I3; CodeLength=1}, @{Category=I3; Code...}}
    8 I4                        {@{Category=I4; CodeLength=2}, @{Category=I4; Code...}}
   40 M0                        {@{Category=M0; CodeLength=4}, @{Category=M0; Code...}}

Great, only thing left to do is get the maximum CodeLength per Category… most Google results tell you to sort and select the first ForEach group. % is a shortcut for ForEach:

$ANV | Select `
@{Name="Category";Expression={$_.Substring(3,2)}}, `
@{Name="CodeLength";Expression={$_.Substring(5,15).trim().Length}} | `
Group Category | %{ $_.Group | Sort CodeLength -Descending | Select -First 1 }

Exactly what I needed:

Category CodeLength
-------- ----------
A1                5
A3                4
A4                9
A6                5
A7                5
C4                1
...

Now back to the boring parts :(