If the data was in a RDBMS you’d simply GROUP BY Field1
and SELECT MAX(Field2)
.
However, all I got are: FlatFiles, little time and PowerShell, so Let’s GO!
First, I have to filter down records in the file based on the first 3 characters of each line, only ANV
records were of interest to me, let me load these records in memory real fast:
$ANV=gc SSRF-FAH.BAK | ? { $_.substring(0,3) -eq "ANV" }
Great, let’s have a look at what the first few records we just loaded look like:
$ANV | Select -First 10
Which gives me:
ANVA182007 20130425...
ANVA182600 20120829...
ANVA182607 20130425...
ANVA188001 20120829...
ANVA188010 20120829...
ANVA188011 20120829...
ANVA188012 20120829...
ANVA188014 20120829...
ANVA188016 20120829...
ANVA190000 20120829...
I only care about the first 20 characters and only the last 2 record fields listed below:
Field | Start | Length |
---|---|---|
RecordType | 00 | 03 |
Category | 03 | 02 |
Code | 05 | 15 |
Let’s create some custom properties to display this data
$ANV | Select `
@{Name="Category";Expression={$_.Substring(3,2)}}, `
@{Name="Code";Expression={$_.Substring(5,15)}} `
| Select -First 10
Above we used HashTables (a data structure which stores Key-Value pairs). In PowerShell you write HashTables with the
@{<key>=<value>}
syntax. TheSelect-Object
cmdlet accepts a HashTable withName
andExpression
Fields to create custom (a.k.a calculated) properties.
I actually only want the length of the codes, so getting the Length after trimming the field gives me the following snippet:
$ANV | Select `
@{Name="Category";Expression={$_.Substring(3,2)}}, `
@{Name="CodeLength";Expression={$_.Substring(5,15).trim().Length}} `
| Select -First 10
Now let’s group by the Category
Field:
$ANV | Select `
@{Name="Category";Expression={$_.Substring(3,2)}}, `
@{Name="CodeLength";Expression={$_.Substring(5,15).trim().Length}} | `
Group Category | Select -First 10
This gives us the following
Count Name Group
----- ---- -----
16 A1 {@{Category=A1; CodeLength=5}, @{Category=A1; Code...}}
13 A3 {@{Category=A3; CodeLength=4}, @{Category=A3; Code...}}
530 A4 {@{Category=A4; CodeLength=1}, @{Category=A4; Code...}}
351 A6 {@{Category=A6; CodeLength=4}, @{Category=A6; Code...}}
1 A7 {@{Category=A7; CodeLength=5}}
3 C4 {@{Category=C4; CodeLength=1}, @{Category=C4; Code...}}
216 I1 {@{Category=I1; CodeLength=3}, @{Category=I1; Code...}}
3 I3 {@{Category=I3; CodeLength=1}, @{Category=I3; Code...}}
8 I4 {@{Category=I4; CodeLength=2}, @{Category=I4; Code...}}
40 M0 {@{Category=M0; CodeLength=4}, @{Category=M0; Code...}}
Great, only thing left to do is get the maximum CodeLength
per Category
… most Google results tell you to sort and select the first ForEach
group. %
is a shortcut for ForEach
:
$ANV | Select `
@{Name="Category";Expression={$_.Substring(3,2)}}, `
@{Name="CodeLength";Expression={$_.Substring(5,15).trim().Length}} | `
Group Category | %{ $_.Group | Sort CodeLength -Descending | Select -First 1 }
Exactly what I needed:
Category CodeLength
-------- ----------
A1 5
A3 4
A4 9
A6 5
A7 5
C4 1
...
Now back to the boring parts :(