Fastest way to parse thousands of small files in PowerShell - powershell

I have over 16000 inventory log files ranging in size from 3-5 KB on a network share.
Sample file looks like this:
## System Info
SystemManufacturer:=:Dell Inc.
SystemModel:=:OptiPlex GX620
SystemType:=:X86-based PC
ChassisType:=:6 (Mini Tower)
## System Type
isLaptop=No
I need to put them into a DB, so I started parsing them and creating a custom object for each that I can later use to check duplicates, normalize etc...
Initial parse with a code snippet as in below took about 7.5mins.
Foreach ($invlog in $invlogs) {
$content = gc $invlog.FullName -ReadCount 0
foreach ($line in $content) {
if ($line -match '^#|^\s*$') { continue }
$invitem,$value=$line -split ':=:'
[PSCustomObject]#{Name=$invitem;Value=$value}
}
}
I started optimizing it and after several trial and error ended up with this which takes 2mins and 4 secs:
Foreach ($invlog in $invlogs) {
foreach ($line in ([System.IO.File]::ReadLines("$($invlog.FullName)") -match '^\w') ) {
$invitem,$value=$line -split ':=:'
[PSCustomObject]#{name=$invitem;Value=$value} #2.04mins
}
}
I also tried using a hash instead of PSCustomObject, but to my surprise it took much longer (5mins 26secs)
Foreach ($invlog in $invlogs) {
$hash=#{}
foreach ($line in ([System.IO.File]::ReadLines("$($invlog.FullName)") -match $propertyline) ) {
$invitem,$value=$line -split ':=:'
$hash[$invitem]=$value #5.26mins
}
}
What would be the fastest method to use here?

See if this is any faster:
Foreach ($invlog in $invlogs) {
#(gc $invlog.FullName -ReadCount 0) -notmatch '^#|^\s*$' |
foreach {
$invitem,$value=$line -split ':=:'
[PSCustomObject]#{Name=$invitem;Value=$value}
}
}
The -match and -notmatch operators, when appied to an array return all the elements that satisfy the match, so you can eliminate having to test every line for the lines to exclude.
Are you really wanting to create a PS Object for every line, or just one for every file?
If you want one object per file, see if this is any quicker:
The multi-line regex eliminates the line array, and a filter is used in place of the foreach to create the hash entries.
$regex = [regex]'(?ms)^(\w+):=:([^\r]+)'
filter make-hash { #{$_.groups[1].value = $_.groups[2].value} }
Foreach ($invlog in $invlogs) {
$regex.matches([io.file]::ReadAllText($invlog.fullname)) | make-hash
}
The objective of switching to using the multi-line regex and [io.file]::ReadAllText] is to simplify what Powershell is doing with the file input internally. The result of [io.file]::ReadAllText() will be a string object, which is a much simpler type of object than the array of strings that [io.file]::ReadAllLines() will produce, and requires less overhead to counstruct internally. A filter is essentially just the Process block of a function - it will run once for every object that comes to it from the pipeline, so it emulates the action of foreach-object, but actually runs slightly faster (I don't know the internals well enough to tell you exactly why). Both of these changes require more coding and only result in a marginal increase in performace. In my testing switching to multi-line gained about .1ms per file, and changing from foreach-object to the filter another .1 ms. You probably don't see these techniques used very often because of the low return compared to the additional coding work required, but it becomes significant when you start to multiply those fractions of a ms by 160K iterations.

Try this:
Foreach ($invlog in $invlogs) {
$output = #{}
foreach ($line in ([IO.File]::ReadLines("$($invlog.FullName)") -ne '') ) {
if ($line.Contains(":=:")) {
$item, $value = $line.Split(":=:") -ne ''
$output[$item] = $value
}
}
New-Object PSObject -Property $output
}
As a general rule, Regex is sometimes cool but always slower.

Wouldn't you want an object per system, and not per key-value pair? :S
Like this.. By replacing Get-Content to the .Net method you could probably save some time.
Get-ChildItem -Filter *.txt -Path <path to files> | ForEach-Object {
$ht = #{}
Get-Content $_ | Where-Object { $_ -match ':=:' } | ForEach-Object {
$ht[($_ -split ':=:')[0].Trim()] = ($_ -split ':=:')[1].Trim()
}
[pscustomobject]$ht
}
ChassisType SystemManufacturer SystemType SystemModel
----------- ------------------ ---------- -----------
6 (Mini Tower) Dell Inc. X86-based PC OptiPlex GX620

Related

Why Isn't This Counting Correctly | PowerShell

Right now, I have a CSV file which contains 3,800+ records. This file contains a list of server names, followed by an abbreviation stating if the server is a Windows server, Linux server, etc. The file also contains comments or documentation, where each line starts with "#", stating it is a comment. What I have so far is as follows.
$file = Get-Content .\allsystems.csv
$arraysplit = #()
$arrayfinal = #()
[int]$windows = 0
foreach ($thing in $file){
if ($thing.StartsWith("#")) {
continue
}
else {
$arraysplit = $thing.Split(":")
$arrayfinal = #($arraysplit[0], $arraysplit[1])
}
}
foreach ($item in $arrayfinal){
if ($item[1] -contains 'NT'){
$windows++
}
else {
continue
}
}
$windows
The goal of this script is to count the total number of Windows servers. My issue is that the first "foreach" block works fine, but the second one results in "$Windows" being 0. I'm honestly not sure why this isn't working. Two example lines of data are as follows:
example:LNX
example2:NT
if the goal is to count the windows servers, why do you need the array?
can't you just say something like
foreach ($thing in $file)
{
if ($thing -notmatch "^#" -and $thing -match "NT") { $windows++ }
}
$arrayfinal = #($arraysplit[0], $arraysplit[1])
This replaces the array for every run.
Changing it to += gave another issue. It simply appended each individual element. I used this post's info to fix it, sort of forcing a 2d array: How to create array of arrays in powershell?.
$file = Get-Content .\allsystems.csv
$arraysplit = #()
$arrayfinal = #()
[int]$windows = 0
foreach ($thing in $file){
if ($thing.StartsWith("#")) {
continue
}
else {
$arraysplit = $thing.Split(":")
$arrayfinal += ,$arraysplit
}
}
foreach ($item in $arrayfinal){
if ($item[1] -contains 'NT'){
$windows++
}
else {
continue
}
}
$windows
1
I also changed the file around and added more instances of both NT and other random garbage. Seems it works fine.
I'd avoid making another ForEach loop for bumping count occurrences. Your $arrayfinal also rewrites everytime, so I used ArrayList.
$file = Get-Content "E:\Code\PS\myPS\2018\Jun\12\allSystems.csv"
$arrayFinal = New-Object System.Collections.ArrayList($null)
foreach ($thing in $file){
if ($thing.StartsWith("#")) {
continue
}
else {
$arraysplit = $thing -split ":"
if($arraysplit[1] -match "NT" -or $arraysplit[1] -match "Windows")
{
$arrayfinal.Add($arraysplit[1]) | Out-Null
}
}
}
Write-Host "Entries with 'NT' or 'Windows' $($arrayFinal.Count)"
I'm not sure if you want to keep 'Example', 'example2'... so I have skipped adding them to arrayfinal, assuming the goal is to count "NT" or "Windows" occurrances
The goal of this script is to count the total number of Windows servers.
I'd suggest the easy way: using cmdlets built for this.
$csv = Get-Content -Path .\file.csv |
Where-Object { -not $_.StartsWith('#') } |
ConvertFrom-Csv
#($csv.servertype).Where({ $_.Equals('NT') }).Count
# Compatibility mode:
# ($csv.servertype | Where-Object { $_.Equals('NT') }).Count
Replace servertype and 'NT' with whatever that header/value is called.

While loop does not produce pipeline output

It appears that a While loop does not produce an output that can continue in the pipeline. I need to process a large (many GiB) file. In this trivial example, I want to extract the second field, sort on it, then get only the unique values. What am I not understanding about the While loop and pushing things through the pipeline?
In the *NIX world this would be a simple:
cut -d "," -f 2 rf.txt | sort | uniq
In PowerShell this would be not quite as simple.
The source data.
PS C:\src\powershell> Get-Content .\rf.txt
these,1,there
lines,3,paragraphs
are,2,were
The script.
PS C:\src\powershell> Get-Content .\rf.ps1
$sr = New-Object System.IO.StreamReader("$(Get-Location)\rf.txt")
while ($line = $sr.ReadLine()) {
Write-Verbose $line
$v = $line.split(',')[1]
Write-Output $v
} | sort
$sr.Close()
The output.
PS C:\src\powershell> .\rf.ps1
At C:\src\powershell\rf.ps1:7 char:3
+ } | sort
+ ~
An empty pipe element is not allowed.
+ CategoryInfo : ParserError: (:) [], ParseException
+ FullyQualifiedErrorId : EmptyPipeElement
Making it a bit more complicated than it needs to be. You have a CSV without headers. The following should work:
Import-Csv .\rf.txt -Header f1,f2,f3 | Select-Object -ExpandProperty f2 -Unique | Sort-Object
Nasir's workaround looks like the way to go here.
If you want to know what was going wrong in your code, the answer is that while loops (and do/while/until loops) don't consistently return values to the pipeline the way that other statements in PowerShell do (actually that is true, and I'll keep the examples of that, but scroll down for the real reason it wasn't working for you).
ForEach-Object -- a cmdlet, not a built-in language feature/statement; does return objects to the pipeline.
1..3 | % { $_ }
foreach -- statement; does return.
foreach ($i in 1..3) { $i }
if/else -- statement; does return.
if ($true) { 1..3 }
for -- statement; does return.
for ( $i = 0 ; $i -le 3 ; $i++ ) { $i }
switch -- statement; does return.
switch (2)
{
1 { 'one' }
2 { 'two' }
3 { 'three' }
}
But for some reason, these other loops seem to act unpredictably.
Loops forever, returns $i (0 ; no incrementing going on).
$i = 0; while ($i -le 3) { $i }
Returns nothing, but $i does get incremented:
$i = 0; while ($i -le 3) { $i++ }
If you wrap the expression inside in parentheses, it seems it does get returned:
$i = 0; while ($i -le 3) { ($i++) }
But as it turns out (I'm learning a bit as I go here), while's strange return semantics have nothing to do with your error; you just can't pipe statements into functions/cmdlets, regardless of their return value.
foreach ($i in 1..3) { $i } | measure
will give you the same error.
You can "get around" this by making the entire statement a sub-expression with $():
$( foreach ($i in 1..3) { $i } ) | measure
That would work for you in this case. Or in your while loop instead of using Write-Output, you could just add your item to an array and then sort it after:
$arr = #()
while ($line = $sr.ReadLine()) {
Write-Verbose $line
$v = $line.split(',')[1]
$arr += $v
}
$arr | sort
I know you're dealing with a large file here, so maybe you're thinking that by piping to sort line by line you'll be avoiding a large memory footprint. In many cases piping does work that way in PowerShell, but the thing about sorting is that you need the whole set to sort it, so the Sort-Object cmdlet will be "collecting" each item you pass to it anyway and then do the actual sorting in the end; I'm not sure you can avoid that at all. Admittedly letting Sort-Object do that instead of building the array yourself might be more efficient depending on how its implemented, but I don't think you'll be saving much on RAM.
other solution
Get-Content -Path C:\temp\rf.txt | select #{Name="Mycolumn";Expression={($_ -split "," )[1]}} | select Mycolumn -Unique | sort

Powershell/Batch Files: Verify that a file contains at least one entry from a list of strings

Here is my current issue: I have a list of 1800 customer numbers (ie 123456789). I need to determine which of these numbers show up in another, much larger (4 gb) file. The larger file is a fixed-width file of all customer information. I know how I would do this in SQL, but like I said it's a flat file.
When searching for individual numbers, I was using a command I found elsewhere on this site which worked very well:
get-content CUSTOMERINFO.txt -ReadCount 1000 | foreach { $_ -match "123456789" }
However, I do not have the expertise to translate this into another command, or a batch file, which would load list.txt and search all lines in customerinfo.txt for the requisite strings.
Time is not a major constraint, as this is running on a test server and will be a once-off project.
Thank you very much for any help you can provide.
So I appreciate everyone's help. Everybody gave me helpful info that let me get to my final solution, so I appreciate it. Especially to the guy who asked if this was a codewriting request, because it made me realize I needed to just write some code.
For anyone else who runs into the same problem, here is the code I ended up using:
$matches = Get-Content .\list.txt
foreach ($entry in $matches)
{ $results = get-content FiletoSearch -ReadCount 1000 | foreach { $_ -match $entry }
if ($results -eq $null) {
$entry }
else {
"found"}
}
This gives a 'found' entry for everything that was found (which is information I don't need), and gives back the value searched for when it's not found (which is information I do need).
The match comparator can work over multiple values, you can separate them with a bar | character.
e.g.
get-content CUSTOMERINFO.txt -ReadCount 1000 | foreach { $_ -match "DEF|YZ" }
You can also read the contents of a file and replace newlines with a character of your choice. So if list.txt is a list of values to search, such as
DEF
XY
Then you can read it and convert it to a bar-separated list using the join operator:
(Get-Content list.txt) -join "|"
Put them together and you should have your solution:
$listSearch = (Get-Content list.txt) -join "|";
get-content CUSTOMERINFO.txt -ReadCount 1000 | foreach { $_ -match $listSearch}

Pulling a substring for each line in file

Using Powershell, I am simply trying to pull 15 characters starting from the 37th position of any record that begins with a 6. I'd like to loop through and generate a record for each instance so it can later be put into an output file. But I seem to not be hitting the correct syntax just to return the 15 characters I know I am missing something obvious. Been at this for a while. Here is my script:
$content = Get-Content -Path .\tmfhsyst*.txt | Where-Object { $_.StartsWith("6") }
foreach ($line in $contents)
{
$val102 = $line.substring(36,15)
}
write-output $val102
Just as Bill_Stewart pointed out, you need to move your Write-Output line inside the ForEach loop. A possibly better way to do it would just be to pipe it:
Get-Content -Path .\tmfhsyst*.txt | Where-Object { $_.StartsWith("6") } | foreach{$_.substring(36,15)}
That should give you the output you desired.
Using Substring() has the disadvantage that it will raise an error if the string is shorter than start index + substring length. You can avoid this with a regular expression match:
(Get-Content -Path .\tmfhsyst*.txt) -match '^6.{35}(.{15})' | % { $matches[1] }

Get all lines containing a string in a huge text file - as fast as possible?

In Powershell, how to read and get as fast as possible the last line (or all the lines) which contains a specific string in a huge text file (about 200000 lines / 30 MBytes) ?
I'm using :
get-content myfile.txt | select-string -pattern "my_string" -encoding ASCII | select -last 1
But it's very very long (about 16-18 seconds).
I did tests without the last pipe "select -last 1", but it's the same time.
Is there a faster way to get the last occurence (or all occurences) of a specific string in huge file?
Perhaps it's the needed time ...
Or it there any possiblity to read the file faster from the end as I want the last occurence?
Thanks
Try this:
get-content myfile.txt -ReadCount 1000 |
foreach { $_ -match "my_string" }
That will read your file in chunks of 1000 records at a time, and find the matches in each chunk. This gives you better performance because you aren't wasting a lot of cpu time on memory management, since there's only 1000 lines at a time in the pipeline.
Have you tried:
gc myfile.txt | % { if($_ -match "my_string") {write-host $_}}
Or, you can create a "grep"-like function:
function grep($f,$s) {
gc $f | % {if($_ -match $s){write-host $_}}
}
Then you can just issue: grep $myfile.txt $my_string
$reader = New-Object System.IO.StreamReader("myfile.txt")
$lines = #()
if ($reader -ne $null) {
while (!$reader.EndOfStream) {
$line = $reader.ReadLine()
if ($line.Contains("my_string")) {
$lines += $line
}
}
}
$lines | Select-Object -Last 1
Have you tried using [System.IO.File]::ReadAllLines();? This method is more "raw" than the PowerShell-esque method, since we're plugging directly into the Microsoft .NET Framework types.
$Lines = [System.IO.File]::ReadAllLines();
[Regex]::Matches($Lines, 'my_string_pattern');
I wanted to extract the lines that contained failed and also write this lines to a new file, I will add the full command for this
get-content log.txt -ReadCount 1000 |
>> foreach { $_ -match "failed" } | Out-File C:\failes.txt