How can I make this PowerShell script parse large files faster? - powershell

I have the following PowerShell script that will parse some very large file for ETL purposes. For starters my test file is ~ 30 MB. Larger files around 200 MB are expected. So I have a few questions.
The script below works, but it takes a very long time to process even a 30 MB file.
PowerShell Script:
$path = "E:\Documents\Projects\ESPS\Dev\DataFiles\DimProductionOrderOperation"
$infile = "14SEP11_ProdOrderOperations.txt"
$outfile = "PROCESSED_14SEP11_ProdOrderOperations.txt"
$array = #()
$content = gc $path\$infile |
select -skip 4 |
where {$_ -match "[|].*[|].*"} |
foreach {$_ -replace "^[|]","" -replace "[|]$",""}
$header = $content[0]
$array = $content[0]
for ($i = 1; $i -le $content.length; $i+=1) {
if ($array[$i] -ne $content[0]) {$array += $content[$i]}
}
$array | out-file $path\$outfile -encoding ASCII
DataFile Excerpt:
---------------------------
|Data statistics|Number of|
|-------------------------|
|Records passed | 93,118|
---------------------------
02/14/2012 Production Operations and Confirmations 2
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Production Operations and Confirmations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|ProductionOrderNumber|MaterialNumber |ModifiedDate|Plant|OperationRoutingNumber|WorkCenter|OperationStatus|IsActive| WbsElement|SequenceNumber|OperationNumber|OperationDescription |OperationQty|ConfirmedYieldQty|StandardValueLabor|ActualDirectLaborHrs|ActualContractorLaborHrs|ActualOvertimeLaborHrs|ConfirmationNumber|
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|180849518 |011255486L1 |02/08/2012 |2101 | 9901123118|56B30 |I9902 | |SOC10MA2302SOCJ31| |0140 |Operation 1 | 1 | 0 | 0.0 | | 499.990 | | 9908651250|
|180849518 |011255486L1 |02/08/2012 |2101 | 9901123118|56B30 |I9902 | |SOC10MA2302SOCJ31|14 |9916 |Operation 2 | 1 | 0 | 499.0 | | | | 9908532289|
|181993564 |011255486L1 |02/09/2012 |2101 | 9901288820|56B30 |I9902 | |SOC10MD2302SOCJ31|14 |9916 |Operation 1 | 1 | 0 | 499.0 | | 399.599 | | 9908498544|
|180885825 |011255486L1 |02/08/2012 |2101 | 9901162239|56B30 |I9902 | |SOC10MG2302SOCJ31| |0150 |Operation 3 | 1 | 0 | 0.0 | | 882.499 | | 9908099659|
|180885825 |011255486L1 |02/08/2012 |2101 | 9901162239|56B30 |I9902 | |SOC10MG2302SOCJ31|14 |9916 |Operation 4 | 1 | 0 | 544.0 | | | | 9908858514|
|181638583 |990104460I0 |02/10/2012 |2101 | 9902123289|56G99 |I9902 | |SOC11MAR105SOCJ31| |0160 |Operation 5 | 1 | 0 | 1,160.0 | | | | 9914295010|
|181681218 |990104460B0 |02/08/2012 |2101 | 9902180981|56G99 |I9902 | |SOC11MAR328SOCJ31|0 |9910 |Operation 6 | 1 | 0 | 916.0 | | | | 9914621885|
|181681036 |990104460I0 |02/09/2012 |2101 | 9902180289|56G99 |I9902 | |SOC11MAR108SOCJ31| |0180 |Operation 8 | 1 | 0 | 1.0 | | | | 9914619196|
|189938054 |011255486A2 |02/10/2012 |2101 | 9999206805|5AD99 |I9902 | |RS08MJ2305SOCJ31 | |0599 |Operation 8 | 1 | 0 | 0.0 | | | | 9901316289|
|181919894 |012984532A3 |02/10/2012 |2101 | 9902511433|A199399Z |I9902 | |SOC12MCB101SOCJ31|0 |9935 |Operation 9 | 1 | 0 | 0.5 | | | | 9916914233|
|181919894 |012984532A3 |02/10/2012 |2101 | 9902511433|A199399Z |I9902 | |SOC12MCB101SOCJ31|22 |9951 |Operation 10 | 1 | 0 | 68.080 | | | | 9916914224|

Your script reads one line at a time (slow!) and stores almost the entire file in memory (big!).
Try this (not tested extensively):
$path = "E:\Documents\Projects\ESPS\Dev\DataFiles\DimProductionOrderOperation"
$infile = "14SEP11_ProdOrderOperations.txt"
$outfile = "PROCESSED_14SEP11_ProdOrderOperations.txt"
$batch = 1000
[regex]$match_regex = '^\|.+\|.+\|.+'
[regex]$replace_regex = '^\|(.+)\|$'
$header_line = (Select-String -Path $path\$infile -Pattern $match_regex -list).line
[regex]$header_regex = [regex]::escape($header_line)
$header_line.trim('|') | Set-Content $path\$outfile
Get-Content $path\$infile -ReadCount $batch |
ForEach {
$_ -match $match_regex -NotMatch $header_regex -Replace $replace_regex ,'$1' | Out-File $path\$outfile -Append
}
That's a compromise between memory usage and speed. The -match and -replace operators will work on an array, so you can filter and replace an entire array at once without having to foreach through every record. The -readcount will cause the file to be read in chunks of $batch records, so you're basically reading in 1000 records at a time, doing the match and replace on that batch then appending the result to your output file. Then it goes back for the next 1000 records. Increasing the size of $batch should speed it up, but it will make it use more memory. Adjust that to suit your resources.

The Get-Content cmdlet does not perform as well as a StreamReader when dealing with very large files. You can read a file line by line using a StreamReader like this:
$path = 'C:\A-Very-Large-File.txt'
$r = [IO.File]::OpenText($path)
while ($r.Peek() -ge 0) {
$line = $r.ReadLine()
# Process $line here...
}
$r.Dispose()
Some performance comparisons:
Measure-Command {Get-Content .\512MB.txt > $null}
Total Seconds: 49.4742533
Measure-Command {
$r = [IO.File]::OpenText('512MB.txt')
while ($r.Peek() -ge 0) {
$r.ReadLine() > $null
}
$r.Dispose()
}
Total Seconds: 27.666803

This is almost a non-answer...I love PowerShell...but I will not use it to parse log files, especially large log files. Use Microsoft's Log Parser.
C:\>type input.txt | logparser "select substr(field1,1) from STDIN" -i:TSV -nskiplines:14 -headerrow:off -iseparator:spaces -o:tsv -headers:off -stats:off

Related

remove all characters in every line starting from the character "\" in a text file using powershell

I have 50 lines text file ($file1) like and i need to remove the characters starting from an specific character "/" until,the end of the line.
Sample text file:
| Area | vserver | file-id |connection-id | session-id | open-mode | path |
| manphsan01 | manphs101 | 9980 | 4278018043 | 5065142205921760710 | rw | Share01\Mandaue\Data01 |
| manphsan01 | manphs101 | 1790 | 4278020659 | 5065142205921763223 | rwd | FinanceDept\ARCHIVING |
| manphsan01 | manphs101 | 1824 | 4278020659 | 5065142205921763223 | rwd | Share01\Cebu\Year2022 |
| manphsan01 | manphs101 | 1976 | 4278020659 | 5065142205921763223 | rwd | SGSDept\General\Document |
My desired output sh0uld be like:
| Area | vserver | file-id |connection-id | session-id | open-mode | path |
| manphsan01 | manphs101 | 9980 | 4278018043 | 5065142205921760710 | rw | Share01 |
| manphsan01 | manphs101 | 1790 | 4278020659 | 5065142205921763223 | rwd | Finance |
| manphsan01 | manphs101 | 1824 | 4278020659 | 5065142205921763223 | rwd | Share01 |
| manphsan01 | manphs101 | 1976 | 4278020659 | 5065142205921763223 | rwd | SGSDept |
the command i used is like this:
$var = Get-content $file1
$var.Substring(0, $var.IndexOf('\')) | FT -AutoSize or
$var.Substring(0, $var.IndexOf('backslash')) | FT -AutoSize
My command will work if my data is only 1 line but multiple lines it wont work. I am not sure why the 'backslash' is not showing on the command when i posted it.
ny ideas how to make this work?
You can get away with plain-text processing if you can assume that only one field on each line of your structured text file contains \ and that it and everything after it - up until the next field delimiter, | - should be removed:
# Transforms all matching lines and outputs them.
# Pipe to Set-Content to save back to a file; use -Encoding as needed.
(Get-Content $file) -replace '\\.+?(?= \|)'
The above uses a -replace operation with a regex to remove the unwanted part of matching lines (lines that don't match are passed through as-is).
For an explanation of the regex and the ability to experiment with it, see this regex101.com page.
As for what you tried:
$var = Get-content $file1 stores the individual lines of file $file1 as an array in variable $var1.
To process the resulting lines one by one, you need a loop construct, such as a foreach statement or the ForEach-Object cmdlet; e.g. foreach ($line in $var) { ... }
While $line.Substring(0, $line.IndexOf('\')) works in principle, it will cause a statement-terminating error (exception) for every $line value that contains no \ character, as Theo notes, notably with your file's header line.
While this could easily be fixed with try { $line.Substring(0, $line.IndexOf('\')) } catch { $line }, the bigger problem is that it would remove everything through the end of the line, which contradicts your desired output, which shows that the next field seprator, | should be retained.
The above -replace operation fixes both these problems; note that it implicitly loops over the array of input lines and performs the replacement operation on each, returning an array of (potentially) transformed lines.
Also note that a formatting cmdlet such as Format-Table (-FT) should only be used for for-display output; it doesn't produce usable data - see this answer for more information; also, it has no formatting effect on strings.

PowerShell: Incremental Counter in Foreach Loop

I have a foreach loop that iterates over an Array and calls a function which also has another foreach loop inside with an incremental counter, however it doesn't seem to be working as expected?
Array contents:
| Username | Username2 |
|----------|-----------|
| p1 | p2 |
| p3 | p4 |
Code:
function insertIntoLunchJobs($arrayOfRows) {
$counter = 1
foreach ($i in $arrayOfRows) {
$i
$counter++
$counter
}
}
Output:
| Username | Username2 |
|----------|-----------|
| p1 | p2 |
| 2 | |
| p3 | p4 |
| 2 | |
Desired result:
| Username | Username2 |
|----------|-----------|
| p1 | p2 |
| 2 | |
| p3 | p4 |
| 3 | |
Any ideas?
TIA
I'm literally copy pasting your code. I don't see any errors here:
$arr=#'
Username,Username2
p1,p2
p3,p4
'#|ConvertFrom-Csv
function insertIntoLunchJobs($arrayOfRows) {
$counter = 1
foreach ($i in $arrayOfRows) {
$i
$counter++
$counter
}
}
insertIntoLunchJobs -arrayOfRows $arr

tasklist -v truncation of output

Evidently the Windows dos cmd "tasklist -v " is truncating lines after so many characters.
My perl program is reading in special command processes to compare against processes stored in my database. I am trying to make sure expected processes are running.
Unfortunately the script fails since one of my 50 or so processes is truncated by "tasklist -v".
Is there an alternative command?
Thanks,
Sammy
Following code demonstrates usage of tasklist /fo table command as a pipe input for further processing
Tip: help tasklist
use strict;
use warnings;
my $regex = qr/^(?<name>.*?)\s+(?<pid>\d+)\s+(?<session_name>\S+)\s+(?<session>\d+)\s+(?<mem>.*)/;
$^ = 'STDOUT_TOP';
open my $pipe, 'tasklist /fo table|';
/$regex/ && write for <$pipe>;
close $pipe;
$~ = 'STDOUT_BOTTOM';
write;
exit 0;
format STDOUT_TOP =
+----------------------------------+------------+----------+---------+-----------+
| Name | PID | SessName | Session | Memory |
+----------------------------------+------------+----------+---------+-----------+
.
format STDOUT =
| #<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< | #>>>>>>>>> | #<<<<<<< | #>>>>>> | #>>>>>>>> |
$+{name}, $+{pid}, $+{session_name}, $+{session}, $+{mem}
.
format STDOUT_BOTTOM =
+----------------------------------+------------+----------+---------+-----------+
.
Output
+----------------------------------+------------+----------+---------+-----------+
| Name | PID | SessName | Session | Memory |
+----------------------------------+------------+----------+---------+-----------+
| System Idle Process | 0 | Services | 0 | 8 K |
| System | 4 | Services | 0 | 7,452 K |
| Registry | 100 | Services | 0 | 28,664 K |
| smss.exe | 412 | Services | 0 | 368 K |
| csrss.exe | 552 | Services | 0 | 2,256 K |
| csrss.exe | 776 | Console | 1 | 2,496 K |
| wininit.exe | 796 | Services | 0 | 1,420 K |
| winlogon.exe | 860 | Console | 1 | 5,084 K |
| services.exe | 940 | Services | 0 | 5,964 K |
..............
| RuntimeBroker.exe | 7392 | Console | 1 | 8,604 K |
| dwm.exe | 1224 | Console | 1 | 70,144 K |
| chrome.exe | 10580 | Console | 1 | 103,584 K |
| svchost.exe | 12152 | Services | 0 | 7,496 K |
| LockApp.exe | 2620 | Console | 1 | 39,392 K |
| RuntimeBroker.exe | 3104 | Console | 1 | 30,508 K |
| chrome.exe | 452 | Console | 1 | 54,088 K |
| svchost.exe | 7460 | Services | 0 | 7,408 K |
| svchost.exe | 5744 | Services | 0 | 11,540 K |
♀+----------------------------------+------------+----------+---------+-----------+
| Name | PID | SessName | Session | Memory |
+----------------------------------+------------+----------+---------+-----------+
| WmiPrvSE.exe | 6200 | Services | 0 | 10,612 K |
| perl.exe | 2520 | Console | 1 | 8,948 K |
| tasklist.exe | 4808 | Console | 1 | 8,940 K |
+----------------------------------+------------+----------+---------+-----------+

Continuation Four in a row game

I figured the array out, but now I want to Write-host the value of my $CreateGrid[1,1] for example to
Write-host " A B C D E F G H I J "
Write-host "+---+---+---+---+---+---+---+---+---+---+ "
Write-host "| | | | | | | | | | | 1"
Write-host "+---+---+---+---+---+---+---+---+---+---+ "
Write-host "| | | | | | | | | | | 2"
Write-host "+---+---+---+---+---+---+---+---+---+---+ "
Write-host "| | | | | | | | | | | 3"
Write-host "+---+---+---+---+---+---+---+---+---+---+ "
Write-host "| | | | | | | | | | | 4"
Write-host "+---+---+---+---+---+---+---+---+---+---+ "
Write-host "| | | | | | | | | | | 5"
Write-host "+---+---+---+---+---+---+---+---+---+---+ "
Write-host "| | | | | | | | | | | 6"
Write-host "+---+---+---+---+---+---+---+---+---+---+ "
Write-host "| | | | | | | | | | | 7"
Write-host "+---+---+---+---+---+---+---+---+---+---+ "
Write-host "| | | | | | | | | | | 8"
Write-host "+---+---+---+---+---+---+---+---+---+---+ "
Write-host "| | | | | | | | | | | 9"
Write-host "+---+---+---+---+---+---+---+---+---+---+ "
Write-host "| $CreateGrid[1,1] | | | | | | | | | | 10"
Write-host "+---+---+---+---+---+---+---+---+---+---+ "
However when I try this, I get the following output for the value:
( System.Object[] System.Object[] System.Object[] System.Object[] System.Object[] System.Object[] System.Object[] System.Object[] Syst
em.Object[] System.Object[] System.Object[] [1,1])
How would I go around this? or is there a more clever way?
In short I want to include the positional value of the array in the Grid shown above.
EDIT:
$CreateBoard = New-object "Array[,]" 10,10
Function Add-ToColumn{
param ([Int] $columnnum,[String] $player)
PROCESS{if (0..9 -notcontains $columnnum){"Invalid move";return}
#0 is the bottom, 9 is the top
for($i = 0; $i -le 9; $i++)
{
if ($CreateBoard[$columnnum, $i] -eq $null)
{
$CreateBoard[$columnnum, $i] = $player
"Coin placed in $columnnum, $i coins in the column!"
return
}
}
#if you get here, column is full
"Invalid move"
}
}
When you put an expression in a double quoted string, the parser stops at the first non-variable name character. So:
"$CreateGrid[1,1]"
is processed as if it were
"$CreateGrid" + "[1,1]"
and as $CreateGrid is (or at least appears to be) an array it performs a ToString on each member and concatenates the results (as this is a two dimensional array, each enumerated member is an array, hence System.Object[] multiple times).
If you use the expression syntax ($(...)) inside the string the whole contained expression is processed as a PowerShell expression (eg. you can put a pipeline in there):
"$($CreateGrid[1,1])"

Batch Incremented File Rename

I'm trying to rename each file I have in a directory to an incremented value based on the current directory listing, so that
-------------------------
|-------------------------|
| | B1S1A800.ext |
| | B100M803.ext |
| | B100N807.ext |
| | B101S800.ext |
| | B102S803.ext |
-------------------------
Would instead look like:
-------------------------
|-------------------------|
| | 1.ext |
| | 2.ext |
| | 3.ext |
| | 4.ext |
| | 5.ext |
-------------------------
How would one go about achieving this in PowerShell?
This is a way:
$files = Get-ChildItem c:\yourpath\*.ext
$id = 1
$files | foreach {
Rename-Item -Path $_.fullname -NewName (($id++).tostring() + $_.extension)
}