How to use Powershell Pipeline to Avoid Large Objects? - powershell

I'm using a custom function to essentially do a DIR command (recursive file listing) on an 8TB drive (thousands of files).
My first iteration was:
$results = $PATHS | % {Get-FolderItem -Path "$($_)" } | Select Name,DirectoryName,Length,LastWriteTime
$results | Export-CVS -Path $csvfile -Force -Encoding UTF8 -NoTypeInformation -Delimiter "|"
This resulted in a HUGE $results variable and slowed the system down to a crawl by spiking the powershell process to use 99%-100% of the CPU as the processing went on.
I decided to use the power of the pipeline to WRITE to the CSV file directly (presumably freeing up the memory) instead of saving to an intermediate variable, and came up with this:
$PATHS | % {Get-FolderItem -Path "$($_)" } | Select Name,DirectoryName,Length,LastWriteTime | ConvertTo-CSV -NoTypeInformation -Delimiter "|" | Out-File -FilePath $csvfile -Force -Encoding UTF8
This seemed to be working fine (the CSV file was growing..and CPU seemed to be stable) but then abruptly stopped when the CSV file size hit ~200MB, and the error to the console was "The pipeline has been stopped".
I'm not sure the CSV file size had anything to do with the error message, but I'm unable to process this large directory with either method! Any suggestions on how to allow this process to complete successfully?

Get-FolderItem runs robocopy to list the files and converts its output into a PSObject array. This is a slow operation, which isn't required for the actual task, strictly speaking. Pipelining also adds big overhead compared to the foreach statement. In the case of thousands or hundreds of thousands repetitions that becomes noticeable.
We can speed up the process beyond anything pipelining and standard PowerShell cmdlets can offer to write the info for 400,000 files on an SSD drive in 10 seconds.
.NET Framework 4 or newer (included since Win8, installable on Win7/XP) IO.DirectoryInfo's EnumerateFileSystemInfos to enumerate the files in a non-blocking pipeline-like fashion;
PowerShell 3 or newer as it's faster than PS2 overall;
foreach statement which doesn't need to create ScriptBlock context for each item thus it's much faster than ForEach cmdlet
IO.StreamWriter to write each file's info immediately in a non-blocking pipeline-like fashion;
\\?\ prefix trick to lift the 260 character path length restriction;
manual queuing of directories to process to get past "access denied" errors, which otherwise would stop naive IO.DirectoryInfo enumeration;
progress reporting.
function List-PathsInCsv([string[]]$PATHS, [string]$destination) {
$prefix = '\\?\' #' UNC prefix lifts 260 character path length restriction
$writer = [IO.StreamWriter]::new($destination, $false, [Text.Encoding]::UTF8, 1MB)
$writer.WriteLine('Name|Directory|Length|LastWriteTime')
$queue = [Collections.Generic.Queue[string]]($PATHS -replace '^', $prefix)
$numFiles = 0
while ($queue.Count) {
$dirInfo = [IO.DirectoryInfo]$queue.Dequeue()
try {
$dirEnumerator = $dirInfo.EnumerateFileSystemInfos()
} catch {
Write-Warning ("$_".replace($prefix, '') -replace '^.+?: "(.+?)"$', '$1')
continue
}
$dirName = $dirInfo.FullName.replace($prefix, '')
foreach ($entry in $dirEnumerator) {
if ($entry -is [IO.FileInfo]) {
$writer.WriteLine([string]::Join('|', #(
$entry.Name
$dirName
$entry.Length
$entry.LastWriteTime
)))
} else {
$queue.Enqueue($entry.FullName)
}
if (++$numFiles % 1000 -eq 0) {
Write-Progress -activity Digging -status "$numFiles files, $dirName"
}
}
}
$writer.Close()
Write-Progress -activity Digging -Completed
}
Usage:
List-PathsInCsv 'c:\windows', 'd:\foo\bar' 'r:\output.csv'

dont use robocopy, use native PowerShell command, like this :
$PATHS = 'c:\temp', 'c:\temp2'
$csvfile='c:\temp\listresult.csv'
$PATHS | % {Get-ChildItem $_ -file -recurse } | Select Name,DirectoryName,Length,LastWriteTime | export-csv $csvfile -Delimiter '|' -Encoding UTF8 -NoType
Short version for no purist :
$PATHS | % {gci $_ -file -rec } | Select Name,DirectoryName,Length,LastWriteTime | epcsv $csvfile -D '|' -E UTF8 -NoT

Related

Replacing \x00 (ASCII 0 , NUL) with empty string for a huge csv using PowerShell

I have this code that works like a charm for small files. It just dumps the whole file into memory, replaces NUL and writes back to the same file. This is not really very practical for huge files when file size is larger than the available memory. Can someone help me convert it to a streaming model such that it won't choke for huge files.
Get-ChildItem -Path "Drive:\my\folder\path" -Depth 2 -Filter *.csv |
Foreach-Object {
$content = Get-Content $_.FullName
#Replace NUL and save content back to the original file
$content -replace "`0","" | Set-Content $_.FullName
}
The way you have this structured the entire file contents have to be read into memory. Note: That reading a file into memory uses 3-4x the file size in RAM, which's documented here.
Without getting into .Net classes, particularly [System.IO.StreamReader], Get-Content is actually very memory efficient, you just have to leverage the pipeline so you don't build up the data in memory.
Note: if you do decide to try StreamReader, the article will give you some syntax clues. Moreover, that topic has been covered by many others on the web.
Get-ChildItem -Path "C:\temp" -Depth 2 -Filter *.csv |
ForEach-Object{
$CurrentFile = $_
$TmpFilePath = Join-Path $CurrentFile.Directory.FullName ($CurrentFile.BaseName + "_New" + $CurrentFile.Extension)
Get-Content $CurrentFile.FullName |
ForEach-Object{ $_ -replace "`0","" } |
Add-Content $TmpFilePath
# Now that you've got the new file you can rename it & delete the original:
Remove-Item -Path $CurrentFile.FullName
Rename-Item -Path $TmpFilePath -NewName $CurrentFile.Name
}
This is a streaming model, Get-Content is streaming inside the outer ForEach-Object loop. There may be other ways to do it, but I chose this so I could keep track of the names and do the file swap at the end...
Note: Per the same article, in terms of speed Get-Content is quite slow. However, your original code was likely already suffering that burden. Moreover, you can speed it up a bit using the -ReadCount XXXX parameter. That will send some number of lines down the pipe at a time. That of course does use more memory, so you'd have to find a level that helps you say within the boundaries of your available RAM. Performance improvement with -ReadCount is mentioned in this answer's comments.
Update Based on Comments:
Here's an example of using StreamReader/Writer to perform the same operations from the previous example. This should be just as memory efficient as Get-Content, but should be much faster.
Get-ChildItem -Path "C:\temp" -Depth 2 -Filter *.csv |
ForEach-Object{
$CurrentFile = $_.FullName
$CurrentName = $_.Name
$TmpFilePath = Join-Path $_.Directory.FullName ($_.BaseName + "_New" + $_.Extension)
$StreamReader = [System.IO.StreamReader]::new( $CurrentFile )
$StreamWriter = [System.IO.StreamWriter]::new( $TmpFilePath )
While( !$StreamReader.EndOfStream )
{
$StreamWriter.WriteLine( ($StreamReader.ReadLine() -replace "`0","") )
}
$StreamReader.Close()
$StreamWriter.Close()
# Now that you've got the new file you can rename it & delete the original:
Remove-Item -Path $CurrentFile
Rename-Item -Path $TmpFilePath -NewName $CurrentName
}
Note: I have some sense this issue is rooted in encoding. The Stream constructors do accept an encoding enum as an argument.
Available Encodings:
[System.Text.Encoding]::BigEndianUnicode
[System.Text.Encoding]::Default
[System.Text.Encoding]::Unicode
[System.Text.Encoding]::UTF32
[System.Text.Encoding]::UTF7
[System.Text.Encoding]::UTF8
So if you wanted to instantiate the streams with, for example, UTF8:
$StreamReader = [System.IO.StreamReader]::new( $CurrentFile, [System.Text.Encoding]::UTF8 )
$StreamWriter = [System.IO.StreamWriter]::new( $TmpFilePath, [System.Text.Encoding]::UTF8 )
The streams do default to UTF8. I think the system default is typically code page Windows 1251.
This would be the simplest way using the least memory, one line at a time, to another file. But it needs double the disk space.
get-content file.txt | % { $_ -replace "`0" } | set-content file2.txt

How can I (efficiently) match content (lines) of many small files with content (lines) of a single large file and update/recreate them

I've tried solving the following case:
many small text files (in subfolders) need their content (lines) matched to lines that exist in another (large) text file. The small files then need to be updated or copied with those matching Lines.
I was able to come up with some running code for this but I need to improve it or use a complete other method because it is extremely slow and would take >40h to get through all files.
One idea I already had was to use a SQL Server to bulk-import all files in a single table with [relative path],[filename],[jap content] and the translation file in a table with [jap content],[eng content] and then join [jap content] and bulk-export the joined table as separate files using [relative path],[filename]. Unfortunately I got stuck right at the beginning due to formatting and encoding issues so I dropped it and started working on a PowerShell script.
Now in detail:
Over 40k txt files spread across multiple subfolders with multiple lines each, every line can exist in multiple files.
Content:
UTF8 encoded Japanese text that also can contain special characters like \\[*+(), each Line ending with a tabulator character. Sounds like csv files but they don't have headers.
One large File with >600k Lines containing the translation to the small files. Every line is unique within this file.
Content:
Again UTF8 encoded Japanese text. Each line formatted like this (without brackets):
[Japanese Text][tabulator][English Text]
Example:
テスト[1] Test [1]
End result should be a copy or a updated version of all these small files where their lines got replaced with the matching ones of the translation file while maintaining their relative path.
What I have at the moment:
$translationfile = 'B:\Translation.txt'
$inputpath = 'B:\Working'
$translationarray = [System.Collections.ArrayList]#()
$translationarray = #(Get-Content $translationfile -Encoding UTF8)
Get-Childitem -path $inputpath -Recurse -File -Filter *.txt | ForEach-Object -Parallel {
$_.Name
$filepath = ($_.Directory.FullName).substring(2)
$filearray = [System.Collections.ArrayList]#()
$filearray = #(Get-Content -path $_.FullName -Encoding UTF8)
$filearray = $filearray | ForEach-Object {
$result = $using:translationarray -match ("^$_" -replace '[[+*?()\\.]','\$&')
if ($result) {
$_ = $result
}
$_
}
If(!(test-path B:\output\$filepath)) {New-Item -ItemType Directory -Force -Path B:\output\$filepath}
#$("B:\output\"+$filepath+"\")
$filearray | Out-File -FilePath $("B:\output\" + $filepath + "\" + $_.Name) -Force -Encoding UTF8
} -ThrottleLimit 10
I would appreciate any help and ideas but please keep in mind that I rarely write scripts so anything to complex might fly right over my head.
Thanks
As zett42 states, using a hash table is your best option for mapping the Japanese-only phrases to the dual-language lines.
Additionally, use of .NET APIs for file I/O can speed up the operation noticeably.
# Be sure to specify all paths as full paths, not least because .NET's
# current directory usually differs from PowerShell's
$translationfile = 'B:\Translation.txt'
$inPath = 'B:\Working'
$outPath = (New-Item -Type Directory -Force 'B:\Output').FullName
# Build the hashtable mapping the Japanese phrases to the full lines.
# Note that ReadLines() defaults to UTF-8
$ht = #{ }
foreach ($line in [IO.File]::ReadLines($translationfile)) {
$ht[$line.Split("`t")[0] + "`t"] = $line
}
Get-ChildItem $inPath -Recurse -File -Filter *.txt | Foreach-Object -Parallel {
# Translate the lines to the matching lines including the $translation
# via the hashtable.
# NOTE: If an input line isn't represented as a key in the hashtable,
# it is passed through as-is.
$lines = foreach ($line in [IO.File]::ReadLines($_.FullName)) {
($using:ht)[$line] ?? $line
}
# Synthesize the output file path, ensuring that the target dir. exists.
$outFilePath = (New-Item -Force -Type Directory ($using:outPath + $_.Directory.FullName.Substring(($using:inPath).Length))).FullName + '/' + $_.Name
# Write to the output file.
# Note: If you want UTF-8 files *with BOM*, use -Encoding utf8bom
Set-Content -Encoding utf8 $outFilePath -Value $lines
} -ThrottleLimit 10
Note: Your use of ForEach-Object -Parallel implies that you're using PowerShell [Core] 7+, where BOM-less UTF-8 is the consistent default encoding (unlike in Window PowerShell, where default encodings vary wildly).
Therefore, in lieu of the .NET [IO.File]::ReadLines() API in a foreach loop, you could also use the more PowerShell-idiomatic switch statement with the -File parameter for efficient line-by-line text-file processing.

Powershell: Logging foreach changes

I have put together a script inspired from a number of sources. The purpose of the powershell script is to scan a directory for files (.SQL), copy all of it to a new directory (retain the original), and scan each file against a list file (CSV format - containing 2 columns: OldValue,NewValue), and replace any strings that matches. What works: moving, modifying, log creation.
What doesn't work:
Recording in the .log for the changes made by the script.
Sample usage: .\ConvertSQL.ps1 -List .\EVar.csv -Files \SQLFiles\Rel_1
Param (
[String]$List = "*.csv",
[String]$Files = "*.sql"
)
function Get-TimeStamp {
return "[{0:dd/MM/yyyy} {0:HH:mm:ss}]" -f (Get-Date)
}
$CustomFiles = "$Files\CUSTOMISED"
IF (-Not (Test-Path $CustomFiles))
{
MD -Path $CustomFiles
}
Copy-Item "$Files\*.sql" -Recurse -Destination "$CustomFiles"
$ReplacementList = Import-Csv $List;
Get-ChildItem $CustomFiles |
ForEach-Object {
$LogFile = "$CustomFiles\$_.$(Get-Date -Format dd_MM_yyyy).log"
Write-Output "$_ has been modified on $(Get-TimeStamp)." | Out-File "$LogFile"
$Content = Get-Content -Path $_.FullName;
foreach ($ReplacementItem in $ReplacementList)
{
$Content = $Content.Replace($ReplacementItem.OldValue, $ReplacementItem.NewValue)
}
Set-Content -Path $_.FullName -Value $Content
}
Thank you very much.
Edit: I've cleaned up a bit and removed my test logging files.
Here's the snippet of code that I've been testing with little success. I put the following right under $Content= Content.Replace($ReplacementItem.OldValue, $ReplacementItem.NewValue)
if ( $_.FullName -like '*TEST*' ) {
"This is a test." | Add-Content $LogFile
}
I've also tried to pipe out the Set-Content using Out-File. The outputs I end up with are either a full copy of the contents of my CSV file or the SQL file itself. I'll continue reading up on different methods. I simply want to, out of hundreds to a thousand or so lines, to be able to identify what variables in the SQL has been changed.
Instead of piping output to Add-Content, pipe the log output to: Out-File -Append
Edit: compare the content using the Compare-Object cmdlet and evaluate it's ouput to identify where the content in each string object differs.

How to modify my powershell script to search for all jpg/jpeg on a machine

I am working on building a PowerShell script that will find all the jpeg/jpg files on a machine. This is what I have so far-
# PowerShell script to list the DLL files under the C drive
$Dir = get-childitem C:\ -recurse
# $Dir |get-member
$List = $Dir | where {$_.extension -eq ".jpg"}
$List |ft fullname |out-file C:\Users\User1\Desktop\dll.txt
# List | format-table name
The only problem is that some of the files I am looking for don't have the extension jpg/jpeg. I know that you can look in the header of the file and if it says ÿØÿà then it is a jpeg/jpg but I don't know how to incorporate this into the script.
Any help would be appreciated. Thanks so much!
I'm not sure how to use powershell native commands to Look at file headers, I will do some research on it because it sounds fun. Until then I can suggest a shorter version of your initial command, reducing it to a one liner.
Get-ChildItem -Recurse -include *.jpg | Format-table -Property Fullname | Out-file C:\Users\User1\Desktop\Jpg.txt
or
ls -r -inc *.jpg | ft Fullname
EDITED: removed redundant code, thanks #nick.
I'll let you know what I find if I find anything at all.
Chris
The following will retrieve files with either a .jpg/.jpeg extension or that contain a JPEG header in the first four bytes:
[Byte[]] $jpegHeader = 255, 216, 255, 224;
function IsJpegFile([System.IO.FileSystemInfo] $file)
{
# Exclude directories
if ($file -isnot [System.IO.FileInfo])
{
return $false;
}
# Include files with either a .jpg or .jpeg extension, case insensitive
if ($file.Extension -match '^\.jpe?g$')
{
return $true;
}
# Read up to the first $jpegHeader.Length bytes from $file
[Byte[]] $fileHeader = #(
Get-Content -Path $file.FullName -Encoding Byte -ReadCount 0 -TotalCount $jpegHeader.Length
);
if ($fileHeader.Length -ne $jpegHeader.Length)
{
# The length of the file is less than the JPEG header length
return $false;
}
# Compare each byte in the file header to the JPEG header
for ($i = 0; $i -lt $fileHeader.Length; $i++)
{
if ($fileHeader[$i] -ne $jpegHeader[$i])
{
return $false;
}
}
return $true;
}
[System.IO.FileInfo[]] $jpegFiles = #(
Get-ChildItem -Path 'C:\' -Recurse `
| Where-Object { IsJpegFile $_; }
);
$jpegFiles | Format-Table 'FullName' | Out-File 'C:\Users\User1\Desktop\dll.txt';
Note that the -Encoding and -TotalCount parameters of the Get-Content cmdlet are used to read only the first four bytes of each file, not the entire file. This is an important optimization as it avoids basically reading every byte of file data on your C: drive.
This should give you all files starting with the sequence "ÿØÿà":
$ref = [byte[]]#(255, 216, 255, 224)
Get-ChildItem C:\ -Recurse | ? { -not $_.PSIsContainer } | % {
$header = [System.IO.File]::ReadAllBytes($_.FullName)[0..3]
if ( (compare $ref $header) -eq $null ) {
$_.FullName
}
} | Out-File "C:\Users\User1\Desktop\dll.txt"
To find if the header starts with ÿØÿà, use:
[System.String]$imgInfo = get-content $_ # where $_ is a .jpg file such as "pic.jpg"
if($imgInfo.StartsWith("ÿØÿà"))
{
#It's a jpeg, start processing...
}
Hope this helps
I would recommend querying the windows search index for the jpegs, rather than trying to sniff file contents. Searching the system index using filenames is insanely fast, the downside is that you must search indexed locations.
I wrote a windows search querying script using the windows sdk \samples\windowssearch\oledb, you would want to query using the imaging properties. However, I'm not certain off the top of my head if the search index uses the imaging api to look at unknown files or files without extensions. Explorer seems to know my jpeg thumbnails and metadata without jpg extensions, so I'm guessing the indexer is going to be as clever as explorer.
How about this one?
jp*g is used to match both jpg and jepg images.
$List = Get-ChildItem "C:\*.jp*g" -Recurse
$List |ft fullname |out-file C:\Users\User1\Desktop\dll.txt

Remove Top Line of Text File with PowerShell

I am trying to just remove the first line of about 5000 text files before importing them.
I am still very new to PowerShell so not sure what to search for or how to approach this. My current concept using pseudo-code:
set-content file (get-content unless line contains amount)
However, I can't seem to figure out how to do something like contains.
While I really admire the answer from #hoge both for a very concise technique and a wrapper function to generalize it and I encourage upvotes for it, I am compelled to comment on the other two answers that use temp files (it gnaws at me like fingernails on a chalkboard!).
Assuming the file is not huge, you can force the pipeline to operate in discrete sections--thereby obviating the need for a temp file--with judicious use of parentheses:
(Get-Content $file | Select-Object -Skip 1) | Set-Content $file
... or in short form:
(gc $file | select -Skip 1) | sc $file
It is not the most efficient in the world, but this should work:
get-content $file |
select -Skip 1 |
set-content "$file-temp"
move "$file-temp" $file -Force
Using variable notation, you can do it without a temporary file:
${C:\file.txt} = ${C:\file.txt} | select -skip 1
function Remove-Topline ( [string[]]$path, [int]$skip=1 ) {
if ( -not (Test-Path $path -PathType Leaf) ) {
throw "invalid filename"
}
ls $path |
% { iex "`${$($_.fullname)} = `${$($_.fullname)} | select -skip $skip" }
}
I just had to do the same task, and gc | select ... | sc took over 4 GB of RAM on my machine while reading a 1.6 GB file. It didn't finish for at least 20 minutes after reading the whole file in (as reported by Read Bytes in Process Explorer), at which point I had to kill it.
My solution was to use a more .NET approach: StreamReader + StreamWriter.
See this answer for a great answer discussing the perf: In Powershell, what's the most efficient way to split a large text file by record type?
Below is my solution. Yes, it uses a temporary file, but in my case, it didn't matter (it was a freaking huge SQL table creation and insert statements file):
PS> (measure-command{
$i = 0
$ins = New-Object System.IO.StreamReader "in/file/pa.th"
$outs = New-Object System.IO.StreamWriter "out/file/pa.th"
while( !$ins.EndOfStream ) {
$line = $ins.ReadLine();
if( $i -ne 0 ) {
$outs.WriteLine($line);
}
$i = $i+1;
}
$outs.Close();
$ins.Close();
}).TotalSeconds
It returned:
188.1224443
Inspired by AASoft's answer, I went out to improve it a bit more:
Avoid the loop variable $i and the comparison with 0 in every loop
Wrap the execution into a try..finally block to always close the files in use
Make the solution work for an arbitrary number of lines to remove from the beginning of the file
Use a variable $p to reference the current directory
These changes lead to the following code:
$p = (Get-Location).Path
(Measure-Command {
# Number of lines to skip
$skip = 1
$ins = New-Object System.IO.StreamReader ($p + "\test.log")
$outs = New-Object System.IO.StreamWriter ($p + "\test-1.log")
try {
# Skip the first N lines, but allow for fewer than N, as well
for( $s = 1; $s -le $skip -and !$ins.EndOfStream; $s++ ) {
$ins.ReadLine()
}
while( !$ins.EndOfStream ) {
$outs.WriteLine( $ins.ReadLine() )
}
}
finally {
$outs.Close()
$ins.Close()
}
}).TotalSeconds
The first change brought the processing time for my 60 MB file down from 5.3s to 4s. The rest of the changes is more cosmetic.
$x = get-content $file
$x[1..$x.count] | set-content $file
Just that much. Long boring explanation follows. Get-content returns an array. We can "index into" array variables, as demonstrated in this and other Scripting Guys posts.
For example, if we define an array variable like this,
$array = #("first item","second item","third item")
so $array returns
first item
second item
third item
then we can "index into" that array to retrieve only its 1st element
$array[0]
or only its 2nd
$array[1]
or a range of index values from the 2nd through the last.
$array[1..$array.count]
I just learned from a website:
Get-ChildItem *.txt | ForEach-Object { (get-Content $_) | Where-Object {(1) -notcontains $_.ReadCount } | Set-Content -path $_ }
Or you can use the aliases to make it short, like:
gci *.txt | % { (gc $_) | ? { (1) -notcontains $_.ReadCount } | sc -path $_ }
Another approach to remove the first line from file, using multiple assignment technique. Refer Link
$firstLine, $restOfDocument = Get-Content -Path $filename
$modifiedContent = $restOfDocument
$modifiedContent | Out-String | Set-Content $filename
skip` didn't work, so my workaround is
$LinesCount = $(get-content $file).Count
get-content $file |
select -Last $($LinesCount-1) |
set-content "$file-temp"
move "$file-temp" $file -Force
Following on from Michael Soren's answer.
If you want to edit all .txt files in the current directory and remove the first line from each.
Get-ChildItem (Get-Location).Path -Filter *.txt |
Foreach-Object {
(Get-Content $_.FullName | Select-Object -Skip 1) | Set-Content $_.FullName
}
For smaller files you could use this:
& C:\windows\system32\more +1 oldfile.csv > newfile.csv | out-null
... but it's not very effective at processing my example file of 16MB. It doesn't seem to terminate and release the lock on newfile.csv.