Read and write to same txt file in loop with StreamReader - powershell

I have a working script in PowerShell:
$file = Get-Content -Path HKEY_USERS.txt -Raw
foreach($line in [System.IO.File]::ReadLines("EXCLUDE_HKEY_USERS.txt"))
{
$escapedLine = [Regex]::Escape($line)
$pattern = $("(?sm)^$escapedLine.*?(?=^\[HKEY)")
$file -replace $pattern, ' ' | Set-Content HKEY_USERS-filtered.txt
$file = Get-Content -Path HKEY_USERS-filtered.txt -Raw
}
For each line in EXCLUDE_HKEY_USERS.txt it is performing some changes in file HKEY_USERS.txt. So with every loop iteration it is writing to this file and re-reading the same file to pull the changes. However, Get-Content is notorious for memory leaks, so I wanted to refactor it to StreamReader and StreamWriter, but I'm a having a hard time to make it work.
As soon as I do:
$filePath = 'HKEY_USERS-filtered.txt';
$sr = New-Object IO.StreamReader($filePath);
$sw = New-Object IO.StreamWriter($filePath);
I get:
New-Object : Exception calling ".ctor" with "1" argument(s): "The process cannot access the file
'HKEY_USERS-filtered.txt' because it is being used by another process."
So it looks like I cannot use StreamReader and StreamWriter on same file simultaneously. Or can I?

tl;dr
Get-Content -Raw reads a file as a whole and is fast and consumes little unwanted memory.
[System.IO.File]::ReadLines() is a faster and more memory-efficient alternative to line-by-line reading with Get-Content (without -Raw), but you need to ensure that the input file is passed as a full path, because .NET's working directory usually differs from PowerShell's.
Convert-Path resolves a given relative path to a full, file-system-native one.
A PowerShell-native alternative to using [System.IO.File]::ReadLines() is the switch statement with the -File parameter, which performs similarly well while avoiding the working-directory discrepancy pitfall, and offers additional features.
There is no need to save the modified file content to disk after each iteration - just update the $file variable, and, after exiting the loop, save the value of $file to the output file.
$fileContent = Get-Content -Path HKEY_USERS.txt -Raw
# Be sure to specify a *full* path.
$excludeFile = Convert-Path -LiteralPath 'EXCLUDE_HKEY_USERS.txt'
foreach($line in [System.IO.File]::ReadLines($excludeFile)) {
$escapedLine = [Regex]::Escape($line)
$pattern = "(?sm)^$escapedLine.*?(?=^\[HKEY)"
# Modify the content and save the result back to variable $fileContent
$fileContent = $fileContent -replace $pattern, ' '
}
# After all modifications have been performed, save to the output file
$fileContent | Set-Content HKEY_USERS-filtered.txt
Building on Santiago Squarzon's helpful comments:
Get-Content does not cause memory leaks, but it can consume a lot of memory that isn't garbage-collected until an unpredictable later point in time.
The reason is that - unless the -Raw switch is used - it decorates each line read with PowerShell ETS (Extended Type System) properties containing metadata about the file of origin, such as its path (.PSPath) and the line number (.ReadCount).
This both consumes extra memory and slows the command down - GitHub issue #7537 asks for a way to opt out of this wasteful decoration, as it typically isn't needed.
However, reading with -Raw is efficient, because the entire file content is read into a single, multi-line string, which means that the decoration is only performed once.
So it looks like I cannot use StreamReader and StreamWriter on same file simultaneously. Or can I?
No, you cannot. You cannot simultaneously read from a file and overwrite it.
To update / replace an existing file you have two options (note that, for a fully robust solution, all attributes of the original file (except the last write time and size) should be retained, which requires extra work):
Read the old content into memory in full, perform the desired modification in memory, then write the modified content back to the original file, as shown in the top section.
There is a slight risk of data loss, however, namely if the process of writing back to the file gets interrupted.
More safely, write the modified content to a temporary file and, upon successful completion, replace the original file with the temporary one.

Related

Powershell: Efficient way to delete first 10 rows of a HUGE textfile

I need to delete the first couple of lines of a .txt-file in powershell. There are plenty of questions and answers already on SA how to do it. Most of them copy the whole filecontent into memory, cut out the first x lines and then save the content into the textfile again.
However, in my case the textfiles are huge (500MB+), so loading them completly into memory, just to delete the first couple of lines, takes very long and feels like a huge waste of resources.
Is there a more elegant approach? If you only want to read the first x lines, you can use
Get-Content in.csv -Head 10
, which only reads the first 10 lines. Is there something similar for deletion?
Here is a another way to do it using StreamReader and StreamWriter, as noted in comments, it's important to know the encoding of your file for this use case.
See Remarks from the Official Documentation:
The StreamReader object attempts to detect the encoding by looking at the first four bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, big-endian Unicode, little-endian UTF-32, and big-endian UTF-32 text if the file starts with the appropriate byte order marks. Otherwise, the user-provided encoding is used. See the Encoding.GetPreamble method for more information.
If you need to specify an Encoding you can target the StreamReader(String, Encoding) Constructor. For example:
$reader = [System.IO.StreamReader]::new('path\to\input.csv', [System.Text.Encoding]::UTF8)
As noted previously in Remarks, this might not be needed for common encodings.
An alternative to below code, could be the use of $reader.ReadToEnd() as Brice points out in his comment, after skipping the first 10 lines, this would read the entire contents of the file in memory before writing to the new file. I haven't used this method for this answer since, mklement0's helpful answer provides a very similar solution to the problem and this answer was intended to be a memory friendly solution.
try {
$reader = [System.IO.StreamReader]::new('absolute\path\to\input.csv')
$writer = [System.IO.StreamWriter]::new('absolute\path\to\output.csv')
# skip 10 lines
foreach($i in 1..10) {
$null = $reader.ReadLine()
}
while(-not $reader.EndOfStream) {
$writer.WriteLine($reader.ReadLine())
}
}
finally {
($reader, $writer).foreach('Dispose')
}
It's very also worth noting zett42's helpful comment using $reader.ReadBlock(Char[], Int32, Int32) method and $writer.Write(..) instead of $write.WriteLine(..) could be an even faster and still memory friendly alternative to read and write in chunks.
You're essentially attempting to remove the starting bytes of the file without modifying the remaining bytes, Raymond C has a good read posted here about why that can't be done.
The underlying abstract model for storage of file contents is in the form of a chunk of bytes, each indexed by the file offset. The reason appending bytes and truncating bytes is so easy is that doing so doesn’t alter the file offsets of any other bytes in the file. If a file has ten bytes and you append one more, the offsets of the first ten bytes stay the same. On the other hand, deleting bytes from the front or middle of a file means that all the bytes that came after the deleted bytes need to “slide down” to close up the space. And there is no “slide down” file system function.
As Mike Anthony's helpful answer explains, there is no system-level function that efficiently implements what you're trying to do, so you have no choice but to rewrite your file.
While memory-intensive, the following solution is reasonably fast:
Read the file as a whole into memory, as a single string, using Get-Content's -Raw switch...
This is orders of magnitude faster than the line-by-line streaming that Get-Content performs by default.
... then use regex processing to strip the first 10 lines ...
... and save the trimmed content back to disk.
Important:
Since this rewrites the file in place, be sure to have a backup copy of your file.
Use -Encoding with Get-Content / Set-Content to correctly interpret the input / control the output character encoding (PowerShell fundamentally doesn't preserve the information about the character encoding of a file that was read with Get-Content). Without -Encoding, the default encoding is the system's active ANSI code page in Windows PowerShell, and, more sensibly, BOM-less UTF-8 in PowerShell (Core) 7+.
# Use -Encoding as needed.
(Get-Content -Raw in.csv) -replace '^(?:.*\r?\n){10}' |
Set-Content -NoNewLine in.csv
If the file is too large to fit into memory:
If you happen to have WSL installed, an efficient, streaming tail solution is possible:
Note:
Your input file must use a character encoding in which a LF character is represented as a single 0xA byte - which is true of most single-byte encodings and also of the variable-width UTF-8 encoding, but not of, say, UTF-16.
You must output to a different file (which you can later replace the input file with).
bash.exe -c 'tail +11 in.csv > out.csv'
Otherwise, line-by-line processing is required.
Note: I'm leaving aside other viable approaches, namely those that either read and write the file in large blocks, as zett42 recommends, or an approach that collects (large) groups of output lines before writing them to the output file in a single operation, as shown in Theo's helpful answer.
Caveat:
All line-by-line processing approaches risk inadvertently changing the newline format of the original file: on writing the lines back to a file, it is invariably the platform-native newline format that is used (CLRF on Windows, LF on Unix-like platforms).
Also, the information as to whether the input file had a trailing newline or not is lost.
Santiago's helpful answer shows a solution based on .NET APIs, which performs well by PowerShell standards.
Brice came up with an elegant and significant optimization that lets a .NET method perform the (lazy) iteration over the file's lines, which is much faster than looping in PowerShell code:
[System.IO.File]::WriteAllLines(
"$pwd/out.csv",
[Linq.Enumerable]::Skip(
[System.IO.File]::ReadLines("$pwd/in.csv"),
10
)
)
For the sake of completeness, here's a comparatively slower, PowerShell-native solution using a switch statement with the -File parameter for fast line-by-line reading (much faster than Get-Content):
& {
$i = 0
switch -File in.csv {
default { if (++$i -ge 11) { $_ } }
}
} | Set-Content out.csv # use -Encoding as needed
Note:
Since switch doesn't allow specifying a character encoding for the input file, this approach only works if the character encoding is correctly detected / assumed by default. While BOM-based files will be read correctly, note that switch makes different assumptions about BOM-less files based on the PowerShell edition: in Windows PowerShell, the system's active ANSI code page is assumed; in PowerShell (Core) 7+, it is UTF-8.
Because language statements cannot directly serve as pipeline input, the switch statement must be called via a script block (& { ... })
Streaming the resulting lines to Set-Content via the pipeline is what slows the solution down. Passing the new file content as an argument, to Set-Content's -Value parameter would drastically speed up the operation - but that would again require that the file fit into memory as a whole:
# Faster reformulation, but *input file must fit into memory as whole*.
# `switch` offers a lot of flexibility. If that isn't needed
# and reading the file in full is acceptable, the
# the Get-Content -Raw solution at the top is the fastest Powershell solution.
Set-Content out.csv $(
$i = 0
switch -File in.csv {
default { if (++$i -ge 11) { $_ } }
}
)
There may be another alternative by using switch to read the files line-by line and buffering a certain maximum amount of lines in a List.
This would be lean on memory consumtion and at the same time limit the number of disk writes to speed up the process.
Something like this perhaps
$maxBuffer = 10000 # the maximum number of lines to buffer
$linesBuffer = [System.Collections.Generic.List[string]]::new()
# get an array of the files you need to process
$files = Get-ChildItem -Path 'X:\path\to\the\input\files' -Filter '*.txt' -File
foreach ($file in $files) {
# initialize a counter for omitting the first 10 lines lines and clear the buffer
$omitCounter = 0
$linesBuffer.Clear()
# create a new file path by appending '_New' to the input file's basename
$outFile = '{0}\{1}_New{2}' -f $file.DirectoryName, $file.BaseName, $file.Extension
switch -File $file.FullName {
default {
if ($omitCounter -ge 10) {
if ($linesBuffer.Count -eq $maxBuffer) {
# write out the buffer to the new file and clear it for the next batch
Add-Content -Path $outFile -Value $linesBuffer
$linesBuffer.Clear()
}
$linesBuffer.Add($_)
}
else { $omitCounter++ } # no output, just increment the counter
}
}
# here, check if there is still some data left in the buffer
if ($linesBuffer.Count) { Add-Content -Path $outFile -Value $linesBuffer }
}

In Powershell I'm receiving an "OutOfMemoryException" when working with files over 1gb

I am doing some file clean up before loading into my data warehouse and have run into a file sizing issue:
(Get-Content -path C:\Workspace\workfile\myfile.txt -Raw) -replace '\\"', '"' | Set-Content C:\Workspace\workfile\myfileCLEAN.txt
My file is about 2GB. I am receiving the following error and not sure how to correct.
Get-Content : Exception of type 'System.OutOfMemoryException' was
thrown, ........
I am NOT a coder, but I do like learning so am building my own data warehouse. So if you do respond, keep my experience level in mind :)
A performant way of reading a text file line by line - without loading the entire file into memory - is to use a switch statement with the -File parameter.
A performant way of writing a text file is to use a System.IO.StreamWriter instance.
As Mathias points out in his answer, using verbatim \" with the regex-based -replace operator actually replaces " alone, due to the escaping rules of regexes. While you could address that with '\\"', in this case a simpler and better-performing alternative is to use the [string] type's Replace() method, which operates on literal substrings.
To put it all together:
# Note: Be sure to use a *full* path, because .NET's working dir. usually
# differs from PowerShell's.
$streamWriter = [System.IO.StreamWriter]::new('C:\Workspace\workfile\myfileCLEAN.txt')
switch -File C:\Workspace\workfile\myfile.txt {
default { $streamWriter.WriteLine($_.Replace('\"', '"')) }
}
$streamWriter.Close()
Note: If you're using an old version of Windows PowerShell, namely version 4 or below, use
New-Object System.IO.StreamWriter 'C:\Workspace\workfile\myfileCLEAN.txt'
instead of
[System.IO.StreamWriter]::new('C:\Workspace\workfile\myfileCLEAN.txt')
Get-Content -Raw makes PowerShell read the entire file into a single string.
.NET can't store individual objects over 2GB in size in memory, and each character in a string takes up 2 bytes, so after reading the first ~1 billion characters (roughly equivalent to a 1GB ASCII-encoded text file), it reaches the memory limit.
Remove the -Raw switch, -replace is perfectly capable of operating on multiple input strings at once:
(Get-Content -path C:\Workspace\workfile\myfile.txt) -replace '\"', '"' | Set-Content C:\Workspace\workfile\myfileCLEAN.txt
Beware that -replace is a regex operator, and if you want to remove \ from a string, you need to escape it:
(Get-Content -path C:\Workspace\workfile\myfile.txt) -replace '\\"', '"' | Set-Content C:\Workspace\workfile\myfileCLEAN.txt
While this will work, it'll still be slow due to the fact that we're still loading >2GB of data into memory before applying -replace and writing to the output file.
Instead, you might want to pipe the output from Get-Content to the ForEach-Object cmdlet:
Get-Content -path C:\Workspace\workfile\myfile.txt |ForEach-Object {
$_ -replace '\\"','"'
} |Set-Content C:\Workspace\workfile\myfileCLEAN.txt
This allows Get-Content to start pushing output prior to finishing reading the file, and PowerShell therefore no longer needs to allocate as much memory as before, resulting in faster execution.
Get-Content loads the whole file into memory.
Try processing line by line to improve memory utilization.
$infile = "C:\Workspace\workfile\myfile.txt"
$outfile = "C:\Workspace\workfile\myfileCLEAN.txt"
foreach ($line in [System.IO.File]::ReadLines($infile)) {
Add-Content -Path $outfile -Value ($line -replace '\\"','"')
}

Windows Powershell - delete a line by line number

I have a large csv file (1.6gb). how can I delete a specific line e.g. line 1005?
Note: The solutions below eliminate a single line from any text-based file by line number. As marsze points out, additional considerations may apply to CSV files, where care must be taken not to eliminate the header row, and rows may span multiple lines if they have values with embedded newlines; use of a CSV parser is a better choice in that case.
If performance isn't paramount, here's a memory-friendly pipeline-based way to do it:
Get-Content file.txt |
Where-Object ReadCount -ne 1005 |
Set-Content -Encoding Utf8 new-file.txt
Get-Content adds a (somewhat obscurely named) .ReadCount property to each line it outputs, which contains the 1-based line number.
Note that the input file's character encoding isn't preserved by Get-Content, so you should control Set-Content'st output encoding explicitly, as shown above, using UTF-8 as an example.
Without reading the whole file into memory as a whole, you must output to a new file, at least temporarily; you can replace the original file with the temporary output file with
Move-Item -Force new-file.txt file.txt
A faster, but memory-intensive alternative based on direct use of the .NET framework, which also allows you to update the file in place:
$file = 'file.txt'
$lines = [IO.File]::ReadAllLines("$PWD/$file")
Set-Content -Encoding UTF8 $file -Value $lines[0..1003 + 1005..($lines.Count-1)]
Note the need to use "$PWD/$file", i.e., to explicitly prepend the current directory path to the relative path stored in $file, because the .NET framework's idea of what the current directory is differs from PowerShell's.
While $lines = Get-Content $file would be functionally equivalent to $lines = [IO.File]::ReadAllLines("$PWD/$file"), it would perform noticeably poorer.
0..1003 creates an array of indices from 0 to 1003; + concatenates that array with indices 1005 through the rest of the input array; note that array indices are 0-based, whereas line numbers are 1-based.
Also note how the resulting array is passed to Set-Content as a direct argument via -Value, which is faster than passing it via the pipeline (... | Set-Content ...), where element-by-element processing would be performed.
Finally, a memory-friendly method that is faster than the pipeline-based method:
$file = 'file.txt'
$outFile = [IO.File]::CreateText("$PWD/new-file.txt")
$lineNo = 0
try {
foreach ($line in [IO.File]::ReadLines("$PWD/$file")) {
if (++$lineNo -eq 1005) { continue }
$outFile.WriteLine($line)
}
} finally {
$outFile.Dispose()
}
Note the use of "$PWD/..." in the .NET API calls, which ensures that a full path is passed, which is necessary, because .NET's working directory usually differs from PowerShell's.
As with the pipeline-based command, you may have to replace the original file with the new file afterwards.

Powershell script write back to sources from drag and drop

I need to create a powershell script that removes quotes from CSV files in a user friendly drag and drop way. I have the basics of the script down courtesy of this page:
http://blogs.technet.com/b/heyscriptingguy/archive/2011/11/02/remove-unwanted-quotation-marks-from-csv-files-by-using-powershell.aspx
And I've already sucessfully made .ps1 files drag and droppable courtesy of this stack overflow question:
Drag and Drop to a Powershell script
The author of the answer implies that it's just as easy to drop a single file, many files, and folders with lots of files in them. However, I have yet to figure this out in a way that can also can write back to the source file. Here's my current code:
Param([string[]]$file)
(gc $file) | % {$_ -replace '"', ""} | out-file C:\Users\pfoster\Desktop\Output\test.txt -Fo -En ascii
Currently, this will only accept a single file, and output the result as a txt to a specified file regardless of the source file type (I can change that to CSV easily but I'd like the script to mirror the source). Ideally, I'd like it to accept files and folders, and to rewrite the source file. I have a feeling this would involve the get-ChildItem but I'm not sure how to implement that in the current scenario. I've also tried out-file $file and that didn't work either.
Thanks for the help!
For writing the modified content back to the original files try something like this:
foreach ($file in $ARGS) {
(Get-Content $file) -replace '"', '' | Out-File $file -Encoding ASCII -Force
}
Use a foreach in loop, because you need the file name in more than one place in the pipeline. Reading the content in a subshell and then piping the modified content into the Out-File cmdlet makes sure that the output file is only written after the content was already read.
Don't use a redirection operator ((Get-Content $file) >$file), because that would first open the file for writing (effectively truncating it) and afterwards read the content from the now empty file.
Beware that this approach may cause problems with large files, because each file is read completely into the RAM before they're processed and written back to disk. If a file doesn't fit into the available RAM the computer will start swapping, thus causing significant performance degradation.

Find and Replace in a Large File

I want to find a piece of text in a large xml file and want to replace with some other text. The size of the file is around (50GB). I want to do this in command line. I am looking at PowerShell and want to know if it can handle the large size.
Currently I am trying something like this but it does not like it
Get-Content C:\File1.xml | Foreach-Object {$_ -replace "xmlns:xsi=\"http:\/\/www\.w3\.org\/2001\/XMLSchema-instance\"", ""} | Set-Content C:\File1.xml
The text I want to replace is xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" with an empty string "".
Questions
Can PowerShell handle large
files
I don't want the replace to happen in
memory and prefer streaming assuming
that will not bring the server to
its knees.
Are there any other approaches I can take (different
tools/strategy?)
Thanks
I had a similar need (and similar lack of powershell experience) but cobbled together a complete answer from the other answers on this page plus a bit more research.
I also wanted to avoid the regex processing, since I didn't need it either -- just a simple string replace -- but on a large file, so I didn't want it loaded into memory.
Here's the command I used (adding linebreaks for readability):
Get-Content sourcefile.txt
| Foreach-Object {$_.Replace('http://example.com', 'http://another.example.com')}
| Set-Content result.txt
Worked perfectly! Never sucked up much memory (it very obviously didn't load the whole file into memory), and just chugged along for a few minutes then finished.
Aside from worrying about reading the file in chunks to avoid loading it into memory, you need to dump to disk often enough that you aren't storing the entire contents of the resulting file in memory.
Get-Content sourcefile.txt -ReadCount 10000 |
Foreach-Object {
$line = $_.Replace('http://example.com', 'http://another.example.com')
Add-Content -Path result.txt -Value $line
}
The -ReadCount <number> sets the number of lines to read at a time. Then the ForEach-Object writes each line as it is read. For a 30GB file filled with SQL Inserts, I topped out around 200MB of memory and 8% CPU. While, piping it all into Set-Content at hit 3GB of memory before I killed it.
It does not like it because you can't read from a file and write back to it at the same time using Get-Content/Set-Content. I recommend using a temp file and then at the end, rename file1.xml to file1.xml.bak and rename the temp file to file1.xml.
Yes as long as you don't try to load the whole file at once. Line-by-line will work but is going to be a bit slow. Use the -ReadCount parameter and set it to 1000 to improve performance.
Which command line? PowerShell? If so then you can invoke your script like so .\myscript.ps1 and if it takes parameters then c:\users\joe\myscript.ps1 c:\temp\file1.xml.
In general for regexes I would use single quotes if you don't need to reference PowerShell variables. Then you only need to worry about regex escaping and not PowerShell escaping as well. If you need to use double-quotes then the back-tick character is the escape char in double-quotes e.g. "`$p1 is set to $ps1". In your example single quoting simplifies your regex to (note: forward slashes aren't metacharacters in regex):
'xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"'
Absolutely you want to stream this since 50GB won't fit into memory. However, this poses an issue if you process line-by-line. What if the text you want to replace is split across multiple lines?
If you don't have the split line issue then I think PowerShell can handle this.
This is my take on it, building on some of the other answers here:
Function ReplaceTextIn-File{
Param(
$infile,
$outfile,
$find,
$replace
)
if( -Not $outfile)
{
$outfile = $infile
}
$temp_out_file = "$outfile.temp"
Get-Content $infile | Foreach-Object {$_.Replace($find, $replace)} | Set-Content $temp_out_file
if( Test-Path $outfile)
{
Remove-Item $outfile
}
Move-Item $temp_out_file $outfile
}
And called like so:
ReplaceTextIn-File -infile "c:\input.txt" -find 'http://example.com' -replace 'http://another.example.com'
The escape character in powershell strings is the backtick ( ` ), not backslash ( \ ). I'd give an example, but the backtick is also used by the wiki markup. :(
The only thing you should have to escape is the quotes - the periods and such should be fine without.