Use .NET for fast read/write of large files - powershell

i am trying to search through a number of large files and replace parts of the text, but i keep running into errors.
i tried this, but sometimes i'll get an 'out of memory' error in powershell
#region The Setup
$file = "C:\temp\168MBfile.txt"
$hash = #{
ham = 'bacon'
toast = 'pancakes'
}
#endregion The Setup
$obj = [System.IO.StreamReader]$file
$contents = $obj.ReadToEnd()
$obj.Close()
foreach ($key in $hash.Keys) {
$contents = $contents -replace [regex]::Escape($key), $hash[$key]
}
try {
$obj = [System.IO.StreamWriter]$file
$obj.Write($contents)
} finally {
if ($obj -ne $null) {
$obj.Close()
}
}
then i tried this (in the ISE), but it crashes with a popup message (sorry, don't have the error on hand) and tries to restart the ISE
$arraylist = New-Object System.Collections.ArrayList
$obj = [System.IO.StreamReader]$file
while (!$obj.EndOfStream) {
$line = $obj.ReadLine()
foreach ($key in $hash.Keys) {
$line = $line -replace [regex]::Escape($key), $hash[$key]
}
[void]$arraylist.Add($line)
}
$obj.Close()
$arraylist
and finally, i came across something like this, but i'm not sure how to use it properly, and i am not even sure if i am going about this the right way.
$sourcestream = [System.IO.File]::Open($file)
$newstream = [System.IO.File]::Create($file)
$sourcestream.Stream.CopyTo($newstream)
$sourcestream.Close()
any advice would be greatly appreciated.

You can start with a readcount of 1000 and tweak it based on the performance you get:
get-content textfile -Readcount 1000 |
foreach-object {do something} |
set-content textfile
or
(get-content textfile -Readcount 1000) -replace 'something','withsomething' |
set-content textfile

Related

Comparing two text files and output the differences in Powershell

So I'm new to the Powershell scripting world and I'm trying to compare a list of IPs in text file against a database of IP list. If an IP from (file) does not exist in the (database) file put it in a new file, let's call it compared.txt. When I tried to run the script, I didn't get any result. What am I missing here?
$file = Get-Content "C:\Users\zack\Desktop\file.txt"
$database = Get-Content "C:\Users\zack\Desktop\database.txt"
foreach($line1 in $file){
$check = 0
foreach($line2 in $database)
{
if($line1 != $line2)
{
$check = 1
}
else
{
$check = 0
break
}
}
if ($check == 1 )
{
$line2 | Out-File "C:\Users\zack\Desktop\compared.txt"
}
}
There is a problem with your use of PowerShell comparison operators unlike in C#, equality and inequality are -eq and -ne, and since PowerShell is a case insensitive language, there is also -ceq and -cne.
There is also a problem with your code's logic, a simple working version of it would be:
$database = Get-Content "C:\Users\zack\Desktop\database.txt"
# iterate each line in `file.txt`
$result = foreach($line1 in Get-Content "C:\Users\zack\Desktop\file.txt") {
# iterate each line in `database.txt`
# this happens on each iteration of the outer loop
$check = foreach($line2 in $database) {
# if this line of `file.txt` is the same as this line of `database.txt`
if($line1 -eq $line2) {
# we don't need to keep checking, output this boolean
$true
# and break the inner loop
break
}
}
# if above condition was NOT true
if(-not $check) {
# output this line, can be `$line1` or `$line2` (same thing here)
$line1
}
}
$result | Set-Content path\to\comparisonresult.txt
However, there are even more simplified ways you could achieve the same results:
Using containment operators:
$database = Get-Content "C:\Users\zack\Desktop\database.txt"
$result = foreach($line1 in Get-Content "C:\Users\zack\Desktop\file.txt") {
if($line1 -notin $database) {
$line1
}
}
$result | Set-Content path\to\comparisonresult.txt
Using Where-Object:
$database = Get-Content "C:\Users\zack\Desktop\database.txt"
Get-Content "C:\Users\zack\Desktop\file.txt" | Where-Object { $_ -notin $database } |
Set-Content path\to\comparisonresult.txt
Using a HashSet<T> and it's ExceptWith method (Note, this will also get rid of duplicates in your file.txt):
$file = [System.Collections.Generic.HashSet[string]]#(
Get-Content "C:\Users\zack\Desktop\file.txt"
)
$database = [string[]]#(Get-Content "C:\Users\zack\Desktop\database.txt")
$file.ExceptWith($database)
$file | Set-Content path\to\comparisonresult.txt

How to sort 30Million csv records in Powershell

I am using oledbconnection to sort the first column of csv file. Oledb connection is executed up to 9 million records within 6 min duration successfully. But when am executing 10 million records, getting following alert message.
Exception calling "ExecuteReader" with "0" argument(s): "The query cannot be completed. Either the size of the query result is larger than the maximum size of a database (2 GB), or
there is not enough temporary storage space on the disk to store the query result."
is there any other solution to sort 30 million using Powershell?
here is my script
$OutputFile = "D:\Performance_test_data\output1.csv"
$stream = [System.IO.StreamWriter]::new( $OutputFile )
$sb = [System.Text.StringBuilder]::new()
$sw = [Diagnostics.Stopwatch]::StartNew()
$conn = New-Object System.Data.OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source='D:\Performance_test_data\';Extended Properties='Text;HDR=Yes;CharacterSet=65001;FMT=Delimited';")
$cmd=$conn.CreateCommand()
$cmd.CommandText="Select * from 1crores.csv order by col6"
$conn.open()
$data = $cmd.ExecuteReader()
echo "Query has been completed!"
$stream.WriteLine( "col1,col2,col3,col4,col5,col6")
while ($data.read())
{
$stream.WriteLine( $data.GetValue(0) +',' + $data.GetValue(1)+',' + $data.GetValue(2)+',' + $data.GetValue(3)+',' + $data.GetValue(4)+',' + $data.GetValue(5))
}
echo "data written successfully!!!"
$stream.close()
$sw.Stop()
$sw.Elapsed
$cmd.Dispose()
$conn.Dispose()
You can try using this:
$CSVPath = 'C:\test\CSVTest.csv'
$Delimiter = ';'
# list we use to hold the results
$ResultList = [System.Collections.Generic.List[Object]]::new()
# Create a stream (I use OpenText because it returns a streamreader)
$File = [System.IO.File]::OpenText($CSVPath)
# Read and parse the header
$HeaderString = $File.ReadLine()
# Get the properties from the string, replace quotes
$Properties = $HeaderString.Split($Delimiter).Replace('"',$null)
$PropertyCount = $Properties.Count
# now read the rest of the data, parse it, build an object and add it to a list
while ($File.EndOfStream -ne $true)
{
# Read the line
$Line = $File.ReadLine()
# split the fields and replace the quotes
$LineData = $Line.Split($Delimiter).Replace('"',$null)
# Create a hashtable with the properties (we convert this to a PSCustomObject later on). I use an ordered hashtable to keep the order
$PropHash = [System.Collections.Specialized.OrderedDictionary]#{}
# if loop to add the properties and values
for ($i = 0; $i -lt $PropertyCount; $i++)
{
$PropHash.Add($Properties[$i],$LineData[$i])
}
# Now convert the data to a PSCustomObject and add it to the list
$ResultList.Add($([PSCustomObject]$PropHash))
}
# Now you can sort this list using Linq:
Add-Type -AssemblyName System.Linq
# Sort using propertyname (my sample data had a prop called "Name")
$Sorted = [Linq.Enumerable]::OrderBy($ResultList, [Func[object,string]] { $args[0].Name })
Instead of using import-csv I've written a quick parser which uses a streamreader and parses the CSV data on the fly and puts it in a PSCustomObject.
This is then added to a list.
edit: fixed the linq sample
Putting the performance aside and at least come to a solution that works (meaning one that doesn't hang due to memory shortage) I would rely on the PowerShell pipeline. The issue is thou that for sorting an object you will need to stall te pipeline as the last object might potentially become the first object.
To resolve this part, I would do a coarse division on the first character(s) of the concern property first. Once that is done, fine sort each coarse division and append the results:
Function Sort-BigObject {
[CmdletBinding()] param(
[Parameter(ValueFromPipeLine = $True)]$InputObject,
[Parameter(Position = 0)][String]$Property,
[ValidateRange(1,9)]$Coarse = 1,
[System.Text.Encoding]$Encoding = [System.Text.Encoding]::Default
)
Begin {
$TemporaryFiles = [System.Collections.SortedList]::new()
}
Process {
if ($InputObject.$Property) {
$Grain = $InputObject.$Property.SubString(0, $Coarse)
if (!$TemporaryFiles.Contains($Grain)) { $TemporaryFiles[$Grain] = New-TemporaryFile }
$InputObject | Export-Csv $TemporaryFiles[$Grain] -Encoding $Encoding -Append
} else { $InputObject.$Property }
}
End {
Foreach ($TemporaryFile in $TemporaryFiles.Values) {
Import-Csv $TemporaryFile -Encoding $Encoding | Sort-Object $Property
Remove-Item -LiteralPath $TemporaryFile
}
}
}
Usage
(Don't assign the stream to a variable and don't use parenthesis.)
Import-Csv .\1crores.csv | Sort-BigObject <PropertyName> | Export-Csv .\output.csv
If the temporary files still get too big to handle, you might need to increase the -Coarse parameter
Caveats (improvement considerations)
Objects with an empty sort property will be immediately outputted
The sort column is presumed to be a (single) string column
I presume the performance is poor (I didn't do a full test on 30 million records, but 10.000 records take about 8 second which means about 8 hours). Consider replacing native PowerShell cmdlets with .Net streaming methods. buffer/cache file input and outputs, parallel processing?
You could try SQLite:
$OutputFile = "D:\Performance_test_data\output1.csv"
$sw = [Diagnostics.Stopwatch]::StartNew()
sqlite3 output1.db '.mode csv' '.import 1crores.csv 1crores' '.headers on' ".output $OutputFile" 'Select * from 1crores order by 最終アクセス日時'
echo "data written successfully!!!"
$sw.Stop()
$sw.Elapsed
I have added a new answer as this is a complete different approach to tackle this issue.
Instead of creating temporary files (which presumable causes a lot of file opens and closures), you might consider to create a ordered list of indices and than go over the input file (-FilePath) multiple times and each time, process a selective number of lines (-BufferSize = 1Gb, you might have to tweak this "memory usage vs. performance" parameter):
Function Sort-Csv {
[CmdletBinding()] param(
[string]$InputFile,
[String]$Property,
[string]$OutputFile,
[Char]$Delimiter = ',',
[System.Text.Encoding]$Encoding = [System.Text.Encoding]::Default,
[Int]$BufferSize = 1Gb
)
Begin {
if ($InputFile.StartsWith('.\')) { $InputFile = Join-Path (Get-Location) $InputFile }
$Index = 0
$Dictionary = [System.Collections.Generic.SortedDictionary[string, [Collections.Generic.List[Int]]]]::new()
Import-Csv $InputFile -Delimiter $Delimiter -Encoding $Encoding | Foreach-Object {
if (!$Dictionary.ContainsKey($_.$Property)) { $Dictionary[$_.$Property] = [Collections.Generic.List[Int]]::new() }
$Dictionary[$_.$Property].Add($Index++)
}
$Indices = [int[]]($Dictionary.Values | ForEach-Object { $_ })
$Dictionary = $Null # we only need the sorted index list
}
Process {
$Start = 0
$ChunkSize = [int]($BufferSize / (Get-Item $InputFile).Length * $Indices.Count / 2.2)
While ($Start -lt $Indices.Count) {
[System.GC]::Collect()
$End = $Start + $ChunkSize - 1
if ($End -ge $Indices.Count) { $End = $Indices.Count - 1 }
$Chunk = #{}
For ($i = $Start; $i -le $End; $i++) { $Chunk[$Indices[$i]] = $i }
$Reader = [System.IO.StreamReader]::new($InputFile, $Encoding)
$Header = $Reader.ReadLine()
$i = $Start
$Count = 0
For ($i = 0; ($Line = $Reader.ReadLine()) -and $Count -lt $ChunkSize; $i++) {
if ($Chunk.Contains($i)) { $Chunk[$i] = $Line }
}
$Reader.Dispose()
if ($OutputFile) {
if ($OutputFile.StartsWith('.\')) { $OutputFile = Join-Path (Get-Location) $OutputFile }
$Writer = [System.IO.StreamWriter]::new($OutputFile, ($Start -ne 0), $Encoding)
if ($Start -eq 0) { $Writer.WriteLine($Header) }
For ($i = $Start; $i -le $End; $i++) { $Writer.WriteLine($Chunk[$Indices[$i]]) }
$Writer.Dispose()
} else {
$Start..$End | ForEach-Object { $Header } { $Chunk[$Indices[$_]] } | ConvertFrom-Csv -Delimiter $Delimiter
}
$Chunk = $Null
$Start = $End + 1
}
}
}
Basic usage
Sort-Csv .\Input.csv <PropertyName> -Output .\Output.csv
Sort-Csv .\Input.csv <PropertyName> | ... | Export-Csv .\Output.csv
Note that for 1Crones.csv it will probably just export the full file in once unless you set the -BufferSize to a lower amount e.g. 500Kb.
I downloaded gnu sort.exe from here: http://gnuwin32.sourceforge.net/packages/coreutils.htm It also requires libiconv2.dll and libintl3.dll from the dependency zip. I basically did this within cmd.exe, and it used a little less than a gig of ram and took about 5 minutes. It's a 500 meg file of about 30 million random numbers. This command can also merge sorted files with --merge. You can also specify begin and end key position for sorting --key. It automatically uses temp files.
.\sort.exe < file1.csv > file2.csv
Actually it works in a similar way with the windows sort from the cmd prompt. The windows sort also has a /+n option to specify what character column to start the sort by.
sort.exe < file1.csv > file2.csv

Search large .log for specific string quickly without streamreader

Problem: I need to search a large log file that is currently being used by another process. I cannot stop this other process or put a lock on the .log file. I need to quickly search this file, and I can't read it all into memory. I get that StreamReader() is the fastest, but I can't figure out how to avoid it attempting to grab a lock on the file.
$p = "Seachterm:Search"
$files = "\\remoteserver\c\temp\tryingtofigurethisout.log"
$SearchResult= Get-Content -Path $files | Where-Object { $_ -eq $p }
The below doesn't work because I can't get a lock of the file.
$reader = New-Object System.IO.StreamReader($files)
$lines = #()
if ($reader -ne $null) {
while (!$reader.EndOfStream) {
$line = $reader.ReadLine()
if ($line.Contains($p)) {
$lines += $line
}
}
}
$lines | Select-Object -Last 1
This takes too long:
get-content $files -ReadCount 500 |
foreach { $_ -match $p }
I would greatly appreciate any pointers in how I can go about quickly and efficiently (memory wise) searching a large log file.
Perhaps this will work for you. It tries to read the lines of the file as fast as possible, but with a difference to your second approach (which is approx. the same as what [System.IO.File]::ReadAllLines() would do).
To collect the lines, I use a List object which will perform faster than appending to an array using +=
$p = "Seachterm:Search"
$path = "\\remoteserver\c$\temp\tryingtofigurethisout.log"
if (!(Test-Path -Path $path -PathType Leaf)) {
Write-Warning "File '$path' does not exist"
}
else {
try {
$fileStream = [System.IO.FileStream]::new($path, [System.IO.FileMode]::Open, [System.IO.FileAccess]::Read, [System.IO.FileShare]::ReadWrite)
$streamReader = [System.IO.StreamReader]::new($fileStream)
# or use this syntax:
# $fileMode = [System.IO.FileMode]::Open
# $fileAccess = [System.IO.FileAccess]::Read
# $fileShare = [System.IO.FileShare]::ReadWrite
# $fileStream = New-Object -TypeName System.IO.FileStream $path, $fileMode, $fileAccess, $fileShare
# $streamReader = New-Object -TypeName System.IO.StreamReader -ArgumentList $fileStream
# use a List object of type String or an ArrayList to collect the strings quickly
$lines = New-Object System.Collections.Generic.List[string]
# read the lines as fast as you can and add them to the list
while (!$streamReader.EndOfStream) {
$lines.Add($streamReader.ReadLine())
}
# close and dispose the obects used
$streamReader.Close()
$fileStream.Dispose()
# do the 'Contains($p)' after reading the file to not slow that part down
$lines.ToArray() | Where-Object { $_.Contains($p) } | Select-Object -Last 1
}
catch [System.IO.IOException] {}
}
Basically, it does what your second code does, but with the difference that using just the StreamReader, the file is opened with [System.IO.FileShare]::Read, whereas this code opens the file with [System.IO.FileShare]::ReadWrite
Note that there may be exceptions thrown using this because another application has write permissions to the file, hence the try{...} catch{...}
Hope that helps

Why Isn't This Counting Correctly | PowerShell

Right now, I have a CSV file which contains 3,800+ records. This file contains a list of server names, followed by an abbreviation stating if the server is a Windows server, Linux server, etc. The file also contains comments or documentation, where each line starts with "#", stating it is a comment. What I have so far is as follows.
$file = Get-Content .\allsystems.csv
$arraysplit = #()
$arrayfinal = #()
[int]$windows = 0
foreach ($thing in $file){
if ($thing.StartsWith("#")) {
continue
}
else {
$arraysplit = $thing.Split(":")
$arrayfinal = #($arraysplit[0], $arraysplit[1])
}
}
foreach ($item in $arrayfinal){
if ($item[1] -contains 'NT'){
$windows++
}
else {
continue
}
}
$windows
The goal of this script is to count the total number of Windows servers. My issue is that the first "foreach" block works fine, but the second one results in "$Windows" being 0. I'm honestly not sure why this isn't working. Two example lines of data are as follows:
example:LNX
example2:NT
if the goal is to count the windows servers, why do you need the array?
can't you just say something like
foreach ($thing in $file)
{
if ($thing -notmatch "^#" -and $thing -match "NT") { $windows++ }
}
$arrayfinal = #($arraysplit[0], $arraysplit[1])
This replaces the array for every run.
Changing it to += gave another issue. It simply appended each individual element. I used this post's info to fix it, sort of forcing a 2d array: How to create array of arrays in powershell?.
$file = Get-Content .\allsystems.csv
$arraysplit = #()
$arrayfinal = #()
[int]$windows = 0
foreach ($thing in $file){
if ($thing.StartsWith("#")) {
continue
}
else {
$arraysplit = $thing.Split(":")
$arrayfinal += ,$arraysplit
}
}
foreach ($item in $arrayfinal){
if ($item[1] -contains 'NT'){
$windows++
}
else {
continue
}
}
$windows
1
I also changed the file around and added more instances of both NT and other random garbage. Seems it works fine.
I'd avoid making another ForEach loop for bumping count occurrences. Your $arrayfinal also rewrites everytime, so I used ArrayList.
$file = Get-Content "E:\Code\PS\myPS\2018\Jun\12\allSystems.csv"
$arrayFinal = New-Object System.Collections.ArrayList($null)
foreach ($thing in $file){
if ($thing.StartsWith("#")) {
continue
}
else {
$arraysplit = $thing -split ":"
if($arraysplit[1] -match "NT" -or $arraysplit[1] -match "Windows")
{
$arrayfinal.Add($arraysplit[1]) | Out-Null
}
}
}
Write-Host "Entries with 'NT' or 'Windows' $($arrayFinal.Count)"
I'm not sure if you want to keep 'Example', 'example2'... so I have skipped adding them to arrayfinal, assuming the goal is to count "NT" or "Windows" occurrances
The goal of this script is to count the total number of Windows servers.
I'd suggest the easy way: using cmdlets built for this.
$csv = Get-Content -Path .\file.csv |
Where-Object { -not $_.StartsWith('#') } |
ConvertFrom-Csv
#($csv.servertype).Where({ $_.Equals('NT') }).Count
# Compatibility mode:
# ($csv.servertype | Where-Object { $_.Equals('NT') }).Count
Replace servertype and 'NT' with whatever that header/value is called.

Parsing and splitting files based on the string

I have a very large file (hence .ReadLines) which I need to efficiently and quickly parse and split into other files. For each line which contains a keyword I need to copy that line and append to a specific file. This is what I have so far, the script runs but the files aren't getting populated.
$filename = "C:\dev\powershell\test1.csv"
foreach ($line in [System.IO.File]::ReadLines($filename)) {
if ($line | %{$_ -match "Apple"}){Out-File -Append Apples.txt}
elseif($line | %{$_ -match "Banana"}){Out-File -Append Bananas.txt}
elseif($line | %{$_ -match "Pear"}){Out-File -Append Pears.txt}
}
Example content of the csv file:
Apple,Test1,Cross1
Apple,Test2,Cross2
Apple,Test3,Cross3
Banana,Test4,Cross4
Pear,Test5,Cross5
I want Apples.txt to contain:
Apple,Test1,Cross1
Apple,Test2,Cross2
Apple,Test3,Cross3
Couple of things:
Your if conditions don't need %/foreach-object - -match will do on its own:
foreach ($line in [System.IO.File]::ReadLines($filename)) {
if($line -match "Apple"){
# output to apple.txt
}
else($line -match "Banana"){
# output to banana.txt
}
# etc...
}
The files aren't getting populated because you're not actually sending any output to Out-File:
foreach ($line in [System.IO.File]::ReadLines($filename)) {
if($line -match "Apple"){
# send $line to the file
$line |Out-File apple.txt -Append
}
# etc...
}
If your files are really massive and you expect a lot of matching lines, I'd recommend using a StreamWriter for the output files - otherwise Out-File will be opening and closing the file all the time:
$OutFiles = #{
'apple' = New-Object System.IO.StreamWriter $PWD\apples.txt
'banana' = New-Object System.IO.StreamWriter $PWD\bananas.txt
'pear' = New-Object System.IO.StreamWriter $PWD\pears.txt
}
foreach ($line in [System.IO.File]::ReadLines($filename)) {
foreach($keyword in $OutFiles.Keys){
if($line -match $keyword){
$OutFiles[$keyword].WriteLine($line)
continue
}
}
}
foreach($Writer in $OutFiles.Values){
try{
$Writer.Close()
}
finally{
$Writer.Dispose()
}
}
This way you also only have to maintain the $OutFiles hashtable if you need to update the keywords for example.