Search large .log for specific string quickly without streamreader - powershell

Problem: I need to search a large log file that is currently being used by another process. I cannot stop this other process or put a lock on the .log file. I need to quickly search this file, and I can't read it all into memory. I get that StreamReader() is the fastest, but I can't figure out how to avoid it attempting to grab a lock on the file.
$p = "Seachterm:Search"
$files = "\\remoteserver\c\temp\tryingtofigurethisout.log"
$SearchResult= Get-Content -Path $files | Where-Object { $_ -eq $p }
The below doesn't work because I can't get a lock of the file.
$reader = New-Object System.IO.StreamReader($files)
$lines = #()
if ($reader -ne $null) {
while (!$reader.EndOfStream) {
$line = $reader.ReadLine()
if ($line.Contains($p)) {
$lines += $line
}
}
}
$lines | Select-Object -Last 1
This takes too long:
get-content $files -ReadCount 500 |
foreach { $_ -match $p }
I would greatly appreciate any pointers in how I can go about quickly and efficiently (memory wise) searching a large log file.

Perhaps this will work for you. It tries to read the lines of the file as fast as possible, but with a difference to your second approach (which is approx. the same as what [System.IO.File]::ReadAllLines() would do).
To collect the lines, I use a List object which will perform faster than appending to an array using +=
$p = "Seachterm:Search"
$path = "\\remoteserver\c$\temp\tryingtofigurethisout.log"
if (!(Test-Path -Path $path -PathType Leaf)) {
Write-Warning "File '$path' does not exist"
}
else {
try {
$fileStream = [System.IO.FileStream]::new($path, [System.IO.FileMode]::Open, [System.IO.FileAccess]::Read, [System.IO.FileShare]::ReadWrite)
$streamReader = [System.IO.StreamReader]::new($fileStream)
# or use this syntax:
# $fileMode = [System.IO.FileMode]::Open
# $fileAccess = [System.IO.FileAccess]::Read
# $fileShare = [System.IO.FileShare]::ReadWrite
# $fileStream = New-Object -TypeName System.IO.FileStream $path, $fileMode, $fileAccess, $fileShare
# $streamReader = New-Object -TypeName System.IO.StreamReader -ArgumentList $fileStream
# use a List object of type String or an ArrayList to collect the strings quickly
$lines = New-Object System.Collections.Generic.List[string]
# read the lines as fast as you can and add them to the list
while (!$streamReader.EndOfStream) {
$lines.Add($streamReader.ReadLine())
}
# close and dispose the obects used
$streamReader.Close()
$fileStream.Dispose()
# do the 'Contains($p)' after reading the file to not slow that part down
$lines.ToArray() | Where-Object { $_.Contains($p) } | Select-Object -Last 1
}
catch [System.IO.IOException] {}
}
Basically, it does what your second code does, but with the difference that using just the StreamReader, the file is opened with [System.IO.FileShare]::Read, whereas this code opens the file with [System.IO.FileShare]::ReadWrite
Note that there may be exceptions thrown using this because another application has write permissions to the file, hence the try{...} catch{...}
Hope that helps

Related

Powershell: exclusive lock for a file during multiple set-content and get-content operations

I am trying to write a script which runs on multiple client machines and writes to a single text file on a network share.
I want to ensure that only one machine can maniputale the file at any one time, whilst the other machines run a loop to check if the file is available.
the script runs this first:
Set-Content -Path $PathToHost -Value (get-content -Path $PathToHost | Select-String -Pattern "$HostName " -NotMatch) -ErrorAction Stop
Which removes some lines if they are matching the criteria. Then I want to append a new line with this:
Add-Content $PathToHost "$heartbeat$_" -ErrorAction Stop
The problem is that between the execution of those two commands another client has access to the file and begins to write to the file as well.
I have explored the solution here: Locking the file while writing in PowerShell
$PathToHost = "C:\file.txt"
$mode = "Open"
$access = "ReadWrite"
$share = "None"
$file = [System.IO.File]::Open($path, $mode, $access, $share)
$file.close()
Which can definitely lock the file, but I am not sure how to proceed to then read and write to the file.
Any help is much appreciated.
EDIT: Solution as below thanks to twinlakes' answer
$path = "C:\Users\daniel_mladenov\hostsTEST.txt"
$mode = "Open"
$access = "ReadWrite"
$share = "none"
$file = [System.IO.File]::Open($path, $mode, $access, $share)
$fileread = [System.IO.StreamReader]::new($file, [Text.Encoding]::UTF8)
# Counts number of lines in file
$imax=0
while ($fileread.ReadLine() -ne $null){
$imax++
}
echo $imax
#resets read position to beginning
$fileread.basestream.position = 0
#reads content of whole file and discards mathching lines
$content=#()
for ($i=0; $i -lt $imax; $i++){
$ContentLine = $fileread.ReadLine()
If($ContentLine -notmatch "$HostIP\s" -and $ContentLine -notmatch "$HostName\s"){
$content += $ContentLine
}
}
echo $content
#Writes remaining lines back to file
$filewrite = [System.IO.StreamWriter]::new($file)
$filewrite.basestream.position = 0
for ($i=0; $i -lt $content.length; $i++){
$filewrite.WriteLine($content[$i])
}
$filewrite.WriteLine($heartbeat)
$filewrite.Flush()
$file.SetLength($file.Position) #trims file to the content which has been written, discarding any content past that point
$file.close()
$file is a System.IO.FileStream object. You will need to call the write method on that object, which requires a byte array.
$string = # the string to write to the file
$bytes = [Text.Encoding]::UTF8.GetBytes($string)
$file.Write($bytes, 0, $bytes.Length)

Word com object failing

SCRIPT PURPOSE
The idea behind the script is to recursively extract the text from a large amount of documents and update a field in an Azure SQL database with the extracted text. Basically we are moving away from Windows Search of document contents to an SQL full text search to improve the speed.
ISSUE
When the script encounters an issue opening the file such as it being password protected, it fails for every single document that follows. Here is the section of the script that processes the files:
foreach ($list in (Get-ChildItem ( join-path $PSScriptRoot "\FileLists\*" ) -include *.txt )) {
## Word object
$word = New-Object -ComObject word.application
$word.Visible = $false
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")
$word.DisplayAlerts = 0
Write-Output ""
Write-Output "################# Parsing $list"
Write-Output ""
$query = "INSERT INTO tmp_CachedText (tCachedText, tOID)
VALUES "
foreach ($file in (Get-Content $list)) {
if ($file -like "*-*" -and $file -notlike "*~*") {
Write-Output "Processing: $($file)"
Try {
$doc = $word.Documents.OpenNoRepairDialog($file, $false, $false, $false, "ttt")
if ($doc) {
$fileName = [io.path]::GetFileNameWithoutExtension($file)
$fileName = $filename + ".txt"
$doc.SaveAs("$env:TEMP\$fileName", [ref]$saveFormat)
$doc.Close()
$4ID = $fileName.split('-')[-1].replace(' ', '').replace(".txt", "")
$text = Get-Content -raw "$env:TEMP\$fileName"
$text = $text.replace("'", "''")
$query += "
('$text', $4ID),"
Remove-Item -Force "$env:TEMP\$fileName"
<# Upload to azure #>
$query = $query.Substring(0,$query.Length-1)
$query += ";"
Invoke-Sqlcmd #params -Query $Query -ErrorAction "SilentlyContinue"
$query = "INSERT INTO tmp_CachedText (tCachedText, tOID)
VALUES "
}
}
Catch {
Write-Host "$($file) failed to process" -ForegroundColor RED;
continue
}
}
}
Remove-Item -Force $list.FullName
Write-Output ""
Write-Output "Uploading to azure"
Write-Output ""
<# Upload to azure #>
Invoke-Sqlcmd #params -Query $setQuery -ErrorAction "SilentlyContinue"
$word.Quit()
TASKKILL /f /PID WINWORD.EXE
}
Basically it parses through a folder of .txt files that contain x amount of document paths, creates a T-SQL update statement and runs against an Azure SQL database after each file is fully parsed. The files are generated with the following:
if (!($continue)) {
if ($pdf){
$files = (Get-ChildItem -force -recurse $documentFolder -include *.pdf).fullname
}
else {
$files = (Get-ChildItem -force -recurse $documentFolder -include *.doc, *.docx).fullname
}
$files | Out-File (Join-Path $PSScriptRoot "\documents.txt")
$i=0; Get-Content $documentFile -ReadCount $interval | %{$i++; $_ | Out-File (Join-Path $PSScriptRoot "\FileLists\documents_$i.txt")}
}
The $interval variable defines how many files are set to be extracted for each given upload to azure. Initially i had the word object being created outside the loop and never closed until the end. Unfortunately this doesn't seem to work as every time the script hits a file it cannot open, every file that follows will fail, until it reaches the end of the inner foreach loop foreach ($file in (Get-Content $list)) {.
This means that to get the expected outcome i have to run this with an interval of 1 which takes far too long.
This is a shot in the dark
But to me it sounds like the reason its failing is because the Word Com object is now prompting you for some action due since it cannot open the file so all following items in the loop also fail. This might explain why it works if you set the $Interval to 1 because when its 1 it is closing and opening the Com object every time and that takes forever (I did this with excel).
What you can do is in your catch statement, close and open a new Word Com object which should lets you continue on with the loop (but it will be a bit slower if it needs to open the Com object a lot).
If you want to debug the problem even more, set the Com object to be visible, and slowly loop through your program without interacting with Word. This will show you what is happening with Word and if there are any prompts that are causing the application to hang.
Of course, if you want to run it at full speed, you will need to detect which documents you can't open before hand or you could multithread it by opening several Word Com objects which will allow you to load several documents at a time.
As for...
ISSUE
When the script encounters an issue opening the file such as it being password protected, it fails for every single document that follows.
... then test for this as noted here...
How to check if a word file has a password?
$filename = "C:\path\to\your.doc"
$wd = New-Object -COM "Word.Application"
try {
$doc = $wd.Documents.Open($filename, $null, $null, $null, "")
} catch {
Write-Host "$filename is password-protected!"
}
... and skip the file to avoid the failure of the remaining files.

Powershell Mass Rename files with a excel reference list

I need help with PowerShell.
I will have to start renaming files in a weekly basis which I will be renaming more than 100 a week or more each with a dynamic name.
The files I want to rename are in a folder name Scans located in the "C: Documents\Scans". And they would be in order, to say time scanned.
I have an excel file located in "C: Documents\Mapping\ New File Name.xlsx.
The workbook has only one sheet and the new names would be in column A with x rows. Like mention above each cell will have different variables.
P Lease make comments on your suggestions so that I may understand what is going on since I'm a new to coding.
Thank you all for your time and help.
Although I agree with Ad Kasenally that it would be easier to use CSV files, here's something that may work for you.
$excelFile = 'C:\Documents\Mapping\New File Name.xlsx'
$scansFolder = 'C:\Documents\Scans'
########################################################
# step 1: get the new filenames from the first column in
# the Excel spreadsheet into an array '$newNames'
########################################################
$excel = New-Object -ComObject Excel.Application
$excel.Visible = $false
$workbook = $excel.Workbooks.Open($excelFile)
$worksheet = $workbook.Worksheets.Item(1)
$newNames = #()
$i = 1
while ($worksheet.Cells.Item($i, 1).Value() -ne $null) {
$newNames += $worksheet.Cells.Item($i, 1).Value()
$i++
}
$excel.Quit
# IMPORTANT: clean-up used Com objects
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($worksheet) | Out-Null
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($workbook) | Out-Null
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($excel) | Out-Null
[System.GC]::Collect()
[System.GC]::WaitForPendingFinalizers()
########################################################
# step 2: rename the 'scan' files
########################################################
$maxItems = $newNames.Count
if ($maxItems) {
$i = 0
Get-ChildItem -Path $scansFolder -File -Filter 'scan*' | # get a list of FileInfo objects in the folder
Sort-Object { [int]($_.BaseName -replace '\D+', '') } | # sort by the numeric part of the filename
Select-Object -First ($maxItems) | # select no more that there are items in the $newNames array
ForEach-Object {
try {
Rename-Item -Path $_.FullName -NewName $newNames[$i] -ErrorAction Stop
Write-Host "File '$($_.Name)' renamed to '$($newNames[$i])'"
$i++
}
catch {
throw
}
}
}
else {
Write-Warning "Could not get any new filenames from the $excelFile file.."
}
You may want to have 2 columns in the excel file:
original file name
target file name
From there you can save the file as a csv.
Use Import-Csv to pull the data into Powershell and a ForEach loop to cycle through each row with a command like move $item.original $item.target.
There are abundant threads describing using import-csv with forEach.
Good luck.

Use .NET for fast read/write of large files

i am trying to search through a number of large files and replace parts of the text, but i keep running into errors.
i tried this, but sometimes i'll get an 'out of memory' error in powershell
#region The Setup
$file = "C:\temp\168MBfile.txt"
$hash = #{
ham = 'bacon'
toast = 'pancakes'
}
#endregion The Setup
$obj = [System.IO.StreamReader]$file
$contents = $obj.ReadToEnd()
$obj.Close()
foreach ($key in $hash.Keys) {
$contents = $contents -replace [regex]::Escape($key), $hash[$key]
}
try {
$obj = [System.IO.StreamWriter]$file
$obj.Write($contents)
} finally {
if ($obj -ne $null) {
$obj.Close()
}
}
then i tried this (in the ISE), but it crashes with a popup message (sorry, don't have the error on hand) and tries to restart the ISE
$arraylist = New-Object System.Collections.ArrayList
$obj = [System.IO.StreamReader]$file
while (!$obj.EndOfStream) {
$line = $obj.ReadLine()
foreach ($key in $hash.Keys) {
$line = $line -replace [regex]::Escape($key), $hash[$key]
}
[void]$arraylist.Add($line)
}
$obj.Close()
$arraylist
and finally, i came across something like this, but i'm not sure how to use it properly, and i am not even sure if i am going about this the right way.
$sourcestream = [System.IO.File]::Open($file)
$newstream = [System.IO.File]::Create($file)
$sourcestream.Stream.CopyTo($newstream)
$sourcestream.Close()
any advice would be greatly appreciated.
You can start with a readcount of 1000 and tweak it based on the performance you get:
get-content textfile -Readcount 1000 |
foreach-object {do something} |
set-content textfile
or
(get-content textfile -Readcount 1000) -replace 'something','withsomething' |
set-content textfile

Optimize Word document keyword search

I'm trying to search for keywords across a large number of MS Word documents, and return the results to a file. I've got a working script, but I wasn't aware of the scale, and what I've got isn't nearly efficient enough, it would take days to plod through everything.
The script as it stands now takes keywords from CompareData.txt and runs it through all the files in a specific folder, then appends it to a file.
So when I'm done I will know how many files have each specific keyword.
[cmdletBinding()]
Param(
$Path = "C:\willscratch\"
) #end param
$findTexts = (Get-Content c:\scratch\CompareData.txt)
Foreach ($Findtext in $FindTexts)
{
$matchCase = $false
$matchWholeWord = $true
$matchWildCards = $false
$matchSoundsLike = $false
$matchAllWordForms = $false
$forward = $true
$wrap = 1
$application = New-Object -comobject word.application
$application.visible = $False
$docs = Get-childitem -path $Path -Recurse -Include *.docx
$i = 1
$totaldocs = 0
Foreach ($doc in $docs)
{
Write-Progress -Activity "Processing files" -status "Processing $($doc.FullName)" -PercentComplete ($i /$docs.Count * 100)
$document = $application.documents.open($doc.FullName)
$range = $document.content
$null = $range.movestart()
$wordFound = $range.find.execute($findText,$matchCase,
$matchWholeWord,$matchWildCards,$matchSoundsLike,
$matchAllWordForms,$forward,$wrap)
if($wordFound)
{
$doc.fullname
$document.Words.count
$totaldocs ++
} #end if $wordFound
$document.close()
$i++
} #end foreach $doc
$application.quit()
"There are $totaldocs total files with $findText" | Out-File -Append C:\scratch\output.txt
#clean up stuff
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($range) | Out-Null
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($document) | Out-Null
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($application) | Out-Null
Remove-Variable -Name application
[gc]::collect()
[gc]::WaitForPendingFinalizers()
}
What I'd like to do is figure out a way to search each file for everything in CompareData.txt once, rather than iterate through it a bunch of times. If I was dealing with a small set of data, the approach I've got would get the job done - but I've come to find out that both the data in CompareData.txt and the source Word file directory will be very large.
Any ideas on how to optimize this?
Right now you're doing this (pseudocode):
foreach $Keyword {
create Word Application
foreach $File {
load Word Document from $File
find $Keyword
}
}
That means that if you have a 100 keywords and 10 documents, you're opening and closing a 100 instances of Word and loading in a thousand word documents before you're done.
Do this instead:
create Word Application
foreach $File {
load Word Document from $File
foreach $Keyword {
find $Keyword
}
}
So you only launch one instance of Word and only load each document once.
As noted in the comments, you may optimize the whole process by using the OpenXML SDK, rather than launching Word:
(assuming you've installed OpenXML SDK in its default location)
# Import the OpenXML library
Add-Type -Path 'C:\Program Files (x86)\Open XML SDK\V2.5\lib\DocumentFormat.OpenXml.dll'
# Grab the keywords and file names
$Keywords = Get-Content C:\scratch\CompareData.txt
$Documents = Get-childitem -path $Path -Recurse -Include *.docx
# hashtable to store results per document
$KeywordMatches = #{}
# store OpenXML word document type in variable as a shorthand
$WordDoc = [DocumentFormat.OpenXml.Packaging.WordprocessingDocument] -as [type]
foreach($Docx in $Docs)
{
# create array to hold matched keywords
$KeywordMatches[$Docx.FullName] = #()
# open document, wrap content stream in streamreader
$Document = $WordDoc::Open($Docx.FullName, $false)
$DocumentStream = $Document.MainDocumentPart.GetStream()
$DocumentReader = New-Object System.IO.StreamReader $DocumentStream
# read entire document
$DocumentContent = $DocumentReader.ReadToEnd()
# test for each keyword
foreach($Keyword in $Keywords)
{
$Pattern = [regex]::Escape($KeyWord)
$WordFound = $DocumentContent -match $Pattern
if($WordFound)
{
$KeywordMatches[$Docx.FullName] += $Keyword
}
}
$DocumentReader.Dispose()
$Document.Dispose()
}
Now, you can show the word count for each document:
$KeywordMatches.GetEnumerator() |Select File,#{n="Count";E={$_.Value.Count}}