Is this the best way to replace text in all of an object's properties in powershell? - powershell

I have a large CSV file in which some fields have a new line embedded. Excel 2016 produces errors when importing a CSV with rows which have fields with a new line embedded.
Based on this post, I wrote code to replace any new line in any field with a space. Below is a code block that duplicates the functionality and issue. Option 1 works. Option 2, which is commented out, casts my object to a string. I was hoping Option 2 might run faster.
Question: Is there a better way to do this to optimize for performance processing very large files?
$array = #([PSCustomObject]#{"ID"="1"; "Name"="Joe`nSmith"},
[PSCustomObject]#{"ID"="2"; "Name"="Jasmine Baker"})
$array = $array | ForEach-Object {
#Option 1: produces an Object, but is code optimized?
foreach ($n in $_.PSObject.Properties.Name) {
$_.PSObject.Properties[$n].Value = `
$_.PSObject.Properties[$n].Value -replace "`n"," "
}
#Option 2: produces a string, not an object
#$_ = $_ -replace "`n"," "
$_
}
Keep in mind that in my real-world use case, each row has > 15 fields and any combination of them may have one or more new lines embedded.

Use the fast TextFieldParser to read, process, and build the CSV from the file (PowerShell 3+):
[Reflection.Assembly]::LoadWithPartialName('Microsoft.VisualBasic') >$null
$parser = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser 'r:\1.csv'
$parser.SetDelimiters(',')
$header = $parser.ReadFields()
$CSV = while (!$parser.EndOfData) {
$i = 0
$row = [ordered]#{}
foreach ($field in $parser.ReadFields()) {
$row[$header[$i++]] = $field.replace("`n", ' ')
}
[PSCustomObject]$row
}
Or modify each field in-place in an already existing CSV array:
foreach ($row in $CSV) {
foreach ($field in $row.PSObject.Properties) {
$field.value = $field.value.replace("`n", ' ')
}
}
Notes:
foreach statement is much faster than piping to ForEach-Object (also aliased as foreach)
$stringVariable.replace() is faster then -replace operator

Related

Check if a condition is met by a line within a TXT but "in an advanced way"

I have a TXT file with 1300 megabytes (huge thing). I want to build code that does two things:
Every line contains a unique ID at the beginning. I want to check for all lines with the same unique ID if the conditions is met for that "group" of IDs. (This answers me: For how many lines with the unique ID X have all conditions been met)
If the script is finished I want to remove all lines from the TXT where the condition was met (see 2). So I can rerun the script with another condition set to "narrow down" the whole document.
After few cycles I finally have a set of conditions that applies to all lines in the document.
It seems that my current approach is very slow.( one cycle needs hours). My final result is a set of conditions that apply to all lines of code.
If you find an easier way to do that, feel free to recommend.
Help is welcome :)
Code so far (does not fullfill everything from 1&2)
foreach ($item in $liste)
{
# Check Conditions
if ( ($item -like "*XXX*") -and ($item -like "*YYY*") -and ($item -notlike "*ZZZ*")) {
# Add a line to a document to see which lines match condition
Add-Content "C:\Desktop\it_seems_to_match.txt" "$item"
# Retrieve the unique ID from the line and feed array.
$array += $item.Split("/")[1]
# Remove the line from final document
$liste = $liste -replace $item, ""
}
}
# Pipe the "new cleaned" list somewhere
$liste | Set-Content -Path "C:\NewListToWorkWith.txt"
# Show me the counts
$array | group | % { $h = #{} } { $h[$_.Name] = $_.Count } { $h } | Out-File "C:\Desktop\count.txt"
Demo Lines:
images/STRINGA/2XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGA/3XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGB/4XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGB/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGC/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
performance considerations:
Add-Content "C:\Desktop\it_seems_to_match.txt" "$item"
try to avoid wrapping cmdlet pipelines
See also: Mastering the (steppable) pipeline
$array += $item.Split("/")[1]
Try to avoid using the increase assignment operator (+=) to create a collection
See also: Why should I avoid using the increase assignment operator (+=) to create a collection
$liste = $liste -replace $item, ""
This is a very expensive operation considering that you are reassigning (copying) a long list ($liste) with each iteration.
Besides it is a bad practice to change an array that you are currently iterating.
$array | group | ...
Group-Object is a rather slow cmdlet, you better collect (or count) the items on-the-fly (where you do $array += $item.Split("/")[1]) using a hashtable, something like:
$Name = $item.Split("/")[1]
if (!$HashTable.Contains($Name)) { $HashTable[$Name] = [Collections.Generic.List[String]]::new() }
$HashTable[$Name].Add($Item)
To minimize memory usage it may be better to read one line at a time and check if it already exists. Below code I used StringReader and you can replace with StreamReader for reading from a file. I'm checking if the entire string exists, but you may want to split the line. Notice I have duplicaes in the input but not in the dictionary. See code below :
$rows= #"
images/STRINGA/2XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGA/3XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/4XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGC/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGA/2XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGA/3XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/4XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGC/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
"#
$dict = [System.Collections.Generic.Dictionary[int, System.Collections.Generic.List[string]]]::new();
$reader = [System.IO.StringReader]::new($rows)
while(($row = $reader.ReadLine()) -ne $null)
{
$hash = $row.GetHashCode()
if($dict.ContainsKey($hash))
{
#check if list contains the string
if($dict[$hash].Contains($row))
{
#string is a duplicate
}
else
{
#add string to dictionary value if it is not in list
$list = $dict[$hash].Value
$list.Add($row)
}
}
else
{
#add new hash value to dictionary
$list = [System.Collections.Generic.List[string]]::new();
$list.Add($row)
$dict.Add($hash, $list)
}
}
$dict

PS Object unescape character

I have small error when running my code. I assign a string to custom object but it's parsing the string by itself and throwing an error.
Code:
foreach ($item in $hrdblistofobjects) {
[string]$content = Get-Content -Path $item
[string]$content = $content.Replace("[", "").Replace("]", "")
#here is line 43 which is shown as error as well
foreach ($object in $listofitemsdb) {
$result = $content -match $object
$OurObject = [PSCustomObject]#{
ObjectName = $null
TestObjectName = $null
Result = $null
}
$OurObject.ObjectName = $item
$OurObject.TestObjectName = $object #here is line 52 which is other part of error
$OurObject.Result = $result
$Resultsdb += $OurObject
}
}
This code loads an item and checks if an object exists within an item. Basically if string part exists within a string part and then saves result to a variable. I am using this code for other objects and items but they don't have that \p part which I am assuming is the issue. I can't put $object into single quotes for obvious reasons (this was suggested on internet but in my case it's not possible). So is there any other option how to unescape \p? I tried $object.Replace("\PMS","\\PMS") but that did not work either (this was suggested somewhere too).
EDIT:
$Resultsdb = #(foreach ($item in $hrdblistofobjects) {
[string]$content = Get-Content -Path $item
[string]$content = $content.Replace("[", "").Replace("]", "")
foreach ($object in $listofitemsdb) {
[PSCustomObject]#{
ObjectName = $item
TestObjectName = $object
Result = $content -match $object
}
}
}
)
$Resultsdb is not defined as an array, hence you get that error when you try to add one object to another object when that doesn't implement the addition operator.
You shouldn't be appending to an array in a loop anyway. That will perform poorly, because with each iteration it creates a new array with the size increased by one, copies all elements from the existing array, puts the new item in the new free slot, and then replaces the original array with the new one.
A better approach is to just output your objects in the loop and collect the loop output in a variable:
$Resultsdb = foreach ($item in $hrdblistofobjects) {
...
foreach ($object in $listofitemsdb) {
[PSCustomObject]#{
ObjectName = $item
TestObjectName = $object
Result = $content -match $object
}
}
}
Run the loop in an array subexpression if you need to ensure that the result is an array, otherwise it will be empty or a single object when the loop returns less than two results.
$Resultsdb = #(foreach ($item in $hrdblistofobjects) {
...
})
Note that you need to suppress other output on the default output stream in the loop, so that it doesn't pollute your result.
I changed the match part to this and it's working fine $result = $content -match $object.Replace("\PMS","\\PMS").
Sorry for errors in posting. I will amend that.

How can i modify part of a row in file?

I have a file witch contains multiple rows with strings like this:
DTSTART:20190716T180000
DTEND:20190716T180000
I want to modify every DTEND row. I want to replace the 180000 with 190000.
The Parts between DTEND: and 180000 are different each time. Does anyone now how I can change the string in powershell?
here's one way to do the job. [grin] it finds a line that starts with DTEND, grabs the timestamp, converts it to a [datetime] object, adds one hour to it, reformats that to the same layout as the original, builds a new line, and then outputs it to the $Results collection.
the collection can be sent to a file or screen as desired.
# fake reading in a text file
# in real life, use Get-Content
$InStuff = #'
DTSTART:20190716T180000
DTEND:20190716T180000
'# -split [System.Environment]::NewLine
$Marker = 'DTEND'
$HoursToAdd = 1
$Results = foreach ($IS_Item in $InStuff)
{
if ($IS_Item -match "^$Marker")
{
$Prefix, $OldTimeStamp = $IS_Item.Split(':')
$NewTimeStamp = [datetime]::ParseExact($OldTimeStamp, 'yyyyMMddTHHmmssss', $Null).
AddHours($HoursToAdd).
ToString('yyyyMMddTHHmmssss')
($Prefix, $NewTimeStamp) -join ':'
}
else
{
$IS_Item
}
}
$Results
output ...
DTSTART:20190716T180000
DTEND:20190716T190000

Fastest way to match two large arrays of objects by key in Powershell

I have two powershell arrays of objects generated via Import-CSV, and I must match them by one of their properties. Specifically, it is a 1:n relationship so currently I'm following this pattern:
foreach ($line in $array1) {
$match=$array2 | where {$_.key -eq $line.key} # could be 1 or n results
...# process here the 1 to n lines
}
, which is not very efficient (both tables have many columns) and takes a time that is unacceptable for our needs. Is there a fastest way to perform this match?
Both data sources come from a csv file, so using something instead of Import-CSV would be also welcome.
Thanks
The standard method is to index the data using a hashtable (or dictionary/map in other languages).
function buildIndex($csv, [string]$keyName) {
$index = #{}
foreach ($row in $csv) {
$key = $row.($keyName)
$data = $index[$key]
if ($data -is [Collections.ArrayList]) {
$data.add($row) >$null
} elseif ($data) {
$index[$key] = [Collections.ArrayList]#($data, $row)
} else {
$index[$key] = $row
}
}
$index
}
$csv1 = Import-Csv 'r:\1.csv'
$csv2 = Import-Csv 'r:\2.csv'
$index2 = buildIndex $csv2, 'key'
foreach ($row in $csv1) {
$matchedInCsv2 = $index2[$row.key]
foreach ($row2 in $matchedInCsv2) {
# ........
}
}
Also, if you need speed and iterate a big collection, avoid | pipelining as it's many times slower than foreach/while/do statements. And don't use anything with a ScriptBlock like where {$_.key -eq $line.key} in your code because execution context creation adds a ridiculously big overhead compared to the simple code inside.

How to properly string replace in Powershell without appending the replaced variable to a newline?

I'm pretty new to powershell/programming so bear with me. I have this bug that appends the new renamed path to a new-line without the rest of path.
The console output:
/content/pizza/en/ingredients/
helloworld/menu-eng.html
What I want:
/content/pizza/en/ingredients/helloworld/menu-eng.html
What the code below is supposed to do is rename a bunch paths. Right now testName is hard-coded but after I get this to work properly it will be dynamic.
My code:
$testName = "helloworld"
$text = (Get-Content W:\test\Rename\rename.csv) | Out-String
$listOfUri = Import-Csv W:\test\Rename\rename.csv
foreach ($element in $listOfUri) {
if ($element -match "menu-eng.html") {
$elementString = $element.'ColumnTitle' | Out-String
$elementString = $elementString.Replace('menu-eng.html', '')
$varPath1 = $elementString
$elementString = $elementString.Insert('', 'http://www.pizza.com')
$elementName = ([System.Uri]$elementString).Segments[-1]
$elementString = $elementString.Replace($elementName, '')
$elementString = $elementString.Replace('http://www.pizza.com', '')
$varPath2 = $elementString.Insert($elementString.Length, $testName + '/')
$text = $text.Replace($varPath1.Trim(), $varPath2)
}
}
$text
Assuming your .csv file looks like this:
ColumnTitle,Idk
/content/pizza/en/ingredients/SPAM/menu-eng.html,Stuff
Then:
$testName = 'helloworld'
foreach ($row in Import-CSV d:\rename.csv) {
$bit = $row.'ColumnTitle'.Split('/')[-2]
$row.'ColumnTitle'.replace($bit, $testName)
}
I have no real idea what all the rest of your code is for, particularly my earlier comment, your line:
$text = (Get-Content W:\test\Rename\rename.csv) | Out-String
is making $text into an /array/ of all the lines in the file, including the headers. You can still use .Replace() on it in PowerShell, but it's going to do the replace on every line. I can't quite see how that gives you the output you get, but it will give you multiple lines for every line in the input file.