Working with CSV - DataSplits - powershell

I have a script written that pulls a number of CSV files from an FTP server and downloads to a network location.
The content of this CSV file follows the example I have provided in this link
File Example
In short working with this file I need to:
Using the 12 characters (alpha-numeric) which follow Ords: on line two define a variable which will used later in a query. (A)
GB0000000001
Would become
$OrderVariable = "GB0000000001"
I have read about
.TrimStart([Characters_to_remove])
but am unsure how it would skip the first row and then how I would remove everything following the next 12 letters.
Using the entire line two information excluding the Ords: define this as a variable e.g.
GB0000000001 – Promotion Event
would become
$TitleEvent = "GB0000000001 – Promotion Event"
The CSV contains all the customers that an email needs to be sent to e.g.
D|300123123|BBA
D|300321312|DDS
D|A0123950|BBA
D|A0999950|ZZG
These items I would expect to be written into a hashtable which I thought would be simple enough except I cannot find any way to exclude everything which precedes it.
$mytable = Import-Csv -Path $filePath -Header D,Client,Suffix
$HashTable = #{}
foreach ($r in $mytable) {
$HashTable[$r.Client] = $r.Data
}
UPDATE
I have managed to get most of this element into a variable with the following
$target = "\\Messaging"
cd $target
$Clients = Import-Csv example.txt | where {$_ -like "*D|*"}
$Clients = $Clients[1..($Clients.count - 1)]
$Clients | Export-Csv "Test.csv" -NoTypeInformation
But I cannot get it to import with custom headers or without the first "H|" delimitation...
End of update 1
I believe this is roughly what is going to be required as the only element that I will need to define and use in a later query is the Client themselves.
The next would define all the text that remains as the message content
This is a Promotion Event and action needs to be taken by you. The
deadline for your instruction is 2pm on 12 September 2016.
The deadline for this event has been extended.
To notify us of your instruction you can send a secure message.
This can differ on each occasion massively so cannot simply be a removal of X numbers of lines the content will always follow the Ords: (line two) and End at the start for the D| delimitation.
Most of the other code I need to put together I am 'fairly confident' with (famous last words) and have a fully working script that is pulling the files I need, I am just not great at working with .csv's when I have them.

The data format is flexible without a global table/grid structure so let's use regexps (the breakdown), which is quite a universal method of parsing such texts.
$text = [IO.File]::ReadAllText('inputfile.txt', [Text.Encoding]::UTF8)
$data = ([regex]('ORDS: (?<order>.+?) [-–—] (?<title>.+)[\r\n]+' +
'(?<info>[\s\S]+?)[\r\n]+' +
'(?<clients>D\|[\s\S]+?)[\r\n]+' +
'T\|(?<T>\d+)')
).Matches($text) |
forEach {
$g = $_.groups
#{
order = $g['order'].value
info = $g['info'].value -join ' '
clients = $g['clients'].value -split '[\r\n]+' |
where { $_ -match 'D\|(.+?)\|(.+)' } |
forEach {
#{
id = $matches[1]
suffix = $matches[2]
}
}
T = $g['T']
}
}
$data is now a record (or an array of records if the file has multiple entries):
Name Value
---- -----
T 000004
info This is a Promotion Event and action needs to be take...
order GB0000000001
clients {System.Collections.Hashtable, System.Collections.Has...
$data.clients is an array of records:
Name Value
---- -----
id 300123123
suffix BBA
id 300321312
suffix DDS
id A0123950
suffix BBA
id A0999950
suffix ZZG

Related

How to speed up processing of ~million lines of text in log file

I am trying to parse a very large log file that consists of space delimited text across about 16 fields. Unfortunately the app logs a blank line in between each legitimate one (arbitrarily doubling the lines I must process). It also causes fields to shift because it uses space as both a delineator as well as for empty fields. I couldn't get around this in LogParser. Fortunately Powershell affords me the option to reference fields from the end as well making it easier to get later fields affected by shift.
After a bit of testing with smaller sample files, I've determined that processing line by line as the file is streaming with Get-Content natively is slower than just reading the file completely using Get-Content -ReadCount 0 and then processing from memory. This part is relatively fast (<1min).
The problem comes when processing each line, even though it's in memory. It is taking hours for a 75MB file with 561178 legitimate lines of data (minus all the blank lines).
I'm not doing much in the code itself. I'm doing the following:
Splitting line via space as delimiter
One of the fields is an IP address that I am reverse DNS resolving, which is obviously going to be slow. So I have wrapped this into more code to create an in-memory arraylist cache of previously resolved IPs and pulling from it when possible. The IPs are largely the same so after a few hundred lines, resolution shouldn't be an issue any longer.
Saving the needed array elements into my pscustomobject
Adding pscustomobject to arraylist to be used later.
During the loop I'm tracking how many lines I've processed and outputting that info in a progress bar (I know this adds extra time but not sure how much). I really want to know progress.
All in all, it's processing some 30-40 lines per second, but obviously this is not very fast.
Can someone offer alternative methods/objectTypes to accomplish my goals and speed this up tremendously?
Below are some samples of the log with the field shift (Note this is a Windows DNS Debug log) as well as the code below that.
10/31/2022 12:38:45 PM 2D00 PACKET 000000B25A583FE0 UDP Snd 127.0.0.1 6c94 R Q [8385 A DR NXDOMAIN] AAAA (4)pool(3)ntp(3)org(0)
10/31/2022 12:38:45 PM 2D00 PACKET 000000B25A582050 UDP Snd 127.0.0.1 3d9d R Q [8081 DR NOERROR] A (4)pool(3)ntp(3)org(0)
NOTE: the issue in this case being [8385 A DR NXDOMAIN] (4 fields) vs [8081 DR NOERROR] (3 fields)
Other examples would be the "R Q" where sometimes it's " Q".
$Logfile = "C:\Temp\log.txt"
[System.Collections.ArrayList]$LogEntries = #()
[System.Collections.ArrayList]$DNSCache = #()
# Initialize log iteration counter
$i = 1
# Get Log data. Read entire log into memory and save only lines that begin with a date (ignoring blank lines)
$LogData = Get-Content $Logfile -ReadCount 0 | % {$_ | ? {$_ -match "^\d+\/"}}
$LogDataTotalLines = $LogData.Length
# Process each log entry
$LogData | ForEach-Object {
$PercentComplete = [math]::Round(($i/$LogDataTotalLines * 100))
Write-Progress -Activity "Processing log file . . ." -Status "Processed $i of $LogDataTotalLines entries ($PercentComplete%)" -PercentComplete $PercentComplete
# Split line using space, including sequential spaces, as delimiter.
# NOTE: Due to how app logs events, some fields may be blank leading split yielding different number of columns. Fortunately the fields we desire
# are in static positions not affected by this, except for the last 2, which can be referenced backwards with -2 and -1.
$temp = $_ -Split '\s+'
# Resolve DNS name of IP address for later use and cache into arraylist to avoid DNS lookup for same IP as we loop through log
If ($DNSCache.IP -notcontains $temp[8]) {
$DNSEntry = [PSCustomObject]#{
IP = $temp[8]
DNSName = Resolve-DNSName $temp[8] -QuickTimeout -DNSOnly -ErrorAction SilentlyContinue | Select -ExpandProperty NameHost
}
# Add DNSEntry to DNSCache collection
$DNSCache.Add($DNSEntry) | Out-Null
# Set resolved DNS name to that which came back from Resolve-DNSName cmdlet. NOTE: value could be blank.
$ResolvedDNSName = $DNSEntry.DNSName
} Else {
# DNSCache contains resolved IP already. Find and Use it.
$ResolvedDNSName = ($DNSCache | ? {$_.IP -eq $temp[8]}).DNSName
}
$LogEntry = [PSCustomObject]#{
Datetime = $temp[0] + " " + $temp[1] + " " + $temp[2] # Combines first 3 fields Date, Time, AM/PM
ClientIP = $temp[8]
ClientDNSName = $ResolvedDNSName
QueryType = $temp[-2] # Second to last entry of array
QueryName = ($temp[-1] -Replace "\(\d+\)",".") -Replace "^\.","" # Last entry of array. Replace any "(#)" characters with period and remove first period for friendly name
}
# Add LogEntry to LogEntries collection
$LogEntries.Add($LogEntry) | Out-Null
$i++
}
Here is a more optimized version you can try.
What changed?:
Removed Write-Progress, especially because it's not known if Windows PowerShell is used. PowerShell versions below 6 have a big performance impact with Write-Progress
Changed $DNSCache to Generic Dictionary for fast lookups
Changed $LogEntries to Generic List
Switched from Get-Content to switch -Regex -File
$Logfile = 'C:\Temp\log.txt'
$LogEntries = [System.Collections.Generic.List[psobject]]::new()
$DNSCache = [System.Collections.Generic.Dictionary[string, psobject]]::new([System.StringComparer]::OrdinalIgnoreCase)
# Process each log entry
switch -Regex -File ($Logfile) {
'^\d+\/' {
# Split line using space, including sequential spaces, as delimiter.
# NOTE: Due to how app logs events, some fields may be blank leading split yielding different number of columns. Fortunately the fields we desire
# are in static positions not affected by this, except for the last 2, which can be referenced backwards with -2 and -1.
$temp = $_ -Split '\s+'
$ip = [string] $temp[8]
$resolvedDNSRecord = $DNSCache[$ip]
if ($null -eq $resolvedDNSRecord) {
$resolvedDNSRecord = [PSCustomObject]#{
IP = $ip
DNSName = Resolve-DnsName $ip -QuickTimeout -DnsOnly -ErrorAction Ignore | select -ExpandProperty NameHost
}
$DNSCache[$ip] = $resolvedDNSRecord
}
$LogEntry = [PSCustomObject]#{
Datetime = $temp[0] + ' ' + $temp[1] + ' ' + $temp[2] # Combines first 3 fields Date, Time, AM/PM
ClientIP = $ip
ClientDNSName = $resolvedDNSRecord.DNSName
QueryType = $temp[-2] # Second to last entry of array
QueryName = ($temp[-1] -Replace '\(\d+\)', '.') -Replace '^\.', '' # Last entry of array. Replace any "(#)" characters with period and remove first period for friendly name
}
# Add LogEntry to LogEntries collection
$LogEntries.Add($LogEntry)
}
}
If it's still slow, there is still the option to use Start-ThreadJob as a multithreading approach with chunked lines (like 10000 per job).

How can I loop through each record of a text file to replace a string of characters

I have a large .txt file containing records where a date string in each record needs to be incremented by 2 days which will then update the field to the right of it which contains dashes --------- with that date. For example, a record contains the following record data:
1440149049845_20191121000000 11/22/2019 -------- 0.000 0.013
I am replacing the -------- dashes with 11/24/2019 (2 days added to the date 11/22/2019) so that it shows as:
1440149049845_20191121000000 11/22/2019 11/24/2019 0.000 0.013
I have the replace working on a single record but need to loop through the entire .txt file to update all of the records. Here is what I tried:
$inputRecords = get-content '\\10.12.7.13\vipsvr\Rancho\MRDF_Report\_Report.txt'
foreach ($line in $inputRecords)
{
$item -match '\d{2}/\d{2}/\d{4}'
$inputRecords -replace '-{2,}',([datetime]$matches.0).adddays(2).tostring('MM/dd/yyyy') -replace '\b0\.000\b','0.412'
}
I get an PS error stating: "Cannot convert null to type "System.DateTime"
I'm sorry but why are we using RegEx for something this simple?
I can see it if there are differently formatted lines in the file, you'd want to make sure you aren't manipulating unintended lines, but that's not indicated in the question. Even still, it doesn't seem like you need to match anything within the line itself. It seems like it's delimited on spaces which would make a simple split a lot easier.
Example:
$File = "C:\temp\Test.txt"
$Output =
ForEach( $Line in Get-Content $File)
{
$TmpArray = $Line.Split(' ')
$TmpArray[2] = (Get-Date $TmpArray[1]).AddDays(2).ToString('M/dd/yyyy')
$TmpArray -join ' '
}
The 3rd element in the array do the calculation and reassign the value...
Notice there's no use of the += operator which is very slow compared to simply assigning the output to a variable. I wouldn't make a thing out of it but considering we don't know how big the file is... Also the String format given before 'mm/dd/yyyy' will result in 00 for the month like for example '00/22/2019', so I changed that to 'M/dd/yyyy'
You can still add logic to skip unnecessary lines if it's needed...
You can send $Output to a file with something like $Output | Out-File <FilePath>
Or this can be converted to a single pipeline that outputs directly to a file using | ForEach{...} instead of ForEach(.. in ..) If the file is truly huge and holding $Output in memory is an issue this is a good alternative.
Let me know if that helps.
You mostly had the right idea, but here are a few suggested changes, but not exactly in this order:
Use a new file instead of trying to replace the old file.
Iterate a line at a time, replace the ------, write to the new file.
Use '-match' instead of '-replace', because as you will see below that you need to manipulate the capture more than a simple '-replace' allows.
Use [datetime]::parseexact instead of trying to just force cast the captured text.
[string[]]$inputRecords = get-content ".\linesource.txt"
[string]$outputRecords
foreach ($line in $inputRecords) {
[string]$newLine = ""
[regex]$logPattern = "^([\d_]+) ([\d/]+) (-+) (.*)$"
if ($line -match $logPattern) {
$origDate = [datetime]::parseexact($Matches[2], 'mm/dd/yyyy', $null)
$replacementDate = $origDate.adddays(2)
$newLine = $Matches[1]
$newLine += " " + $origDate.toString('mm/dd/yyyy')
$newLine += " " + $replacementDate.toString('mm/dd/yyyy')
$newLine += " " + $Matches[4]
} else {
$newLine = $line
}
$outputRecords += "$newLine`n"
}
$outputRecords.ToString()
Even if you don't use the whole solution, hopefully at least parts of it will be helpful to you.
Using the suggested code from adamt8 and Steven, I added to 2 echo statements to show what gets displayed in the variables $logpattern and $line since it is not recognizing the pattern of characters to be updated. This is what displays from the echo:
Options MatchTimeout RightToLeft
CalNOD01 1440151020208_20191205000000 12/06/2019 12/10/2019
None -00:00:00.0010000 False
CalNOD01 1440151020314_20191205000000 12/06/2019 --------
None -00:00:00.0010000 False
this is the rendered output:
CalNOD01 1440151020208_20191205000000 12/06/2019 12/10/2019
CalNOD01 1440151020314_20191205000000 12/06/2019 --------
This is the code that was used:
enter image description here

Parse MDT Log using PowerShell

I am trying to setup a log which would pull different information from another log file to log assets build by MDT using PowerShell. I can extract a line of log using simple get-content | select-string to get the lines i need so output looks like that
[LOG[Validate Domain Credentials [domain\user]]LOG]!
time="16:55:42.000+000" date="10-20-2017" component="Wizard"
context="" type="1" thread="" file="Wizard"
and I am curious if there is a way of capturing things like domain\user, time and date in a separate variables so it can be later passed with another data captured in a similar way in output file in a single line.
This is how you could do it:
$line = Get-Content "<your_log_path>" | Select-String "Validate Domain Credentials" | select -First 1
$regex = '\[(?<domain>[^\\[]+)\\(?<user>[^]]+)\].*time="(?<time>[^"]*)".*date="(?<date>[^"]*)".*component="(?<component>[^"]*)".*context="(?<context>[^"]*)".*type="(?<type>[^"]*)".*thread="(?<thread>[^"]*)".*file="(?<file>[^"]*)"'
if ($line -match $regex) {
$user = $Matches.user
$date = $Matches.date
$time = $Matches.time
# ... now do stuff with your variables ...
}
You might want to build in some error checking etc. (e.g. when no line is found or does not match etc.)
Also you could greatly simplify the regex if you only need those 3 values. I designed it so that all fields from the line are included.
Also, you could convert the values into more appropriate types, which (depending on what you want to do with them afterwards) might make handling them easier:
$type = [int]$Matches.type
$credential = New-Object System.Net.NetworkCredential($Matches.user, $null, $Matches.domain)
$datetime = [DateTime]::ParseExact(($Matches.date + $Matches.time), "MM-dd-yyyyHH:mm:ss.fff+000", [CultureInfo]::InvariantCulture)

CSV input, powershell pulling $null value rows from targeted column

I am trying to create a script to create Teams in Microsoft Teams from data in a CSV file.
The CSV file has the following columns: Team_name, Team_owner, Team_Description, Team_class
The script should grab Team_name row value and use that value to create a variable. Use that variable to query if it exists in Teams and if not, create it using the data in the other columns.
The problem I am having is my foreach loop seems to be collecting rows without values. I simplified the testing by first trying to identify the values and monitoring the output.
Here is the test script
$Team_infocsv = Import-csv -path $path Teams_info.csv
# $Team_infocsv | Foreach-object{
foreach($line in $Team_infocsv){
$owner = $line.Team_owner
Write-Host "Team Owner: $owner"
$teamname = $line.Team_name
Write-Host "Team Name: $teamname"
$team_descr = $line.Team_Description
Write-Host "Team Description: $team_descr"
$teamclass = $line.Team_class
Write-Host "Team Class: $teamclass"
}
I only have two rows of data but yet returned are the two lines as requested with extra output (from rows) without values.
There's no problem with your code per se, except:
Teams_info.csv is specified in addition to $path after Import-Csv -Path, which I presume is a typo, however.
$path could conceivably - and accidentally - be an array of file paths, and if the additional file(s) has entirely different columns, you'd get empty values for the first file's columns.
If not, the issue must be with the contents of Teams_info.csv, so I suggest you examine that; piping to Format-Custom as shown below will also you help you detect the case where $path is unexpectedly an array of file paths:
Here's a working example of a CSV file resembling your input - created ad hoc - that you can compare to your input file.
# Create sample file.
#'
"Team_name","Team_owner","Team_Description","Team_class"
"Team_nameVal1","Team_ownerVal1","Team_DescriptionVal1","Team_classVal1"
"Team_nameVal2","Team_ownerVal2","Team_DescriptionVal2","Team_classVal2"
'# > test.csv
# Import the file and examine the objects that get created.
# Note the use of Format-Custom.
Import-Csv test.csv test.csv | Format-Custom
The above yields:
class PSCustomObject
{
Team_name = Team_nameVal1
Team_owner = Team_ownerVal1
Team_Description = Team_DescriptionVal1
Team_class = Team_classVal1
}
class PSCustomObject
{
Team_name = Team_nameVal2
Team_owner = Team_ownerVal2
Team_Description = Team_DescriptionVal2
Team_class = Team_classVal2
}
Format-Custom produces a custom view (a non-table and non-list view) as defined by the type of the instances being output; in the case of the [pscustomobject] instances that Import-Csv outputs you get the above view, which is a convenient way of getting at least a quick sense of the objects' content (you may still have to dig deeper to distinguish empty strings from $nulls, ...).

How to add a column to an existing CSV row in PowerShell?

I'm trying to write a simple usage logger into my script that would store information about the time when user opened the script, finished using the script and the user name.
The first part of the logger where I gather the first two data works fine and adds two necessary columns with values to the CSV file. Yet when I run the second part of the logger it does not add a new column to my existing CSV file.
#Code I will add at the very beginning of my script
$FileNameDate = Get-Date -Format "MMM_yyyy"
$FilePath = "C:\Users\Username\Desktop\Script\Logs\${FileNameDate}_MonthlyLog.csv"
$TimeStamp = (Get-Date).toString("dd/MMM/yyyy HH:mm:ss")
$UserName = [string]($env:UserName)
$LogArray = #()
$LogArrayDetails = #{
Username = $UserName
StartDate = $TimeStamp
}
$LogArray += New-Object PSObject -Property $LogArrayDetails | Export-Csv $FilePath -Notypeinformation -Append
#Code I will add at the very end of my script
$logArrayFinishDetails = #{FinishDate = $TimeStamp}
$LogCsv = Import-Csv $FilePath | Select Username, StartDate, #{$LogArrayFinishDetails} | Export-Csv $FilePath -NoTypeInformation -Append
CSV file should look like this when the script is closed:
Username StartDate FinishDate
anyplane 08/Apr/2018 23:47:55 08/Apr/2018 23:48:55
Yet it looks like this:
StartDate Username
08/Apr/2018 23:47:55 anyplane
The other weird thing is that it puts the StartDate first while I clearly stated in $LogArrayDetails that Username goes first.
Assuming that you only ever want to record the most recent run [see bottom if you want to record multiple runs] (PSv3+):
# Log start of execution.
[pscustomobject] #{ Username = $env:USERNAME; StartDate = $TimeStamp } |
Export-Csv -Notypeinformation $FilePath
# Perform script actions...
# Log end of execution.
(Import-Csv $FilePath) |
Select-Object *, #{ n='FinishDate'; e={ (Get-Date).toString("dd/MMM/yyyy HH:mm:ss") } } |
Export-Csv -Notypeinformation $FilePath
As noted in boxdog's helpful answer, using -Append with Export-Csv won't add additional columns.
However, since you're seemingly attempting to rewrite the entire file, there is no need to use
-Append at all.
So as to ensure that the old version of the file has been read in full before you attempt to replace it with Export-Csv, be sure to enclose your Import-Csv $FilePath call in (...), however.
This is not strictly necessary with a 1-line file such as in this case, but a good habit to form for such rewrites; do note that this approach is somewhat brittle in general, as something could go wrong while rewriting the file, resulting in potential data loss.
#{ n='FinishDate'; e={ (Get-Date).toString("dd/MMM/yyyy HH:mm:ss") } is an example of a calculated property/column that is appended to the preexisting columns (*)
The other weird thing is that it puts the StartDate first while I clearly stated in $LogArrayDetails that Username goes first.
You've used a hashtable (#{ ... }) to declare the columns for the output CSV, but the order in which a hashtable's entries are enumerated is not guaranteed.
In PSv3+, you can use an ordered hashtable instead ([ordered] #{ ... }) to achieve predictable enumeration, which you also get if you convert the hashtable to a custom object by casting to [pscustomobject], as shown above.
If you do want to append to the existing file, you can use the following, but note that:
this approach does not scale well, because the entire log file is read into memory every time (and converted to objects), though limiting the entries to a month's worth should be fine.
as stated, the approach is brittle, as things can go wrong while rewriting the file; consider simply writing 2 rows per execution instead, which allows you to append to the file line by line.
there's no concurrency management, so the assumption is that only ever one instance of the script is run at a time.
$FilePath = './t.csv'
$TimeStamp = (Get-Date).toString("dd/MMM/yyyy HH:mm:ss")
$env:USERNAME = $env:USER
# Log start of execution. Note the empty 'FinishDate' property
# to ensure all rows ultimately have the same column structure.
[pscustomobject] #{ Username = $env:USERNAME; StartDate = $TimeStamp; FinishDate = '' } |
Export-Csv -Notypeinformation -Append $FilePath
# Perform script actions...
# Log end of execution:
# Read the entire existing file...
$logRows = Import-Csv $FilePath
# ... update the last row's .FinishDate property
$logRows[-1].FinishDate = (Get-Date).toString("dd/MMM/yyyy HH:mm:ss")
# ... and rewrite the entire file, keeping only the last 30 entries
$logRows[-30..-1] | Export-Csv -Notypeinformation $FilePath
Because your CSV already has a structure (i.e. defined headers), PowerShell honours this when appending and doesn't add additional columns. It is (sort of) explained in this excerpt from the Export-Csv help:
When you submit multiple objects to Export-CSV, Export-CSV organizes
the file based on the properties of the first object that you submit.
If the remaining objects do not have one of the specified properties,
the property value of that object is null, as represented by two
consecutive commas. If the remaining objects have additional
properties, those property values are not included in the file.
You could include the FinishDate property in the original file (even though it would be empty), but the best option might be to export your output to a different CSV at the end, perhaps deleting the original after import then recreating it with the additional data. In fact, just removing the -Append will likely give the result you want.