Removing CR LF improper split line on .txt pipe-delimited flat with Powershell script - powershell

Hope all is well! I came across a bit of a tricky issue with a flat file that is exported from Oracle PBCS with some carriage return issues. End users - when inputting data into PBCS - will often press in a specific data field input screen. When the data gets exported representing a specific record with all the data elements representing that data point (intersection) - think like a SQL record - the record element where the user pressed enter causes that record to break at that point - shifting the rest of the data elements in that record to the next line. This is very bad as each record must have the same amount of elements - causing downstream issues in a mapping. In effect one unique record becomes two broken records.
I need a Powershell script that looks at the improper CR LF (Windows system) and reforms each unique record. However, the majority of the records in the flat file are fine so the code will have to be able to discern the "mostly good" from the "very bad" cases.
My flat file is pipe delimited and has a header element. The header element may not need to be considered as I am simply trying to address the fix - a solution could potentially look at the amount of property values for the header record to determine how to format broken records based off a property count using the pipe delimiter - but not sure that is necessary.
I will be honest - there are Jython scripts I tried to no avail - so I felt given that I have employed a couple Powershell scripts for other reasons in the past that I would use this again. I have a basis of a script for a csv file - but this isn't quite working.
$file = Get-Content 'E:\EPM_Cloud\Exports\BUD_PLN\Data\EXPORT_DATA_BUD_PLN.txt'
$file| Foreach-Object {
foreach ($property in $_.PSObject.Properties) {
$property.Value = ($property.Value).replace("`r","").replace("`n","")
}
}
$file|out-file -append 'E:\EPM_Cloud\Exports\BUD_PLN\Data\EXPORT_DATA_BUD_PLN_FINAL.txt'
Here are a few examples of what the before and after case would be if I could get this code to work.
This is supposed to be one record - as you see beginning with "$43K from... the user pressed enter several times. As you see it is pipe delimited - I use the numeric line numbers to show you what I mean since this isn't notepad++. The idea is this should all just be on 1.
Contract TBD|#missing|#missing|#missing|#missing|ORNL to Perform Radio-Chemical (RCA) Measurements|#missing|#missing|#missing|#missing|"$43K from above
$92,903
$14,907
The current $150K to be reprogrammed to XXX, plus another $150K from Fuel Fac for this item to be reprogrammed to RES."|#missing|#missing|#missing|"Summary|X0200_FEEBASED|No_BOC|O-xxxx-B999|xx_xxx_xx_xxx|Plan|Active|FY19|BegBalance"|COMMIT
This is what the output should look like (I have attached screenshots instead). All in 1.
Contract TBD|#missing|#missing|#missing|#missing|ORNL to Perform Radio-Chemical (RCA) Measurements|#missing|#missing|#missing|#missing|"$43K from above $92,903 $14,907 The current $150K to be reprogrammed to XXX, plus another $150K from Fuel Fac for this item to be reprogrammed to RES."|#missing|#missing|#missing|"Summary|X0200_FEEBASED|No_BOC|O-xxxx-B999|xx_xxx_xx_xxx|Plan|Active|FY19|BegBalance"|COMMIT
In other cases the line breaks just once - all defined just by how many times the user presses enter.enter image description here
As you see in the data image - you see how the line splits - this is the point of the powershell. As you see next to that screenshot image - other lines are just fine.

So after checking locally you should be able to just import the file as a csv, then loop through everything and remove CRLF from each property on each record, and output to a new file (or the same, but its safer to output to a new file).
$Records = Import-Csv C:\Path\To\File.csv -Delimiter '|'
$Properties = $Records[0].psobject.properties.name
ForEach($Record in $Records){
ForEach($Property in $Properties){
$Record.$Property = $Record.$Property -replace "[\r\n]"
}
}
$Records | Export-Csv C:\Path\To\NewFile.csv -Delimiter '|' -NoTypeInfo

Related

Using Powershell to remove illegal CRLF from csv row

Gentle Reader,
I have a year's worth of vendor csv files sitting in a directory. My task is to load them into a SQL Server DB as a "Historical Load". The files are mal-formed and while we are working with the vendor to re-send 365 new, properly structured files, I have been tasked with trying to work with what we have.
I'm restricted to using either C# (as a script task in SSIS) or Powershell.
Each file has no header but the schema is known and built into the SSIS package connection.
Each file has approx 35k rows and roughly a few dozen mal-formed rows per file.
Each properly formed row consists of 122 columns, 121 comma's.
Rows are NOT text qualified.
Example: (data cleaned of PII)
555222,555222333444,1,HN71232,1/19/2018 8:58:07 AM,3437,27.50,HECTOR EVERYMAN,25-Foot Garden Hose - ,1/03/2018 10:17:24 AM,,1835,,,,,online,,MERCH,1,MERCH,MI,,,,,,,,,,,,,,,,,,,,6611060033556677,2526677,,,,,,,,,,,,,,EVERYMAN,,,,,,,,,,,,,,,,,,,,,,,,,,,,,VWDEB,,,,,,,555666NA118855,2/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,2121,,,1/29/2018 9:50:56 AM,0,,,[CRLF]
555222,555222444888,1,CASUAL50,1/09/2018 12:00:00 PM,7000,50.00,JANE SMITH,$50 Casual Gift Card,1/19/2018 8:09:15 AM,1/29/2018 8:19:25 AM,1856,,,,,online,,FREE,1,CERT,GC,,,,,,,6611060033553311[CRLF]
,6611060033553311[CRLF]
,,,,,,,,,25,,,6611060033556677,2556677,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,CASUAL25,VWDEB,,,,,,,555222NA118065,1/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,,,,1/19/2018 12:00:15 PM,0,,,[CRLF]
555222,555222777666,1,CASHCS,1/12/2018 10:31:43 AM,2500,25.00,BOB SMITH,BIG BANK Rewards Cash Back Credit [...6S66],,,1821,,,,,online,,CHECK,1,CHECK,CK,,,,,,,,,,,,,,,,,,,,555222166446,5556677,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,VWDEB,,,1/23/2018 10:30:21 AM,,,,555666NA118844,1/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,,,,1/22/2018 10:31:26 AM,0,,,[CRLF]
Powershell Get-Content (I think...) reads until file into an array where each row is identified by the CRLF as the terminator. This means (again, I think) that mal-formed rows will be treated as an element of the array without respect to how many "columns" it holds.
C# Streamreader also uses CRLF as a marker but a streamreader object also has a few methods available like Peek and Read that may be useful.
Please, Oh Wise Ones, point me in the direction of least resistance. Using Powershell, as a script to process mal-formed csv files such that CRLFs that are not EOL are removed.
Thank you.
Based on #vonPryz design but in (Native¹) PowerShell:
$Delimiters = 121
Get-Content .\OldFile.csv |ForEach-Object { $Line = '' } {
if ($Line) { $Line += ',' + $_ } else { $Line = $_ }
$TotalMatches = ($Line |Select-String ',' -AllMatches).Matches.Count
if ($TotalMatches -ge $Delimiters ) {
$Line
$Line = ''
}
} |Set-Content .\NewFile.Csv
1) I guess performance might be improved by avoiding += and using dot .net methods along with text streamers
Honestly, your best bet is to get good data from the supplier. Trying to work around a mess will just cause problems later on. Garbage in, garbage out. Since it's you who wrote the garbage data in the database, congratulations, it's now your fault that the DB data is of poor quality. Please talk with your manager and the stakeholders first, so that you have in writing an agreement that you didn't break the data and it was broken to start with. I've seen such problems on ETL processing all too often.
A quick and dirty pseudocode without error handling, edge case processing, substring index assumptions, performance guarantees and whatnot goes like so,
while(dataInFile)
line = readline()
:parseLine
commasInLine = countCommas(line)
if commasInLine == rightAmount
addLineInOKBuffer(line)
else
commasNeeded = rightAmount - commasInLine
if commasNeeded < 0
# too many commas, two lines are combined
lastCommaLocation = getLastCommaIndex(line, commasNeeded)
addLineInOKBuffer(line.substring(0, lastCommaLocation)
line = line.substring(lastCommaLocation, line.end)
goto :parseline
else
# too few lines, need to read next line too
line = line.removeCrLf() + readline()
goto :parseline
The idea is that first you look for a line and count how many commas there are. If the count matches what's expected, the row is not broken. Store it in a buffer containing good data.
If you have too many commas, then the row contains at least two different elements. Then find the index of where the first element ends, extract it and store it in the good data buffer. Then remove already processed part of the line and start again.
If you have too few commas, then the row is splitted by a newline. Read the next line from the file, join it with the current line and start the parsing again from counting the lines.

Parsing a text file or an html file to create a table

I have a simple issue with a .msg file from outlook, but I discovered that with a code someone helped me with, it was not working since the htmlbody from the .msg file would vary between different emails even though they are from the same source, so my next option was to save the email as a .txt and .html file, since I have no knowledge of html I have no idea how to grab the table which is structured in the html with a . but on the text I found something easy, for example this is data from one table:
Summary
Date
Good mail
Rule matches
Spam
Malware
2019-10-22
4927
4519
2078
0
2019-10-23
4783
4113
1934
0
this is on the text file, Summary is the keyword, and after that key word, the next 5 lines are the columns of the table, after that ,each 5 lines following are the rows, this goes up to 7 rows in total, so headers and then 7 rows.
Now what I want to do is create a table from this text using the 5 first lines after summary as my columns. Since each .msg is different, this 5 columns will change order on each file randomly so I want to avoid this, my best attempt was to use convertfrom-string to create a table , but I have little idea on how to format the table with the conditions set above.
The problem I have is this simple, I have a table on the txt file shown as above, with 5 columns, each column besides the headers contains 7 rows, therei s also the condition that the email since it has more data, I need to stop there nad just grab that part which should be easy.
How can I use convertfrom-string to create the table using those 5 columns , how can I set the delimiter as a new line and how can I set the first 5 lines as the column headers?
I think trying to make this work with ConvertFrom-StringData is adding more work than necessary. But here is an alternative that works with your sample set.
$text = Get-Content -Path File.txt
$formattedText = if ($text[0] -match '^Summary') {
for ($i = 1; $i -lt $text.count; $i+=5 ) {
$text[$i..($i+4)] -join ','
}
}
$fomattedText | ConvertFrom-Csv | ConvertTo-Html
Explanation:
If we assume your text data is in File.txt, Get-Content is used to read the data as an array ($text). If the first line begins with Summary, the file will be parsed.
The for loop is used to skip 5 lines during each iteration until the end of the file. The for loop begins with $text values (indexes 1, 2, 3, 4, and 5) joined together by a ,. Then the index increment ($i) is increased by 5 and the next five index values are joined together. Each increment will create a new line of comma separated values. The reason for the , join is just to use the simple ConvertFrom-Csv later.
ConvertFrom-Csv converts the CSV data into an array of objects ($formattedText) with the first row becoming those objects' properties.
Finally, the array is piped to ConvertTo-Html, which will output all of the objects in a table.
Note: If you want to resize or add extra format to the table, you may need to do that after the code is generated. If your data has commas, you will need a different delimiter when joining the strings. You will then need to add the -Delimiter parameter to the ConvertFrom-Csv with the delimiter you choose.
Adaptation:
The code is fairly flexible. If you need to work with more than five properties, the $i+=5 will need to reflect the number of properties you need to cycle through. The same change needs to apply to $text[$i..($i+4)]. You want the .. to separate two values that differ by your property number.

Powershell - Efficient Way to Return Line Numbers from Logs

I have an extremely large log file (max 1GB) which is appended to throughout the day. There are various strings within this log which I would like to search for (that I can already achieve using Select-String) however I am scanning the whole file on every sweep which is inefficient and a tad unnecessary.
Ideally I want to scan only the last 5 minutes of the log for these strings on each sweep. Unfortunately not every row of the log file contains a timestamp. I initially thought of doing a wildcard select-string for the last 5 mins timestamps combined with the strings of interest will miss some occurrences. My only other idea at the moment is to determine the line numbers of interest, $FromLineNumber (5 mins before current system time) and $ToLineNumber (the very last line number of log file) and then only Select-String between those two line number ranges.
As an example, to search between line 50 and the final line of the log. I am able to return the line number of $FromLineNumber but I'm struggling with grabbing $ToLineNumber for final row of log.
Q. How do I return only the line number of the final row of a log file?
So far I have tried returning this with Get-Content $path -tail -1 (object type linenumber) however this always returns blank values even with various switches and variations. I can only return line numbers via the Select-String cmdlet however I do not have a specific string to use that relates to the final row of the log. Am I misusing this cmdlet per its original design and if so...is there any other alternative to return the last line number?
Continued...Once I have determined the line number range to search between would I isolate those rows using a Get-Content loop between
$FromLineNumber and $ToLineNumber first to filter down to this smaller selection and then pipe this into a Select-String or is there a more efficient way to achieve this? I suspect that looping through thousands of lines would be demanding on resources so I'm keen to know if there is a better way.
Here is the answer to the first question
From https://blogs.technet.microsoft.com/heyscriptingguy/2011/10/09/use-a-powershell-cmdlet-to-count-files-words-and-lines/
If I want to know how many lines are contained in the file, I use the Measure->Object cmdlet with the line switch. This command is shown here:
Get-Content C:\fso\a.txt | Measure-Object –Line

powershell - replace line in .txt file

I am using PowerShell and I need replace a line in a .txt file.
The .txt file always has different number at the end of the line.
For example:
...............................txt (first)....................................
appversion= 10.10.1
............................txt (a second time)................................
appversion= 10.10.2
...............................txt (third)...................................
appversion= 10.10.5
I need to replace appversion + number behind it (the number is always different). I have set the required value in variable.
How do I do this?
Part of this issue you are getting, which I see from your comments, is that you are trying to replace text in a file and saved it back to the same file while you are still reading it.
I will try to show a similar solution while addressing this. Again we are going to use -replaces functionality as an array operator.
$NewVersion = "Awesome"
$filecontent = Get-Content C:\temp\file.txt
$filecontent -replace '(^appversion=.*\.).*',"`$1$NewVersion" | Set-Content C:\temp\file.txt
This regex will match lines starting with "appversion=" and everything up until the last period. Since we are storing the text in memory we can write it back to the same file. Change $NewVersion to a number ... unless that is your versioning structure.
Not sure about what numbers you are keeping
About which part of the numbers, if any, you are trying to preserve. If you intend to change the whole number then you can just .*\. to a space. That way you ignore everything after the equal sign.
Yes, you can with regex.
Let call $myString and $verNumber the variables with text and version number
$myString = "appversion= 10.10.1";
$verNumber = 7;
You can use -replace operator to get the version part and replace only last subversion number this way
$mystring -replace 'appversion= (\d+).(\d+).(\d+)', "appversion= `$1.`$2.$verNumber";

Powershell: search backwards from end of file

My script reads a log file once a minute and selects (and acts upon) the lines where the timestamp begins with the previous minute.
This is easy (the regex is simply "^$timestamp"), but when the log gets big it can take a while.
My thinking is the lines I want will always be near the bottom of the file, so I'd be searching far fewer lines if I started at the bottom and searched upwards, stopping when I get to the minute prior to the one I'm interested in.
My question is, how can I search from the bottom of the file instead of the top? Can I even say "read line $length", or even "read line n" (if so I could do a sort of binary search thing to find the length of the file and work backwards from there)?
Last question: would this even be faster (I'd still like to know how to do it even if it wouldn't be faster)?
Ideally, I'd like to do this all in my own code without installing anything extra.
Thanks
get-content bigfile.txt -tail 10
This words on huge files nearly instantly without any big memory usage.
I did it with a 22 GB text file in my testing.
Doing something like "get-context bigfile.txt | select -Last 10" works but it seems to have to load all of the lines (or objects in powershell) then does the select.
May I suggest just changing the regex to equal Get-Date + whatever time period you want?
For example (and this is without your log so i apologize)
$a = Get-Date
$hr = $a.Hour
$min = $a.Minute
Then work off those values to build out the regex to select the times you want. And if you don't already use it this website is awesome for building regex's quickly and easily http://gskinner.com/RegExr/ .
Got another fix, I think you will like this..
$a = get-content .\biglog.text
Use the length to slice the array from back to front change write host to select-string and your regex or whatever you want to do in reverse..
foreach($x in $a.length..0){ write-host $a[$x] }
Another option after the get-content cmdlet again, this option just reverse orders the array then you are reading $a from bottom to top
[array]::Reverse($a)
dc
If you only want the last bit of the file, depending on the format, you can just do this:
Get-Content C:\Windows\WindowsUpdate.log | Select -last 10
This will return the last 10 lines found in the file.