Using Powershell to remove illegal CRLF from csv row - powershell

Gentle Reader,
I have a year's worth of vendor csv files sitting in a directory. My task is to load them into a SQL Server DB as a "Historical Load". The files are mal-formed and while we are working with the vendor to re-send 365 new, properly structured files, I have been tasked with trying to work with what we have.
I'm restricted to using either C# (as a script task in SSIS) or Powershell.
Each file has no header but the schema is known and built into the SSIS package connection.
Each file has approx 35k rows and roughly a few dozen mal-formed rows per file.
Each properly formed row consists of 122 columns, 121 comma's.
Rows are NOT text qualified.
Example: (data cleaned of PII)
555222,555222333444,1,HN71232,1/19/2018 8:58:07 AM,3437,27.50,HECTOR EVERYMAN,25-Foot Garden Hose - ,1/03/2018 10:17:24 AM,,1835,,,,,online,,MERCH,1,MERCH,MI,,,,,,,,,,,,,,,,,,,,6611060033556677,2526677,,,,,,,,,,,,,,EVERYMAN,,,,,,,,,,,,,,,,,,,,,,,,,,,,,VWDEB,,,,,,,555666NA118855,2/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,2121,,,1/29/2018 9:50:56 AM,0,,,[CRLF]
555222,555222444888,1,CASUAL50,1/09/2018 12:00:00 PM,7000,50.00,JANE SMITH,$50 Casual Gift Card,1/19/2018 8:09:15 AM,1/29/2018 8:19:25 AM,1856,,,,,online,,FREE,1,CERT,GC,,,,,,,6611060033553311[CRLF]
,6611060033553311[CRLF]
,,,,,,,,,25,,,6611060033556677,2556677,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,CASUAL25,VWDEB,,,,,,,555222NA118065,1/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,,,,1/19/2018 12:00:15 PM,0,,,[CRLF]
555222,555222777666,1,CASHCS,1/12/2018 10:31:43 AM,2500,25.00,BOB SMITH,BIG BANK Rewards Cash Back Credit [...6S66],,,1821,,,,,online,,CHECK,1,CHECK,CK,,,,,,,,,,,,,,,,,,,,555222166446,5556677,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,VWDEB,,,1/23/2018 10:30:21 AM,,,,555666NA118844,1/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,,,,1/22/2018 10:31:26 AM,0,,,[CRLF]
Powershell Get-Content (I think...) reads until file into an array where each row is identified by the CRLF as the terminator. This means (again, I think) that mal-formed rows will be treated as an element of the array without respect to how many "columns" it holds.
C# Streamreader also uses CRLF as a marker but a streamreader object also has a few methods available like Peek and Read that may be useful.
Please, Oh Wise Ones, point me in the direction of least resistance. Using Powershell, as a script to process mal-formed csv files such that CRLFs that are not EOL are removed.
Thank you.

Based on #vonPryz design but in (Native¹) PowerShell:
$Delimiters = 121
Get-Content .\OldFile.csv |ForEach-Object { $Line = '' } {
if ($Line) { $Line += ',' + $_ } else { $Line = $_ }
$TotalMatches = ($Line |Select-String ',' -AllMatches).Matches.Count
if ($TotalMatches -ge $Delimiters ) {
$Line
$Line = ''
}
} |Set-Content .\NewFile.Csv
1) I guess performance might be improved by avoiding += and using dot .net methods along with text streamers

Honestly, your best bet is to get good data from the supplier. Trying to work around a mess will just cause problems later on. Garbage in, garbage out. Since it's you who wrote the garbage data in the database, congratulations, it's now your fault that the DB data is of poor quality. Please talk with your manager and the stakeholders first, so that you have in writing an agreement that you didn't break the data and it was broken to start with. I've seen such problems on ETL processing all too often.
A quick and dirty pseudocode without error handling, edge case processing, substring index assumptions, performance guarantees and whatnot goes like so,
while(dataInFile)
line = readline()
:parseLine
commasInLine = countCommas(line)
if commasInLine == rightAmount
addLineInOKBuffer(line)
else
commasNeeded = rightAmount - commasInLine
if commasNeeded < 0
# too many commas, two lines are combined
lastCommaLocation = getLastCommaIndex(line, commasNeeded)
addLineInOKBuffer(line.substring(0, lastCommaLocation)
line = line.substring(lastCommaLocation, line.end)
goto :parseline
else
# too few lines, need to read next line too
line = line.removeCrLf() + readline()
goto :parseline
The idea is that first you look for a line and count how many commas there are. If the count matches what's expected, the row is not broken. Store it in a buffer containing good data.
If you have too many commas, then the row contains at least two different elements. Then find the index of where the first element ends, extract it and store it in the good data buffer. Then remove already processed part of the line and start again.
If you have too few commas, then the row is splitted by a newline. Read the next line from the file, join it with the current line and start the parsing again from counting the lines.

Related

Removing CR LF improper split line on .txt pipe-delimited flat with Powershell script

Hope all is well! I came across a bit of a tricky issue with a flat file that is exported from Oracle PBCS with some carriage return issues. End users - when inputting data into PBCS - will often press in a specific data field input screen. When the data gets exported representing a specific record with all the data elements representing that data point (intersection) - think like a SQL record - the record element where the user pressed enter causes that record to break at that point - shifting the rest of the data elements in that record to the next line. This is very bad as each record must have the same amount of elements - causing downstream issues in a mapping. In effect one unique record becomes two broken records.
I need a Powershell script that looks at the improper CR LF (Windows system) and reforms each unique record. However, the majority of the records in the flat file are fine so the code will have to be able to discern the "mostly good" from the "very bad" cases.
My flat file is pipe delimited and has a header element. The header element may not need to be considered as I am simply trying to address the fix - a solution could potentially look at the amount of property values for the header record to determine how to format broken records based off a property count using the pipe delimiter - but not sure that is necessary.
I will be honest - there are Jython scripts I tried to no avail - so I felt given that I have employed a couple Powershell scripts for other reasons in the past that I would use this again. I have a basis of a script for a csv file - but this isn't quite working.
$file = Get-Content 'E:\EPM_Cloud\Exports\BUD_PLN\Data\EXPORT_DATA_BUD_PLN.txt'
$file| Foreach-Object {
foreach ($property in $_.PSObject.Properties) {
$property.Value = ($property.Value).replace("`r","").replace("`n","")
}
}
$file|out-file -append 'E:\EPM_Cloud\Exports\BUD_PLN\Data\EXPORT_DATA_BUD_PLN_FINAL.txt'
Here are a few examples of what the before and after case would be if I could get this code to work.
This is supposed to be one record - as you see beginning with "$43K from... the user pressed enter several times. As you see it is pipe delimited - I use the numeric line numbers to show you what I mean since this isn't notepad++. The idea is this should all just be on 1.
Contract TBD|#missing|#missing|#missing|#missing|ORNL to Perform Radio-Chemical (RCA) Measurements|#missing|#missing|#missing|#missing|"$43K from above
$92,903
$14,907
The current $150K to be reprogrammed to XXX, plus another $150K from Fuel Fac for this item to be reprogrammed to RES."|#missing|#missing|#missing|"Summary|X0200_FEEBASED|No_BOC|O-xxxx-B999|xx_xxx_xx_xxx|Plan|Active|FY19|BegBalance"|COMMIT
This is what the output should look like (I have attached screenshots instead). All in 1.
Contract TBD|#missing|#missing|#missing|#missing|ORNL to Perform Radio-Chemical (RCA) Measurements|#missing|#missing|#missing|#missing|"$43K from above $92,903 $14,907 The current $150K to be reprogrammed to XXX, plus another $150K from Fuel Fac for this item to be reprogrammed to RES."|#missing|#missing|#missing|"Summary|X0200_FEEBASED|No_BOC|O-xxxx-B999|xx_xxx_xx_xxx|Plan|Active|FY19|BegBalance"|COMMIT
In other cases the line breaks just once - all defined just by how many times the user presses enter.enter image description here
As you see in the data image - you see how the line splits - this is the point of the powershell. As you see next to that screenshot image - other lines are just fine.
So after checking locally you should be able to just import the file as a csv, then loop through everything and remove CRLF from each property on each record, and output to a new file (or the same, but its safer to output to a new file).
$Records = Import-Csv C:\Path\To\File.csv -Delimiter '|'
$Properties = $Records[0].psobject.properties.name
ForEach($Record in $Records){
ForEach($Property in $Properties){
$Record.$Property = $Record.$Property -replace "[\r\n]"
}
}
$Records | Export-Csv C:\Path\To\NewFile.csv -Delimiter '|' -NoTypeInfo

Using perl to split over multiple lines

I'm trying to write a perl script to process a log4net log file. The fields in the log file are separated by a semi-colon. My end goal is to capture each field and populate a mysql table.
Usually I have lines that look a little like this (all on a single line)
DEBUG;2017-06-13T03:56:38,316-05:00;2017-06-13 08:56:38,316;79ab0b95-7f58-
44a8-a2c6-1f8feba1d72d;(null);WorkerStartup 1;"Starting services."
These are easy to process. I can simply split by semicolon to get the information I need.
However occassionally the "message" field at the end may span several lines, especially if there is a stack trace. I would want to capture the entire message as a single column. I cannot use split by semicolon, because the next lines would typically look like:
at some.random.classname
at another.classname
...
Can someone give some tips how to solve this problem?
The following solution uses that the number of " in a field is even ($p=~y/"//%2), this condition number of " odd may be changed by other that can indicate the field is not complete.
The number of columns splitted is fixed to 7 (to allow ; in last field) and may be changed for example #array = map {s/;$//} $p=~/\G(?:"[^"]*"|[^;])*;/g;.
The file is read line by line but a line is processed sub process when it's complete $p variable to store the previous line the last line is processed in END block.
perl -ne '
sub process {
#array = split /;/,$p,7;
# do something with array
print ((join "\n---\n", #array),"\n");
}
if ($p=~y/"//%2) {
$p.=$_;
next;
}
process;
$p=$_;
END{process}
' < logfile.txt

Powershell - Efficient Way to Return Line Numbers from Logs

I have an extremely large log file (max 1GB) which is appended to throughout the day. There are various strings within this log which I would like to search for (that I can already achieve using Select-String) however I am scanning the whole file on every sweep which is inefficient and a tad unnecessary.
Ideally I want to scan only the last 5 minutes of the log for these strings on each sweep. Unfortunately not every row of the log file contains a timestamp. I initially thought of doing a wildcard select-string for the last 5 mins timestamps combined with the strings of interest will miss some occurrences. My only other idea at the moment is to determine the line numbers of interest, $FromLineNumber (5 mins before current system time) and $ToLineNumber (the very last line number of log file) and then only Select-String between those two line number ranges.
As an example, to search between line 50 and the final line of the log. I am able to return the line number of $FromLineNumber but I'm struggling with grabbing $ToLineNumber for final row of log.
Q. How do I return only the line number of the final row of a log file?
So far I have tried returning this with Get-Content $path -tail -1 (object type linenumber) however this always returns blank values even with various switches and variations. I can only return line numbers via the Select-String cmdlet however I do not have a specific string to use that relates to the final row of the log. Am I misusing this cmdlet per its original design and if so...is there any other alternative to return the last line number?
Continued...Once I have determined the line number range to search between would I isolate those rows using a Get-Content loop between
$FromLineNumber and $ToLineNumber first to filter down to this smaller selection and then pipe this into a Select-String or is there a more efficient way to achieve this? I suspect that looping through thousands of lines would be demanding on resources so I'm keen to know if there is a better way.
Here is the answer to the first question
From https://blogs.technet.microsoft.com/heyscriptingguy/2011/10/09/use-a-powershell-cmdlet-to-count-files-words-and-lines/
If I want to know how many lines are contained in the file, I use the Measure->Object cmdlet with the line switch. This command is shown here:
Get-Content C:\fso\a.txt | Measure-Object –Line

Powershell break a file up on character count

I have a binary file that I need to process, but it contains no line breaks in it.
The data is arranged, within the file, into 104 character blocks and then divided into its various fields by character count alone (no delimiting characters).
I'd like to firstly process the file, so that there is a line break (`n) every 104 characters, but after much web searching and a lot of disappointment, I've found nothing useful yet. (Unless I ditch PowerShell and use awk.)
Is there a Split option that understands character counts?
Not only would it allow me to create the file with nice easy to read lines of 104 chars, but it would also allow me to then split these lines into their component fields.
Can anyone help please, without *nix options?
Cheers :)
$s = get-content YourFileName | Out-String
$a = $s.ToCharArray()
$a[0..103] # will return an array of first 104 chars
You can get your string back the following way, the replace removes space char( which is what array element separators turn into)
$ns = ([string]$a[0..103]).replace(" ","")
Using the V4 Where method with Split option:
$text = 'abcdefghi'
While ($text)
{
$x,$text = ([char[]]$text).where({$_},'Split',3)
$x -join ''
}
abc
def
ghi

Powershell: search backwards from end of file

My script reads a log file once a minute and selects (and acts upon) the lines where the timestamp begins with the previous minute.
This is easy (the regex is simply "^$timestamp"), but when the log gets big it can take a while.
My thinking is the lines I want will always be near the bottom of the file, so I'd be searching far fewer lines if I started at the bottom and searched upwards, stopping when I get to the minute prior to the one I'm interested in.
My question is, how can I search from the bottom of the file instead of the top? Can I even say "read line $length", or even "read line n" (if so I could do a sort of binary search thing to find the length of the file and work backwards from there)?
Last question: would this even be faster (I'd still like to know how to do it even if it wouldn't be faster)?
Ideally, I'd like to do this all in my own code without installing anything extra.
Thanks
get-content bigfile.txt -tail 10
This words on huge files nearly instantly without any big memory usage.
I did it with a 22 GB text file in my testing.
Doing something like "get-context bigfile.txt | select -Last 10" works but it seems to have to load all of the lines (or objects in powershell) then does the select.
May I suggest just changing the regex to equal Get-Date + whatever time period you want?
For example (and this is without your log so i apologize)
$a = Get-Date
$hr = $a.Hour
$min = $a.Minute
Then work off those values to build out the regex to select the times you want. And if you don't already use it this website is awesome for building regex's quickly and easily http://gskinner.com/RegExr/ .
Got another fix, I think you will like this..
$a = get-content .\biglog.text
Use the length to slice the array from back to front change write host to select-string and your regex or whatever you want to do in reverse..
foreach($x in $a.length..0){ write-host $a[$x] }
Another option after the get-content cmdlet again, this option just reverse orders the array then you are reading $a from bottom to top
[array]::Reverse($a)
dc
If you only want the last bit of the file, depending on the format, you can just do this:
Get-Content C:\Windows\WindowsUpdate.log | Select -last 10
This will return the last 10 lines found in the file.