Powershell break a file up on character count - powershell

I have a binary file that I need to process, but it contains no line breaks in it.
The data is arranged, within the file, into 104 character blocks and then divided into its various fields by character count alone (no delimiting characters).
I'd like to firstly process the file, so that there is a line break (`n) every 104 characters, but after much web searching and a lot of disappointment, I've found nothing useful yet. (Unless I ditch PowerShell and use awk.)
Is there a Split option that understands character counts?
Not only would it allow me to create the file with nice easy to read lines of 104 chars, but it would also allow me to then split these lines into their component fields.
Can anyone help please, without *nix options?
Cheers :)

$s = get-content YourFileName | Out-String
$a = $s.ToCharArray()
$a[0..103] # will return an array of first 104 chars
You can get your string back the following way, the replace removes space char( which is what array element separators turn into)
$ns = ([string]$a[0..103]).replace(" ","")

Using the V4 Where method with Split option:
$text = 'abcdefghi'
While ($text)
{
$x,$text = ([char[]]$text).where({$_},'Split',3)
$x -join ''
}
abc
def
ghi

Related

Using Powershell to remove illegal CRLF from csv row

Gentle Reader,
I have a year's worth of vendor csv files sitting in a directory. My task is to load them into a SQL Server DB as a "Historical Load". The files are mal-formed and while we are working with the vendor to re-send 365 new, properly structured files, I have been tasked with trying to work with what we have.
I'm restricted to using either C# (as a script task in SSIS) or Powershell.
Each file has no header but the schema is known and built into the SSIS package connection.
Each file has approx 35k rows and roughly a few dozen mal-formed rows per file.
Each properly formed row consists of 122 columns, 121 comma's.
Rows are NOT text qualified.
Example: (data cleaned of PII)
555222,555222333444,1,HN71232,1/19/2018 8:58:07 AM,3437,27.50,HECTOR EVERYMAN,25-Foot Garden Hose - ,1/03/2018 10:17:24 AM,,1835,,,,,online,,MERCH,1,MERCH,MI,,,,,,,,,,,,,,,,,,,,6611060033556677,2526677,,,,,,,,,,,,,,EVERYMAN,,,,,,,,,,,,,,,,,,,,,,,,,,,,,VWDEB,,,,,,,555666NA118855,2/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,2121,,,1/29/2018 9:50:56 AM,0,,,[CRLF]
555222,555222444888,1,CASUAL50,1/09/2018 12:00:00 PM,7000,50.00,JANE SMITH,$50 Casual Gift Card,1/19/2018 8:09:15 AM,1/29/2018 8:19:25 AM,1856,,,,,online,,FREE,1,CERT,GC,,,,,,,6611060033553311[CRLF]
,6611060033553311[CRLF]
,,,,,,,,,25,,,6611060033556677,2556677,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,CASUAL25,VWDEB,,,,,,,555222NA118065,1/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,,,,1/19/2018 12:00:15 PM,0,,,[CRLF]
555222,555222777666,1,CASHCS,1/12/2018 10:31:43 AM,2500,25.00,BOB SMITH,BIG BANK Rewards Cash Back Credit [...6S66],,,1821,,,,,online,,CHECK,1,CHECK,CK,,,,,,,,,,,,,,,,,,,,555222166446,5556677,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,VWDEB,,,1/23/2018 10:30:21 AM,,,,555666NA118844,1/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,,,,1/22/2018 10:31:26 AM,0,,,[CRLF]
Powershell Get-Content (I think...) reads until file into an array where each row is identified by the CRLF as the terminator. This means (again, I think) that mal-formed rows will be treated as an element of the array without respect to how many "columns" it holds.
C# Streamreader also uses CRLF as a marker but a streamreader object also has a few methods available like Peek and Read that may be useful.
Please, Oh Wise Ones, point me in the direction of least resistance. Using Powershell, as a script to process mal-formed csv files such that CRLFs that are not EOL are removed.
Thank you.
Based on #vonPryz design but in (NativeĀ¹) PowerShell:
$Delimiters = 121
Get-Content .\OldFile.csv |ForEach-Object { $Line = '' } {
if ($Line) { $Line += ',' + $_ } else { $Line = $_ }
$TotalMatches = ($Line |Select-String ',' -AllMatches).Matches.Count
if ($TotalMatches -ge $Delimiters ) {
$Line
$Line = ''
}
} |Set-Content .\NewFile.Csv
1) I guess performance might be improved by avoiding += and using dot .net methods along with text streamers
Honestly, your best bet is to get good data from the supplier. Trying to work around a mess will just cause problems later on. Garbage in, garbage out. Since it's you who wrote the garbage data in the database, congratulations, it's now your fault that the DB data is of poor quality. Please talk with your manager and the stakeholders first, so that you have in writing an agreement that you didn't break the data and it was broken to start with. I've seen such problems on ETL processing all too often.
A quick and dirty pseudocode without error handling, edge case processing, substring index assumptions, performance guarantees and whatnot goes like so,
while(dataInFile)
line = readline()
:parseLine
commasInLine = countCommas(line)
if commasInLine == rightAmount
addLineInOKBuffer(line)
else
commasNeeded = rightAmount - commasInLine
if commasNeeded < 0
# too many commas, two lines are combined
lastCommaLocation = getLastCommaIndex(line, commasNeeded)
addLineInOKBuffer(line.substring(0, lastCommaLocation)
line = line.substring(lastCommaLocation, line.end)
goto :parseline
else
# too few lines, need to read next line too
line = line.removeCrLf() + readline()
goto :parseline
The idea is that first you look for a line and count how many commas there are. If the count matches what's expected, the row is not broken. Store it in a buffer containing good data.
If you have too many commas, then the row contains at least two different elements. Then find the index of where the first element ends, extract it and store it in the good data buffer. Then remove already processed part of the line and start again.
If you have too few commas, then the row is splitted by a newline. Read the next line from the file, join it with the current line and start the parsing again from counting the lines.

replacing text in Powershell every alternate match

I have looked at this question, and it's close to what I need to do, but the text I need to replace is inconsistent.
I need to replace "`r`n with ", but only the first of the 2 adjacent lines
example: (the full file is 50k lines and up to 500 chars wide)
ID,Name,LinkedRecords
54429,Abe,
54247,Jonathan,"
63460|63461"
54249,Teresa,
54418,Cody,
58046,Joseph,
58243,David,
,Barry,"
74330"
C8876,Simon,
X_10934,David,
should become
ID,Name,LinkedRecords
54429,Abe,
54247,Jonathan,"63460|63461"
54249,Teresa,
54418,Cody,
58046,Joseph,
58243,David,
,Barry,"74330"
C8876,Simon,
X_10934,David,
I can see this will probably be useful, but I'm having a hard time getting the command to work as desired
If the `r`n characters are literal, then you can do the following:
[System.IO.File]::ReadAllText('c:\path\file.txt') -replace '(?<=,")`r`n\r?\n' |
Set-Content c:\path\file.txt
If `r`n are actual carriage return and line feed chars, then you can do the following:
[System.IO.File]::ReadAllText('c:\path\file.txt') -replace '(?<=,")\r\n' |
Set-Content c:\path\file.txt
Note if memory becomes an issue, a different approach may be needed.

String variable position being overwritten in write-host

If I run the below code, $SRN can be written as output or added to another variable, but trying to include either another variable or regular text causes it to be overwritten from the beginning of the line. I'm assuming it's something to do with how I'm assigning $autocode and $SRN initially but can't tell what it's trying to do.
# Load the property set to allow us to get to the email body.
$item.load($psPropertySet) # Load the data.
$bod = ($item.Body.Text -creplace '(?m)^\s*\r?\n','') -split "\n" # Get the body text, remove blank lines, split on line breaks to create an array (otherwise it is a single string).
$autocode = $bod[4].split('-')[2] # Get line 4 (should be Title), split on dash, look for 3rd element, this should contain our automation code.
$SRN = $bod[1] -replace 'ID: ','' # Get line 2 (should be ID), find and replace the preceding text.
# Skip processing if autocode does not match our list of handled ones.
if ($autocode -cin $autocodes)
{
write-host "$SRN $autocode"
write-host "$autocode $SRN"
write-host "$SRN test"
$var = "$SRN $autocode"
$var
}
The code results in this, you can see if $SRN isn't at the start of the line it is fine. Unsure where the extra spaces come from either:
KRNE8385
KRNE SR1788385
test8385
KRNE8385
I would expect to see this:
SR1788385 KRNE
KRNE SR1788385
SR1788385 test
SR1788385 KRNE
LotPings pointed me down the right path, both variables still had either "0D" or "\r" in them. My regex replace was only getting rid of them on blank lines, and I split the array on "\n" only. Changing line 3 in the original code to the below appears to have resolved the issue. First time seeing Format-Hex, but it appears to be excellent for troubleshooting such issues.
$bod = ($item.Body.Text -creplace '(?m)^\s*\r?\n','') -split "\r\n"

Read from text file one character at a time

I'm trying to convert characters in a text file based one what type they are:
Letters > L
Numbers > #
Is there a way to iterate through a file on a per-character basis? The only way I can get it to work currently is nested loops iterating through individual lines within the file. If there's a simpler way, that cuts out a lot of code I'll have to wade through.
You can use Get-Content -Encoding Byte and convert from the byte value back to a character:
Get-Content foo.txt -Encoding Byte | foreach { [char]$_ }
You can use Get-Content -Raw and cast the result to [byte[]]. Not recommended for large files.
Both options above will give you all characters, including line breaks. Option 1 will not work with Unicode for obvious reasons; option 2 will.
Then there is the variant you mention already: Iterate twice, once by lines, once by character:
Get-Content foo.txt | foreach { [char[]] $_ | foreach { ... } }
If you don't need line breaks as characters I'd prefer this version since it should have reasonable runtime and memory requirements (e.g. it won't try to fit the whole file into memory).
get-content myfile.txt | foreach { $_.ToCharArray() }
This flattens the contents of your file into a long array of characters.
If you are processing very large files, the fastest (programmatic) method I have found is to use .NET StreamReader and StreamWriter. Utilizing these objects will allow you to read line-at-a-time into a string, perform manipulation, and then write to a new file line-at-a-time. At the end, delete your original and rename the new file accordingly.
If you don't need to programmatically solve this and can utilize regular expressions, I recommend UltraEdit. I don't know what wizardry they utilize, but it is MUCH faster at reading files than what I've managed to do in PowerShell.

powershell - replace line in .txt file

I am using PowerShell and I need replace a line in a .txt file.
The .txt file always has different number at the end of the line.
For example:
...............................txt (first)....................................
appversion= 10.10.1
............................txt (a second time)................................
appversion= 10.10.2
...............................txt (third)...................................
appversion= 10.10.5
I need to replace appversion + number behind it (the number is always different). I have set the required value in variable.
How do I do this?
Part of this issue you are getting, which I see from your comments, is that you are trying to replace text in a file and saved it back to the same file while you are still reading it.
I will try to show a similar solution while addressing this. Again we are going to use -replaces functionality as an array operator.
$NewVersion = "Awesome"
$filecontent = Get-Content C:\temp\file.txt
$filecontent -replace '(^appversion=.*\.).*',"`$1$NewVersion" | Set-Content C:\temp\file.txt
This regex will match lines starting with "appversion=" and everything up until the last period. Since we are storing the text in memory we can write it back to the same file. Change $NewVersion to a number ... unless that is your versioning structure.
Not sure about what numbers you are keeping
About which part of the numbers, if any, you are trying to preserve. If you intend to change the whole number then you can just .*\. to a space. That way you ignore everything after the equal sign.
Yes, you can with regex.
Let call $myString and $verNumber the variables with text and version number
$myString = "appversion= 10.10.1";
$verNumber = 7;
You can use -replace operator to get the version part and replace only last subversion number this way
$mystring -replace 'appversion= (\d+).(\d+).(\d+)', "appversion= `$1.`$2.$verNumber";