Removing "SUB" characters (ASCII=26, which is outside of normally searchable special characters) with powershell - powershell

I'm trying to remove all of the SUB (substitute, ASCII=26) characters from a large text file
I want to import a large file into sas, but sas bombs out (actually it just stops and reports a success, which is much worse) when it reaches an ususual character that looks like a "->" when viewed in excel. Using the Code() function in excel, it identifies this character as a "26" which I believe is the ASCII 26 or SUB (substitute) character.
Anyway, I'd like to remove all these "->" characters from the file so I can import them into sas, so was thinking I could use powershell (one of the few tools available to me).
I'm new to powershell, but none of the documentation I've been able to find on Select-String has information on writing hex or arbitiray ascii characters, just a fixed list of normal special characters that doesn't include the character I'm struggling with.
Any ideas how I can remove all the SUB characters from a text file using powershell?

You can use \xnn in a regex to match arbitrary character codes expressed as hex. 026 = x1a
For Unicode, the format is \unnnn
Clear-Content $outputfile
Get-Content $inputfile -ReadCount 1000 |
foreach {
$_ -replace '\u001a' |
Add-Content $outputfile
}

Related

Is there a way to set the default EOL separator for Out-File in PowerShell 5.1?

I am looking for a way to set the default EOL marker to 0x0A when writing text with Out-File.
On the internet, I found tons of examples that either replace 0x0D 0x0A after a file is already written, or -join the lines on 0x0A and then write the concatenated text into the file.
I find both approaches a bit clumsy as I'd just like to write the files with the redirection operator >.
So, is there a way to set the EOL style in PowerShell 5.1?
No, unfortunately, as of PowerShell 7.2, there is no way to make PowerShell use a different newline (EOL, line break) format.
It is the platform-native newline character or sequence (as reflected in [Environment]::NewLine) that is invariably used - both for separating multiple input objects and for the trailing newline by default.
To control the newline format, you need to:
Join the input objects explicitly with the newline character (sequence) of choice, using the -join operator, followed by another instance if a trailing newline is desired ...
... and use the -NoNewLine switch of Set-Content / Out-File (in lieu of >) so as to prevent appending of a trailing platform-native newline.
As for potential future enhancements:
GitHub feature request #2872 suggests adding a parameter to Set-Content, specifically, to allow specifying the newline format; the request has been green-lighted (a long time ago), but has yet to be implemented - however, I think it isn't comprehensive enough, and it wouldn't help with Out-File / >; see next point.
GitHub feature request #3855 more generally asked for a -Delimiter parameter (to mirror Get-Content's existing parameter by that name) to be added to Set-Content / Add-Content, Out-File and Out-String
Unfortunately, the proposal was rejected; if it hadn't, you would have been able to configure > - a virtual alias of Out-File - to use LF-only newlines, for instance, as follows:
# WISHFUL THINKING
$PSDefaultParameterValues['Out-File:Delimiter'] = "`n"

Swap string order in one line or swap lines order in powershell

I need to swap place of 2 or more regex strings in one line or some lines in a txt file in powershell.
In npp i just find ^(String 1.*)\r\n(String 2.*)\r\n(String 3.*)$ and replace with \3\r\n\1\r\n\2:
String 1 aksdfh435##%$dsf
String 2 aksddfgdfg$dsf
String 3 aksddfl;gksf
Turns to:
String 3 aksddfl;gksf
String 1 aksdfh435##%$dsf
String 2 aksddfgdfg$dsf
So how can I do it in Powershell? And if possible can I use the command by calling powershell -command in cmd?
It's basically exactly the same in PowerShell, eg:
$Content = #'
Unrelated data 1
Unrelated data 2
aksdfh435##%$dsf
aksddfgdfg$dsf
aksddfl;gksf
Unrelated data 3
'#
$LB = [System.Environment]::NewLine
$String1= [regex]::Escape('aksdfh435##%$dsf')
$String2= [regex]::Escape('aksddfgdfg$dsf')
$String3= [regex]::Escape('aksddfl;gksf')
$RegexMatch = "($String1.*)$LB($String2.*)$LB($String3.*)$LB"
$Content -replace $RegexMatch,"`$3$LB`$1$LB`$2$LB"
outputs:
Unrelated data 1
Unrelated data 2
aksddfl;gksf
aksdfh435##%$dsf
aksddfgdfg$dsf
Unrelated data 3
I used [System.Environment]::NewLine since it uses the default line break no matter what system you're on. Bound to a variable for easier to read code. Either
\r\n
or
`r`n
would've worked as well. The former if using single quotes and the latter (using backticks) when using double quotes. The backtick is what I use to escape $1, $2 and so on as well, that being the format to use when grabbing the first, second, third group from the regex.
I also use the [regex]::Escape('STRING') method to escape the strings to avoid special characters messing things up.
To use file input instead replace $Content with something like this:
$Content = Get-Content -Path 'C:\script\lab\Tests\testfile.txt' -Raw
and replace the last line with something like:
$Content -replace $RegexMatch,"`$3$LB`$1$LB`$2$LB" | Set-Content -Path 'C:\script\lab\Tests\testfile.txt'
In PowerShell it is not very different.
The replacement string needs to be inside double-qoutes (") here because of the newline characters and because of that, you need to backtick-escape the backreference variables $1, $2 and $3:
$str -replace '^(String 1.*)\r?\n(String 2.*)\r?\n(String 3.*)$', "`$3`r`n`$1`r`n`$2"
This is assuming your $str is a single multiline string as the question implies.

ConvertTo-Json and ConvertFrom-Json with special characters

I have a file containing some properties which value of some of them contains escape characters, for example some Urls and Regex patterns.
When reading the content and converting back to the json, with or without unescaping, the content is not correct. If I convert back to json with unescaping, some regular expression break, if I convert with unescaping, urls and some regular expressions will break.
How can I solve the problem?
Minimal Complete Verifiable Example
Here are some simple code blocks to allow you simply reproduce the problem:
Content
$fileContent =
#"
{
"something": "http://domain/?x=1&y=2",
"pattern": "^(?!(\\`|\\~|\\!|\\#|\\#|\\$|\\||\\\\|\\'|\\\")).*"
}
"#
With Unescape
If I read the content and then convert the content back to json using following command:
$fileContent | ConvertFrom-Json | ConvertTo-Json | %{[regex]::Unescape($_)}
The output (which is wrong) would be:
{
"something": "http://domain/?x=1&y=2",
"pattern": "^(?!(\|\~|\!|\#|\#|\$|\||\\|\'|\")).*"
}
Without Unescape
If I read the content and then convert the content back to json using following command:
$fileContent | ConvertFrom-Json | ConvertTo-Json
The output (which is wrong) would be:
{
"something": "http://domain/?x=1\u0026y=2",
"pattern": "^(?!(\\|\\~|\\!|\\#|\\#|\\$|\\||\\\\|\\\u0027|\\\")).*"
}
Expected Result
The expected result should be same as the input file content.
I decided to not use Unescape, instead replace the unicode \uxxxx characters with their string values and now it works properly:
$fileContent =
#"
{
"something": "http://domain/?x=1&y=2",
"pattern": "^(?!(\\`|\\~|\\!|\\#|\\#|\\$|\\||\\\\|\\'|\\\")).*"
}
"#
$fileContent | ConvertFrom-Json | ConvertTo-Json | %{
[Regex]::Replace($_,
"\\u(?<Value>[a-zA-Z0-9]{4})", {
param($m) ([char]([int]::Parse($m.Groups['Value'].Value,
[System.Globalization.NumberStyles]::HexNumber))).ToString() } )}
Which generates the expected output:
{
"something": "http://domain/?x=1&y=\\2",
"pattern": "^(?!(\\|\\~|\\!|\\#|\\#|\\$|\\||\\\\|\\'|\\\")).*"
}
If you don't want to rely on Regex (from #Reza Aghaei's answer), you could import the Newtonsoft JSON library. The benefit is the default StringEscapeHandling property which escapes control characters only. Another benefit is avoiding the potentially dangerous string replacements you would be doing with Regex.
This StringEscapeHandling is also the default handling of PowerShell Core (version 6 and up) because they started to use Newtonsoft internally since then. So another alternative would be to use ConvertFrom-Json and ConvertTo-Json from PowerShell Core.
Your code would look something like this if you import the Newtonsoft JSON library:
[Reflection.Assembly]::LoadFile("Newtonsoft.Json.dll")
$json = Get-Content -Raw -Path file.json -Encoding UTF8 # read file
$unescaped = [Newtonsoft.Json.Linq.JObject]::Parse($json) # similar to ConvertFrom-Json
$escapedElementValue = [Newtonsoft.Json.JsonConvert]::ToString($unescaped.apiName.Value) # similar to ConvertTo-Json
$escapedCompleteJson = [Newtonsoft.Json.JsonConvert]::SerializeObject($unescaped) # similar to ConvertTo-Json
Write-Output "Variable passed = $escapedElementValue"
Write-Output "Same JSON as Input = $escapedCompleteJson"
Note:
Applying [regex]::Unescape() isn't called for, as JSON's escaping is unrelated to regex escaping.
That is, $fileContent | ConvertFrom-Json | ConvertTo-Json should work as-is, but doesn't due to a quirk in Windows PowerShell, which caused the & in your input string to be represented as its equivalent escape sequence on re-conversion, \u0026; the quirk similarly affects ' (\u0026), < (\u003c) and > (\u003e).
tl;dr
The problem does not affect PowerShell (Core) 6+ (the install-on-demand, cross-platform PowerShell edition), which uses a different implementation of the ConvertTo-Json and ConvertFrom-Json cmdlets, namely, as of PowerShell 7.2.x, one based on Newtonsoft.JSON (whose direct use is shown in r3verse's answer). There, your sample roundtrip command works as expected.
Only ConvertTo-Json in Windows PowerShell is affected (the bundled-with-Windows PowerShell edition whose latest and final version is 5.1). But note that the JSON representation - while unexpected - is technically correct.
A simple, but robust solution focused only on unescaping those Unicode escape sequences that ConvertTo-Json unexpectedly creates - namely for & ' < > - while ruling out false positives:
# The following sample JSON with undesired Unicode escape sequences for `& < > '`
# was created with Windows PowerShell's ConvertTo-Json as follows:
# ConvertTo-Json "Ten o'clock at <night> & later. \u0027 \\u0027"
$json = '"Ten o\u0027clock at \u003cnight\u003e \u0026 later. \\u0027 \\\\u0027"'
[regex]::replace(
$json,
'(?<=(?:^|[^\\])(?:\\\\)*)\\u(00(?:26|27|3c|3e))',
{ param($match) [char] [int] ('0x' + $match.Groups[1].Value) },
'IgnoreCase'
)
The above outputs the desired JSON representation, without the unnecessary escaping of &, ', <, and >, and without having falsely replaced the escaped substrings \\u0027 and \\\\u0027:
"Ten o'clock at <night> & later. \\u0027 \\\\u0027"
Background information:
ConvertTo-Json in Windows PowerShell unexpectedly represents the following ASCII-range characters by their Unicode escape sequences in JSON strings:
& (Unicode escape sequence: \u0026)
' (\u0027)
< and > (\u003c and \u003e)
There's no good reason to do so (these characters only require escaping in HTML/XML text).
However, any compliant JSON parser - including ConvertFrom-Json - converts these escape sequences back to the characters they represent.
In other words: While the JSON text created by Windows PowerShell's ConvertTo-Json is unexpected and can impede readability, it is technically correct and - while not identical - equivalent to the original representation in terms of the data it represents.
Fixing the readability problem:
As an aside: While [regex]::Unescape(), whose purpose is to unescape regexes only, also converts Unicode escape sequences to the characters they represent, it is fundamentally
unsuited to selectively unescaping Unicode sequences JSON strings, given that all other \ escapes must be preserved in order for the JSON string to remain syntactically valid.
While your answer works well in general, it has limitations (aside from the easily corrected problem that a-zA-Z should be a-fA-F to limit matching to those letters that are valid hex. digits):
It doesn't rule out false positives, such as \\u0027 or \\\\u0027 (\\ escapes \, so that the u0027 part becomes a verbatim string and must not be treated as an escape sequence).
It converts all Unicode escape sequences, which presents two problems:
Escape sequences representing characters that require escaping would also be converted to the verbatim character representations, which would break the JSON representations with \u005c, for instance, given that the character it represents, \, requires escaping.
For non-BMP Unicode characters that must be represented as pairs of Unicode escape sequences (so-called surrogate pairs), your solution would mistakenly try to unescape each half of the pair separately.
For a robust solution that overcomes these limitations, see this answer
(surrogate pairs are left as Unicode escape sequences, Unicode escape sequences
whose characters require escaping are converted to \-based (C-style) escapes, such as \n, if possible).
However, if the only requirement is to unescape those Unicode escape sequences
that Windows PowerShell's ConvertTo-Json unexpectedly creates, the solution at the top is sufficient.

Strip all non letters and numbers from a string in Powershell?

I have a variable that has a string which recieves strange characters like hearts.
Besides that point, I wanted to know anyhow: How do I leave a string with only letters and numbers, discarding the rest or replacing it with nothing (not adding a space or anything)
My first thought was using a regular expression but I wanted to know if Powershell had something more native that does it automatically.
Let say you have variable like this:
$temp = '^gdf35#&fhd^^h%(#$!%sdgjhsvkushdv'
you can use the -replace method to replace only the Non-word characters like this:
$temp -replace "\W"
The result will be:
gdf35fhdhsdgjhsvkushdv
Consider white-listing approved characters, and replace anything that isn't whitelisted:
$temp = 'This-Is-Example!"£$%^&*.Number(1)'
$temp -replace "[^a-zA-Z0-9]"
ThisIsExampleNumber1
This gives added flexibility if you ever do want to include non-alphanumeric characters which may be expected as part of the string (e.g. dot, dash, space).
$temp = 'This-is-example!"£$%^&*.Number(2)'
$temp -replace "[^a-zA-Z0-9.-]"
This-is-example.Number2

Split lines to words then save as a new file

Say I have a text file test.txt in C drive.
On the face of things, we seem to be merely talking about text-based files, containing only
the letters of the English Alphabet (and the occasional punctuation mark).
On deeper inspection, of course, this isn't quite the case. What this site
offers is a glimpse into the history of writers and artists bound by the 128
characters that the American Standard Code for Information Interchange (ASCII)
allowed them. The focus is on mid-1980's textfiles and the world as it was then,
but even these files are sometime retooled 1960s and 1970s works, and offshoots
of this culture exist to this day.
I want to split all lines to words then save it as a new file. In the new file, each line only contains one word.
Thus the new file will be:
On
the
face
of
things
we
seem
to
....
The delimiter is a white space and please skip all punctuation marks.
You haven't even tried. The next time I'm voting for closed question. Powershell uses 99% of c# syntax and "all" .Net classes are available, so if you know c#, you will come far in PowerShell by using 5 minutes on google and trying some commands.
#create array
$words = #()
#read file
$lines = [System.IO.File]::ReadAllLines("C:\Users\Frode\Desktop\in.txt")
#split words
foreach ($line in $lines) {
$words += $line.Split(" ,.", [System.StringSplitOptions]::RemoveEmptyEntries)
}
#save words
[System.IO.File]::WriteAllLines("C:\Users\Frode\Desktop\out.txt", $words)
In PowerShell you could also do it like this:
Get-Content .\in.txt | ForEach-Object {
$_.Split(" ,.", [System.StringSplitOptions]::RemoveEmptyEntries)
} | Set-Content out.txt
$Text = #'
On the face of things, we seem to be merely talking about
text-based files, containing only the letters of the English Alphabet
(and the occasional punctuation mark). On deeper inspection, of
course, this isn't quite the case. What this site offers is a glimpse
into the history of writers and artists bound by the 128 characters
that the American Standard Code for Information Interchange (ASCII)
allowed them. The focus is on mid-1980's textfiles and the world as it
was then, but even these files are sometime retooled 1960s and 1970s
works, and offshoots of this culture exist to this day.
'#
[regex]::split($Text, ‘\W+’)
Here's a solution using regular expressions, that will:
Remove special characters
Parse words based on word boundaries (\b in regex)
Code:
$Text = #'
On the face of things, we seem to be merely talking about text-based files, containing only
the letters of the English Alphabet (and the occasional punctuation mark).
On deeper inspection, of course, this isn't quite the case. What this site
offers is a glimpse into the history of writers and artists bound by the 128
characters that the American Standard Code for Information Interchange (ASCII)
allowed them. The focus is on mid-1980's textfiles and the world as it was then,
but even these files are sometime retooled 1960s and 1970s works, and offshoots
of this culture exist to this day.
'#;
# Remove special characters
$Text = $Text -replace '\(|\)|''|\.|,','';
# Match words
$MatchList = ([Regex]'(?<word>\b\w+\b)').Matches($Text);
# Get just the text values of the matches
$WordList = $MatchList | % { $PSItem.Groups['word'].Value; };
# Examine the 'Count' of words
$WordList.Count
Result looks like:
$WordList[0..9];
On
the
face
of
things
we
seem
to
be
merely
I wouldn't bother splitting the string, since you write the result back to a file anyway. Just replace all punctuation (and perhaps parentheses as well) with spaces, replace all consecutive whitespace with newlines, and write everything back to a file:
$in = 'C:\test.txt'
$out = 'C:\test2.txt'
(Get-Content $in | Out-String) -replace '[.,;:?!()]',' ' -replace '\s+',"`r`n" |
Set-Content $out