Split lines to words then save as a new file - powershell

Say I have a text file test.txt in C drive.
On the face of things, we seem to be merely talking about text-based files, containing only
the letters of the English Alphabet (and the occasional punctuation mark).
On deeper inspection, of course, this isn't quite the case. What this site
offers is a glimpse into the history of writers and artists bound by the 128
characters that the American Standard Code for Information Interchange (ASCII)
allowed them. The focus is on mid-1980's textfiles and the world as it was then,
but even these files are sometime retooled 1960s and 1970s works, and offshoots
of this culture exist to this day.
I want to split all lines to words then save it as a new file. In the new file, each line only contains one word.
Thus the new file will be:
On
the
face
of
things
we
seem
to
....
The delimiter is a white space and please skip all punctuation marks.

You haven't even tried. The next time I'm voting for closed question. Powershell uses 99% of c# syntax and "all" .Net classes are available, so if you know c#, you will come far in PowerShell by using 5 minutes on google and trying some commands.
#create array
$words = #()
#read file
$lines = [System.IO.File]::ReadAllLines("C:\Users\Frode\Desktop\in.txt")
#split words
foreach ($line in $lines) {
$words += $line.Split(" ,.", [System.StringSplitOptions]::RemoveEmptyEntries)
}
#save words
[System.IO.File]::WriteAllLines("C:\Users\Frode\Desktop\out.txt", $words)
In PowerShell you could also do it like this:
Get-Content .\in.txt | ForEach-Object {
$_.Split(" ,.", [System.StringSplitOptions]::RemoveEmptyEntries)
} | Set-Content out.txt

$Text = #'
On the face of things, we seem to be merely talking about
text-based files, containing only the letters of the English Alphabet
(and the occasional punctuation mark). On deeper inspection, of
course, this isn't quite the case. What this site offers is a glimpse
into the history of writers and artists bound by the 128 characters
that the American Standard Code for Information Interchange (ASCII)
allowed them. The focus is on mid-1980's textfiles and the world as it
was then, but even these files are sometime retooled 1960s and 1970s
works, and offshoots of this culture exist to this day.
'#
[regex]::split($Text, ‘\W+’)

Here's a solution using regular expressions, that will:
Remove special characters
Parse words based on word boundaries (\b in regex)
Code:
$Text = #'
On the face of things, we seem to be merely talking about text-based files, containing only
the letters of the English Alphabet (and the occasional punctuation mark).
On deeper inspection, of course, this isn't quite the case. What this site
offers is a glimpse into the history of writers and artists bound by the 128
characters that the American Standard Code for Information Interchange (ASCII)
allowed them. The focus is on mid-1980's textfiles and the world as it was then,
but even these files are sometime retooled 1960s and 1970s works, and offshoots
of this culture exist to this day.
'#;
# Remove special characters
$Text = $Text -replace '\(|\)|''|\.|,','';
# Match words
$MatchList = ([Regex]'(?<word>\b\w+\b)').Matches($Text);
# Get just the text values of the matches
$WordList = $MatchList | % { $PSItem.Groups['word'].Value; };
# Examine the 'Count' of words
$WordList.Count
Result looks like:
$WordList[0..9];
On
the
face
of
things
we
seem
to
be
merely

I wouldn't bother splitting the string, since you write the result back to a file anyway. Just replace all punctuation (and perhaps parentheses as well) with spaces, replace all consecutive whitespace with newlines, and write everything back to a file:
$in = 'C:\test.txt'
$out = 'C:\test2.txt'
(Get-Content $in | Out-String) -replace '[.,;:?!()]',' ' -replace '\s+',"`r`n" |
Set-Content $out

Related

Split text after each end of line [duplicate]

This question already has answers here:
Powershell: split string with newline, then use -contains
(3 answers)
Closed 1 year ago.
I have a script that works perfectly fine with Powershell 5.x, but does not work anymore on Powershell Core (7.2.1)
The problem happens when I try to split a text (copy&past from an email)..
It all comes down to this part of the code:
$test="blue
green
yellow
"
#$test.Split([Environment]::NewLine)
$x = $test.Split([Environment]::NewLine)
$x[0]
$x[1]
In Powershell 5 the value for $x[0]==blue and $x[1]==green
But in Powershell Core the split doesn't do anything and $x[1] is "non existent".
In Powershell 7 the line breaks are handled differently (that's at least what I assume), but I couldn't find a solution to it..
I tried it with changing the code to
$rows = $path.split([Environment]::NewLine) and $path.Split([System.Environment]::NewLine, [System.StringSplitOptions]::RemoveEmptyEntries) but that doesn't change anything..
Also, when I use a "here-string"
$test = #'
green
yellow
blue
white
'#
$x= $test -split "`r`n", 5, "multiline"
Everything excepts $x[0] is empty (i.e $x[2])
I was already looking here: https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_split?view=powershell-7.2
And here: powershell -split('') specify a new line
And here: WT: Paste multiple lines to Windows Terminal without executing
So far I have not found a solution to my problem.
Any help is appreciated.
EDIT: I found a hint about that problem, but don't understand the implications of it yet: https://n-v-o.github.io/2021-06-10-String-Method-in-Powershell-7/
EDIT 2:
Thanks everyone for participating in answering my question.
First I thought I'm going to write a long explanation why my question is different then the duplicated answer from #SantiagoSquarzon. But while reading the answers to my question and the other question I noticed I was doing something differently..
Apparently there is something differnt when I use
$splits = $test -split '`r?`n' # doesn't work in 5.1 and 7.2.1
$splits = $test -split '\r?\n' # works in 5.1 and 7.2.1 as suggested from Santiago and others
BUT
$splits = $test.Split("\r?\n") # doesn't work in 5.1 and 7.2.1
$splits = $test.Split("`r?`n") # doesn't work in 5.1 and 7.2.1
$splits = $test.Split([char[]]"\r\n") # doesnt' work in 7.2.1
$splits = $test.Split([char[]]"`r`n") # works in 7.2.1
tl;dr:
Use -split '\r?\n to split multiline text into lines irrespective of whether Windows-format CRLF or Unix-format LF newlines are used (it even handles a mix of these formats in a single string).
If you additionally want to handle CR-only newlines (which would be unusual, but appears to be the case for you), use -split '\r?\n|\r'
On Windows, with CRLF newlines only, .Split([Environment]::NewLine) only works as intended in PowerShell (Core) 7+, not in Windows PowerShell (and, accidentally, in Windows PowerShell only with CR-only newlines, as in your case.) To explicitly split by CR only, .Split("`r") would happen to work as intended in both editions, due to splitting by a single character only.
# Works on both Unix and Windows, in both PowerShell editions.
# Input string contains a mix of CRLF and LF and CR newlines.
"one`r`ntwo`nthree`rfour" -split '\r?\n|\r' | ForEach-Object { "[$_]" }
Output:
[one]
[two]
[three]
[four]
This is the most robust approach, as you generally can not rely on input text to use the platform-native newline format, [Environment]::NewLine; see the bottom section for details.
Note:
The above uses PowerShell's -split operator, which operates on regexes (regular expressions), which enables the flexible matching logic shown above.
This regex101.com page explains the \r?\n|\r regex and allows you to experiment with it.
By contrast, the System.String.Split() .NET method only splits by literal strings, which, while faster, limits you to finding verbatim separators.
The syntax implications are:
Regex constructs such as escape sequences \r (CR) and \n (LF) are only supported by the .NET regex engine and therefore only by -split (and other PowerShell contexts where regexes are being used); ditto for regex metacharacters ? (match the preceding subexpression zero or one time) and | (alternation; match the subexpression on either side).
Inside strings (which is how regexes must be represented in PowerShell, preferably inside '...'), these sequences and characters have no special meaning, neither to PowerShell itself nor to the .Split() method, which treats them all verbatim.
By contrast, the analogous escape sequences "`r" (CR) and "`n" (LF) are PowerShell features, available in expandable strings, i.e. they work only inside "..." - not also inside verbatim strings, '...' - and are expanded to the characters they represent before the target operator, method, or command sees the resulting string.
This answer discusses -split vs. .Split() in more depth and recommends routine use of -split.
As for what you tried:
Use [Environment]::NewLine only if you are certain that the input string uses the platform-native newline format. Notably, multiline string literals entered interactively at the PowerShell prompt use Unix-format LF newlines even on Windows (the only exception is the obsolescent Windows-only ISE, which uses CRLF).
String literals in script files (*.ps1) use the same newline format that the script is saved in - which may or may not be the platform's format.
Additionally, as you allude to in your own answer, the addition of a string parameter overload in the System.String.Split() method in .NET Core / .NET 5+ - and therefore PowerShell (Core) v6+ - implicitly caused a breaking change relative to Windows PowerShell: specifically, .Split('ab') splits by 'a' or 'b' - i.e. by any of the individual characters that make up the string - in Windows PowerShell, whereas it splits by the whole string, 'ab', in PowerShell (Core) v6+.
Such implicit breaking changes are rare, but they do happen, and they're outside PowerShell's control.
For that reason, you should always prefer PowerShell-native features for long-term stability, which in this case means preferring the -split operator to the .Split() .NET method.
That said, sometimes .NET methods are preferable for performance reasons; you can make them work robustly, but only if carefully match the exact data types of the method overloads of interest, which may require cast; see below.
See this answer for more information, including a more detailed explanation of the implicit breaking change.
Your feedback on -split '\r?\n' not working for you and the solutions in your own answer suggest that your input string - unusually - uses CR-only newlines.
Your answer's solutions would not work as expected with Windows-format CRLF-format text, because splitting would happen for each CR and LF in isolation, which would result in extra, empty elements in the output array (each representing the empty string "between" a CRLF sequence).
If you did want to split by [Environment]::NewLine on Windows - i.e. by CRLF - and you wanted to stick with the .Split() method, in order to make it work in Windows PowerShell too, you'd need to call the overload that expects a [string[]] argument, indicating that each string (even if only one) is to be used as a whole as the separator - as opposed to splitting by any of its individual characters:
# On Windows, split by CRLF only.
# (Would also work on Unix with LF-only text.)
# In PowerShell (Core) 7+ only, .Split([Environment]::NewLine) would be enough.
"one`r`ntwo`r`nthree".Split([string[]] [Environment]::NewLine, [StringSplitOptions]::None) |
ForEach-Object { "[$_]" }
Output:
[one]
[two]
[three]
While this is obviously more ceremony than using -split '\r?\n', it does have the advantage of performing better - although that will rarely matter. See the next section for a generalization of this approach.
Using an unambiguous .Split() call for improved performance:
Note:
This is only necessary if -split '\r?\n' or -split '\r?\n|\r' turns out to be too slow in practice, which won't happen often.
To make this work robustly, in both PowerShell editions as well as long-term, you must carefully match the exact data types of the .Split() overload of interest.
The command below is the equivalent of -split '\r?\n|\r', i.e. it matches CRLF, LF, and CR newlines. Adapt the array of strings for more restrictive matching.
# Works on both Unix and Windows, in both PowerShell editions
"one`r`ntwo`nthree`rfour".Split(
[string[]] ("`r`n", "`n", "`r"),
[StringSplitOptions]::None
) | ForEach-Object { "[$_]" }
The reason: When pasting text into the terminal, it matters which terminal you are using. The default powershell 5.1, ISE terminals, and most other Windows software separates new lines with both carriage return \r and newline \n characters. We can check by converting to bytes:
# 5.1 Desktop
$test = "a
b
c"
[byte[]][char[]]$test -join ','
97,13,10,98,13,10,99
#a,\r,\n, b,\r,\n, c
Powershell Core separates new lines with only a newline \n character
# 7.2 Core
$test = "a
b
c"
[byte[]][char[]]$test -join ','
97,10,98,10,99
On Windows OS, [Environment]::NewLine is \r\n no matter which console. On Linux, it is \n.
The solution: split multiline strings on either \r\n or \n (but not on only \r). The easy way here is with regex like #Santiago-squarzon suggests:
$splits = $test -split '\r?\n'
$splits[0]
a
$splits[1]
b
Thanks to this site I found a solution:
https://n-v-o.github.io/2021-06-10-String-Method-in-Powershell-7/
In .NET 4, the string class only had methods that took characters as
parameter types. PowerShell sees this and automagically converts it,
to make life a little easier on you. Note there’s an implied ‘OR’ (|)
here as it’s an array of characters.
Why is PowerShell 7 behaving differently? In .NET 5, the string class
has some additional parameters that accept strings. PowerShell 7 does
not take any automatic action.
In order to fix my problem, I had to use this:
$test.Split("`r").Split("`n") #or
$test.Split([char[]]"`r`n")

Text file search for match strings regex

I am trying to understand how regex works and what are the possibilities of working with it.
So I have a txt file and I am trying to search for 8 char long strings containing numbers. for now I use a quite simple option:
clear
Get-ChildItem random.txt | Select-String -Pattern [0-9][a-z] | foreach {$_.line}
It sort of works but I am trying to find a better option. ATM it takes too long to read through the left out text since it writes entire lines and it does not filter them by length.
You can use a lookahead to assert that a string contains at least 1 digit, then specify the length of the match and finally anchor it with ^ (start of string) and $ (end of string) if the string is on a line of its own, or \b (word boundary) if it's part of an HTML document as your comments seem to suggest:
Get-ChildItem C:\files\ |Select-String -Pattern '^(?=.*\d)\w{8}$'
Get-ChildItem C:\files\ |Select-String -Pattern '\b(?=.*\d)\w{8}\b'
The pattern [0-9][a-z] matches a digit followed by a letter. If you want to match a sequence of 8 characters use .{8}. The dot in regular expressions matches any character except newlines. A number in curly brackets matches the preceding expression the given number of times.
If you want to match non-whitespace characters use \S instead of .. If you want to match only digits and letters use [0-9a-z] (a character class) instead of ..
For a more thorough introduction please go find a tutorial. The subject is way too complex to be covered by a single answer on SO.
What you're currently searching for is a single number ranging from 0-9 followed by a single lowercase letter ranging from a-z.
this, for example, will match any 8 char long strings containing only alphanumeric characters.
\w{8}
i often forget what some regex classes are, and it may be useful to you as a learning tool, but i use this as a point of reference: http://regexr.com/
It can also validate what you're typing inline via a text field so you can see if what you're doing works or not.
If you need more of a tutorial than a reference, i found this extremely useful when i learned: regexone.com

Strip all non letters and numbers from a string in Powershell?

I have a variable that has a string which recieves strange characters like hearts.
Besides that point, I wanted to know anyhow: How do I leave a string with only letters and numbers, discarding the rest or replacing it with nothing (not adding a space or anything)
My first thought was using a regular expression but I wanted to know if Powershell had something more native that does it automatically.
Let say you have variable like this:
$temp = '^gdf35#&fhd^^h%(#$!%sdgjhsvkushdv'
you can use the -replace method to replace only the Non-word characters like this:
$temp -replace "\W"
The result will be:
gdf35fhdhsdgjhsvkushdv
Consider white-listing approved characters, and replace anything that isn't whitelisted:
$temp = 'This-Is-Example!"£$%^&*.Number(1)'
$temp -replace "[^a-zA-Z0-9]"
ThisIsExampleNumber1
This gives added flexibility if you ever do want to include non-alphanumeric characters which may be expected as part of the string (e.g. dot, dash, space).
$temp = 'This-is-example!"£$%^&*.Number(2)'
$temp -replace "[^a-zA-Z0-9.-]"
This-is-example.Number2

Removing "SUB" characters (ASCII=26, which is outside of normally searchable special characters) with powershell

I'm trying to remove all of the SUB (substitute, ASCII=26) characters from a large text file
I want to import a large file into sas, but sas bombs out (actually it just stops and reports a success, which is much worse) when it reaches an ususual character that looks like a "->" when viewed in excel. Using the Code() function in excel, it identifies this character as a "26" which I believe is the ASCII 26 or SUB (substitute) character.
Anyway, I'd like to remove all these "->" characters from the file so I can import them into sas, so was thinking I could use powershell (one of the few tools available to me).
I'm new to powershell, but none of the documentation I've been able to find on Select-String has information on writing hex or arbitiray ascii characters, just a fixed list of normal special characters that doesn't include the character I'm struggling with.
Any ideas how I can remove all the SUB characters from a text file using powershell?
You can use \xnn in a regex to match arbitrary character codes expressed as hex. 026 = x1a
For Unicode, the format is \unnnn
Clear-Content $outputfile
Get-Content $inputfile -ReadCount 1000 |
foreach {
$_ -replace '\u001a' |
Add-Content $outputfile
}

Convert a CSV file with embedded commas into a bash array by line efficiently

Normally, I do something like
IFS=','
columns=( $LINE )
where $LINE is a line from a csv file I'm reading.
However, how do I handle a csv file with embedded commas? I have to handle several hundred gigs of file so everything needs to be done quickly, i.e., no multiple readings of a line, definitely no loops (last time I tried that slowed it down several factors).
The general structure of the code is as follows
FILENAME=$1
cat $FILENAME | while read LINE
do
IFS=","
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
Preferably, I need something that goes
FILENAME=$1
cat $FILENAME | while read LINE
do
IFS=","
# code to tell bash to ignore if IFS is within an open quote
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
Any tips would be appreciated. Otherwise, I'll probably switch to using another language to handle this stuff.
Probably embedded commas is just the first obvious problem that you encountered while parsing those CSV files.
Future problems that might popped are:
embedded newline separator characters
embedded utf8 chars
special treatment for whitespaces, empty fields, spaces around commas, undef values
I generally tend to follow the philosophy that If there is a (reputable) module that parses some
format you have to parse, use it instead of making a homebrew
I don't think there is such a thing for bash, but there are some for Perl. I'd go for Text::CSV_XS. Being written in C I expect it to be very fast.
You can use sed or something similar to convert the commas within quotes to some other sequence or punctuation. If you don't care about the stuff in quotes then you do not even need to change them back. You can do this on the whole file:
sed 's/\("[^,"]*\),\([^"]*"\)/\1;\2/g' input.csv > intermediate.csv
or on each line:
line=$(echo $line | sed 's/\("[^,"]*\),\([^"]*"\)/\1;\2/g')
This isn't a complete answer, but it's a possible approach.
Find a character that never occurs in the input file. Use a C program that parses the CSV file and prints lines to standard output with a different delimiter. Writing that program is left as an exercise, but I'm sure there's CSV-parsing C source code out there. Pipe the output of the C program into your script.
For example:
FILENAME=$1
new_c_program $FILENAME | while read LINE
do
IFS="|"
# code to tell bash to ignore if IFS is within an open quote
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
A minor point: I'd pick a name other than $newline; newline suggests an end-of-line marker rather than an entire line.
Another minor point: you have a "Useless Use Of cat" in the code in your question. You could replace this:
cat $FILENAME | while read LINE
do
...
done
by this:
while read LINE
do
...
done < $FILENAME
But if you replace cat by the hypothetical C program I suggested, you still need the pipe.