Using PowerShell To Count Sentences In A File

Using PowerShell To Count Sentences In A File - powershell

I am having an issue with my PowerShell Program counting the number of sentences in a file I am using. I am using the following code:
foreach ($Sentence in (Get-Content file))
{
$i = $Sentence.Split("?")
$n = $Sentence.Split(".")
$Sentences += $i.Length
$Sentences += $n.Length
}
The total number of sentences I should get is 61 but I am getting 71, could someone please help me out with this? I have Sentences set to zero as well.
Thanks

foreach ($Sentence in (Get-Content file))
{
$i = $Sentence.Split("[?\.]")
$Sentences = $i.Length
}
I edited your code a bit.
The . that you were using needs to be escaped, otherwise Powershell recognises it as a Regex dotall expression, which means "any character"
So you should split the string on "[?\.]" or similar.

When counting sentences, what you are looking for is where each sentence ends. Splitting, though, returns a collection of sentence fragments around those end characters, with the ends themselves represented by the gap between elements. Therefore, the number of sentences will equal the number of gaps, which is one less the number of fragments in the split result.
Of course, as Keith Hill pointed out in a comment above, the actual splitting is unnecessary when you can count the ends directly.
foreach( $Sentence in (Get-Content test.txt) ) {
# Split at every occurrence of '.' and '?', and count the gaps.
$Split = $Sentence.Split( '.?' )
$SplitSentences += $Split.Count - 1
# Count every occurrence of '.' and '?'.
$Ends = [char[]]$Sentence -match '[.?]'
$CountedSentences += $Ends.Count
}
Contents of test.txt file:
Is this a sentence? This is a
sentence. Is this a sentence?
This is a sentence. Is this a
very long sentence that spans
multiple lines?
Also, to clarify on the remarks to Vasili's answer: the PowerShell -split operator interprets a string as a regular expression by default, while the .NET Split method only works with literal string values.
For example:
'Unclosed [bracket?' -split '[?]' will treat [?] as a regular expression character class and match the ? character, returning the two strings 'Unclosed [bracket' and ''
'Unclosed [bracket?'.Split( '[?]' ) will call the Split(char[]) overload and match each [, ?, and ] character, returning the three strings 'Unclosed ', 'bracket', and ''

Related

How to find the positions of all instances of a string in a specific line of a txt file?

Say that I have a .txt file with lines of multiple dates/times:
5/5/2020 5:45:45 AM
5/10/2020 12:30:03 PM
And I want to find the position of all slashes in one line, then move on to the next.
So for the first line I would want it to return the value:
1 3
And for the second line I would want:
1 4
How would I go about doing this?
I currently have:
$firstslashpos = Get-Content .\Documents\LoggedDates.txt | ForEach-Object{
$_.IndexOf("/")}
But that gives me only the first "/" on each line, and gives me that result for all lines at once. I need it to loop where I can figure out the space between each "/" for each line.
Sorry if I worded this badly.

You can indeed use the String.IndexOf() method for this!
function Find-SubstringIndex
{
param(
[string]$InputString,
[string]$Substring
)
$indices = #()
# start at position zero
$offset = 0
# Keep calling IndexOf() to find the next occurrence of the substring
# stop when IndexOf() returns -1
while(($i = $InputString.IndexOf($Substring, $offset)) -ne -1){
# Keep track of the index at which the substring was found
$indices += $i
# Update the offset, we'll want to start searching for the next index _after_ this one
$offset = $i + $Substring.Length
}
}
Now you can do:
Get-Content listOfDates.txt |ForEach-Object {
$indices = Find-SubstringIndex -InputString $_ -Substring '/'
Write-Host "Found slash at indices: $($indices -join ',')"
}

An concise solution is to use [regex]::Matches(), which finds all matches of a given regular expression in a given string and returns a collection of match objects that also indicate the index (character position) of each match:
# Create a sample file.
#'
5/5/2020 5:45:45 AM
5/10/2020 12:30:03 PM
'# > sample.txt
Get-Content sample.txt | ForEach-Object {
# Get the indices of all '/' instances.
$indices = [regex]::Matches($_, '/').Index
# Output them as a list (string), separated with spaces.
"$indices"
}
The above yields:
1 3
1 4
Note:
Input lines that contain no / instances at all will result in empty lines.
If, rather than strings, you want to output the indices as arrays (collections), use
, [regex]::Matches($_, '/').Index as the only statement in the ForEach-Object script block; the unary form of ,, the array constructor operator ensures (by way of a transient aux. array) that the collection returned by the method call is output as a whole. If you omit the , , the indices are output one by one, resulting in a flat array when collected in a variable.

In PowerShell, how do I copy the last alphabet characters from a string which also has numbers in it to create a variable?

For example if the string is blahblah02baboon - I need to get the "baboon" seperated from the rest and the variable would countain only the characters "baboon". Every string i need to do this with has alphabet characters first then 2 numbers then more alphabet characters, so it should be the same process everytime.
Any advice would be greatly appreciated.

My advice is to learn about regular expressions.
'blahblah02baboon' -replace '\D*\d*(\w*)', '$1'

Or use regex
$MyString = "01baaab01blah02baboon"
# Match any character which is not a digit
$Result = [regex]::matches($MyString, "\D+")
# Take the last result
$LastResult = $Result[$Result.Count-1].Value
# Output
Write-Output "My last result = $LastResult"

Extract the nth to nth characters of an string object

I have a filename and I wish to extract two portions of this and add into variables so I can compare if they are the same.
$name = FILE_20161012_054146_Import_5785_1234.xml
So I want...
$a = 5785
$b = 1234
if ($a = $b) {
# do stuff
}
I have tried to extract the 36th up to the 39th character
Select-Object {$_.Name[35,36,37,38]}
but I get
{5, 7, 8, 5}
Have considered splitting but looks messy.

There are several ways to do this. One of the most straightforward, as PetSerAl suggested is with .Substring():
$_.name.Substring(35,4)
Another way is with square braces, as you tried to do, but it gives you an array of [char] objects, not a string. You can use -join and you can use a range to make that easier:
$_.name[35..38] -join ''
For what you're doing, matching a pattern, you could also use a regular expression with capturing groups:
if ($_.name -match '_(\d{4})_(\d{4})\.xml$') {
if ($Matches[1] -eq $Matches[2]) {
# ...
}
}
This way can be very powerful, but you need to learn more about regex if you're not familiar. In this case it's looking for an underscore _ followed by 4 digits (0-9), followed by an underscore, and four more digits, followed by .xml at the end of the string. The digits are wrapped in parentheses so they are captured separately to be referenced later (in $Matches).

Yet another approach: returns 1234 substring four times.
$FileName = "FILE_20161012_054146_Import_5785_1234.xml"
# $FileName
$FileName.Substring(33,4) # Substring method (zero-based)
-join $FileName[33..36] # indexing from beginning (zero-based)
-join $FileName[-8..-5] # reverse indexing:
# e.g. $FileName[-1] returns the last character
$FileArr = $FileName.Split("_.") # Split (depends only on filename "pattern template")
$FileArr[$FileArr.Count -2] # does not depend on lengths of tokens

Perl Regex to match words with more than 2 characters

I am new to PERL and working on a regex to match only words with equal to or more than 3 letters . Here is the program I am trying. I tried adding \w{3,} since it should match 3 re more characters. But it is still matching <3 characters in a word. For example If i give "This is a Pattern". I want my $field to match only "This" and "Pattern" and skip "is" and "a".
#!/usr/bin/perl
while (<STDIN>) {
foreach my $reg_part (split(/\s+/, $_)) {
if ($reg_part =~ /([^\w\#\.]*)?([\w{3,}\#\(\)\+\$\.]+)(?::(.+))?/) {
print "reg_part = $reg_part \n";
my ($mod, $field, $pat) = ($1, $2, $3);
print "#$mod#$field#$pat#$negate#\n";
}
}
}
exit(0);
What am I missing?

You have
[\w{3,}...]+
which is the same as
[{},3\w...]+
I think you want
(?:\w{3,}|[\$\#()+.])+

Break your regular expression up.
You know you want three word characters, so specify :-
# Match three word characters.
\w{3}
After that, you don't really care if the word has more characters, but you won't block it either.
# Match 0 or more word characters
\w*
Finally, you want to ensure that you have boundaries to catch the end of words. So, putting it all together. To match a word with at least three word characters, possibly more, use:-
# Word boundaries at start and end
\b\w{3}\w*\b
Note - \w matches alphanumeric - if it's just alpha you need:-
# Alpha only
\b[A-Za-z]{3}[A-Za-z]*\b

What's happening in this Perl foreach loop?

I have this Perl code:
foreach (#tmp_cycledef)
{
chomp;
my ($cycle_code, $close_day, $first_date) = split(/\|/, $_,3);
$cycle_code =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
$close_day =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
$first_date =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
#print "$cycle_code, $close_day, $first_date\n";
$cycledef{$cycle_code} = [ $close_day, split(/-/,$first_date) ];
}
The value of tmp_cycledef comes from output of an SQL query:
select cycle_code,cycle_close_day,to_char(cycle_first_date,'YYYY-MM-DD')
from cycle_definition d
order by cycle_code;
What exactly is happening inside the for loop?

Huh, I'm surprised no one fixed it for you :)
It looks like the person who wrote this was trying to trim leading and trailing whitespace from each field. It's a really odd way to do that, and for some reason he was overly concerned with interior whitespace in each field despite his anchors.
I think that should be the same as trimming the whitespace around the delimiter in the split:
foreach (#tmp_cycledef)
{
s/^\s+//; s/$//; #leading and trailing whitespace on the whole string
my ($cycle_code, $close_day, $first_date) = split(/\s*\|\s*/, $_, 3);
$cycledef{$cycle_code} = [ $close_day, split(/-/,$first_date) ];
}
The key to thinking about split is considering which parts of the string you want to throw away, not just what separates the fields that you want.

For regex part, s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/ do stripping of leading and trailing whitespaces

Each row in #tmp_cycledef is composed of a string formatted following "cycle_code | close_day | first_date".
my ($cycle_code, $close_day, $first_date) = split(/\|/, $_,3);
Split the string into three parts. The following regular expressions are used to strip leading and trailing whitespaces.
The last instruction of the loop creates an entry in the dictionary $cycledef indexed by $cycle_code. The entry is formated is formatted using the following scheme:
[ $close_day, YYYY, MM, DD ]
where $first_date = "YYYY-MM-DD".

#tmp_cycledef: The output of the sql query is stored in this array
foreach (#tmp_cycledef) : For every element in this array.
chomp : remove the \n char from the end of every element.
my ($cycle_code, $close_day, $first_date) = split(/\|/, $_,3);
split the elements into 3 parts and assign the variable to each of the splited element. parts of split are "split(/PATTERN/,EXPR,LIMIT)"
$cycle_code =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
$close_day =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
$first_date =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
This regex part is sripping of leading and trailing whitespaces from each variable.

my god, it's been such a long time since I've read perl... but I'll give it a shot.
you grab a record from #tmp_cycledef, and chomp off the newline at the end, and split it up into the three variables: then, like S.Mark said, each substitution regex strips off the leading and trailing whitespace for each of the three variable. Finally, the values get pushed into a hash as a list, with some debugging code commented out right above it.
hth

Your query gives a set of rows that
are stored in the array
#tmp_cycledef.
We iterate over each row in the
result using: foreach
(#tmp_cycledef).
The result rows might have trailing
newline char, we get rid of them
using chomp.
Next we split the row (which is not
in $_) on the pipe and assign the
first 3 pieces to $cycle_code,
$close_day and $first_date
respectively.
The split pieces might have leading
and trailing white spaces, the next 3
lines are to remove the leading and
trailing white space in the 3
variables.
Finally we make an entry into the
hash %cycledef. The key use is
$cycle_code and the value is an
array whose first element is
$close_day and rest of the elements
are pieces got after splitting
$first_date on hyphen.