Count matches between specific substrings - perl

I have a string
1AAAAaaa>###_1BBbbbbbbb>###_2CCCCCCCCccccc
Data blocks begins with "a number" and end with >.
I need to calculate in how many of those blocks lower case letters outnumber upper case.
As an answer I want to get
there are x places between number and >, where lowercase is over 50%.
I understand how to do it for the whole string, but not for the separate regions.

You can separate each target section of the string into an array using split.
Then iterate through the array and do your count.
my $string = 'AAAAaaa>1BBbbbbbbb>2CCCCCCCCccccc>3DDDDDDDDDddd>4FFFFfffffff>';
my #targets = split(/(?=\d+\w+>)/, $string);
my $successes = 0;
foreach my $target (#targets){
my $target_lc = $target =~ tr/a-z//;
my $target_uc = $target =~ tr/A-Z//;
if($target_lc > $target_uc){
$successes++;
}
}
print $successes;
OUTPUT = 2

Related

Regular Expression Matching Perl for first case of pattern

I have multiple variables that have strings in the following format:
some_text_here__what__i__want_here__andthen_someĀ 
I want to be able to assign to a variable the what__i__want_here portion of the first variable. In other words, everything after the FIRST double underscore. There may be double underscores in the rest of the string but I only want to take the text after the FIRST pair of underscores.
Ex.
If I have $var = "some_text_here__what__i__want_here__andthen_some", I would like to assign to a new variable only the second part like $var2 = "what__i__want_here__andthen_some"
I'm not very good at matching so I'm not quite sure how to do it so it just takes everything after the first double underscore.
my $text = 'some_text_here__what__i__want_here';
# .*? # Match a minimal number of characters - see "man perlre"
# /s # Make . match also newline - see "man perlre"
my ($var) = $text =~ /^.*?__(.*)$/s;
# $var is not defined when there is no __ in the string
print "var=${var}\n" if defined($var);
You might consider this an example of where split's third parameter is useful. The third parameter to split constrains how many elements to return. Here is an example:
my #examples = (
'some_text_here__what__i_want_here',
'__keep_this__part',
'nothing_found_here',
'nothing_after__',
);
foreach my $string (#examples) {
my $want = (split /__/, $string, 2)[1];
print "$string => ", (defined $want ? $want : ''), "\n";
}
The output will look like this:
some_text_here__what__i_want_here => what__i_want_here
__keep_this__part => keep_this__part
nothing_found_here =>
nothing_after__ =>
This line is a little dense:
my $want = (split /__/, $string, 2)[1];
Let's break that down:
my ($prefix, $want) = split /__/, $string, 2;
The 2 parameter tells split that no matter how many times the pattern /__/ could match, we only want to split one time, the first time it's found. So as another example:
my (#parts) = split /#/, "foo#bar#baz#buzz", 3;
The #parts array will receive these elements: 'foo', 'bar', 'baz#buzz', because we told it to stop splitting after the second split, so that we get a total maximum of three elements in our result.
Back to your case, we set 2 as the maximum number of elements. We then go one step further by eliminating the need for my ($throwaway, $want) = .... We can tell Perl we only care about the second element in the list of things returned by split, by providing an index.
my $want = ('a', 'b', 'c', 'd')[2]; # c, the element at offset 2 in the list.
my $want = (split /__/, $string, 2)[1]; # The element at offset 1 in the list
# of two elements returned by split.
You use brackets to capature then reorder the string, the first set of brackets () is $1 in the next part of the substitution, etc ...
my $string = "some_text_here__what__i__want_here";
(my $newstring = $string) =~ s/(some_text_here)(__)(what__i__want_here)/$3$2$1/;
print $newstring;
OUTPUT
what__i__want_here__some_text_here

How can I split a word into its component letters?

I am working with Perl and I have an array with only one word:
#example = ("helloword")
I want to generate another array in which each element is a letter from the word:
#example2 = ("h", "e", "l"...)
I need to do that because I need to count the numbers of "h", "e"... How can I do this?
To count how many times letter occurred in a string,
print "helloword" =~ tr/h//; # for 'h' letter
otherwise you can split string and assign list to an array,
my #example2 = split //, $example[0];
I don't completely grasp exactly what you need to count, but perhaps you can take pieces from this example, which uses a hash to store the letters and counts of each...
use warnings;
use strict;
my #array = 'helloworld';
my %letters;
$letters{$_}++ for split //, $array[0];
my $total;
while (my ($k, $v) = each %letters){
$total += $v;
print "$k: $v\n";
}
print "Total letters in string: $total\n",
Output:
w: 1
d: 1
l: 3
o: 2
e: 1
r: 1
h: 1
Total letters in string: 10
Try using this code, found here: http://www.comp.leeds.ac.uk/Perl/split.html
#chars = split(//, $word);
You can of course use split(//,"helloworld"), but that's not as efficient as unpack. Figuring out the template to provide to unpack can be somewhat steep, but this should work for you: unpack('(A)*',"helloworld"). For example:
perl -e 'print(join("\n",unpack("(A)*","helloworld")),"\n")'
h
e
l
l
o
w
o
r
l
d
To count the number of letters, you could either assume that every character of a "word" you split the string up into is a letter and simply evaluate the list in scalar context (or use 'length'), e.g. print(scalar(#letters),"\n"); or print(length(#letters),"\n"), OR you could create a count variable and increment it in a map when a letter pattern is matched, e.g.:
my $cnt = 0;
foreach(#chars){$cnt++ if(/\w/)}
print("$cnt\n");
Or you can use the same evaluation of a list in scalar trick with a grep:
print(scalar(grep {/\w/} #chars),"\n");
There are of course, in perl, other ways to do it.
EDIT: In case I misinterpreted the question, and you want to know how many of each letter there is in the string, then this should suffice:
$cnt = 0;
foreach(unpack("(A)*","helloworld")))
{
next unless(/\w/);
$hash->{$_}->{ORD} = $cnt++ unless(exists($hash->{$_}));
$hash->{$_}->{CNT}++;
}
foreach(sort {$hash->{$a}->{ORD} <=> $hash->{$b}->{ORD}}
keys(%$hash))
{print("$_\t$hash->{$_}->{CNT}\n")}
This solution has the advantage of keeping the unique letters in the order of their first occurrence in the word they were found in.

Using PowerShell To Count Sentences In A File

I am having an issue with my PowerShell Program counting the number of sentences in a file I am using. I am using the following code:
foreach ($Sentence in (Get-Content file))
{
$i = $Sentence.Split("?")
$n = $Sentence.Split(".")
$Sentences += $i.Length
$Sentences += $n.Length
}
The total number of sentences I should get is 61 but I am getting 71, could someone please help me out with this? I have Sentences set to zero as well.
Thanks
foreach ($Sentence in (Get-Content file))
{
$i = $Sentence.Split("[?\.]")
$Sentences = $i.Length
}
I edited your code a bit.
The . that you were using needs to be escaped, otherwise Powershell recognises it as a Regex dotall expression, which means "any character"
So you should split the string on "[?\.]" or similar.
When counting sentences, what you are looking for is where each sentence ends. Splitting, though, returns a collection of sentence fragments around those end characters, with the ends themselves represented by the gap between elements. Therefore, the number of sentences will equal the number of gaps, which is one less the number of fragments in the split result.
Of course, as Keith Hill pointed out in a comment above, the actual splitting is unnecessary when you can count the ends directly.
foreach( $Sentence in (Get-Content test.txt) ) {
# Split at every occurrence of '.' and '?', and count the gaps.
$Split = $Sentence.Split( '.?' )
$SplitSentences += $Split.Count - 1
# Count every occurrence of '.' and '?'.
$Ends = [char[]]$Sentence -match '[.?]'
$CountedSentences += $Ends.Count
}
Contents of test.txt file:
Is this a sentence? This is a
sentence. Is this a sentence?
This is a sentence. Is this a
very long sentence that spans
multiple lines?
Also, to clarify on the remarks to Vasili's answer: the PowerShell -split operator interprets a string as a regular expression by default, while the .NET Split method only works with literal string values.
For example:
'Unclosed [bracket?' -split '[?]' will treat [?] as a regular expression character class and match the ? character, returning the two strings 'Unclosed [bracket' and ''
'Unclosed [bracket?'.Split( '[?]' ) will call the Split(char[]) overload and match each [, ?, and ] character, returning the three strings 'Unclosed ', 'bracket', and ''

Concatenating strings from a multidimensional array overwrites the target string in Perl

I've built a two dimension array with string values. There are always 12 columns but the number of rows vary. Now I'd like to build a string of each row but when I run the following code:
$outstring = "";
for ($i=0; $i < $ctrLASTROW + 1; $i++) {
for ($k=0; $k < 12; $k++){
$datastring = $DATATABLE[$i][$k]);
$outstring .= $datastring;
}
}
$outstring takes the first value. Then on the second inner loop and subsequent loops the value in $outstring gets overlaid. For example the first value is "DATE" then the next time when the value "ABC" gets fed to it. Rather than being the hoped for "DATEABC" it's "ABCE". The "E" is the fourth character of DATE. I figure I'm missing the scalar / list issue but I've tried who knows how many variations to no avail. When I first started I tried the concatenation directly from the #DATATABLE. Same problem. Only quicker.
When you have a problem such as two strings DATE and ABC being concatenated, and the end result is ABCE, or one of the strings overwriting the other, a likely scenario is that you have a file from another OS, with the line endings \r\n, which are chomped, resulting in the string DATE\rABC when concatenated, which then becomes ABCE when printed.
In other words:
my $foo = "DATE\r\n";
my $bar = "ABC\r\n"; # \r\n line endings from file
chomp($foo, $bar); # removes \n but leaves \r
print $foo . $bar; # prints ABCE
To confirm, use
use Data::Dumper;
$Data::Dumper::Useqq = 1;
print Dumper $DATATABLE[$i][$k]; # prints $VAR1 = "DATE\rABC\r";
To resolve, instead of chomp use a regex such as:
$foo =~ s/[\r\n]+\z//;

Skipping particular positions in a string using substitution operator in perl

Yesterday, I got stuck in a perl script. Let me simplify it, suppose there is a string (say ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD), first I've to break it at every position where "E" comes, and secondly, break it specifically where the user wants to be at. But, the condition is, program should not cut at those sites where E is followed by P. For example there are 6 Es in this sequence, so one should get 7 fragments, but as 2 Es are followed by P one will get 5 only fragments in the output.
I need help regarding the second case. Suppose user doesn't wants to cut this sequence at, say 5th and 10th positions of E in the sequence, then what should be the corresponding script to let program skip these two sites only? My script for first case is:
my $otext = 'ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD';
$otext=~ s/([E])/$1=/g; #Main cut rule.
$otext=~ s/=P/P/g;
#output = split( /\=/, $otext);
print "#output";
Please do help!
To split on "E" except where it's followed by "P", you should use Negative look-ahead assertions.
From perldoc perlre "Look-Around Assertions" section:
(?!pattern)
A zero-width negative look-ahead assertion.
For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar".
my $otext = 'ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD';
# E E EP E EP E
my #output=split(/E(?!P)/, $otext);
use Data::Dumper; print Data::Dumper->Dump([\#output]);"
$VAR1 = [
'ABCD',
'ABCD',
'ABCDEPABCD',
'ABCDEPABCD',
'ABCD'
];
Now, in order to NOT cut at occurences #2 and #4, you can do 2 things:
Concoct a really fancy regex that automatically fails to match on given occurence. I will leave that to someone else to attempt in an answer for completeness sake.
Simply stitch together the correct fragments.
I'm too brain-dead to come up with a good idiomatic way of doing it, but the simple and dirty way is either:
my %no_cuts = map { ($_=>1) } (2,4); # Do not cut in positions 2,4
my #output_final;
for(my $i=0; $i < #output; $i++) {
if ($no_cuts{$i}) {
$output_final[-1] .= $output[$i];
} else {
push #output_final, $output[$i];
}
}
print Data::Dumper->Dump([\#output_final];
$VAR1 = [
'ABCD',
'ABCDABCDEPABCD',
'ABCDEPABCDABCD'
];
Or, simpler:
my %no_cuts = map { ($_=>1) } (2,4); # Do not cut in positions 2,4
for(my $i=0; $i < #output; $i++) {
$output[$i-1] .= $output[$i];
$output[$i]=undef; # Make the slot empty
}
my #output_final = grep {$_} #output; # Skip empty slots
print Data::Dumper->Dump([\#output_final];
$VAR1 = [
'ABCD',
'ABCDABCDEPABCD',
'ABCDEPABCDABCD'
];
Here's a dirty trick that exploits two facts:
normal text strings never contain null bytes (if you don't know what a null byte is, you should as a programmer: http://en.wikipedia.org/wiki/Null_character, and nb. it is not the same thing as the number 0 or the character 0).
perl strings can contain null bytes if you put them there, but be careful, as this may screw up some perl internal functions.
The "be careful" is just a point to be aware of. Anyway, the idea is to substitute a null byte at the point where you don't want breaks:
my $s = "ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD";
my #nobreak = (4,9);
foreach (#nobreak) {
substr($s, $_, 1) = "\0";
}
"\0" is an escape sequence representing a null byte like "\t" is a tab. Again: it is not the character 0. I used 4 and 9 because there were E's in those positions. If you print the string now it looks like:
ABCDABCDABCDEPABCDEABCDEPABCDEABCD
Because null bytes don't display, but they are there, and we are going to swap them back out later. First the split:
my #a = split(/E(?!P)/, $s);
Then swap the zero bytes back:
$_ =~ s/\0/E/g foreach (#a);
If you print #a now, you get:
ABCDEABCDEABCDEPABCD
ABCDEPABCD
ABCD
Which is exactly what you want. Note that split removes the delimiter (in this case, the E); if you intended to keep those you can tack them back on again afterward. If the delimiter is from a more dynamic regex it is slightly more complicated, see here:
http://perlmeme.org/howtos/perlfunc/split_function.html
"Example 9. Keeping the delimiter"
If there is some possibility that the #nobreak positions are not E's, then you must also keep track of those when you swap them out to make sure you replace with the correct character again.