Counting occurance of the word "the" giving me all different answers - perl

So I have a simple script to read in a text file from the command line, and I want to count the number of "the"s but I've been getting weird numbers.
while(<>){
$wordcount= split(/\bthe\b/, $_);}
print "\"the\" occurs $wordcount times in $ARGV";
So using that I get 10 occurrences, but if I use /\bthe\b/i I get 12. /\Bthe\b/ gives me 6 I believe. There are 11 occurrences in my test txt. Am I just an idiot? Should $wordcount just be started at 1 or 0? Also is it bad practice to use split this way? The code works fine for actually counting the words, but not when counting an exact string. New to perl so any and all abuse is appreciated. Thanks
Edit: also I know it's not adding, but now I get that $wordcount is being treated more like an array, so it worked for a previous iteration, though it was definitely poor form.

Use a regex in a list context to pull the count of matches:
my $wordcount = 0;
while (<>) {
$wordcount += () = /\bthe\b/g;
}
print qq{"the" occurs $wordcount times in $ARGV\n};
Reference: perlfaq4 - How can I count the number of occurrences of a substring within a string?

split splits the string into a list based on the regex provided. Your count comes from the fact you've put split in scalar context. From perldoc -f split:
split Splits the string EXPR into a list of strings and returns the
list in list context, or the size of the list in scalar context.
Given the string "The quick brown fox jumps over the lazy dog" I'd expect your $wordcount to be 2, which would be correct.
The quick brown fox jumps over the lazy dog
^^^============================^^^========= -> two fields
However if you had "A bird and the quick brown fox jumps over the lazy dog" you'd end up with 3 which is not correct.
A bird and the quick brown fox jumps over the lazy dog
===========^^^============================^^^========= -> three fields
First of all you absolutely would want \b as that matches a word boundary. \B matches things that aren't word boundaries so you'd be matching any word that contained "the" instead of the word "the".
Secondly you just want to count the occurrences - you do that by counting the matches of the entire string
$wordcount = () = $string =~ /\bthe\b/gi
$wordcount becomes the list in scalar context, () is a list you aren't actually capturing since you don't want the matches. $string is the string to match against. You're matching "the" at word boundaries and gi is the whole string (global), case insensitive.

With the /i flag, 'The' would be included, but not without it.
\B is a non-word boundary, so would only find things like "clothe", and not "the".
Yes, it is bad practice to use split that way. Properly, if you just want a count, do this:
$wordcount = () = split ...;
split in scalar context does something that seemed like a good idea originally, but doesn't seem so good anymore, so avoid it. The above incantation calls it in list context but assigns the number of elements found to $wordcount.
But the elements produced by splitting on the aren't what you want; you want a count of times the was found. So do (possibly with /ig instead of just /g):
$wordcount = () = /\bthe\b/g;
Note that you probably want +=, not =, to get a total for all lines.

sample.txt
Ajith
kumar
Ajith
my name is Ajith and Ajith
lastname is kumar
code
use Data::Dumper;
print "Enter your string = ";
my $input = <>; ## User input
chomp $input; ## The chomp() function will remove (usually) any newline character from the end of a string
my %count;
open FILE, "<sample.txt" or die $!; ## To read the data from a file
my #data = <FILE>;
for my $d (#data) {
my #array = split ('\s', $d); ##To split the more than one word in a line
for my $a (#array) {
$count{$a}++; ## Counter
}
}
print Dumper "Result: " . $count{$input};
The above code get the input vai command prompt, then search the word into the given text file "sample.txt", then display the output how many times it appears in the text file (sample.txt)
Note: User Input must be in "Case sensitive ".
INTPUT from the USER
Enter your string = Ajith
OUTPUT
$VAR1 = 'Result: 4';

print "Enter the string: ";
chomp($string = <>);
die "Error opening file" unless(open(fil,"filename.txt"));
my #file = <fil>;
my #mt;
foreach (#file){
#s = map split,$_;
push(#mt,#s);
}
$s = grep {m/$string/gi} #mt;
print "Total no., of $string is:: $s\n";
In this give the output what you expect.

Related

Add a new line in a variable using perl

I am trying to add a new line in a variable after certain number of words. For example: If we have a variable:
$x = "This a variable, start a new line here, This is a new line.";
If I print the above variable
print $x;
I should get the below output:
This is a variable,
start a new line here,
This is a new line.
How can I achieve this in Perl from the variable itself?
I do not agree to the formula "after certain number of words".
Note that the first target line has 4 words, whereas remaining 2 have
5 words each.
Actually you need to replace each comma and following sequence of
spaces (if any) with a comma and \n.
So the intuitive way to do it is:
$x =~ s/,\s*/,\n/g;
The simplest way is to split the string on comma followed by a space and then
join the word groups with a comma followed by a newline.
my $x = "This a variable, start a new line here, This is a new line.";
print join(",\n", split /, /, $x) . "\n";
output
This a variable,
start a new line here,
This is a new line.
For solving the general, how do I reformat this string with line breaks after n-columns? problem, use the Text::Wrap library (as suggested by #ikegami):
use Text::Wrap;
my $x = "The quick brown fox jumped over the lazy dog.";
$Text::Wrap::columns = 15;
# wrap() needs an array of words
my #words = split /\s+/, $x;
# Initial tab, subsequent tab values set to '' (think indent amount)
print wrap('', '', #words) . "\n";
output
The quick
brown fox
jumped over
the lazy dog.
You probably want to use regular expressions. You can do this:
$x =~ s/^(\S+\s+){3}\K/\n/;
Or if this is about the commas and not the spaces:
$x =~ s/^([^,]+,+){2}\s*\K/\n/;
(in this case I also remove any potential space that would be after the comma)
You can also configure separately how many words or comma you want, by putting this in a variable:
my $nbwords = 7; # add a line after the 7th word
$x =~ s/^(\S+\s+){$nbwords}\K/\n/;
Now, that would keep the last space so you may want to do this:
my $nbwords = 7; # add a line after the 7th word
$nbwords--; # becomes 6 because there is another word after that we match as well
$x =~ s/^(\S+\s+){$nbwords}\S+\K\s+/\n/;
You should probably learn to use Regexps but just to explain the above:
\s is any space character (like space, tab, line feed, etc)
\S (uppercase) is any character except a space character
+ means any number of characters of that type described with what is before. So \s+ means any number of consecutive space characters.
{123} means 123 times that type of character ...
{3,80} means 3 to 80 times. So + is equivalent to {1,} (one to unlimited)
\K means that whatever is before will not be replaced, only what is after.

Perl: Find a match, remove the same lines, and to get the last field

Being a Perl newbie, please pardon me for asking this basic question.
I have a text file #server1 that shows a bunch of sentences (white space is the field separator) on many lines in the file.
I needed to match lines with my keyword, remove the same lines, and extract only the last field, so I have tried with:
my #allmatchedlines;
open(output1, "ssh user1#server1 cat /tmp/myfile.txt |");
while(<output1>) {
chomp;
#allmatchedlines = $_ if /mysearch/;
}
close(output1);
my #uniqmatchedline = split(/ /, #allmatchedlines);
my $lastfield = $uniqmatchedline[-1]\n";
print "$lastfield\n";
and it gives me the output showing:
1
I don't know why it's giving me just "1".
Could someone please explain why I'm getting "1" and how I can get the last field of the matched line correctly?
Thank you!
my #uniqmatchedline = split(/ /, #allmatchedlines);
You're getting "1" because split takes a scalar, not an array. An array in scalar context returns the number of elements.
You need to split on each individual line. Something like this:
my #uniqmatchedline = map { split(/ /, $_) } #allmatchedlines;
There are two issues with your code:
split is expecting a scalar value (string) to split on; if you are passing an array, it will convert the array to scalar (which is just the array length)
You did not have a way to remove same lines
To address these, the following code should work (not tested as no data):
my #allmatchedlines;
open(output1, "ssh user1#server1 cat /tmp/myfile.txt |");
while(<output1>) {
chomp;
#allmatchedlines = $_ if /mysearch/;
}
close(output1);
my %existing;
my #uniqmatchedline = grep !$existing{$_}++, #allmatchedlines; #this will return the unique lines
my #lastfields = map { ((split / /, $_)[-1]) . "\n" } #uniqmatchedline ; #this maps the last field in each line into an array
print for #lastfields;
Apart from two errors in the code, I find the statement "remove the same lines and extract only the last field" unclear. Once duplicate matching lines are removed, there may still be multiple distinct sentences with the pattern.
Until a clarification comes, here is code that picks the last field from the last such sentence.
use warnings 'all';
use strict;
use List::MoreUtils qw(uniq)
my $file = '/tmp/myfile.txt';
my $cmd = "ssh user1\#server1 cat $file";
open my $fh, '-|', $cmd // die "Error opening $cmd: $!"; # /
while (<$fh>) {
chomp;
push #allmatchedlines, $_ if /mysearch/;
}
close(output1);
my #unique_matched_lines = uniq #allmatchedlines;
my $lastfield = ( split ' ', $unique_matched_lines[-1] )[-1];
print $lastfield, "\n";
I changed to the three-argument open, with error checking. Recall that open for a process involves a fork and returns pid, so an "error" doesn't at all relate to what happened with the command itself. See open. (The # / merely turns off wrong syntax highlighting.) Also note that # under "..." indicates an array and thus need be escaped.
The (default) pattern ' ' used in split splits on any amount of whitespace. The regex / / turns off this behavior and splits on a single space. You most likely want to use ' '.
For more comments please see the original post below.
The statement #allmatchedlines = $_ if /mysearch/; on every iteration assigns to the array, overwriting whatever has been in it. So you end up with only the last line that matched mysearch. You want push #allmatchedlines, $_ ... to get all those lines.
Also, as shown in the answer by Justin Schell, split needs a scalar so it is taking the length of #allmatchedlines – which is 1 as explained above. You should have
my #words_in_matched_lines = map { split } #allmatchedlines;
When all this is straightened out, you'll have words in the array #uniqmatchedline and if that is the intention then its name is misleading.
To get unique elements of the array you can use the module List::MoreUtils
use List::MoreUtils qw(uniq);
my #unique_elems = uniq #whole_array;

Perl tr operator is transliterating based on the variable's name not its value

I'm using Perl 5.16.2 to try to count the number of occurrences of a particular delimiter in the $_ string. The delimiter is passed to my Perl program via the #ARGV array. I verify that it is correct within the program. My instruction to count the number of delimiters in the string is:
$dlm_count = tr/$dlm//;
If I hardcode the delimiter, e.g. $dlm_count = tr/,//; the count comes out correctly. But when I use the variable $dlm, the count is wrong. I modified the instruction to say
$dlm_count = tr/$dlm/\t/;
and realized from how the tabs were inserted in the string that the operation was substituting every instance of any of the four characters "$", "d", "l", or "m" to \t — i.e. any of the four characters that made up my variable name $dlm.
Here is a sample program that illustrates the problem:
$_ = "abcdefghij,klm,nopqrstuvwxyz";
my $dlm = ",";
my $dlm_count = tr/$dlm/\t/;
print "The count is $dlm_count\n";
print "The modified string is $_\n";
There are only two commas in the $_ string, but this program prints the following:
The count is 3
The modified string is abc efghij,k ,nopqrstuvwxyz
Why is the $dlm token being treated as a literal string of four characters instead of as a variable name?
You cannot use tr that way, it doesn't interpolate variables. It runs strictly character by character replacement. So this
$string =~ tr/a$v/123/
is going to replace every a with 1, every $ with 2, and every v with 3. It is not a regex but a transliteration. From perlop
Because the transliteration table is built at compile time, neither the SEARCHLIST nor the REPLACEMENTLIST are subjected to double quote interpolation. That means that if you want to use variables, you must use an eval():
eval "tr/$oldlist/$newlist/";
die $# if $#;
eval "tr/$oldlist/$newlist/, 1" or die $#;
The above example from docs hints how to count. For $dlms in $string
$dlm_count = eval "\$string =~ tr/$dlm//";
The $string is escaped so to not be interpolated before it gets to eval. In your case
$dlm_count = eval "tr/$dlm//";
You can also use tools other than tr (or regex). For example, with string being in $_
my $dlm_count = grep { /$dlm/ } split //;
When split breaks $_ by the pattern that is empty string (//) it returns the list of all characters in it. Then the grep block tests each against $dlm so returning the list of as many $dlm characters as there were in $_. Since this is assigned to a scalar, $dlm_count is set to the length of that list, which is the count of all $dlm.
In the section of the docs on perlop 'Quote Like Operators', it states:
Because the transliteration table is built at compile time, neither
the SEARCHLIST nor the REPLACEMENTLIST are subjected to double quote
interpolation. That means that if you want to use variables, you must
use an eval():
As documented and as you discovered, tr/// doesn't interpolate. The simple solution is to use s/// instead.
my $dlm = ",";
$_ = "abcdefghij,klm,nopqrstuvwxyz";
my $dlm_count = s/\Q$dlm/\t/g;
If the transliteration is being performed in a loop, the following might speed things up noticeably:
my $dlm = ",";
my $tr = eval "sub { tr/\Q$dlm\E/\\t/ }";
for (...) {
my $dlm_count = $tr->();
...
}
Although several answers have hinted at the eval() idiom for tr///, none have the form that covers cases where the string has tr syntax characters in it, e.g.- (hyphen):
$_ = "abcdefghij,klm,nopqrstuvwxyz";
my $dlm = ",";
my $dlm_count = eval sprintf "tr/%s/%s/", map quotemeta, $dlm, "\t";
But as others have noted, there are lots of ways to count characters in Perl that avoid eval(), here's another:
my $dlm_count = () = m/$dlm/go;

Perl colon split show one character

Perl colon split show one character when using it in variables.
Code
#!/usr/bin/perl
my $serialnumber= "0123456789";
my $macaddr = "a1:b2:c3:d4:e5:f6";
my $macaddress = split /:/, $macaddr;
print $macaddress;
print "\n";
The result is only "6" or last character of the text.
See image...
Sean explained why you are getting 6. However, instead of using split, just substitute to get rid of colons:
my $macaddr = "a1:b2:c3:d4:e5:f6";
$macaddr =~ s/://g;
print $macaddr;
# a1b2c3d4e5f6
The first paragraph of perldoc -f split says:
Splits the string EXPR into a list of strings and returns the
list in list context, or the size of the list in scalar context.
Be assigning the result of split to the scalar variable $macaddress, you're providing scalar context, and so you're getting the size of the list back. It's just a coincidence that this happens to be the same as the last character of your input string.
You could use split in list context:
my #macaddress = split /:/, $macaddr;
print join "", #macaddress;

Skipping particular positions in a string using substitution operator in perl

Yesterday, I got stuck in a perl script. Let me simplify it, suppose there is a string (say ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD), first I've to break it at every position where "E" comes, and secondly, break it specifically where the user wants to be at. But, the condition is, program should not cut at those sites where E is followed by P. For example there are 6 Es in this sequence, so one should get 7 fragments, but as 2 Es are followed by P one will get 5 only fragments in the output.
I need help regarding the second case. Suppose user doesn't wants to cut this sequence at, say 5th and 10th positions of E in the sequence, then what should be the corresponding script to let program skip these two sites only? My script for first case is:
my $otext = 'ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD';
$otext=~ s/([E])/$1=/g; #Main cut rule.
$otext=~ s/=P/P/g;
#output = split( /\=/, $otext);
print "#output";
Please do help!
To split on "E" except where it's followed by "P", you should use Negative look-ahead assertions.
From perldoc perlre "Look-Around Assertions" section:
(?!pattern)
A zero-width negative look-ahead assertion.
For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar".
my $otext = 'ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD';
# E E EP E EP E
my #output=split(/E(?!P)/, $otext);
use Data::Dumper; print Data::Dumper->Dump([\#output]);"
$VAR1 = [
'ABCD',
'ABCD',
'ABCDEPABCD',
'ABCDEPABCD',
'ABCD'
];
Now, in order to NOT cut at occurences #2 and #4, you can do 2 things:
Concoct a really fancy regex that automatically fails to match on given occurence. I will leave that to someone else to attempt in an answer for completeness sake.
Simply stitch together the correct fragments.
I'm too brain-dead to come up with a good idiomatic way of doing it, but the simple and dirty way is either:
my %no_cuts = map { ($_=>1) } (2,4); # Do not cut in positions 2,4
my #output_final;
for(my $i=0; $i < #output; $i++) {
if ($no_cuts{$i}) {
$output_final[-1] .= $output[$i];
} else {
push #output_final, $output[$i];
}
}
print Data::Dumper->Dump([\#output_final];
$VAR1 = [
'ABCD',
'ABCDABCDEPABCD',
'ABCDEPABCDABCD'
];
Or, simpler:
my %no_cuts = map { ($_=>1) } (2,4); # Do not cut in positions 2,4
for(my $i=0; $i < #output; $i++) {
$output[$i-1] .= $output[$i];
$output[$i]=undef; # Make the slot empty
}
my #output_final = grep {$_} #output; # Skip empty slots
print Data::Dumper->Dump([\#output_final];
$VAR1 = [
'ABCD',
'ABCDABCDEPABCD',
'ABCDEPABCDABCD'
];
Here's a dirty trick that exploits two facts:
normal text strings never contain null bytes (if you don't know what a null byte is, you should as a programmer: http://en.wikipedia.org/wiki/Null_character, and nb. it is not the same thing as the number 0 or the character 0).
perl strings can contain null bytes if you put them there, but be careful, as this may screw up some perl internal functions.
The "be careful" is just a point to be aware of. Anyway, the idea is to substitute a null byte at the point where you don't want breaks:
my $s = "ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD";
my #nobreak = (4,9);
foreach (#nobreak) {
substr($s, $_, 1) = "\0";
}
"\0" is an escape sequence representing a null byte like "\t" is a tab. Again: it is not the character 0. I used 4 and 9 because there were E's in those positions. If you print the string now it looks like:
ABCDABCDABCDEPABCDEABCDEPABCDEABCD
Because null bytes don't display, but they are there, and we are going to swap them back out later. First the split:
my #a = split(/E(?!P)/, $s);
Then swap the zero bytes back:
$_ =~ s/\0/E/g foreach (#a);
If you print #a now, you get:
ABCDEABCDEABCDEPABCD
ABCDEPABCD
ABCD
Which is exactly what you want. Note that split removes the delimiter (in this case, the E); if you intended to keep those you can tack them back on again afterward. If the delimiter is from a more dynamic regex it is slightly more complicated, see here:
http://perlmeme.org/howtos/perlfunc/split_function.html
"Example 9. Keeping the delimiter"
If there is some possibility that the #nobreak positions are not E's, then you must also keep track of those when you swap them out to make sure you replace with the correct character again.