Matching different instances of the same word - perl

How to match different instances of the same word.
Eg: if the string is synergy. How to match it with synergizes, synergism, synergically, synergistic.
I could write the following:
while(<IN>)
{
chomp $_;
my $line= $_;
$word="Synergy";
if($line=~m/$word(\w+)/i)
{
$line=~s/$word/<Effect>$word<\/Effect>/ig;
}
}

What you probably want to do is called stemming.
However, for this to work, you have to stem all the words in the text plus the one you search for. Hopefully all of your listed words result in the same stem. I haven't tested it yet.
use Lingua::Stem;
my $stemmer = Lingua::Stem->new( -locale => 'EN-UK' );
# first convert text to list of words
my #words;
while(<IN>) {
push #words, split(/\b/, $_); # you can do better here
}
# now stem all words.
my $stemmed_words = $stemmer->stem(#words);
# results in an array ref of stems in the same order as the words have been.
# now stem your search
my $stemmed_search = $stemmer->stem($word);
# and do the search from above inside stemmed array.
Now it depends on what you want. If you want to exchange all those words with something, you have to get the indexes of the matched (stemmed) words and do the replacement at the same positions inside your text.

You're going to want to use regular expressions. It is unclear what exactly the criteria you want to match, but in your example all the words start with "synergi" so if($string =~ \bsynergi\w*\b) would find all lines containing "synergi" anywhere in them.

You might want to check out Text::Soundex. For example,
use Text::Soundex;
# The following all return S562
print soundex("synergizes"), "\n";
print soundex("synergism"), "\n";
print soundex("synergically"), "\n";
print soundex("synergistic"), "\n";
Further reading: Perldoc Text::Soundex

Related

Perl: Find a match, remove the same lines, and to get the last field

Being a Perl newbie, please pardon me for asking this basic question.
I have a text file #server1 that shows a bunch of sentences (white space is the field separator) on many lines in the file.
I needed to match lines with my keyword, remove the same lines, and extract only the last field, so I have tried with:
my #allmatchedlines;
open(output1, "ssh user1#server1 cat /tmp/myfile.txt |");
while(<output1>) {
chomp;
#allmatchedlines = $_ if /mysearch/;
}
close(output1);
my #uniqmatchedline = split(/ /, #allmatchedlines);
my $lastfield = $uniqmatchedline[-1]\n";
print "$lastfield\n";
and it gives me the output showing:
1
I don't know why it's giving me just "1".
Could someone please explain why I'm getting "1" and how I can get the last field of the matched line correctly?
Thank you!
my #uniqmatchedline = split(/ /, #allmatchedlines);
You're getting "1" because split takes a scalar, not an array. An array in scalar context returns the number of elements.
You need to split on each individual line. Something like this:
my #uniqmatchedline = map { split(/ /, $_) } #allmatchedlines;
There are two issues with your code:
split is expecting a scalar value (string) to split on; if you are passing an array, it will convert the array to scalar (which is just the array length)
You did not have a way to remove same lines
To address these, the following code should work (not tested as no data):
my #allmatchedlines;
open(output1, "ssh user1#server1 cat /tmp/myfile.txt |");
while(<output1>) {
chomp;
#allmatchedlines = $_ if /mysearch/;
}
close(output1);
my %existing;
my #uniqmatchedline = grep !$existing{$_}++, #allmatchedlines; #this will return the unique lines
my #lastfields = map { ((split / /, $_)[-1]) . "\n" } #uniqmatchedline ; #this maps the last field in each line into an array
print for #lastfields;
Apart from two errors in the code, I find the statement "remove the same lines and extract only the last field" unclear. Once duplicate matching lines are removed, there may still be multiple distinct sentences with the pattern.
Until a clarification comes, here is code that picks the last field from the last such sentence.
use warnings 'all';
use strict;
use List::MoreUtils qw(uniq)
my $file = '/tmp/myfile.txt';
my $cmd = "ssh user1\#server1 cat $file";
open my $fh, '-|', $cmd // die "Error opening $cmd: $!"; # /
while (<$fh>) {
chomp;
push #allmatchedlines, $_ if /mysearch/;
}
close(output1);
my #unique_matched_lines = uniq #allmatchedlines;
my $lastfield = ( split ' ', $unique_matched_lines[-1] )[-1];
print $lastfield, "\n";
I changed to the three-argument open, with error checking. Recall that open for a process involves a fork and returns pid, so an "error" doesn't at all relate to what happened with the command itself. See open. (The # / merely turns off wrong syntax highlighting.) Also note that # under "..." indicates an array and thus need be escaped.
The (default) pattern ' ' used in split splits on any amount of whitespace. The regex / / turns off this behavior and splits on a single space. You most likely want to use ' '.
For more comments please see the original post below.
The statement #allmatchedlines = $_ if /mysearch/; on every iteration assigns to the array, overwriting whatever has been in it. So you end up with only the last line that matched mysearch. You want push #allmatchedlines, $_ ... to get all those lines.
Also, as shown in the answer by Justin Schell, split needs a scalar so it is taking the length of #allmatchedlines – which is 1 as explained above. You should have
my #words_in_matched_lines = map { split } #allmatchedlines;
When all this is straightened out, you'll have words in the array #uniqmatchedline and if that is the intention then its name is misleading.
To get unique elements of the array you can use the module List::MoreUtils
use List::MoreUtils qw(uniq);
my #unique_elems = uniq #whole_array;

How do I count the "real" words in a text with Perl?

I've run into a text processing problem. I've an article, and I'd like to find out how many "real" words there are.
Here is what I mean by "real". Articles usually contain various punctuation marks such as dashes, and commas, dots, etc. What I'd like to find out is how many words there are, skipping like "-" dashes and "," commas with spaces, etc.
I tried doing this:
my #words = split ' ', $article;
print scalar #words, "\n";
But that includes various punctuations that have spaces in them as words.
So I'm thinking of using this:
my #words = grep { /[a-z0-9]/i } split ' ', $article;
print scalar #words, "\n";
This would match all words that have characters or numbers in them. What do you think, would this be good enough way to count words in an article?
Does anyone know maybe of a module on CPAN that does this?
Try to use: \W - any non-word character, and also drop _
Solution
use strict;
my $article = 'abdc, dd_ff, 11i-11, ff44';
# case David's, but it didn't work with I'm or There's
$article =~ s/\'//g;
my $number_words = scalar (split /[\W_]+/, $article);
print $number_words;
I think your solution is about as good as you're going to get without resorting to something elaborate.
You could also write it as
my #words = $article =~ /\S*\w\S*/
or count the words in a file by writing
my $n = 0;
while (<>) {
my #words = /\S*\w\S*/g;
$n += #words;
}
say "$n words found";
Try a few sample blocks of text and look at the list of "words" that it finds. If you are happy with that then your code works.

how can i fetch the whole word on the basis of index no of that string in perl

I have one string of line like
comments:[I#1278327] is related to office communicator.i fixed the bug to declare it null at first time.
Here I am searching index of I#then I want the whole word means [I#1278327]. I'm doing it like this:
open(READ1,"<letter.txt");
while(<READ1>)
{
if(index($_,"I#")!=-1)
{
$indexof=index($_,"I#");
print $indexof,"\n";
$string=substr($_,$indexof);##i m cutting that string first from index of I# to end then...
$string=substr($string,0,index($string," "));
$lengthof=length($string);
print $lengthof,"\n";
print $string,"\n";
print $_,"\n";
}
}
Is any API is there in perl to find the word length directly after finding the index of I# in that line.
You could do something like:
$indexof=index($_,"I#");
$index2 = index($_,' ',$indexof);
$lengthof = $index2 - $indexof;
However, the bigger issue is you are using Perl as if it were BASIC. A more perlish approach to the task of printing selected lines:
use strict;
use warnings;
open my $read, '<', 'letter.txt'; # safer version of open
LINE:
while (<$read>) {
print "$1 - $_" if (/(I#.*?) /);
}
I would use a regex instead, a regex will allow you to match a pattern ("I#") and also capture other data from the string:
$_ =~ m/I#(\d+)/;
The line above will match and set $1 to the number.
See perldoc perlre

count number of times string repeated in file perl

I am new to Perl, by the way. I have a Perl script that needs to count the number of times a string appears in the file. The script gets the word from the file itself.
I need it to grab the first word in the file and then search the rest of the file to see if it is repeated anywhere else. If it is repeated I need it to return the amount of times it was repeated. If it was not repeated, it can return 0. I need it to then get the next word in the file and check this again.
I will grab the first word from the file, search the file for repeats of that word, grab the second word from
the file, search the file for repeats of that word, grab the third word from the file, search the file for repeats of that word.
So far I have a while loop that is grabbing each word I need, but I do not know how to get it to search for repeats without resetting the position of my current line. So how do I do this? Any ideas or suggestions are greatly appreciated! Thanks in advance!
while (<theFile>) {
my $line1 = $_;
my $startHere = rindex($line1, ",");
my $theName = substr($line1, $startHere + 1, length($line1) - $startHere);
#print "the name: ".$theName."\n";
}
Use a hashtable;
my %wordcount = ();
while(my $line = <theFile>)
{
chomp($line);
my #words = split(' ', $line);
foreach my $word(#words)
{
$wordCount{$word} += 1;
}
}
# output
foreach my $key(keys %wordCount)
{
print "Word: $key Repeat_Count: " . ($wordCount{$key} - 1) . "\n";
}
The $wordCount{$key} - 1 in the output accounts for the first time a word was seen; Words that only apprear once in the file will have a count of 0
Unless this is actually homework and/or you have to achieve the results in the specific manor you describe, this is going to be FAR more efficient.
Edit: From your comment below:
Each word i am searching for is not "the first word" it is a certain word on the line. Basically i have a csv file and i am skipping to the third value and searching for repeats of it.
I would still use this approach. What you would want to do is:
split on , since this is a CSV file
Pull out the 3rd word in the array on each line and store the words you are interested in in their own hash table
At the end, iterate through the "search word" hash table, and pull out the counts from the wordcount table
So:
my #words = split(',', $line);
$searchTable{#words[2]} = 1;
...
foreach my $key(keys %searchTable)
{
print "Word: $key Repeat_Count: " . ($wordCount{$key} - 1) . "\n";
}
you'll have to adjust according to what rules you have around counting words that repeat in the third column. You could just remove them from #words before the loop that inserts into your wordCount hash.
my $word = <theFile>
chomp($word); #`assuming word is by itself.
my $wordcount = 0;
foreach my $line (<theFile>) {
$line =~ s/$word/$wordcount++/eg;
}
print $wordcount."\n";
Look up the regex flag 'e' for more on what this does. I didn't test the code, but something like it should work. For clarification, the 'e' flag evaluates the second part of the regex (the substitution) as code before replacing, but it's more than that, so with that flag you should be able to make this work.
Now that I understand what you are asking for, the above solution won't work. What you can do, is use sysread to read the entire file into a buffer, and run the same substition after that, but you will have to get the first word off manually, or you can just decrement after the fact. This is because the sysread filehandle and the regular filehandle are handled differently, so try this:
my $word = <theFile>
chomp($word); #`assuming word is by itself.
my $wordcount = 0;
my $srline = '';
#some arbitrary very long length, longer than file
#Looping also possible.
sysread(theFile,$srline,10000000)
$srline =~ s/$word/$wordcount++/eg;
$wordcount--; # I think that the first word will still be in here, causing issues, you should test.
print $wordcount."\n";
Now, given that I read your comment responding to your question, I don't think that your current algorithm is optimal, and you probably want a hash storing up all of the counts for words in a file. This would probably be best done using something like the following:
my %counts = ();
foreach my $line (<theFile>) {
$line =~ s/(\w+)/$counts{$1}++/eg;
}
# now %counts contains key-value pair words for everything in the file.
To find count of all words present in the file you can do something like:
#!/usr/bin/perl
use strict;
use warnings;
my %count_of;
while (my $line = <>) { #read from file or STDIN
foreach my $word (split /\s+/, $line) {
$count_of{$word}++;
}
}
print "All words and their counts: \n";
for my $word (sort keys %count_of) {
print "'$word': $count_of{$word}\n";
}
__END__

How do I remove a a list of character sequences from the beginning of a string in Perl?

I have to read lines from a file and store them into a hash in Perl. Many of these lines have special character sequences at the beginning that I need to remove before storing. These character sequences are
| || ### ## ##||
For example, if it is ||https://ads, I need to get https://ads; if ###http, I need to get http.
I need to exclude these character sequences. I want to do this by having all the character sequences to exclude in a array and then check if the line starts with these character sequences and remove those. What is a good way to do this?
I've gone as far as:
our $ad_file = "C:/test/list.txt";
our %ads_list_hash = ();
my $lines = 0;
# List of lines to ignore
my #strip_characters = qw /| || ### ## ##||/;
# Create a list of substrings in the easylist.txt file
open my $ADS, '<', $ad_file or die "can't open $ad_file";
while(<$ADS>) {
chomp;
$ads_list_hash{$lines} = $_;
$lines ++;
}
close $ADS;
I need to add the logic to remove the #strip_characters from the beginning of each line if any of them are present.
Probably a bit too complex and general for the task, but still..
my $strip = join "|", map {quotemeta} #strip_characters;
# avoid bare [] etc. in the RE
# ... later, in the while()
s/^(?:$strip)+//o;
# /o means "compile $strip into the regex once and for all"
Why don't you do it with a regex? Something like
$line =~ s/^[## |]+//;
should work.
If you want to remove a list of characters (according to your title), then a very simple regular expression will work.
Within the loop, add the following regular expression
while( <$ADS> ) {
chomp;
s/^[## \|]+//;
$ads_list_hash{$lines++} = $_;
}
Note the pipe charachter ('|') is escapted.
However, it appears that you want to remove a list of expressions. You can do the following
while( <$ADS> ) {
chomp;
s/^((\|)|(\|\|)|(###)|(##)|(##\|\|))+//;
$add_list_hash{$lines++} = $_;
}
You said that the list of expression is stored in an array or words. In your sample code, you create this array with 'qw'. If the list of expressions isn't known at compile time, you can build a regular expression in a variable, and use it.
my #strip_expression = ... // get an array of strip expressions
my $re = '^((' . join(')|(',#strip_expression) . '))+';
and then, use the following statement in the loop:
s/$re//;
Finaly, one thing not related to the question can be said about the code: It would be much more appropriate to use Array instead of Hash, to map an integer to a set of strings. Unless you have some other requirement, better have:
our #ads_list; // no need to initialize the array (or the hash) with empty list
...
while( <$ADS> ) {
chomp;
s/.../;
push #ads_list, $_;
}
$ads_list_hash{$lines} = $_;
$lines ++;
Don't do that. If you want an array, use an array:
push #ads_lines, $_;
Shawn's Rule of Programming #7: When creating data structures: if preserving the order is important, use an array; otherwise use a hash.
Because substitutions return whether or not they did anything you can use a
substitution to search the string for your pattern and remove it if it's there.
while( <$ADS> ) {
next unless s/^\s*(?:[#]{2,3}|(?:##)?[|]{1,2})\s*//;
chomp;
$ads_list_hash{$lines} = $_;
$lines ++;
}