perl search sentence for keywords - perl

If I want to find a keyword in a sentence using Perl I have something like this:
foreach $line (#lines)
{
if ($line =~/keyword/)
{
print $line;
}
}
If I want to see if more keywords are in the sentence how should I change the matching?

There are several solutions. The easiest is to use something like /keyword.*keyword/.
When you want to count number of the keywords in a string (not simply check if there two keywords) you can do something like:
for(#lines){
my $n = 0;
$n++ while /(keyword)/g;
print if ($n>2);
}
By the way, your code can be more concise:
for (#lines) {
print if /keyword/;
}
That is the same.

If you want to process each match of all matches (g modifier):
my $number_of_matches = 0;
foreach $line (#lines)
{
while ( $line =~ m/keyword/g )
{
do_something_you_need();
$number_of_matches++;
}
}

my #words=map {split / /;} #lines;
foreach my $el(#keywords) {
#match=grep {$el eq $_} #words;
}

Do you want to see if the sentence contains other (different) keywords, or do you want to check whether it contains the same keyword multiple times.
For the first, you can write
if ($line =~ /keyword1|keyword2|keyword3/) { ... }
and for the second, it looks like this
my $n = () = $line =~ /keyword/g;

Related

How to join lines in between 2 conditions

I'm listing a program to customly clean up a phonebook .vcf file.
I can't figure out how to place in a single string variable everything between start with /^NOTE.+:/ and end before /X-ACCOUNT.*:/. Please consider there were presence of \n I'd like preserve for future contact formatting:
ADR;WORK;X-SYNCMLREF507891:;;NUENSCfff STRASSE 3-5;RIESCHWEILER;;66509;Germania
ORG:aaaa & CO. KG
NOTE;ENCODING=QUOTED-PRINTABLE:Hall 30 F35=0A*ddddd#jggggine.com =0A014 =
EURO =0A*kkkk#ggggne.com =0A=
Bjjj#rrrLINE.COM
X-ACCOUNT:Local Phone Account;Phone
maybe using any foreach loop. but I can't concatenate only what I need and stop at the right point:
foreach $in (<IN>){
if($in =~ /^NOTE.*:|=$/){
unless($in =~ /^[A-Z]+:/){
$in =~ s/(.+=$)\n//;
$inn .= $1;
}
}
$a = $a;
}
Any recomandations ?
More or less this:
$inn .= $in if (($in =~ /^NOTE/ .. $in =~ /^X-ACCOUNT/) && $in !~ /^X-ACCOUNT/);
Using implicit $_ would make it shorter, perhaps:
while(<IN>) {
$inn .= $_ if (/^NOTE/ .. /^X-ACCOUNT/ and !/^X-ACCOUNT/);
See perlop for Range Operators and perhaps also perlop for why and is lower precedence than &&.
Less one-liney, perhaps more understandable : add a flag which indicates "currently accumulating lines in a variable"
$accumulating = 0;
foreach $in (<IN>){
if ($in =~ /^NOTE/) {
$accumulating = 1;
}
if ($in =~ /^X-ACCOUNT:/) {
$accumulating = 0;
# possibly : add some code which does something with $inn
}
if ($accumulating) {
$inn .= $in;
}
}

Perl question. Need a way to replace two variables on the same line

I am trying to write a replace script in Perl, and I have it working halfway, but it seems that I cannot replace two strings in the same line.
I have a json file named foo.json that contains the following line: "title": "%CLIENT% Dashboard Web Map %WEBMAPENV%",
Now, I have a second file named env.txt that contains all the variables that I wish to use. In this file, there is an entry called: %WEBMAPENV%=(Test-Dev). My goal is to have PERL read the file env.txt, and replace BOTH "%CLIENT% and %WEBMAPENV% simultaneously.
Here is my code so far:
my $envFilePath = "$dirScripts/env/env.txt";
# Reading Firebase variables from Test environment file.
open($fh, "<", $envFilePath);
while (my $line=<$fh>) {
if ($line eq "\n") {
next;
}
if ($line =~ m/^(%\w+%)=/) {
$cur_key = $1;
$line =~ s/$cur_key=//;
$replacements{$cur_key} = $line;
} else {
$replacements{$cur_key} .= $line;
}
}
...
my $targetFilePath3 = "$dirHome/foo.json";
tie my #v_lines, 'Tie::File', $targetFilePath3, autochomp => 0 or die $!;
replaceEnvVars(#v_lines);
# Replace the environment variables as part of the setup.
sub replaceEnvVars {
for my $line (#_) {
if ($line =~ m/(%\w+%)/) {
my $key = $1;
if (defined($replacements{$key})) {
my $value = $replacements{$key};
chomp $value;
$line =~ s/$key/$value/g;
}
}
}
untie #_;
}
I am only able to substitute one variable per line, but I need to be able to handle 2.
Can any offer some help?
Derek
You only check for one.
if ($line =~ m/(%\w+%)/) { ... }
Solution:
# Clean up %replacements before using it.
chomp for values %replacements;
for my $line (#_) {
$line =~ s{(%\w+%)}{ $replacements{$1} // $1 }eg;
}
By adding a loop inside of s/// (through the use of /g) rather than a loop around s///, this one doesn't mess up if the values contain %.
/e means the replacement will be run as Perl code.
// is the "defined-or" operator. It works like || but looks for defined rather than truth.
See the Perl Regex Tutorial for more.

Amend perl script so that words are matched on a word for word basis

I have been using this perl script (thanks to Jeff Schaller) to match 3 or more words in the title fields of two separate csv files.
Original question here:
https://unix.stackexchange.com/questions/283942/matching-3-or-more-words-from-fields-in-separate-csv-files?noredirect=1#comment494461_283942
I have also added some exception functionality following advice from meuh:
#!/bin/perl
my #csv2 = ();
open CSV2, "<csv2" or die;
#csv2=<CSV2>;
close CSV2;
my %csv2hash = ();
for (#csv2) {
chomp;
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title
$csv2hash{$_} = $title;
}
open CSV1, "<csv1" or die;
while (<CSV1>) {
chomp;
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title
my #titlewords = split /\s+/, $title; #/ get words
my #new; #add exception words which shouldn't be matched
foreach my $t (#titlewords){
push(#new, $t) if $t !~ /^(and|if|where)$/i;
}
#titlewords = #new;
my $desired = 3;
my $matched = 0;
foreach my $csv2 (keys %csv2hash) {
my $count = 0;
my $value = $csv2hash{$csv2};
foreach my $word (#titlewords) {
++$count if $value =~ /\b$word\b/i;
last if $count >= $desired;
}
if ($count >= $desired) {
print "$csv2\n";
++$matched;
}
}
print "$_\n" if $matched;
}
close CSV1;
During my testing, one issue I've found that I would like to tweak is that if csv2 contains a single common word such as the, if this is replicated in csv1 three or more times then three positive matches is found. To clarify:
If csv1 contains:
1216454,the important people feel the same way as the others, 15445454, 45445645
^ i.e. there are three insatnces of the in the above line
If csv2 contains:
14564564,the tallest man on earth,546456,47878787
^ i.e. there is one instance of the in this line
Then I would like only one word to be classed as matching, and there be no output (based on my desired number of matching words- 3 ) because there is only one instance of the matching word in one of the files.
However if:
csv1 contained:
1216454,the important people feel the same way as the others,15445454, 45445645
and csv2 contained:
15456456,the only way the man can sing the blues,444545,454545
Then, as there are three matching words in each (i.e. 3 instances of the word the in each title, then I would like this to be classed as a matching title based on my desired number of matching words being 3 or more, thus generating the output:
1216454,the important people feel the same way as the others,15445454, 45445645
15456456,the only way the man can sing the blues,444545,454545
I would like to amend the script so that if there is one instance of a word in a csv, and multiple instances of the same word in the other csv then that is classed as only one match. However, if there were say 3 instance of the word the in both files, then it should still be classed as three matches. Basically I would like matches to be on a word for word basis.
Everything about the script other than this is perfect so I would rather not go back to the drawing board completely as I am happy with everything other than this.
I hope I've explained it ok, if anyone need any clarification let me know.
If you just wan to count unique matches, you can use a hash instead of a list to collect the words from csv1, just like you do for csv2, and then also count the occurrences of each word separately:
#!/usr/bin/env perl
my #csv2 = ();
open CSV2, "<csv2" or die;
#csv2=<CSV2>;
close CSV2;
my %csv2hash = ();
for (#csv2) {
chomp;
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title
$csv2hash{$_} = $title;
}
open CSV1, "<csv1" or die;
while (<CSV1>) {
chomp;
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title
my %words;
$words{$_}++ for split /\s+/, $title; #/ get words
## Collect unique words
my #titlewords = keys(%words);
my #new; #add exception words which shouldn't be matched
foreach my $t (#titlewords){
push(#new, $t) if $t !~ /^(and|if|where)$/i;
}
#titlewords = #new;
my $desired = 3;
my $matched = 0;
foreach my $csv2 (keys %csv2hash) {
my $count = 0;
my $value = $csv2hash{$csv2};
foreach my $word (#titlewords) {
my #matches = ( $value=~/\b$word\b/ig );
my $numIncsv2 = scalar(#matches);
#matches = ( $title=~/\b$word\b/ig );
my $numIncsv1 = scalar(#matches);
++$count if $value =~ /\b$word\b/i;
if ($count >= $desired || ($numIncsv1 >= $desired && $numIncsv2 >= $desired)) {
$count = $desired+1;
last;
}
}
if ($count >= $desired) {
print "$csv2\n";
++$matched;
}
}
print "$_\n" if $matched;
}
close CSV1;

Using a regular expression with nested for loops, using Perl

I have two arrays:
#file_list holds a list of files in a directory, and
#name_list holds some names.
For example, these arrays could contain
#file_list = ('Bob_car', 'Bob_house', 'Bob_work', 'Fred_car', 'Fred_house', 'Fred_work', ...);
#name_list = ('Bob', 'Fred', ...);
(the real data is not that simple).
My goal is to compare each file with every name and see if they match. They match if the file string starts with the name.
I could then use these matches to sort the files into new directories, based on their corresponding name.
Here is my code:
for ( my $i = 0; $i < scalar #file_list ; $i++ )
{
for ( my $j = 0; $j < #name_list ; $j++ )
{
if ( $file_list[ $i ] =~ m/^$name_list[ $j ]/ )
{
print "$file_list[ $i ] goes with $name_list[ $j ]\n";
}
else
{
print "no match\n";
}
}
}
However, I don't get any matches. I've tested the individual loops and they are working. Else, is there something off about the regex?
About how the arrays were made:
For #name_list, the file containing the names is organized in a seemingly random way, just because of how it was used for something else. The names in that file are on several different lines, with lots of blank lines in between and lots of blank entries within lines. Names can appear more than once.
I used the following code to make #name_list:
while (my $line = <$OriginalFILE>)
{
chomp $line;
my #current_line = split( "\t", $line );
for ( my $i = 0; $i < scalar #current_line ; $i ++ )
{
if ( $current_line[ $i ] =~ m/^\s*$/ )
{
# print "$current_line[$i] is blank\n";
}
else
{
push( #raw_name_list, $current_line[ $i ] );
}
} # end of for
} # while
# collect list without repeat instances of the same name
my %unique = ();
foreach my $name (#raw_name_list)
{
$unique{$name} ++;
}
my #name_list = keys %unique;
foreach my $name ( #name_list )
{
# print "$name\n";
chomp $name;
unless(mkdir $name, 0700)
{
die "Unable to create directory called $name\n";
}
}
The array #file_list was made using:
opendir(DIR, $ARGV[1]);
my #file_list = grep ! /^\./, readdir DIR;
closedir(DIR);
# print #file_list;
#amon, here is what i did to test the loops and regex:
FILE: for my $file (#transposed_files) {
print "$file\n";
for my $name (#transposedunique) {
print "i see this $name\n";
if ($file =~ /^\Q$name\E/) {
print "$file goes with $name\n";
next FILE;
}
}
#print "no match for $file\n";
}
oh, and I transposed the arrays, so that they would print to an outfile into separate rows.
Short version: You are building your name array wrong. Look at this line:
$unique{name} ++;
You are just incrementing the name entry of the hash. You probably wanted the $name variable.
The Longer Version
On English, and Foreach-Loops
Your code is a bit unperlish and looks more like C than like Perl. Perl is much closer to English than you might think. From the original wording of your question:
take the first element from #file_list and then to compare that to each element in #name_list
You wrote this as
for (my $i = 0; $i < #file_list; $i++) {
for (my $j = 0; $j < #name_list; $j++) {
...; # compare $file_list[$i] with $name_list[$j]
}
}
I'd rather do
for my $file (#file_list) {
for my $name (#name_list) {
...; # compare $file with $name
}
}
and save myself from the hassle of array subscripting.
Building Correct Regexes
Your code contains the following test:
$file_list[ $i ] =~ m/^$name_list[ $j ]/
This will not do what you think if $name_list[$j] contains special characters like (, ., +. You can match the literal contents of a variable by enclosing it in \Q ... \E. This would make the code
$file =~ /^\Q$name\E/
(if used with my variant of the loop).
You could also go the nifty route and compare the leading substring directly:
$name eq substr $file, 0, length($name)
This expresses the same condition.
On Loop Control
I will make two assumptions:
You are only interested in the first matching name for any file
You only want to print the no match message if no name was found
Perl allows us to break out of arbitrary loops, or restart the current iteration, or go directly to the next iteration, without using flags, as you would do in other languages. All we have to do is to label our loops like LABEL: for (...).
So once we have a match, we can start our search for the next file. Also, we only want to print no match if we left the inner loop without going to the next file. This code does it:
FILE: for my $file (#file_list) {
for my $name (#name_list) {
if ($file =~ /^\Q$name\E/) {
print "$file goes with $name\n";
next FILE;
}
}
print "no match for $file\n";
}
The Zen of Negation
In your file parsing code, you express a condition
if ($field =~ /^\s*$/) {
} else {
# do this stuff only if the field does not consist only of
# zero or more whitespace characters
}
That description is far to complex. How about
if ($field =~ /\S/) {
# do this stuff only if the field contains a non-whitespace character.
}
The same condition, but simpler, and more efficient.
Simplify your Parse
In short, your file parsing code can be condensed to
my %uniq;
while (<$OriginalFILE>) {
chomp;
$uniq{$_} = undef for grep /\S/, split /\t/;
}
my #name_list = sort { length($b) <=> length($a) } keys %uniq;
The split function takes a regex as first argument, and will split on $_ if no other string is specified. It returns a list of fields.
The grep function takes a condition and a list, and will return all elements of a list that match the condition. The current element is in $_, which regexes match by default. For explanation of the regex, see above.
Note: This still allows for the fields to contain whitespace, even in leading position. To split on all whitespace, you can give split the special argument of a string containing a single space: split ' '. This would make the grep unneccessary.
The for loop can also be used as a statement modifier, i.e. like EXPR for LIST. The current element is in $_. We assign something to the $_ entry in our %uniq hash (which is already initialized to the empty hash). This could be a number, but undef works as well.
The keys are returned in a seemingly random order. But as multiple names could match a file, but we only want to select one match, we will have to match the most specific name first. Therefore, I sort the names after their length in descending order.
Your code seems to work for me. All I did was construct two arrays like this:
my #file_list = qw/Bob_car Bob_house Bob_work Fred_car Fred_house Fred_work/;
my #name_list = qw/Fred Bob Mary/;
Then running your code produces output like this:
no match
Bob_car goes with Bob
no match
no match
Bob_house goes with Bob
no match
no match
Bob_work goes with Bob
no match
Fred_car goes with Fred
no match
no match
Fred_house goes with Fred
no match
no match
Fred_work goes with Fred
no match
no match
So it looks like it's working.
A common problem with reading input from files or from a user is forgetting to strip the newline character from the end of the input. This could be your problem. If so, have a read about perldoc -f chomp, and just chomp each value as you add it to your array.
I'm always interested in doing things in efficient way so every time I see O(N^2) algorithm rings bells for me. Why it should be O(N*M) and not O(N+M)?
my $re = join('|',map quotemeta, #name_list);
$re = qr/$re/;
for my $file (#file_list) {
if($file =~ /^($re)/) {
my $name = $1;
... do what you need
}
}
its look something wrong in loop.
follow comments in code
for ( my $i = 0; $i < scalar #file_list ; $i++ )
{
#use some string variable assign it ""
for ( my $j = 0; $j < #name_list ; $j++ )
{
if ( $file_list[ $i ] =~ m/^$name_list[ $j ]/ )
{
# assign string variable to founded name_list[$j]
break loop
}
}
# check condition if string not equal to "" match found print your requirement with string value else match not found
}

How to matching sequential words in a sentence using Perl?

Is there better way to match words other than this method, im trying to find the word in the array that occur in any of the sentences.
my $count = 0;
my #strings = (
"i'm going to find the occurrence of two words going if possible",
"i'm going to find the occurrence of two words if impossible",
"to find a solution to this problem",
"i will try my best for a way to match this problem"
);
#neurot = qw(going match possible);
my $com_neu = '\b'.join('\b|\b', #neurot).'\b';
foreach my $sentence (#string){
#l = $sentence =~ /($com_neu)/gi;
foreach my $list (#l){
if($list =~ m/\w['\w-]*/){
print $list;
$count++;
}
}
print $count;
}
Output:
String 1: going going possible
String 2: going
String 3:
String 4: match
please help me with a faster way.
Thanks.
Another approach could be to use hash to match the words:
my %neurot_hash = map { lc($_) => 1 } qw(going match possible);
for my $sentence (#strings) {
for my $found (grep { $neurot_hash{ lc($_) } } $sentence =~ /\w['\w-]*/gi) {
print $found, " ";
}
print "\n";
}
For data you provided this method is ~ 7% faster. But keep in mind that the data set is very small, so YMMV.
what about the 'smart-match' operator?
foreach my $elem (#neurot){
if(/$elem/i ~~ #strings){
print "Found $elem\n";
}
}
The same as bvr answer, but perhaps cleaner
my %neurot_hash = map { lc($_) => 1 } qw(going match possible);
for my $sentence (#strings) {
my #words = split /[^\w']/, $sentence;
#I am not sure if you want to take "i'm" as a separate word.
#Apparently, stackoverflow does not like '.
my #found = grep { exists $neurot_hash{ lc($_) } } #words;
print join (" ", #found);
print "\n";
}