How to join lines in between 2 conditions - perl

I'm listing a program to customly clean up a phonebook .vcf file.
I can't figure out how to place in a single string variable everything between start with /^NOTE.+:/ and end before /X-ACCOUNT.*:/. Please consider there were presence of \n I'd like preserve for future contact formatting:
ADR;WORK;X-SYNCMLREF507891:;;NUENSCfff STRASSE 3-5;RIESCHWEILER;;66509;Germania
ORG:aaaa & CO. KG
NOTE;ENCODING=QUOTED-PRINTABLE:Hall 30 F35=0A*ddddd#jggggine.com =0A014 =
EURO =0A*kkkk#ggggne.com =0A=
Bjjj#rrrLINE.COM
X-ACCOUNT:Local Phone Account;Phone
maybe using any foreach loop. but I can't concatenate only what I need and stop at the right point:
foreach $in (<IN>){
if($in =~ /^NOTE.*:|=$/){
unless($in =~ /^[A-Z]+:/){
$in =~ s/(.+=$)\n//;
$inn .= $1;
}
}
$a = $a;
}
Any recomandations ?

More or less this:
$inn .= $in if (($in =~ /^NOTE/ .. $in =~ /^X-ACCOUNT/) && $in !~ /^X-ACCOUNT/);
Using implicit $_ would make it shorter, perhaps:
while(<IN>) {
$inn .= $_ if (/^NOTE/ .. /^X-ACCOUNT/ and !/^X-ACCOUNT/);
See perlop for Range Operators and perhaps also perlop for why and is lower precedence than &&.

Less one-liney, perhaps more understandable : add a flag which indicates "currently accumulating lines in a variable"
$accumulating = 0;
foreach $in (<IN>){
if ($in =~ /^NOTE/) {
$accumulating = 1;
}
if ($in =~ /^X-ACCOUNT:/) {
$accumulating = 0;
# possibly : add some code which does something with $inn
}
if ($accumulating) {
$inn .= $in;
}
}

Related

Amend perl script so that words are matched on a word for word basis

I have been using this perl script (thanks to Jeff Schaller) to match 3 or more words in the title fields of two separate csv files.
Original question here:
https://unix.stackexchange.com/questions/283942/matching-3-or-more-words-from-fields-in-separate-csv-files?noredirect=1#comment494461_283942
I have also added some exception functionality following advice from meuh:
#!/bin/perl
my #csv2 = ();
open CSV2, "<csv2" or die;
#csv2=<CSV2>;
close CSV2;
my %csv2hash = ();
for (#csv2) {
chomp;
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title
$csv2hash{$_} = $title;
}
open CSV1, "<csv1" or die;
while (<CSV1>) {
chomp;
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title
my #titlewords = split /\s+/, $title; #/ get words
my #new; #add exception words which shouldn't be matched
foreach my $t (#titlewords){
push(#new, $t) if $t !~ /^(and|if|where)$/i;
}
#titlewords = #new;
my $desired = 3;
my $matched = 0;
foreach my $csv2 (keys %csv2hash) {
my $count = 0;
my $value = $csv2hash{$csv2};
foreach my $word (#titlewords) {
++$count if $value =~ /\b$word\b/i;
last if $count >= $desired;
}
if ($count >= $desired) {
print "$csv2\n";
++$matched;
}
}
print "$_\n" if $matched;
}
close CSV1;
During my testing, one issue I've found that I would like to tweak is that if csv2 contains a single common word such as the, if this is replicated in csv1 three or more times then three positive matches is found. To clarify:
If csv1 contains:
1216454,the important people feel the same way as the others, 15445454, 45445645
^ i.e. there are three insatnces of the in the above line
If csv2 contains:
14564564,the tallest man on earth,546456,47878787
^ i.e. there is one instance of the in this line
Then I would like only one word to be classed as matching, and there be no output (based on my desired number of matching words- 3 ) because there is only one instance of the matching word in one of the files.
However if:
csv1 contained:
1216454,the important people feel the same way as the others,15445454, 45445645
and csv2 contained:
15456456,the only way the man can sing the blues,444545,454545
Then, as there are three matching words in each (i.e. 3 instances of the word the in each title, then I would like this to be classed as a matching title based on my desired number of matching words being 3 or more, thus generating the output:
1216454,the important people feel the same way as the others,15445454, 45445645
15456456,the only way the man can sing the blues,444545,454545
I would like to amend the script so that if there is one instance of a word in a csv, and multiple instances of the same word in the other csv then that is classed as only one match. However, if there were say 3 instance of the word the in both files, then it should still be classed as three matches. Basically I would like matches to be on a word for word basis.
Everything about the script other than this is perfect so I would rather not go back to the drawing board completely as I am happy with everything other than this.
I hope I've explained it ok, if anyone need any clarification let me know.
If you just wan to count unique matches, you can use a hash instead of a list to collect the words from csv1, just like you do for csv2, and then also count the occurrences of each word separately:
#!/usr/bin/env perl
my #csv2 = ();
open CSV2, "<csv2" or die;
#csv2=<CSV2>;
close CSV2;
my %csv2hash = ();
for (#csv2) {
chomp;
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title
$csv2hash{$_} = $title;
}
open CSV1, "<csv1" or die;
while (<CSV1>) {
chomp;
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title
my %words;
$words{$_}++ for split /\s+/, $title; #/ get words
## Collect unique words
my #titlewords = keys(%words);
my #new; #add exception words which shouldn't be matched
foreach my $t (#titlewords){
push(#new, $t) if $t !~ /^(and|if|where)$/i;
}
#titlewords = #new;
my $desired = 3;
my $matched = 0;
foreach my $csv2 (keys %csv2hash) {
my $count = 0;
my $value = $csv2hash{$csv2};
foreach my $word (#titlewords) {
my #matches = ( $value=~/\b$word\b/ig );
my $numIncsv2 = scalar(#matches);
#matches = ( $title=~/\b$word\b/ig );
my $numIncsv1 = scalar(#matches);
++$count if $value =~ /\b$word\b/i;
if ($count >= $desired || ($numIncsv1 >= $desired && $numIncsv2 >= $desired)) {
$count = $desired+1;
last;
}
}
if ($count >= $desired) {
print "$csv2\n";
++$matched;
}
}
print "$_\n" if $matched;
}
close CSV1;

Save the result of an expression to be used in a separate block

As you can see, I want to store the result of the first substitution in $enkel. I use this $enkel in the print of if.
But when I want to use this $enkel in the elsif print, it has no value. I actually want to use both $dubbel and $enkel in the elsif print.
Is there a way so that Perl stores this permanently in $dubbel, so it can be used in other prints?
if ($inputwoord =~ /((aa|uu|ee|oo)[^aeiour])$/) {
($enkel = $inputwoord) =~ s/([aueo])\1/$1/g;
print "$enkel$buig\n$inputwoord$gen\n$enkel$comp\n$enkel$compe\n$inputwoord$sup\n$inputwoord$supe\n";
}
elsif ($inputwoord =~ /[^aeiou][aeiou]([pktgnmlf])$/) {
($dubbel = $inputwoord) =~ s /([pktgnmlf]$)/$1$1/g;
print "$dubbel$buig\n$inputwoord$gen\n$dubbel$comp\n$dubbel$compe\n$inputwoord$sup\n$inputwoord$supe\n";
} # consonantgeminatie
You must calculate the value of $enkel outside the if statement if you want to use it in both the if and the elsif clauses. Like this
($enkel = $inputwoord) =~ s/([aueo])\1/$1/g;
($dubbel = $inputwoord) =~ s/([pktgnmlf]$)/$1$1/g;
if ($inputwoord =~ /((aa|uu|ee|oo)[^aeiour])$/) {
print "$enkel$buig\n$inputwoord$gen\n$enkel$comp\n$enkel$compe\n$inputwoord$sup\n$inputwoord$supe\n";
}
elsif ($inputwoord =~ /[^aeiou][aeiou]([pktgnmlf])$/) {
print "$dubbel$buig\n$inputwoord$gen\n$dubbel$comp\n$dubbel$compe\n$inputwoord$sup\n$inputwoord$supe\n";
} # consonantgeminatie

how to replace a div[ ] element with another div[ ] element?

for ($i=0; $i<10; $i++)
{
my $v1 = $sel->get_text("//body[\#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/**div**/table/tbody/tr/td/div/div");
my $v2 = $sel->get_text("//body[#\id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/**div**/table/tbody/tr/td[2]/div/div")
print ($v1 . $v2);
}
For every iteration, it has to find the 14th element starting from div[10] & replace it with the increased div[ ] element (Ex: if 14th element is div, replace it by div[2]. In the next iterartion find 14th element i.e., div[2] & replace it by div[3] & so on ).
By using PATTERN matching, it can't. Is there any method by using regex for finding that particular element & replacing it ? how can i do it ?
my $a = "//body[\#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/**div**/table/tbody/tr/td/div/div";
my #arr = split ('/' , $a);
print "#arr \n";
my $size1 = #arr;
print "$size1\n";
print $arr[16];
foreach my $a2 (#arr)
{
print "$a2 \n";
}
my $b = "//body[\#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/**div**/table/tbody/tr/td[2]/div/div";
Two variables as mentioned in the above question as v1 & v2 (edited as $a and $b), the modification has to apply for both of them. I think i'm almost near to what you've told. Can yoy please help me further
use 5.010;
my $xpath = q(//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div/table/tbody/tr/td/div/div);
for my $i (0..10) {
my #nodes = split qr'/', $xpath;
$nodes[16] .= "[$i]" unless 0 == $i;
say join '/', #nodes;
}
Results:
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[1]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[2]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[3]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[4]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[5]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[6]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[7]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[8]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[9]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[10]/table/tbody/tr/td/div/div
Ummm, all elements are separated by /, right? So you can use the native split method to split the portion of the text following div[10] based on /. Store it in an array $arr. Merge it to find the length of the string, say $len. Find the index of the div[10], say $orig_index. Then you find the 14th element, do a regex match to see which format it is in:
$arr[13] =~ /div([\d+])?/;
if ($1) {
$arr[13] =~ /div[$1]/div[($1+1)]/e;
}
else {
$arr[13] = div[2];
}
Now that the 14th element is changed, concatenate the array to get the new output string for the portion from the portion between div[10] and the 14th one:
{
local $" = '';
$newstring = "#arr";
}
splice($originalstring,$orig_index,$len,$newstring);
I think that will do.

perl search sentence for keywords

If I want to find a keyword in a sentence using Perl I have something like this:
foreach $line (#lines)
{
if ($line =~/keyword/)
{
print $line;
}
}
If I want to see if more keywords are in the sentence how should I change the matching?
There are several solutions. The easiest is to use something like /keyword.*keyword/.
When you want to count number of the keywords in a string (not simply check if there two keywords) you can do something like:
for(#lines){
my $n = 0;
$n++ while /(keyword)/g;
print if ($n>2);
}
By the way, your code can be more concise:
for (#lines) {
print if /keyword/;
}
That is the same.
If you want to process each match of all matches (g modifier):
my $number_of_matches = 0;
foreach $line (#lines)
{
while ( $line =~ m/keyword/g )
{
do_something_you_need();
$number_of_matches++;
}
}
my #words=map {split / /;} #lines;
foreach my $el(#keywords) {
#match=grep {$el eq $_} #words;
}
Do you want to see if the sentence contains other (different) keywords, or do you want to check whether it contains the same keyword multiple times.
For the first, you can write
if ($line =~ /keyword1|keyword2|keyword3/) { ... }
and for the second, it looks like this
my $n = () = $line =~ /keyword/g;

How to matching sequential words in a sentence using Perl?

Is there better way to match words other than this method, im trying to find the word in the array that occur in any of the sentences.
my $count = 0;
my #strings = (
"i'm going to find the occurrence of two words going if possible",
"i'm going to find the occurrence of two words if impossible",
"to find a solution to this problem",
"i will try my best for a way to match this problem"
);
#neurot = qw(going match possible);
my $com_neu = '\b'.join('\b|\b', #neurot).'\b';
foreach my $sentence (#string){
#l = $sentence =~ /($com_neu)/gi;
foreach my $list (#l){
if($list =~ m/\w['\w-]*/){
print $list;
$count++;
}
}
print $count;
}
Output:
String 1: going going possible
String 2: going
String 3:
String 4: match
please help me with a faster way.
Thanks.
Another approach could be to use hash to match the words:
my %neurot_hash = map { lc($_) => 1 } qw(going match possible);
for my $sentence (#strings) {
for my $found (grep { $neurot_hash{ lc($_) } } $sentence =~ /\w['\w-]*/gi) {
print $found, " ";
}
print "\n";
}
For data you provided this method is ~ 7% faster. But keep in mind that the data set is very small, so YMMV.
what about the 'smart-match' operator?
foreach my $elem (#neurot){
if(/$elem/i ~~ #strings){
print "Found $elem\n";
}
}
The same as bvr answer, but perhaps cleaner
my %neurot_hash = map { lc($_) => 1 } qw(going match possible);
for my $sentence (#strings) {
my #words = split /[^\w']/, $sentence;
#I am not sure if you want to take "i'm" as a separate word.
#Apparently, stackoverflow does not like '.
my #found = grep { exists $neurot_hash{ lc($_) } } #words;
print join (" ", #found);
print "\n";
}