I have a sample of text:
my $text = 'a bb cc xx aa a b c a';
and a list of terms that might be in the text:
my #words = ('bb cc',
'a bb cc',
'xx aa a b',
'a b',
'a'
);
I need to find the occurrences of these words, using the longest matches possible, and not marking anything twice. So if I marked the matches in the text above, it would look like this:
<a bb cc> <xx aa a b> c <a>
Notice that I did not mark bb cc, because that is part of the larger match a bb cc.
Any ideas on a way to do this? I feel like it should have been encountered many times before.
A simple substitution should do, you'll have to sort by length:
my $re = '('.join('|', sort {length $b <=> length $a} map(quotemeta,#words)).')';
$text =~ s/$re/<$1>/g;
say $text;
Output as expected for 5.20.2, can't check other version right now.
The quotemeta part isn't actually needed for the examples you gave, it's there to escape characters with special meaning in the regexen.
Related
Below is my code , please review.
use strict;
my #people = qw{a b c d e f};
foreach (#people){
print $_,"$people[$_]\n";
}
Below is the output,
[~/perl]$ perl test.pl
aa #why the output of $people[$_] is not same with $_?
ba
ca
da
ea
fa
Thanks for your asking.
$_ is the actual element you're looking at. $people[$_] is getting the $_th element out of #people. It's intended to be used with numerical indices, so it numifies its argument. Taking a letter and "converting" it to a number just converts to zero since it can't parse a number out of it. So $people[$_] is just $people[0] when $_ is not a number. If you try the same experiment where some of the list elements are actually numerical, you'll get some more interesting results.
Try:
use strict;
my #people = qw{a b c 3 2 1};
foreach (#people){
print $_,"$people[$_]\n";
}
Output:
aa
ba
ca
33
2c
1b
Since, in the first three cases, we couldn't parse a, b, or c as numbers so we got the zeroth element. Then, we can actually convert 3, 2, and 1 to numbers, so we get elements 3, 2, and then 1.
EDIT: As mentioned in a comment by #ikegami, put use warnings at the top of your file in addition to use strict to get warned about this sort of thing.
How to split the data using by particular letter but the splitting data is present into the previous split ted line.
My perl code
$data ="abccddaabcdebb";
#split = split('b',"$data");
foreach (#split){
print "$_\n";
}
In this code gives the outputs but i expected outputs is:
ab
ccddaab
cdeb
b
How can i do this
You can use lookbehind to keep the b:
$data ="abccddaabcdebb";
#split = split(/(?<=b)/, $data);
foreach (#split){
print "$_\n";
}
will print out
ab
ccddaab
cdeb
b
You'll need positive look behind if you want to include letter b as delimiter is excluded from resulting list.
my $data ="abccddaabcdebb";
my #split = split(/(?<=b)/, $data);
foreach (#split) {
print "$_\n";
}
From perldoc -f split
Anything in EXPR that matches PATTERN is taken to be a separator that separates the EXPR into substrings (called "fields") that do not include the separator.
The first parameter of split defines what separates the elements you want to extract. b doesn't separate your elements you want since it's actually part of what you want.
You could specify the split after b using
my #parts = split /(?<=b)/, $s;
You could also use
my #parts = $s =~ /[^b]*b/g;
Side note:
split /(?<=b)/
splits
a b c b b
at three spots
a b|c b|b|
so it results in four strings
ab
cb
b
Empty string
Fortunately, split removes trailing blank strings from its result by default, so it results in the three desired strings instead.
ab
cb
b
I am parsing a string in a subroutine that specifies a fixed number of parameters and two optional parameters. N.B. I also specify the parameter string being used.
This parameter string is of the form:
local_fs_name rem_fs_name timeout diff_limit hi hihi (rem_hi) (rem_hihi)
so definitely six parameters with two optional parameters for a max of eight.
Should the upper limit be set to the maximum number of parameters or one more than the maximum, i.e. eight or nine?
The only reasons to limit the number of fields split returns that I can think of are either for efficiency purposes (and your subroutine would have to be called a lot with very many more parameters than required for this to matter) or if you really want to keep the separators in the final field.
You shouldn't be using split to verify the number of parameters. Fetch all of them into an array and then verify the contents of the array. Something like this:
my $params = 'local_fs_name rem_fs_name timeout diff_limit hi hihi rem_hi rem_hihi';
my #params = split ' ', $params;
if (#params < 6 or #params > 8) {
die "Usage: mysub local_fs_name rem_fs_name timeout diff_limit hi hihi [rem_hi [rem_hihi]]\n";
}
It's not a style (best practice) question.
split ' ', $_
and
split ' ', $_, 6
produce different results when 7+ args are provided.
>perl -E"say +( split ' ', 'a b c d e f g' )[5]"
f
>perl -E"say +( split ' ', 'a b c d e f g', 6 )[5]"
f g
My best guess is that don't want to limit.
Then there's the question of whether you want to keep trailing fields or not.
>perl -E"#a=split(' ', 'a b c d e ' ); say 0+#a;"
5
>perl -E"#a=split(' ', 'a b c d e ', -1); say 0+#a;"
6
My best guess is trailing whitespace isn't significant.
I have a file with around 25000 records, each records has more than 13 entries are drug names. I want to form all the possible pair combination for these entries. Eg: if a line has three records A, B, C. I should form combinations as 1) A B 2) A C 3)B C. Below is the code I got from internet, it works only if a single line is assigned to an array:
use Math::Combinatorics;
my #n = qw(a b c);
my $combinat = Math::Combinatorics->new(
count => 2,
data => [#n],
);
while ( my #combo = $combinat->next_combination ) {
print join( ' ', #combo ) . "\n";
}
The code I am using, it doesn't produce any output:
open IN, "drugs.txt" or die "Cannot open the drug file";
open OUT, ">Combination.txt";
use Math::Combinatorics;
while (<IN>) {
chomp $_;
#Drugs = split /\t/, $_;
#n = $Drugs[1];
my $combinat = Math::Combinatorics->new(
count => 2,
data => [#n],
);
while ( my #combo = $combinat->next_combination ) {
print join( ' ', #combo ) . "\n";
}
print "\n";
}
Can you please suggest me a solution to this problem?
You're setting #n to be an array containing the second value of the #Drugs array, try just using data => \#Drugs in the Math::Combinatorics constructor.
Also, use strict; use warnings; blahblahblah.
All pairs from an array are straightforward to compute. Using drugs A, B, and C as from your question, you might think of them forming a square matrix.
AA AB AC
BA BB BC
CA CB CC
You probably do not want the “diagonal” pairs AA, BB, and CC. Note that the remaining elements are symmetrical. For example, element (0,1) is AB and (1,0) is BA. Here again, I assume these are the same and that you do not want duplicates.
To borrow a term from linear algebra, you want the upper triangle. Doing it this way eliminates duplicates by construction, assuming that each drug name on a given line is unique. An algorithm for this is below.
Select in turn each drug q on the line. For each of these, perform steps 2 and 3.
Beginning with the drug immediately following q and then for each drug r in the rest of the list, perform step 3.
Record the pair (q, r).
The recorded list is the list of all unique pairs.
In Perl, this looks like
#! /usr/bin/env perl
use strict;
use warnings;
sub pairs {
my #a = #_;
my #pairs;
foreach my $i (0 .. $#a) {
foreach my $j ($i+1 .. $#a) {
push #pairs, [ #a[$i,$j] ];
}
}
wantarray ? #pairs : \#pairs;
}
my $line = "Perlix\tScalaris\tHashagra\tNextium";
for (pairs split /\t/, $line) {
print "#$_\n";
}
Output:
Perlix Scalaris
Perlix Hashagra
Perlix Nextium
Scalaris Hashagra
Scalaris Nextium
Hashagra Nextium
I've answered something like this before for someone else. For them, they had a question on how to combine a list of letters into all possible words.
Take a look at How Can I Generate a List of Words from a group of Letters Using Perl. In it, you'll see an example of using Math::Combinatorics from my answer and the correct answer that ikegami had. (He did something rather interesting with regular expressions).
I'm sure one of these will lead you to the answer you need. Maybe when I have more time, I'll flesh out an answer specifically for your question. I hope this link helps.
I need to apply a regexp filtration to affect only pieces of text within quotes and I'm baffled.
$in = 'ab c "d e f" g h "i j" k l';
#...?
$inquotes =~ s/\s+/_/g; #arbitrary regexp working only on the pieces inside quote marks
#...?
$out = 'ab c "d_e_f" g h "i_j" k l';
(the final effect can strip/remove the quotes if that makes it easier, 'ab c d_e_f g...)
You could figure out some cute trick that looks like line noise.
Or you could keep it simple and readable, and just use split and join. Using the quote mark as a field separator, operate on every other field:
my #pieces = split /\"/, $in, -1;
foreach my $i (0 ... $#pieces) {
next unless $i % 2;
$pieces[$i] =~ s/\s+/_/g;
}
my $out = join '"', #pieces;
If you want you use just a regex, the following should work:
my $in = q(ab c "d e f" g h "i j" k l);
$in =~ s{"(.+?)"}{$1 =~ s/\s+/_/gr}eg;
print "$in\n";
(You said the "s may be dropped :) )
HTH,
Paul
Something like
s/\"([\a\w]*)\"/
should match the quoted chunks. My perl regex syntax is a little rusty, but shouldn't just placing quote literals around what you're capturing do the job? You've then got your quoted string d e f inside the first capture group, so you can do whatever you want to it... What kind of 'arbitrary operation' are you trying to do to the quoted strings?
Hmm.
You might be better off matching the quoted strings, then passing them to another regex, rather than doing it all in one.