Split one sentence with 「 and square brackets into multiple - perl

I have sentences of this pattern in the dictionary text data:
I have 「an absolute [a deadly] abhorrence of 「laziness [greasy food].
Is there a way that I can split it into 4 sentences as follows to make it more searchable in the dictionary (using Perl)?
I have an absolute abhorrence of laziness.
I have an absolute abhorrence of greasy food.
I have a deadly abhorrence of laziness.
I have a deadly abhorrence of greasy food.

An interesting problem. Here is one solution.
For now replace the open paren 「 by < and adjust the sentence.† Take an example string:
word <a A1[b b1] and more <a A2[b b2] but <a A3[b b3] end
Split the string into tokens: substrings containing alternatives <...[...], and substrings with groups of words around them. Once we are here, break each alternatives-substring into the two alternatives and put that in an arrayref. So we'll have an array with:
('word', ['a A1', 'b b1'], 'and more', ['a A2', 'b b2'],
'but', ['a A3', 'b b3'], 'end')
Identify indices of alternatives (1,3,5 here)
Create all combinations of these indices (as a set, so find the set of all subsets, the power set). For the indices in a subset we take the first alternative when composing a sentence, for those not in the subset we take the second (or the other way round)
Go through the tokens array and print, selecting the alternatives as described above
I use Algorithm::Combinatorics for combinations but there are of course other libraries.
A program with a test sentence introduced above (and only ascii characters)
use warnings;
use strict;
use feature 'say';
use List::Util qw(any none);
use Algorithm::Combinatorics qw(subsets);
my $str = q(word <a A1[b b1] and more <a A2[b b2] but <a A3[b b3] end);
say $str;
my #tokens =
map { /^</ ? [ /<([^\[]+) \[([^\]]+)\]/x ] : $_ }
split /(<[^\[]+ \[[^\]]+\])/x, $str;
#say "#tokens";
my #idx = grep { ref $tokens[$_] eq 'ARRAY' } 0..$#tokens;
#say "#idx";
my #subsets = subsets( \#idx );
for my $ss (#subsets) {
my #take_0 = #$ss;
for my $iw (0..$#tok) {
if (none { $iw == $_ } #idx) { print " $tok[$iw] " }
elsif (any { $iw == $_ } #take_0) { print " $tok[$iw]->[0] " }
else { print " $tok[$iw]->[1] " }
}
say '';
}
There are great simplifications considering all kinds of sentence structure an details from natural languages. There is plenty of room for code improvement, and there's a bit of cleanup to do (extra spaces, for one), but it does print all combinations with alternative phrases.
The library can generate one item at a time: when invoked in scalar context its functions return an iterator, on which ->next gives the next item. This is important for very large sets of items.
Here is a program with the sentence given in the question. (The solution above has ascii (<) instead of the 「 character, as some systems still have problems with Unicode. Other than that the program is the same.)
use warnings;
use strict;
use feature 'say';
use List::Util qw(any none);
use Algorithm::Combinatorics qw(subsets);
use utf8;
use open qw(:std :encoding(UTF-8));
my $str = q(I have 「an absolute [a deadly] abhorrence of 「laziness [greasy food].);
say $str;
my #tokens =
map { /^「/ ? [ /「([^\[]+) \[([^\]]+)\]/x ] : $_ }
split /(「[^\[]+ \[[^\]]+\])/x, $str;
my #idx = grep { ref $tokens[$_] eq 'ARRAY' } 0..$#tokens;
my #subsets = subsets( \#idx );
for my $ss (#subsets) {
my #take_0 = #$ss;
for my $iw (0..$#tokens) {
if (none { $iw == $_ } #idx) { print " $tokens[$iw] " }
elsif (any { $iw == $_ } #take_0) { print " $tokens[$iw]->[0] " }
else { print " $tokens[$iw]->[1] " }
}
say '';
}

First, parse into
my #def = (
[ "I have " ],
[ "an absolute", "a deadly" ],
[ " abhorrence of " ],
[ "laziness", "greasy food" ],
[ "." ],
);
This can be achieved using the following, a validating parser:
my #def;
for ( $str ) {
/ \G ( [^「]+ ) /xgc
and push #def, [ $1 ];
if ( / \G 「 /xgc ) {
/ \G ( [^「\[\]]+ ) [ ] \[ ( [^「\[\]]+ ) \] /xgc
or die( "Bad sequence at offset ".( pos() - 1 )."\n" );
push #def, [ $1, $2 ];
redo;
}
/\G \z /xgc
and last;
die( "Should not happen" );
}
Then find the product. This can be achieved using the following:
use Algorithm::Loops qw( NestedLoops );
my $iter = NestedLoops( \#def );
while ( my #parts = $iter->() ) {
say join "", #parts;
}
or
use Algorithm::Loops qw( NestedLoops );
NestedLoops( \#def, sub { say join "", #_; } );

Related

Perl - longest common prefix of 2 or more strings?

How can i create a Perl subroutine which would take in an array and find the longest common prefix for 2 or more of its elements? (strings)
I have this code:
sub longest_common_prefix {
$prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
But it only works if you are looking for the longest common prefix of all strings.
For example, if i pass an array with the following strings:
aaaBGFB
aaaJJJJ
jjfkBBB
aaaHGHG
I want it to return aaa as the answer.
Thanks!
I'd use a modified trie.
Normally, one could use the following to add to a trie:
sub add {
my $p = \shift;
my $s = shift;
$p = \( $$p->{$_} ) for split(//, $s);
$$p->{''} = 1;
}
But we need two modifications:
All prefixes of a string must be added when adding a string. For example, adding abc should also add a and ab to the trie.
When adding to the trie, we want to return the length of previously-existing part of the path taken.
So we need:
sub add {
my $p = \shift;
my $s = shift;
my $cp_len = 0;
for (split(//, $s)) {
$p = \( $$p->{$_} );
++$cp_len if $$p->{$_}{''};
$$p->{''} = 1;
}
return $cp_len;
}
Combine (an optimized version of) this with an algorithm to find the longest strings in a list and with an algorithm to remove duplicate strings from a list to get the following solution:
use strict;
use warnings;
use feature qw( say );
sub add {
my $p = \shift;
my $s = shift;
my $cp_len = 0;
for (split(//, $s)) {
++$cp_len if exists($$p->{$_});
$p = \( $$p->{$_} );
}
return $cp_len;
}
my $t;
my $lcp_len = 0; # lcp = longest common prefix
my %lcps;
while (<>) {
chomp;
my $cp_len = add($t, $_)
or next;
if ($cp_len >= $lcp_len) {
if ($cp_len > $lcp_len) {
$lcp_len = $cp_len;
%lcps = ();
}
$lcps{ substr($_, 0, $cp_len) } = 1;
}
}
my #lcps = sort keys %lcps;
if (#lcps) {
say "Longest common prefix(es): #lcps";
} else {
say "No common prefix";
}
Data:
abc
abc
abcd
abcde
hijklx
hijkly
mnopqx
mnopqy
Output:
Longest common prefix(es): hijkl mnopq
The time taken by the above is proportional to the number of input characters.
One way would be to store the information in a hash. In this example, I set the hash key to the length of each prefix, and the value being the actual prefix found.
Note that this method overwrites a key and value if a same-length prefix exists, so you'll always get the last prefix found of the longest length (sort() takes care of finding the longest one).
The regex says "find the first character in the string and capture it, and use that char found in a second capture, and capture as many as there are". This string is then join()ed into a scalar and put into the hash.
use warnings;
use strict;
my %prefixes;
while (<DATA>){
my $prefix = join '', /^(.)(\1+)/;
$prefixes{length $prefix} = $prefix;
}
my $longest = (sort {$b <=> $a} keys %prefixes)[0];
print "$prefixes{$longest}\n";
__DATA__
aaBGFB
aaaJJJJ
jjfkBBB
aaaHGHG
Output:
aaa
You can keep a hash of an array of words keyed by the first character. By definition, if you have words starting with the same letter, those words share at least a one character common prefix of that one letter. Then reduce to the single longest prefix by stepping through the hash by character:
use strict; use warnings;
sub lcp {
(join("\0", #_) =~ /^ ([^\0]*) [^\0]* (?:\0 \1 [^\0]*)* $/sx)[0];
}
my %HoA;
my $longest='';
while (my $line=<DATA>){
$line =~ s/^\s+|\s+$//g ;
push #{ $HoA{substr $line, 0, 1} }, $line if $line=~/^[a-zA-Z]/;
}
for my $key ( sort (keys %HoA )) {
if (scalar #{ $HoA{$key} } > 1){
my $lon=lcp(#{ $HoA{$key} });
my $s = join ', ', map { qq/"$_"/ } #{ $HoA{$key} };
print "lcp: \"$lon\" for ($s)\n";
if (length($lon) > length($longest)) {
$longest=$lon;
}
}
else{
print "$key: no common prefix\n";
}
}
print "\nlongest common prefix is \"$longest\"\n";
__DATA__
aardvark
aaaBGFB
aaaJJJJ
jjfkBBB
aaaHGHG
interspecies
interstellar
interstate
Prints:
lcp: "aa" for ("aardvark", "aaaBGFB", "aaaJJJJ", "aaaHGHG")
lcp: "inters" for ("interspecies", "interstellar", "interstate")
j: no common prefix
longest common prefix is "inters"

How to tell if a string can be transformed into another string by removing or uppercasing lowercase letters?

In perl, I have two input strings, for this example, ahueFFggLKy and HFFGLK. I want to be able to iterate through all of the possible combinations of my input without lowercase letter groups (a, h, u, e, g...ah, au...hegy, etc) so in each iteration lowercase letters are removed and the remaining lowercase letters are uppercased:
ah:
ueFFggLKy (UEFFGGLKY)
^^
au:
h eFFggLKy (HEFFGGLKY)
^ ^
hegy:
a u FF gLKy | a u FFg LKy (AUFFGLKY)
^ ^ ^ | ^ ^ ^
auegy:
h FF gLK | h FFg LK (HFFGLK)
^ ^^ ^ ^ ^ ^^ ^ ^ -^--^-
The last option (auegy) is the answer, and I want to be able to iterate over letters to determine if I am able to convert ahueFFggLKy to HFFGLK without modifying any of the capital letters. This example would return "YES".
If inputs like fOoBar and BAR come up, I am not successfully able to convert fOoBar to BAR because the O in fOoBar is capitalized. My program would return "NO".
Can someone provide me with a perl example of how this would be done?
I think I have understood your requirement: the first string may be transformed by either deleting or upper-casing any lower-case letter, and you wish to know whether the second string can be derived from the first in this way
I suggest that you can transform the second string to a regex pattern to achieve this. If every upper-case letter in the second string must be matched by the corresponding upper or lower-case letter in the first, with any number of intervening lower-case letters, then the transformation is possible. Otherwise it is not
This program implements the idea
use strict;
use warnings 'all';
use feature 'say';
my #pairs = (
[ qw/ ahueFFggLKy HFFGLK / ],
[ qw/ fOoBar BAR / ],
);
for my $pair ( #pairs ) {
my ($s1, $s2) = #$pair;
printf "%s => %s -- %s\n", $s1, $s2, contains($s1, $s2) ? 'YES' : 'NO';
}
sub contains {
my ($s1, $s2) = #_;
my $re = join ' \p{Ll}* ', map { "(?i: $_ )" } $s2 =~ /\p{Lu}/g;
$re = qr/ ^ \p{Ll}* $re \p{Ll}* $ /x;
$s1 =~ $re;
}
output
ahueFFggLKy => HFFGLK -- YES
fOoBar => BAR -- NO
To read an array like #pairs from STDIN you could write something like this
my #pairs;
{
local $/;
my #input = split ' ', <>;
push #pairs, [ splice #input, 0, 2 ] while #input > 1;
}
Kind of unelegant solution, but it seems to output what you need.
#!/usr/bin/perl
use warnings;
use strict;
use List::Util qw{ all };
my ($str1, $str2) = qw( ahueFFggLKy HFFGLK );
my #small_indices;
push #small_indices, pos($str1) - 1 while $str1 =~ /[[:lower:]]/g;
my #present = (0) x #small_indices;
until (all { $_ } #present) {
my $try = $str1;
for my $i (reverse 0 .. $#present) {
substr $try, $small_indices[$i], 1,
$present[$i] ? substr $str1, $small_indices[$i], 1
: q();
}
if (uc $try eq $str2) {
print $present[$_] ? q() : substr $str1, $small_indices[$_], 1
for 0 .. $#present;
print ":\n";
my $j = 0;
for my $i (0 .. length($str1) - 1) {
my $char = substr $str1, $i, 1;
if ($char eq uc $char || $present[$j++]) {
print $char;
} else {
print '.';
}
}
print "\n";
}
my $idx = 0;
$present[$idx] = 0, ++$idx while $present[$idx];
$present[$idx] = 1;
}
It builds an indicator function #present, which say what lowercase letters are present in the string. All possible values of #present are iterated by adding 1 to the binary number corresponding to the function.

How to print data in column form in Perl?

I have a program that prints the contents of arrays in rows. I would like it to print each array in a column next to each other.
This is the code:
#!/usr/local/bin/perl
use strict;
use warnings;
my #M_array;
my #F_array;
open (my $input, "<", 'ssbn1898.txt');
while ( <$input> ) {
chomp;
my ( $name, $id ) = split ( /,/ );
if ( $id eq "M" ) {
push ( #M_array, $name );
}
else {
push ( #F_array, $name );
}
}
close ( $input );
print "M: #M_array \n";
print "F: #F_array \n";
Is this possible or am I trying to do something that can't be done?
Desired format:
M F
Namem1 Namef1
Namem2 Namef2
You can add whatever separator you would like between your data by using the join function, the example below formats the data in your array separated by tabs:
...
use List::MoreUtils qw/pairwise/;
my $separator = "\t";
print join($separator, qw(M F)), "\n";
print join(
"\n",
pairwise { ( $a // '') . $separator . ( $b // '') } #M_array, #F_array
), "\n";
...
I think, you should use Perl formats. Have a look at the Perl documentation. You may want to use the #* format field in your case.
I extended your code in order to print the desired output at the end
use strict;
use warnings;
my #M_array;
my #F_array;
open (my $input, "<", 'ssbn1898.txt');
while ( <$input> ) {
chomp;
my ( $name, $id ) = split ( /,/ );
if ( $id eq "M" ) {
push ( #M_array, $name );
}
else {
push ( #F_array, $name );
}
}
close ( $input );
unshift #M_array, 'M';
unshift #F_array, 'F';
my $namem;
my $namef;
my $max = 0;
$max = (length($_) gt $max ? length($_) : $max) for #M_array;
my $w = '#' . '<' x $max;
eval "
format STDOUT =
$w #*
\$namem, \$namef
.
";
while ( #M_array or #F_array) {
$namem = shift #M_array || '';
$namef = shift #F_array || '';
write;
}
join is probably the simplest approach to take tabs will align your columns nicely.
join ( "\t", #array ),
Alternatively, perl allows formatting via (s)printf:
printf ( "%-10s %-10s", "first", "second" );
Or a more detailed 'format'
Given what you're trying to do is put your two arrays into columns though:
#!/usr/local/bin/perl
use strict;
use warnings;
my $format = "%-10s\t%-10s\n";
my #M_array = qw ( M1 M2 M3 M4 M5 );
my #F_array = qw ( F1 F2 F3 );
my $maxrows = $#M_array > $#F_array ? $#M_array : $#F_array;
printf ( $format, "M", "F" );
for my $rownum ( 0..$maxrows ) {
printf ( $format, $M_array[$rownum] // '', $F_array[$rownum] // '' );
}
This will print a header row, and then loop through you arrays printing one line at a time. // is a conditional operation that tests if something is defined. It's only available in newer perls though*. In older versions || will do the trick - it's almost the same, but handles '' and 0 slightly differently.
* Perl 5.10 onward, so is pretty safe, but worth mentioning because some system are still rocking around with perl 5.8 on them.
You may format output with the sprintf function, but there are some more problems to solve: What if the arrays don't have the same count of entries? For this, you need a place-holder. How much letters must fit into a column? How should it be aligned? Some code for illustration:
#!/usr/bin/perl
use strict;
use warnings;
my #m = (1, 2, 3);
my #f = (11, 22, 33, 44);
# calculate how many rows to display
my $max = #m;
if (#m < #f) {
$max = #f;
}
# placeholder for missing data
my $none = '-';
# formatting 20 chars per column, left aligned
my $fmt = "%-20s%-20s\n";
# print header
print sprintf($fmt, "M", "F");
# print data rows
foreach my $i (0..$max-1) {
print sprintf($fmt, ($m[$i] or $none), ($f[$i] or $none));
}
If you are interested in more sophisticated formatting (for instance center-aligned text), you should switch to the special formatting capabilities Perl provides for report generation.
Borrowing from #HunterMcMillen
use strict;
use warnings;
use feature "say";
local $, = "\t"; # separator when printing list
my $i = (#F_array > #M_array) ? $#F_array : $#M_array;
say qw(M F);
say $M_array[$i] //"", $F_array[$i] //"" for 0 .. $i;
I guess Text::Table is the required module which comes with the perl distribution(just need to install).Go through the below documentation -
Documentation of Text::Table
You need to pass the content as array to the add() method and it will do the wonders for you.

Alternate between upper and lowercase, PERL

I want to alternate between upper and lower case, however I only managed to get the whole string upper or lower, or the first character.
I have not found a proper function to execute what I need. Please have a look and help me out. Cheers.
#!/usr/bin/perl
my $mystring = "this is my string I want each character to alternate between upper and lowercase";
my #myarray = split("", $mystring);
print ucfirst("#myarray");
A more general approach using function factory
use strict;
use warnings;
sub periodic {
my #subs = #_;
my $i = 0;
return sub {
$i = 0 if $i > $#subs;
return $subs[$i++]->(#_);
};
}
my $mystring = "this is my string I want each character to alternate between upper and lowercase";
my $f = periodic(
sub { uc pop },
sub { lc pop },
# sub { .. },
# sub { .. },
);
$mystring =~ s/([a-z])/ $f->($1) /egi;
print $mystring, "\n";
output
ThIs Is My StRiNg I wAnT eAcH cHaRaCtEr To AlTeRnAtE bEtWeEn UpPeR aNd LoWeRcAsE
How about:
my $mystring = "this is my string I want each character to alternate between upper and lowercase";
my #myarray = split("", $mystring);
my $cnt = 1;
for (#myarray) {
next unless /[a-z]/i;
$_ = ($cnt%2 ? uc($_) : lc($_));
$cnt++;
}
say join('',#myarray);
Output:
ThIs Is My StRiNg I wAnT eAcH cHaRaCtEr To AlTeRnAtE bEtWeEn UpPeR aNd LoWeRcAsE
My first thought was to use a regex substitution. Try this:
use strict;
use warnings;
my $str = "this string, I will change";
# Ignore whitespace and punctuation.
$str =~ s/(\w)(\w)/\L$1\U$2/g;
# Or include all characters in the uc/lc alternation.
# $str =~ s/(.)(.)/\L$1\U$2/g;
print $str, "\n";
If, for some reason, you wish to avoid regexes, try:
my $str = "this string, I will change";
my #ary;
my $count = 0;
for my $glyph ( split //, lc $str ) {
$glyph = uc $glyph if $count % 2;
push #ary, $glyph;
$count++;
}
print join( "", #ary ), "\n";
Try this:
use strict;
use warnings;
use 5.016;
use Data::Dumper;
my $str = 'hello';
my $x = 0;
$str =~ s/(.)/($x++ % 2 == 0) ? "\U$1" : "\L$1"/eg;
say $str;
--output:--
HeLlO
Save script below with name alter.pl
#!/usr/bin/perl
print#ARGV[0]=~s/([a-z])([^a-z]*)([a-z])/uc($1).$2.lc$3/egri
And run script by command
$ perl alter.pl "this is my string I want each character to alternate between upper and lowercase"
Output
ThIs Is My StRiNg I wAnT eAcH cHaRaCtEr To AlTeRnAtE bEtWeEn UpPeR aNd LoWeRcAse
You have some good answers already but I thought I'd chip in because I hadn't seen map yet.
print map { $c++ % 2 ? lc : uc } split ( //, $mystring );
splits $mystring into characters (split //);
uses map to apply a function to each letter.
uses $c++ to autoincrement, then take a modulo 2 to decide if this should be uppercase or lower case.
join the resultant array.
Gives:
#!c:\Strawberry\perl\bin
use strict;
use warnings;
my $mystring = "this is my string I want each character to alternate between upper and lowercase";
my $c;
print join ( "", map { $c++ % 2 ? lc : uc } split ( //, $mystring ));
Prints:
ThIs iS My sTrInG I WaNt eAcH ChArAcTeR To aLtErNaTe bEtWeEn uPpEr aNd lOwErCaSe
map is a useful function that applies some code to each element in a list, and then 'returns' the list that's produced. So if we treat your string as a list of characters, it works nicely.
Try this. simple if else condition enough for this
my $mystring = "this is my string I want each character to alternate between upper and lowercase";
#xz = split( '', $mystring );
for ( $i = 0; $i < scalar #xz; $i++ ) {
if ( $i % 2 ) {
print uc "$xz[$i]";
}
else {
print "$xz[$i]";
}
}

Using Perl to create another Perl file

I have an input file that looks like
*firsttitle
nameA
nameB
nameC
*secondtitle
xnameA
xnameB
xnameC
I want to create a Perl script that takes this file and basically will create another perl script that looks like
#!/usr/bin/perl
use strict;
use warnings;
my %tags = (
"firsttitle" => [ qw (nameA nameB nameC) ],
"secondtitle" => [ qw (xnameA xnameB xnameC) ]);
my $rx = join '|', keys %tags;
while (<>) {
s/^\s*($rx):\s*(\d+)/$1: $tags{$1}[$2]/;
print;
}
My thought process is that I have to first match print out the regular perl code (#!,use..etc.).Then add " my%tags=(. Then take the input file and look for the * and that's the lookup for the hash and start parsing everything after until the next(*) or end of life. If it's another * then do it again. If it's EOF then add ");" and end. And then finish with printing the last bit of perl code. Help/ideas would be appreciated. If you're going to post code snippets could you go through and explain what each part is doing? Thanks!
Very simple script. First just parse through the input file. Lines that start with * will be titles, and all the following lines up until the next *-line will be values. We put this into a hash of arrays.
The map statement gives us a list of the hash key (the title), and it's values joined together with space. We put this in an array for printing. The printing itself is done with printf, which can be a bit difficult to use, since meta characters will mess us up. Any % that are to be literal must be written as %%. I also changed single quotes from the original to double quotes. I use single quotes on the printf pattern to avoid accidental interpolation of variables.
An alternative - possibly better one - is to not just printf at all, and simply concatenate the string in a normal fashion.
use strict;
use warnings;
my ($title, %hash);
while (<DATA>) {
chomp;
if (/^\*(.+)$/) {
$title = $1;
} else {
push #{$hash{$title}}, $_;
}
}
my #args = ( map { $_, join(' ', #{$hash{$_}}) } keys %hash );
printf '#!/usr/bin/perl
use strict;
use warnings;
my %%tags = (
"%s" => [ qw ( %s ) ],
"%s" => [ qw ( %s ) ]);
my $rx = join "|", keys %%tags;
while (<>) {
s/^\s*($rx):\s*(\d+)/$1: $tags{$1}[$2]/;
print;
}', #args;
__DATA__
*firsttitle
nameA
nameB
nameC
*secondtitle
xnameA
xnameB
xnameC
Update:
This will use a different method of printing, which will be more stable.
my #args = ( map { " '$_' => [ qw ( #{$hash{$_}} ) ],\n" } keys %hash );
print '#!/usr/bin/perl
use strict;
use warnings;
my %tags = (
', #args, '
);
my $rx = join "|", keys %tags;
while (<>) {
s/^\s*($rx):\s*(\d+)/$1: $tags{$1}[$2]/;
print;
}';