Parsing files that use synonyms

Parsing files that use synonyms - perl

If I had a text file with the following:
Today (is|will be) a (great|good|nice) day.
Is there a simple way I can generate a random output like:
Today is a great day.
Today will be a nice day.
Using Perl or UNIX utils?

Closures are fun:
#!/usr/bin/perl
use strict;
use warnings;
my #gens = map { make_generator($_, qr~\|~) } (
'Today (is|will be) a (great|good|nice) day.',
'The returns this (month|quarter|year) will be (1%|5%|10%).',
'Must escape %% signs here, but not here (%|#).'
);
for ( 1 .. 5 ) {
print $_->(), "\n" for #gens;
}
sub make_generator {
my ($tmpl, $sep) = #_;
my #lists;
while ( $tmpl =~ s{\( ( [^)]+ ) \)}{%s}x ) {
push #lists, [ split $sep, $1 ];
}
return sub {
sprintf $tmpl, map { $_->[rand #$_] } #lists
};
}
Output:
C:\Temp> h
Today will be a great day.
The returns this month will be 1%.
Must escape % signs here, but not here #.
Today will be a great day.
The returns this year will be 5%.
Must escape % signs here, but not here #.
Today will be a good day.
The returns this quarter will be 10%.
Must escape % signs here, but not here %.
Today is a good day.
The returns this month will be 1%.
Must escape % signs here, but not here %.
Today is a great day.
The returns this quarter will be 5%.
Must escape % signs here, but not here #.

Code:
#!/usr/bin/perl
use strict;
use warnings;
my $template = 'Today (is|will be) a (great|good|nice) day.';
for (1..10) {
print pick_one($template), "\n";
}
exit;
sub pick_one {
my ($template) = #_;
$template =~ s{\(([^)]+)\)}{get_random_part($1)}ge;
return $template;
}
sub get_random_part {
my $string = shift;
my #parts = split /\|/, $string;
return $parts[rand #parts];
}
Logic:
Define template of output (my $template = ...)
Enter loop to print random output many times (for ...)
Call pick_one to do the work
Find all "(...)" substrings, and replace them with random part ($template =~ s...)
Print generated string
Getting random part is simple:
receive extracted substring (my $string = shift)
split it using | character (my #parts = ...)
return random part (return $parts[...)
That's basically all. Instead of using function you could put the same logic in s{}{}, but it would be a bit less readable:
$template =~ s{\( ( [^)]+ ) \)}
{ my #parts = split /\|/, $1;
$parts[rand #parts];
}gex;

Sounds like you may be looking for Regexp::Genex. From the module's synopsis:
#!/usr/bin/perl -l
use Regexp::Genex qw(:all);
$regex = shift || "a(b|c)d{2,4}?";
print "Trying: $regex";
print for strings($regex);
# abdd
# abddd
# abdddd
# acdd
# acddd
# acdddd

Use a regex to match each parenthetical (and the text inside it).
Use a string split operation (pipe delimiter) on the text inside of the matched parenthetical to get each of the options.
Pick one randomly.
Return it as the replacement for that capture.

Smells like a recursive algorithm
Edit: misread and thought you wanted all possibilities
#!/usr/bin/python
import re, random
def expand(line, all):
result = re.search('\([^\)]+\)', line)
if result:
variants = result.group(0)[1:-1].split("|")
for v in variants:
expand(line[:result.start()] + v + line[result.end():], all)
else:
all.append(line)
return all
line = "Today (is|will be) a (great|good|nice) day."
all = expand(line, [])
# choose a random possibility at the end:
print random.choice(all)
A similar construct that produces a single random line:
def expand_rnd(line):
result = re.search('\([^\)]+\)', line)
if result:
variants = result.group(0)[1:-1].split("|")
choice = random.choice(variants)
return expand_rnd(
line[:result.start()] + choice + line[result.end():])
else:
return line
Will fail however on nested constructs

Related

Perl module / subroutine to remove shared substring of strings

Given a list/array of strings (in particular, UNIX paths), remove the shared part, eg:
./dir/fileA_header.txt
./dir/fileA_footer.txt
I probably will strip the directory before using the function, but strictly speacking this won't change much.
I'd like to know a method to either remove the shared parts (./dir/fileA_) or remove the not-shared part.
Thank you for your help!

This is a bit of a hack, but if you don't need to support Unicode strings (that is, if all characters have a value below 256), you can use xor to get the length of the longest common prefix of two strings:
my $n = do {
($str1 ^ $str2) =~ /^\0*/;
$+[0]
};
You can apply this operation in a loop to get the common prefix of a list of strings:
use v5.12.0;
use warnings;
sub common_prefix {
my $prefix = shift;
for my $str (#_) {
($prefix ^ $str) =~ /^\0*/;
substr($prefix, $+[0]) = '';
}
return $prefix;
}
my #paths = qw(
./dir/fileA_header.txt
./dir/fileA_footer.txt
);
say common_prefix(#paths);
Output: ./dir/fileA_

Regular expression to print a string from a command outpout

I have written a function that uses regex and prints the required string from a command output.
The script works as expected. But it's does not support a dynamic output. currently, I use regex for "icmp" and "ok" and print the values. Now, type , destination and return code could change. There is a high chance that command doesn't return an output at all. How do I handle such scenarios ?
sub check_summary{
my ($self) = #_;
my $type = 0;
my $return_type = 0;
my $ipsla = $self->{'ssh_obj'}->exec('show ip sla');
foreach my $line( $ipsla) {
if ( $line =~ m/(icmp)/ ) {
$type = $1;
}
if ( $line =~ m/(OK)/ ) {
$return_type = $1;
}
}
INFO ($type,$return_type);
}
command Ouptut :
PSLAs Latest Operation Summary
Codes: * active, ^ inactive, ~ pending
ID Type Destination Stats Return Last
(ms) Code Run
-----------------------------------------------------------------------
*1 icmp 192.168.25.14 RTT=1 OK 1 second ago

Updated to some clarifications -- we need only the last line
As if often the case, you don't need a regex to parse the output as shown. You have space-separated fields and can just split the line and pick the elements you need.
We are told that the line of interest is the last line of the command output. Then we don't need the loop but can take the last element of the array with lines. It is still unclear how $ipsla contains the output -- as a multi-line string or perhaps as an arrayref. Since it is output of a command I'll treat it as a multi-line string, akin to what qx returns. Then, instead of the foreach loop
my #lines = split '\n', $ipsla; # if $ipsla is a multi-line string
# my #lines = #$ipsla; # if $ipsla is an arrayref
pop #lines while $line[-1] !~ /\S/; # remove possible empty lines at end
my ($type, $return_type) = (split ' ', $lines[-1])[1,4];
Here are some comments on the code. Let me know if more is needed.
We can see in the shown output that the fields up to what we need have no spaces. So we can split the last line on white space, by split ' ', $lines[-1], and take the 2nd and 5th element (indices 1 and 4), by ( ... )[1,4]. These are our two needed values and we assign them.
Just in case the output ends with empty lines we first remove them, by doing pop #lines as long as the last line has no non-space characters, while $lines[-1] !~ /\S/. That is the same as
while ( $lines[-1] !~ /\S/ ) { pop #lines }
Original version, edited for clarifications. It is also a valid way to do what is needed.
I assume that data starts after the line with only dashes. Set a flag once that line is reached, process the line(s) if the flag is set. Given the rest of your code, the loop
my $data_start;
foreach (#lines)
{
if (not $data_start) {
$data_start = 1 if /^\s* -+ \s*$/x; # only dashes and optional spaces
}
else {
my ($type, $return_type) = (split)[1,4];
print "type: $type, return code: $return_type\n";
}
}
This is a sketch until clarifications come. It also assumes that there are more lines than one.

I'm not sure of all possibilities of output from that command so my regular expression may need tweaking.
I assume the goal is to get the values of all columns in variables. I opted to store values in a hash using the column names as the hash keys. I printed the results for debugging / demonstration purposes.
use strict;
use warnings;
sub check_summary {
my ($self) = #_;
my %results = map { ($_,undef) } qw(Code ID Type Destination Stats Return_Code Last_Run); # Put results in hash, use column names for keys, set values to undef.
my $ipsla = $self->{ssh_obj}->exec('show ip sla');
foreach my $line (#$ipsla) {
chomp $line; # Remove newlines from last field
if($line =~ /^([*^~])([0-9]+)\s+([a-z]+)\s+([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)\s+([[:alnum:]=]+)\s+([A-Z]+)\s+([^\s].*)$/) {
$results{Code} = $1; # Code prefixing ID
$results{ID} = $2;
$results{Type} = $3;
$results{Destination} = $4;
$results{Stats} = $5;
$results{Return_Code} = $6;
$results{Last_Run} = $7;
}
}
# Testing
use Data::Dumper;
print Dumper(\%results);
}
# Demonstrate
check_summary();
# Commented for testing
#INFO ($type,$return_type);
Worked on the submitted test line.
EDIT:
Regular expressions allow you to specify patterns instead of the exact text you are attempting to match. This is powerful but complicated at times. You need to read the Perl Regular Expression documentation to really learn them.
Perl regular expressions also allow you to capture the matched text. This can be done multiple times in a single pattern which is how we were able to capture all the columns with one expression. The matches go into numbered variables...
$1
$2

Want to extract the first letter of each word

I basically have a variable COUNTRY along with variables SUBJID and TREAT and I want to concatenate it like this ABC002-123 /NZ/ABC.
Suppose if the COUNTRY variable had the value 'New Zealand'. I want to extract the first letter of each word, But I want extract only the first two letters of the value when there is only one word in the COUNTRY variable. I wanted a to know how to simply the below code. If possible in perl programming.
If COUNTW(COUNTRY) GT 1 THEN
CAT_VAR=
UPCASE(SUBJID||"/"||CAT(SUBSTR(SCAN(COUNTRY,1,' '),1,1),
SUBSTR(SCAN(COUNTRY,2,' '),1,1))||"/"||TREAT);

my #COUNTRY = ("New Zealand", "Germany");
# 'NZ', 'GE'
my #two_letters = map {
my #r = /\s/ ? /\b(\w)/g : /(..)/;
uc(join "", #r);
} #COUNTRY;

The SAS Perl Regular Expression solution is to use CALL PRXNEXT along with PRXPOXN or CALL PRXPOSN (or a similar function, if you prefer):
data have;
infile datalines truncover;
input #1 country $20.;
datalines;
New Zealand
Australia
Papua New Guinea
;;;;
run;
data want;
set have;
length country_letter $5.;
prx_1 = prxparse('~(?:\b([a-z])[a-z]*\b)+~io');
length=0;
start=1;
stop = length(country);
position=0;
call prxnext(prx_1,start,stop,country,position,length);
do while (position gt 0);
matchletter = prxposn(prx_1,1,country);
country_letter = cats(country_letter,matchletter);
call prxnext(prx_1,start,stop,country,position,length);
put i= position= start= stop=;
end;
run;

I realize the OP might not be interested in another answer, but for other users browsing this thread and not wanting to use Perl expressions I suggest the following simple solution (for the original COUNTRY variable):
FIRST_LETTERS = compress(propcase(COUNTRY),'','l');
The propcase functions capitalizes the first letters of each word and puts the other ones in lower case. The compress function with 'l' modifier deletes all lower case letters.
COUNTRY may have any number of words.

How about this:
#!/usr/bin/perl
use warnings;
use strict;
my #country = ('New Zealand', 'Germany', 'Tanzania', 'Mozambique', 'Irish Repuublic');
my ($one_word_letters, $two_word_letters, #initials);
foreach (#country){
if ($_ =~ /\s+/){ # Captures CAPs if 'country' contains a space
my ($first_letter, $second_letter) = ($_ =~ /([A-Z])/g);
my ($two_word_letters) = ($first_letter.$second_letter);
push #initials, $two_word_letters; # Add to array for later
}
else { ($one_word_letters) = ($_ =~ /([A-Z][a-z])/); # If 'country' is only one word long, then capture first two letters (CAP+noncap)
push #initials, $one_word_letters; # Add this to the same array
}
}
foreach (#initials){ # Print contents of the capture array:
print "$_\n";
}
Outputs:
NZ
Ge
Ta
Mo
IR
This should do the job provided there really are no 3 word countries. Easily fixed if there are though...

This should do.
#!/usr/bin/perl
$init = &getInitials($ARGV[0]);
if($init)
{
print $init . "\n";
exit 0;
}
else
{
print "invalid name\n";
exit 1;
}
1;
sub getInitials {
$name = shift;
$name =~ m/(^(\S)\S*?\s+(\S)\S*?$)|(^(\S\S)\S*?$)/ig;
if( defined($1) and $1 ne '' ) {
return uc($2.$3);
} elsif( defined($4) and $4 ne '' ) {
return uc($5);
} else {
return 0;
}
}

How can i detect symbols using regular expression in perl?

Please how can i use regular expression to check if word starts or ends with a symbol character, also how to can i process the text within the symbol.
Example:
(text) or te-xt, or tex't. or text?
change it to
(<t>text</t>) or <t>te-xt</t>, or <t>tex't</t>. or <t>text</t>?
help me out?
Thanks

I assume that "word" means alphanumeric characters from your example? If you have a list of permitted characters which constitute a valid word, then this is enough:
my $string = "x1 .text1; 'text2 \"text3;\"";
$string =~ s/([a-zA-Z0-9]+)/<t>$1<\/t>/g;
# Add more to character class [a-zA-Z0-9] if needed
print "$string\n";
# OUTPUT: <t>x1</t> .<t>text1</t>; '<t>text2</t> "<t>text3</t>;"
UPDATE
Based on your example you seem to want to DELETE dashes and apostrophes, if you want to delete them globally (e.g. whether they are inside the word or not), before the first regex, you do
$string =~ s/['-]//g;

I am using DVK's approach here, but with a slight modification. The difference is that her/his code would also put the tags around all words that don't contain/are next to a symbol, which (according to the example given in the question) is not desired.
#!/usr/bin/perl
use strict;
use warnings;
sub modify {
my $input = shift;
my $text_char = 'a-zA-Z0-9\-\''; # characters that are considered text
# if there is no symbol, don't change anything
if ($input =~ /^[a-zA-Z0-9]+$/) {
return $input;
}
else {
$input =~ s/([$text_char]+)/<t>$1<\/t>/g;
return $input;
}
}
my $initial_string = "(text) or te-xt, or tex't. or text?";
my $expected_string = "(<t>text</t>) or <t>te-xt</t>, or <t>tex't</t>. or <t>text</t>?";
# version BEFORE edit 1:
#my #aux;
# take the initial string apart and process it one word at a time
#my #string_list = split/\s+/, $initial_string;
#
#foreach my $string (#string_list) {
# $string = modify($string);
# push #aux, $string;
#}
#
# put the string together again
#my $final_string = join(' ', #aux);
# ************ EDIT 1 version ************
my $final_string = join ' ', map { modify($_) } split/\s+/, $initial_string;
if ($final_string eq $expected_string) {
print "it worked\n";
}
This strikes me as a somewhat long-winded way of doing it, but it seemed quicker than drawing up a more sophisticated regex...
EDIT 1: I have incorporated the changes suggested by DVK (using map instead of foreach). Now the syntax highlighting is looking even worse than before; I hope it doesn't obscure anything...

This takes standard input and processes it to and prints on Standard output.
while (<>) {
s {
( [a-zA-z]+ ) # word
(?= [,.)?] ) # a symbol
}
{<t>$1</t>}gx ;
print ;
}
You might need to change the bit to match the concept of word.
I have use the x modifeid to allow the regexx to be spaced over more than one line.
If the input is in a Perl variable, try
$string =~ s{
( [a-zA-z]+ ) # word
(?= [,.)?] ) # a symbol
}
{<t>$1</t>}gx ;

How can I expand a string like "1..15,16" into a list of numbers?

I have a Perl application that takes from command line an input as:
application --fields 1-6,8
I am required to display the fields as requested by the user on command line.
I thought of substituting '-' with '..' so that I can store them in array e.g.
$str = "1..15,16" ;
#arr2 = ( $str ) ;
#arr = ( 1..15,16 ) ;
print "#arr\n" ;
print "#arr2\n" ;
The problem here is that #arr works fine ( as it should ) but in #arr2 the entire string is not expanded as array elements.
I have tried using escape sequences but no luck.
Can it be done this way?

If this is user input, don't use string eval on it if you have any security concerns at all.
Try using Number::Range instead:
use Number::Range;
$str = "1..15,16" ;
#arr2 = Number::Range->new( $str )->range;
print for #arr2;
To avoid dying on an invalid range, do:
eval { #arr2 = Number::Range->new( $str )->range; 1 } or your_error_handling
There's also Set::IntSpan, which uses - instead of ..:
use Set::IntSpan;
$str = "1-15,16";
#arr2 = Set::IntSpan->new( $str )->elements;
but it requires the ranges to be in order and non-overlapping (it was written for use on .newsrc files, if anyone remembers what those are). It also allows infinite ranges (where the string starts -number or ends number-), which the elements method will croak on.

You're thinking of #arr2 = eval($str);
Since you're taking input and evaluating that, you need to be careful.
You should probably #arr2 = eval($str) if ($str =~ m/^[0-9.,]+$/)
P.S. I didn't know about the Number::Range package, but it's awesome. Number::Range ftw.

I had the same problem in dealing with the output of Bit::Vector::to_Enum. I solved it by doing:
$range_string =~ s/\b(\d+)-(\d+)\b/expand_range($1,$2)/eg;
then also in my file:
sub expand_range
{
return join(",",($_[0] .. $_[1]));
}
So "1,3,5-7,9,12-15" turns into "1,3,5,6,7,9,12,13,14,15".
I tried really hard to put that expansion in the 2nd part of the s/// so I wouldn't need that extra function, but I couldn't get it to work. I like this because while Number::Range would work, this way I don't have to pull in another module for something that should be trivial.

#arr2 = ( eval $str ) ;
Works, though of course you have to be very careful with eval().

You could use eval:
$str = "1..15,16" ;
#arr2 = ( eval $str ) ;
#arr = ( 1..15,16 ) ;
print "#arr\n" ;
print "#arr2\n" ;
Although if this is user input, you'll probably want to do some validation on the input string first, to make sure they haven't input anything dodgy.

Use split:
#parts = split(/\,/, $fields);
print $parts[0];
1-6
print $parts[1];
8
You can't just put a string containing ',' in an array, and expect it to turn to elements (except if you use some Perl black magic, but we won't go into that here)
But Regex and split are your friends.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Parsing files that use synonyms - perl

If I had a text file with the following: Today (is|will be) a (great|good|nice) day. Is there a simple way I can generate a random output like: Today is a great day. Today will be a nice day. Using Perl or UNIX utils?

Sounds like you may be looking for Regexp::Genex. From the module's synopsis: #!/usr/bin/perl -l use Regexp::Genex qw(:all); $regex = shift || "a(b|c)d{2,4}?"; print "Trying: $regex"; print for strings($regex); # abdd # abddd # abdddd # acdd # acddd # acdddd

Use a regex to match each parenthetical (and the text inside it). Use a string split operation (pipe delimiter) on the text inside of the matched parenthetical to get each of the options. Pick one randomly. Return it as the replacement for that capture.

Related

Perl module / subroutine to remove shared substring of strings

Regular expression to print a string from a command outpout

Want to extract the first letter of each word

How can i detect symbols using regular expression in perl?

How can I expand a string like "1..15,16" into a list of numbers?

Categories

Resources