Perl - lookahead assertion - perl

I'm trying to describe a perl script for this purpose:
a = ~b & ~c; ==> a = (~b) & (~c);
a = ~b & (~c); ==> a = (~b) & (~c);
So I used lookahead assertions to insert parenthesis like this. Here is the test code.
#!/usr/local/bin/perl5 -w
use strict;
use warnings;
my $line;
my #lines;
#lines = (
"assign a = ~b & ~c;",
"assign a = (~b) & (~c);",
"assign a = ( ~b & ~c );",
"assign a = (b & ~c );"
);
foreach $line (#lines) {
print " $line\n";
$line =~ s/(?!\(\s*)~\w+(?!\s*\))/\($&\)/g;
print ">> $line\n\n";
}
It looks working with above examples. However, it's not working with this.
assign a = ~b & ~c;
>> assign a = (~b) & (~c); <== OK
assign a = (~b) & (~c);
>> assign a = (~b) & (~c); <== OK
assign a = ( ~b & ~c);
>> assign a = ( (~b) & ~c); <== X. I want ( (~b) & (~c));
assign a = ( ~b & ~c );
>> assign a = ( (~b) & ~c ); <== X. I want ( (~b) & (~c) );
Would you let me know how to fix the script? Thank you.

Your goal of using lookahead and lookbehind assertions doesn't really get you anything. Breaking up the code into two steps makes it easier in my opinion. One step to capture variables prefixed by ~, and the second part to see if they're surrounded by balanced parenthesis.
use strict;
use warnings;
while (<DATA>) {
chomp(my $src = <DATA>);
chomp(my $test = <DATA>);
$src =~ s{([(]?~\w+[)]?)}{
my $str = $1;
$str =~ /^\(.*\)$/ ? $str : "($str)";
}eg;
print "test $test\n";
print $src eq $test ? ' ok ' : ' FAIL! ';
print "$src\n";
}
__DATA__
Test:
a = ~b & ~c;
a = (~b) & (~c);
Test:
a = (~b) & (~c);
a = (~b) & (~c);
Test:
a = ( ~b & ~c);
a = ( (~b) & (~c));
Test:
a = ( ~b & ~c );
a = ( (~b) & (~c) );
results:
test a = (~b) & (~c);
ok a = (~b) & (~c);
test a = (~b) & (~c);
ok a = (~b) & (~c);
test a = ( (~b) & (~c));
ok a = ( (~b) & (~c));
test a = ( (~b) & (~c) );
ok a = ( (~b) & (~c) );

You can't easily do what you're asking using a single regular expression.
The problem is that there is no way to count the number of nested parentheses without writing a recursive regex pattern, so at the end of ~c simple regex cannot know whether how many parentheses are needed to close the expression.
This is possible with a more complex regex, but it would also be much easier to tokenize the string in a Perl loop.
Do you have to deal with stuff like a & ~b & c | (d | ~e & f)?

You can do this with one regex, here it is;
$line =~ s/(?|([^\(])(~\w+)(.)|(.)(~\w+)([^\)]))/$1\($2\)$3/g;
Your reqex wasn't doing what you thought.
$line =~ s/(?!\(\s*)~\w+(?!\s*\))/\($&\)/g;
the first part "(?!(\s*)~" will never match. Remember lookaheads and lookbehinds are zero width assertions. I like to think of them as matching the space in between the letters. (?!(\s*)~ means, that you want to match a "~" character, but in the space right before the "~"character, you want to lookahead and make sure you dont see a "(" and spaces. Well, if you are in the space right before a "~", you"ll never see a "(". And if your at a "(", the negative look ahead might fail to match (like you want) but you'd have never matched the "~" anyway.
You are trying to match if the character before is not a "(" AND the char after is not a ")". But what you want is to match if the character before is not a "(" OR the char after is not a ")". So you need a conditional branch, one to match if there is no "(" in front, and one to match if there is no ")" behind.
I used a condition branch, the (?| tells the engine to store the captured submatches like this;
(?|([^\\(])(~\w+)(.)|(.)(~\w+)([^\\)]))
$1 $2 $3 |$1 $2 $3
instead of this
([^\\(])(~\w+)(.)|(.)(~\w+)([^\\)]))
$1 $2 $3 |$4 $5 $6
I used (.) to make the ~\w part always $2, then just put a "(" ")" around the $2 in the output
my output
assign a = ~b & ~c;
assign a = (~b) & (~c);
assign a = (~b) & (~c);
assign a = (~b) & (~c);
assign a = ( ~b & ~c );
assign a = ( (~b) & (~c) );
assign a = (~b & ~c );
assign a = ((~b) & (~c) );
assign a = ( ~b & ~c );
assign a = ( (~b) & (~c) );
assign a = ( ~b & ~c);
assign a = ( (~b) & (~c));

Related

sprintf pad string on right with dash

I need to pad a string on the right with dashes ('-'). e.g. convert 'M' to 'M-----'.
sprintf "%-6s", "M"; gives me 'M '. I tried printf "%-6-s", "M";, and printf "%--6s", "M";, but neither of those work...
Can this be done with sprinf and if so, how?
It can't be done with sprintf alone. (sprintf will only pad with spaces or with zeroes.)
sprintf("%-6s", $s) =~ tr/ /-/r
or
substr($s.("-" x 6), 0, 6)
or
$s . ("-" x (6-length($s)))
sprintf only supports padding with 0 and , so no. You can pad with one of those then replace the padding, but the problem with that, is that you run the risk of replacing any padding characters in the original string. For example sprintf('%-6s', ' M') =~ s/ /-/gr produces --M---.
From the FAQ:
If you need to pad with a character other than blank or zero you can
use one of the following methods. They all generate a pad string with
the x operator and combine that with $text. These methods do not
truncate $text.
Left and right padding with any character, creating a new string:
my $padded = $pad_char x ( $pad_len - length( $text ) ) . $text;
my $padded = $text . $pad_char x ( $pad_len - length( $text ) );
Left and right padding with any character, modifying $text directly:
substr( $text, 0, 0 ) = $pad_char x ( $pad_len - length( $text ) );
$text .= $pad_char x ( $pad_len - length( $text ) );
If you do it often, you could wrap it in a subroutine.
sub pad {
my ($str, $padding, $length) = #_;
my $pad_length = $length - length $str;
$pad_length = 0 if $pad_length < 0;
$padding x= $pad_length;
$str.$padding;
}
say pad('M', '-', 6);
say pad('MMMMMM', '-', 6);
say pad('12345', '-', 6);
say pad('1234567', '-', 6);
say pad(' ', '-', 6);
Output:
M-----
MMMMMM
12345-
1234567
--

perl array from csv file creating newline in unexpected place

Hi I have a few scripts that convert an xlsx file to a tab seperated file, which then remove any commas, duplicates and then splits it by commas. (i do this to make sure users have not put any commas in a colomn)
I then do some stuff. and then convert it back to an xlsx file. This has always worked fine. But instead of opening and closing files all the time i thought i would push the file to an array and then convert it to an xlsx at the end. Unfortunatly when i try and convert back to an xlsx file it is creating a newline in the space between the name. If i OUTPUT to a csv file then Open it and convert to an xlsx file it works fine.
#!/usr/bin/perl
use strict;
use warnings;
use Spreadsheet::BasicRead;
use Excel::Writer::XLSX;
local $" = "'\n'";
open( STDERR, ">&STDOUT" );
#covert to csv
my $xlsx_WSD = ( "C:\\Temp\\testing_file.xlsx"),, 1;
my #csvtemp;
if ( -e $xlsx_WSD ) {
my $ss = new Spreadsheet::BasicRead($xlsx_WSD) or die;
my $col = '';
my $row = 0;
while ( my $data = $ss->getNextRow() ) {
$row++;
$col= join( "\t", #$data );
push #csvtemp, $col . "\n" if ( $col ne "" );
}
}
else {
print " C:\\Temp\\testing_file.xlsx file EXISTS ...!!\n";
print " please investigate and use the restore option if required !..\n";
exit;
}
;
my #arraynew;
my %seen;
our $Header_row = shift (#csvtemp);
foreach (#csvtemp){
chomp;
$_ =~ s/,//g;
$_ =~ s/\t/,/g;
# print $_ . "\n" if !$seen{$_}++ ;
push #arraynew, $_ . "\n" if !$seen{$_}++ ; #remove any dupes
}
#covert back to xlsx
my $workbook = Excel::Writer::XLSX->new("C:\\Temp\\testing_filet.xlsx");
my $worksheet = $workbook->add_worksheet();
my ( $x, $y ) = ( 0, 0 );
while (<#arraynew>) {
my #list = split /,/;
foreach my $c (#list) {
$worksheet->write( $x, $y++, $c );
}
$x++;
$y = 0;
}
__DATA__
Animal keeper M/F Years START DATE FRH FSM
GIRAFFE JAMES LE M 5 10/12/2007 Y
HIPPO JACKIE LEAN F 6 11/12/2007 Y
ZEBRA JAMES LEHERN M 7 12/12/2007 Y
GIRAFFE AMIE CAHORT M 5 13/12/2012 Y
GIRAFFE MICKY JAMES M 5 14/06/2007 Y
MEERKAT JOHN JONES M 9 15/12/2007 v v
LEOPPARD JIM LEE M 8 16/12/2002
unexpected result
GIRAFFE JAMES
LE M 5 10/12/2007 Y
"
HIPPO" JACKIE
LEAN F 6 11/12/2007 Y
"
ZEBRA" JAMES
LEHERN M 7 12/12/2007 Y
"
GIRAFFE" AMIE
CAHORT M 5 13/12/2012 Y
"
GIRAFFE" MICKY
JAMES M 5 14/06/2007 Y
"
MEERKAT" JOHN
JONES M 9 15/12/2007 v v
"
LEOPPARD" JIM
LEE M 8 16/12/2002
Since you are running this on Windows, have you considered using Win32::OLE instead?
use strict;
use Win32::OLE;
my $app = Win32::OLE->GetActiveObject('Excel.Application')
|| Win32::OLE->new('Excel.Application', 'Quit');
my $wb = $app->Workbooks->Open("C:/Temp/testing_file.xlsx");
my $ws = $wb->ActiveSheet;
my $max_row = $ws->UsedRange->Rows->Count;
my $max_col = $ws->UsedRange->Columns->Count;
my ($row, %already) = (1);
while ($row <= $max_row) {
my ($col, #output) = (1);
while ($col <= $max_col) {
my $val = $ws->Cells($row, $col)->{Text};
if ($val =~ /[,\t]/) {
$val =~ tr/,//d;
$val =~ tr/\t/,/;
$ws->Cells($row, $col)->{Value} = $val;
}
#output[$col - 1] = $val;
$col++;
}
if ($already{join "|", #output}++) {
$ws->Rows($row)->EntireRow->Delete;
$max_row--;
} else {
$row++;
}
}
$wb->SaveAs("C:\\temp\\testing_filet.xlsx");
This is an issue with end of line characters.
There are three conventions for marking the end of a line: with \n on Unix, \r\n on Windows and \r on Mac. It looks as though your script is assuming the Mac convention while input and output use the Windows convention.
So after reading the input, a leading \n appears on all lines except the first. As long as this is also the case with the output lines prior to composing them with \r, you end up with an output file with perfectly \r\n-delimited lines. Clearly it's better to make your script wary of what line ending convention the input is using and ensure it uses the same for splitting the lines and composing the output.

Perl cannot open file; from Ploteig script

I am a total novice when it comes to computer programming and perl, so please forgive me if this question is simple!
I am trying to run a perl script (called ploteig, a component of a free genetics program download, Eigenstrat:Eigenstrat software) that works fine until I get to line 96-
open (YY, ">$dfile") || die "can't open $dfile\n" ;
I am given the error that the file is unable to be opened and the script dies.
Below, I have provided the entire code for you (since honestly, I have no idea what part of the code could be influencing the inability to open the file). The code uses input from a file created previously with Eigenstrat, example of 4 rows, 12 columns:
#eigvals: 20.388 7.503 4.033 2.929 2.822 2.726 2.700 2.590 2.451 2.365
GREY_BI_011_COMSTOCK_11 0.0164 0.0164 0.0382 -0.1283 -0.0658 0.0406 0.0322 0.0105 -0.0851 -0.0625 Case
GREY_BI_014_COMSTOCK_14 0.0191 0.0094 0.0567 -0.0250 0.0804 -0.0531 -0.0165 0.0321 0.1130 -0.0025 Control
GREY_BI_015_COMSTOCK_15 0.0221 -0.0042 -0.0031 0.0091 0.1448 0.0351 0.0430 0.0359 0.0049 0.0791 Control
(rows represent individual sample pca scores, columns specific pcas. First column sample names, last column case or control status)
Additionally, I call the code as follows:
perl ploteig –i combogreyout.pca.evec –p Case:Control –s Out –c 1:2 –x –o utploteig.xtxt –k
I am really unsure where to go from here. I tried changing the file permissions and ensuring it was in the working directory, but it wouldn’t allow me to change permissions and everything pointed to being in the correct directory. However, I am unsure if either of these are the real problem.
I would very much appreciate any help anyone can give me!
Thank you SO much!
> #!/usr/bin/perl -w
### ploteig -i eigfile -p pops -c a:b [-t title] [-s stem] [-o outfile] [-x] [-k] [-y]
[-z sep] [-f fixgreen]
use Getopt::Std ;
use File::Basename ;
## pops : separated -x = make postscript and pdf -z use another separator
## -k keep intermediate files
## NEW if pops is a file names are read one per line
getopts('i:o:p:c:s:d:z:t:xkyf',\%opts) ;
$postscmode = $opts{"x"} ;
$oldkeystyle = $opts{"y"} ;
$kflag = $opts{"k"} ;
$keepflag = 1 if ($kflag) ;
$keepflag = 1 unless ($postscmode) ;
$dofixgreen = ( exists $opts{"f"} ? $opts{"f"} : 0 );
$zsep = ":" ;
if (defined $opts{"z"}) {
$zsep = $opts{"z"} ;
$zsep = "\+" if ($zsep eq "+") ;
}
$title = "" ;
if (defined $opts{"t"}) {
$title = $opts{"t"} ;
}
if (defined $opts{"i"}) {
$infile = $opts{"i"} ;
}
else {
usage() ;
exit 0 ;
}
open (FF, $infile) || die "can't open $infile\n" ;
#L = (<FF>) ;
chomp #L ;
$nf = 0 ;
foreach $line (#L) {
next if ($line =~ /\#/) ;
#Z = split " ", $line ;
$x = #Z ;
$nf = $x if ($nf < $x) ;
}
printf "## number of fields: %d\n", $nf ;
$popcol = $nf-1 ;
if (defined $opts{"p"}) {
$pops = $opts{"p"} ;
}
else {
die "p parameter compulsory\n" ;
}
$popsname = setpops ($pops) ;
print "$popsname\n" ;
$c1 = 1; $c2 =2 ;
if (defined $opts{"c"}) {
$cols = $opts{"c"} ;
($c1, $c2) = split ":", $cols ;
die "bad c param: $cols\n" unless (defined $cols) ;
}
$stem = "$infile.$c1:$c2" ;
if (defined $opts{"s"}) {
$stem = $opts{"s"} ;
}
$gnfile = "$stem.$popsname.xtxt" ;
if (defined $opts{"o"}) {
$gnfile = $opts{"o"} ;
}
#T = () ; ## trash
open (GG, ">$gnfile") || die "can't open $gnfile\n" ;
print GG "## " unless ($postscmode) ;
print GG "set terminal postscript color\n" ;
print GG "set title \"$title\" \n" ;
print GG "set key outside\n" unless ($oldkeystyle) ;
print GG "set xlabel \"eigenvector $c1\" \n" ;
print GG "set ylabel \"eigenvector $c2\" \n" ;
print GG "plot " ;
$np = #P ;
$lastpop = $P[$np-1] ;
$d1 = $c1+1 ;
$d2 = $c2+1 ;
foreach $pop (#P) {
$dfile = "$stem:$pop" ;
push #T, $dfile ;
print GG " \"$dfile\" using $d1:$d2 title \"$pop\" " ;
print GG ", \\\n" unless ($pop eq $lastpop) ;
chomp $dfile;
open (YY, ">$dfile") || die "can't open $dfile\n" ;
foreach $line (#L) {
next if ($line =~ /\#/) ;
#Z = split " ", $line ;
next unless (defined $Z[$popcol]) ;
next unless ($Z[$popcol] eq $pop) ;
print YY "$line\n" ;
}
close YY ;
}
print GG "\n" ;
print GG "## " if ($postscmode) ;
print GG "pause 9999\n" ;
close GG ;
if ($postscmode) {
$psfile = "$stem.ps" ;
if ($gnfile =~ /xtxt/) {
$psfile = $gnfile ;
$psfile =~ s/xtxt/ps/ ;
}
system "gnuplot < $gnfile > $psfile" ;
if ( $dofixgreen ) {
system "fixgreen $psfile" ;
}
system "ps2pdf $psfile " ;
}
unlink (#T) unless $keepflag ;
sub usage {
print "ploteig -i eigfile -p pops -c a:b [-t title] [-s stem] [-o outfile] [-x] [-k]\n" ;
print "-i eigfile input file first col indiv-id last col population\n" ;
print "## as output by smartpca in outputvecs \n" ;
print "-c a:b a, b columns to plot. 1:2 would be common and leading 2 eigenvectors\n" ;
print "-p pops Populations to plot. : delimited. eg -p Bantu:San:French\n" ;
print "## pops can also be a filename. List populations 1 per line\n" ;
print "[-s stem] stem will start various output files\n" ;
print "[-o ofile] ofile will be gnuplot control file. Should have xtxt suffix\n";
print "[-x] make ps and pdf files\n" ;
print "[-k] keep various intermediate files although -x set\n" ;
print "## necessary if .xtxt file is to be hand edited\n" ;
print "[-y] put key at top right inside box (old mode)\n" ;
print "[-t] title (legend)\n" ;
print "[-f] fix green and yellow colors\n";
print "The xtxt file is a gnuplot file and can be easily hand edited. Intermediate files
needed if you want to make your own plot\n" ;
}
sub setpops {
my ($pops) = #_ ;
local (#a, $d, $b, $e) ;
if (-e $pops) {
open (FF1, $pops) || die "can't open $pops\n" ;
#P = () ;
foreach $line (<FF1>) {
($a) = split " ", $line ;
next unless (defined $a) ;
next if ($a =~ /\#/) ;
push #P, $a ;
}
$out = join ":", #P ;
print "## pops: $out\n" ;
($b, $d , $e) = fileparse($pops) ;
return $b ;
}
#P = split $zsep, $pops ;
return $pops ;
}

Concatenating strings with spaces in Perl. Any built-ins?

In Perl I can concatenate multiple strings with spaces between them as follows:
my $long_string = $one_string . " " . $another_string . " " . $yet_another_string . " " .
$and_another_string . " " $the_lastr_string
However, typing this is a bit cumbersome.
Is there a built-in that can make this task easier?
e.g. something like:
concatenate_with_spaces($one_string, $another_string, $yet_another_string, ...)
You want join:
my $x = 'X';
my #vars = ( 1, 'then', 'some' );
my $long_string = join ' ', $x, 2, #vars; # "X 2 1 then some"
Zaid has given the idiomatic solution, using join. However, there are more ways to do it.
my #vars = ($one, $two, $three);
my $str1 = "#vars"; # Using array interpolation
my $str2 = "$one $two $three"; # interpolating scalars directly
Interpolating an array uses the predefined variable $" (list separator), which is by default set to space. When interpolating variables, you do not need to use . to concatenate spaces to your strings, they can be used directly in a double quoted string.
my #list_of_strings = ($one_string, $two_strings );
my $string = join(' ', #list_of_strings );
print $string;

How can I extract digits at the end of a string using Perl?

I have a string of the form:
"jflsdlf f fas253k46l ;sf635jsf sd;lfwio sfkljflsk-=fsd f 24461 425 "
Toward the end it contains eight digits. There may be spaces between the digits, but there are always eight digits in the end. How do we obtain each of these digits separately using Perl?
Get input:
my $input = "jflsdlf f fas253k46l ;sf635jsf sd;lfwio sfkljflsk-=fsd f 24461 425 ";
now, extract all digits:
my #all_digits = $input =~ /(\d)/g;
Now, get the last 8 from it:
my #last_8_digits = #all_digits[-8..-1];
get rid of non-digits and then take substring from the back
$string="jflsdlf f fas253k46l ;sf635jsf sd;lfwio sfkljflsk-=fsd f 24461 425 ";
$string =~ s/[^[:digit:]]//g;
print substr ( $string ,-8);
The easiest thing to do conceptually is to apply a normalization step to the string before you extract the digits. In the example you've shown, that might be as easy as just removing all of the whitespace first. In case you need the string later, I'll do that to a copy. Once you have the normalized copy, just grab the eight digits at the end:
my $string = "jflsdlf f fas253k46l ;sf635jsf sd;lfwio sfkljflsk-=fsd f 24461 425 ";
my $copy = $string;
$copy =~ s/\s+//g; # "normalize" string
my #digits;
if( $copy =~ m/(\d{8})\z/ )
{
#digits = split //, $1
}
print "digits are #digits\n";
/(\d\s*){8}$/
should do it. don't forget to strip out the whitespace in each of the captures.
Here's a solution that should work with any input
my $input = "dlf f fas253k46l ;sf635jsf sd;lfwio sfkljflsk-=fsd f 24461 425 ";
if ($input =~ /((?:\d\s*){8})$/) { # grab last 8 digits and any space
my #nums = split /\s+|/ => $1; # throw away space and separate each digit
print "#nums\n"; # 2 4 4 6 1 4 2 5
}
You can use the following code
my $string="jflsdlf f fas253k46l ;sf635jsf sd;lfwio sfkljflsk-=fsd f 24461 425 ";
my #array=split(/ / , $string);
print "$array[$#array-1]";
print "$array[$#array]\n";
if (m/(\d)\s*(\d)\s*(\d)\s*(\d)\s*(\d)\s*(\d)\s*(\d)\s*(\d)\s*$/) {
($d1, $d2, $d3, $d4, $d5, $d6, $d7, $d8) = ($1, $2, $3, $4, $5, $6, $7, $8);
}