Compound print statement overwrites part of variable - perl

I have some very bizarre behavior in a script that I wrote and have used for years but, for some reason, fails to run on one particular file.
Recognizing that the script is failing to identify a key that should be in a hash, I added some test print statements to read the keys. My normal strategy involves placing asterisks before and after the variable to detect potential hidden characters. Clearly, the keys are corrupt. Relevant code block:
foreach my $fastaRecord (#GenomeList) {
my ($ID, $Seq) = split(/\n/, $fastaRecord, 2);
# uncomment next line to strip everything off sequence
# header except trailing numeric identifiers
# $ID =~ s/.+?(\d+$)/$1/;
$Seq =~ s/[^A-Za-z-]//g; # remove any kind of new line characters
$RefSeqLen = length($Seq);
$GenomeLenHash{$ID} = $RefSeqLen;
print "$ID\n";
print "*$ID**\n";
}
This produces the following output:
supercont3
**upercont3
Mitochondrion
**itochondrion
Chr1
**hr1
Chr2
**hr2
Chr3
**hr3
Chr4
**hr4
Normally, I'd suspect "illegal" newline characters as being involved. However, I manually replaced all newlines in the input file to try and solve the problem. What in the input file could be causing the script to execute in this way? I could imagine that maybe, despite my efforts, there is still an illegal newline after the ID variable, but then why are neither the first asterisk, nor newline characters after the double asterisk not printed, and why is the double asterisk printed at the beginning of the line in a way that overwrites the first asterisk as well as the first two characters of the variable "value"?

When you see these sorts of effects, look at the data in a file or in a hexdump. The terminal is going to hide data if it interprets backspace, carriage returns, and ansi sequences.
% perl script.pl | hexdump -C
Here's a simple example. I echo a, b, carriage return, then c. My terminal sees the carriage return and moves the cursor to the beginning of the line. After that, the output continues. The c masks the a:
% echo $'ab\rc'
cb
With a hex dump, I can see the 0d that represents the carriage return:
% echo $'ab\rc' | hexdump -C
00000000 61 62 0d 63 0a |ab.c.|
00000005
Also, when you try to remove "any sort of newline" from $Seq, you might just remove vertical whitespace:
$target =~ s/\v//g;
You might also use the generalized newline to
$target =~ s/\R//g;

Related

what does this $tok =~ s{\\(.)|([\$\#]|\\$)}{'\\'.($2 || $1)}sge; (perl code) mean?

What does this mean?
$tok =~ s{\\(.)|([\$\#]|\\$)}{'\\'.($2 || $1)}sge;
This comes from a cve study blog which written in Perl. I know this is a regular expression, the content in the second {} should replace that in the first, but I do NOT get what '\\'.($2 || $1)means.
$tok =~ s{\\(.)|([\$\#]|\\$)}{'\\'.($2 || $1)}sge;
It is a substitution operator s/// applied to the string $tok, with the modifiers sge. The delimiters of the operator has been changed from / to {}. Lets break that regex down
s{
\\(.) # (1) match a backslash followed by 1 character, capture
| # (2) or
( # (3) start capture parens
[\$\#] # (4) either a literal $ or #
| # (5) or
\\$ # (6) backslash at the end of line (including newline)
) # end capture parens
}{ # replace with
'\\'.($2 || $1)} # (7) backslash concatenated with either capture 2 or 1
sge; # (8) s = . matches newline, g = match multiple times, e = eval
Judging (at a glance) from the rest of that blog code, this code is not written by someone skilled at Perl. So I will take their comments at face value:
# must protect unescaped "$" and "#" symbols, and "\" at end of string
The eval (8) is apparently to concatenate a backslash with either capture group 2 (2) or 1 (1), depending on which is "true". Or rather, which one matched the string.
Looking closer at the code, (1) and (6) are very similar. The latter one will trigger only at the end of a line that does not have a newline, whereas the first one will handle all other cases, including end of line with a newline (because of /s modifier).
(1) will match any escaped character, so \1, or \$ or \\ anything with a backslash followed by a character. If we look at the replacement part (7), we see that this capture group is the fallback, which will only trigger if the second capture group fails. The second capture group also only matches if the first fails. Confusing? Maybe a little.
(2) triggers if the matching character is not a backslash followed by a character. Now we are looking for a literal $ or #. Or failing that, a backslash at the end of line. But wait a minute, we already checked for backslash? Yes, but this is an edge case.
In the case of (1) matching, $2 will be undefined, and $1, the first capture group, a single character, will be put back into the text. The backslash that was before it will be removed in (1), and then put back in (7). This will not really do anything, just make the regex not destroy already escaped characters.
In the case of (2) matching, it will either be an end of line backslash that is consumed (6) and put back (7), or it will be a $ or # which is consumed (4) and put back (7), with a backslash in front.
So basically what the OP says in the comment is happening.

Perl - unknown end of line character

I want to read an input file line by line, but this file has unknown ending character.
Editor vim does not know it either, it represents this character as ^A and immediately starts with characters from new line. The same is for perl. It tried to load all lines in once, because it ignores these strange end of line character.
How can I set this character as end of line for perl? I don't want to use any special module for it (because of our strict system), I just want to define the character (maybe in hex code) of end of line.
The another option is to convert the file to another one, with good end of line character (replace them). Can I make it in some easy way (something like sed on input file)? But everything need to be done in perl.
It is possible?
Now, my reading part looks like:
open (IN, $in_file);
$event=<IN>; # read one line
The ^A character you mention is the "start of heading" character. You can set the special Perl variable $/ to this character. Although, if you want your code to be readable and editable by the guy who comes after you (and uses another editor), I would do something like this:
use English;
local $INPUT_RECORD_SEPARATOR = "\cA" # 'start of heading' character
while (<>)
{
chomp; # remove the unwanted 'start of heading' character
print $_ . "\n";
}
From Perldoc:
$INPUT_RECORD_SEPARATOR
$/
The input record separator, newline by default. This influences Perl's idea of what a "line" is.
More on special character escaping on PerlMonks.
Oh and if you want, you can enter the "start of heading" character in VI, in insert mode, by pressing CTRL+V, then CTRL+A.
edit: added local per Drt's suggestion

Decipher this sed one-liner

I want to remove duplicate lines from a file, without sorting the file.
Example of why this is useful to me: removing duplicates from Bash's $HISTFILE without changing the chronological order.
This page has a one-liner to do that:
http://sed.sourceforge.net/sed1line.txt
Here's the one-liner:
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
I asked a sysadmin and he told me "you just copy the script and it works, don't go philosophising about this", which is fine, so I am asking here as it's a developer forum and I trust people might be like me, suspicious about using things they don't understand:
Could you kindly provide a pseudo-code explanation of what that "black magic" script is doing, please? I tried parsing the incantation in my head but especially the central part is quite hard.
I'll note that this script does not appear to work with my copy of sed (GNU sed 4.1.5) in my current locale. If I run it with LC_ALL=C it works fine.
Here's an annotated version of the script. sed basically has two registers, one is called "pattern space" and is used for (basically) the current input line, and the other, the "hold space", can be used by scripts for temporary storage etc.
sed -n ' # -n: by default, do not print
G # Append hold space to current input line
s/\n/&&/ # Add empty line after current input line
/^\([ -~]*\n\).*\n\1/d # If the current input line is repeated in the hold space, skip this line
# Otherwise, clean up for storing all input in hold space:
s/\n// # Remove empty line after current input line
h # Copy entire pattern space back to hold space
P # Print current input line'
I guess the adding and removal of an empty line is there so that the central pattern can be kept relatively simple (you can count on there being a newline after the current line and before the beginning of the matching line).
So basically, the entire input file (sans duplicates) is kept (in reverse order) in the hold space, and if the first line of the pattern space (the current input line) is found anywhere in the rest of the pattern space (which was copied from the hold space when the script started processing this line), we skip it and start over.
The regex in the conditional can be further decomposed;
^ # Look at beginning of line (i.e. beginning of pattern space)
\( # This starts group \1
[ -~] # Any printable character (in the C locale)
* # Any number of times
\n # Followed by a newline
\) # End of group \1 -- it contains the current input line
.*\n # Skip any amount of lines as necessary
\1 # Another occurrence of the current input line, with newline and all
If this pattern matches, the script discards the pattern space and starts over with the next input line (d).
You can get it to work independently of locale by changing [ -~] to [[:print:]]
The code doesn't work for me, perhaps due to some locale setting, but this does:
vvv
sed -n 'G; s/\n/&&/; /^\([^\n]*\n\).*\n\1/d; s/\n//; h; P'
^^^
Let's first translate this by the book (i.e. sed info page), into something perlish.
# The standard sed loop
my $hold = "";
while ($my pattern = <>) {
chomp $pattern;
$pattern = "$pattern\n$hold"; # G
$pattern =~ s/(\n)/$1$1/; # s/\n/&&/
if ($pattern =~ /^([^\n]*\n).*\n\1/) { # /…/
next; # d
}
$pattern =~ s/\n//; # s/\n//
$hold = $pattern; # h
$pattern =~ /^([^\n]*\n?)/; print $1; # P
}
OK, the basic idea is that the hold space contains all the lines seen so far.
G: At the beginning of each cycle, append that hold space to the current line. Now we have a single string consisting of the current line and all unique lines which preceeded it.
s/\n/&&/: Turn the newline which separates them into a double newline, so that we can match subsequent and non-subsequent duplicates the same, see the next step.
^\([^\n]*\n\).*\n\1/: Look through the current text for the following: at the beginning of all the lines (^) look for a first line including trailing newline (\([^\n]*\n\)), then anything (.*), then a newline (\n), and then that same first line including newline repeated again (\1). If two subsequent lines are the same, then the .* in the regular expression will match the empty string, but the two \n will still match due to the newline duplication in the preceding step. So basically this asks whether the first line appears again among the other lines.
d: If there is a match, this is a duplicate line. We discard this input, keep the hold space as it is as a buffer of all unique lines seen so far, and continue with the next line of input.
s/\n//: Otherwise, we continue and next turn the double newline back into a single newline.
h: We include the current line in our list of all unique lines.
P: And finally print this new unique line, up to the newline character.
For the actual problem to resolve, here is a simpler solution (at least it looks so) with awk:
awk '!_[$0]++' FILE
In short _[$0] is a counter (of appearance) for each unique line, for any line ($0) appearing for the second time _[$0] >= 1, thus !_[$0] evaluates to false, causing it not to be printed except its first time appearance.
See https://gist.github.com/ryenus/5866268 (credit goes to a recent forum I visited.)

Read a string of alphanumeric characters after a ;

I'm teaching myself perl so I'm pretty new to this language. I've been reading over and over about regular expression but I can't figure out the right context. I want to do the following:
Let say I have a file name "testfile"
this files contains 3 lines,
test this is the first line
test: this is the first line
test; this is the third line
How can I read and print out only the third one and everything after the ; without the space. so basically "This is the third line"
This is what I'm thinking to do $string =~ m/this is the third/
This was edited incorrectly. In the first and second sentence there should a space before the test.in the third one shouldn't. So I want to skip the white space.
Grabbing from STDIN, it might look like this:
while ( <> ) {
print $1 if /^test; (.*\n)/;
}
You might find that YAPE::Regex::Explain is a handy tool:
Using Axeman's regular expression:
#!/usr/bin/env perl
use strict;
use warnings;
use YAPE::Regex::Explain;
my $expr = q(/^test; (.*\n)/);
print YAPE::Regex::Explain->new( $expr )->explain;
The regular expression:
(?-imsx:/^test; (.*\n)/)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
test; 'test; '
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
\n '\n' (newline)
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
If you only want the third line, then simply counting the lines and then doing:
s/.*;\s*//;
will remove everything until the ; and any white space after it. Note, however, that if the third line contains another ';' in it then you'll be in trouble. So if that's a possibility but there is no chance that one will exist earlier, then do this:
s/[^;]*;\s*//;
Which will delete only up until the first ';' (and trailing whitespace).
I suspect, however, in the long run you want to match all lines that contain some particular format and it won't always be "just the third". If that's the case then:
while(<>) {
if (/;\s*(.*)/) {
print $1;
}
}
Will get you closer to your end-goal.
Another way to achieve this is to try to remove everything up to the first ; (and any spaces straight afterwards), and only print the line if there was something to remove.
s/.*?;\s*//;
This line basically says: "match any characters (but as few as possible), then a semicolon, then any spaces, and replace it all with nothing".
You can then make a the program that reads from STDIN:
while (<>) {
print if s/.*?;\s*//;
}
You can turn these into a nice one-liner on the command line too:
perl -ne 'print if s/.*?;\s*//;'

Perl - remove carriage return and append next line

What if I have a record in a otherwise good file that had a carriage return in it.
Ex:
1,2,3,4,5 ^M
,6,7,8,9,10
and I wanted to make it
1,2,3,4,5,6,7,8,9,10
In general, if you have a string with a stray newline at the end that you want to get rid of, you can use chomp on it (note that you can pass it an lvalue, so wrapping it around an assignment is legal):
my $string = $string2 = "blah\n";
chomp $string;
# this works too:
chomp(my $string3 = $string2);
Note that if the string has a trailing "\r\n", chomp won't take the \r as well, unless you modify $/.
So if all of that is too complicated, and you need to remove all occurrences of \n, \r\n and \r (maybe you're processing lines from a variety of architectures all at once?), you can fall back to good old tr:
$string =~ tr/\r\n//d;
Say we have a file that contains a ctrl-M (aka \r on some platforms):
$ cat input
1,2,3
4,5,6
,7,8,9
10,11,12
This is explicit with od:
$ od -c input
0000000 1 , 2 , 3 \n 4 , 5 , 6 \r \n , 7 ,
0000020 8 , 9 \n 1 0 , 1 1 , 1 2 \n
0000035
Remove each offending character and join its line with the next by running
$ perl -pe 's/\cM\cJ?//g' input
1,2,3
4,5,6,7,8,9
10,11,12
or redirect to a new file with
$ perl -pe 's/\cM\cJ?//g' input >updated-input
or overwrite it in place (plus a backup in input.bak) with
$ perl -i.bak -pe 's/\cM\cJ?//g' input
Making the \cJ optional handles the case when a file ends with ctrl-M but not ctrl-J.
s/[\r\n]//g
Only do this if you want to combine a line with the next.
Assuming the carriage return is right before the line feed:
perl -pi.bak -e 's/\r\n//' your_file_name
This will join only lines with a carriage return at the end of the line to the next line.
Every line is ended with some terminator sequence, either
CRLF (\r\n = 13, 10) on Windows/DOS
CR (\n = 13) on Unix
LF (\r = 10) on MacOS
If some lines are OK, you should say from wich system the file comes or on wich system the perl script is executed, or the risk is to remove every end of line and merge all of your program lines...
As ^M is the CR character, if you see such a character at the end of a line and nothing special on other lines, you are probably using some kind of Unix (Linux ?) and some copy/paste has polluted one line with an additional \r at the end of line.
if this is the case :
perl -pi -e 's/\r\n$//g' filetomodify
will do the trick and merge only the line containing both CR and LF with the next line, leaving the other lines untounhed.
More Information Needed
More information is needed about the underlying data and what your definition of carriage return is. Is the data in Linux or Windows? Really, do you mean carriage return/line feed, or just line feed?
Some Options:
$text =~ tr/\r//; → this is the fastest method to weed out carriage returns
$text =~ tr/\n//; → this is the fastest method to change newlines
$test =~ s/\n//s; → this is probably what you're looking for, which makes the text appear as one line and removes the internal \n