how to shift based on a regular expression using perl? - perl

how to shift the top element from array based on a regular expression using perl? Also these are datarecords, meaning I have the input record separator ($/) set to
$/='#';
for example, the following text file contains this data record.
#dddddddddd
ccccccccccc
eeeeeeeeeee
fffffffffff
I would like to remove the # sign and keep the text, for example:
dddddddddd
ccccccccccc
eeeeeeeeeee
fffffffffff

If you just want to manipulate a text file, a one-liner seems like the best solution. This will edit the file and keep a backup in "inputfile.txt.bak".
perl -pi.bak -we 's/^#//' inputfile.txt
Or you can do a shell redirection:
perl -wpe 's/^#//' inputfile.txt > outputfile.txt
These will try to alter all the lines in the file. If you just want the first line altered you need something different:
perl -wpe 's/^#// if ($. == 0);' inputfile.txt > outputfile.txt

Don't confuse shift with regex substitution.
shift will remove the first element from the array, not string.
A regex substitution can deal with removal of the leading '#' sigil.
The first element of the array would be $array[0].
If a regex substitution is applied to this first element, the '#' is removed:
my #array = ( '#dddddddddd', 'ccccccccccc', 'eeeeeeeeeee', 'fffffffffff' );
$array[0] =~ s/^#//;
print $array[0]; # 'dddddddddd'

This does not seem to be related to arrays. It appears you are just dealing with strings.
This removes a leading hash mark for the string $line:
$line =~ s/^\#//;

Related

Bash one-liner to insert text marker into the fourth and all consecutive tabs of fields populated with text

This is a Bash/.bat terminal script for Mac.
I'm trying to add text ("!!XX!!") into a group of tab-delimited .txt files in a folder, but I only want to add it into the 4th and all following incidents of the tab in each .txt file, and then only if those cels have text in them. So, the end result would be something like (assuming the 7th cel/field/bit of info is blank). So turn this:
text01
text02
text03
text04
text05
text06
... into this:
text01 [TAB] text02 [TAB] text03 [TAB] text04!!XX!! [TAB] text05!!XX!! [TAB] text06!!XX!! [TAB]
The text marker "!!XX!!" is so that another script in a different system can run on the file and perform special system-compatible/custom line formatting at each incident of "!!XX!!", but I don't want to populate the first three fields/tab-delimited text (because it's not needed there) or in the empty fields (because it's not wanted there).
I'm already replacing each line return with a tab, so it is possible to do it there, though my preference is to do it later to the tab-delimited text b/c of weird issues with the line returns/formatting coming in from .rtf files. Below is what I am to replace each line return and replace it with a TAB (and, yes, that is an actual line return and tab in there, which seems to work best because... Macs?):
perl -pi -w -e 's/
/ /g' *.txt;
Thanks in advance :)
This post assumes an input file that has lines with tab-separated fields, where each field starting from (and including) the fourth need be edited if it has something.
One way
perl -F"\t" -wlane'
for (3..$#F) { $F[$_] .= "!XX!" if defined $F[$_] }; print join("\t", #F)
' file
(In tcsh shell need to escape those ! with a backslash.) Once you've tested enough add -i switch to change input file in place (-i.bak keeps a backup).
This uses Perl's -a switch to break input lines by what is given under -F switch (or by whitespace by default), and the resulting array is in #F. See switches in perlrun.
Then it iterates from the fourth field to the last. I use syntax $#ary for the index of the last element of array #ary.
I don't know what counts for cells that "have text in them" so above I test a field for defined-ness; thus this will append even for an empty string. Adjust as suitable.
Or use a regex, which allows more flexibility here. For example,
for (3..$#F) { $F[$_] =~ s/.+\K/!XX!/ }
This matches all characters and then adds !XX! (keeping what it matched, by \K assertion). Using regex allows and demands to specify more precisely what is accepted there; the shown pattern will match even for whitespace alone, but not for empty string. To not touch fields with whitespace only, and to strip trailing spaces if any
for (3..$#F) { $F[$_] =~ s/.+\S\K\s*/XX/ };
Again, adjust to your details.
I don't quite understand the discussion of newlines and what is wanted of them; the above one-liner goes line by line. If that's not what you need please clarify. I don't have Macs to test, so I can't comment on all that.
A self-contained example for ready testing and tweaking
echo "t1\tt2\tt3\tt4\t\tt6 \t " |
perl -F"\t" -wlanE'for (3..$#F) { $F[$_] =~ /.+\S\K\s*/XX/ } say for #F'
where I print each field on a separate line for easier inspection. The last tab in input is followed by trailing spaces only -- this results in an empty field, but with no text marker added (as asked for in a comment).
with GNU sed
$ echo text{01..07}$'\t' | sed -E 's/([^\t]+)(\t|$)/\1!!xx!!/4g'
text01 text02 text03 text04!!xx!! text05!!xx!! text06!!xx!! text07!!xx!!
or
$ echo text{01..07}$'\t' | sed -E 's/\t([^\t]+)/\1!!xx!!/3g'
Assuming each text file contains 7 lines, you can do
paste -s *.txt | awk '
BEGIN {FS=OFS="\t"}
{for (i=4; i<=NF; i++) if ($i != "") $i = $i "!!XX!!"; print}
'
Here is an awk:
echo text{01..10}$'\t' |
awk -v OFS=$'\t' '{for(i=1;i<=NF;i++) printf "%s%s", $i, i>=4 ? "XXX\t" : i<NF ? OFS : ORS }'
With perl, I would do this:
echo text{01..10}$'\t' |
perl -lpE '$cnt=0; s/\h+/++$cnt>=4 ? "XXX\t" : "\t"/ge;'
Both print:
text01 text02 text03 text04XXX text05XXX text06XXX text07XXX text08XXX text09XXX text10XXX

Change formatting on paragraphs, with perl

I have a number of paragraphs that have returns at the end of a line. I do not want returns at the end of lines, I will let the layout program take care of that. I would like to remove the returns, and replace them with spaces.
The issue is that I do want returns in between paragraphs. So, if there is more than one return in a row (2, 3, etc) I would like to keep two returns.
This would allow for there to be paragraphs, with one blank line between then, but all other formatting for lines would be removed. This would allow the layout program to worry about the line breaks, and not the have the breaks determined by a set number of characters, as they are now.
I would like to use Perl to accomplish this change, but am open to other methods.
example text:
This is a test.
This is just a test.
This too is a test.
This too is just a test.
would become:
This is a test. This is just a test.
This too is a test. This too is just a test.
Can this be done easily?
Using a perl one-liner. Replace 2 or more newlines with just 2. Strip all single newlines:
perl -0777 -pe 's{(\n{2})\n*|\n}{$1//" "}eg' file.txt > newfile.txt
Switches:
-0777: Slurps the entire file
-p: Creates a while(<>){...; print} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
I came up with another solution and also wanted to explain what your regex was matching.
Matt#MattPC ~/perl/testing/8
$ cat input.txt
This is a test.
This is just a test.
This too is a test.
This too is just a test.
another test.
test.
Matt#MattPC ~/perl/testing/8
$ perl -e '$/ = undef; $_ = <>; s/(?<!\n)\n(?!\n)/ /g; s/\n{2,}/\n\n/g; print' input.txt
This is a test. This is just a test.
This too is a test. This too is just a test.
another test. test.
I basically just wrote a perl program and mashed it into a one-liner. It would normally look like this.
# First two lines read in the whole file
$/ = undef;
$_ = <>;
# This regex replaces every `\n` by a space
# if it is not preceded or followed by a `\n`
s/(?<!\n)\n(?!\n)/ /g;
# This replaces every two or more \n by two \n
s/\n{2,}/\n\n/g;
# finally print $_
print;
perl -p -i -e 's/(\w+|\s+)[\r\n]/$1 /g' abc.txt
Part of the problem here is what you are matching. (\w+|\s+) matches one of more word characters, which is the same as [a-zA-Z0-9_], OR one or more whitespace characters, which is the same as [\t\n\f\r ].
This wouldn't match your input, since you aren't matching periods, and no line consists of only white space or only characters (even the blank lines would need two whitespace characters to match it, since we have [\r\n] at the end). Plus, neither would match a period.

put all separate paragraphs of a file into a separate line

I have a file that contains sequence data, where each new paragraph (separated by two blank lines) contain a new sequence:
#example
ASDHJDJJDMFFMF
AKAKJSJSJSL---
SMSM-....SKSKK
....SK
SKJHDDSNLDJSCC
AK..SJSJSL--HG
AHSM---..SKSKK
-.-GHH
and I want to end up with a file looking like:
ASDHJDJJDMFFMFAKAKJSJSJSL---SMSM-....SKSKK....SK
SKJHDDSNLDJSCCAK..SJSJSL--HGAHSM---..SKSKK-.-GHH
each sequence is the same length (if that helps).
I would also be looking to do this over multiple files stored in different directiories.
I have just tried
sed -e '/./{H;$!d;}' -e 'x;/regex/!d' ./text.txt
however this just deleted the entire file :S
any help would bre appreciated - doesn't have to be in sed, if you know how to do it in perl or something else then that's also great.
Thanks.
All you're asking to do is convert a file of blank-lines-separated records (RS) where each field is separated by newlines into a file of newline-separated records where each field is separated by nothing (OFS). Just set the appropriate awk variables and recompile the record:
$ awk '{$1=$1}1' RS= OFS= file
ASDHJDJJDMFFMFAKAKJSJSJSL---SMSM-....SKSKK....SK
SKJHDDSNLDJSCCAK..SJSJSL--HGAHSM---..SKSKK-.-GHH
awk '
/^[[:space:]]*$/ {if (line) print line; line=""; next}
{line=line $0}
END {if (line) print line}
'
perl -00 -pe 's/\n//g; $_.="\n"'
For multiple files:
# adjust your glob pattern to suit,
# don't be shy to ask for assistance
for file in */*.txt; do
newfile="/some/directory/$(basename "$file")"
perl -00 -pe 's/\n//g; $_.="\n"' "$file" > "$newfile"
done
A Perl one-liner, if you prefer:
perl -nle 'BEGIN{$/=""};s/\n//g;print $_' file
The $/ variable is the equivalent of awk's RS variable. When set to the empty sting ("") it causes two or more empty lines to be treated as one empty line. This is the so-called "paragraph-mode" of reading. For each record read, all newline characters are removed. The -l switch adds a newline to the end of each output string, thus giving the desired result.
just try to find those double linebreaks: \n or \r and replace first those with an special sign like :$:
after that you replace every linebreak with an empty string to get the whole file in one line.
next, replace your special sign with a simple line break :)

Read a string of alphanumeric characters after a ;

I'm teaching myself perl so I'm pretty new to this language. I've been reading over and over about regular expression but I can't figure out the right context. I want to do the following:
Let say I have a file name "testfile"
this files contains 3 lines,
test this is the first line
test: this is the first line
test; this is the third line
How can I read and print out only the third one and everything after the ; without the space. so basically "This is the third line"
This is what I'm thinking to do $string =~ m/this is the third/
This was edited incorrectly. In the first and second sentence there should a space before the test.in the third one shouldn't. So I want to skip the white space.
Grabbing from STDIN, it might look like this:
while ( <> ) {
print $1 if /^test; (.*\n)/;
}
You might find that YAPE::Regex::Explain is a handy tool:
Using Axeman's regular expression:
#!/usr/bin/env perl
use strict;
use warnings;
use YAPE::Regex::Explain;
my $expr = q(/^test; (.*\n)/);
print YAPE::Regex::Explain->new( $expr )->explain;
The regular expression:
(?-imsx:/^test; (.*\n)/)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
test; 'test; '
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
\n '\n' (newline)
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
If you only want the third line, then simply counting the lines and then doing:
s/.*;\s*//;
will remove everything until the ; and any white space after it. Note, however, that if the third line contains another ';' in it then you'll be in trouble. So if that's a possibility but there is no chance that one will exist earlier, then do this:
s/[^;]*;\s*//;
Which will delete only up until the first ';' (and trailing whitespace).
I suspect, however, in the long run you want to match all lines that contain some particular format and it won't always be "just the third". If that's the case then:
while(<>) {
if (/;\s*(.*)/) {
print $1;
}
}
Will get you closer to your end-goal.
Another way to achieve this is to try to remove everything up to the first ; (and any spaces straight afterwards), and only print the line if there was something to remove.
s/.*?;\s*//;
This line basically says: "match any characters (but as few as possible), then a semicolon, then any spaces, and replace it all with nothing".
You can then make a the program that reads from STDIN:
while (<>) {
print if s/.*?;\s*//;
}
You can turn these into a nice one-liner on the command line too:
perl -ne 'print if s/.*?;\s*//;'

What do these various pieces of syntax mean?

I'm trying to figure out the syntax of both the sed command and perl script:
sed 's/^EOR:$//' INPUTFILE |
perl -00 -ne '/
TAGA01:\s+(.*?)\n
.*
TAGCC08:\s+(.*?)\n
# and so on
/xs && print "$1 $2\n"'
Why is there a circumflex ^ in the sed command? The third slash / will replace all instances of EOR: with a blank line, correct?
I understand some of the Perl script. Looking at perlrun, -00 will slurp the stream in paragraph mode and -n starts a while <> loop.
Why is there the first slash / next to the apostrophe? The command searches for TAGXXXX:, but I am not sure what \s+(.*?) does. Does that put whatever is after the tag into a variable? How about the .* in the between tag searches? What does /ns do? What do the $1 and $2 refer to in the print line?
This was tough to find online, and if someone could kick me in the right direction, I'd appreciate it.
The circumflex ^ is regex for "start of line", and $ is regex for "end of line"; so sed will only remove lines which contain exactly "EOR:" and nothing else.
The Perl script is basically perl -00 -ne '/(re)g(ex)/ && print "re ex\n"' with a big ole regex instead of the simple placeholder I put here. In particular, the /x modifier allows you to split the regex over several lines. So the first / is the start of the regex and the final / is the end of the regex and the lines in between form the regex together.
The /s modifier changes how Perl interprets . in a regex; normally it will match any character except newline, but with this option, it includes newlines as well. This means that .* can match multiple lines.
\s matches a single whitespace character; \s+ matches as many whitespace characters as possible, but there has to be at least one.
(.*?) matches an arbitrary length of string; the dot matches any character, the asterisk says zero or more of any character, and the question mark modifies the asterisk repetition operator to match as short a string as possible instead of as long a string as possible. The parentheses cause the skipped expression to be captured in a back reference; the backrefs are named $1, $2, etc, as many as there are backreferences; the numbers correspond to the order of the opening parenthesis (so if you apply (a(b)) to the string "ab", $1 will be "ab" and $2 will be "b").
Finally, \n matches a literal newline. So the (.*?) non-greedy match will match up to the first newline, i.e. the tail of the line on which the TAGsomething was found. (I
imagine these are gene sequences, not "tags"?)
It doesn't really make sense to run sed separately; Perl would be quite capable of removing the EOR: lines before attempting to match the regex.
Let's see...
Yes, sed will empty the lines with EOR:
The first / in the Perl script means a regexp pattern. Concretely, it is searching for a pattern in the form below
The regex ends with "xs", which means that the regex will match multiple lines of the input
The script also will print as output the strings found in the tags (see below). The $1 and $2 mean the elements contained in the first pair of parentheses ($1) and in the second ($2).
. The form is this one:
TAGA01:<spaces><string1>
<whatever here>
TAGCC00:<spaces><string2>
In this case, $1 is <string1> and $2 is <string2>.