What do these various pieces of syntax mean? - perl

I'm trying to figure out the syntax of both the sed command and perl script:
sed 's/^EOR:$//' INPUTFILE |
perl -00 -ne '/
TAGA01:\s+(.*?)\n
.*
TAGCC08:\s+(.*?)\n
# and so on
/xs && print "$1 $2\n"'
Why is there a circumflex ^ in the sed command? The third slash / will replace all instances of EOR: with a blank line, correct?
I understand some of the Perl script. Looking at perlrun, -00 will slurp the stream in paragraph mode and -n starts a while <> loop.
Why is there the first slash / next to the apostrophe? The command searches for TAGXXXX:, but I am not sure what \s+(.*?) does. Does that put whatever is after the tag into a variable? How about the .* in the between tag searches? What does /ns do? What do the $1 and $2 refer to in the print line?
This was tough to find online, and if someone could kick me in the right direction, I'd appreciate it.

The circumflex ^ is regex for "start of line", and $ is regex for "end of line"; so sed will only remove lines which contain exactly "EOR:" and nothing else.
The Perl script is basically perl -00 -ne '/(re)g(ex)/ && print "re ex\n"' with a big ole regex instead of the simple placeholder I put here. In particular, the /x modifier allows you to split the regex over several lines. So the first / is the start of the regex and the final / is the end of the regex and the lines in between form the regex together.
The /s modifier changes how Perl interprets . in a regex; normally it will match any character except newline, but with this option, it includes newlines as well. This means that .* can match multiple lines.
\s matches a single whitespace character; \s+ matches as many whitespace characters as possible, but there has to be at least one.
(.*?) matches an arbitrary length of string; the dot matches any character, the asterisk says zero or more of any character, and the question mark modifies the asterisk repetition operator to match as short a string as possible instead of as long a string as possible. The parentheses cause the skipped expression to be captured in a back reference; the backrefs are named $1, $2, etc, as many as there are backreferences; the numbers correspond to the order of the opening parenthesis (so if you apply (a(b)) to the string "ab", $1 will be "ab" and $2 will be "b").
Finally, \n matches a literal newline. So the (.*?) non-greedy match will match up to the first newline, i.e. the tail of the line on which the TAGsomething was found. (I
imagine these are gene sequences, not "tags"?)
It doesn't really make sense to run sed separately; Perl would be quite capable of removing the EOR: lines before attempting to match the regex.

Let's see...
Yes, sed will empty the lines with EOR:
The first / in the Perl script means a regexp pattern. Concretely, it is searching for a pattern in the form below
The regex ends with "xs", which means that the regex will match multiple lines of the input
The script also will print as output the strings found in the tags (see below). The $1 and $2 mean the elements contained in the first pair of parentheses ($1) and in the second ($2).
. The form is this one:
TAGA01:<spaces><string1>
<whatever here>
TAGCC00:<spaces><string2>
In this case, $1 is <string1> and $2 is <string2>.

Related

Perl one-liner: deleting a line with pattern matching

I am trying to delete bunch of lines in a file if they match with a particular pattern which is variable.
I am trying to delete a line which matches with abc12, abc13, etc.
I tried writing a C-shell script, and this is the code:
**!/bin/csh
foreach $x (12 13 14 15 16 17)
perl -ni -e 'print unless /abc$x/' filename
end**
This doesn't work, but when I use the one-liner without a variable (abc12), it works.
I am not sure if there is something wrong with the pattern matching or if there is something else I am missing.
Yes, it's the fact you're using single quotes. It means that $x is being interpreted literally.
Of course, you're also doing it very inefficiently, because you're processing each file multiple times.
If you're looking to remove lines abc12 to abc17 you can do this all in one go:
perl -n -i.bak -e 'print unless m/abc1[234567]/' filename
Try this
perl -n -i.bak -e 'print unless m/abc1[2-7]/' filename
using the range [2-7] only removes the need to type [234567] which has the effect of saving you three keystrokes.
man 1 bash: Pattern Matching
[...] Matches any one of the enclosed characters. A pair of characters separated by a hyphen denotes a range expression; any character that sorts between those two characters, inclusive, using the current locale's collating sequence and character set, is matched. If the first character following the [ is a ! or a ^ then any character not enclosed is matched.
A - may be matched by including it as the first or last character in the set. A ] may be matched by including it as the first character in the set.

meaning of the following regular expressions written in perl

Here is a piece of code
while($l=~/(\\\s*)$/) {
statements;
}
$l contains a line of text taken form file, in effect this code is for go through lines in file.
Questions:
I don't clearly understand what the condition in while is doing. I think it is trying to match group of \ followed by some number of white spaces at the end of line and loop should stop whenever a line ends with \ and may be some white spaces. I am not sure of it.
I came across statement $a ~= s/^(.*$)/$1/ . What I understand that ^ will force matching at the beginning of string, but in (.*$) would mean match all the characters at the end of string . Dose it mean that the statement is trying to find if any group of character at the end is same as group of character in the beginning of text ?
It is interesting to note that this statement:
while ( $l =~ /(\\\s*)$/ ) {
Is an infinite loop unless $l is altered inside the loop so that the regex no longer matches. As has already been mentioned by others, this is what it matches:
( ... ) a capture group, captures string to $1 (that's the number one, not lower case L)
\\ matches a literal backslash
\s* matches 0 or more whitespace characters.
$ matches end of line with optional newline.
Since you do not have the /g modifier, this regex will not iterate through matches, it will simply check if there is a match, resetting the regex each iteration, thereby causing an endless loop.
The statement
$a ~= s/^(.*$)/$1/
Looks rather pointless. It captures a string of characters up until end of string, then replaces it with itself. The captured text is stored in $1 and is simply replaced. The only marginally useful thing about this regex is that:
It matches up until newline \n, and nothing further, which may be of some use to a parser. A period . matches any character except newline, unless the /s modifier is present on the regex.
It captures the line in $1 for future use. However, a simple /^(.*$)/ would do the same.
1. the while
Usually while (regex) is used with the /g modifier, otherwise, if it matches, you get an infinite loop (unless you exit the loop, like using last).
statements would be executed continuously in an infinite loop.
In your case, adding the g
while($l=~/(\\\s*)$/g)
will have the while make only one loop, due to the $ - making a match unique (whatever matches up to the end of string is unique, as $ marks the end, and there is nothing after...).
2. $a ~= s/^(.*$)/$1/
This is a substitution. If the string ^.*$ matches (and it will, since ^.*$ matches (almost, see comment) anything) it is replaced with... $1 or what's inside the (), ie itself, since the match occurs from 1st char to the end of string
^ means beginning of string
(.*) means all chars
$ end of string
so that will replace $a with itself - probably not what you want.
it matches a literal backslash followed by 0 or more spaces followed by the end of the line.
it executes statements for all the lines in that text file that contain a \, followed by zero or more spaces ( \s* ), at the end of the line ($).
It matches lines that end with a backslash character, ignoring any trailing whitespace characters.
Ending a line with a backslash is used in some languages and data files to indicate that the line is being continued on the next line. So I suspect this is part of a parser that merges these continuation lines.
If you enter a regular expression at RegExr and hover your mouse over the pieces, it displays the meaning of each piece in a tooltip.
(\\\s*)$ this regex means --- a \ followed by zero or more number of white space characters which is followed by end of the line. Since you have your regex in (...), you can extract what you matched using $1, if you need.
http://rubular.com/r/dtHtEPh5DX
EDIT -- based on your update
$a ~= s/^(.$)/$1/ --- this is search and replace. So your regex matches a line which contains exactly one character (since you use . http://www.regular-expressions.info/dot.html), except a new-line character. Since you use (...), the character which matched the regex is extracted and stored in variable a
EDIT -- you changed your regex so here is the updated answer
$a ~= s/^(.*$)/$1/ -- same as above except now it matches zero or more characters (except new-line)

how to use sed/awk to remove words with multiple pattern count

I have a file of string records where one of the fields - delimited by "," - can contain one or more "-" inside it.
The goal is to delete the field value if it contains more than two "-".
i am trying to recoup my past knowledge of sed/awk but can't make much headway
==========
info,whitepaper,Data-Centers,yes-the-6-top-problems-in-your-data-center-lane
info,whitepaper,Data-Centers,the-evolution-center
info,whitepaper,Data-Centers,the-evolution-of-lan-technology-lanner
==========
expected outcome:
info,whitepaper,Data-Centers
info,whitepaper,Data-Centers,the-evolution-center
info,whitepaper,Data-Centers
thanks
Try
sed -r 's/(^|,)([^,-]+-){3,}[^,]+(,|$)/\3/g'
or if you're into slashes
sed 's/\(^\|,\)\([^,-]\+-\)\{3,\}[^,]\+\(,\|$\)/\3/g'
Explanation:
I'm using the most basic sed command: substitution. The syntax is: s/pattern/replacement/flags.
Here pattern is (^|,)([^,-]+-){3,}[^,]+(,|$), replacement is \3, flags is g.
The g flag means global replacement (all matching parts are replaced, not only the first in line).
In pattern:
brackets () create a group. Somewhat like in math. They also allow to refer to a group with a number later.
^ and $ mean beginning and end of the string.
| means "or", so (^|,) means "comma or beginning of the string".
square brackets [] mean a character class, ^ inside means negation. So [^,-] means "anything but comma or hyphen". Not that usually the hyphen has a special meaning in character classes: [a-z] means all lowercase letters. But here it's just a hyphen because it's not in the middle.
+ after an expression means "match it 1 or more times" (like * means match it 0 or more times).
{N} means "match it exactly N times. {N,M} is "from N to M times". {3,} means "three times or more". + is equivalent to {1,}.
So this is it. The replacement is just \3. This refers to the third group in (), in this case (,|$). This will be the only thing left after the substitution.
P.S. the -r option just changes what characters need to be escaped: without it all of ()-{}| are treated as regular chars unless you escape them with \. Conversely, to match literal ( with -r option you'll need to escape it.
P.P.S. Here's a reference for sed. man sed is your friend as well.
Let me know if you have further questions.
You could try perl instead of sed or awk:
perl -F, -lane 'print join ",", grep { !/-.*-.*-/ } #F' < file.txt
This might work for you:
sed 's/,\{,1\}[^,-]*\(-[^,]*\)\{3,\}//g file
sed 's/\(^\|,\)\([^,]*-\)\{3\}[^,]*\(,\|$\)//g'
This should work in more cases:
sed 's/,$/\n/g;s/\(^\|,\|\n\)\([^,\n]*-\)\{3\}[^,\n]*\(,\|\n\|$\)/\3/g;s/,$//;s/\n/,/g'

Confining Substitution to Match Space Using sed?

Is there a way to substitute only within the match space using sed?
I.e. given the following line, is there a way to substitute only the "." chars that are contained within the matching single quotes and protect the "." chars that are not enclosed by single quotes?
Input:
'ECJ-4YF1H10.6Z' ! 'CAP' ! '10.0uF' ! 'TOL' ; MGCDC1008.S1 MGCDC1009.A2
Desired result:
'ECJ-4YF1H10-6Z' ! 'CAP' ! '10_0uF' ! 'TOL' ; MGCDC1008.S1 MGCDC1009.A2
Or is this just a job to which perl or awk might be better suited?
Thanks for your help,
Mark
Give the following a try which uses the divide-and-conquer technique:
sed "s/\('[^']*'\)/\n&\n/g;s/\(\n'[^.]*\)\.\([^']*Z'\)/\1-\2/g;s/\(\n'[^.]*\)\.\([^']*uF'\)/\1_\2/g;s/\n//g" inputfile
Explanation:
s/\('[^']*'\)/\n&\n/g - Add newlines before and after each pair of single quotes with their contents
s/\(\n'[^.]*\)\.\([^']*Z'\)/\1-\2/g - Using a newline and the single quotes to key on, replace the dot with a dash for strings that end in "Z"
s/\(\n'[^.]*\)\.\([^']*uF'\)/\1_\2/g - Using a newline and the single quotes to key on, replace the dot with a dash for strings that end in "uF"
s/\n//g - Remove the newlines added in the first step
You can restrict the command to acting only on certain lines:
sed "/foo/{s/\('[^']*'\)/\n&\n/g;s/\(\n'[^.]*\)\.\([^']*Z'\)/\1-\2/g;s/\(\n'[^.]*\)\.\([^']*uF'\)/\1_\2/g;s/\n//g}" inputfile
where you would substitute some regex in place of "foo".
Some versions of sed like to be spoon fed (instead of semicolons between commands, use -e):
sed -e "/foo/{s/\('[^']*'\)/\n&\n/g" -e "s/\(\n'[^.]*\)\.\([^']*Z'\)/\1-\2/g" -e "s/\(\n'[^.]*\)\.\([^']*uF'\)/\1_\2/g" -e "s/\n//g}" inputfile
$ cat phoo1234567_sedFix.sed
#! /bin/sed -f
/'[0-9][0-9]\.[0-9][a-zA-Z][a-zA-Z]'/s/'\([0-9][0-9]\)\.\([0-9][a-zA-Z][a-zA-Z]\)'/\1_\2/
This answers your specific question. If the pattern you need to fix isn't always like the example you provided, they you'll need multiple copies of this line, with reg-expressions modified to match your new change targets.
Note that the cmd is in 2 parts, "/'[0-9][0-9].[0-9][a-zA-Z][a-zA-Z]'/" says, must match lines with this pattern, while the trailing "s/'([0-9][0-9]).([0-9][a-zA-Z][a-zA-Z])'/\1_\2/", is the part that does the substitution. You can add a 'g' after the final '/' to make this substitution happen on all instances of this pattern in each line.
The \(\) pairs in match pattern get converted into the numbered buffers on the substitution side of the command (i.e. \1 \2). This is what gives sed power that awk doesn't have.
If your going to do much of this kind of work, I highly recommend O'Rielly's Sed And Awk book. The time spent going thru how sed works will be paid back many times.
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, or give it a + (or -) as a useful answer.
this is a job most suitable for awk or any language that supports breaking/splitting strings.
IMO, using sed for this task, which is regex based , while doable, is difficult to read and debug, hence not the most appropriate tool for the job. No offense to sed fanatics.
awk '{
for(i=1;i<=NF;i++) {
if ($i ~ /\047/ ){
gsub(".","_",$i)
}
}
}1' file
The above says for each field (field seperator by default is white space), check to see if there is a single quote, and if there is , substitute the "." to "_". This method is simple and doesn't need complicated regex.

What's the clearest way to replace trailing backslash \ with \n?

I want multi-line strings in java, so I seek a simple preprocessor to convert C-style multi-lines into single lines with a literal '\n'.
Before:
System.out.println("convert trailing backslashes\
this is on another line\
\
\
above are two blank lines\
But don't convert non-trailing backslashes, like: \"\t\" and \'\\\'");
After:
System.out.println("convert trailing backslashes\nthis is on another line\n\n\nabove are two blank lines\nBut don't convert non-trailing backslashes, like: \"\t\" and \'\\\'");
I thought sed would do it well, but sed is line-based, so replacing the '\' and the newline that follows it (effectively joining the two lines) is not very natural in sed. I adapted sredden79's oneliner to the following - it works, it's clever, but it's not clear:
sed ':a { $!N; s/\\\n/\\n/; ta }'
The substitute is of escaped literal backslash, newline with escaped literal backslash, n. :a is a label and ta is goto label if the substitute found a match; $ means the last line, and $! is the opposite (i.e. all lines but the last). N means to append the next line to the pattern space (thus making the \n character visible.)
EDIT here's a variation to keep compiler error line numbers etc accurate: it turns each extended line into "..."+\n (and handles the first and last lines of the String correctly):
sed ':a { $!N; s/\\\n/\\n"+\n"/; ta }'
giving:
System.out.println("convert trailing backslashes\n"+
"this is on another line\n"+
"\n"+
"\n"+
"above are two blank lines\n"+
"But don't convert non-trailing backslashes, like: \"\t\" and \'\\\'");
EDIT Actually, it would be better have Perl/Python style multi-line, where it starts and ends with a special code on one line (""" for python, I think).
Is there a simpler, saner, clearer way (maybe not using sed)?
Is there a simpler, saner, clearer way.
Forget the pre-processor, live with the limitation, complain about it (so that it will maybe be fixed in Java 7 or 8), and use an IDE to ease the pain.
Other alternatives (too troublesome I suppose, but still better than messing with the compilation process):
use a JVM-based language that does support here-docs
externalize the string into a resource file
A perl one-liner:
perl -0777 -pe 's/\\\n/\\n/g'
This will read either stdin or the file(s) named after it on the command line and write the output to stdout.
If you're using an editor that supports filtering, like vi or emacs, just filter your text through the above command and you're done:
If you're using Windows and have to worry about \r :
C:\> perl -0777 -pe "s/\\\r?\n/\\n/g"
although I think win32 Perl handles \r itself so this may be unnecessary.
The -0777 option is a special case of the -0 (that's a zero) option that defines the line or record separator. In this case, it means that we don't want any separator so read the entire file in as a single string.
The -pe option is a combination of -p (process line-by-line and print the result) and -e (next argument is (a line of) the program to execute)
A perl script to what you asked for.
while (<>) {
chomp;
print $_;
if (/\\$/) {
print "n";
} else {
print "\n";
}
}
sed 's/\x5c\x5c$/\x22\x5c\x5cn\x22/'
Hex for backslash and double quote is \x5c and \x22 respectively - it needs to be escaped so \x5c is doubled and the $ anchors to the end of the line.
Updated again per OP comment:
sed "{:a;N;\$!b a};s/\x5c\x5c\n/\x5c\x5cn/g"
The :a creates a label and the N appends a line to the pattern space, the b a branches back to the label :a except when its the last line $!;
After its all loaded - a single line substitution replaces all occurrences of a newline \n with a literal '\n' using the hex ascii code \x5c for the backslash.