How to modify the matched pattern - perl

Just wondering if there is a handy way to modify matched pattern variable in Perl one liner. For instance in the string abcdef I'd like to replace def with e (output abce) using a command looking like this :
echo "abcdef" | perl -pne 's/(def)/{command that trims first and last character of $1 and returns it as a string for perl to use it as a replacement}/'
It would be easy to use such functionality to perform various formating tasks. Can we do this in sed ?

This is easy in Perl with the /e flag:
echo 'abcdef' | perl -pe 's/(def)/substr $1, 1, -1/e'
e tells perl to parse the replacement part as a block of code, not a string. You can put arbitrary code in there.
But your concrete task (trimming the first and last character) can also be done like this:
echo 'abcdef' | perl -pe 's/d(e)f/$1/'
(Also, perl -p already implies -n. No need to specify both.)

Related

How to replace part of multi-line string with sed

I want to replace newlines in a string with <br/> on all places not inside triple backticks statement(```)
String:
This is some requirements just talking about c++ with
`char` types supported by C++:
```
using SecureString = BasicSecureString<char>;
using WSecureString = BasicSecureString<wchar_t>;
using U16SecureString = BasicSecureString<char16_t>;
using U32SecureString = BasicSecureString<char32_t>;
```
And continuing to write
stuff
Expected result:
This is some requirements just talking about c++ with<br/>`char` types supported by C++:<br/>```
using SecureString = BasicSecureString<char>;
using WSecureString = BasicSecureString<wchar_t>;
using U16SecureString = BasicSecureString<char16_t>;
using U32SecureString = BasicSecureString<char32_t>;
```<br/>And continuing to write<br/>stuff
What I currently have is something like:
sed --null-data '/```.*```/!s/\n/<br\/>/g'
But it's only working on inputs which don't include the backticks.
Does someone have any hints?
With perl or awk
perl -0777 -pe 's#```.*?```(*SKIP)(*F)|\n#<br/>#sg'
awk '/```/{f=!f} {ORS = f ? RS : "<br/>"} 1'
The perl solution is similar to what you tried with sed.
-0777 will slurp entire input file, similar to sed -z but -z works by using ASCII NUL as line separator
```.*?```(*SKIP)(*F) to prevent this matching portion from being changed
|\n specified what should be matched
<br/> replacement string
s flag to allow . to match newline characters as well
With awk, the output record separator is changed based on value of f which changes whenever input has triple backticks. The advantage with this approach is that whole input doesn't have to be slurped.
If you don't wish to change the last newline in the file, use
perl -0777 -pe 's#```.*?```(*SKIP)(*F)|\n(?!\z)#<br/>#sg'

Perl one-liner to extract groups of characters

I am trying to extract a group of characters with a Perl one-liner, but I have been unsuccessful:
echo "hello_95_.txt" | perl -ne 's/.*([0-9]+).*/\1/'
Returns nothing, while I would like it to return 95. How can I do this with Perl?
Update:
Note that, in contrast to the suggested duplicate, I am interested in how to do this from the command-line. Surely this looks like a subtle difference, but it's not straightforward unless you already know how to effectively use Perl one-liners.
Since people are asking, eventually I want to learn to use Perl to write powerful one-liners, but most immediately I need a one-liner to extract consecutive digits from each line in a large text file.
perl -pe's/\D*(\d+).*/$1/'
or
perl -nE'/\d+/&&say$&'
or
perl -nE'say/(\d+)/'
or
perl -ple's/\D//g'
or may be
perl -nE'$,=" ";say/\d+/g'
Well first, you need to use the -p rather than the -n switch.
And you need to amend your regular expression, as in:
echo "hello_95_.txt" | perl -pe "s/^.*?([0-9]+).*$/\1/"
which looks for the longest non-greedy string of chars, followed by one or more digits, followed by any number of chars to the end of the line.
Note that while '\1' is acceptable as a back-reference and is more familiar to SED/AWK users, '$1' is the more up-to-date form. So, you might wish to use:
echo "hello_95_.txt" | perl -pe "s/^.*?([0-9]+).*$/$1/"
instead.

Insert between each occurrence of two characters

If I can have somewhere in my input a series of two or more characters (in my case, >), how can I insert something between each occurrence of >?
For example: >> to >foo>, but also:
>>> to >foo>foo> and:
>>>> to >foo>foo>foo>.
Using 's/>>/>foo>/g' gives me of course >foo>>foo>, which is not what I need.
In other words, how can I push a character back to the pattern space, or match a character without consuming it (does that make any sense?)
Using Perl, you can do it iteratively
$ echo '>>>>' | perl -pe 's/>>/>foo>/ while />>/'
>foo>foo>foo>
or use a look-ahead assertion, which does not consume the 2nd >
$ echo '>>>>' | perl -pe 's/>(?=>)/>foo/g'
>foo>foo>foo>
This should also work
sed ':b; s/>>/>foo>/; tb'

Remove from the beginning till certain part in a string

I work with strings like
abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf
and I need to get a new one where I remove in the original string everything from the beginning till the last appearance of "_" and the next characters (can be 3, 4, or whatever number)
so in this case I would get
_adf
How could I do it with "sed" or another bash tool?
Regular expression pattern matching is greedy. Hence ^.*_ will match all characters up to and including the last _. Then just put the underscore back in:
echo abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf | sed 's/^.*_/_/'
sed 's/^(.*)_([^_]*)$/_\2/' < input.txt
Do you need to modify the string, or just find everything after the last underscore? The regex to find the last _{anything} would be /(_[^_]+)$/ ($ matches the end of the string), or if you also want to match a trailing underscore with nothing after it, /(_[^_]*)$/.
Unless you really need to modify the string in place instead of just finding this piece, or you really want to do this from the command line instead of a script, this regex is a bit simpler (you tagged this with perl, so I wasn't sure quite how committed to using just the command line as opposed to a simple script you were).
If you do need to modify the string in place, sed -i 's/(_[^_]+)$/\1/' myfile or sed -i 's/(_[^_]+)$/\1/g' myfile. The -i (edit: I decided not to be lazy and look up the proper syntax...) the -i flag will just overwrite the old file with the new one. If you want to create a new file and not clobber the old one, sed -e 's/.../.../g' oldfile > newfile. The g after the s/// will do this for all instances in the file you pass into sed; leaving it out just replaces the first instance.
If the string is not by itself at the end of the line, but rather embedded in other text. but just separated by whitespace, replace the $ with \s, which will match a whitespace character (the end of a word).
If you have strings like these in bash variables (I don't see that specified in the question), you can use parameter expansion:
s="abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf"
t="_${s##*_}"
echo "$t" # ==> _adf
In Perl, you could do this:
my $string = "abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf";
if ( $string =~ m/(_[^_]+)$/ ) {
print $1;
}
[Edit]
A Perl one liner approach (ie, can be run from bash directly):
perl -lne 'm/(_[^_]+)$/ && print $1;' infile > outfile
Or using substitution:
perl -pe 's/.*(_[^_]+)$/$1/' infile > outfile
Just group the last non-underscore characters preceded by the last underscore with \(_[^_]*\), then reference this group with \1:
sed 's/^.*\(_[^_]*\)$/\1/'
Result:
$ echo abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf | sed 's/^.*\(_[^_]*\)$/\1/'
_adf
A Perl way:
echo 'abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf' | \
perl -e 'print ((split/(_)/,<>)[-2..-1])'
output:
_adf
Just for fun:
echo abc_dsdsds_ss_gsgsdsfsdf_ewew_wewewewewew_adf | tr _ '\n' | tail -n 1 | rev | tr '\n' _ | rev

How to use sed-awk-gawk to display a matched string

I've got a file called 'res' that's 29374 characters of http data in a one-line string. Inside it, there are several http links, but I only want to be display those that end in '/idNNNNNNNNN' where N is a digit. In fact I'm only interested in the string 'idNNNNNNNNN'.
I've tried with:
cat res | sed -n '0,/.*\(id[0-9]*\).*/s//\1/p'
but I get the whole file.
Do you know a way to do it?
perl -n -E 'say $1 while m!/id(\d{9})!g' input-file
should work. That assumes exactly 9 digits; that's the {9} in the above. You can match 8 or 9 ({8,9}), 8 or more ({8,}), up to 9 ({0,9}), etc.
Example of this working:
$ echo -n 'junk jumk http://foo/id231313 junk lalala http://bar/id23123 asda' | perl -n -E 'say $1 while m!id(\d{0,9})!g'
231313
23123
That's with the 0 to 9 variant, of course.
If you're stuck with a pre-5.10 perl, use -e instead of -E and print "$1\n" instead of say $1.
How it works
First is the two command-line arguments to Perl. -n tells Perl to read input from standard input or files given on the command line, line by line, setting $_ to each line. $_ is perl's default target for a lot of things, including regular expression matches. -E merely tells Perl that the next argument is a Perl one-liner, using the new language features (vs. -e which does not use the 5.10 extensions).
So, looking at the one liner: say means to print out some value, followed by a newline. $1 is the first regular expression capture (captures are made by parentheses in regular expressions). while is a looping construct, which you're probably familiar with. m is the match operator, the ! after it is the regular expression delimiter (normally, you see / here, but since the pattern contains / it's easier to use something else, so you don't have to escape the / as \/). /id(\d{9}) is the regular expression to match. Keep in mind that the delimiter is !, so the / is not special, it just matches a literal /. The parentheses form a capture group, so $1 will be the number. The ! is the delimiter, followed by g which means to match as many times as possible (as opposed to once). This is what makes it pick up all the URLs in the line, not just the first. As long as there is a match, the m operator will return a true value, so the loop will continue (and run that say $1, printing out the match).
Two-sed solution
I think this is one way to do this with only sed. Much more complicated!
echo 'junk jumk http://foo/id231313 junk lalala http://bar/id23123 asda' | \
sed 's!http://!\nhttp://!g' | \
sed 's!^.*/id\([0-9]*\).*$!\1!'
cat res | perl -ne 'chomp; print "$1\n" if m/\/(id\d*)/'
The trouble is that sed and grep and awk work on lines, and you've only got one line. So, you probably need to split things up so you have more than one line -- then you can make the normal tools work.
tr ':' '\012' < res |
sed -n 's%.*/\(id[0-9][0-9]*\).*%\1%p'
This takes advantage of URLs containing colons and maps colons to newlines with tr, then uses sed to pick up anything up to a slash, followed by id and one or more digits, followed by anything, and prints out the id and digit string (only). Since these only occur in URLs, they will only appear one per line and relatively near the start of the line too.
Here's a solution using only one invocation of sed:
sed -n 's| |\n|g;/^http/{s|http://[^/]*/id\([0-9]*\)|\1|;P};D' inputfile
Explanation:
s| |\n|g; - Divide and conquer
/^http/{ - If pattern space begins with "http"
s|http://[^/]*/id\([0-9]*\)|\1|; - capture the id
P - Print the string preceding the first newline
}; - end if
D - Delete the string preceding the first newline regardless of whether it contains "http"
Edit:
This version uses the same technique but is more selective.
sed -n 's|http://|\n&|g;/^\n*http/{s|\n*http://[^/]*/id\([0-9]*\)|\1\n|;P};D' inputfile