Extract substrings between strings - sed

I have a file with text as follows:
###interest1 moreinterest1### sometext ###interest2###
not-interesting-line
sometext ###interest3###
sometext ###interest4### sometext othertext ###interest5### sometext ###interest6###
I want to extract all strings between ### .
My desired output would be something like this:
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
I have tried the following:
grep '###' file.txt | sed -e 's/.*###\(.*\)###.*/\1/g'
This almost works but only seems to grab the first instance per line, so the first line in my output only grabs
interest1 moreinterest1
rather than
interest1 moreinterest1
interest2

Here is a single awk command to achieve this that makes ### field separator and prints each even numbered field:
awk -F '###' '{for (i=2; i<NF; i+=2) print $i}' file
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
Here is an alternative grep + sed solution:
grep -oE '###[^#]*###' file | sed -E 's/^###|###$//g'
This assumes there are no # characters in between ### markers.

With GNU awk for multi-char RS:
$ awk -v RS='###' '!(NR%2)' file
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6

You can use pcregrep:
pcregrep -o1 '###(.*?)###' file
The regex - ###(.*?)### - matches ###, then captures into Group 1 any zero o more chars other than line break chars, as few as possible, and ### then matches ###.
o1 option will output Group 1 value only.
See the regex demo online.

sed 't x
s/###/\
/;D; :x
s//\
/;t y
D;:y
P;D' file
Replacing "###" with newline, D, then conditionally branching to P if a second replacement of "###" is successful.

This might work for you (GNU sed):
sed -n 's/###/\n/g;/[^\n]*\n/{s///;P;D}' file
Replace all occurrences of ###'s by newlines.
If a line contains a newline, remove any characters before and including the first newline, print the details up to and including the following newline, delete those details and repeat.

Related

Replacing all occurrence after nth occurrence in a line in perl

I need to replace all occurrences of a string after nth occurrence in every line of a Unix file.
My file data:
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
My output data:
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
tried using sed: sed 's/://3g' test.txt
Unfortunately, the g option with the occurrence is not working as expected. instead, it is replacing all the occurrences.
Another approach using awk
awk -v c=':' -v n=2 'BEGIN{
FS=OFS=""
}
{
j=0;
for(i=0; ++i<=NF;)
if($i==c && j++>=n)$i=""
}1' file
$ cat file
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
$ awk -v c=':' -v n=2 'BEGIN{FS=OFS=""}{j=0;for(i=0; ++i<=NF;)if($i==c && j++>=n)$i=""}1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
With GNU awk, using gensub please try following. This is completely based on your shown samples, where OP wants to remove : from 3rd occurrence onwards. Using gensub to segregate parts of matched values and removing all colons from 2nd part(from 3rd colon onwards) in it as per OP's requirement.
awk -v regex="^([^:]*:)([^:]*:)(.*)" '
{
firstPart=restPart=""
firstPart=gensub(regex, "\\1 \\2", "1", $0)
restPart=gensub(regex,"\\3","1",$0)
gsub(/:/,"",restPart)
print firstPart restPart
}
' Input_file
I have inferred based on the limited data you've given us, so it's possible this won't work. But I wouldn't use regex for this job. What you have there is colon delimited fields.
So I'd approach it using split to extract the data, and then some form of string formatting to reassemble exactly what you like:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
chomp;
my ( undef, $first, #rest ) = split /:/;
print ":$first:", join ( "", #rest ),"\n";
}
__DATA__
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
This gives you the desired result, whilst IMO being considerably clearer for the next reader than a complicated regex.
You can use the perl solution like
perl -pe 's~^(?:[^:]*:){2}(*SKIP)(?!)|:~~g if /^:account_id:/' test.txt
See the online demo and the regex demo.
The ^(?:[^:]*:){2}(*SKIP)(?!)|: regex means:
^(?:[^:]*:){2}(*SKIP)(?!) - match
^ - start of string (here, a line)
(?:[^:]*:){2} - two occurrences of any zero or more chars other than a : and then a : char
(*SKIP)(?!) - skip the match and go on to search for the next match from the failure position
| - or
: - match a : char.
And only run the replacement if the current line starts with :account_id: (see if /^:account_id:/').
Or an awk solution like
awk 'BEGIN{OFS=FS=":"} /^:account_id:/ {result="";for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result}' test.txt
See this online demo. Details:
BEGIN{OFS=FS=":"} - sets the input/output field separator to :
/^:account_id:/ - line must start with :account_id:
result="" - sets result variable to an empty string
for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result} - iterates over the fields and if the field number is greater than 2, just append the current field value to result, else, append the value + output field separator; then print the result.
I would use GNU AWK following way if n fixed and equal 2 following way, let file.txt content be
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
then
awk 'BEGIN{FS=":";OFS=""}{$2=FS $2 FS;print}' file.txt
output
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
Explanation: use : as field separator and nothing as output field separator, this itself does remove all : so I add : which have to be preserved: 1st (before second column) and 2nd (after second column). Beware that I tested it solely for this data, so if you would want to use it you should firstly test it with more possible inputs.
(tested in gawk 4.2.1)
This might work for you (GNU sed):
sed 's/:/\n/3;h;s/://g;H;g;s/\n.*\n//' file
Replace the third occurrence of : by a newline.
Make a copy of the line.
Delete all occurrences of :'s.
Append the amended line to the copy.
Join the two lines by removing everything from third occurrence of the copy to the third occurrence of the amended line.
N.B. The use of the newline is the best delimiter to use in the case of sed, as the line presented to seds commands are initially devoid of newlines. However the important property of the delimiter is that it is unique and therefore can be any such character as long as it is not found anywhere in the data set.
An alternative solution uses a loop to remove all :'s after the first two:
sed -E ':a;s/^(([^:]*:){2}[^:]*):/\1/;ta' file
With GNU awk for the 3rd arg to match() and gensub():
$ awk 'match($0,/(:[^:]+:)(.*)/,a){ $0=a[1] gensub(/:/,"","g",a[2]) } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
and with any awk in any shell on every Unix box:
$ awk 'match($0,/:[^:]+:/){ tgt=substr($0,1+RLENGTH); gsub(/:/,"",tgt); $0=substr($0,1,RLENGTH) tgt } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus

How to replace a block of code between two patterns with blank lines?

I am trying replace a block of code between two patterns with blank lines
Tried using below command
sed '/PATTERN-1/,/PATTERN-2/d' input.pl
But it only removes the lines between the patterns
PATTERN-1 : "=head"
PATTERN-2 : "=cut"
input.pl contains below text
=head
hello
hello world
world
morning
gud
=cut
Required output :
=head
=cut
Can anyone help me on this?
$ awk '/=cut/{f=0} {print (f ? "" : $0)} /=head/{f=1}' file
=head
=cut
To modify the given sed command, try
$ sed '/=head/,/=cut/{//! s/.*//}' ip.txt
=head
=cut
//! to match other than start/end ranges, might depend on sed implementation whether it dynamically matches both the ranges or statically only one of them. Works on GNU sed
s/.*// to clear these lines
awk '/=cut/{found=0}found{print "";next}/=head/{found=1}1' infile
# OR
# ^ to take care of line starts with regexp
awk '/^=cut/{found=0}found{print "";next}/^=head/{found=1}1' infile
Explanation:
awk '/=cut/{ # if line contains regexp
found=0 # set variable found = 0
}
found{ # if variable found is nonzero value
print ""; # print ""
next # go to next line
}
/=head/{ # if line contains regexp
found=1 # set variable found = 1
}1 # 1 at the end does default operation
# print current line/row/record
' infile
Test Results:
$ cat infile
=head
hello
hello world
world
morning
gud
=cut
$ awk '/=cut/{found=0}found{print "";next}/=head/{found=1}1' infile
=head
=cut
This might work for you (GNU sed):
sed '/=head/,/=cut/{//!z}' file
Zap the lines between =head and =cut.

Joining lines in order of different blocks in the same text file

I have a file split in blocks like the following:
AGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTGGGG
AGGTAGTTATTATTTTTTTGGTTTTTAGTATTTAATTGAGTGTTT
ATGTAGGTGTTTATGTATTAGTTTTTTTTAGGTTTAGGGTGTTGT
ATTTAGGTTTTGTGTTTTGTGTATTATTGAATTTAATTAAAGTTA
AGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTT
AGTTTTTTTTTATTTGTCGGGATATTTTAGTTGATTTTAGATTGC
TATATTTTTAGTTTCGATTCGTCGTAAGTTTTATTTTTTTTTAAT
GGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTTT
I've truncated/wrapped the lines for clarity's sake, but imagine very long lines. The point of my question is that I want a final file that looks like this:
AGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTGGGGAGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTT
AGGTAGTTATTATTTTTTTGGTTTTTAGTATTTAATTGAGTGTTTAGTTTTTTTTTATTTGTCGGGATATTTTAGTTGATTTTAGATTGC
ATGTAGGTGTTTATGTATTAGTTTTTTTTAGGTTTAGGGTGTTGTTATATTTTTAGTTTCGATTCGTCGTAAGTTTTATTTTTTTTTAAT
ATTTAGGTTTTGTGTTTTGTGTATTATTGAATTTAATTAAAGTTAGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTTT
Where this new block has:
the same number of lines as the initial blocks,
each of the lines of the resulting block is a concatenation of the lines with the same line-number in the initial blocks.
this concatenation should be in-order (i.e. "1st line of 1st block" + "1st line of 2nd block", etc
Is it possible to achieve this final block using sed and/or awk, could you show me how it could be done?
In bash with paste:
$ paste <(head -4 file) <(tail -4 file) | tr -d '\t'
AGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTGGGGAGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTT
AGGTAGTTATTATTTTTTTGGTTTTTAGTATTTAATTGAGTGTTTAGTTTTTTTTTATTTGTCGGGATATTTTAGTTGATTTTAGATTGC
ATGTAGGTGTTTATGTATTAGTTTTTTTTAGGTTTAGGGTGTTGTTATATTTTTAGTTTCGATTCGTCGTAAGTTTTATTTTTTTTTAAT
ATTTAGGTTTTGTGTTTTGTGTATTATTGAATTTAATTAAAGTTAGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTTT
try this:
awk -vOFS="" '$0{a[NR]=$0}END{for(i=1;i<=NR/2;i++)print a[i],a[i+5]}' file
test with your example:
kent$ cat tmp.txt
AGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTGGGG
AGGTAGTTATTATTTTTTTGGTTTTTAGTATTTAATTGAGTGTTT
ATGTAGGTGTTTATGTATTAGTTTTTTTTAGGTTTAGGGTGTTGT
ATTTAGGTTTTGTGTTTTGTGTATTATTGAATTTAATTAAAGTTA
AGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTT
AGTTTTTTTTTATTTGTCGGGATATTTTAGTTGATTTTAGATTGC
TATATTTTTAGTTTCGATTCGTCGTAAGTTTTATTTTTTTTTAAT
GGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTTT
kent$ awk -vOFS="" '$0{a[NR]=$0}END{for(i=1;i<=NR/2;i++)print a[i],a[i+5]}' tmp.txt
AGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTGGGGAGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTT
AGGTAGTTATTATTTTTTTGGTTTTTAGTATTTAATTGAGTGTTTAGTTTTTTTTTATTTGTCGGGATATTTTAGTTGATTTTAGATTGC
ATGTAGGTGTTTATGTATTAGTTTTTTTTAGGTTTAGGGTGTTGTTATATTTTTAGTTTCGATTCGTCGTAAGTTTTATTTTTTTTTAAT
ATTTAGGTTTTGTGTTTTGTGTATTATTGAATTTAATTAAAGTTAGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTTT
awk -F'\n' -v RS= '{for (i=1;i<=NF;i++) {str[i] = str[i] $i} END {for (i=1;i<=NF;i++) print str[i]}' file

divide each line in equal part

I would be happy if anyone can suggest me command (sed or AWK one line command) to divide each line of file in equal number of part. For example divide each line in 4 part.
Input:
ATGCATHLMNPHLNTPLML
Output:
ATGCA THLMN PHLNT PLML
This should work using GNU sed:
sed -r 's/(.{4})/\1 /g'
-r is needed to use extended regular expressions
.{4} captures every four characters
\1 refers to the captured group which is surrounded by the parenthesis ( ) and adds a space behind this group
g makes sure that the replacement is done as many times as possible on each line
A test; this is the input and output in my terminal:
$ echo "ATGCATHLMNPHLNTPLML" | sed -r 's/(.{4})/\1 /g'
ATGC ATHL MNPH LNTP LML
I suspect awk is not the best tool for this, but:
gawk --posix '{ l = sprintf( "%d", 1 + (length()-1)/4);
gsub( ".{"l"}", "& " ) } 1' input-file
If you have a posix compliant awk you can omit the --posix, but --posix is necessary for gnu awk and since that seems to be the most commonly used implementation I've given the solution in terms of gawk.
This might work for you (GNU sed):
sed 'h;s/./X/g;s/^\(.*\)\1\1\1/\1 \1 \1 \1/;G;s/\n/&&/;:a;/^\n/bb;/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta;s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta;:b;s/\n//g' file
Explanation:
h copy the pattern space (PS) to the hold space (HS)
s/./X/g replace every character in the HS with the same non-space character (in this case X)
s/^\(.*\)\1\1\1/\1 \1 \1 \1/ split the line into 4 parts (space separated)
G append a newline followed by the contents of the HS to the PS
s/\n/&&/ double the newline (to be later used as markers)
:a introduce a loop namespace
/^\n/bb if we reach a newline we are done and branch to the b namespace
/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta; if the first character is a space add a space to the real line at this point and repeat
s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta any other character just bump along and repeat
:b;s/\n//g all done just remove the markers and print out the result
This work for any length of line, however is the line is not exactly divisible by 4 the last portion will contain the remainder as well.
perl
perl might be a better choice here:
export cols=4
perl -ne 'chomp; $fw = 1 + int length()/$ENV{cols}; while(/(.{1,$fw})/gm) { print $1 . " " } print "\n"'
This re-calculates field-width for every line.
coreutils
A GNU coreutils alternative, field-width is chosen based on the first line of infile:
cols=4
len=$(( $(head -n1 infile | wc -c) - 1 ))
fw=$(echo "scale=0; 1 + $len / 4" | bc)
cut_arg=$(paste -d- <(seq 1 $fw 19) <(seq $fw $fw $len) | head -c-1 | tr '\n' ',')
Value of cut_arg is in the above case:
1-5,6-10,11-15,16-
Now cut the line into appropriate chunks:
cut --output-delimiter=' ' -c $cut_arg infile

sed - comment a matching line and x lines after it

I need help with using sed to comment a matching lines and 4 lines which follows it.
in a text file.
my text file is like this:
[myprocess-a]
property1=1
property2=2
property3=3
property4=4
[anotherprocess-b]
property1=gffgg
property3=gjdl
property2=red
property4=djfjf
[myprocess-b]
property1=1
property4=4
property2=2
property3=3
I want to prefix # to all the lines having text '[myprocess' and 4 lines that follows it
expected output:
#[myprocess-a]
#property1=1
#property2=2
#property3=3
#property4=4
[anotherprocess-b]
property1=gffgg
property3=gjdl
property2=red
property4=djfjf
#[myprocess-b]
#property1=1
#property4=4
#property2=2
#property3=3
Greatly appreciate your help on this.
You can do this by applying a regular expression to a set of lines:
sed -e '/myprocess/,+4 s/^/#/'
This matches lines with 'myprocess' and the 4 lines after them. For those 4 lines it then inserts a '#' at the beginning of the line.
(I think this might be a GNU extension - it's not in any of the "sed one liner" cheatsheets I know)
sed '/\[myprocess/ { N;N;N;N; s/^/#/gm }' input_file
Using string concatenation and default action in awk.
http://www.gnu.org/software/gawk/manual/html_node/Concatenation.html
awk '/myprocess/{f=1} f>5{f=0} f{f++; $0="#" $0} 1' foo.txt
or if the block always ends with empty line
awk '/myprocess/{f=1} !NF{f=0} f{$0="#" $0} 1' foo.txt