remove string between two character with sed - sed

I have a file of this type:
16:00 [61]Al-Najma - Al-Rifaa [62]5.06 [63]3.55 [64]1.57 4
and i want remove all the strings inside square parentheses in order to obtain
16:00 Al-Najma - Al-Rifaa 5.06 3.55 1.57 4
I am trying with sed in this manner:
sed 's/\[.*]//g' file1 > file2
but i obtain
16:00 1.57 4
and with
sed 's/\[.[1234567890]]//g' file1 > file2
does not work if the string contains more than 2 digit.
how can i do this?

your pattern allows only one character, adding a star behind the pattern widens it to all matching characters.
sed 's/\[.[1234567890]]*//g' file1 > file2
alternative:
sed 's/\[^\]*//g' file1 > file2
that means: after the starting "[" everything but the "]" is OK, and that for as many characters as there come (the "*")
for further reading on sed: http://www.grymoire.com/Unix/Sed.html

Your first regex does not work because the quantifier * is greedy, meaning it matches as many characters as possible. Since . also matches brackets, it continues to match until the last closing bracket ] it can find.
So you basically have two options: Use a non-greedy quantifier or restrict the types of characters you can match. You have tried the second solution. I would go with using a negated character class instead:
sed 's/\[[^]]*\]//g'
I'm not sure if sed has non-greedy quantifiers, but perl does:
perl -lpwe 's/\[.*?\]//g'

Does escaping the closing ] help ?
sed 's/\[.*\]//g' file1 > file2

You already got the sed answer, so I will add other one using awk:
awk '
BEGIN {
FS = "\\[[^]]*\\]";
OFS = " "
}
{
for (i=1; i<=NF; i++)
printf "%s", $i
}
END {
printf "\n"
}
' <<<"16:00 [61]Al-Najma - Al-Rifaa [62]5.06 [63]3.55 [64]1.57 4"
Output:
16:00 Al-Najma - Al-Rifaa 5.06 3.55 1.57 4

using awk:
$ echo '16:00 [61]Al-Najma - Al-Rifaa [62]5.06 [63]3.55 [64]1.57 4' | awk -F '\[[0-9]*\]' '$1=$1'
16:00 Al-Najma - Al-Rifaa 5.06 3.55 1.57 4

This might work for you:
echo "16:00 [61]Al-Najma - Al-Rifaa [62]5.06 [63]3.55 [64]1.57 4" |
sed 's/\[[^]]*\]//g'
16:00 Al-Najma - Al-Rifaa 5.06 3.55 1.57 4

Related

Extract substrings between strings

I have a file with text as follows:
###interest1 moreinterest1### sometext ###interest2###
not-interesting-line
sometext ###interest3###
sometext ###interest4### sometext othertext ###interest5### sometext ###interest6###
I want to extract all strings between ### .
My desired output would be something like this:
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
I have tried the following:
grep '###' file.txt | sed -e 's/.*###\(.*\)###.*/\1/g'
This almost works but only seems to grab the first instance per line, so the first line in my output only grabs
interest1 moreinterest1
rather than
interest1 moreinterest1
interest2
Here is a single awk command to achieve this that makes ### field separator and prints each even numbered field:
awk -F '###' '{for (i=2; i<NF; i+=2) print $i}' file
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
Here is an alternative grep + sed solution:
grep -oE '###[^#]*###' file | sed -E 's/^###|###$//g'
This assumes there are no # characters in between ### markers.
With GNU awk for multi-char RS:
$ awk -v RS='###' '!(NR%2)' file
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
You can use pcregrep:
pcregrep -o1 '###(.*?)###' file
The regex - ###(.*?)### - matches ###, then captures into Group 1 any zero o more chars other than line break chars, as few as possible, and ### then matches ###.
o1 option will output Group 1 value only.
See the regex demo online.
sed 't x
s/###/\
/;D; :x
s//\
/;t y
D;:y
P;D' file
Replacing "###" with newline, D, then conditionally branching to P if a second replacement of "###" is successful.
This might work for you (GNU sed):
sed -n 's/###/\n/g;/[^\n]*\n/{s///;P;D}' file
Replace all occurrences of ###'s by newlines.
If a line contains a newline, remove any characters before and including the first newline, print the details up to and including the following newline, delete those details and repeat.

xargs and sed to extract specific lines

I want to extract lines that have a particular pattern, in a certain column. For example, in my 'input.txt' file, I have many columns. I want to search the 25th column for 'foobar', and extract only those lines that have 'foobar' in the 25th column. I cannot do:
grep foobar input.txt
because other columns may also have 'foobar', and I don't want those lines. Also:
the 25th column will have 'foobar' as part of a string (i.e. it could be 'foobar ; muller' or 'max ; foobar ; john', or 'tom ; foobar35')
I would NOT want 'tom ; foobar35'
The word in column 25 must be an exact match for 'foobar' (and ; so using awk $25=='foobar' is not an option.
In other words, if column 25 had the following lines:
foobar ; muller
max ; foobar ; john
tom ; foobar35
I would want only lines 1 & 2.
How do I use xargs and sed to extract these lines? I am stuck at:
cut -f25 input.txt | grep -nw foobar | xargs -I linenumbers sed ???
thanks!
Do not use xargs and sed, use the other tool common on so many machines and do this:
awk '{if($25=="foobar"){print NR" "$0}}' input.txt
print NR prints the line number of the current match so the first column of the output will be the line number.
print $0 prints the current line. Change it to print $25 if you only want the matching column. If you only want the output, use this:
awk '{if($25=="foobar"){print $0}}' input.txt
EDIT1 to match extended question:
Use what #shellter and #Jotne suggested but add string delimiters.
awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' '$25~/foobar/' input.txt
[^ ]* matches all characters that are not a space.
'[^']*' matches everything inside single quotes.
EDIT2 to exclude everything but foobar:
awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' "\$25~/[;' ]foobar[;' ]/" input.txt
[;' ] only allows ;, ' and in front and after foobar.
Tested with this file:
1 "1 ; 1" 4
2 'kom foobar' 33
3 "ll;3" 3
4 '1; foobar' asd
7 '5 ;foobar' 2
7 '5;foobar' 0
2 'kom foobar35' 33
2 'kom ; foobar' 33
2 'foobar ; john' 33
2 'foobar;paul' 33
2 'foobar1;paul' 33
2 'foobarli;paul' 33
2 'afoobar;paul' 33
and this command awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' "\$2~/[;' ]foobar[;' ]/" input.txt
To get the line with foobar as part of the 25 field.
awk '$25=="foobar"' input.txt
$25 25th filed
== equal to
"foobar"
Since no action spesified, print the complete line will be done, same as {print $0}
Or
awk '$25~/^foobar$/' input.txt
This might work for you (GNU sed):
sed -En 's/\S+/\n&\n/25;s/\n(.*foobar.*)\n/\1/p' file
Surround the 25th field by newlines and pattern match for foobar between newlines.
If you only want to match the word foobar use:
sed -En 's/\S+/\n&\n/25;s/\n(.*\<foobar\>.*)\n/\1/p' file

sed to insert characters in a specific column

Can anyone tell me how to insert chr to every character in the first column.
file in
1 34566 34765
2 45678 45789
3 34567 34799
X 67895 66900
Y 34567 34890
file out
chr1 34566 34765
chr2 45678 45789
chr3 34567 34799
chrX 67895 66900
chrY 34567 34890
I can't figure out how to make sed -i apply to a specific column. I'm not good with the syntax so if you could break down your explanation I would be grateful. Also, would it be better to use awk for this?
With sed:
sed 's/^/chr/' file.in > file.out
You don't need the -i-flag because you aren't overwriting the input-file file.in.
With awk:
awk '{print "chr"$0}' file.in > file.out
sed 's|^|chr|' file_in > file_out
This does a substitution (s) at the beginning of each line (^) and replacing it with the characters "chr". The pipes (|) are just separators.
With awk:
awk '{print "chr"$0}' file
To store the changes back to the file:
awk '{print "chr"$0}' file > tmp && mv tmp file
$ cat file.in
1 34566 34765
2 45678 45789
3 34567 34799
X 67895 66900
Y 34567 34890
$ awk '$1="chr"$1 {print}' file.in
chr1 34566 34765
chr2 45678 45789
chr3 34567 34799
chrX 67895 66900
chrY 34567 34890
In essence, you are redirecting value of "chr"$1 back to $1 regardless if there are spaces before column 1 or not and still print out the rest of the line.
sed 's/^\s*/chr/'
just to be sur that no white space are before you object. If you are sur of, no need of th \s
awk '{print "chr"$0}'
Works for the file in question because there are no field separators. For that case you can use (for a pipe delimited file for instance):
awk 'BEGIN { FS = "|";OFS="|"} {$97="chr"$0;print}'

sed — joining a range of selected lines

I'm a beginner to sed. I know that it's possible to apply a command (or a set of commands) to a certain range of lines like so
sed '/[begin]/,/[end]/ [some command]'
where [begin] is a regular expression that designates the beginning line of the range and [end] is a regular expression that designates the ending line of the range (but is included in the range).
I'm trying to use this to specify a range of lines in a file and join them all into one line. Here's my best try, which didn't work:
sed '/[begin]/,/[end]/ {
N
s/\n//
}
'
I'm able to select the set of lines I want without any problem, but I just can't seem to merge them all into one line. If anyone could point me in the right direction, I would be really grateful.
One way using GNU sed:
sed -n '/begin/,/end/ { H;g; s/^\n//; /end/s/\n/ /gp }' file.txt
This is straight forward if you want to select some lines and join them. Use Steve's answer or my pipe-to-tr alternative:
sed -n '/begin/,/end/p' | tr -d '\n'
It becomes a bit trickier if you want to keep the other lines as well. Here is how I would do it (with GNU sed):
join.sed
/\[begin\]/ {
:a
/\[end\]/! { N; ba }
s/\n/ /g
}
So the logic here is:
When [begin] line is encountered start collecting lines into pattern space with a loop.
When [end] is found stop collecting and join the lines.
Example:
seq 9 | sed -e '3s/^/[begin]\n/' -e '6s/$/\n[end]/' | sed -f join.sed
Output:
1
2
[begin] 3 4 5 6 [end]
7
8
9
I like your question. I also like Sed. Regrettably, I do not know how to answer your question in Sed; so, like you, I am watching here for the answer.
Since no Sed answer has yet appeared here, here is how to do it in Perl:
perl -wne 'my $flag = 0; while (<>) { chomp; if (/[begin]/) {$flag = 1;} print if $flag; if (/[end]/) {print "\n" if $flag; $flag = 0;} } print "\n" if $flag;'

divide each line in equal part

I would be happy if anyone can suggest me command (sed or AWK one line command) to divide each line of file in equal number of part. For example divide each line in 4 part.
Input:
ATGCATHLMNPHLNTPLML
Output:
ATGCA THLMN PHLNT PLML
This should work using GNU sed:
sed -r 's/(.{4})/\1 /g'
-r is needed to use extended regular expressions
.{4} captures every four characters
\1 refers to the captured group which is surrounded by the parenthesis ( ) and adds a space behind this group
g makes sure that the replacement is done as many times as possible on each line
A test; this is the input and output in my terminal:
$ echo "ATGCATHLMNPHLNTPLML" | sed -r 's/(.{4})/\1 /g'
ATGC ATHL MNPH LNTP LML
I suspect awk is not the best tool for this, but:
gawk --posix '{ l = sprintf( "%d", 1 + (length()-1)/4);
gsub( ".{"l"}", "& " ) } 1' input-file
If you have a posix compliant awk you can omit the --posix, but --posix is necessary for gnu awk and since that seems to be the most commonly used implementation I've given the solution in terms of gawk.
This might work for you (GNU sed):
sed 'h;s/./X/g;s/^\(.*\)\1\1\1/\1 \1 \1 \1/;G;s/\n/&&/;:a;/^\n/bb;/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta;s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta;:b;s/\n//g' file
Explanation:
h copy the pattern space (PS) to the hold space (HS)
s/./X/g replace every character in the HS with the same non-space character (in this case X)
s/^\(.*\)\1\1\1/\1 \1 \1 \1/ split the line into 4 parts (space separated)
G append a newline followed by the contents of the HS to the PS
s/\n/&&/ double the newline (to be later used as markers)
:a introduce a loop namespace
/^\n/bb if we reach a newline we are done and branch to the b namespace
/^ /s/ \(.*\n.*\)\n\(.\)/\1 \n\2/;ta; if the first character is a space add a space to the real line at this point and repeat
s/^.\(.*\n.*\)\n\(.\)/\1\2\n/;ta any other character just bump along and repeat
:b;s/\n//g all done just remove the markers and print out the result
This work for any length of line, however is the line is not exactly divisible by 4 the last portion will contain the remainder as well.
perl
perl might be a better choice here:
export cols=4
perl -ne 'chomp; $fw = 1 + int length()/$ENV{cols}; while(/(.{1,$fw})/gm) { print $1 . " " } print "\n"'
This re-calculates field-width for every line.
coreutils
A GNU coreutils alternative, field-width is chosen based on the first line of infile:
cols=4
len=$(( $(head -n1 infile | wc -c) - 1 ))
fw=$(echo "scale=0; 1 + $len / 4" | bc)
cut_arg=$(paste -d- <(seq 1 $fw 19) <(seq $fw $fw $len) | head -c-1 | tr '\n' ',')
Value of cut_arg is in the above case:
1-5,6-10,11-15,16-
Now cut the line into appropriate chunks:
cut --output-delimiter=' ' -c $cut_arg infile