delete string between two strings in one line

delete string between two strings in one line - perl

i am trying to delete everything between bracket <>, i can do it if one line only has one <>, but if a line has more than one, it seems to delete everything inside the outer <>.
echo "hi, <how> are you" | sed 's/<.*>//'
result: hi, are you
echo "hi, <how> are <you>? " | sed 's/<.*>//'
result: hi, ?
the first echo is working fine, but if one sentense has more than one <>, it can not classify.
expected input: 1 <a> 2 <b> 3 <c> 4 <d> ...... 1000 <n>
expected out come: 1 2 3 4 .... 1000
thanks

Using awk:
# using gsub - recommended
$ echo "1 <a> 2 <b> 3 <c> 4 <d> ...... 1000 <n>" | awk 'gsub(/<[^>]*>/,"")'
1 2 3 4 ...... 1000
# OR using FS and OFS
$ echo "1 <a> 2 <b> 3 <c> 4 <d> ...... 1000 <n>" | awk -F'<[^>]*>' -v OFS='' '$1=$1'
1 2 3 4 ...... 1000

Following awk will be helpful to you.
echo "hi, <how> are <you>? " | awk '{for(i=1;i<=NF;i++){if($i~/<.*>/){$i=""}}} 1'
OR
echo "1 <a> 2 <b> 3 <c> 4 <d> ...... 1000 <n>" | awk '{for(i=1;i<=NF;i++){if($i~/<.*>/){$i=""}}} 1'
Explanation: Simply going through all the fields of the line(by starting a for loop which starts from i=1 to till the value of NF(number of fields)), there I am checking if a field's value is satisfying regex <.*>(means it has ) then I am nullifying it.

* matches zero or more times with greedy. use the negation character class <[^>]*>
echo "hi, <how> are <you>? " | sed 's/<[^>]*>//g'

Related

Replace tab inside double quotes as space Sed, Regexp

Hello Sed/Regexp experts, Need some help,
I have a file with below contents, need to replace tabs as space inside double quotes.
Note \t is tab.
1 \t 2 \t 3 \t "4 \t 5 \t 6" \t 7
Expected output:
1 \t 2 \t 3 \t "4 5 6" \t 7
Matching quotes and tired replacing the tabs to space but it replaces the content inside the quotes.
sed '/\s/s/".*"/" "/' 1.txt
Thanks

Here is a sed solution using label:
sed -E -e :a -e 's/("[^\t"]*)\t([^"]*")/\1 \2/; ta' file
1 2 3 "4 5 6" 7
However, it is easier to do this using awk by using " as field delimiter and change every even numbered field (which will be inside the quote):
awk '
BEGIN {FS=OFS="\""}
{
for (i=2; i<=NF; i+=2)
gsub(/\t/, " ", $i)
} 1' file
1 2 3 "4 5 6" 7

With your shown samples Only, please try following awk code. Written and tested in GNU awk using RT variable of awk to deal with values between "....".
awk -v RS='"[^*]*"' 'RT{gsub(/\t/,OFS,RT);ORS=RT;print};END{ORS="";print}' Input_file

with python using indexes and regex - re.sub
st = r'1 2 3 "4 5 6" 7'
l_ind = st.index('"')
r_ind = st.rindex('"')
new_st = st[:l_ind] + re.sub(r'\s+', r' ', st[l_ind:r_ind]) + st[r_ind:]
1 2 3 "4 5 6" 7
another version using re.sub and re.findall
re.sub(r'".*?"',re.sub(r'\s+', r' ', re.findall(r'".*?"', st)[0]), st)
1 2 3 "4 5 6" 7
re.findall(r'".*?"', st)[0] - find the string in double quotes
re.sub(r'\s+', r' ', - compress the multiple space to one inside the double quoted string
re.sub(r'".*?"', - substitute the original double quoted string with the new one.

This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^"\t]*"[^"]*)*"[^"\t]*)\t/\1 /;ta' file
Replace the first tab within matched double quotes with a space and repeat until failure.
N.B. This solution caters for lines with multiple matching double quotes.

xargs and sed to extract specific lines

I want to extract lines that have a particular pattern, in a certain column. For example, in my 'input.txt' file, I have many columns. I want to search the 25th column for 'foobar', and extract only those lines that have 'foobar' in the 25th column. I cannot do:
grep foobar input.txt
because other columns may also have 'foobar', and I don't want those lines. Also:
the 25th column will have 'foobar' as part of a string (i.e. it could be 'foobar ; muller' or 'max ; foobar ; john', or 'tom ; foobar35')
I would NOT want 'tom ; foobar35'
The word in column 25 must be an exact match for 'foobar' (and ; so using awk $25=='foobar' is not an option.
In other words, if column 25 had the following lines:
foobar ; muller
max ; foobar ; john
tom ; foobar35
I would want only lines 1 & 2.
How do I use xargs and sed to extract these lines? I am stuck at:
cut -f25 input.txt | grep -nw foobar | xargs -I linenumbers sed ???
thanks!

Do not use xargs and sed, use the other tool common on so many machines and do this:
awk '{if($25=="foobar"){print NR" "$0}}' input.txt
print NR prints the line number of the current match so the first column of the output will be the line number.
print $0 prints the current line. Change it to print $25 if you only want the matching column. If you only want the output, use this:
awk '{if($25=="foobar"){print $0}}' input.txt
EDIT1 to match extended question:
Use what #shellter and #Jotne suggested but add string delimiters.
awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' '$25~/foobar/' input.txt
[^ ]* matches all characters that are not a space.
'[^']*' matches everything inside single quotes.
EDIT2 to exclude everything but foobar:
awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' "\$25~/[;' ]foobar[;' ]/" input.txt
[;' ] only allows ;, ' and in front and after foobar.
Tested with this file:
1 "1 ; 1" 4
2 'kom foobar' 33
3 "ll;3" 3
4 '1; foobar' asd
7 '5 ;foobar' 2
7 '5;foobar' 0
2 'kom foobar35' 33
2 'kom ; foobar' 33
2 'foobar ; john' 33
2 'foobar;paul' 33
2 'foobar1;paul' 33
2 'foobarli;paul' 33
2 'afoobar;paul' 33
and this command awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' "\$2~/[;' ]foobar[;' ]/" input.txt

To get the line with foobar as part of the 25 field.
awk '$25=="foobar"' input.txt
$25 25th filed
== equal to
"foobar"
Since no action spesified, print the complete line will be done, same as {print $0}
Or
awk '$25~/^foobar$/' input.txt

This might work for you (GNU sed):
sed -En 's/\S+/\n&\n/25;s/\n(.*foobar.*)\n/\1/p' file
Surround the 25th field by newlines and pattern match for foobar between newlines.
If you only want to match the word foobar use:
sed -En 's/\S+/\n&\n/25;s/\n(.*\<foobar\>.*)\n/\1/p' file

Find duplicate records in file

I have a text file with lines like below:
name1#domainx.com, name1
info#domainy.de, somename
name2#domainz.com, othername
name3#domainx.com, name3
How can I find duplicate domains like domainx.com with sed or awk?

With GNU awk you can do:
$ awk -F'[#,]' '{a[$2]++}END{for(k in a) print a[k],k}' file
1 domainz.com
2 domainx.com
1 domainy.de
You can use sort to order the output i.e. ascending numerical with -n:
$ awk -F'[#,]' '{a[$2]++}END{for(k in a) print a[k],k}' file | sort -n
1 domainy.de
1 domainz.com
2 domainx.com
Or just to print duplicate domains:
$ awk -F'[#,]' '{a[$2]++}END{for(k in a)if (a[k]>1) print k}' file
domainx.com

Here:
sed -n '/#domainx.com/ p' yourfile.txt
(Actually is grep what you should use for that)
Would you like to count them? add an |nl to the end.
Using that minilist you gave, using the sed line with |nl, outputs this:
1 name1#domainx.com, name1
2 name3#domainx.com, name3
What if you need to count how many repetitions have each domain? For that try this:
for line in `sed -n 's/.*#\([^,]*\).*/\1/p' yourfile.txt|sort|uniq` ; do
echo "$line `grep -c $line yourfile.txt`"
done
The output of that is:
domainx.com 2
domainy.de 1
domainz.com 1

Print only duplicate domains
awk -F"[#,]" 'a[$2]++==1 {print $2}'
domainx.com
Print a "*" in front of line that are listed duplicated.
awk -F"[#,]" '{a[$2]++;if (a[$2]>1) f="* ";print f$0;f=x}'
name1#domainx.com, name1
info#domainy.de, somename
name2#domainz.com, othername
* name3#domainx.com, name3
This version paints all line with duplicate domain in color red
awk -F"[#,]" '{a[$2]++;b[NR]=$0;c[NR]=$2} END {for (i=1;i<=NR;i++) print ((a[c[i]]>1)?"\033[1;31m":"\033[0m") b[i] "\033[0m"}' file
name1#domainx.com, name1 <-- This line is red
info#domainy.de, somename
name2#domainz.com, othername
name3#domainx.com, name3 <-- This line is red
Improved version (reading the file twice):
awk -F"[#,]" 'NR==FNR{a[$2]++;next} a[$2]>1 {$0="\033[1;31m" $0 "\033[0m"}1' file file
name1#domainx.com, name1 <-- This line is red
info#domainy.de, somename
name2#domainz.com, othername
name3#domainx.com, name3 <-- This line is red

If you have GNU grep available, you can use the PCRE matcher to do a positive look-behind to extract the domain name. After that sort and uniq can find duplicate instances:
<infile grep -oP '(?<=#)[^,]*' | sort | uniq -d
Output:
domainx.com

Using SED to delete a line that has multiple fields to match

I have a file that has 12 columns of data. I would like to delete / remove the entire line if column 5 equals "A" and column 12 equals "Z". Is this possible using SED?

You can. Suppose your columns are separated by spaces:
sed -i -e '/\([^ ]* *\)\{4\}A *\([^ ]* *\)\{6\}Z/d' file
The -i flag is used to edit the file in place.
The pattern [^ ]* * matches zero or more (indicated by the asterisk) characters that aren't spaces (indicated by the space character after the ^ in the brackets) followed by zero or more spaces.
Placing this pattern between backslashed parenthesis, we can group it into a single expression, and we can then use backslashed braces to repeat the expression. Four times initially, then match an A followed by spaces, then the pattern again repeated six times, then the Z.
Hope this helps =)

You can do this with sed, but it is much easier with awk:
awk '! ( $5 == "A" && $12 == "Z" )' input-file
or
awk '$5 != "A" || $12 != "Z"' input-file

perl -F -ane 'print unless($F[4] eq "A" and $F[11] eq "Z") your_file
tested below:
> cat temp
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 A 6 7 8 9 10 11 Z
> perl -F -ane 'print unless($F[4] eq "A" and $F[11] eq "Z")' temp
1 2 3 4 5 6 7 8 9 10 11 12
>

Prevent column shift after character removal

Currently I am using the following oneliner for the removal of special characters:
sed 's/[-$*=+()]//g'
However sometimes it occurs that a column only contains the special character *.
How can I prevent the column from shifting if it only contains *?
Would it be possible to use a placeholder, so that whenever it occurs that the only character(s) in the columns two and/or four are * it gets replaced by N for every *?
From:
6 cc-g*$ 10 cc+c
6 c$c$*g$q 10 ***
6 *c*c$$qq 10 ccc
6 ** 10 c$cc
6 ** 10 *
To possibly:
6 ccg 10 ccc
6 ccgq 10 NNN
6 ccqq 10 ccc
6 NN 10 ccc
6 NN 10 N

Try with in awk,
awk '{ if($2 ~ /^[*]+$/) { gsub ( /[*]/,"N",$2); } if($4 ~ /^[*]+$/ ){ gsub ( /[*]/,"N",$4); } print }' your_file.txt | sed 's/[-$*=+()]//g'
I hope this will help you.

One way using perl. Traverse all fields of each line and substitute special characters unless the field only has * characters. After that print them separated with one space.
perl -ane '
for my $pos ( 0 .. $#F ) {
$F[ $pos ] =~ s/[-\$*=+()]//g unless $F[ $pos ] =~ m/\A\*+\Z/;
}
printf qq|%s\n|, join qq| |, #F;
' infile
Assuming infile has the content of the question, output will be:
6 ccg 10 ccc
6 ccgq 10 ***
6 ccqq 10 ccc
6 ** 10 ccc
6 ** 10 *

This might work for you (GNU sed):
sed 'h;s/\S*\s*\(\S*\).*/\1/;:a;/^\**$/y/*/N/;s/[*$+=-]//g;H;g;/\n.*\n/bb;s/\(\S*\s*\)\{3\}\(\S*\).*/\2/;ba;:b;s/^\(\S*\s*\)\(\S*\)\([^\n]*\)\n\(\S*\)/\1\4\3/;s/\(\S*\)\n\(.*\)/\2/' file

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

delete string between two strings in one line - perl

Using awk: # using gsub - recommended $ echo "1 <a> 2 <b> 3 <c> 4 <d> ...... 1000 <n>" | awk 'gsub(/<[^>]>/,"")' 1 2 3 4 ...... 1000 # OR using FS and OFS $ echo "1 <a> 2 <b> 3 <c> 4 <d> ...... 1000 <n>" | awk -F'<[^>]>' -v OFS='' '$1=$1' 1 2 3 4 ...... 1000

* matches zero or more times with greedy. use the negation character class <[^>]> echo "hi, <how> are <you>? " | sed 's/<[^>]>//g'

Related

Replace tab inside double quotes as space Sed, Regexp

xargs and sed to extract specific lines

Find duplicate records in file

Using SED to delete a line that has multiple fields to match

Prevent column shift after character removal

Categories

Resources

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

delete string between two strings in one line - perl

Using awk: # using gsub - recommended $ echo "1 <a> 2 <b> 3 <c> 4 <d> ...... 1000 <n>" | awk 'gsub(/<[^>]*>/,"")' 1 2 3 4 ...... 1000 # OR using FS and OFS $ echo "1 <a> 2 <b> 3 <c> 4 <d> ...... 1000 <n>" | awk -F'<[^>]*>' -v OFS='' '$1=$1' 1 2 3 4 ...... 1000

* matches zero or more times with greedy. use the negation character class <[^>]*> echo "hi, <how> are <you>? " | sed 's/<[^>]*>//g'

Related

Replace tab inside double quotes as space Sed, Regexp

xargs and sed to extract specific lines

Find duplicate records in file

Using SED to delete a line that has multiple fields to match

Prevent column shift after character removal

Categories

Resources

Using awk: # using gsub - recommended $ echo "1 <a> 2 <b> 3 <c> 4 <d> ...... 1000 <n>" | awk 'gsub(/<[^>]>/,"")' 1 2 3 4 ...... 1000 # OR using FS and OFS $ echo "1 <a> 2 <b> 3 <c> 4 <d> ...... 1000 <n>" | awk -F'<[^>]>' -v OFS='' '$1=$1' 1 2 3 4 ...... 1000

* matches zero or more times with greedy. use the negation character class <[^>]> echo "hi, <how> are <you>? " | sed 's/<[^>]>//g'