delete string between two strings in one line - perl

i am trying to delete everything between bracket <>, i can do it if one line only has one <>, but if a line has more than one, it seems to delete everything inside the outer <>.
echo "hi, <how> are you" | sed 's/<.*>//'
result: hi, are you
echo "hi, <how> are <you>? " | sed 's/<.*>//'
result: hi, ?
the first echo is working fine, but if one sentense has more than one <>, it can not classify.
expected input: 1 <a> 2 <b> 3 <c> 4 <d> ...... 1000 <n>
expected out come: 1 2 3 4 .... 1000
thanks

Using awk:
# using gsub - recommended
$ echo "1 <a> 2 <b> 3 <c> 4 <d> ...... 1000 <n>" | awk 'gsub(/<[^>]*>/,"")'
1 2 3 4 ...... 1000
# OR using FS and OFS
$ echo "1 <a> 2 <b> 3 <c> 4 <d> ...... 1000 <n>" | awk -F'<[^>]*>' -v OFS='' '$1=$1'
1 2 3 4 ...... 1000

Following awk will be helpful to you.
echo "hi, <how> are <you>? " | awk '{for(i=1;i<=NF;i++){if($i~/<.*>/){$i=""}}} 1'
OR
echo "1 <a> 2 <b> 3 <c> 4 <d> ...... 1000 <n>" | awk '{for(i=1;i<=NF;i++){if($i~/<.*>/){$i=""}}} 1'
Explanation: Simply going through all the fields of the line(by starting a for loop which starts from i=1 to till the value of NF(number of fields)), there I am checking if a field's value is satisfying regex <.*>(means it has ) then I am nullifying it.

* matches zero or more times with greedy. use the negation character class <[^>]*>
echo "hi, <how> are <you>? " | sed 's/<[^>]*>//g'

Related

Replace tab inside double quotes as space Sed, Regexp

Hello Sed/Regexp experts, Need some help,
I have a file with below contents, need to replace tabs as space inside double quotes.
Note \t is tab.
1 \t 2 \t 3 \t "4 \t 5 \t 6" \t 7
Expected output:
1 \t 2 \t 3 \t "4 5 6" \t 7
Matching quotes and tired replacing the tabs to space but it replaces the content inside the quotes.
sed '/\s/s/".*"/" "/' 1.txt
Thanks
Here is a sed solution using label:
sed -E -e :a -e 's/("[^\t"]*)\t([^"]*")/\1 \2/; ta' file
1 2 3 "4 5 6" 7
However, it is easier to do this using awk by using " as field delimiter and change every even numbered field (which will be inside the quote):
awk '
BEGIN {FS=OFS="\""}
{
for (i=2; i<=NF; i+=2)
gsub(/\t/, " ", $i)
} 1' file
1 2 3 "4 5 6" 7
With your shown samples Only, please try following awk code. Written and tested in GNU awk using RT variable of awk to deal with values between "....".
awk -v RS='"[^*]*"' 'RT{gsub(/\t/,OFS,RT);ORS=RT;print};END{ORS="";print}' Input_file
with python using indexes and regex - re.sub
st = r'1 2 3 "4 5 6" 7'
l_ind = st.index('"')
r_ind = st.rindex('"')
new_st = st[:l_ind] + re.sub(r'\s+', r' ', st[l_ind:r_ind]) + st[r_ind:]
1 2 3 "4 5 6" 7
another version using re.sub and re.findall
re.sub(r'".*?"',re.sub(r'\s+', r' ', re.findall(r'".*?"', st)[0]), st)
1 2 3 "4 5 6" 7
re.findall(r'".*?"', st)[0] - find the string in double quotes
re.sub(r'\s+', r' ', - compress the multiple space to one inside the double quoted string
re.sub(r'".*?"', - substitute the original double quoted string with the new one.
This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^"\t]*"[^"]*)*"[^"\t]*)\t/\1 /;ta' file
Replace the first tab within matched double quotes with a space and repeat until failure.
N.B. This solution caters for lines with multiple matching double quotes.

xargs and sed to extract specific lines

I want to extract lines that have a particular pattern, in a certain column. For example, in my 'input.txt' file, I have many columns. I want to search the 25th column for 'foobar', and extract only those lines that have 'foobar' in the 25th column. I cannot do:
grep foobar input.txt
because other columns may also have 'foobar', and I don't want those lines. Also:
the 25th column will have 'foobar' as part of a string (i.e. it could be 'foobar ; muller' or 'max ; foobar ; john', or 'tom ; foobar35')
I would NOT want 'tom ; foobar35'
The word in column 25 must be an exact match for 'foobar' (and ; so using awk $25=='foobar' is not an option.
In other words, if column 25 had the following lines:
foobar ; muller
max ; foobar ; john
tom ; foobar35
I would want only lines 1 & 2.
How do I use xargs and sed to extract these lines? I am stuck at:
cut -f25 input.txt | grep -nw foobar | xargs -I linenumbers sed ???
thanks!
Do not use xargs and sed, use the other tool common on so many machines and do this:
awk '{if($25=="foobar"){print NR" "$0}}' input.txt
print NR prints the line number of the current match so the first column of the output will be the line number.
print $0 prints the current line. Change it to print $25 if you only want the matching column. If you only want the output, use this:
awk '{if($25=="foobar"){print $0}}' input.txt
EDIT1 to match extended question:
Use what #shellter and #Jotne suggested but add string delimiters.
awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' '$25~/foobar/' input.txt
[^ ]* matches all characters that are not a space.
'[^']*' matches everything inside single quotes.
EDIT2 to exclude everything but foobar:
awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' "\$25~/[;' ]foobar[;' ]/" input.txt
[;' ] only allows ;, ' and in front and after foobar.
Tested with this file:
1 "1 ; 1" 4
2 'kom foobar' 33
3 "ll;3" 3
4 '1; foobar' asd
7 '5 ;foobar' 2
7 '5;foobar' 0
2 'kom foobar35' 33
2 'kom ; foobar' 33
2 'foobar ; john' 33
2 'foobar;paul' 33
2 'foobar1;paul' 33
2 'foobarli;paul' 33
2 'afoobar;paul' 33
and this command awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' "\$2~/[;' ]foobar[;' ]/" input.txt
To get the line with foobar as part of the 25 field.
awk '$25=="foobar"' input.txt
$25 25th filed
== equal to
"foobar"
Since no action spesified, print the complete line will be done, same as {print $0}
Or
awk '$25~/^foobar$/' input.txt
This might work for you (GNU sed):
sed -En 's/\S+/\n&\n/25;s/\n(.*foobar.*)\n/\1/p' file
Surround the 25th field by newlines and pattern match for foobar between newlines.
If you only want to match the word foobar use:
sed -En 's/\S+/\n&\n/25;s/\n(.*\<foobar\>.*)\n/\1/p' file

Find duplicate records in file

I have a text file with lines like below:
name1#domainx.com, name1
info#domainy.de, somename
name2#domainz.com, othername
name3#domainx.com, name3
How can I find duplicate domains like domainx.com with sed or awk?
With GNU awk you can do:
$ awk -F'[#,]' '{a[$2]++}END{for(k in a) print a[k],k}' file
1 domainz.com
2 domainx.com
1 domainy.de
You can use sort to order the output i.e. ascending numerical with -n:
$ awk -F'[#,]' '{a[$2]++}END{for(k in a) print a[k],k}' file | sort -n
1 domainy.de
1 domainz.com
2 domainx.com
Or just to print duplicate domains:
$ awk -F'[#,]' '{a[$2]++}END{for(k in a)if (a[k]>1) print k}' file
domainx.com
Here:
sed -n '/#domainx.com/ p' yourfile.txt
(Actually is grep what you should use for that)
Would you like to count them? add an |nl to the end.
Using that minilist you gave, using the sed line with |nl, outputs this:
1 name1#domainx.com, name1
2 name3#domainx.com, name3
What if you need to count how many repetitions have each domain? For that try this:
for line in `sed -n 's/.*#\([^,]*\).*/\1/p' yourfile.txt|sort|uniq` ; do
echo "$line `grep -c $line yourfile.txt`"
done
The output of that is:
domainx.com 2
domainy.de 1
domainz.com 1
Print only duplicate domains
awk -F"[#,]" 'a[$2]++==1 {print $2}'
domainx.com
Print a "*" in front of line that are listed duplicated.
awk -F"[#,]" '{a[$2]++;if (a[$2]>1) f="* ";print f$0;f=x}'
name1#domainx.com, name1
info#domainy.de, somename
name2#domainz.com, othername
* name3#domainx.com, name3
This version paints all line with duplicate domain in color red
awk -F"[#,]" '{a[$2]++;b[NR]=$0;c[NR]=$2} END {for (i=1;i<=NR;i++) print ((a[c[i]]>1)?"\033[1;31m":"\033[0m") b[i] "\033[0m"}' file
name1#domainx.com, name1 <-- This line is red
info#domainy.de, somename
name2#domainz.com, othername
name3#domainx.com, name3 <-- This line is red
Improved version (reading the file twice):
awk -F"[#,]" 'NR==FNR{a[$2]++;next} a[$2]>1 {$0="\033[1;31m" $0 "\033[0m"}1' file file
name1#domainx.com, name1 <-- This line is red
info#domainy.de, somename
name2#domainz.com, othername
name3#domainx.com, name3 <-- This line is red
If you have GNU grep available, you can use the PCRE matcher to do a positive look-behind to extract the domain name. After that sort and uniq can find duplicate instances:
<infile grep -oP '(?<=#)[^,]*' | sort | uniq -d
Output:
domainx.com

Using SED to delete a line that has multiple fields to match

I have a file that has 12 columns of data. I would like to delete / remove the entire line if column 5 equals "A" and column 12 equals "Z". Is this possible using SED?
You can. Suppose your columns are separated by spaces:
sed -i -e '/\([^ ]* *\)\{4\}A *\([^ ]* *\)\{6\}Z/d' file
The -i flag is used to edit the file in place.
The pattern [^ ]* * matches zero or more (indicated by the asterisk) characters that aren't spaces (indicated by the space character after the ^ in the brackets) followed by zero or more spaces.
Placing this pattern between backslashed parenthesis, we can group it into a single expression, and we can then use backslashed braces to repeat the expression. Four times initially, then match an A followed by spaces, then the pattern again repeated six times, then the Z.
Hope this helps =)
You can do this with sed, but it is much easier with awk:
awk '! ( $5 == "A" && $12 == "Z" )' input-file
or
awk '$5 != "A" || $12 != "Z"' input-file
perl -F -ane 'print unless($F[4] eq "A" and $F[11] eq "Z") your_file
tested below:
> cat temp
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 A 6 7 8 9 10 11 Z
> perl -F -ane 'print unless($F[4] eq "A" and $F[11] eq "Z")' temp
1 2 3 4 5 6 7 8 9 10 11 12
>

Prevent column shift after character removal

Currently I am using the following oneliner for the removal of special characters:
sed 's/[-$*=+()]//g'
However sometimes it occurs that a column only contains the special character *.
How can I prevent the column from shifting if it only contains *?
Would it be possible to use a placeholder, so that whenever it occurs that the only character(s) in the columns two and/or four are * it gets replaced by N for every *?
From:
6 cc-g*$ 10 cc+c
6 c$c$*g$q 10 ***
6 *c*c$$qq 10 ccc
6 ** 10 c$cc
6 ** 10 *
To possibly:
6 ccg 10 ccc
6 ccgq 10 NNN
6 ccqq 10 ccc
6 NN 10 ccc
6 NN 10 N
Try with in awk,
awk '{ if($2 ~ /^[*]+$/) { gsub ( /[*]/,"N",$2); } if($4 ~ /^[*]+$/ ){ gsub ( /[*]/,"N",$4); } print }' your_file.txt | sed 's/[-$*=+()]//g'
I hope this will help you.
One way using perl. Traverse all fields of each line and substitute special characters unless the field only has * characters. After that print them separated with one space.
perl -ane '
for my $pos ( 0 .. $#F ) {
$F[ $pos ] =~ s/[-\$*=+()]//g unless $F[ $pos ] =~ m/\A\*+\Z/;
}
printf qq|%s\n|, join qq| |, #F;
' infile
Assuming infile has the content of the question, output will be:
6 ccg 10 ccc
6 ccgq 10 ***
6 ccqq 10 ccc
6 ** 10 ccc
6 ** 10 *
This might work for you (GNU sed):
sed 'h;s/\S*\s*\(\S*\).*/\1/;:a;/^\**$/y/*/N/;s/[*$+=-]//g;H;g;/\n.*\n/bb;s/\(\S*\s*\)\{3\}\(\S*\).*/\2/;ba;:b;s/^\(\S*\s*\)\(\S*\)\([^\n]*\)\n\(\S*\)/\1\4\3/;s/\(\S*\)\n\(.*\)/\2/' file