Find duplicate records in file

Find duplicate records in file - sed

I have a text file with lines like below:
name1#domainx.com, name1
info#domainy.de, somename
name2#domainz.com, othername
name3#domainx.com, name3
How can I find duplicate domains like domainx.com with sed or awk?

With GNU awk you can do:
$ awk -F'[#,]' '{a[$2]++}END{for(k in a) print a[k],k}' file
1 domainz.com
2 domainx.com
1 domainy.de
You can use sort to order the output i.e. ascending numerical with -n:
$ awk -F'[#,]' '{a[$2]++}END{for(k in a) print a[k],k}' file | sort -n
1 domainy.de
1 domainz.com
2 domainx.com
Or just to print duplicate domains:
$ awk -F'[#,]' '{a[$2]++}END{for(k in a)if (a[k]>1) print k}' file
domainx.com

Here:
sed -n '/#domainx.com/ p' yourfile.txt
(Actually is grep what you should use for that)
Would you like to count them? add an |nl to the end.
Using that minilist you gave, using the sed line with |nl, outputs this:
1 name1#domainx.com, name1
2 name3#domainx.com, name3
What if you need to count how many repetitions have each domain? For that try this:
for line in `sed -n 's/.*#\([^,]*\).*/\1/p' yourfile.txt|sort|uniq` ; do
echo "$line `grep -c $line yourfile.txt`"
done
The output of that is:
domainx.com 2
domainy.de 1
domainz.com 1

Print only duplicate domains
awk -F"[#,]" 'a[$2]++==1 {print $2}'
domainx.com
Print a "*" in front of line that are listed duplicated.
awk -F"[#,]" '{a[$2]++;if (a[$2]>1) f="* ";print f$0;f=x}'
name1#domainx.com, name1
info#domainy.de, somename
name2#domainz.com, othername
* name3#domainx.com, name3
This version paints all line with duplicate domain in color red
awk -F"[#,]" '{a[$2]++;b[NR]=$0;c[NR]=$2} END {for (i=1;i<=NR;i++) print ((a[c[i]]>1)?"\033[1;31m":"\033[0m") b[i] "\033[0m"}' file
name1#domainx.com, name1 <-- This line is red
info#domainy.de, somename
name2#domainz.com, othername
name3#domainx.com, name3 <-- This line is red
Improved version (reading the file twice):
awk -F"[#,]" 'NR==FNR{a[$2]++;next} a[$2]>1 {$0="\033[1;31m" $0 "\033[0m"}1' file file
name1#domainx.com, name1 <-- This line is red
info#domainy.de, somename
name2#domainz.com, othername
name3#domainx.com, name3 <-- This line is red

If you have GNU grep available, you can use the PCRE matcher to do a positive look-behind to extract the domain name. After that sort and uniq can find duplicate instances:
<infile grep -oP '(?<=#)[^,]*' | sort | uniq -d
Output:
domainx.com

Related

Extract substrings between strings

I have a file with text as follows:
###interest1 moreinterest1### sometext ###interest2###
not-interesting-line
sometext ###interest3###
sometext ###interest4### sometext othertext ###interest5### sometext ###interest6###
I want to extract all strings between ### .
My desired output would be something like this:
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
I have tried the following:
grep '###' file.txt | sed -e 's/.*###\(.*\)###.*/\1/g'
This almost works but only seems to grab the first instance per line, so the first line in my output only grabs
interest1 moreinterest1
rather than
interest1 moreinterest1
interest2

Here is a single awk command to achieve this that makes ### field separator and prints each even numbered field:
awk -F '###' '{for (i=2; i<NF; i+=2) print $i}' file
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
Here is an alternative grep + sed solution:
grep -oE '###[^#]*###' file | sed -E 's/^###|###$//g'
This assumes there are no # characters in between ### markers.

With GNU awk for multi-char RS:
$ awk -v RS='###' '!(NR%2)' file
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6

You can use pcregrep:
pcregrep -o1 '###(.*?)###' file
The regex - ###(.*?)### - matches ###, then captures into Group 1 any zero o more chars other than line break chars, as few as possible, and ### then matches ###.
o1 option will output Group 1 value only.
See the regex demo online.

sed 't x
s/###/\
/;D; :x
s//\
/;t y
D;:y
P;D' file
Replacing "###" with newline, D, then conditionally branching to P if a second replacement of "###" is successful.

This might work for you (GNU sed):
sed -n 's/###/\n/g;/[^\n]*\n/{s///;P;D}' file
Replace all occurrences of ###'s by newlines.
If a line contains a newline, remove any characters before and including the first newline, print the details up to and including the following newline, delete those details and repeat.

xargs and sed to extract specific lines

I want to extract lines that have a particular pattern, in a certain column. For example, in my 'input.txt' file, I have many columns. I want to search the 25th column for 'foobar', and extract only those lines that have 'foobar' in the 25th column. I cannot do:
grep foobar input.txt
because other columns may also have 'foobar', and I don't want those lines. Also:
the 25th column will have 'foobar' as part of a string (i.e. it could be 'foobar ; muller' or 'max ; foobar ; john', or 'tom ; foobar35')
I would NOT want 'tom ; foobar35'
The word in column 25 must be an exact match for 'foobar' (and ; so using awk $25=='foobar' is not an option.
In other words, if column 25 had the following lines:
foobar ; muller
max ; foobar ; john
tom ; foobar35
I would want only lines 1 & 2.
How do I use xargs and sed to extract these lines? I am stuck at:
cut -f25 input.txt | grep -nw foobar | xargs -I linenumbers sed ???
thanks!

Do not use xargs and sed, use the other tool common on so many machines and do this:
awk '{if($25=="foobar"){print NR" "$0}}' input.txt
print NR prints the line number of the current match so the first column of the output will be the line number.
print $0 prints the current line. Change it to print $25 if you only want the matching column. If you only want the output, use this:
awk '{if($25=="foobar"){print $0}}' input.txt
EDIT1 to match extended question:
Use what #shellter and #Jotne suggested but add string delimiters.
awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' '$25~/foobar/' input.txt
[^ ]* matches all characters that are not a space.
'[^']*' matches everything inside single quotes.
EDIT2 to exclude everything but foobar:
awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' "\$25~/[;' ]foobar[;' ]/" input.txt
[;' ] only allows ;, ' and in front and after foobar.
Tested with this file:
1 "1 ; 1" 4
2 'kom foobar' 33
3 "ll;3" 3
4 '1; foobar' asd
7 '5 ;foobar' 2
7 '5;foobar' 0
2 'kom foobar35' 33
2 'kom ; foobar' 33
2 'foobar ; john' 33
2 'foobar;paul' 33
2 'foobar1;paul' 33
2 'foobarli;paul' 33
2 'afoobar;paul' 33
and this command awk -vFPAT="([^ ]*)|('[^']*')" -vOFS=' ' "\$2~/[;' ]foobar[;' ]/" input.txt

To get the line with foobar as part of the 25 field.
awk '$25=="foobar"' input.txt
$25 25th filed
== equal to
"foobar"
Since no action spesified, print the complete line will be done, same as {print $0}
Or
awk '$25~/^foobar$/' input.txt

This might work for you (GNU sed):
sed -En 's/\S+/\n&\n/25;s/\n(.*foobar.*)\n/\1/p' file
Surround the 25th field by newlines and pattern match for foobar between newlines.
If you only want to match the word foobar use:
sed -En 's/\S+/\n&\n/25;s/\n(.*\<foobar\>.*)\n/\1/p' file

Replace complete line getting number from variable

I have a file with a certain line, let's say...
AAA BBB CCC
I need to replace that entire line, after finding it, so I did:
q1=`grep -Hnm 1 "AAA" FILE | cut -d : -f 2`
That outputs me the line number of the first occurrence (in q1), because it has more than one occurrence, now, here comes my problem... In a previous step I was using this sed to replace a certain line in the file:
sed -e '3s/.*/WHATEVER/' FILE
To replace (in the example, line 3) the full line with WHATEVER, but now if I try to use $q1 instead of the "3" indicating the line number it doesn't work:
sed -e '$q1s/.*/WHATEVER/' FILE
It's probably a stupid syntax mistake, any help is welcome; thanks in advance

Try:
sed -e "${q1}s/.*/WHATEVER/" FILE

I'd use awk for this:
awk '/AAA/ && !r {print "WHATEVER"; r=1; next} {print}' <<END
a
b
AAA BBB CCC
d
e
AAA foo bar
f
END
a
b
WHATEVER
d
e
AAA foo bar
f

If you want to replace the first occurrence of a string in a file, you could use this awk script:
awk '/occurrence/ && !f++ {print "replacement"; next}1' file
The replacement will only be printed the first time, as !f++ will only evaluate to true once (on subsequent evaluations, f will be greater than zero so !f will be false. The 1 at the end is always true, so for each line other than the matched one, awk does the default action and prints the line.
Testing it out:
$ cat file
blah
blah
occurrence 1 and some other stuff
blah
blah
some more stuff and occurrence 2
blah
$ awk '/occurrence/ && !f++ {print "replacement"; next}1' file
blah
blah
replacement
blah
blah
some more stuff and occurrence 2
blah
The "replacement" string could easily be set to the value of a shell variable in the following way:
awk -v rep="$r" '/occurrence/ && !f++ {print rep; next}1' file
where $r is a shell variable.
Using the same file as above and the example variable in your comment:
$ q2="x=\"Second\""
$ awk -v rep="$q2" '/occurrence/ && !f++ {print rep; next}1' file
blah
blah
x="Second"
stuff
blah
blah
some more stuff and occurrence 2
blah

sed "${q1} c\
WHATEVER" YourFile
but you can directly use
sed '/YourPatternToFound/ {s/.*/WHATEVER/
:a
N;$!ba
}' YourFile

search and print the value inside tags using script

I have a file like this. abc.txt
<ra><r>12.34</r><e>235</e><a>34.908</a><r>23</r><a>234.09</a><p>234</p><a>23</a></ra>
<hello>sadfaf</hello>
<hi>hiisadf</hi>
<ra><s>asdf</s><qw>345</qw><a>345</a><po>234</po><a>345</a></ra>
What I have to do is I have to find <ra> tag and for inside <ra> tag there is <a> tag whose valeus I have to store the values inside of into some variables which I need to process further. How should I do this.?
values inside tag within tag are:
34.908,234.09,23
345,345

This awk should do:
cat file
<ra><r>12.34</r><e>235</e><a>34.908</a><r>23</r><a>234.09</a><p>234</p><a>23</a></ra><a>12344</a><ra><e>45</e><a>666</a></ra>
<hello>sadfaf</hello>
<hi>no print from this line</hi><a>256</a>
<ra><s>asdf</s><qw>345</qw><a>345</a><po>234</po><a>345</a></ra>
awk -v RS="<" -F">" '/^ra/,/\/ra/ {if (/^a>/) print $2}' file
34.908
234.09
23
666
345
345
It take in care if there are multiple <ra>...</ra> groups in one line.
A small variation:
awk -v RS=\< -F\> '/\/ra/ {f=0} f&&/^a/ {print $2} /^ra/ {f=1}' file
34.908
234.09
23
666
345
345
How does it work:
awk -v RS="<" -F">" ' # This sets record separator to < and gives a new line for every <
/^ra/,/\/ra/ { # within the record starting witn "ra" to record ending with "/ra" do
if (/^a>/) # if line starts with an "a" do
print $2}' # print filed 2
To see how changing RS works try:
awk -v RS="<" '$1=$1' file
ra>
r>12.34
/r>
e>235
/e>
a>34.908
/a>
r>23
/r>
a>234.09
/a>
p>234
...
To store it in an variable you can do as BMW suggested:
var=$(awk ...)
var=$(awk -v RS=\< -F\> '/\/ra/ {f=0} f&&/^a/ {print $2} /^ra/ {f=1}' file)
echo $var
34.908 234.09 23 666 345 345
echo "$var"
34.908
234.09
23
666
345
345
Since its many values, you can use an array:
array=($(awk -v RS=\< -F\> '/\/ra/ {f=0} f&&/^a/ {print $2} /^ra/ {f=1}' file))
echo ${array[2]}
23
echo ${var2[0]}
34.908
echo ${var2[*]}
34.908 234.09 23 666 345 345

Use gnu grep's Lookahead and Lookbehind Zero-Length Assertions
grep -oP "(?<=<ra>).*?(?=</ra>)" file |grep -Po "(?<=<a>).*?(?=</a>)"
explanation
the first grep will get the content in ra tag. Even there are several ra tags in one line, it still can identified.
The second grep get the content in a tag

Combine columns from different files

I have two text files that have these structures:
File 1
Column1:Column2
Column1:Column2
...
File 2
Column3
Column3
...
I would like to create a file that has this file structure:
Column1:Column3
Column1:Column3
...
Open to any suggestions, but it would be nice if the solution can be done from a Bash shell, or sed / awk / perl / etc...

cut -d: -f1 "File 1" | paste -d: - "File 2"
This cuts field 1 from File 1 (delimited by a colon) and pastes it with the only column in File 2, separating the output fields with a colon.

Here's an awk solution. It assumes file1 and file2 have an equal number of lines.
awk -F : '{ printf "%s:",$1; getline < "file2"; print }' < file1

Since a pure bash implementation hasn't been suggested, also assuming an equal number of lines (bash v4 only):
mapfile -t file2 < file2
index=0
while IFS=: read -r column1 _; do
echo "$column1:${file2[index]}"
((index++))
done < file1
bash v3:
IFS=$'\n' read -r -d '' file2 < file2
index=0
while IFS=: read -r column1 _; do
echo "$column1:${file2[index]}"
((index++))
done < file1