Count of unique lines based on first field in file - sed

I am trying to get a count of unique lines output to a file based on the first
field, where the input lines look like:
Forms.js /forms/Forms.js http://www.gumby.com/test.htm 404
Forms.js /forms/Forms1.js http://www.gumby.com/test.htm 404
Forms.js /forms/Forms2.js http://www.gumby.com/test.htm 404
Interpret.js /forms/Interpret1.js http://www.gumby.com/test.htm 404
Interpret.js /forms/Interpret2.js http://www.gumby.com/test.htm 404
Interpret.js /forms/Interpret3.js http://www.gumby.com/test.htm 404
To something like this:
3 Forms.js /forms/Forms.js http://www.gumby.com.mx/test.htm 404
3 Interpret.js /forms/Interpret.js http://www.gumby.com.mx/test.htm 404
I have been trying various combinations of sort and uniq, but haven't hit on it yet.
I can get distinct lines using the whole line, but I just want the first field.
I am currently using cygwin. I am not awk literate, but I
suspect that is the route to go. Anyone have a handy solution?

This:
<infile awk '{ h[$1]++ } END { for(k in h) print h[k], k }'
Will get you:
3 Forms.js
3 Interpret.js
If you also want to keep the first hit use:
awk '!h[$1] { g[$1]=$0 } { h[$1]++ } END { for(k in g) print h[k], g[k] }'
Output:
3 Forms.js /forms/Forms.js http://www.gumby.com/test.htm 404
3 Interpret.js /forms/Interpret1.js http://www.gumby.com/test.htm 404
Tested with GNU awk.
Note that this does not require input to be sorted. Also note that the results are unordered.

Awk is the tool for this but if you want to be clever with uniq:
$ column -t file | uniq -w12 -c
3 Forms.js /forms/Forms.js http://www.gumby.com/test.htm 404
3 Interpret.js /forms/Interpret1.js http://www.gumby.com/test.htm 404
column -t aligns all the columns so we get a fixed width for column one.
Or a hack if column isn't available is to append the first column to end of the line with awk and then use uniq -c -f4 to count unique on the last column and use awk again to print the n-1 fields.
$ awk '{print $0, $1}' file | uniq -c -f4 | awk '{$NF=""; NF--; print}'
3 Forms.js /forms/Forms.js http://www.gumby.com/test.htm 404
3 Interpret.js /forms/Interpret1.js http://www.gumby.com/test.htm 404
It would be nice if uniq -f worked like -f4,4 or f1,1.
Or you could use rev to reverse the file so uniq -c -f3 can be done and then rev back (you get the count at the end however and if you don't have column you probably don't have rev)
$ rev file | uniq -c -f3 | rev
Forms.js /forms/Forms.js http://www.gumby.com/test.htm 404 3
Interpret.js /forms/Interpret1.js http://www.gumby.com/test.htm 404 3

$ awk '!c[$1]++{v[$1]=$0} END{for (i in c) print c[i],v[i]}' file
3 Forms.js /forms/Forms.js http://www.gumby.com/test.htm 404
3 Interpret.js /forms/Interpret1.js http://www.gumby.com/test.htm 404
The above uses the common awk idiom of '!array[$n]++' to tell if a key value ($n where n is $0 or $1 or $4,$5 or ...) has been seen before.

assuming file.txt contains your sample input:
sort file.txt | awk -f counts.awk file
returns:
3:Forms.js /forms/Forms.js http://www.gumby.com/test.htm 404
3:Interpret.js /forms/Interpret1.js http://www.gumby.com/test.htm 404
awk script file:
cat counts.awk
# output format is:
#+ TimesFirstFieldIsRepeated:FirstMatchingLineContents
BEGIN {
plmatch="";
pline="";
outline="";
n=1;
}
{
if($1 != plmatch && NR != 1)
{
print n ":" outline;
n=1;
outline="";
}
if($1 == plmatch)
{
n+=1;
if(outline == ""){
outline=pline;
}
}
plmatch=$1;
pline=$0;
}
END {
print n ":" outline;
}

I'd just cut -f 1 | uniq -c. That won't give you the whole line, but if the lines are differing, printing any line won't make too much sense anyway. Depends on what you want to achieve.

You can count the amount of the first field with cut but what you want to print after this field ?
cat file | cut -d " " -f 1 | uniq -c

Related

Unique count of a value in a zipped file based on other constraints on surrounding lines

I have a log file.
Has data like this:
Operation=ABC,
CustomerId=12,
..
..
..
Counters=qwe=1,wer=2,mbn=4,Hello=0,
----
Operation=CQW,
CustomerId=10,
Time=blah,
..
..
Counters=qwe=1,wer=2,mbn=4,Hello=0,jvnf=2,njfs=4
----
Operation=ABC,
CustomerId=12,
Metric=blah
..
..
Counters=qwe=1,wer=2,mbn=4,Hello=1, uisg=2,vieus=3
----
Operation=ABC,
CustomerId=12,
Metric=blah
..
..
Counters=qwe=1,wer=2,mbn=4,Hello:0, uisg=2,vieus=3
----
Now, I want to find all the unique CustomerIds where Operation=ABC and Hello=0 (in Counters).
All of this info is contained in .gz files in a directory.
So, here is what I've tried to just retrieve the number of times Operation=ABC and "Hello=0" appears in the lines near it.
zgrep -A 20 "Operation=ABC" * | grep "Hello=0" | wc -l
This gave me the number of times that "Hello=0" was found for Operation=ABC. (about 250)
In order to get unique customer Ids, I tried this:
zgrep -A 20 "Operation=ABC" * | grep "Hello=0" -B 10 | grep "CustomerId" | uniq -c
This gave me no results. What am I getting wrong here?
Actually, this works. I was just being impatient.
zgrep -A 20 "Operation=ABC" * | grep "Hello=0" -B 10 | grep "CustomerId" | uniq -c
You need NOT to use these many grep and zgrep we could do it within single awk.
awk -F'=' '
/^--/{
if(val==3){
print value
}
val=value=""
}
/Operation=ABC/{
val++
}
/CustomerId/{
if(!a[$NF]++){
val++
}
}
/Hello=0/{
val++
}
{
value=(value?value ORS:"")$0
}
END{
if(val && value){
print value
}
}' <(gzip -dc input_file.gz)
Output will be as follows(tested from your sample only):
Operation=ABC,
CustomerId=12,
..
..
..
Counters=qwe=1,wer=2,mbn=4,Hello=0,

what is the best way to extract filled data from a static form?

I have some federal pdf forms with filled data init. Lets say for example i765 and I have the data of this form available in a text format, with duly filled in details. How can I extract the data from this form with minimum parsing. Lets say how can write a script that identifies "difference" , which in itself is nothing but the filled information.
For eg: if a line contains..
SSN: (Whitespace) and the actual filled in form has SSN: ABC!##456
so the filled in information is nothing but ABC!##456 which just a difference between the strings . Is there a known approach that i can follow. Any pointers are much appreciated.
If we are talking about Linux Tools then you could try various solutions , like:
$ join -t"=" -a1 -o 0,2.2 <(sort emptyform) <(sort filledform) # "=" is used as delimiter
Or even awk without sorting requirements:
$ awk 'BEGIN{FS=OFS="="}NR==FNR{a[$1]=$2;next}{if ($1 in a) {print;delete a[$1]}} \
END{print "\n Missing fields:";for (i in a) print i,a[i]}' empty filled
Testing:
cat <<EOF >empty
Name=""
Surname=""
Age=""
Address=""
Kids=""
Married=""
EOF
cat <<EOF >filled
Name="George"
Surname="Vasiliou"
Age="42"
Address="Europe"
EOF
join -t"=" -a1 -o 0,2.2 <(sort empty) <(sort filled)
#Output:
Address="Europe"
Age="42"
Kids=
Married=
Name="George"
Surname="Vasiliou"
awk output
awk 'BEGIN{FS=OFS="="}NR==FNR{a[$1]=$2;next}{if ($1 in a) {print;delete a[$1]}} \
END{print "\nnot completed fields:";for (i in a) print i,a[i]}' empty filled
Name="George"
Surname="Vasiliou"
Age="42"
Address="Europe"
not completed fields:
Married=""
Kids=""
Especially in awk if you remove the print from {if ($1 in a) {print;delete a[$1]}} the END section will print out for you only the missing fields.
Another alternative with a nice visual interface is with diff utility:
$ diff -y <(sort empty) <(sort filled)
Address="" | Address="Europe"
Age="" | Age="42"
Kids="" | Name="George"
Married="" | Surname="Vasiliou"
Name="" <
Surname="" <

Substring pattern matching in two files

I have an input flat file like this with many rows:
Apr 3 13:30:02 aag8-ca-acs01-en2 CisACS_01_PassedAuth p1n5ut5s 1 0 Message-Type=Authen OK,User-Name=joe7#it.test.com,NAS- IP-Address=4.196.63.55,Caller-ID=az-4d-31-89-92-90,EAP Type=17,EAP Type Name=LEAP,Response Time=0,
Apr 3 13:30:02 aag8-ca-acs01-en2 CisACS_01_PassedAuth p1n6ut5s 1 0 Message-Type=Authen OK,User-Name=bobe#jg.test.com,NAS-IP-Address=4.197.43.55,Caller-ID=az-4d-4q-x8-92-80,EAP Type=17,EAP Type Name=LEAP,Response Time=0,
Apr 3 13:30:02 abg8-ca-acs01-en2 CisACS_01_PassedAuth p1n4ut5s 1 0 Message-Type=Authen OK,User-Name=jerry777#it.test.com,NAS-IP-Address=7.196.63.55,Caller-ID=az-4d-n6-4e-y2-90,EAP Type=17,EAP Type Name=LEAP,Response Time=0,
Apr 3 13:30:02 aca8-ca-acs01-en2 CisACS_01_PassedAuth p1n4ut5s 1 0 Message-Type=Authen OK,User-Name=frc777o.#it.test.com,NAS-IP-Address=4.196.263.55,Caller-ID=a4-4e-31-99-92-90,EAP Type=17,EAP Type Name=LEAP,Response Time=0,
Apr 3 13:30:02 aag8-ca-acs01-en2 CisACS_01_PassedAuth p1n4ut5s 1 0 Message-Type=Authen OK,User-Name=frc77#xed.test.com,NAS-IP-Address=4.136.163.55,Caller-ID=az-4d-4w-b5-s2-90,EAP Type=17,EAP Type Name=LEAP,Response Time=0,
I'm trying to grep the email addresses from input file to see if they already exist in the master file.
Master flat file looks like this:
a44e31999290;frc777o.#it.test.com;20150403
az4d4qx89280;bobe#jg.test.com;20150403
0dbgd0fed04t;rrfuf#us.test.com;20150403
28cbe9191d53;rttuu4en#us.test.com;20150403
az4d4wb5s290;frc77#xed.test.com;20150403
d89695174805;ccis6n#cn.test.com;20150403
If the email doesn't exist in master I want a simple count.
So using the examples I hope to see: count=3, because bobe#jg.test.com and frc77#xed.test.com already exist in master but the others don't.
I tried various combinations of grep, example below from last tests but it is not working.. I'm using grep within a perl script to first capture emails and then count them but all I really need is the count of emails from input file that don't exist in master.
grep -o -P '(?<=User-Name=\).*(?=,NAS-IP-)' $infile $mstr > $new_emails;
Any help would be appreciated, Thanks.
I would use this approach in awk:
$ awk 'FNR==NR {FS=";"; a[$2]; next}
{FS="[,=]"; if ($4 in a) c++}
END{print c}' master file
3
This works by setting different field separators and storing / matching the emails. Then, printing the final sum.
For master file we use ; and get the 2nd field:
$ awk -F";" '{print $2}' master
frc777o.#it.test.com
bobe#jg.test.com
rrfuf#us.test.com
rttuu4en#us.test.com
frc77#xed.test.com
ccis6n#cn.test.com
For file file (the one with all the info) we use either , or = and get the 4th field:
$ awk -F[,=] '{print $4}' file
joe7#it.test.com
bobe#jg.test.com
jerry777#it.test.com
frc777o.#it.test.com
frc77#xed.test.com
Think the below does what you want as a one liner with diff and perl:
diff <( perl -F';' -anE 'say #F[1]' master | sort -u ) <( perl -pe 'm/User-Name=([^,]+),/; $_ = "$1\n"' data | sort -u ) | grep '^>' | perl -pe 's/> //;'
The diff <( command_a |sort -u ) <( command_b |sort -u) | grep '>' lets you handle the set difference of the command output.
perl -F';' -anE 'say #F[1]' just splits each line of the file on ';' and prints the second field on its own line.
perl -pe 'm/User-Name=([^,]+),/; $_ = "$1\n"' gets the specific field you wanted ignoring the surrounding key= and prints on a new line implicitly.

search and print the value inside tags using script

I have a file like this. abc.txt
<ra><r>12.34</r><e>235</e><a>34.908</a><r>23</r><a>234.09</a><p>234</p><a>23</a></ra>
<hello>sadfaf</hello>
<hi>hiisadf</hi>
<ra><s>asdf</s><qw>345</qw><a>345</a><po>234</po><a>345</a></ra>
What I have to do is I have to find <ra> tag and for inside <ra> tag there is <a> tag whose valeus I have to store the values inside of into some variables which I need to process further. How should I do this.?
values inside tag within tag are:
34.908,234.09,23
345,345
This awk should do:
cat file
<ra><r>12.34</r><e>235</e><a>34.908</a><r>23</r><a>234.09</a><p>234</p><a>23</a></ra><a>12344</a><ra><e>45</e><a>666</a></ra>
<hello>sadfaf</hello>
<hi>no print from this line</hi><a>256</a>
<ra><s>asdf</s><qw>345</qw><a>345</a><po>234</po><a>345</a></ra>
awk -v RS="<" -F">" '/^ra/,/\/ra/ {if (/^a>/) print $2}' file
34.908
234.09
23
666
345
345
It take in care if there are multiple <ra>...</ra> groups in one line.
A small variation:
awk -v RS=\< -F\> '/\/ra/ {f=0} f&&/^a/ {print $2} /^ra/ {f=1}' file
34.908
234.09
23
666
345
345
How does it work:
awk -v RS="<" -F">" ' # This sets record separator to < and gives a new line for every <
/^ra/,/\/ra/ { # within the record starting witn "ra" to record ending with "/ra" do
if (/^a>/) # if line starts with an "a" do
print $2}' # print filed 2
To see how changing RS works try:
awk -v RS="<" '$1=$1' file
ra>
r>12.34
/r>
e>235
/e>
a>34.908
/a>
r>23
/r>
a>234.09
/a>
p>234
...
To store it in an variable you can do as BMW suggested:
var=$(awk ...)
var=$(awk -v RS=\< -F\> '/\/ra/ {f=0} f&&/^a/ {print $2} /^ra/ {f=1}' file)
echo $var
34.908 234.09 23 666 345 345
echo "$var"
34.908
234.09
23
666
345
345
Since its many values, you can use an array:
array=($(awk -v RS=\< -F\> '/\/ra/ {f=0} f&&/^a/ {print $2} /^ra/ {f=1}' file))
echo ${array[2]}
23
echo ${var2[0]}
34.908
echo ${var2[*]}
34.908 234.09 23 666 345 345
Use gnu grep's Lookahead and Lookbehind Zero-Length Assertions
grep -oP "(?<=<ra>).*?(?=</ra>)" file |grep -Po "(?<=<a>).*?(?=</a>)"
explanation
the first grep will get the content in ra tag. Even there are several ra tags in one line, it still can identified.
The second grep get the content in a tag

Find duplicate records in file

I have a text file with lines like below:
name1#domainx.com, name1
info#domainy.de, somename
name2#domainz.com, othername
name3#domainx.com, name3
How can I find duplicate domains like domainx.com with sed or awk?
With GNU awk you can do:
$ awk -F'[#,]' '{a[$2]++}END{for(k in a) print a[k],k}' file
1 domainz.com
2 domainx.com
1 domainy.de
You can use sort to order the output i.e. ascending numerical with -n:
$ awk -F'[#,]' '{a[$2]++}END{for(k in a) print a[k],k}' file | sort -n
1 domainy.de
1 domainz.com
2 domainx.com
Or just to print duplicate domains:
$ awk -F'[#,]' '{a[$2]++}END{for(k in a)if (a[k]>1) print k}' file
domainx.com
Here:
sed -n '/#domainx.com/ p' yourfile.txt
(Actually is grep what you should use for that)
Would you like to count them? add an |nl to the end.
Using that minilist you gave, using the sed line with |nl, outputs this:
1 name1#domainx.com, name1
2 name3#domainx.com, name3
What if you need to count how many repetitions have each domain? For that try this:
for line in `sed -n 's/.*#\([^,]*\).*/\1/p' yourfile.txt|sort|uniq` ; do
echo "$line `grep -c $line yourfile.txt`"
done
The output of that is:
domainx.com 2
domainy.de 1
domainz.com 1
Print only duplicate domains
awk -F"[#,]" 'a[$2]++==1 {print $2}'
domainx.com
Print a "*" in front of line that are listed duplicated.
awk -F"[#,]" '{a[$2]++;if (a[$2]>1) f="* ";print f$0;f=x}'
name1#domainx.com, name1
info#domainy.de, somename
name2#domainz.com, othername
* name3#domainx.com, name3
This version paints all line with duplicate domain in color red
awk -F"[#,]" '{a[$2]++;b[NR]=$0;c[NR]=$2} END {for (i=1;i<=NR;i++) print ((a[c[i]]>1)?"\033[1;31m":"\033[0m") b[i] "\033[0m"}' file
name1#domainx.com, name1 <-- This line is red
info#domainy.de, somename
name2#domainz.com, othername
name3#domainx.com, name3 <-- This line is red
Improved version (reading the file twice):
awk -F"[#,]" 'NR==FNR{a[$2]++;next} a[$2]>1 {$0="\033[1;31m" $0 "\033[0m"}1' file file
name1#domainx.com, name1 <-- This line is red
info#domainy.de, somename
name2#domainz.com, othername
name3#domainx.com, name3 <-- This line is red
If you have GNU grep available, you can use the PCRE matcher to do a positive look-behind to extract the domain name. After that sort and uniq can find duplicate instances:
<infile grep -oP '(?<=#)[^,]*' | sort | uniq -d
Output:
domainx.com