Count Duplicate URLs, fastest method possible

Count Duplicate URLs, fastest method possible - text-processing

I'm still working with this huge list of URLs, all the help I have received has been great.
At the moment I have the list looking like this (17000 URLs though):
http://www.example.com/page?CONTENT_ITEM_ID=1
http://www.example.com/page?CONTENT_ITEM_ID=3
http://www.example.com/page?CONTENT_ITEM_ID=2
http://www.example.com/page?CONTENT_ITEM_ID=1
http://www.example.com/page?CONTENT_ITEM_ID=2
http://www.example.com/page?CONTENT_ITEM_ID=3
http://www.example.com/page?CONTENT_ITEM_ID=3
I can filter out the duplicates no problem with a couple of methods, awk etc. What I am really looking to do it take out the duplicate URLs but at the same time taking a count of how many times the URL exists in the list and printing the count next to the URL with a pipe separator. After processing the list it should look like this:
url
count
http://www.example.com/page?CONTENT_ITEM_ID=1
2
http://www.example.com/page?CONTENT_ITEM_ID=2
2
http://www.example.com/page?CONTENT_ITEM_ID=3
3
What method would be the fastest way to achieve this?

This is probably as fast as you can get without writing code.
$ cat foo.txt
http://www.example.com/page?CONTENT_ITEM_ID=1
http://www.example.com/page?CONTENT_ITEM_ID=3
http://www.example.com/page?CONTENT_ITEM_ID=2
http://www.example.com/page?CONTENT_ITEM_ID=1
http://www.example.com/page?CONTENT_ITEM_ID=2
http://www.example.com/page?CONTENT_ITEM_ID=3
http://www.example.com/page?CONTENT_ITEM_ID=3
$ sort foo.txt | uniq -c
2 http://www.example.com/page?CONTENT_ITEM_ID=1
2 http://www.example.com/page?CONTENT_ITEM_ID=2
3 http://www.example.com/page?CONTENT_ITEM_ID=3
Did a bit of testing, and it's not particularly fast, although for 17k it'll take little more than 1 second (on a loaded P4 2.8Ghz machine)
$ wc -l foo.txt
174955 foo.txt
vinko#mithril:~/i3media/2008/product/Pending$ time sort foo.txt | uniq -c
54482 http://www.example.com/page?CONTENT_ITEM_ID=1
48212 http://www.example.com/page?CONTENT_ITEM_ID=2
72261 http://www.example.com/page?CONTENT_ITEM_ID=3
real 0m23.534s
user 0m16.817s
sys 0m0.084s
$ wc -l foo.txt
14955 foo.txt
$ time sort foo.txt | uniq -c
4233 http://www.example.com/page?CONTENT_ITEM_ID=1
4290 http://www.example.com/page?CONTENT_ITEM_ID=2
6432 http://www.example.com/page?CONTENT_ITEM_ID=3
real 0m1.349s
user 0m1.216s
sys 0m0.012s
Although O() wins the game hands down, as usual. Tested S.Lott's solution and
$ cat pythoncount.py
from collections import defaultdict
myFile = open( "foo.txt", "ru" )
fq= defaultdict( int )
for n in myFile:
fq[n] += 1
for n in fq.items():
print "%s|%s" % (n[0].strip(),n[1])
$ wc -l foo.txt
14955 foo.txt
$ time python pythoncount.py
http://www.example.com/page?CONTENT_ITEM_ID=2|4290
http://www.example.com/page?CONTENT_ITEM_ID=1|4233
http://www.example.com/page?CONTENT_ITEM_ID=3|6432
real 0m0.072s
user 0m0.028s
sys 0m0.012s
$ wc -l foo.txt
1778955 foo.txt
$ time python pythoncount.py
http://www.example.com/page?CONTENT_ITEM_ID=2|504762
http://www.example.com/page?CONTENT_ITEM_ID=1|517557
http://www.example.com/page?CONTENT_ITEM_ID=3|756636
real 0m2.718s
user 0m2.440s
sys 0m0.072s

Are you're going to do this over and over again? If not, then "fastest" as in fastest to implement is probably
sort </file/of/urls | uniq --count | awk '{ print $2, " | ", $1}'
(not tested, I'm not near a UNIX command line)

In perl
[disclaimer: not able to test this code at the moment]
while (<>) {
chomp;
$occurences{$_}++;
}
foreach $url (sort keys %occurences) {
printf "%s|%d\n", $url, $occurences{$url};
}

See Converting a list of tuples into a dict in python.
Essentially, you're doing the same thing with an int instead of a list.
This may be faster than the system sort because it's O(n). However, it's also Python, not C.
from collections import defaultdict
myFile = open( "urlFile", "ru" )
fq= defaultdict( int )
for n in myFile:
fq[n] += 1
for url, count in fq.iteritems():
print url.rstrip(), "|", count
On my little Dell D830, this processes 17000 URL's in 0.015 seconds.

Here is another version in Python:
import fileinput, itertools
urls = sorted(fileinput.input())
for url, sameurls in itertools.groupby(urls):
print url.rstrip(), "|", sum(1 for _ in sameurls)
Example:
$ cat foo.txt
http://www.example.com/page?CONTENT_ITEM_ID=1
http://www.example.com/page?CONTENT_ITEM_ID=3
http://www.example.com/page?CONTENT_ITEM_ID=2
http://www.example.com/page?CONTENT_ITEM_ID=1
http://www.example.com/page?CONTENT_ITEM_ID=2
http://www.example.com/page?CONTENT_ITEM_ID=3
http://www.example.com/page?CONTENT_ITEM_ID=3
$ python countuniq.py foo.txt
http://www.example.com/page?CONTENT_ITEM_ID=1 | 2
http://www.example.com/page?CONTENT_ITEM_ID=2 | 2
http://www.example.com/page?CONTENT_ITEM_ID=3 | 3
Performance:
C:\> timethis "sort urls17000.txt|uniq -c"
...
TimeThis : Elapsed Time : 00:00:00.688
C:\> timethis python countuniq.py urls17000.txt
...
TimeThis : Elapsed Time : 00:00:00.625
C:\> timethis python slott.py urls17000.txt
...
TimeThis : Elapsed Time : 00:00:00.562
C:\> timethis perl toolkit.pl urls17000.txt
...
TimeThis : Elapsed Time : 00:00:00.187
Conclusion: All solutions are under 1 second. The pipe is the slowest, S.Lott's solution is faster then the above python's version and toolkit's Perl solution is the fastest.
C:\> timethis perl toolkit.pl urls1778955.txt
...
TimeThis : Elapsed Time : 00:00:17.656
C:\> timethis "sort urls1778955.txt|uniq -c"
...
TimeThis : Elapsed Time : 00:01:54.234
$ wc urls1778955.txt
1778955 1778955 81831930 urls1778955.txt
Hashing beats sorting for a large number of URLs.

Related

Substring pattern matching in two files

I have an input flat file like this with many rows:
Apr 3 13:30:02 aag8-ca-acs01-en2 CisACS_01_PassedAuth p1n5ut5s 1 0 Message-Type=Authen OK,User-Name=joe7#it.test.com,NAS- IP-Address=4.196.63.55,Caller-ID=az-4d-31-89-92-90,EAP Type=17,EAP Type Name=LEAP,Response Time=0,
Apr 3 13:30:02 aag8-ca-acs01-en2 CisACS_01_PassedAuth p1n6ut5s 1 0 Message-Type=Authen OK,User-Name=bobe#jg.test.com,NAS-IP-Address=4.197.43.55,Caller-ID=az-4d-4q-x8-92-80,EAP Type=17,EAP Type Name=LEAP,Response Time=0,
Apr 3 13:30:02 abg8-ca-acs01-en2 CisACS_01_PassedAuth p1n4ut5s 1 0 Message-Type=Authen OK,User-Name=jerry777#it.test.com,NAS-IP-Address=7.196.63.55,Caller-ID=az-4d-n6-4e-y2-90,EAP Type=17,EAP Type Name=LEAP,Response Time=0,
Apr 3 13:30:02 aca8-ca-acs01-en2 CisACS_01_PassedAuth p1n4ut5s 1 0 Message-Type=Authen OK,User-Name=frc777o.#it.test.com,NAS-IP-Address=4.196.263.55,Caller-ID=a4-4e-31-99-92-90,EAP Type=17,EAP Type Name=LEAP,Response Time=0,
Apr 3 13:30:02 aag8-ca-acs01-en2 CisACS_01_PassedAuth p1n4ut5s 1 0 Message-Type=Authen OK,User-Name=frc77#xed.test.com,NAS-IP-Address=4.136.163.55,Caller-ID=az-4d-4w-b5-s2-90,EAP Type=17,EAP Type Name=LEAP,Response Time=0,
I'm trying to grep the email addresses from input file to see if they already exist in the master file.
Master flat file looks like this:
a44e31999290;frc777o.#it.test.com;20150403
az4d4qx89280;bobe#jg.test.com;20150403
0dbgd0fed04t;rrfuf#us.test.com;20150403
28cbe9191d53;rttuu4en#us.test.com;20150403
az4d4wb5s290;frc77#xed.test.com;20150403
d89695174805;ccis6n#cn.test.com;20150403
If the email doesn't exist in master I want a simple count.
So using the examples I hope to see: count=3, because bobe#jg.test.com and frc77#xed.test.com already exist in master but the others don't.
I tried various combinations of grep, example below from last tests but it is not working.. I'm using grep within a perl script to first capture emails and then count them but all I really need is the count of emails from input file that don't exist in master.
grep -o -P '(?<=User-Name=\).*(?=,NAS-IP-)' $infile $mstr > $new_emails;
Any help would be appreciated, Thanks.

I would use this approach in awk:
$ awk 'FNR==NR {FS=";"; a[$2]; next}
{FS="[,=]"; if ($4 in a) c++}
END{print c}' master file
3
This works by setting different field separators and storing / matching the emails. Then, printing the final sum.
For master file we use ; and get the 2nd field:
$ awk -F";" '{print $2}' master
frc777o.#it.test.com
bobe#jg.test.com
rrfuf#us.test.com
rttuu4en#us.test.com
frc77#xed.test.com
ccis6n#cn.test.com
For file file (the one with all the info) we use either , or = and get the 4th field:
$ awk -F[,=] '{print $4}' file
joe7#it.test.com
bobe#jg.test.com
jerry777#it.test.com
frc777o.#it.test.com
frc77#xed.test.com

Think the below does what you want as a one liner with diff and perl:
diff <( perl -F';' -anE 'say #F[1]' master | sort -u ) <( perl -pe 'm/User-Name=([^,]+),/; $_ = "$1\n"' data | sort -u ) | grep '^>' | perl -pe 's/> //;'
The diff <( command_a |sort -u ) <( command_b |sort -u) | grep '>' lets you handle the set difference of the command output.
perl -F';' -anE 'say #F[1]' just splits each line of the file on ';' and prints the second field on its own line.
perl -pe 'm/User-Name=([^,]+),/; $_ = "$1\n"' gets the specific field you wanted ignoring the surrounding key= and prints on a new line implicitly.

How to get wheel users with specific prefix letters

cat /etc/group | grep wheel
wheel:x:10:I0173203,i04317303,raccount,d454523,c564566,C555533,D2354546
I want to extract only the users that start with c\C i\I d\D
How do I get this Desired output?
I0173203 i04317303 d454523 c564566 C555533 D2354546

I would use awk for this:
$ awk -F[:,] '/^wheel/ {
for(i=4;i<=NF;i++) if($i~/^[cCiIdD]/) printf "%s%s",$i,(i==NF?RS:OFS)
}' /etc/group
I0173203 i04317303 d454523 c564566 C555533 D2354546
You can also use perl:
perl -nle '#m=(m/[:,]([iIcCdD]\w+)/g) if $_=~/^wheel/ }{ print "#m"' /etc/group

cat /etc/group | grep wheel | sed 's/^.*:\(.*\)$/\1/g' | sed 's/,/\n/g' | egrep '^[cCiIdD].*'
Run first command in chain, look at results. Then add second, look at results, ...

How to add timestamp to pipe output?

I need to add a timestamp in front of the output of a long-executing command (a "tcpdump", in my use-case...).
It - very simplified - looks like this one:
(echo A1; sleep 3; echo B2) | perl -MPOSIX -pe 'print strftime "%T ", localtime $^T; s/\d//'
which gives this kind of output:
16:10:24 A
16:10:24 B
i.e.: perl's localtime is (obviously) called when perl is invoked.
Instead I need this kind of result:
16:10:24 A
16:10:27 B
i.e.: time stamp should be relative to the input's generation time...
Any smart (or no so smart :-) solution?

Just remove the $^T from your Perl command. That way, you will use the current time instead of the process start time. See the docs for $^T.
However, a more elegant formulation with Perl would be:
... | perl -MPOSIX -ne's/\d//; print strftime("%T ", localtime), $_'

You could pipe the output to:
awk '{ print strftime("%T"), $0; }'
Example:
while : ; do echo hey; sleep 1; done | awk '{ print strftime("%T"), $0; }'
20:49:58 hey
20:49:59 hey
20:50:00 hey
20:50:01 hey
20:50:02 hey
20:50:03 hey
20:50:04 hey
20:50:05 hey
Alternatively, you could use ts:
ts '%T'

(echo A1; sleep 3; echo B2) | perl -MPOSIX -pe 'print strftime "%T ", localtime; s/\d//'
Works excellent for me. Why you added $^T there?

on Linux, you can use tcpdump -tttt to print the timestamp before each output line.
# tcpdump -tttt -c 1 2>/dev/null
2013-12-13 23:42:12.044426 IP 10.0.2.15.ssh > 10.0.2.2.53466: Flags [P.], seq 464388005:464388121, ack 16648998, win 65535, length 116
If your tcpdump doesn't have the -tttt option, you should use awk from devnull

sed: replace spaces within quotes with underscores

I have input (for example, from ifconfig run0 scan on OpenBSD) that has some fields that are separated by spaces, but some of the fields themselves contain spaces (luckily, such fields that contain spaces are always enclosed in quotes).
I need to distinguish between the spaces within the quotes, and the separator spaces. The idea is to replace spaces within quotes with underscores.
Sample data:
%cat /tmp/ifconfig_scan | fgrep nwid | cut -f3
nwid Websense chan 6 bssid 00:22:7f:xx:xx:xx 59dB 54M short_preamble,short_slottime
nwid ZyXEL chan 8 bssid cc:5d:4e:xx:xx:xx 5dB 54M privacy,short_slottime
nwid "myTouch 4G Hotspot" chan 11 bssid d8:b3:77:xx:xx:xx 49dB 54M privacy,short_slottime
Which doesn't end up processed the way I want, since I haven't replaced the spaces within the quotes with the underscores yet:
%cat /tmp/ifconfig_scan | fgrep nwid | cut -f3 |\
cut -s -d ' ' -f 2,4,6,7,8 | sort -n -k4
"myTouch Hotspot" 11 bssid d8:b3:77:xx:xx:xx
ZyXEL 8 cc:5d:4e:xx:xx:xx 5dB 54M
Websense 6 00:22:7f:xx:xx:xx 59dB 54M

For a sed-only solution (which I don't necessarily advocate), try:
echo 'a b "c d e" f g "h i"' |\
sed ':a;s/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta'
a b "c_d_e" f g "h_i"
Translation:
Start at the beginning of the line.
Look for the pattern junk"junk", repeated zero or more times, where junk doesn't have a quote, followed by junk"junk space.
Replace the final space with _.
If successful, jump back to the beginning.

try this:
awk -F'"' '{for(i=2;i<=NF;i++)if(i%2==0)gsub(" ","_",$i);}1' OFS="\"" file
it works for multi quotation parts in a line:
echo '"first part" foo "2nd part" bar "the 3rd part comes" baz'| awk -F'"' '{for(i=2;i<=NF;i++)if(i%2==0)gsub(" ","_",$i);}1' OFS="\""
"first_part" foo "2nd_part" bar "the_3rd_part_comes" baz
EDIT alternative form:
awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' file

Another awk to try:
awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\"
Removing the quotes:
awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=
Some additional testing with a triple size test file further to the earlier tests done by #steve. I had to transform the sed statement a little bit so that non-GNU seds could process it as well. I included awk (bwk) gawk3, gawk4 and mawk:
$ for i in {1..1500000}; do echo 'a b "c d e" f g "h i" j k l "m n o "p q r" s t" u v "w x" y z' ; done > test
$ time perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge' test >/dev/null
real 0m27.802s
user 0m27.588s
sys 0m0.177s
$ time awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null
real 0m6.565s
user 0m6.500s
sys 0m0.059s
$ time gawk3 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null
real 0m21.486s
user 0m18.326s
sys 0m2.658s
$ time gawk4 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null
real 0m14.270s
user 0m14.173s
sys 0m0.083s
$ time mawk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null
real 0m4.251s
user 0m4.193s
sys 0m0.053s
$ time awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null
real 0m13.229s
user 0m13.141s
sys 0m0.075s
$ time gawk3 '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null
real 0m33.965s
user 0m26.822s
sys 0m7.108s
$ time gawk4 '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null
real 0m15.437s
user 0m15.328s
sys 0m0.087s
$ time mawk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null
real 0m4.002s
user 0m3.948s
sys 0m0.051s
$ time sed -e :a -e 's/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test > /dev/null
real 5m14.008s
user 5m13.082s
sys 0m0.580s
$ time gsed -e :a -e 's/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test > /dev/null
real 4m11.026s
user 4m10.318s
sys 0m0.463s
mawk rendered the fastest results...

You'd be better off with perl. The code is much more readable and maintainable:
perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge'
With your input, the results are:
a b "c_d_e" f g "h_i"
Explanation:
-p # enable printing
-e # the following expression...
s # begin a substitution
: # the first substitution delimiter
"[^"]*" # match a double quote followed by anything not a double quote any
# number of times followed by a double quote
: # the second substitution delimiter
($x=$&)=~s/ /_/g; # copy the pattern match ($&) into a variable ($x), then
# substitute a space for an underscore globally on $x. The
# variable $x is needed because capture groups and
# patterns are read only variables.
$x # return $x as the replacement.
: # the last delimiter
g # perform the nested substitution globally
e # make sure that the replacement is handled as an expression
Some testing:
for i in {1..500000}; do echo 'a b "c d e" f g "h i" j k l "m n o "p q r" s t" u v "w x" y z' >> test; done
time perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge' test >/dev/null
real 0m8.301s
user 0m8.273s
sys 0m0.020s
time awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null
real 0m4.967s
user 0m4.924s
sys 0m0.036s
time awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null
real 0m4.336s
user 0m4.244s
sys 0m0.056s
time sed ':a;s/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test >/dev/null
real 2m26.101s
user 2m25.925s
sys 0m0.100s

NOT AN ANSWER, just posting awk equivalent code for #steve's perl code in case anyone's interested (and to help me remember this in future):
#steve posted:
perl -pe 's:"[^\"]*":($x=$&)=~s/ /_/g;$x:ge'
and from reading #steve's explanation the briefest awk equivalent to that perl code (NOT the preferred awk solution - see #Kent's answer for that) would be the GNU awk:
gawk '{
head = ""
while ( match($0,"\"[^\"]*\"") ) {
head = head substr($0,1,RSTART-1) gensub(/ /,"_","g",substr($0,RSTART,RLENGTH))
$0 = substr($0,RSTART+RLENGTH)
}
print head $0
}'
which we get to by starting from a POSIX awk solution with more variables:
awk '{
head = ""
tail = $0
while ( match(tail,"\"[^\"]*\"") ) {
x = substr(tail,RSTART,RLENGTH)
gsub(/ /,"_",x)
head = head substr(tail,1,RSTART-1) x
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}'
and saving a line with GNU awk's gensub():
gawk '{
head = ""
tail = $0
while ( match(tail,"\"[^\"]*\"") ) {
x = gensub(/ /,"_","g",substr(tail,RSTART,RLENGTH))
head = head substr(tail,1,RSTART-1) x
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}'
and then getting rid of the variable x:
gawk '{
head = ""
tail = $0
while ( match(tail,"\"[^\"]*\"") ) {
head = head substr(tail,1,RSTART-1) gensub(/ /,"_","g",substr(tail,RSTART,RLENGTH))
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}'
and then getting rid of the variable "tail" if you don't need $0, NF, etc, left hanging around after the loop:
gawk '{
head = ""
while ( match($0,"\"[^\"]*\"") ) {
head = head substr($0,1,RSTART-1) gensub(/ /,"_","g",substr($0,RSTART,RLENGTH))
$0 = substr($0,RSTART+RLENGTH)
}
print head $0
}'

How can I extract a predetermined range of lines from a text file on Unix?

I have a ~23000 line SQL dump containing several databases worth of data. I need to extract a certain section of this file (i.e. the data for a single database) and place it in a new file. I know both the start and end line numbers of the data that I want.
Does anyone know a Unix command (or series of commands) to extract all lines from a file between say line 16224 and 16482 and then redirect them into a new file?

sed -n '16224,16482p;16483q' filename > newfile
From the sed manual:
p -
Print out the pattern space (to the standard output). This command is usually only used in conjunction with the -n command-line option.
n -
If auto-print is not disabled, print the pattern space, then, regardless, replace the pattern space with the next line of input. If
there is no more input then sed exits without processing any more
commands.
q -
Exit sed without processing any more commands or input.
Note that the current pattern space is printed if auto-print is not disabled with the -n option.
and
Addresses in a sed script can be in any of the following forms:
number
Specifying a line number will match only that line in the input.
An address range can be specified by specifying two addresses
separated by a comma (,). An address range matches lines starting from
where the first address matches, and continues until the second
address matches (inclusively).

sed -n '16224,16482 p' orig-data-file > new-file
Where 16224,16482 are the start line number and end line number, inclusive. This is 1-indexed. -n suppresses echoing the input as output, which you clearly don't want; the numbers indicate the range of lines to make the following command operate on; the command p prints out the relevant lines.

Quite simple using head/tail:
head -16482 in.sql | tail -258 > out.sql
using sed:
sed -n '16224,16482p' in.sql > out.sql
using awk:
awk 'NR>=16224&&NR<=16482' in.sql > out.sql

You could use 'vi' and then the following command:
:16224,16482w!/tmp/some-file
Alternatively:
cat file | head -n 16482 | tail -n 258
EDIT:- Just to add explanation, you use head -n 16482 to display first 16482 lines then use tail -n 258 to get last 258 lines out of the first output.

There is another approach with awk:
awk 'NR==16224, NR==16482' file
If the file is huge, it can be good to exit after reading the last desired line. This way, it won't read the following lines unnecessarily:
awk 'NR==16224, NR==16482-1; NR==16482 {print; exit}' file
awk 'NR==16224, NR==16482; NR==16482 {exit}' file

perl -ne 'print if 16224..16482' file.txt > new_file.txt

People trying to wrap their heads around computing an interval for the head | tail combo are overthinking it.
Here's how you get the "16224 -- 16482" range without computing anything:
cat file | head -n +16482 | tail -n +16224
Explanation:
The + instructs the head/tail command to "go up to / start from" (respectively) the specified line number as counted from the beginning of the file.
Similarly, a - instructs them to "go up to / start from" (respectively) the specified line number as counted from the end of the file
The solution shown above simply uses head first, to 'keep everything up to the top number', and then tail second, to 'keep everything from the bottom number upwards', thus defining our range of interest (with no need to compute an interval).

Standing on the shoulders of boxxar, I like this:
sed -n '<first line>,$p;<last line>q' input
e.g.
sed -n '16224,$p;16482q' input
The $ means "last line", so the first command makes sed print all lines starting with line 16224 and the second command makes sed quit after printing line 16428. (Adding 1 for the q-range in boxxar's solution does not seem to be necessary.)
I like this variant because I don't need to specify the ending line number twice. And I measured that using $ does not have detrimental effects on performance.

# print section of file based on line numbers
sed -n '16224 ,16482p' # method 1
sed '16224,16482!d' # method 2

cat dump.txt | head -16224 | tail -258
should do the trick. The downside of this approach is that you need to do the arithmetic to determine the argument for tail and to account for whether you want the 'between' to include the ending line or not.

sed -n '16224,16482p' < dump.sql

Quick and dirty:
head -16428 < file.in | tail -259 > file.out
Probably not the best way to do it but it should work.
BTW: 259 = 16482-16224+1.

I wrote a Haskell program called splitter that does exactly this: have a read through my release blog post.
You can use the program as follows:
$ cat somefile | splitter 16224-16482
And that is all that there is to it. You will need Haskell to install it. Just:
$ cabal install splitter
And you are done. I hope that you find this program useful.

Even we can do this to check at command line:
cat filename|sed 'n1,n2!d' > abc.txt
For Example:
cat foo.pl|sed '100,200!d' > abc.txt

Using ruby:
ruby -ne 'puts "#{$.}: #{$_}" if $. >= 32613500 && $. <= 32614500' < GND.rdf > GND.extract.rdf

I wanted to do the same thing from a script using a variable and achieved it by putting quotes around the $variable to separate the variable name from the p:
sed -n "$first","$count"p imagelist.txt >"$imageblock"
I wanted to split a list into separate folders and found the initial question and answer a useful step. (split command not an option on the old os I have to port code to).

Just benchmarking 3 solutions given above, that works to me:
awk
sed
"head+tail"
Credits on the 3 solutions goes to:
#boxxar
#avandeursen
#wds
#manveru
#sibaz
#SOFe
#fedorqui 'SO stop harming'
#Robin A. Meade
I'm using a huge file I find in my server:
# wc fo2debug.1.log
10421186 19448208 38795491134 fo2debug.1.log
38 Gb in 10.4 million lines.
And yes, I have a logrotate problem. : ))
Make your bets!
Getting 256 lines from the beginning of the file.
# time sed -n '1001,1256p;1256q' fo2debug.1.log | wc -l
256
real 0m0,003s
user 0m0,000s
sys 0m0,004s
# time head -1256 fo2debug.1.log | tail -n +1001 | wc -l
256
real 0m0,003s
user 0m0,006s
sys 0m0,000s
# time awk 'NR==1001, NR==1256; NR==1256 {exit}' fo2debug.1.log | wc -l
256
real 0m0,002s
user 0m0,004s
sys 0m0,000s
Awk won. Technical tie in second place between sed and "head+tail".
Getting 256 lines at the end of the first third of the file.
# time sed -n '3473001,3473256p;3473256q' fo2debug.1.log | wc -l
256
real 0m0,265s
user 0m0,242s
sys 0m0,024s
# time head -3473256 fo2debug.1.log | tail -n +3473001 | wc -l
256
real 0m0,308s
user 0m0,313s
sys 0m0,145s
# time awk 'NR==3473001, NR==3473256; NR==3473256 {exit}' fo2debug.1.log | wc -l
256
real 0m0,393s
user 0m0,326s
sys 0m0,068s
Sed won. Followed by "head+tail" and, finally, awk.
Getting 256 lines at the end of the second third of the file.
# time sed -n '6947001,6947256p;6947256q' fo2debug.1.log | wc -l
A256
real 0m0,525s
user 0m0,462s
sys 0m0,064s
# time head -6947256 fo2debug.1.log | tail -n +6947001 | wc -l
256
real 0m0,615s
user 0m0,488s
sys 0m0,423s
# time awk 'NR==6947001, NR==6947256; NR==6947256 {exit}' fo2debug.1.log | wc -l
256
real 0m0,779s
user 0m0,650s
sys 0m0,130s
Same results.
Sed won. Followed by "head+tail" and, finally, awk.
Getting 256 lines near the end of the file.
# time sed -n '10420001,10420256p;10420256q' fo2debug.1.log | wc -l
256
real 1m50,017s
user 0m12,735s
sys 0m22,926s
# time head -10420256 fo2debug.1.log | tail -n +10420001 | wc -l
256
real 1m48,269s
user 0m42,404s
sys 0m51,015s
# time awk 'NR==10420001, NR==10420256; NR==10420256 {exit}' fo2debug.1.log | wc -l
256
real 1m49,106s
user 0m12,322s
sys 0m18,576s
And suddenly, a twist!
"Head+tail" won. Followed by awk and, finally, sed.
(some hours later...)
Sorry guys!
My analysis above ends up being an example of a basic flaw in doing an analysis.
The flaw is not knowing in depth the resources used for the analysis.
In this case, I used a log file to analyze the performance of a search for a certain number of lines within it.
Using 3 different techniques, searches were made at different points in the file, comparing the performance of the techniques at each point and checking whether the results varied depending on the point in the file where the search was made.
My mistake was to assume that there was a certain homogeneity of content in the log file.
The reality is that long lines appear more frequently at the end of the file.
Thus, the apparent conclusion that longer searches (closer to the end of the file) are better with a given technique, may be biased. In fact, this technique may be better when dealing with longer lines. What remains to be confirmed.

I was about to post the head/tail trick, but actually I'd probably just fire up emacs. ;-)
esc-x goto-line ret 16224
mark (ctrl-space)
esc-x goto-line ret 16482
esc-w
open the new output file, ctl-y
save
Let's me see what's happening.

I would use:
awk 'FNR >= 16224 && FNR <= 16482' my_file > extracted.txt
FNR contains the record (line) number of the line being read from the file.

Using ed:
ed -s infile <<<'16224,16482p'
-s suppresses diagnostic output; the actual commands are in a here-string. Specifically, 16224,16482p runs the p (print) command on the desired line address range.

I wrote a small bash script that you can run from your command line, so long as you update your PATH to include its directory (or you can place it in a directory that is already contained in the PATH).
Usage: $ pinch filename start-line end-line
#!/bin/bash
# Display line number ranges of a file to the terminal.
# Usage: $ pinch filename start-line end-line
# By Evan J. Coon
FILENAME=$1
START=$2
END=$3
ERROR="[PINCH ERROR]"
# Check that the number of arguments is 3
if [ $# -lt 3 ]; then
echo "$ERROR Need three arguments: Filename Start-line End-line"
exit 1
fi
# Check that the file exists.
if [ ! -f "$FILENAME" ]; then
echo -e "$ERROR File does not exist. \n\t$FILENAME"
exit 1
fi
# Check that start-line is not greater than end-line
if [ "$START" -gt "$END" ]; then
echo -e "$ERROR Start line is greater than End line."
exit 1
fi
# Check that start-line is positive.
if [ "$START" -lt 0 ]; then
echo -e "$ERROR Start line is less than 0."
exit 1
fi
# Check that end-line is positive.
if [ "$END" -lt 0 ]; then
echo -e "$ERROR End line is less than 0."
exit 1
fi
NUMOFLINES=$(wc -l < "$FILENAME")
# Check that end-line is not greater than the number of lines in the file.
if [ "$END" -gt "$NUMOFLINES" ]; then
echo -e "$ERROR End line is greater than number of lines in file."
exit 1
fi
# The distance from the end of the file to end-line
ENDDIFF=$(( NUMOFLINES - END ))
# For larger files, this will run more quickly. If the distance from the
# end of the file to the end-line is less than the distance from the
# start of the file to the start-line, then start pinching from the
# bottom as opposed to the top.
if [ "$START" -lt "$ENDDIFF" ]; then
< "$FILENAME" head -n $END | tail -n +$START
else
< "$FILENAME" tail -n +$START | head -n $(( END-START+1 ))
fi
# Success
exit 0

This might work for you (GNU sed):
sed -ne '16224,16482w newfile' -e '16482q' file
or taking advantage of bash:
sed -n $'16224,16482w newfile\n16482q' file

Since we are talking about extracting lines of text from a text file, I will give an special case where you want to extract all lines that match a certain pattern.
myfile content:
=====================
line1 not needed
line2 also discarded
[Data]
first data line
second data line
=====================
sed -n '/Data/,$p' myfile
Will print the [Data] line and the remaining. If you want the text from line1 to the pattern, you type: sed -n '1,/Data/p' myfile. Furthermore, if you know two pattern (better be unique in your text), both the beginning and end line of the range can be specified with matches.
sed -n '/BEGIN_MARK/,/END_MARK/p' myfile

I've compiled some of the highest rated solutions for sed, perl, head+tail, plus my own code for awk, and focusing on performance via the pipe, while using LC_ALL=C to ensure all candidates at their fastest possible, allocating 2-second sleep gap in between.
The gaps are somewhat noticeable :
abs time awk/app speed ratio
----------------------------------
0.0672 sec : 1.00x mawk-2
0.0839 sec : 1.25x gnu-sed
0.1289 sec : 1.92x perl
0.2151 sec : 3.20x gnu-head+tail
Haven't had chance to test python or BSD variants of those utilities.
(fg && fg && fg && fg) 2>/dev/null;
echo;
( time ( pvE0 < "${m3t}"
| LC_ALL=C mawk2 '
BEGIN {
_=10420001-(\
__=10420256)^(FS="^$")
} _<NR {
print
if(__==NR) { exit }
}' ) | pvE9) | tee >(xxh128sum >&2) | LC_ALL=C gwc -lcm | lgp3 ;
sleep 2;
(fg && fg && fg && fg) 2>/dev/null
echo;
( time ( pvE0 < "${m3t}"
| LC_ALL=C gsed -n '10420001,10420256p;10420256q'
) | pvE9 ) | tee >(xxh128sum >&2) | LC_ALL=C gwc -lcm | lgp3 ;
sleep 2; (fg && fg && fg && fg) 2>/dev/null
echo
( time ( pvE0 < "${m3t}"
| LC_ALL=C perl -ne 'print if 10420001..10420256'
) | pvE9 ) | tee >(xxh128sum >&2) | LC_ALL=C gwc -lcm | lgp3 ;
sleep 2; (fg && fg && fg && fg) 2>/dev/null
echo
( time ( pvE0 < "${m3t}"
| LC_ALL=C ghead -n +10420256
| LC_ALL=C gtail -n +10420001
) | pvE9 ) | tee >(xxh128sum >&2) | LC_ALL=C gwc -lcm | lgp3 ;
in0: 1.51GiB 0:00:00 [2.31GiB/s] [2.31GiB/s] [============> ] 81%
out9: 42.5KiB 0:00:00 [64.9KiB/s] [64.9KiB/s] [ <=> ]
( pvE 0.1 in0 < "${m3t}" | LC_ALL=C mawk2 ; )
0.43s user 0.36s system 117% cpu 0.672 total
256 43487 43487
54313365c2e66a48dc1dc33595716cc8 stdin
out9: 42.5KiB 0:00:00 [51.7KiB/s] [51.7KiB/s] [ <=> ]
in0: 1.51GiB 0:00:00 [1.84GiB/s] [1.84GiB/s] [==========> ] 81%
( pvE 0.1 in0 < "${m3t}" |LC_ALL=C gsed -n '10420001,10420256p;10420256q'; )
0.68s user 0.34s system 121% cpu 0.839 total
256 43487 43487
54313365c2e66a48dc1dc33595716cc8 stdin
in0: 1.85GiB 0:00:01 [1.46GiB/s] [1.46GiB/s] [=============>] 100%
out9: 42.5KiB 0:00:01 [33.5KiB/s] [33.5KiB/s] [ <=> ]
( pvE 0.1 in0 < "${m3t}" | LC_ALL=C perl -ne 'print if 10420001..10420256'; )
1.10s user 0.44s system 119% cpu 1.289 total
256 43487 43487
54313365c2e66a48dc1dc33595716cc8 stdin
in0: 1.51GiB 0:00:02 [ 728MiB/s] [ 728MiB/s] [=============> ] 81%
out9: 42.5KiB 0:00:02 [19.9KiB/s] [19.9KiB/s] [ <=> ]
( pvE 0.1 in0 < "${m3t}"
| LC_ALL=C ghead -n +10420256
| LC_ALL=C gtail -n ; )
1.98s user 1.40s system 157% cpu 2.151 total
256 43487 43487
54313365c2e66a48dc1dc33595716cc8 stdin

The -n in the accept answers work. Here's another way in case you're inclined.
cat $filename | sed "${linenum}p;d";
This does the following:
pipe in the contents of a file (or feed in the text however you want).
sed selects the given line, prints it
d is required to delete lines, otherwise sed will assume all lines will eventually be printed. i.e., without the d, you will get all lines printed by the selected line printed twice because you have the ${linenum}p part asking for it to be printed. I'm pretty sure the -n is basically doing the same thing as the d here.

I was looking for an answer to this but I had to end up writing my own code which worked. None of the answers above were satisfactory.
Consider you have very large file and have certain line numbers that you want to print out but the numbers are not in order. You can do the following:
My relatively large file
for letter in {a..k} ; do echo $letter; done | cat -n > myfile.txt
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
11 k
Specific line numbers I want:
shuf -i 1-11 -n 4 > line_numbers_I_want.txt
10
11
4
9
To print these line numbers, do the following.
awk '{system("head myfile.txt -n " $0 " | tail -n 1")}' line_numbers_I_want.txt
What the above does is to head the n line then take the last line using tail
If you want your line numbers in order, sort ( is -n numeric sort) first then get the lines.
cat line_numbers_I_want.txt | sort -n | awk '{system("head myfile.txt -n " $0 " | tail -n 1")}'
4 d
9 i
10 j
11 k

Maybe, you would be so kind to give this humble script a chance ;-)
#!/usr/bin/bash
# Usage:
# body n m|-m
from=$1
to=$2
if [ $to -gt 0 ]; then
# count $from the begin of the file $to selected line
awk "NR >= $from && NR <= $to {print}"
else
# count $from the begin of the file skipping tailing $to lines
awk '
BEGIN {lines=0; from='$from'; to='$to'}
{++lines}
NR >= $from {line[lines]=$0}
END {for (i = from; i < lines + to + 1; i++) {
print line[i]
}
}'
fi
Outputs:
$ seq 20 | ./body.sh 5 15
5
6
7
8
9
10
11
12
13
14
15
$ seq 20 | ./body.sh 5 -5
5
6
7
8
9
10
11
12
13
14
15

You could use sed command in your case and is pretty fast.
As mentioned lets assume the range is: between 16224 and 16482 lines
#get the lines from 16224 to 16482 and prints the values into filename.txt file
sed -n '16224 ,16482p' file.txt > filename.txt
#Additional Info to showcase other possible scenarios:
#get the 16224 th line and writes the value to filename.txt
sed -n '16224p' file.txt > filename.txt
#get the 16224 and 16300 line values only and write to filename.txt.
sed -n '16224p;16300p;' file.txt > filename.txt

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Count Duplicate URLs, fastest method possible - text-processing

Are you're going to do this over and over again? If not, then "fastest" as in fastest to implement is probably sort </file/of/urls | uniq --count | awk '{ print $2, " | ", $1}' (not tested, I'm not near a UNIX command line)

In perl [disclaimer: not able to test this code at the moment] while (<>) { chomp; $occurences{$_}++; } foreach $url (sort keys %occurences) { printf "%s|%d\n", $url, $occurences{$url}; }

Related

Substring pattern matching in two files

How to get wheel users with specific prefix letters

How to add timestamp to pipe output?

sed: replace spaces within quotes with underscores

How can I extract a predetermined range of lines from a text file on Unix?

Categories

Resources