Simple tool to find the most recurrent terms in a text [closed] - text-processing

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I have a text and I would like to extract the most recurrent terms, even if made up by more than one word (i.e.: managing director, position, salary, web developer).
I would need a library or an installable executable, more than a web service.
I came across some complex tools (such as Topia's Term Extraction, MAUI) that require training. There are overcomplicated for my purpose and I find them difficult to use by me.
I just need a piece of software that extracts the most recurrent terms in a text.
Thanks.

Do you use linux? I use these shell functions
# copyright by Werner Rudolph <werner (at) artistoex (dot) net>
# copying and distributing of the following source code
# is permitted, as long as this note is preserved.
# ftr CHAR1 CHAR2
# translate delimiter char in frequency list
#
ftr()
{
sed -r 's/^( *[0-9]+)'"$1"'/\1'"$2"'/'
}
# valid-collocations -- find valid collocations in inputstream
# reads records COUNT<SPC>COLLOCATION from inputstream
# writes records with existing collocations to stdout.
valid-collocations ()
{
#sort -k 2 -m - "$coll" |uniq -f 1 -D|accumulate
local delimiter="_"
ftr ' ' $delimiter |
join -t $delimiter -o 1.1 0 -1 2 -2 1 - /tmp/wordsets-helper-collocations |
ftr $delimiter ' '
}
# ngrams MAX [MIN]
#
# Generates all n-grams (for each MIN <= n <= MAX, where MIN defaults to 2)
# from inputstream
#
# reads word list, as generated by
#
# $ words < text
#
# from stdin. For each WORD in wordlist, it writes MAX-1 records
#
# COUNT<TAB>WORD<SPC>SUCC_1<SPC>
# COUNT<TAB>WORD<SPC>SUCC_1<SPC>SUCC_2<SPC>
# :
# COUNT<TAB>WORD<SPC>SUCC_1<SPC>SUCC_2<SPC>...<SPC>SUCC_MAX-2
# COUNT<TAB>WORD<SPC>SUCC_1<SPC>SUCC_2<SPC>...<SPC>SUCC_MAX-1
#
# to stdout, where word SUCC follows word WORD, and SUCC_n follows
# SUCC_n-1 in input stream COUNT times.
ngrams ()
{
local max=$1
local min=${2:-2};
awk 'FNR > 1 {print old " " $0} {old=$1}' | if (( $max - 1 > 1 )); then
if (( $min <= 2 )); then
tee >( ngrams $(( $max - 1 )) $(( $min - 1 )) );
else
ngrams $(( $max - 1 )) $(( $min - 1 ));
fi;
else
cat;
fi
}
words() {
grep -Eo '\<([a-zA-Z]'"'"'?){'${1:-3}',}\>'|grep -v "[A-Z]"
}
parse-collocations() {
local freq=${1:-0}
local length=${2:-4}
words | ngrams $length | sort | uniq -c |
awk '$1 > '"$freq"' { print $0; }' |
valid-collocations
}
Where parse-collocation is the actual function to use. It accepts two optional parameters: The first sets the maximum recurring frequency of the terms to be skipped from the result (defaults to 0, i.e. consider all terms). The second parameter sets the maximum term length to search for. The function will read the text from stdin and print the terms to stdout line by line. It requires a dictionary file at /tmp/wordsets-helper-collocations (download one here):
Usage example:
$ parse-collocation < some-text
would be pretty much what you want. However, if you don't want terms to be matched with a dictionary, you can use this one
$ words < some-text | ngrams 3 4 | sort | uniq -c |sort -nr
ngrams's first parameter sets the minimum term length whereas its second (optional) parameter sets the maximum term length.

Related

sort and extract certain number of rows from a file containing dates

i have in a txt file, date like:
yyyymmdd
raw data are like:
20171115
20171115
20180903
...
20201231
They are more than 100k rows. i am trying to keep in one file the "newest" 10k lines, and in a separate file the 10k "oldest" 10k lines.
I guess this must be a two steps process:
sort lines,
then extract the 10k rows that are on top, the "newest = most recent dates" and the 10k rows that are towards the end of the file ie the "oldest = most ancient dates"
How could i achieve it using awk?
I even tried with perl no luck though, so a perl one liner would be highly accepted as well.
Edit: i would prefer a clean clever solution so that i learn from,
and not an optimization of my attempts.
example with perl
#dates = ('20170401', '20170721', '20200911');
#ordered = sort { &compare } #dates;
sub compare {
$a =~ /(\d{4})(\d{2})(\d{2})/;
$c = $3 . $2 . $1;
$b =~ /(\d{4})(\d{2})(\d{2})/;
$c = $3 . $2 . $1;
$c <=> $d;
}
print "#ordered\n";
This is an answer using perl.
If you want the oldest on top, you can use the standard sort order:
#dates = sort #dates;
Reverse sort order, with the newest on top:
#dates = sort { $b <=> $a } #dates;
# ^^^
# |
# numerical three-way comparison returning -1, 0 or +1
You can then extract 10000 of the entries from the top:
my $keep = 10000;
my #top = splice #dates, 0, $keep;
And 10000 from the bottom:
$keep = #dates unless(#dates >= $keep);
my #bottom = splice #dates, -$keep;
#dates will now contain the dates between the 10000 at the top and the 10000 at the bottom that you extracted.
You can then save the two arrays to files if you want:
sub save {
my $filename=shift;
open my $fh, '>', $filename or die "$filename: $!";
print $fh join("\n", #_) . "\n" if(#_);
close $fh;
}
save('top', #top);
save('bottom', #bottom);
A command-line script ("one"-liner) with Perl
perl -MPath::Tiny=path -we'
$f = shift; $n = shift//2; # filename; number of lines or default
#d = sort +(path($f)->lines); # sort lexicographically, ascending
$n = int #d/2 if 2*$n > #d; # top/bottom lines, up to half of file
path("bottom.txt")->spew(#d[0..$n-1]); # write files, top/bottom $n lines
path("top.txt") ->spew(#d[$#d-$n+1..$#d])
' dates.txt 4
Comments
Needs a filename, and can optionally take the number of lines to take from top and bottom; in this example 4 is passed (with default 2), for easy tests with small files. Don't need to check for the filename since the library used to read it, Path::Tiny, does that
For the library (-MPath::Tiny) I specify the method name (=path) only for documentation; this isn't necessary since the libary is a class, so that =path may be just removed
Sorting is alphabetical but that is fine with dates in this format; oldest dates come first but that doesn't matter since we'll split off what we need. To enforce numerical sorting, and once at it to sort in descending order, use sort { $b <=> $a } #d;. See sort
We check whether there is enough lines in the file for the desired number of lines to shave off from the (sorted) top and bottom ($n). If there isn't then that's set to half the file
The syntax $#ary is the last index of the array #ary and that is used to count off $n items from the back of the array with lines #d
This is written as a command-line program ("one-liner") merely because that was asked for. But that much code would be far more comfortable in a script.
Given that your lines with dates will sort lexicographically, it is simple. Just use sort then split.
Given:
cat infile
20171115
20171115
20180903
20100101
20211118
20201231
You can sort then split that input file into files of 3 lines each:
split -l3 <(sort -r infile) dates
# -l10000 for a 10,000 line split
The result:
for fn in dates*; do echo "${fn}:"; cat "$fn"; done
datesaa:
20211118
20201231
20180903
datesab:
20171115
20171115
20100101
# files are names datesaa, datesab, datesac, ... dateszz
# if you only want two blocks of 10,000 dates,
# just throw the remaining files away.
Given you may have significantly more lines than you are interested in, you can also sort to a intermediate file then use head and tail to get the newest and oldest respectively:
sort -r infile >dates_sorted
head -n10000 dates_sorted >newest_dates
tail -n10000 dates_sorted >oldest_dates
Assumptions:
dates are not unique (per OPs comment)
results are dumped to two files newest and oldest
newest entries will be sorted in descending order
oldest entries will be sorted in ascending order
there's enough memory on the host to load the entire data file into memory (in the form of an awk array)
Sample input:
$ cat dates.dat
20170415
20171115
20180903
20131115
20141115
20131115
20141115
20150903
20271115
20271105
20271105
20280903
20071115
20071015
20070903
20031115
20031015
20030903
20011115
20011125
20010903
20010903
One idea using GNU awk:
x=5
awk -v max="${x}" '
{ dates[$1]++ }
END { count=0
PROCINFO["sorted_in"]="#ind_str_desc" # find the newest "max" dates
for (i in dates) {
for (n=1; n<=dates[i]; n++) {
if (++count > max) break
print i > "newest"
}
if (count > max) break
}
count=0
PROCINFO["sorted_in"]="#ind_str_asc" # find the oldest "max" dates
for (i in dates) {
for (n=1; n<=dates[i]; n++) {
if (++count > max) break
print i > "oldest"
}
if (count > max) break
}
}
' dates.dat
NOTE: if duplicate date shows up as rows #10,000 and #10,001, the #10,001 entry will not be included in the output
This generates:
$ cat oldest
20010903
20010903
20011115
20011125
20030903
$ cat newest
20280903
20271115
20271105
20271105
20180903
Here is a quick and dirty Awk attempt which collects the ten smallest and the ten largest entries from the file.
awk 'BEGIN { for(i=1; i<=10; i++) max[i] = min[i] = 0 }
NR==1 { max[1] = min[1] = $1; next }
(!max[10]) || ($1 > max[10]) {
for(i=1; i<=10; ++i) if(!max[i] || (max[i] < $1)) break
for(j=9; j>=i; --j) max[j+1]=max[j]
max[i] = $1 }
(!min[10]) !! ($1 < min[10]) {
for(i=1; i<=10; ++i) if (!min[i] || (min[i] > $1)) break
for(j=9; j>=i; --j) min[j+1]=min[j]
min[i] = $1 }
END { for(i=1; i<=10; ++i) print max[i];
print "---"
for(i=1; i<=10; ++i) print min[i] }' file
For simplicity, this has some naïve assumptions (numbers are all positive, there are at least 20 distinct numbers, duplicates should be accounted for).
This avoids external dependencies by using a brute-force sort in native Awk. We keep two sorted arrays min and max with ten items in each, and shift off the values which no longer fit as we populate them with the largest and smallest numbers.
It should be obvious how to extend this to 10,000.
Same assumptions as with my [other answer], except newest data is in ascending order ...
One idea using sort and head/tail:
$ sort dates.dat | tee >(head -5 > oldest) | tail -5 > newest
$ cat oldest
20010903
20010903
20011115
20011125
20030903
$ cat newest
20180903
20271105
20271105
20271115
20280903
OP can add another sort if needed (eg, tail -5 | sort -r > newest).
For large datasets OP may also want to investigate other sort options, eg, -S (allocate more memory for sorting), --parallel (enable parallel sorting), etc.

printf zero padded string

The format of MAC addresses varies with the platform.
E.g. on HPUX I could get something like:
0:0:c:7:ac:1e
While Linux gives me
00:00:0c:07:ac:1e
I used to use awk in a kornshell script on CentOS5 to format this to 00000c07ac1e like shown below.
MAC="0:0:c:7:ac:1e"
echo $MAC | awk -F: '{printf( "%02s%02s%02s%02s%02s%02s\n", $1,$2,$3,$4,$5,$6)}'
Unfortunately our admin server now is Ubuntu 14LTS with a newer version of awk which doesn't support the zero padding in the %s format anymore and I get an undesired 0 0 c 7ac1e
So I now switched to perl and do:
echo $MAC | perl -ne '{#A=split(":"); printf( "%02s%02s%02s%02s%02s%02s", #A)}'
As this may break too in upcoming releases I am looking for a more robust but still compact way to format the string.
Your Perl snippet will not break in future releases. This is basic functionality. Changing it will break many, many programs. (Plus, Perl has a mechanism for introducing backwards incompatible changes without breaking existing program.)
Cleaned up:
echo "$MAC" | perl -ne'#F=split(/:/); printf("%02s%02s%02s%02s%02s%02s\n", #F)'
Shorter:
echo "$MAC" | perl -ne'printf "%02s%02s%02s%02s%02s%02s\n", split /:/'
Without the repetition:
echo "$MAC" | perl -ple'$_ = join ":", map sprintf("%02s", $_), split /:/'
There's -a if you want something more awkish:
echo "$MAC" | perl -F: -aple'$_ = join ":", map sprintf("%02s", $_), #F'
Bit long but should be pretty robust
awk -F: '{for(i=1;i<=NF;i++){while(length($i)<2)$i=0$i;printf "%s",$i;}print ""}'
How it works
1.Loop through fields
2.Whilst the field is less than 2 characters long add zeros to the front
3.print the field
4.print newline character at end.
If you were dealing with a number rather than hex, you could use %.Xd to indicate you want at least X digits.
$ awk -F: '{printf( "%.2d%.2d\n", $1, $2)}' <<< "0:23"
0023
^^
two digits
From The GNU Awk User’s Guide #5.5.3 Modifiers for printf Formats:
.prec
A period followed by an integer constant specifies the precision to
use when printing. The meaning of the precision varies by control
letter:
%d, %i, %o, %u, %x, %X
Minimum number of digits to print.
In this case, you need a more general approach to deal with each one of the blocks of the MAC address. You can loop through the elements and add a 0 in case their length is just 1:
awk -F: '{for (i=1;i<=NF;i++) #loop through the elements
{
if (length($i)==1) #if length is 1
printf("0") #add a 0
printf ("%s", $i) #print the rest
}
print "" #print a new line at the end
}' <<< "0:0:c:7:ac:1e"
This returns:
00000c07ac1e
^^ ^^ ^^
^^ ^^ ^^
Note awk '...' <<< "$MAC" is the same as echo "$MAC" | awk '...'.

Using the bash sort command within variable-length filenames

I am trying to numerically sort a series of files output by the ls command which match the pattern either ABCDE1234A1789.RST.txt or ABCDE12345A1789.RST.txt by the '789' field.
In the example patterns above, ABCDE is the same for all files, 1234 or 12345 are digits that vary but are always either 4 or 5 digits in length. A1 is the same length for all files, but value can vary so unfortunately it can't be used as a delimiter. Everything after the first . is the same for all files. Something like:
ls -l *.RST.txt | sort -k +9.13 | awk '{print $9} ' > file-list.txt
will match the shorter filenames but not the longer ones because of the variable length of characters before the field I want to sort by.
Is there a way to accomplish sorting all files without first padding the shorter-length files to make them all the same length?
Perl to the rescue!
perl -e 'print "$_\n" for sort { substr($a, -11, 3) cmp substr($b, -11, 3) } glob "*.RST.txt"'
If your perl is more recent (5.10 or newer), you can shorten it to
perl -E 'say for sort { substr($a, -11, 3) cmp substr($b, -11, 3) } glob "*.RST.txt"'
Because of the parts of the filename which you've identified as unchanging, you can actually build a key which sort will use:
$ echo ABCDE{99999,8765,9876,345,654,23,21,2,3}A1789.RST.txt \
| fmt -w1 \
| sort -tE -k2,2n --debug
ABCDE2A1789.RST.txt
_
___________________
ABCDE3A1789.RST.txt
_
___________________
ABCDE21A1789.RST.txt
__
etc.
What this does is tell sort to separate the fields on character E, then use the 2nd field numerically. --debug arrived in coreutils 8.6, and can be very helpful in seeing exactly what sort is doing.
The conventional way to do this in bash is to extract your sort field. Except for the sort command, the following is implemented in pure bash alone:
sort_names_by_first_num() {
shopt -s extglob
for f; do
first_num="${f##+([^0-9])}";
first_num=${first_num%[^0-9]*};
[[ $first_num ]] && printf '%s\t%s\n' "$first_num" "$f"
done | sort -n | while IFS='' read -r name; do name=${name#*$'\t'}; printf '%s\n' "$name"; done
}
sort_names_by_first_num *.RST.txt
That said, newline-delimiting filenames (as this question seems to call for) is a bad practice: Filenames on UNIX filesystems are allowed to contain newlines within their names, so separating them by newlines within a list means your list is unable to contain a substantial subset of the range of valid names. It's much better practice to NUL-delimit your lists. Doing that would look like so:
sort_names_by_first_num() {
shopt -s extglob
for f; do
first_num="${f##+([^0-9])}";
first_num=${first_num%[^0-9]*};
[[ $first_num ]] && printf '%s\t%s\0' "$first_num" "$f"
done | sort -n -z | while IFS='' read -r -d '' name; do name=${name#*$'\t'}; printf '%s\0' "$name"; done
}
sort_names_by_first_num *.RST.txt

How to quickly find and replace many items on a list without replacing previously replaced items in BASH?

I want to perform about many find and replace operations on some text. I have a UTF-8 CSV file containing what to find (in the first column) and what to replace it with (in the second column), arranged from longest to shortest.
E.g.:
orange,fruit2
carrot,vegetable1
apple,fruit3
pear,fruit4
ink,item1
table,item2
Original file:
"I like to eat apples and carrots"
Resulting output file:
"I like to eat fruit3s and vegetable1s."
However, I want to ensure that if one part of text has already been replaced, that it doesn't mess with text that was already replaced. In other words, I don't want it to appear like this (it matched "table" from within vegetable1):
"I like to eat fruit3s and vegeitem21s."
Currently, I am using this method which is quite slow, because I have to do the whole find and replace twice:
(1) Convert the CSV to three files, e.g.:
a.csv b.csv c.csv
orange 0001 fruit2
carrot 0002 vegetable1
apple 0003 fruit3
pear 0004 fruit4
ink 0005 item1
table 0006 item 2
(2) Then, replace all items from a.csv in file.txt with the matching column in b.csv, using ZZZ around the words to make sure there is no mistake later in matching the numbers:
a=1
b=`wc -l < ./a.csv`
while [ $a -le $b ]
do
for i in `sed -n "$a"p ./b.csv`; do
for j in `sed -n "$a"p ./a.csv`; do
sed -i "s/$i/ZZZ$j\ZZZ/g" ./file.txt
echo "Instances of '"$i"' replaced with '"ZZZ$j\ZZZ"' ("$a"/"$b")."
a=`expr $a + 1`
done
done
done
(3) Then running this same script again, but to replace ZZZ0001ZZZ with fruit2 from c.csv.
Running the first replacement takes about 2 hours, but as I must run this code twice to avoid editing the already replaced items, it takes twice as long. Is there a more efficient way to run a find and replace that does not perform replacements on text already replaced?
Here's a perl solution which is doing the replacement in "one phase".
#!/usr/bin/perl
use strict;
my %map = (
orange => "fruit2",
carrot => "vegetable1",
apple => "fruit3",
pear => "fruit4",
ink => "item1",
table => "item2",
);
my $repl_rx = '(' . join("|", map { quotemeta } keys %map) . ')';
my $str = "I like to eat apples and carrots";
$str =~ s{$repl_rx}{$map{$1}}g;
print $str, "\n";
Tcl has a command to do exactly this: string map
tclsh <<'END'
set map {
"orange" "fruit2"
"carrot" "vegetable1"
"apple" "fruit3"
"pear" "fruit4"
"ink" "item1"
"table" "item2"
}
set str "I like to eat apples and carrots"
puts [string map $map $str]
END
I like to eat fruit3s and vegetable1s
This is how to implement it in bash (requires bash v4 for the associative array)
declare -A map=(
[orange]=fruit2
[carrot]=vegetable1
[apple]=fruit3
[pear]=fruit4
[ink]=item1
[table]=item2
)
str="I like to eat apples and carrots"
echo "$str"
i=0
while (( i < ${#str} )); do
matched=false
for key in "${!map[#]}"; do
if [[ ${str:$i:${#key}} = $key ]]; then
str=${str:0:$i}${map[$key]}${str:$((i+${#key}))}
((i+=${#map[$key]}))
matched=true
break
fi
done
$matched || ((i++))
done
echo "$str"
I like to eat apples and carrots
I like to eat fruit3s and vegetable1s
This will not be speedy.
Clearly, you may get different results if you order the map differently. In fact, I believe the order of "${!map[#]}" is unspecified, so you might want to specify the order of the keys explicitly:
keys=(orange carrot apple pear ink table)
# ...
for key in "${keys[#]}"; do
One way to do it would be to do a two-phase replace:
phase 1:
s/orange/##1##/
s/carrot/##2##/
...
phase 2:
s/##1##/fruit2/
s/##2##/vegetable1/
...
The ##1## markers should be chosen so that they don't appear in the original text or the replacements of course.
Here's a proof-of-concept implementation in perl:
#!/usr/bin/perl -w
#
my $repls = $ARGV[0];
die ("first parameter must be the replacement list file") unless defined ($repls);
my $tmpFmt = "###%d###";
open(my $replsFile, "<", $repls) || die("$!: $repls");
shift;
my #replsList;
my $i = 0;
while (<$replsFile>) {
chomp;
my ($from, $to) = /\"([^\"]*)\",\"([^\"]*)\"/;
if (defined($from) && defined($to)) {
push(#replsList, [$from, sprintf($tmpFmt, ++$i), $to]);
}
}
while (<>) {
foreach my $r (#replsList) {
s/$r->[0]/$r->[1]/g;
}
foreach my $r (#replsList) {
s/$r->[1]/$r->[2]/g;
}
print;
}
I would guess that most of your slowness is coming from creating so many sed commands, which each need to individually process the entire file. Some minor adjustments to your current process would speed this up a lot by running 1 sed per file per step.
a=1
b=`wc -l < ./a.csv`
while [ $a -le $b ]
do
cmd=""
for i in `sed -n "$a"p ./a.csv`; do
for j in `sed -n "$a"p ./b.csv`; do
cmd="$cmd ; s/$i/ZZZ${j}ZZZ/g"
echo "Instances of '"$i"' replaced with '"ZZZ${j}ZZZ"' ("$a"/"$b")."
a=`expr $a + 1`
done
done
sed -i "$cmd" ./file.txt
done
Doing it twice is probably not your problem. If you managed to just do it once using your basic strategy, it would still take you an hour, right? You probably need to use a different technology or tool. Switching to Perl, as above, might make your code a lot faster (give it a try)
But continuing down the path of other posters, the next step might be pipelining. Write a little program that replaces two columns, then run that program twice, simultaneously. The first run swaps out strings in column1 with strings in column2, the next swaps out strings in column2 with strings in column3.
Your command line would be like this
cat input_file.txt | perl replace.pl replace_file.txt 1 2 | perl replace.pl replace_file.txt 2 3 > completely_replaced.txt
And replace.pl would be like this (similar to other solutions)
#!/usr/bin/perl -w
my $replace_file = $ARGV[0];
my $before_replace_colnum = $ARGV[1] - 1;
my $after_replace_colnum = $ARGV[2] - 1;
open(REPLACEFILE, $replace_file) || die("couldn't open $replace_file: $!");
my #replace_pairs;
# read in the list of things to replace
while(<REPLACEFILE>) {
chomp();
my #cols = split /\t/, $_;
my $to_replace = $cols[$before_replace_colnum];
my $replace_with = $cols[$after_replace_colnum];
push #replace_pairs, [$to_replace, $replace_with];
}
# read input from stdin, do swapping
while(<STDIN>) {
# loop over all replacement strings
foreach my $replace_pair (#replace_pairs) {
my($to_replace,$replace_with) = #{$replace_pair};
$_ =~ s/${to_replace}/${replace_with}/g;
}
print STDOUT $_;
}
A bash+sed approach:
count=0
bigfrom=""
bigto=""
while IFS=, read from to; do
read countmd5sum x < <(md5sum <<< $count)
count=$(( $count + 1 ))
bigfrom="$bigfrom;s/$from/$countmd5sum/g"
bigto="$bigto;s/$countmd5sum/$to/g"
done < replace-list.csv
sed "${bigfrom:1}$bigto" input_file.txt
I have chosen md5sum, to get some unique token. But some other mechanism can also be used to generate such token; like reading from /dev/urandom or shuf -n1 -i 10000000-20000000
A awk+sed approach:
awk -F, '{a[NR-1]="s/####"NR"####/"$2"/";print "s/"$1"/####"NR"####/"}; END{for (i=0;i<NR;i++)print a[i];}' replace-list.csv > /tmp/sed_script.sed
sed -f /tmp/sed_script.sed input.txt
A cat+sed+sed approach:
cat -n replace-list.csv | sed -rn 'H;g;s|(.*)\n *([0-9]+) *[^,]*,(.*)|\1\ns/####\2####/\3/|;x;s|.*\n *([0-9]+)[ \t]*([^,]+).*|s/\2/####\1####/|p;${g;s/^\n//;p}' > /tmp/sed_script.sed
sed -f /tmp/sed_script.sed input.txt
Mechanism:
Here, it first generates the sed script, using the csv as input file.
Then uses another sed instance to operate on input.txt
Notes:
The intermediate file generated - sed_script.sed can be re-used again, unless the input csv file changes.
####<number>#### is chosen as some pattern, which is not present in the input file. Change this pattern if required.
cat -n | is not UUOC :)
This might work for you (GNU sed):
sed -r 'h;s/./&\\n/g;H;x;s/([^,]*),.*,(.*)/s|\1|\2|g/;$s/$/;s|\\n||g/' csv_file | sed -rf - original_file
Convert the csv file into a sed script. The trick here is to replace the substitution string with one which will not be re-substituted. In this case each character in the substitution string is replaced by itself and a \n. Finally once all substitutions have taken place the \n's are removed leaving the finished string.
There are a lot of cool answers here already. I'm posting this because I'm taking a slightly different approach by making some large assumptions about the data to replace ( based on the sample data ):
Words to replace don't contain spaces
Words are replaced based on the longest, exactly matching prefix
Each word to replace is exactly represented in the csv
This a single pass, awk only answer with very little regex.
It reads the "repl.csv" file into an associative array ( see BEGIN{} ), then attempts to match on prefixes of each word when the length of the word is bound by key length limits, trying to avoid looking in the associative array whenever possible:
#!/bin/awk -f
BEGIN {
while( getline repline < "repl.csv" ) {
split( repline, replarr, "," )
replassocarr[ replarr[1] ] = replarr[2]
# set some bounds on the replace word sizes
if( minKeyLen == 0 || length( replarr[1] ) < minKeyLen )
minKeyLen = length( replarr[1] )
if( maxKeyLen == 0 || length( replarr[1] ) > maxKeyLen )
maxKeyLen = length( replarr[1] )
}
close( "repl.csv" )
}
{
i = 1
while( i <= NF ) { print_word( $i, i == NF ); i++ }
}
function print_word( w, end ) {
wl = length( w )
for( j = wl; j >= 0 && prefix_len_bound( wl, j ); j-- ) {
key = substr( w, 1, j )
wl = length( key )
if( wl >= minKeyLen && key in replassocarr ) {
printf( "%s%s%s", replassocarr[ key ],
substr( w, j+1 ), !end ? " " : "\n" )
return
}
}
printf( "%s%s", w, !end ? " " : "\n" )
}
function prefix_len_bound( len, jlen ) {
return len >= minKeyLen && (len <= maxKeyLen || jlen > maxKeylen)
}
Based on input like:
I like to eat apples and carrots
orange you glad to see me
Some people eat pears while others drink ink
It yields output like:
I like to eat fruit3s and vegetable1s
fruit2 you glad to see me
Some people eat fruit4s while others drink item1
Of course any "savings" of not looking the replassocarr go away when the words to be replaced goes to length=1 or if the average word length is much greater than the words to replace.

How can I extract a predetermined range of lines from a text file on Unix?

I have a ~23000 line SQL dump containing several databases worth of data. I need to extract a certain section of this file (i.e. the data for a single database) and place it in a new file. I know both the start and end line numbers of the data that I want.
Does anyone know a Unix command (or series of commands) to extract all lines from a file between say line 16224 and 16482 and then redirect them into a new file?
sed -n '16224,16482p;16483q' filename > newfile
From the sed manual:
p -
Print out the pattern space (to the standard output). This command is usually only used in conjunction with the -n command-line option.
n -
If auto-print is not disabled, print the pattern space, then, regardless, replace the pattern space with the next line of input. If
there is no more input then sed exits without processing any more
commands.
q -
Exit sed without processing any more commands or input.
Note that the current pattern space is printed if auto-print is not disabled with the -n option.
and
Addresses in a sed script can be in any of the following forms:
number
Specifying a line number will match only that line in the input.
An address range can be specified by specifying two addresses
separated by a comma (,). An address range matches lines starting from
where the first address matches, and continues until the second
address matches (inclusively).
sed -n '16224,16482 p' orig-data-file > new-file
Where 16224,16482 are the start line number and end line number, inclusive. This is 1-indexed. -n suppresses echoing the input as output, which you clearly don't want; the numbers indicate the range of lines to make the following command operate on; the command p prints out the relevant lines.
Quite simple using head/tail:
head -16482 in.sql | tail -258 > out.sql
using sed:
sed -n '16224,16482p' in.sql > out.sql
using awk:
awk 'NR>=16224&&NR<=16482' in.sql > out.sql
You could use 'vi' and then the following command:
:16224,16482w!/tmp/some-file
Alternatively:
cat file | head -n 16482 | tail -n 258
EDIT:- Just to add explanation, you use head -n 16482 to display first 16482 lines then use tail -n 258 to get last 258 lines out of the first output.
There is another approach with awk:
awk 'NR==16224, NR==16482' file
If the file is huge, it can be good to exit after reading the last desired line. This way, it won't read the following lines unnecessarily:
awk 'NR==16224, NR==16482-1; NR==16482 {print; exit}' file
awk 'NR==16224, NR==16482; NR==16482 {exit}' file
perl -ne 'print if 16224..16482' file.txt > new_file.txt
People trying to wrap their heads around computing an interval for the head | tail combo are overthinking it.
Here's how you get the "16224 -- 16482" range without computing anything:
cat file | head -n +16482 | tail -n +16224
Explanation:
The + instructs the head/tail command to "go up to / start from" (respectively) the specified line number as counted from the beginning of the file.
Similarly, a - instructs them to "go up to / start from" (respectively) the specified line number as counted from the end of the file
The solution shown above simply uses head first, to 'keep everything up to the top number', and then tail second, to 'keep everything from the bottom number upwards', thus defining our range of interest (with no need to compute an interval).
Standing on the shoulders of boxxar, I like this:
sed -n '<first line>,$p;<last line>q' input
e.g.
sed -n '16224,$p;16482q' input
The $ means "last line", so the first command makes sed print all lines starting with line 16224 and the second command makes sed quit after printing line 16428. (Adding 1 for the q-range in boxxar's solution does not seem to be necessary.)
I like this variant because I don't need to specify the ending line number twice. And I measured that using $ does not have detrimental effects on performance.
# print section of file based on line numbers
sed -n '16224 ,16482p' # method 1
sed '16224,16482!d' # method 2
cat dump.txt | head -16224 | tail -258
should do the trick. The downside of this approach is that you need to do the arithmetic to determine the argument for tail and to account for whether you want the 'between' to include the ending line or not.
sed -n '16224,16482p' < dump.sql
Quick and dirty:
head -16428 < file.in | tail -259 > file.out
Probably not the best way to do it but it should work.
BTW: 259 = 16482-16224+1.
I wrote a Haskell program called splitter that does exactly this: have a read through my release blog post.
You can use the program as follows:
$ cat somefile | splitter 16224-16482
And that is all that there is to it. You will need Haskell to install it. Just:
$ cabal install splitter
And you are done. I hope that you find this program useful.
Even we can do this to check at command line:
cat filename|sed 'n1,n2!d' > abc.txt
For Example:
cat foo.pl|sed '100,200!d' > abc.txt
Using ruby:
ruby -ne 'puts "#{$.}: #{$_}" if $. >= 32613500 && $. <= 32614500' < GND.rdf > GND.extract.rdf
I wanted to do the same thing from a script using a variable and achieved it by putting quotes around the $variable to separate the variable name from the p:
sed -n "$first","$count"p imagelist.txt >"$imageblock"
I wanted to split a list into separate folders and found the initial question and answer a useful step. (split command not an option on the old os I have to port code to).
Just benchmarking 3 solutions given above, that works to me:
awk
sed
"head+tail"
Credits on the 3 solutions goes to:
#boxxar
#avandeursen
#wds
#manveru
#sibaz
#SOFe
#fedorqui 'SO stop harming'
#Robin A. Meade
I'm using a huge file I find in my server:
# wc fo2debug.1.log
10421186 19448208 38795491134 fo2debug.1.log
38 Gb in 10.4 million lines.
And yes, I have a logrotate problem. : ))
Make your bets!
Getting 256 lines from the beginning of the file.
# time sed -n '1001,1256p;1256q' fo2debug.1.log | wc -l
256
real 0m0,003s
user 0m0,000s
sys 0m0,004s
# time head -1256 fo2debug.1.log | tail -n +1001 | wc -l
256
real 0m0,003s
user 0m0,006s
sys 0m0,000s
# time awk 'NR==1001, NR==1256; NR==1256 {exit}' fo2debug.1.log | wc -l
256
real 0m0,002s
user 0m0,004s
sys 0m0,000s
Awk won. Technical tie in second place between sed and "head+tail".
Getting 256 lines at the end of the first third of the file.
# time sed -n '3473001,3473256p;3473256q' fo2debug.1.log | wc -l
256
real 0m0,265s
user 0m0,242s
sys 0m0,024s
# time head -3473256 fo2debug.1.log | tail -n +3473001 | wc -l
256
real 0m0,308s
user 0m0,313s
sys 0m0,145s
# time awk 'NR==3473001, NR==3473256; NR==3473256 {exit}' fo2debug.1.log | wc -l
256
real 0m0,393s
user 0m0,326s
sys 0m0,068s
Sed won. Followed by "head+tail" and, finally, awk.
Getting 256 lines at the end of the second third of the file.
# time sed -n '6947001,6947256p;6947256q' fo2debug.1.log | wc -l
A256
real 0m0,525s
user 0m0,462s
sys 0m0,064s
# time head -6947256 fo2debug.1.log | tail -n +6947001 | wc -l
256
real 0m0,615s
user 0m0,488s
sys 0m0,423s
# time awk 'NR==6947001, NR==6947256; NR==6947256 {exit}' fo2debug.1.log | wc -l
256
real 0m0,779s
user 0m0,650s
sys 0m0,130s
Same results.
Sed won. Followed by "head+tail" and, finally, awk.
Getting 256 lines near the end of the file.
# time sed -n '10420001,10420256p;10420256q' fo2debug.1.log | wc -l
256
real 1m50,017s
user 0m12,735s
sys 0m22,926s
# time head -10420256 fo2debug.1.log | tail -n +10420001 | wc -l
256
real 1m48,269s
user 0m42,404s
sys 0m51,015s
# time awk 'NR==10420001, NR==10420256; NR==10420256 {exit}' fo2debug.1.log | wc -l
256
real 1m49,106s
user 0m12,322s
sys 0m18,576s
And suddenly, a twist!
"Head+tail" won. Followed by awk and, finally, sed.
(some hours later...)
Sorry guys!
My analysis above ends up being an example of a basic flaw in doing an analysis.
The flaw is not knowing in depth the resources used for the analysis.
In this case, I used a log file to analyze the performance of a search for a certain number of lines within it.
Using 3 different techniques, searches were made at different points in the file, comparing the performance of the techniques at each point and checking whether the results varied depending on the point in the file where the search was made.
My mistake was to assume that there was a certain homogeneity of content in the log file.
The reality is that long lines appear more frequently at the end of the file.
Thus, the apparent conclusion that longer searches (closer to the end of the file) are better with a given technique, may be biased. In fact, this technique may be better when dealing with longer lines. What remains to be confirmed.
I was about to post the head/tail trick, but actually I'd probably just fire up emacs. ;-)
esc-x goto-line ret 16224
mark (ctrl-space)
esc-x goto-line ret 16482
esc-w
open the new output file, ctl-y
save
Let's me see what's happening.
I would use:
awk 'FNR >= 16224 && FNR <= 16482' my_file > extracted.txt
FNR contains the record (line) number of the line being read from the file.
Using ed:
ed -s infile <<<'16224,16482p'
-s suppresses diagnostic output; the actual commands are in a here-string. Specifically, 16224,16482p runs the p (print) command on the desired line address range.
I wrote a small bash script that you can run from your command line, so long as you update your PATH to include its directory (or you can place it in a directory that is already contained in the PATH).
Usage: $ pinch filename start-line end-line
#!/bin/bash
# Display line number ranges of a file to the terminal.
# Usage: $ pinch filename start-line end-line
# By Evan J. Coon
FILENAME=$1
START=$2
END=$3
ERROR="[PINCH ERROR]"
# Check that the number of arguments is 3
if [ $# -lt 3 ]; then
echo "$ERROR Need three arguments: Filename Start-line End-line"
exit 1
fi
# Check that the file exists.
if [ ! -f "$FILENAME" ]; then
echo -e "$ERROR File does not exist. \n\t$FILENAME"
exit 1
fi
# Check that start-line is not greater than end-line
if [ "$START" -gt "$END" ]; then
echo -e "$ERROR Start line is greater than End line."
exit 1
fi
# Check that start-line is positive.
if [ "$START" -lt 0 ]; then
echo -e "$ERROR Start line is less than 0."
exit 1
fi
# Check that end-line is positive.
if [ "$END" -lt 0 ]; then
echo -e "$ERROR End line is less than 0."
exit 1
fi
NUMOFLINES=$(wc -l < "$FILENAME")
# Check that end-line is not greater than the number of lines in the file.
if [ "$END" -gt "$NUMOFLINES" ]; then
echo -e "$ERROR End line is greater than number of lines in file."
exit 1
fi
# The distance from the end of the file to end-line
ENDDIFF=$(( NUMOFLINES - END ))
# For larger files, this will run more quickly. If the distance from the
# end of the file to the end-line is less than the distance from the
# start of the file to the start-line, then start pinching from the
# bottom as opposed to the top.
if [ "$START" -lt "$ENDDIFF" ]; then
< "$FILENAME" head -n $END | tail -n +$START
else
< "$FILENAME" tail -n +$START | head -n $(( END-START+1 ))
fi
# Success
exit 0
This might work for you (GNU sed):
sed -ne '16224,16482w newfile' -e '16482q' file
or taking advantage of bash:
sed -n $'16224,16482w newfile\n16482q' file
Since we are talking about extracting lines of text from a text file, I will give an special case where you want to extract all lines that match a certain pattern.
myfile content:
=====================
line1 not needed
line2 also discarded
[Data]
first data line
second data line
=====================
sed -n '/Data/,$p' myfile
Will print the [Data] line and the remaining. If you want the text from line1 to the pattern, you type: sed -n '1,/Data/p' myfile. Furthermore, if you know two pattern (better be unique in your text), both the beginning and end line of the range can be specified with matches.
sed -n '/BEGIN_MARK/,/END_MARK/p' myfile
I've compiled some of the highest rated solutions for sed, perl, head+tail, plus my own code for awk, and focusing on performance via the pipe, while using LC_ALL=C to ensure all candidates at their fastest possible, allocating 2-second sleep gap in between.
The gaps are somewhat noticeable :
abs time awk/app speed ratio
----------------------------------
0.0672 sec : 1.00x mawk-2
0.0839 sec : 1.25x gnu-sed
0.1289 sec : 1.92x perl
0.2151 sec : 3.20x gnu-head+tail
Haven't had chance to test python or BSD variants of those utilities.
(fg && fg && fg && fg) 2>/dev/null;
echo;
( time ( pvE0 < "${m3t}"
| LC_ALL=C mawk2 '
BEGIN {
_=10420001-(\
__=10420256)^(FS="^$")
} _<NR {
print
if(__==NR) { exit }
}' ) | pvE9) | tee >(xxh128sum >&2) | LC_ALL=C gwc -lcm | lgp3 ;
sleep 2;
(fg && fg && fg && fg) 2>/dev/null
echo;
( time ( pvE0 < "${m3t}"
| LC_ALL=C gsed -n '10420001,10420256p;10420256q'
) | pvE9 ) | tee >(xxh128sum >&2) | LC_ALL=C gwc -lcm | lgp3 ;
sleep 2; (fg && fg && fg && fg) 2>/dev/null
echo
( time ( pvE0 < "${m3t}"
| LC_ALL=C perl -ne 'print if 10420001..10420256'
) | pvE9 ) | tee >(xxh128sum >&2) | LC_ALL=C gwc -lcm | lgp3 ;
sleep 2; (fg && fg && fg && fg) 2>/dev/null
echo
( time ( pvE0 < "${m3t}"
| LC_ALL=C ghead -n +10420256
| LC_ALL=C gtail -n +10420001
) | pvE9 ) | tee >(xxh128sum >&2) | LC_ALL=C gwc -lcm | lgp3 ;
in0: 1.51GiB 0:00:00 [2.31GiB/s] [2.31GiB/s] [============> ] 81%
out9: 42.5KiB 0:00:00 [64.9KiB/s] [64.9KiB/s] [ <=> ]
( pvE 0.1 in0 < "${m3t}" | LC_ALL=C mawk2 ; )
0.43s user 0.36s system 117% cpu 0.672 total
256 43487 43487
54313365c2e66a48dc1dc33595716cc8 stdin
out9: 42.5KiB 0:00:00 [51.7KiB/s] [51.7KiB/s] [ <=> ]
in0: 1.51GiB 0:00:00 [1.84GiB/s] [1.84GiB/s] [==========> ] 81%
( pvE 0.1 in0 < "${m3t}" |LC_ALL=C gsed -n '10420001,10420256p;10420256q'; )
0.68s user 0.34s system 121% cpu 0.839 total
256 43487 43487
54313365c2e66a48dc1dc33595716cc8 stdin
in0: 1.85GiB 0:00:01 [1.46GiB/s] [1.46GiB/s] [=============>] 100%
out9: 42.5KiB 0:00:01 [33.5KiB/s] [33.5KiB/s] [ <=> ]
( pvE 0.1 in0 < "${m3t}" | LC_ALL=C perl -ne 'print if 10420001..10420256'; )
1.10s user 0.44s system 119% cpu 1.289 total
256 43487 43487
54313365c2e66a48dc1dc33595716cc8 stdin
in0: 1.51GiB 0:00:02 [ 728MiB/s] [ 728MiB/s] [=============> ] 81%
out9: 42.5KiB 0:00:02 [19.9KiB/s] [19.9KiB/s] [ <=> ]
( pvE 0.1 in0 < "${m3t}"
| LC_ALL=C ghead -n +10420256
| LC_ALL=C gtail -n ; )
1.98s user 1.40s system 157% cpu 2.151 total
256 43487 43487
54313365c2e66a48dc1dc33595716cc8 stdin
The -n in the accept answers work. Here's another way in case you're inclined.
cat $filename | sed "${linenum}p;d";
This does the following:
pipe in the contents of a file (or feed in the text however you want).
sed selects the given line, prints it
d is required to delete lines, otherwise sed will assume all lines will eventually be printed. i.e., without the d, you will get all lines printed by the selected line printed twice because you have the ${linenum}p part asking for it to be printed. I'm pretty sure the -n is basically doing the same thing as the d here.
I was looking for an answer to this but I had to end up writing my own code which worked. None of the answers above were satisfactory.
Consider you have very large file and have certain line numbers that you want to print out but the numbers are not in order. You can do the following:
My relatively large file
for letter in {a..k} ; do echo $letter; done | cat -n > myfile.txt
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
11 k
Specific line numbers I want:
shuf -i 1-11 -n 4 > line_numbers_I_want.txt
10
11
4
9
To print these line numbers, do the following.
awk '{system("head myfile.txt -n " $0 " | tail -n 1")}' line_numbers_I_want.txt
What the above does is to head the n line then take the last line using tail
If you want your line numbers in order, sort ( is -n numeric sort) first then get the lines.
cat line_numbers_I_want.txt | sort -n | awk '{system("head myfile.txt -n " $0 " | tail -n 1")}'
4 d
9 i
10 j
11 k
Maybe, you would be so kind to give this humble script a chance ;-)
#!/usr/bin/bash
# Usage:
# body n m|-m
from=$1
to=$2
if [ $to -gt 0 ]; then
# count $from the begin of the file $to selected line
awk "NR >= $from && NR <= $to {print}"
else
# count $from the begin of the file skipping tailing $to lines
awk '
BEGIN {lines=0; from='$from'; to='$to'}
{++lines}
NR >= $from {line[lines]=$0}
END {for (i = from; i < lines + to + 1; i++) {
print line[i]
}
}'
fi
Outputs:
$ seq 20 | ./body.sh 5 15
5
6
7
8
9
10
11
12
13
14
15
$ seq 20 | ./body.sh 5 -5
5
6
7
8
9
10
11
12
13
14
15
You could use sed command in your case and is pretty fast.
As mentioned lets assume the range is: between 16224 and 16482 lines
#get the lines from 16224 to 16482 and prints the values into filename.txt file
sed -n '16224 ,16482p' file.txt > filename.txt
#Additional Info to showcase other possible scenarios:
#get the 16224 th line and writes the value to filename.txt
sed -n '16224p' file.txt > filename.txt
#get the 16224 and 16300 line values only and write to filename.txt.
sed -n '16224p;16300p;' file.txt > filename.txt