sort and extract certain number of rows from a file containing dates - perl

i have in a txt file, date like:
yyyymmdd
raw data are like:
20171115
20171115
20180903
...
20201231
They are more than 100k rows. i am trying to keep in one file the "newest" 10k lines, and in a separate file the 10k "oldest" 10k lines.
I guess this must be a two steps process:
sort lines,
then extract the 10k rows that are on top, the "newest = most recent dates" and the 10k rows that are towards the end of the file ie the "oldest = most ancient dates"
How could i achieve it using awk?
I even tried with perl no luck though, so a perl one liner would be highly accepted as well.
Edit: i would prefer a clean clever solution so that i learn from,
and not an optimization of my attempts.
example with perl
#dates = ('20170401', '20170721', '20200911');
#ordered = sort { &compare } #dates;
sub compare {
$a =~ /(\d{4})(\d{2})(\d{2})/;
$c = $3 . $2 . $1;
$b =~ /(\d{4})(\d{2})(\d{2})/;
$c = $3 . $2 . $1;
$c <=> $d;
}
print "#ordered\n";

This is an answer using perl.
If you want the oldest on top, you can use the standard sort order:
#dates = sort #dates;
Reverse sort order, with the newest on top:
#dates = sort { $b <=> $a } #dates;
# ^^^
# |
# numerical three-way comparison returning -1, 0 or +1
You can then extract 10000 of the entries from the top:
my $keep = 10000;
my #top = splice #dates, 0, $keep;
And 10000 from the bottom:
$keep = #dates unless(#dates >= $keep);
my #bottom = splice #dates, -$keep;
#dates will now contain the dates between the 10000 at the top and the 10000 at the bottom that you extracted.
You can then save the two arrays to files if you want:
sub save {
my $filename=shift;
open my $fh, '>', $filename or die "$filename: $!";
print $fh join("\n", #_) . "\n" if(#_);
close $fh;
}
save('top', #top);
save('bottom', #bottom);

A command-line script ("one"-liner) with Perl
perl -MPath::Tiny=path -we'
$f = shift; $n = shift//2; # filename; number of lines or default
#d = sort +(path($f)->lines); # sort lexicographically, ascending
$n = int #d/2 if 2*$n > #d; # top/bottom lines, up to half of file
path("bottom.txt")->spew(#d[0..$n-1]); # write files, top/bottom $n lines
path("top.txt") ->spew(#d[$#d-$n+1..$#d])
' dates.txt 4
Comments
Needs a filename, and can optionally take the number of lines to take from top and bottom; in this example 4 is passed (with default 2), for easy tests with small files. Don't need to check for the filename since the library used to read it, Path::Tiny, does that
For the library (-MPath::Tiny) I specify the method name (=path) only for documentation; this isn't necessary since the libary is a class, so that =path may be just removed
Sorting is alphabetical but that is fine with dates in this format; oldest dates come first but that doesn't matter since we'll split off what we need. To enforce numerical sorting, and once at it to sort in descending order, use sort { $b <=> $a } #d;. See sort
We check whether there is enough lines in the file for the desired number of lines to shave off from the (sorted) top and bottom ($n). If there isn't then that's set to half the file
The syntax $#ary is the last index of the array #ary and that is used to count off $n items from the back of the array with lines #d
This is written as a command-line program ("one-liner") merely because that was asked for. But that much code would be far more comfortable in a script.

Given that your lines with dates will sort lexicographically, it is simple. Just use sort then split.
Given:
cat infile
20171115
20171115
20180903
20100101
20211118
20201231
You can sort then split that input file into files of 3 lines each:
split -l3 <(sort -r infile) dates
# -l10000 for a 10,000 line split
The result:
for fn in dates*; do echo "${fn}:"; cat "$fn"; done
datesaa:
20211118
20201231
20180903
datesab:
20171115
20171115
20100101
# files are names datesaa, datesab, datesac, ... dateszz
# if you only want two blocks of 10,000 dates,
# just throw the remaining files away.
Given you may have significantly more lines than you are interested in, you can also sort to a intermediate file then use head and tail to get the newest and oldest respectively:
sort -r infile >dates_sorted
head -n10000 dates_sorted >newest_dates
tail -n10000 dates_sorted >oldest_dates

Assumptions:
dates are not unique (per OPs comment)
results are dumped to two files newest and oldest
newest entries will be sorted in descending order
oldest entries will be sorted in ascending order
there's enough memory on the host to load the entire data file into memory (in the form of an awk array)
Sample input:
$ cat dates.dat
20170415
20171115
20180903
20131115
20141115
20131115
20141115
20150903
20271115
20271105
20271105
20280903
20071115
20071015
20070903
20031115
20031015
20030903
20011115
20011125
20010903
20010903
One idea using GNU awk:
x=5
awk -v max="${x}" '
{ dates[$1]++ }
END { count=0
PROCINFO["sorted_in"]="#ind_str_desc" # find the newest "max" dates
for (i in dates) {
for (n=1; n<=dates[i]; n++) {
if (++count > max) break
print i > "newest"
}
if (count > max) break
}
count=0
PROCINFO["sorted_in"]="#ind_str_asc" # find the oldest "max" dates
for (i in dates) {
for (n=1; n<=dates[i]; n++) {
if (++count > max) break
print i > "oldest"
}
if (count > max) break
}
}
' dates.dat
NOTE: if duplicate date shows up as rows #10,000 and #10,001, the #10,001 entry will not be included in the output
This generates:
$ cat oldest
20010903
20010903
20011115
20011125
20030903
$ cat newest
20280903
20271115
20271105
20271105
20180903

Here is a quick and dirty Awk attempt which collects the ten smallest and the ten largest entries from the file.
awk 'BEGIN { for(i=1; i<=10; i++) max[i] = min[i] = 0 }
NR==1 { max[1] = min[1] = $1; next }
(!max[10]) || ($1 > max[10]) {
for(i=1; i<=10; ++i) if(!max[i] || (max[i] < $1)) break
for(j=9; j>=i; --j) max[j+1]=max[j]
max[i] = $1 }
(!min[10]) !! ($1 < min[10]) {
for(i=1; i<=10; ++i) if (!min[i] || (min[i] > $1)) break
for(j=9; j>=i; --j) min[j+1]=min[j]
min[i] = $1 }
END { for(i=1; i<=10; ++i) print max[i];
print "---"
for(i=1; i<=10; ++i) print min[i] }' file
For simplicity, this has some naïve assumptions (numbers are all positive, there are at least 20 distinct numbers, duplicates should be accounted for).
This avoids external dependencies by using a brute-force sort in native Awk. We keep two sorted arrays min and max with ten items in each, and shift off the values which no longer fit as we populate them with the largest and smallest numbers.
It should be obvious how to extend this to 10,000.

Same assumptions as with my [other answer], except newest data is in ascending order ...
One idea using sort and head/tail:
$ sort dates.dat | tee >(head -5 > oldest) | tail -5 > newest
$ cat oldest
20010903
20010903
20011115
20011125
20030903
$ cat newest
20180903
20271105
20271105
20271115
20280903
OP can add another sort if needed (eg, tail -5 | sort -r > newest).
For large datasets OP may also want to investigate other sort options, eg, -S (allocate more memory for sorting), --parallel (enable parallel sorting), etc.

Related

Matching and splitting a specific line from a .txt file

I'm looking for concise Perl equivalents (to use within scripts rather than one-liners) to a few things i'd otherwise do in bash/awk:
Count=$(awk '/reads/ && ! seen {print $1; seen=1}' < input.txt)
Which trawls through a specified .txt file that contains a multitude of lines including some in this format:
8523723 reads; of these:
1256265 reads; of these:
2418091 reads; of these:
Printing '8523723' and ignoring the remainder of the matchable lines (as I only wish to act on the first matched instance).
Secondly:
Count2=$(awk '/paired/ {sum+=$1} END{print sum}' < input.txt)
25 paired; of these:
15 paired; of these:
Which would create a running total of the numbers on each matched line, printing 40.
The first one is:
while (<>) {
if (/reads/) {
print;
last;
}
}
The second one is:
my $total = 0;
while (<>) {
if (/(\d+) paired/) {
$total += $1;
}
}
say $total;
You could, no doubt, golf them both. But these versions are readable :-)

How to sort column A uniquely based on descending order of column B in unix/per/tcl?

I have a csv file like the one below.
Column A, Column B
cat,30
cat,40
dog,10
elephant,23
dog,3
elephant,37
How would i uniquely sort column A, based on largest corresponding value on
column B?
The result I would like to get is,
Column A, Column B
cat,40
elephant,37
dog,10
awk to the rescue!
$ sort -t, -k1,1 -k2,2nr filename | awk -F, '!a[$1]++'
Column A, Column B
cat,40
dog,10
elephant,37
if you want your specific output it needs little more coding because of the header line.
$ sort -t, -k1,1 -k2nr filename | awk -F, 'NR==1{print "999999\t"$0;next} !a[$1]++{print $2"\t"$0}' | sort -k1nr | cut -f2-
Column A, Column B
cat,40
elephant,37
dog,10
Another alternative with removing header upfront and adding it back at the end
$ h=$(head -1 filename); sed 1d filename | sort -t, -k1,1 -k2nr | awk -F, '!a[$1]++' | sort -t, -k2nr | sed '1i'"$h"''
Perlishly:
#!/usr/bin/env perl
use strict;
use warnings;
#print header row
print scalar <>;
my %seen;
#iterate the magic filehandle (file specified on command line or
#stdin - e.g. like grep/sed)
while (<>) {
chomp; #strip trailing linefeed
#split this line on ','
my ( $key, $value ) = split /,/;
#save this value if previous is lower or non existant
if ( not defined $seen{$key}
or $seen{$key} < $value )
{
$seen{$key} = $value;
}
}
#sort, comparing values in %seen
foreach my $key ( sort { $seen{$b} <=> $seen{$a} } keys %seen ) {
print "$key,$seen{$key}\n";
}
I've +1'd karakfa's answer. It's simple and elegant.
My answer is an extension of karakfa's header handling. If you like it, please feel free to +1 my answer, but "best answer" should go to karakfa. (Unless of course you prefer one of the other answer! :] )
If your input is as you've described in your question, then we can recognize the header by seeing that $2 is not numeric. Thus, the following does not take the header into consideration:
$ sort -t, -k1,1 -k2,2nr filename | awk -F, '!a[$1]++'
You might alternately strip the header with:
$ sort -t, -k1,1 -k2,2nr filename | awk -F, '$2~/^[0-9]+$/&&!a[$1]++'
This slows things down quite a bit, since a regex may take longer to evaluate than a simple array assignment and numeric test. I'm using a regex for the numeric test in order to permit a 0, which would otherwise evaluate to "false".
Next, if you want to keep the header, but print it first, you can process your output at the end of the stream:
$ sort -t, -k1,1 -k2,2nr filename | awk -F, '$2!~/^[0-9]+$/{print;next} !a[$1]++{b[$1]=$0} END{for(i in b){print b[i]}}'
Last option to achieve the same effect without storing the extra array in memory would be to process your input a second time. This is more costly in terms of IO, but less costly in terms of memory:
$ sort -t, -k1,1 -k2,2nr filename | awk -F, 'NR==FNR&&$2!~/^[0-9]+$/{print;nextfile} $2~/^[0-9]+$/&&!a[$1]++' filename -
Another perl
perl -MList::Util=max -F, -lane '
if ($.==1) {print; next}
$val{$F[0]} = max $val{$F[0]}, $F[1];
} {
print "$_,$val{$_}" for reverse sort {$val{$a} <=> $val{$b}} keys %val;
' file
One possible Tcl solution:
# read the contents of the file into a list of lines
set f [open data.csv]
set lines [split [string trim [chan read $f]] \n]
chan close $f
# detach the header
set lines [lassign $lines header]
# map the list of lines to a list of tuples
set tuples [lmap line $lines {split $line ,}]
# use an associative array to get unique tuples in a flat list
array set uniqueTuples [concat {*}[lsort -index 1 -integer $tuples]]
# reassemble the tuples, sorted by name
set tuples [lmap {a b} [lsort -stride 2 -index 0 [array get uniqueTuples]] {list $a $b}]
# map the tuples to csv lines and insert the header
set lines [linsert [lmap tuple $tuples {join $tuple ,}] 0 $header]
# convert the list of lines into a data string
set data [join $lines \n]
This solution assumes a simplified data set where there are no quoted elements. If there are quoted elements, the csv module should be used instead of the split command.
Another solution, inspired by the Perl solution:
puts [gets stdin]
set seen [dict create]
while {[gets stdin line] >= 0} {
lassign [split $line ,] key value
if {![dict exists $seen $key] || [dict get $seen $key] < $value} {
dict set seen $key $value
}
}
dict for {key val} [lsort -stride 2 -index 0 $seen] {
puts $key,$val
}
Documentation: chan, concat, dict, gets, if, join, lassign, linsert, lmap, lmap replacement, lsort, open, set, split, string, while

how do you select column from a text file using perl

I want to subtract values in one column from another column and add the differences.How do I do this in perl? I am new to perl.Hence I am unable to figure out how to go about it. Kindly help me.
The first thing is to separate the data into columns. In this case, the columns are separated by a space. split(/ /) will return a list of the columns.
To subtract one from the other, its pulling the values out of the the list and subtracting them.
At the end, you add the difference to the running sum and then loop over the data.
#!/usr/bin/perl
use strict;
my $sum = 0;
while(<DATA>) {
my #vals = split(/ /);
my $diff = $vals[1] - $vals[0];
$sum += $diff;
}
print $sum,"\n";
__DATA__
1 3
3 5
5 7
This will print out 6 --- (3 - 1) + (5 - 3) + (7 - 5)
FYI, if you combine the autosplit (-a), loop (n) and command-line program (-e) arguments (see perlrun), you can shorten this to a one-liner, much like awk:
perl -ane "$sum += $F[1] - $F[0]; END { print $sum }" filename

How to determine if one or more column in a large text file is sorted or not sorted

I have greater than 1GB big text file. The file has 4 columns delimited by TAB.
Col1: Guid
Col2: Date-time (yy-mm-yyyy 0000000000)
Col3: String
Col4: String
I want to determine if one or more of its column are sorted or not sorted.
Is there any quick way to do that? Maybe using Perl or some unix command? Or anything similar?
I have files on large servers and on my local windows machine, so memory or cpu speed or OS is not an issue.
Just use the -c option of sort to check for sorted order and the -k to specify on which column:
$ sort -c -k2,2 file
sort: file:2: disorder: Col2: Date-time (yy-mm-yyyy 0000000000)
Or -C to suppress output and test the exit code. You may also want to specify the type of sort depending on the data like -n for numeric sort -v for version sort, ect.
Many versions of sort have an option to check whether a file is sorted or not. For example, using the version on my laptop (Debian), I can do this:
if sort -C -k 2,2 somefile
then
# something
else
# something else
fi
to check whether the second column of a file is sorted. The exit code of sort indicates success or failure.
first determine the column
then use awk
awk '{print $2}' OFS="\t" test.tmp > unsorted_file.dat
for second column
awk '{print $2}' OFS="\t" test.tmp | sort > sorted_file.dat
diff sorted_file.dat unsorted_file.dat
Just split the line into columns and compare them with the values in the previous line. If the previous value is greater than the one in the current line, the column is not sorted.
#! /usr/bin/perl
use strict;
use warnings;
my #sorted = (1, 1, 1, 1);
my $first = <>; # read the first line
my #prev = split(/\t/, $first);
while (<>) {
my #cols = split(/\t/);
for (my $i = 0; $i < 4; ++$i) {
$sorted[$i] = 0 if ($prev[$i] gt $cols[$i]);
}
#prev = #cols;
}
for (my $i = 0; $i < 4; ++$i) {
my $not = $sorted[$i] ? '' : 'not ';
print "Column $i is $not sorted\n";
}
Test file.txt
a a a a
b b b b
c c c c
d d d d
e e e a
f d f f
g g g g
Call as
perl script.pl file.txt
will give you
Column 0 is sorted
Column 1 is not sorted
Column 2 is sorted
Column 3 is not sorted
This compares textually and tests for ascending order. If you need another order or a different comparison, you must adapt the inner for loop accordingly.

How to quickly find and replace many items on a list without replacing previously replaced items in BASH?

I want to perform about many find and replace operations on some text. I have a UTF-8 CSV file containing what to find (in the first column) and what to replace it with (in the second column), arranged from longest to shortest.
E.g.:
orange,fruit2
carrot,vegetable1
apple,fruit3
pear,fruit4
ink,item1
table,item2
Original file:
"I like to eat apples and carrots"
Resulting output file:
"I like to eat fruit3s and vegetable1s."
However, I want to ensure that if one part of text has already been replaced, that it doesn't mess with text that was already replaced. In other words, I don't want it to appear like this (it matched "table" from within vegetable1):
"I like to eat fruit3s and vegeitem21s."
Currently, I am using this method which is quite slow, because I have to do the whole find and replace twice:
(1) Convert the CSV to three files, e.g.:
a.csv b.csv c.csv
orange 0001 fruit2
carrot 0002 vegetable1
apple 0003 fruit3
pear 0004 fruit4
ink 0005 item1
table 0006 item 2
(2) Then, replace all items from a.csv in file.txt with the matching column in b.csv, using ZZZ around the words to make sure there is no mistake later in matching the numbers:
a=1
b=`wc -l < ./a.csv`
while [ $a -le $b ]
do
for i in `sed -n "$a"p ./b.csv`; do
for j in `sed -n "$a"p ./a.csv`; do
sed -i "s/$i/ZZZ$j\ZZZ/g" ./file.txt
echo "Instances of '"$i"' replaced with '"ZZZ$j\ZZZ"' ("$a"/"$b")."
a=`expr $a + 1`
done
done
done
(3) Then running this same script again, but to replace ZZZ0001ZZZ with fruit2 from c.csv.
Running the first replacement takes about 2 hours, but as I must run this code twice to avoid editing the already replaced items, it takes twice as long. Is there a more efficient way to run a find and replace that does not perform replacements on text already replaced?
Here's a perl solution which is doing the replacement in "one phase".
#!/usr/bin/perl
use strict;
my %map = (
orange => "fruit2",
carrot => "vegetable1",
apple => "fruit3",
pear => "fruit4",
ink => "item1",
table => "item2",
);
my $repl_rx = '(' . join("|", map { quotemeta } keys %map) . ')';
my $str = "I like to eat apples and carrots";
$str =~ s{$repl_rx}{$map{$1}}g;
print $str, "\n";
Tcl has a command to do exactly this: string map
tclsh <<'END'
set map {
"orange" "fruit2"
"carrot" "vegetable1"
"apple" "fruit3"
"pear" "fruit4"
"ink" "item1"
"table" "item2"
}
set str "I like to eat apples and carrots"
puts [string map $map $str]
END
I like to eat fruit3s and vegetable1s
This is how to implement it in bash (requires bash v4 for the associative array)
declare -A map=(
[orange]=fruit2
[carrot]=vegetable1
[apple]=fruit3
[pear]=fruit4
[ink]=item1
[table]=item2
)
str="I like to eat apples and carrots"
echo "$str"
i=0
while (( i < ${#str} )); do
matched=false
for key in "${!map[#]}"; do
if [[ ${str:$i:${#key}} = $key ]]; then
str=${str:0:$i}${map[$key]}${str:$((i+${#key}))}
((i+=${#map[$key]}))
matched=true
break
fi
done
$matched || ((i++))
done
echo "$str"
I like to eat apples and carrots
I like to eat fruit3s and vegetable1s
This will not be speedy.
Clearly, you may get different results if you order the map differently. In fact, I believe the order of "${!map[#]}" is unspecified, so you might want to specify the order of the keys explicitly:
keys=(orange carrot apple pear ink table)
# ...
for key in "${keys[#]}"; do
One way to do it would be to do a two-phase replace:
phase 1:
s/orange/##1##/
s/carrot/##2##/
...
phase 2:
s/##1##/fruit2/
s/##2##/vegetable1/
...
The ##1## markers should be chosen so that they don't appear in the original text or the replacements of course.
Here's a proof-of-concept implementation in perl:
#!/usr/bin/perl -w
#
my $repls = $ARGV[0];
die ("first parameter must be the replacement list file") unless defined ($repls);
my $tmpFmt = "###%d###";
open(my $replsFile, "<", $repls) || die("$!: $repls");
shift;
my #replsList;
my $i = 0;
while (<$replsFile>) {
chomp;
my ($from, $to) = /\"([^\"]*)\",\"([^\"]*)\"/;
if (defined($from) && defined($to)) {
push(#replsList, [$from, sprintf($tmpFmt, ++$i), $to]);
}
}
while (<>) {
foreach my $r (#replsList) {
s/$r->[0]/$r->[1]/g;
}
foreach my $r (#replsList) {
s/$r->[1]/$r->[2]/g;
}
print;
}
I would guess that most of your slowness is coming from creating so many sed commands, which each need to individually process the entire file. Some minor adjustments to your current process would speed this up a lot by running 1 sed per file per step.
a=1
b=`wc -l < ./a.csv`
while [ $a -le $b ]
do
cmd=""
for i in `sed -n "$a"p ./a.csv`; do
for j in `sed -n "$a"p ./b.csv`; do
cmd="$cmd ; s/$i/ZZZ${j}ZZZ/g"
echo "Instances of '"$i"' replaced with '"ZZZ${j}ZZZ"' ("$a"/"$b")."
a=`expr $a + 1`
done
done
sed -i "$cmd" ./file.txt
done
Doing it twice is probably not your problem. If you managed to just do it once using your basic strategy, it would still take you an hour, right? You probably need to use a different technology or tool. Switching to Perl, as above, might make your code a lot faster (give it a try)
But continuing down the path of other posters, the next step might be pipelining. Write a little program that replaces two columns, then run that program twice, simultaneously. The first run swaps out strings in column1 with strings in column2, the next swaps out strings in column2 with strings in column3.
Your command line would be like this
cat input_file.txt | perl replace.pl replace_file.txt 1 2 | perl replace.pl replace_file.txt 2 3 > completely_replaced.txt
And replace.pl would be like this (similar to other solutions)
#!/usr/bin/perl -w
my $replace_file = $ARGV[0];
my $before_replace_colnum = $ARGV[1] - 1;
my $after_replace_colnum = $ARGV[2] - 1;
open(REPLACEFILE, $replace_file) || die("couldn't open $replace_file: $!");
my #replace_pairs;
# read in the list of things to replace
while(<REPLACEFILE>) {
chomp();
my #cols = split /\t/, $_;
my $to_replace = $cols[$before_replace_colnum];
my $replace_with = $cols[$after_replace_colnum];
push #replace_pairs, [$to_replace, $replace_with];
}
# read input from stdin, do swapping
while(<STDIN>) {
# loop over all replacement strings
foreach my $replace_pair (#replace_pairs) {
my($to_replace,$replace_with) = #{$replace_pair};
$_ =~ s/${to_replace}/${replace_with}/g;
}
print STDOUT $_;
}
A bash+sed approach:
count=0
bigfrom=""
bigto=""
while IFS=, read from to; do
read countmd5sum x < <(md5sum <<< $count)
count=$(( $count + 1 ))
bigfrom="$bigfrom;s/$from/$countmd5sum/g"
bigto="$bigto;s/$countmd5sum/$to/g"
done < replace-list.csv
sed "${bigfrom:1}$bigto" input_file.txt
I have chosen md5sum, to get some unique token. But some other mechanism can also be used to generate such token; like reading from /dev/urandom or shuf -n1 -i 10000000-20000000
A awk+sed approach:
awk -F, '{a[NR-1]="s/####"NR"####/"$2"/";print "s/"$1"/####"NR"####/"}; END{for (i=0;i<NR;i++)print a[i];}' replace-list.csv > /tmp/sed_script.sed
sed -f /tmp/sed_script.sed input.txt
A cat+sed+sed approach:
cat -n replace-list.csv | sed -rn 'H;g;s|(.*)\n *([0-9]+) *[^,]*,(.*)|\1\ns/####\2####/\3/|;x;s|.*\n *([0-9]+)[ \t]*([^,]+).*|s/\2/####\1####/|p;${g;s/^\n//;p}' > /tmp/sed_script.sed
sed -f /tmp/sed_script.sed input.txt
Mechanism:
Here, it first generates the sed script, using the csv as input file.
Then uses another sed instance to operate on input.txt
Notes:
The intermediate file generated - sed_script.sed can be re-used again, unless the input csv file changes.
####<number>#### is chosen as some pattern, which is not present in the input file. Change this pattern if required.
cat -n | is not UUOC :)
This might work for you (GNU sed):
sed -r 'h;s/./&\\n/g;H;x;s/([^,]*),.*,(.*)/s|\1|\2|g/;$s/$/;s|\\n||g/' csv_file | sed -rf - original_file
Convert the csv file into a sed script. The trick here is to replace the substitution string with one which will not be re-substituted. In this case each character in the substitution string is replaced by itself and a \n. Finally once all substitutions have taken place the \n's are removed leaving the finished string.
There are a lot of cool answers here already. I'm posting this because I'm taking a slightly different approach by making some large assumptions about the data to replace ( based on the sample data ):
Words to replace don't contain spaces
Words are replaced based on the longest, exactly matching prefix
Each word to replace is exactly represented in the csv
This a single pass, awk only answer with very little regex.
It reads the "repl.csv" file into an associative array ( see BEGIN{} ), then attempts to match on prefixes of each word when the length of the word is bound by key length limits, trying to avoid looking in the associative array whenever possible:
#!/bin/awk -f
BEGIN {
while( getline repline < "repl.csv" ) {
split( repline, replarr, "," )
replassocarr[ replarr[1] ] = replarr[2]
# set some bounds on the replace word sizes
if( minKeyLen == 0 || length( replarr[1] ) < minKeyLen )
minKeyLen = length( replarr[1] )
if( maxKeyLen == 0 || length( replarr[1] ) > maxKeyLen )
maxKeyLen = length( replarr[1] )
}
close( "repl.csv" )
}
{
i = 1
while( i <= NF ) { print_word( $i, i == NF ); i++ }
}
function print_word( w, end ) {
wl = length( w )
for( j = wl; j >= 0 && prefix_len_bound( wl, j ); j-- ) {
key = substr( w, 1, j )
wl = length( key )
if( wl >= minKeyLen && key in replassocarr ) {
printf( "%s%s%s", replassocarr[ key ],
substr( w, j+1 ), !end ? " " : "\n" )
return
}
}
printf( "%s%s", w, !end ? " " : "\n" )
}
function prefix_len_bound( len, jlen ) {
return len >= minKeyLen && (len <= maxKeyLen || jlen > maxKeylen)
}
Based on input like:
I like to eat apples and carrots
orange you glad to see me
Some people eat pears while others drink ink
It yields output like:
I like to eat fruit3s and vegetable1s
fruit2 you glad to see me
Some people eat fruit4s while others drink item1
Of course any "savings" of not looking the replassocarr go away when the words to be replaced goes to length=1 or if the average word length is much greater than the words to replace.