Dear friends
I have the following:
PARAM=1,2,3=,4,5,6,=,7#,8,9
How to count by sed/awk the even "=" character between PARAM until "#" character
For example
PARAM=1,2,3=,4,5,6,=,7#,8,9
Then sed/awk should return 3
OR
PARAM=1,2,3=,4=,5=,6,=,7#,=8,9
Then sed/awk should return 5
THX
yael
you can use this one liner. No need to use split() as in the answer. Just use gsub(). It will return the count of the thing that is replaced. Also, set the field delimiter to "#", so you only need to deal with the first field.
$ echo "PARAM=1,2,3=,4,5,6,=,7#,8,9" | awk -F"#" '{print gsub("=","",$1)}'
3
$ echo "PARAM=1,2,3=,4=,5=,6,=,7#,=8,9" | awk -F"#" '{print gsub("=","",$1)}'
5
Here is an awk script that finds the count using field separators/split. IT sets the field separator to the # symbol and then splits the first word (the stuff to the left of the first # on the = character. An odd approach possibly, but it is one method. Note that it assumes there are no = characters to the left of param. If that is a bad assumption, this will not work.
BEGIN{ FS="#" }
/PARAM.*#/{
n = split( $1, a, "=" );
printf( "Count = %d\n", n-1 );
}
It can be done with one line as well:
[]$ export LINE=PARAM=1,2=3,4=5#=6
[]$ echo $LINE | awk 'BEGIN{FS="#"}/PARAM.*#/{n=split($1,a,"="); print n-1;}'
3
Related
I need to replace all occurrences of a string after nth occurrence in every line of a Unix file.
My file data:
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
My output data:
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
tried using sed: sed 's/://3g' test.txt
Unfortunately, the g option with the occurrence is not working as expected. instead, it is replacing all the occurrences.
Another approach using awk
awk -v c=':' -v n=2 'BEGIN{
FS=OFS=""
}
{
j=0;
for(i=0; ++i<=NF;)
if($i==c && j++>=n)$i=""
}1' file
$ cat file
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
$ awk -v c=':' -v n=2 'BEGIN{FS=OFS=""}{j=0;for(i=0; ++i<=NF;)if($i==c && j++>=n)$i=""}1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
With GNU awk, using gensub please try following. This is completely based on your shown samples, where OP wants to remove : from 3rd occurrence onwards. Using gensub to segregate parts of matched values and removing all colons from 2nd part(from 3rd colon onwards) in it as per OP's requirement.
awk -v regex="^([^:]*:)([^:]*:)(.*)" '
{
firstPart=restPart=""
firstPart=gensub(regex, "\\1 \\2", "1", $0)
restPart=gensub(regex,"\\3","1",$0)
gsub(/:/,"",restPart)
print firstPart restPart
}
' Input_file
I have inferred based on the limited data you've given us, so it's possible this won't work. But I wouldn't use regex for this job. What you have there is colon delimited fields.
So I'd approach it using split to extract the data, and then some form of string formatting to reassemble exactly what you like:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
chomp;
my ( undef, $first, #rest ) = split /:/;
print ":$first:", join ( "", #rest ),"\n";
}
__DATA__
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
This gives you the desired result, whilst IMO being considerably clearer for the next reader than a complicated regex.
You can use the perl solution like
perl -pe 's~^(?:[^:]*:){2}(*SKIP)(?!)|:~~g if /^:account_id:/' test.txt
See the online demo and the regex demo.
The ^(?:[^:]*:){2}(*SKIP)(?!)|: regex means:
^(?:[^:]*:){2}(*SKIP)(?!) - match
^ - start of string (here, a line)
(?:[^:]*:){2} - two occurrences of any zero or more chars other than a : and then a : char
(*SKIP)(?!) - skip the match and go on to search for the next match from the failure position
| - or
: - match a : char.
And only run the replacement if the current line starts with :account_id: (see if /^:account_id:/').
Or an awk solution like
awk 'BEGIN{OFS=FS=":"} /^:account_id:/ {result="";for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result}' test.txt
See this online demo. Details:
BEGIN{OFS=FS=":"} - sets the input/output field separator to :
/^:account_id:/ - line must start with :account_id:
result="" - sets result variable to an empty string
for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result} - iterates over the fields and if the field number is greater than 2, just append the current field value to result, else, append the value + output field separator; then print the result.
I would use GNU AWK following way if n fixed and equal 2 following way, let file.txt content be
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
then
awk 'BEGIN{FS=":";OFS=""}{$2=FS $2 FS;print}' file.txt
output
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
Explanation: use : as field separator and nothing as output field separator, this itself does remove all : so I add : which have to be preserved: 1st (before second column) and 2nd (after second column). Beware that I tested it solely for this data, so if you would want to use it you should firstly test it with more possible inputs.
(tested in gawk 4.2.1)
This might work for you (GNU sed):
sed 's/:/\n/3;h;s/://g;H;g;s/\n.*\n//' file
Replace the third occurrence of : by a newline.
Make a copy of the line.
Delete all occurrences of :'s.
Append the amended line to the copy.
Join the two lines by removing everything from third occurrence of the copy to the third occurrence of the amended line.
N.B. The use of the newline is the best delimiter to use in the case of sed, as the line presented to seds commands are initially devoid of newlines. However the important property of the delimiter is that it is unique and therefore can be any such character as long as it is not found anywhere in the data set.
An alternative solution uses a loop to remove all :'s after the first two:
sed -E ':a;s/^(([^:]*:){2}[^:]*):/\1/;ta' file
With GNU awk for the 3rd arg to match() and gensub():
$ awk 'match($0,/(:[^:]+:)(.*)/,a){ $0=a[1] gensub(/:/,"","g",a[2]) } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
and with any awk in any shell on every Unix box:
$ awk 'match($0,/:[^:]+:/){ tgt=substr($0,1+RLENGTH); gsub(/:/,"",tgt); $0=substr($0,1,RLENGTH) tgt } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
I have a csv file like the one below.
Column A, Column B
cat,30
cat,40
dog,10
elephant,23
dog,3
elephant,37
How would i uniquely sort column A, based on largest corresponding value on
column B?
The result I would like to get is,
Column A, Column B
cat,40
elephant,37
dog,10
awk to the rescue!
$ sort -t, -k1,1 -k2,2nr filename | awk -F, '!a[$1]++'
Column A, Column B
cat,40
dog,10
elephant,37
if you want your specific output it needs little more coding because of the header line.
$ sort -t, -k1,1 -k2nr filename | awk -F, 'NR==1{print "999999\t"$0;next} !a[$1]++{print $2"\t"$0}' | sort -k1nr | cut -f2-
Column A, Column B
cat,40
elephant,37
dog,10
Another alternative with removing header upfront and adding it back at the end
$ h=$(head -1 filename); sed 1d filename | sort -t, -k1,1 -k2nr | awk -F, '!a[$1]++' | sort -t, -k2nr | sed '1i'"$h"''
Perlishly:
#!/usr/bin/env perl
use strict;
use warnings;
#print header row
print scalar <>;
my %seen;
#iterate the magic filehandle (file specified on command line or
#stdin - e.g. like grep/sed)
while (<>) {
chomp; #strip trailing linefeed
#split this line on ','
my ( $key, $value ) = split /,/;
#save this value if previous is lower or non existant
if ( not defined $seen{$key}
or $seen{$key} < $value )
{
$seen{$key} = $value;
}
}
#sort, comparing values in %seen
foreach my $key ( sort { $seen{$b} <=> $seen{$a} } keys %seen ) {
print "$key,$seen{$key}\n";
}
I've +1'd karakfa's answer. It's simple and elegant.
My answer is an extension of karakfa's header handling. If you like it, please feel free to +1 my answer, but "best answer" should go to karakfa. (Unless of course you prefer one of the other answer! :] )
If your input is as you've described in your question, then we can recognize the header by seeing that $2 is not numeric. Thus, the following does not take the header into consideration:
$ sort -t, -k1,1 -k2,2nr filename | awk -F, '!a[$1]++'
You might alternately strip the header with:
$ sort -t, -k1,1 -k2,2nr filename | awk -F, '$2~/^[0-9]+$/&&!a[$1]++'
This slows things down quite a bit, since a regex may take longer to evaluate than a simple array assignment and numeric test. I'm using a regex for the numeric test in order to permit a 0, which would otherwise evaluate to "false".
Next, if you want to keep the header, but print it first, you can process your output at the end of the stream:
$ sort -t, -k1,1 -k2,2nr filename | awk -F, '$2!~/^[0-9]+$/{print;next} !a[$1]++{b[$1]=$0} END{for(i in b){print b[i]}}'
Last option to achieve the same effect without storing the extra array in memory would be to process your input a second time. This is more costly in terms of IO, but less costly in terms of memory:
$ sort -t, -k1,1 -k2,2nr filename | awk -F, 'NR==FNR&&$2!~/^[0-9]+$/{print;nextfile} $2~/^[0-9]+$/&&!a[$1]++' filename -
Another perl
perl -MList::Util=max -F, -lane '
if ($.==1) {print; next}
$val{$F[0]} = max $val{$F[0]}, $F[1];
} {
print "$_,$val{$_}" for reverse sort {$val{$a} <=> $val{$b}} keys %val;
' file
One possible Tcl solution:
# read the contents of the file into a list of lines
set f [open data.csv]
set lines [split [string trim [chan read $f]] \n]
chan close $f
# detach the header
set lines [lassign $lines header]
# map the list of lines to a list of tuples
set tuples [lmap line $lines {split $line ,}]
# use an associative array to get unique tuples in a flat list
array set uniqueTuples [concat {*}[lsort -index 1 -integer $tuples]]
# reassemble the tuples, sorted by name
set tuples [lmap {a b} [lsort -stride 2 -index 0 [array get uniqueTuples]] {list $a $b}]
# map the tuples to csv lines and insert the header
set lines [linsert [lmap tuple $tuples {join $tuple ,}] 0 $header]
# convert the list of lines into a data string
set data [join $lines \n]
This solution assumes a simplified data set where there are no quoted elements. If there are quoted elements, the csv module should be used instead of the split command.
Another solution, inspired by the Perl solution:
puts [gets stdin]
set seen [dict create]
while {[gets stdin line] >= 0} {
lassign [split $line ,] key value
if {![dict exists $seen $key] || [dict get $seen $key] < $value} {
dict set seen $key $value
}
}
dict for {key val} [lsort -stride 2 -index 0 $seen] {
puts $key,$val
}
Documentation: chan, concat, dict, gets, if, join, lassign, linsert, lmap, lmap replacement, lsort, open, set, split, string, while
The format of MAC addresses varies with the platform.
E.g. on HPUX I could get something like:
0:0:c:7:ac:1e
While Linux gives me
00:00:0c:07:ac:1e
I used to use awk in a kornshell script on CentOS5 to format this to 00000c07ac1e like shown below.
MAC="0:0:c:7:ac:1e"
echo $MAC | awk -F: '{printf( "%02s%02s%02s%02s%02s%02s\n", $1,$2,$3,$4,$5,$6)}'
Unfortunately our admin server now is Ubuntu 14LTS with a newer version of awk which doesn't support the zero padding in the %s format anymore and I get an undesired 0 0 c 7ac1e
So I now switched to perl and do:
echo $MAC | perl -ne '{#A=split(":"); printf( "%02s%02s%02s%02s%02s%02s", #A)}'
As this may break too in upcoming releases I am looking for a more robust but still compact way to format the string.
Your Perl snippet will not break in future releases. This is basic functionality. Changing it will break many, many programs. (Plus, Perl has a mechanism for introducing backwards incompatible changes without breaking existing program.)
Cleaned up:
echo "$MAC" | perl -ne'#F=split(/:/); printf("%02s%02s%02s%02s%02s%02s\n", #F)'
Shorter:
echo "$MAC" | perl -ne'printf "%02s%02s%02s%02s%02s%02s\n", split /:/'
Without the repetition:
echo "$MAC" | perl -ple'$_ = join ":", map sprintf("%02s", $_), split /:/'
There's -a if you want something more awkish:
echo "$MAC" | perl -F: -aple'$_ = join ":", map sprintf("%02s", $_), #F'
Bit long but should be pretty robust
awk -F: '{for(i=1;i<=NF;i++){while(length($i)<2)$i=0$i;printf "%s",$i;}print ""}'
How it works
1.Loop through fields
2.Whilst the field is less than 2 characters long add zeros to the front
3.print the field
4.print newline character at end.
If you were dealing with a number rather than hex, you could use %.Xd to indicate you want at least X digits.
$ awk -F: '{printf( "%.2d%.2d\n", $1, $2)}' <<< "0:23"
0023
^^
two digits
From The GNU Awk User’s Guide #5.5.3 Modifiers for printf Formats:
.prec
A period followed by an integer constant specifies the precision to
use when printing. The meaning of the precision varies by control
letter:
%d, %i, %o, %u, %x, %X
Minimum number of digits to print.
In this case, you need a more general approach to deal with each one of the blocks of the MAC address. You can loop through the elements and add a 0 in case their length is just 1:
awk -F: '{for (i=1;i<=NF;i++) #loop through the elements
{
if (length($i)==1) #if length is 1
printf("0") #add a 0
printf ("%s", $i) #print the rest
}
print "" #print a new line at the end
}' <<< "0:0:c:7:ac:1e"
This returns:
00000c07ac1e
^^ ^^ ^^
^^ ^^ ^^
Note awk '...' <<< "$MAC" is the same as echo "$MAC" | awk '...'.
The title may be confusing, here's what I'm trying to do:
File1
12=921:5,895:5,813:5,853:5,978:5,807:5,1200:5,1067:5,827:5
File2
Tom 12 John 921 Mike 813
Output
Tom=John:5,Mike:5
The file2 has the values of the numbers in file1, and I want match and replace the numbers with string values. I tried this with my limited knowledge in awk, but couldn't do it.
Any help appreciated.
Here's one way using GNU awk. Run like:
awk -f script.awk file1 file2
Contents of script.awk:
BEGIN {
FS="[ =:,]"
}
FNR==NR {
a[$1]=$0
next
}
$2 in a {
split(a[$2],b)
for (i=3;i<=NF-1;i+=2) {
for (j=2;j<=length(b)-1;j+=2) {
if ($(i+1) == b[j]) {
line = (line ? line "," : "") $i ":" b[j+1]
}
}
}
print $1 "=" line
line = ""
}
Results:
Tom=John:5,Mike:5
Alternatively, here's the one-liner:
awk -F "[ =:,]" 'FNR==NR { a[$1]=$0; next } $2 in a { split(a[$2],b); for (i=3;i<=NF-1;i+=2) for (j=2;j<=length(b)-1;j+=2) if ($(i+1) == b[j]) line = (line ? line "," : "") $i ":" b[j+1]; print $1 "=" line; line = "" }' file1 file2
Explanation:
Change awk's field separator to a either a space, equals, colon or comma.
'FNR==NR { ... }' is only true for the first file in the arguments list.
So when processing file1, awk will add column '1' to an array and we assign the whole line as a value to this array element.
'next' will simply skip processing the rest of the script, and read the next line of input.
When awk has finished reading the input in file1, it will continue reading file2. However, this also resets 'FNR' to '1', so awk will skip processing the 'FNR==NR' block for file2 because it is not longer true.
So for file2: if column '2' can be found in the array mentioned above:
Split the value of the array element into another array. This essentially splits up the whole line in file1.
Now create two loops.
The first will loop through all the names in file2
And the second will loop through all the values in the (second) array (this essentially loops over all the fields in file1).
Now when a value succeeding a name in file2 is equal to one of the key numbers in file1, create a line construct that looks like: 'name:number_following_key_number_from_file1'.
When more names and values are found during the loops, the quaternary construct '( ... ? ... : ...)' adds these elements onto the end of the line. It's like an if statement; if there's already a line, add a comma onto the end of it, else don't do anything.
When all the loops are complete, print out column '1' and the line. Then empty the line variable so that it can be used again.
HTH. Goodluck.
The following may work as a template:
skrynesaver#busybox ~/ perl -e '$values="12=921:5,895:5,813:5,853:5,978:5,807:5,1200:5,1067:5,827:5";
$data = "Tom 12 John 921 Mike 813";
($line,$values)=split/=/,$values;
#values=split/,/,$values;
$values{$line}="=";
map{$_=~/(\d+)(:\d+)/;$values{$1}="$2";}#values;
if ($data=~/\w+\s$line\s/){
$data=~s/(\w+)\s(\d+)\s?/$1$values{$2}/g;
}
print "$data\n";
'
Tom=John:5Mike:5
skrynesaver#busybox ~/
I want to perform about many find and replace operations on some text. I have a UTF-8 CSV file containing what to find (in the first column) and what to replace it with (in the second column), arranged from longest to shortest.
E.g.:
orange,fruit2
carrot,vegetable1
apple,fruit3
pear,fruit4
ink,item1
table,item2
Original file:
"I like to eat apples and carrots"
Resulting output file:
"I like to eat fruit3s and vegetable1s."
However, I want to ensure that if one part of text has already been replaced, that it doesn't mess with text that was already replaced. In other words, I don't want it to appear like this (it matched "table" from within vegetable1):
"I like to eat fruit3s and vegeitem21s."
Currently, I am using this method which is quite slow, because I have to do the whole find and replace twice:
(1) Convert the CSV to three files, e.g.:
a.csv b.csv c.csv
orange 0001 fruit2
carrot 0002 vegetable1
apple 0003 fruit3
pear 0004 fruit4
ink 0005 item1
table 0006 item 2
(2) Then, replace all items from a.csv in file.txt with the matching column in b.csv, using ZZZ around the words to make sure there is no mistake later in matching the numbers:
a=1
b=`wc -l < ./a.csv`
while [ $a -le $b ]
do
for i in `sed -n "$a"p ./b.csv`; do
for j in `sed -n "$a"p ./a.csv`; do
sed -i "s/$i/ZZZ$j\ZZZ/g" ./file.txt
echo "Instances of '"$i"' replaced with '"ZZZ$j\ZZZ"' ("$a"/"$b")."
a=`expr $a + 1`
done
done
done
(3) Then running this same script again, but to replace ZZZ0001ZZZ with fruit2 from c.csv.
Running the first replacement takes about 2 hours, but as I must run this code twice to avoid editing the already replaced items, it takes twice as long. Is there a more efficient way to run a find and replace that does not perform replacements on text already replaced?
Here's a perl solution which is doing the replacement in "one phase".
#!/usr/bin/perl
use strict;
my %map = (
orange => "fruit2",
carrot => "vegetable1",
apple => "fruit3",
pear => "fruit4",
ink => "item1",
table => "item2",
);
my $repl_rx = '(' . join("|", map { quotemeta } keys %map) . ')';
my $str = "I like to eat apples and carrots";
$str =~ s{$repl_rx}{$map{$1}}g;
print $str, "\n";
Tcl has a command to do exactly this: string map
tclsh <<'END'
set map {
"orange" "fruit2"
"carrot" "vegetable1"
"apple" "fruit3"
"pear" "fruit4"
"ink" "item1"
"table" "item2"
}
set str "I like to eat apples and carrots"
puts [string map $map $str]
END
I like to eat fruit3s and vegetable1s
This is how to implement it in bash (requires bash v4 for the associative array)
declare -A map=(
[orange]=fruit2
[carrot]=vegetable1
[apple]=fruit3
[pear]=fruit4
[ink]=item1
[table]=item2
)
str="I like to eat apples and carrots"
echo "$str"
i=0
while (( i < ${#str} )); do
matched=false
for key in "${!map[#]}"; do
if [[ ${str:$i:${#key}} = $key ]]; then
str=${str:0:$i}${map[$key]}${str:$((i+${#key}))}
((i+=${#map[$key]}))
matched=true
break
fi
done
$matched || ((i++))
done
echo "$str"
I like to eat apples and carrots
I like to eat fruit3s and vegetable1s
This will not be speedy.
Clearly, you may get different results if you order the map differently. In fact, I believe the order of "${!map[#]}" is unspecified, so you might want to specify the order of the keys explicitly:
keys=(orange carrot apple pear ink table)
# ...
for key in "${keys[#]}"; do
One way to do it would be to do a two-phase replace:
phase 1:
s/orange/##1##/
s/carrot/##2##/
...
phase 2:
s/##1##/fruit2/
s/##2##/vegetable1/
...
The ##1## markers should be chosen so that they don't appear in the original text or the replacements of course.
Here's a proof-of-concept implementation in perl:
#!/usr/bin/perl -w
#
my $repls = $ARGV[0];
die ("first parameter must be the replacement list file") unless defined ($repls);
my $tmpFmt = "###%d###";
open(my $replsFile, "<", $repls) || die("$!: $repls");
shift;
my #replsList;
my $i = 0;
while (<$replsFile>) {
chomp;
my ($from, $to) = /\"([^\"]*)\",\"([^\"]*)\"/;
if (defined($from) && defined($to)) {
push(#replsList, [$from, sprintf($tmpFmt, ++$i), $to]);
}
}
while (<>) {
foreach my $r (#replsList) {
s/$r->[0]/$r->[1]/g;
}
foreach my $r (#replsList) {
s/$r->[1]/$r->[2]/g;
}
print;
}
I would guess that most of your slowness is coming from creating so many sed commands, which each need to individually process the entire file. Some minor adjustments to your current process would speed this up a lot by running 1 sed per file per step.
a=1
b=`wc -l < ./a.csv`
while [ $a -le $b ]
do
cmd=""
for i in `sed -n "$a"p ./a.csv`; do
for j in `sed -n "$a"p ./b.csv`; do
cmd="$cmd ; s/$i/ZZZ${j}ZZZ/g"
echo "Instances of '"$i"' replaced with '"ZZZ${j}ZZZ"' ("$a"/"$b")."
a=`expr $a + 1`
done
done
sed -i "$cmd" ./file.txt
done
Doing it twice is probably not your problem. If you managed to just do it once using your basic strategy, it would still take you an hour, right? You probably need to use a different technology or tool. Switching to Perl, as above, might make your code a lot faster (give it a try)
But continuing down the path of other posters, the next step might be pipelining. Write a little program that replaces two columns, then run that program twice, simultaneously. The first run swaps out strings in column1 with strings in column2, the next swaps out strings in column2 with strings in column3.
Your command line would be like this
cat input_file.txt | perl replace.pl replace_file.txt 1 2 | perl replace.pl replace_file.txt 2 3 > completely_replaced.txt
And replace.pl would be like this (similar to other solutions)
#!/usr/bin/perl -w
my $replace_file = $ARGV[0];
my $before_replace_colnum = $ARGV[1] - 1;
my $after_replace_colnum = $ARGV[2] - 1;
open(REPLACEFILE, $replace_file) || die("couldn't open $replace_file: $!");
my #replace_pairs;
# read in the list of things to replace
while(<REPLACEFILE>) {
chomp();
my #cols = split /\t/, $_;
my $to_replace = $cols[$before_replace_colnum];
my $replace_with = $cols[$after_replace_colnum];
push #replace_pairs, [$to_replace, $replace_with];
}
# read input from stdin, do swapping
while(<STDIN>) {
# loop over all replacement strings
foreach my $replace_pair (#replace_pairs) {
my($to_replace,$replace_with) = #{$replace_pair};
$_ =~ s/${to_replace}/${replace_with}/g;
}
print STDOUT $_;
}
A bash+sed approach:
count=0
bigfrom=""
bigto=""
while IFS=, read from to; do
read countmd5sum x < <(md5sum <<< $count)
count=$(( $count + 1 ))
bigfrom="$bigfrom;s/$from/$countmd5sum/g"
bigto="$bigto;s/$countmd5sum/$to/g"
done < replace-list.csv
sed "${bigfrom:1}$bigto" input_file.txt
I have chosen md5sum, to get some unique token. But some other mechanism can also be used to generate such token; like reading from /dev/urandom or shuf -n1 -i 10000000-20000000
A awk+sed approach:
awk -F, '{a[NR-1]="s/####"NR"####/"$2"/";print "s/"$1"/####"NR"####/"}; END{for (i=0;i<NR;i++)print a[i];}' replace-list.csv > /tmp/sed_script.sed
sed -f /tmp/sed_script.sed input.txt
A cat+sed+sed approach:
cat -n replace-list.csv | sed -rn 'H;g;s|(.*)\n *([0-9]+) *[^,]*,(.*)|\1\ns/####\2####/\3/|;x;s|.*\n *([0-9]+)[ \t]*([^,]+).*|s/\2/####\1####/|p;${g;s/^\n//;p}' > /tmp/sed_script.sed
sed -f /tmp/sed_script.sed input.txt
Mechanism:
Here, it first generates the sed script, using the csv as input file.
Then uses another sed instance to operate on input.txt
Notes:
The intermediate file generated - sed_script.sed can be re-used again, unless the input csv file changes.
####<number>#### is chosen as some pattern, which is not present in the input file. Change this pattern if required.
cat -n | is not UUOC :)
This might work for you (GNU sed):
sed -r 'h;s/./&\\n/g;H;x;s/([^,]*),.*,(.*)/s|\1|\2|g/;$s/$/;s|\\n||g/' csv_file | sed -rf - original_file
Convert the csv file into a sed script. The trick here is to replace the substitution string with one which will not be re-substituted. In this case each character in the substitution string is replaced by itself and a \n. Finally once all substitutions have taken place the \n's are removed leaving the finished string.
There are a lot of cool answers here already. I'm posting this because I'm taking a slightly different approach by making some large assumptions about the data to replace ( based on the sample data ):
Words to replace don't contain spaces
Words are replaced based on the longest, exactly matching prefix
Each word to replace is exactly represented in the csv
This a single pass, awk only answer with very little regex.
It reads the "repl.csv" file into an associative array ( see BEGIN{} ), then attempts to match on prefixes of each word when the length of the word is bound by key length limits, trying to avoid looking in the associative array whenever possible:
#!/bin/awk -f
BEGIN {
while( getline repline < "repl.csv" ) {
split( repline, replarr, "," )
replassocarr[ replarr[1] ] = replarr[2]
# set some bounds on the replace word sizes
if( minKeyLen == 0 || length( replarr[1] ) < minKeyLen )
minKeyLen = length( replarr[1] )
if( maxKeyLen == 0 || length( replarr[1] ) > maxKeyLen )
maxKeyLen = length( replarr[1] )
}
close( "repl.csv" )
}
{
i = 1
while( i <= NF ) { print_word( $i, i == NF ); i++ }
}
function print_word( w, end ) {
wl = length( w )
for( j = wl; j >= 0 && prefix_len_bound( wl, j ); j-- ) {
key = substr( w, 1, j )
wl = length( key )
if( wl >= minKeyLen && key in replassocarr ) {
printf( "%s%s%s", replassocarr[ key ],
substr( w, j+1 ), !end ? " " : "\n" )
return
}
}
printf( "%s%s", w, !end ? " " : "\n" )
}
function prefix_len_bound( len, jlen ) {
return len >= minKeyLen && (len <= maxKeyLen || jlen > maxKeylen)
}
Based on input like:
I like to eat apples and carrots
orange you glad to see me
Some people eat pears while others drink ink
It yields output like:
I like to eat fruit3s and vegetable1s
fruit2 you glad to see me
Some people eat fruit4s while others drink item1
Of course any "savings" of not looking the replassocarr go away when the words to be replaced goes to length=1 or if the average word length is much greater than the words to replace.