Gather the data with similar columns - perl

I want to filter the data from a text file in unix.
I have text file in unix as below:
A 200
B 300
C 400
A 100
B 600
B 700
How could i modify/create data as below from the above data i have in awk?
A 200 100
B 300 600 700
C 400
i am not that much good in awk and i believe awk/perl is best for this.

awk 'END {
for (R in r)
print R, r[R]
}
{
r[$1] = $1 in r ? r[$1] OFS $2 : $2
}' infile
If the order of the values in the first field is important,
more code will be needed.
The solution will depend on your awk implementation and version.
Explanation:
r[$1] = $1 in r ? r[$1] OFS $2 : $2
Set the value of the array r element $1 to:
if the key $1 is already present: $1 in r, append OFS $2
to the existing value
otherwise set it to the value of $2
expression ? if true : if false is the ternary operator.
See ternary operation for more.

You could do it like this, but with Perl there's always more than one way to do it:
my %hash;
while(<>) {
my($letter, $int) = split(" ");
push #{ $hash{$letter} }, $int;
}
for my $key (sort keys %hash) {
print "$key " . join(" ", #{ $hash{$key} }) . "\n";
}
Should work like that:
$ cat data.txt | perl script.pl
A 200 100
B 300 600 700
C 400

Not language-specific. More like pseudocode, but here's the idea :
- Get all lines in an array
- Set a target dictionary of arrays
- Go through the array :
- Split the string using ' '(space) as the delimiter, into array parts
- If there is already a dictionary entry for `parts[0]` (e.g. 'A').
If not create it.
- Add `parts[1]` (e.g. 100) to `dictionary(parts[0])`
And that's it! :-)
I'd do it, probably in Python, but that's rather a matter of taste.

Using awk, sorting the output inside it:
awk '
{ data[$1] = (data[$1] ? data[$1] " " : "") $2 }
END {
for (i in data) {
idx[++j] = i
}
n = asort(idx);
for ( i=1; i<=n; i++ ) {
print idx[i] " " data[idx[i]]
}
}
' infile
Using external program sort:
awk '
{ data[$1] = (data[$1] ? data[$1] " " : "") $2 }
END {
for (i in data) {
print i " " data[i]
}
}
' infile | sort
For both commands output is:
A 200 100
B 300 600 700
C 400

Using sed:
Content of script.sed:
## First line. Newline will separate data, so add it after the content.
## Save it in 'hold space' and read next one.
1 {
s/$/\n/
h
b
}
## Append content of 'hold space' to current line.
G
## Search if first char (\1) in line was saved in 'hold space' (\4) and add
## the number (\2) after it.
s/^\(.\)\( *[0-9]\+\)\n\(.*\)\(\1[^\n]*\)/\3\4\2/
## If last substitution succeed, goto label 'a'.
ta
## Here last substitution failed, so it is the first appearance of the
## letter, add it at the end of the content.
s/^\([^\n]*\n\)\(.*\)$/\2\1/
## Label 'a'.
:a
## Save content to 'hold space'.
h
## In last line, get content of 'hold space', remove last newline and print.
$ {
x
s/\n*$//
p
}
Run it like:
sed -nf script.sed infile
And result:
A 200 100
B 300 600 700
C 400

This might work for you:
sort -sk1,1 file | sed ':a;$!N;s/^\([^ ]*\)\( .*\)\n\1/\1\2/;ta;P;D'
A 200 100
B 300 600 700
C 400

Related

Large volume of queries on a large CSV file

I am trying to query a large csv file(100gb,+-1.1billion records) for partial matches in the url column of a csv. I aim to query for about 23000 possible matches.
Example input:
url,answer,rrClass,rrType,tlp,firstSeenTimestamp,lastSeenTimestamp,minimumTTLSec,maximumTTLSec,count
maps.google.com.,173.194.112.106,in,a,white,1442011301000,1442011334000,300,300,2
drive.google.com.,173.194.112.107,in,a,white,1442011301000,1442011334000,300,300,2
nokiantires.com.,185.53.179.22,in,a,white,1529534626596,1529534626596,600,600,1
woodpapersilk.,138.201.32.142,in,a,white,1546339972354,1553285334535,3886,14399,2
xn--c1yn36f.cn.,167.160.174.76,in,a,white,1501685257255,1515592226520,14400,14400,38
maps.google.com.malwaredomain.com.,118.193.165.236,in,a,white,1442148766000,1442148766000,600,600,1
whois.ducmates.blogspot.com.,216.58.194.193,in,a,white,1535969280784,1535969280784,44,44,1
Queries are of the following pattern: /^.*[someurl].*$/ each of these [someurls] come from a different file and can be assumed as an array of size 23000.
Matching Queries:
awk -F, '$1 ~ /^.*google\.com\.$/' > file1.out
awk -F, '$1 ~ /^.*nokiantires\.com\.$/' > file2.out
awk -F, '$1 ~ /^.*woodpapersilk\.com\.$/' > file3.out
awk -F, '$1 ~ /^.*xn--.*$/' > file4.out
Queries that match nothing:
awk -F, '$1 ~ /^.*seasonvintage\.com\.$/' > file5.out
awk -F, '$1 ~ /^.*java2s\.com\.$/' > file6.out
file1.out:
maps.google.com.,173.194.112.106,in,a,white,1442011301000,1442011334000,300,300,2
drive.google.com.,173.194.112.107,in,a,white,1442011301000,1442011334000,300,300,2
file2.out:
nokiantires.com.,185.53.179.22,in,a,white,1529534626596,1529534626596,600,600,1
file3.out:
woodpapersilk.,138.201.32.142,in,a,white,1546339972354,1553285334535,3886,14399,2
file4.out:
xn--c1yn36f.cn.,167.160.174.76,in,a,white,1501685257255,1515592226520,14400,14400,38
file 5.out and file6.out are both empty as nothing matches
I have also uploaded these inputs and outputs as a gist.
Essentially each query extracts a partial match in the url column.
Currently I use the following code with awk to search for possible matches:
awk -F, '$1 ~ /^.*xn--.*$/' file.out > filter.csv
This solution returns a valid response but it takes 14 minutes to query for one example. Unfortunately I am looking to query for 23000 possible matches.
As such I am looking for a more workable and efficient solution.
I have thought of/tried the following
Can i include all the tags in a huge regex or does this increase the inefficiency?
I have tried using MongoDB but it does not work well with just a single machine.
I have an AWS voucher that has about $30 left. Is there any particular AWS solution that could help here?
What would be a more workable solution to process these queries on said csv file?
Many thanks
Given what we know so far and guessing at the answers to a couple of questions, I'd approach this by separating the queries into "queries that can be matched by a hash lookup" (which is all but 1 of the queries in your posted example) and "queries that need a regexp comparison to match" (just xn--.*$ in your example) and then evaluating them as such when reading your records so that any $1 that can be matched by an almost instantaneous hash lookup against all hash-able queries will be done like that and only the few that need a regexp match will be handled sequentially in a loop:
$ cat ../queries
google.com.$
nokiantires.com.$
woodpapersilk.com.$
xn--.*$
seasonvintage.com.$
java2s.com.$
$ cat ../records
url,answer,rrClass,rrType,tlp,firstSeenTimestamp,lastSeenTimestamp,minimumTTLSec,maximumTTLSec,count
maps.google.com.,173.194.112.106,in,a,white,1442011301000,1442011334000,300,300,2
drive.google.com.,173.194.112.107,in,a,white,1442011301000,1442011334000,300,300,2
nokiantires.com.,185.53.179.22,in,a,white,1529534626596,1529534626596,600,600,1
woodpapersilk.,138.201.32.142,in,a,white,1546339972354,1553285334535,3886,14399,2
xn--c1yn36f.cn.,167.160.174.76,in,a,white,1501685257255,1515592226520,14400,14400,38
maps.google.com.malwaredomain.com.,118.193.165.236,in,a,white,1442148766000,1442148766000,600,600,1
whois.ducmates.blogspot.com.,216.58.194.193,in,a,white,1535969280784,1535969280784,44,44,1
.
$ cat ../tst.awk
BEGIN { FS="," }
NR==FNR {
query = $0
outFile = "file" ++numQueries ".out"
printf "" > outFile; close(outFile)
if ( query ~ /^[^.]+[.][^.]+[.][$]$/ ) {
# simple end of field string, can be hash matched
queriesHash[query] = outFile
}
else {
# not a simple end of field string, must be regexp matched
queriesRes[query] = outFile
}
next
}
FNR>1 {
matchQuery = ""
if ( match($1,/[^.]+[.][^.]+[.]$/) ) {
fldKey = substr($1,RSTART,RLENGTH) "$"
if ( fldKey in queriesHash ) {
matchType = "hash"
matchQuery = fldKey
outFile = queriesHash[matchQuery]
}
}
if ( matchQuery == "" ) {
for ( query in queriesRes ) {
if ( $1 ~ query ) {
matchType = "regexp"
matchQuery = query
outFile = queriesRes[matchQuery]
break
}
}
}
if ( matchQuery != "" ) {
print "matched:", matchType, matchQuery, $0, ">>", outFile | "cat>&2"
print >> outFile; close(outFile)
}
}
.
$ ls
$
$ tail -n +1 *
tail: cannot open '*' for reading: No such file or directory
.
$ awk -f ../tst.awk ../queries ../records
matched: hash google.com.$ maps.google.com.,173.194.112.106,in,a,white,1442011301000,1442011334000,300,300,2 >> file1.out
matched: hash google.com.$ drive.google.com.,173.194.112.107,in,a,white,1442011301000,1442011334000,300,300,2 >> file1.out
matched: hash nokiantires.com.$ nokiantires.com.,185.53.179.22,in,a,white,1529534626596,1529534626596,600,600,1 >> file2.out
matched: regexp xn--.*$ xn--c1yn36f.cn.,167.160.174.76,in,a,white,1501685257255,1515592226520,14400,14400,38 >> file4.out
.
$ ls
file1.out file2.out file3.out file4.out file5.out file6.out
$
$ tail -n +1 *
==> file1.out <==
maps.google.com.,173.194.112.106,in,a,white,1442011301000,1442011334000,300,300,2
drive.google.com.,173.194.112.107,in,a,white,1442011301000,1442011334000,300,300,2
==> file2.out <==
nokiantires.com.,185.53.179.22,in,a,white,1529534626596,1529534626596,600,600,1
==> file3.out <==
==> file4.out <==
xn--c1yn36f.cn.,167.160.174.76,in,a,white,1501685257255,1515592226520,14400,14400,38
==> file5.out <==
==> file6.out <==
$
The initial printf "" > outFile; close(outFile) is just to ensure you get an output file per query even if that query oesn't match, just like you asked for in your example.
If you're using GNU awk then it can manage the multiple open output files for you and then you can make these changes:
printf "" > outFile; close(outFile) -> printf "" > outFile
print >> outFile; close(outFile) -> print > outFile
which will be more efficient because then the output file isn't being opened+closed on every print.

Filter unique lines

I have a tile type called fasta which contains a header "> 12122" followed by a string. I would like to remove duplicated strings in the file and keep only one of the duplicated string (same which) and the corresponding header.
In the example below the AGGTTCCGGATAAGTAAGAGCC is duplicated
in:
>17-46151
AGGTTCCGGATAAGTAAGAGCC
>1-242
AGGTTCCGGATAAGTAAGAGCC
>18-41148
TCTTAACCCGGACCAGAAACTA
>43-16054
GTCCCACTCCGTAGATCTGTTC
>32-24116
TAGCATATCGAGCCTGAGAACA
>42-16312
TGATACGGATGTTATACGCAGC
out:
>1-242
AGGTTCCGGATAAGTAAGAGCC
>18-41148
TCTTAACCCGGACCAGAAACTA
>43-16054
GTCCCACTCCGTAGATCTGTTC
>32-24116
TAGCATATCGAGCCTGAGAACA
>42-16312
TGATACGGATGTTATACGCAGC
if order is mandatory
# Field are delimited by new line
awk -F "\n" '
BEGIN {
# Record is delimited by ">"
RS = ">"
}
# skip first "record" due to first ">"
NR > 1 {
# if string is not know, add it to "Order" list array
if ( ! ( $2 in L ) ) O[++a] = $2
# remember (last) peer label/string
L[$2] = $1
}
# after readiong the file
END{
# display each (last know) peer based on the order
for ( i=1; i<=a; i++ ) printf( ">%s\n%s\n", L[O[i]], O[i])
}
' YourFile
if order is not mandatory
awk -F "\n" 'BEGIN{RS=">"}NR>1{L[$2]=$1}END{for (l in L) printf( ">%s\n%s\n", L[l], l)}' YourFile
$ awk '{if(NR%2) p=$0; else a[$0]=p}END{for(i in a)print a[i] ORS i}' file
>18-41148
TCTTAACCCGGACCAGAAACTA
>32-24116
TAGCATATCGAGCCTGAGAACA
>1-242
AGGTTCCGGATAAGTAAGAGCC
>43-16054
GTCCCACTCCGTAGATCTGTTC
>42-16312
TGATACGGATGTTATACGCAGC
Explained:
{
if(NR%2) # every first (of 2) line in p
p=$0
else # every second line is the hash key
a[$0]=p
}
END{
for(i in a) # output every unique key and it's header
print a[i] ORS i
}
Here's a quick one-line awk solution for you. It should be more immediate than the other answers because it runs line by line rather than queuing the data (and looping through it) until the end:
awk 'NR % 2 == 0 && !seen[$0]++ { print last; print } { last = $0 }' file
Explanation:
NR % 2 == 0 runs only on even numbered records (lines, NR)
!seen[$0]++ stores and increments values and returns true only when there were no values in the seen[] hash (!0 is 1, !1 is 0, !2 is 0, etc.)
(Skipping to the end) last is set to the value of each line after we're otherwise done with it
{ print last; print } will print last (the header) and then the current line (gene code)
Note: while this preserves the original order, it shows the first uniquely seen instance while the expected output showed the final uniquely seen instance:
>17-46151
AGGTTCCGGATAAGTAAGAGCC
>18-41148
TCTTAACCCGGACCAGAAACTA
>43-16054
GTCCCACTCCGTAGATCTGTTC
>32-24116
TAGCATATCGAGCCTGAGAACA
>42-16312
TGATACGGATGTTATACGCAGC
If you want the final uniquely seen instance, you can reverse the file before passing to awk and then reverse it back afterwards:
tac file |awk … |tac

hash using sha1sum using awk

I have a "pipe-separated" file that has about 20 columns. I want to just hash the first column which is a number like account number using sha1sum and return the rest of the columns as is.
Whats the best way I can do this using awk or sed?
Accountid|Time|Category|.....
8238438|20140101021301|sub1|...
3432323|20140101041903|sub2|...
9342342|20140101050303|sub1|...
Above is an example of the text file showing just 3 columns. Only the first column has the hashfunction implemented on it. Result should like:
Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
What the Best Way™ is is up for debate. One way to do it with awk is
awk -F'|' 'BEGIN { OFS=FS } NR == 1 { print } NR != 1 { gsub(/'\''/, "'\'\\\\\'\''", $1); command = ("echo '\''" $1 "'\'' | sha1sum -b | cut -d\\ -f 1"); command | getline hash; close(command); $1 = hash; print }' filename
That is
BEGIN {
OFS = FS # set output field separator to field separator; we will use
# it because we meddle with the fields.
}
NR == 1 { # first line: just print headers.
print
}
NR != 1 { # from there on do the hash/replace
# this constructs a shell command (and runs it) that echoes the field
# (singly-quoted to prevent surprises) through sha1sum -b, cuts out the hash
# and gets it back into awk with getline (into the variable hash)
# the gsub bit is to prevent the shell from barfing if there's an apostrophe
# in one of the fields.
gsub(/'/, "'\\''", $1);
command = ("echo '" $1 "' | sha1sum -b | cut -d\\ -f 1")
command | getline hash
close(command)
# then replace the field and print the result.
$1 = hash
print
}
You will notice the differences between the shell command at the top and the awk code at the bottom; that is all due to shell expansion. Because I put the awk code in single quotes in the shell commands (double quotes are not up for debate in that context, what with $1 and all), and because the code contains single quotes, making it work inline leads to a nightmare of backslashes. Because of this, my advice is to put the awk code into a file, say foo.awk, and run
awk -F'|' -f foo.awk filename
instead.
Here's an awk executable script that does what you want:
#!/usr/bin/awk -f
BEGIN { FS=OFS="|" }
FNR != 1 { $1 = encodeData( $1 ) }
47
function encodeData( fld ) {
cmd = sprintf( "echo %s | sha1sum", fld )
cmd | getline output
close( cmd )
split( output, arr, " " )
return arr[1]
}
Here's the flow break down:
Set the input and output field separators to |
When the row isn't the first (header) row, re-assign $1 to an encoded value
Print the entire row when 47 is true (always)
Here's the encodeData function break down:
Create a cmd to feed data to sha1sum
Feed it to getline
Close the cmd
On my system, there's extra info after sha1sum, so I discard it by spliting the output
Return the first field of the sha1sum output.
With your data, I get the following:
Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
running by calling awk.script data (or ./awk.script data if you bash)
EDIT by EdMorton:
sorry for the edit, but your script above is the right approach but needs some tweaks to make it more robust and this is much easier than trying to describe them in a comment:
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==1 { for (i=1; i<=NF; i++) f[$i] = i; next }
{ $(f["Accountid"]) = encodeData($(f["Accountid"])); print }
function encodeData( fld, cmd, output ) {
cmd = "echo \047" fld "\047 | sha1sum"
if ( (cmd | getline output) > 0 ) {
sub(/ .*/,"",output)
}
else {
print "failed to hash " fld | "cat>&2"
output = fld
}
close( cmd )
return output
}
$ awk -f tst.awk file
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
The f[] array decouples your script from hard-coding the number of the field that needs to be hashed, the additional args for your function make them local and so always null/zero on each invocation, the if on getline means you won't return the previous success value if it fails (see http://awk.info/?tip/getline) and the rest is maybe more style/preference with a bit of a performance improvement.

Match a string from File1 in File2 and replace the string in File1 with corresponding matched string in File2

The title may be confusing, here's what I'm trying to do:
File1
12=921:5,895:5,813:5,853:5,978:5,807:5,1200:5,1067:5,827:5
File2
Tom 12 John 921 Mike 813
Output
Tom=John:5,Mike:5
The file2 has the values of the numbers in file1, and I want match and replace the numbers with string values. I tried this with my limited knowledge in awk, but couldn't do it.
Any help appreciated.
Here's one way using GNU awk. Run like:
awk -f script.awk file1 file2
Contents of script.awk:
BEGIN {
FS="[ =:,]"
}
FNR==NR {
a[$1]=$0
next
}
$2 in a {
split(a[$2],b)
for (i=3;i<=NF-1;i+=2) {
for (j=2;j<=length(b)-1;j+=2) {
if ($(i+1) == b[j]) {
line = (line ? line "," : "") $i ":" b[j+1]
}
}
}
print $1 "=" line
line = ""
}
Results:
Tom=John:5,Mike:5
Alternatively, here's the one-liner:
awk -F "[ =:,]" 'FNR==NR { a[$1]=$0; next } $2 in a { split(a[$2],b); for (i=3;i<=NF-1;i+=2) for (j=2;j<=length(b)-1;j+=2) if ($(i+1) == b[j]) line = (line ? line "," : "") $i ":" b[j+1]; print $1 "=" line; line = "" }' file1 file2
Explanation:
Change awk's field separator to a either a space, equals, colon or comma.
'FNR==NR { ... }' is only true for the first file in the arguments list.
So when processing file1, awk will add column '1' to an array and we assign the whole line as a value to this array element.
'next' will simply skip processing the rest of the script, and read the next line of input.
When awk has finished reading the input in file1, it will continue reading file2. However, this also resets 'FNR' to '1', so awk will skip processing the 'FNR==NR' block for file2 because it is not longer true.
So for file2: if column '2' can be found in the array mentioned above:
Split the value of the array element into another array. This essentially splits up the whole line in file1.
Now create two loops.
The first will loop through all the names in file2
And the second will loop through all the values in the (second) array (this essentially loops over all the fields in file1).
Now when a value succeeding a name in file2 is equal to one of the key numbers in file1, create a line construct that looks like: 'name:number_following_key_number_from_file1'.
When more names and values are found during the loops, the quaternary construct '( ... ? ... : ...)' adds these elements onto the end of the line. It's like an if statement; if there's already a line, add a comma onto the end of it, else don't do anything.
When all the loops are complete, print out column '1' and the line. Then empty the line variable so that it can be used again.
HTH. Goodluck.
The following may work as a template:
skrynesaver#busybox ~/ perl -e '$values="12=921:5,895:5,813:5,853:5,978:5,807:5,1200:5,1067:5,827:5";
$data = "Tom 12 John 921 Mike 813";
($line,$values)=split/=/,$values;
#values=split/,/,$values;
$values{$line}="=";
map{$_=~/(\d+)(:\d+)/;$values{$1}="$2";}#values;
if ($data=~/\w+\s$line\s/){
$data=~s/(\w+)\s(\d+)\s?/$1$values{$2}/g;
}
print "$data\n";
'
Tom=John:5Mike:5
skrynesaver#busybox ~/

How to quickly find and replace many items on a list without replacing previously replaced items in BASH?

I want to perform about many find and replace operations on some text. I have a UTF-8 CSV file containing what to find (in the first column) and what to replace it with (in the second column), arranged from longest to shortest.
E.g.:
orange,fruit2
carrot,vegetable1
apple,fruit3
pear,fruit4
ink,item1
table,item2
Original file:
"I like to eat apples and carrots"
Resulting output file:
"I like to eat fruit3s and vegetable1s."
However, I want to ensure that if one part of text has already been replaced, that it doesn't mess with text that was already replaced. In other words, I don't want it to appear like this (it matched "table" from within vegetable1):
"I like to eat fruit3s and vegeitem21s."
Currently, I am using this method which is quite slow, because I have to do the whole find and replace twice:
(1) Convert the CSV to three files, e.g.:
a.csv b.csv c.csv
orange 0001 fruit2
carrot 0002 vegetable1
apple 0003 fruit3
pear 0004 fruit4
ink 0005 item1
table 0006 item 2
(2) Then, replace all items from a.csv in file.txt with the matching column in b.csv, using ZZZ around the words to make sure there is no mistake later in matching the numbers:
a=1
b=`wc -l < ./a.csv`
while [ $a -le $b ]
do
for i in `sed -n "$a"p ./b.csv`; do
for j in `sed -n "$a"p ./a.csv`; do
sed -i "s/$i/ZZZ$j\ZZZ/g" ./file.txt
echo "Instances of '"$i"' replaced with '"ZZZ$j\ZZZ"' ("$a"/"$b")."
a=`expr $a + 1`
done
done
done
(3) Then running this same script again, but to replace ZZZ0001ZZZ with fruit2 from c.csv.
Running the first replacement takes about 2 hours, but as I must run this code twice to avoid editing the already replaced items, it takes twice as long. Is there a more efficient way to run a find and replace that does not perform replacements on text already replaced?
Here's a perl solution which is doing the replacement in "one phase".
#!/usr/bin/perl
use strict;
my %map = (
orange => "fruit2",
carrot => "vegetable1",
apple => "fruit3",
pear => "fruit4",
ink => "item1",
table => "item2",
);
my $repl_rx = '(' . join("|", map { quotemeta } keys %map) . ')';
my $str = "I like to eat apples and carrots";
$str =~ s{$repl_rx}{$map{$1}}g;
print $str, "\n";
Tcl has a command to do exactly this: string map
tclsh <<'END'
set map {
"orange" "fruit2"
"carrot" "vegetable1"
"apple" "fruit3"
"pear" "fruit4"
"ink" "item1"
"table" "item2"
}
set str "I like to eat apples and carrots"
puts [string map $map $str]
END
I like to eat fruit3s and vegetable1s
This is how to implement it in bash (requires bash v4 for the associative array)
declare -A map=(
[orange]=fruit2
[carrot]=vegetable1
[apple]=fruit3
[pear]=fruit4
[ink]=item1
[table]=item2
)
str="I like to eat apples and carrots"
echo "$str"
i=0
while (( i < ${#str} )); do
matched=false
for key in "${!map[#]}"; do
if [[ ${str:$i:${#key}} = $key ]]; then
str=${str:0:$i}${map[$key]}${str:$((i+${#key}))}
((i+=${#map[$key]}))
matched=true
break
fi
done
$matched || ((i++))
done
echo "$str"
I like to eat apples and carrots
I like to eat fruit3s and vegetable1s
This will not be speedy.
Clearly, you may get different results if you order the map differently. In fact, I believe the order of "${!map[#]}" is unspecified, so you might want to specify the order of the keys explicitly:
keys=(orange carrot apple pear ink table)
# ...
for key in "${keys[#]}"; do
One way to do it would be to do a two-phase replace:
phase 1:
s/orange/##1##/
s/carrot/##2##/
...
phase 2:
s/##1##/fruit2/
s/##2##/vegetable1/
...
The ##1## markers should be chosen so that they don't appear in the original text or the replacements of course.
Here's a proof-of-concept implementation in perl:
#!/usr/bin/perl -w
#
my $repls = $ARGV[0];
die ("first parameter must be the replacement list file") unless defined ($repls);
my $tmpFmt = "###%d###";
open(my $replsFile, "<", $repls) || die("$!: $repls");
shift;
my #replsList;
my $i = 0;
while (<$replsFile>) {
chomp;
my ($from, $to) = /\"([^\"]*)\",\"([^\"]*)\"/;
if (defined($from) && defined($to)) {
push(#replsList, [$from, sprintf($tmpFmt, ++$i), $to]);
}
}
while (<>) {
foreach my $r (#replsList) {
s/$r->[0]/$r->[1]/g;
}
foreach my $r (#replsList) {
s/$r->[1]/$r->[2]/g;
}
print;
}
I would guess that most of your slowness is coming from creating so many sed commands, which each need to individually process the entire file. Some minor adjustments to your current process would speed this up a lot by running 1 sed per file per step.
a=1
b=`wc -l < ./a.csv`
while [ $a -le $b ]
do
cmd=""
for i in `sed -n "$a"p ./a.csv`; do
for j in `sed -n "$a"p ./b.csv`; do
cmd="$cmd ; s/$i/ZZZ${j}ZZZ/g"
echo "Instances of '"$i"' replaced with '"ZZZ${j}ZZZ"' ("$a"/"$b")."
a=`expr $a + 1`
done
done
sed -i "$cmd" ./file.txt
done
Doing it twice is probably not your problem. If you managed to just do it once using your basic strategy, it would still take you an hour, right? You probably need to use a different technology or tool. Switching to Perl, as above, might make your code a lot faster (give it a try)
But continuing down the path of other posters, the next step might be pipelining. Write a little program that replaces two columns, then run that program twice, simultaneously. The first run swaps out strings in column1 with strings in column2, the next swaps out strings in column2 with strings in column3.
Your command line would be like this
cat input_file.txt | perl replace.pl replace_file.txt 1 2 | perl replace.pl replace_file.txt 2 3 > completely_replaced.txt
And replace.pl would be like this (similar to other solutions)
#!/usr/bin/perl -w
my $replace_file = $ARGV[0];
my $before_replace_colnum = $ARGV[1] - 1;
my $after_replace_colnum = $ARGV[2] - 1;
open(REPLACEFILE, $replace_file) || die("couldn't open $replace_file: $!");
my #replace_pairs;
# read in the list of things to replace
while(<REPLACEFILE>) {
chomp();
my #cols = split /\t/, $_;
my $to_replace = $cols[$before_replace_colnum];
my $replace_with = $cols[$after_replace_colnum];
push #replace_pairs, [$to_replace, $replace_with];
}
# read input from stdin, do swapping
while(<STDIN>) {
# loop over all replacement strings
foreach my $replace_pair (#replace_pairs) {
my($to_replace,$replace_with) = #{$replace_pair};
$_ =~ s/${to_replace}/${replace_with}/g;
}
print STDOUT $_;
}
A bash+sed approach:
count=0
bigfrom=""
bigto=""
while IFS=, read from to; do
read countmd5sum x < <(md5sum <<< $count)
count=$(( $count + 1 ))
bigfrom="$bigfrom;s/$from/$countmd5sum/g"
bigto="$bigto;s/$countmd5sum/$to/g"
done < replace-list.csv
sed "${bigfrom:1}$bigto" input_file.txt
I have chosen md5sum, to get some unique token. But some other mechanism can also be used to generate such token; like reading from /dev/urandom or shuf -n1 -i 10000000-20000000
A awk+sed approach:
awk -F, '{a[NR-1]="s/####"NR"####/"$2"/";print "s/"$1"/####"NR"####/"}; END{for (i=0;i<NR;i++)print a[i];}' replace-list.csv > /tmp/sed_script.sed
sed -f /tmp/sed_script.sed input.txt
A cat+sed+sed approach:
cat -n replace-list.csv | sed -rn 'H;g;s|(.*)\n *([0-9]+) *[^,]*,(.*)|\1\ns/####\2####/\3/|;x;s|.*\n *([0-9]+)[ \t]*([^,]+).*|s/\2/####\1####/|p;${g;s/^\n//;p}' > /tmp/sed_script.sed
sed -f /tmp/sed_script.sed input.txt
Mechanism:
Here, it first generates the sed script, using the csv as input file.
Then uses another sed instance to operate on input.txt
Notes:
The intermediate file generated - sed_script.sed can be re-used again, unless the input csv file changes.
####<number>#### is chosen as some pattern, which is not present in the input file. Change this pattern if required.
cat -n | is not UUOC :)
This might work for you (GNU sed):
sed -r 'h;s/./&\\n/g;H;x;s/([^,]*),.*,(.*)/s|\1|\2|g/;$s/$/;s|\\n||g/' csv_file | sed -rf - original_file
Convert the csv file into a sed script. The trick here is to replace the substitution string with one which will not be re-substituted. In this case each character in the substitution string is replaced by itself and a \n. Finally once all substitutions have taken place the \n's are removed leaving the finished string.
There are a lot of cool answers here already. I'm posting this because I'm taking a slightly different approach by making some large assumptions about the data to replace ( based on the sample data ):
Words to replace don't contain spaces
Words are replaced based on the longest, exactly matching prefix
Each word to replace is exactly represented in the csv
This a single pass, awk only answer with very little regex.
It reads the "repl.csv" file into an associative array ( see BEGIN{} ), then attempts to match on prefixes of each word when the length of the word is bound by key length limits, trying to avoid looking in the associative array whenever possible:
#!/bin/awk -f
BEGIN {
while( getline repline < "repl.csv" ) {
split( repline, replarr, "," )
replassocarr[ replarr[1] ] = replarr[2]
# set some bounds on the replace word sizes
if( minKeyLen == 0 || length( replarr[1] ) < minKeyLen )
minKeyLen = length( replarr[1] )
if( maxKeyLen == 0 || length( replarr[1] ) > maxKeyLen )
maxKeyLen = length( replarr[1] )
}
close( "repl.csv" )
}
{
i = 1
while( i <= NF ) { print_word( $i, i == NF ); i++ }
}
function print_word( w, end ) {
wl = length( w )
for( j = wl; j >= 0 && prefix_len_bound( wl, j ); j-- ) {
key = substr( w, 1, j )
wl = length( key )
if( wl >= minKeyLen && key in replassocarr ) {
printf( "%s%s%s", replassocarr[ key ],
substr( w, j+1 ), !end ? " " : "\n" )
return
}
}
printf( "%s%s", w, !end ? " " : "\n" )
}
function prefix_len_bound( len, jlen ) {
return len >= minKeyLen && (len <= maxKeyLen || jlen > maxKeylen)
}
Based on input like:
I like to eat apples and carrots
orange you glad to see me
Some people eat pears while others drink ink
It yields output like:
I like to eat fruit3s and vegetable1s
fruit2 you glad to see me
Some people eat fruit4s while others drink item1
Of course any "savings" of not looking the replassocarr go away when the words to be replaced goes to length=1 or if the average word length is much greater than the words to replace.