Perl / grep / awk -- splitting multiple results, if statement checking for second string - perl

I am trying to grep a file for the first 2 matches of a string (there will only ever be a maximum 2 matches) including some context (grep -B 1 -A 5), split each set of 7 lines into two separate variables and write an if statement based on whether or not each set contains a different string.
In some cases, the file may contain only one match.
I know how to grep for the two matches, but not how to split them into separate variables. I can also write an if statement to check if the variable is empty (indicating a lack of a second match). I am not sure how to check each variable to see if it contains the second string. Any assistance would be helpful. Thanks!
Example:
grep -B1 -A5 "Resolution:" file.txt
Color LCD:
Resolution: 1440 x 900
Pixel Depth: 32-Bit Color (ARGB8888)
Main Display: Yes
Mirror: Off
Online: Yes
Built-In: Yes
LED Cinema Display:
Resolution: 1920 x 1200
Depth: 32-Bit Color
Core Image: Hardware Accelerated
Mirror: Off
Online: Yes
Quartz Extreme: Supported
Desired result based on whether or not each match set contains "Main Display":
$mainDisplay = Color LCD
$secondDisplay = LED Cinema Display (or null indicating no second match)

Your file is valid YAML, so if you have installed YAML perl module, here is an oneliner:
eval $(perl -MYAML -0777 -e '$r=Load(<>);map { exists($r->{$_}->{"Main Display"}) ? print "main=\"$_\";\n" : print "second=\"$_\";\n" } keys %$r' < filename.txt)
echo =$main= =$second=
so, after the eval, here are shell variables main and second
or, exactly for your OS X, with system_profiler command:
eval $(
system_profiler SPDisplaysDataType |\
grep -B1 -A5 'Resolution:' |\
perl -MYAML -0777 -e '$r=Load(<>);map { printf "%s=\"%s\"\n", exists($r->{$_}->{"Main Display"}) ? "main" : "second", $_ } keys %$r'
)
echo =$main=$second=

my($first, $second) = split /--\n/, qx/grep -B1 -A5 foo data.text/;

awk:
awk -F : '
/^[^[:space:]]/ {current = $1; devices[$1]++}
$1 ~ /Main Display/ {main = current}
END {
for (d in devices)
if (d == main)
print "mainDisplay=\"" d "\""
else
print "secondDisplay=\"" d "\""
}
'
outputs
mainDisplay="Color LCD"
secondDisplay="LED Cinema Display"
which you can capture and eval in the shell.

Here's a perl solution. Use it like so: script.pl Resolution:. Default search is "Resolution:".
The values are stored in %values, for example:
$values{Color LCD}{Resolution} == "1440 x 900";
use strict;
use warnings;
my $grep = shift || "Resolution:";
my %values;
my $pre;
while (my $line = <DATA>) {
chomp $line;
if ($line =~ /$grep/) {
my #data;
push #data, scalar <DATA> for (0 .. 4);
chomp #data;
for my $pair ($line, #data) {
if ($pair =~ /^([^:]+): (.*)$/) {
$values{$pre}{$1} = $2;
} else { die "Unexpected data: $pair" }
}
} else {
$pre = $line;
}
}
use Data::Dumper;
print Dumper \%values;
__DATA__
Color LCD:
Resolution: 1440 x 900
Pixel Depth: 32-Bit Color (ARGB8888)
Main Display: Yes
Mirror: Off
Online: Yes
Built-In: Yes
LED Cinema Display:
Resolution: 1920 x 1200
Depth: 32-Bit Color
Core Image: Hardware Accelerated
Mirror: Off
Online: Yes
Quartz Extreme: Supported

Related

lowercase everything except content between single quotes - perl

Is there a way in perl to replace all text in input line except ones within single quotes(There could be more than one) using regex, I have achieved this using the code below but would like to see if it can be done with regex and map.
while (<>) {
my $m=0;
for (split(//)) {
if (/'/ and ! $m) {
$m=1;
print;
}
elsif (/'/ and $m) {
$m=0;
print;
}
elsif ($m) {
print;
}
else {
print lc;
}
}
}
**Sample input:**
and (t.TARGET_TYPE='RAC_DATABASE' or (t.TARGET_TYPE='ORACLE_DATABASE' and t.TYPE_QUALIFIER3 != 'racinst'))
**Sample output:**
and (t.target_type='RAC_DATABASE' or (t.target_type='ORACLE_DATABASE' and t.type_qualifier3 != 'racinst'))
You can give this a shot. All one regexp.
$str =~ s/(?:^|'[^']*')\K[^']*/lc($&)/ge;
Or, cleaner and more documented (this is semantically equivalent to the above)
$str =~ s/
(?:
^ | # Match either the start of the string, or
'[^']*' # some text in quotes.
)\K # Then ignore that part,
# because we want to leave it be.
[^']* # Take the text after it, and
# lowercase it.
/lc($&)/gex;
The g flag tells the regexp to run as many times as necessary. e tells it that the substitution portion (lc($&), in our case) is Perl code, not just text. x lets us put those comments in there so that the regexp isn't total gibberish.
Don't you play too hard with regexp for such a simple job?
Why not get the kid 'split' for it today?
#!/usr/bin/perl
while (<>)
{
#F = split "'";
#F = map { $_ % 2 ? $F[$_] : lc $F[$_] } (0..#F);
print join "'", #F;
}
The above is for understanding. We often join the latter two lines reasonably into:
print join "'", map { $_ % 2 ? $F[$_] : lc $F[$_] } (0..#F);
Or enjoy more, making it a one-liner? (in bash shell) In concept, it looks like:
perl -pF/'/ -e 'join "'", map { $_ % 2 ? $F[$_] : lc $F[$_] } (0..#F);' YOUR_FILE
In reality, however, we need to respect the shell and do some escape (hard) job:
perl -pF/\'/ -e 'join "'"'"'", map { $_ % 2 ? $F[$_] : lc $F[$_] } (0..#F);' YOUR_FILE
(The single-quoted single quote needs to become 5 letters: '"'"')
If it doesn't help your job, it helps sleep.
One more variant with Perl one-liner. I'm using hex \x27 for single quotes
$ cat sql_str.txt
and (t.TARGET_TYPE='RAC_DATABASE' or (t.TARGET_TYPE='ORACLE_DATABASE' and t.TYPE_QUALIFIER3 != 'racinst'))
$ perl -ne ' { #F=split(/\x27/); for my $val (0..$#F) { $F[$val]=lc($F[$val]) if $val%2==0 } ; print join("\x27",#F) } ' sql_str.txt
and (t.target_type='RAC_DATABASE' or (t.target_type='ORACLE_DATABASE' and t.type_qualifier3 != 'racinst'))
$

How to exec self with command-line arguments added to stdin?

EDIT: just to clarify, I want to know how one would implement the piping of content to an exec'd process, putting aside the question of whether Perl offers better ways to achieve the same end-result that don't involve this technique.
It's easier to describe what I want to do with a toy zsh script example:
#!/usr/bin/env zsh
# -----------------------------------------------------------------------------
# handle command-line arguments if any
# nb: ${(%):-%x} is zsh-speak for "yours truly"
# (( $# % 2 )) && (print -rl -- "$#"; [[ -t 0 ]] || cat) | exec ${(%):-%x}
(( $# > 0 )) && ([[ -t 0 ]] || cat; print -rl -- "$#") | exec ${(%):-%x}
# -----------------------------------------------------------------------------
# standard operation below this line
nl
The script above is pretty much a pass-through wrapper for the nl ("number lines") utility, except that, if command-line arguments are present, it will append them to to its stdin. For example:
$ seq 3 | /tmp/nlwrapper.sh
1 1
2 2
3 3
$ seq 3 | /tmp/nlwrapper.sh foo bar baz
1 1
2 2
3 3
4 foo
5 bar
6 baz
Note that
the command-line arguments could have just as easily been prepended to stdin (in fact, if one uncomments the script's 7th line, the resulting script will prepend or append the command-line arguments to stdin depending on whether their number is odd or even, respectively); I am interested in both functionalities.
the script consists of two entirely independent sections: a preamble that handles the command-line arguments (if any), and the body (in this toy example consisting of a single line) that takes care of the script's main functionality (numbering lines); this is an essential design feature.
To elaborate on these two points a bit further: the "body" section knows nothing about the business with the command-line arguments. The code that implements this handling of command-line arguments could be prepended pretty much "as-is" to any zsh script that processes stdin. Moreover, changes to how the command-line arguments are handled (prepend vs append, etc) leave everything below the comment # standard operation below this line untouched. The two sections are truly independent of each other.
What would be the equivalent of the preamble above in a Perl script?
I know that the Perl-equivalent of the script above would have the general form
#!/usr/bin/env perl
use strict;
use English;
# -----------------------------------------------------------------------------
# handle command-line arguments if any
if ( #ARGV > 0 ) {
# (mumble, mumble) ... -t STDIN ... exec $PROGRAM_NAME;
}
# -----------------------------------------------------------------------------
# standard operation below this line
printf "%6d\t$_", $., $_ while <>;
My problem is implementing this bit:
([[ -t 0 ]] || cat; print -rl -- "$#") |
I do know that
the [[ -t 0 ]] test in Perl is -t STDIN;
the cat part could be implemented with print while <>; and
the print -rl -- "$#" bit could be implemented with CORE::say for #ARGV
What I don't know is how to put these elements together to get the desired functionality.
Unless the Perl script needs to be as cryptic as the shell script, I would do something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use autouse Carp => qw( croak );
use IO::Interactive qw( is_interactive );
run( \*STDIN, is_interactive() ? [] : (\#ARGV, [qw(ya ba da ba doo)]) );
sub run {
my $appender = mk_appender(#_);
printf "%6d\t%s", #$_ while $_ = $appender->();
}
sub mk_appender {
my $fh = shift;
my $line_number = 0;
my $i = 0;
my #readers = (
sub { scalar <$fh> },
(
map { my $argv = $_; sub {
($i < #$argv) ? $argv->[$i++] . "\n" : ();
}} #_
),
);
return sub {
#readers or return;
while ( #readers ) {
my $line = $readers[0]->();
return [++$line_number, $line] if defined $line;
shift #readers;
$i = 0;
}
return;
};
}
Output:
$ seq 3 | perl t.pl foo bar baz
1 1
2 2
3 3
4 foo
5 bar
6 baz
7 ya
8 ba
9 da
10 ba
11 doo
Some advantages of using the facilities offered by Perl instead of trying to replicate the shell script include the fact that you can avoid spawning additional processes, reading through the entire standard input twice etc.
I also write about reading from multiple files at the same time in How to sum data from multiple files in Perl?
If you are intent on reading the same input twice, see pipe and Bidirectional communication with yourself.
Might be easier just using cat with some common shell features. (bash used here.)
cat <( seq 3 ) <( printf 'foo\nbar\nbaz\n' ) | prog
Solution:
use POSIX qw( );
sub munge_stdin {
pipe(my $r, my $w) or die("pipe: $!");
$w->autoflush();
local $SIG{CHLD} = 'IGNORE';
defined( my $pid = fork() ) or die("fork: $!");
if (!$pid) {
eval {
close($r) or die("close pipe: $!");
while (<STDIN>) {
print($w $_) or die("print: $!");
}
for (#ARGV) {
print($w "$_\n") or die("print: $!");
}
POSIX::_exit(0);
};
warn($#);
POSIX::_exit(1);
}
close($w) or die("close pipe: $!");
open(STDIN, '<&', $r) or die("dup: $!");
#ARGV = ();
}
munge_stdin();
print while <>; # or: exec("cat") or die("exec: $!");
This solution supports exec() in the Perl script.
This solution won't deadlock no matter how large the inputs are.

grep variables and give informative ouput

I want to see how many times specific word was mentioned in the file/lines.
My dummy examples looks like this:
cat words
blue
red
green
yellow
cat text
TEXTTEXTblueTEXTTEXTblue
TEXTTEXTgreenblueTEXTTEXT
TEXTTEXyeowTTEXTTEXTTEXT
I am doing this:
for i in $(cat words); do grep "$i" text | wc >> output; done
cat output
2 2 51
0 0 0
1 1 26
0 0 0
But what I actually want to get is:
1. Word that was used as a variable;
2. In how many lines (additionally to text hits) word was found.
Preferable output looks like this:
blue 3 2
red 0 0
green 1 1
yellow 0 0
$1 - variable that was grep'ed
$2 - how many times variable was found in the text
$3 - in how many lines variable was found
Hope someone could help me doing this with grep, awk, sed as they are fast enough for the large data set, but Perl one liner would help me too.
Edit
Tried this
for i in $(cat words); do grep "$i" text > out_${i}; done && wc out*
and it kinda looks nice, but some of the words are longer than 300 letters so I can't create file named like the word.
You can use the grep option -o which print only the matched parts of a matching line, with each match on a separate output line.
while IFS= read -r line; do
wordcount=$(grep -o "$line" text | wc -l)
linecount=$(grep -c "$line" text)
echo $line $wordcount $linecount
done < words | column -t
You can put it all in one line to make it a one liner.
If column gives the "column too long" error, you can use printf provided you know the maximum number of characters. Use the below instead of echo and remove the pipe to column:
printf "%-20s %-2s %-2s\n" "$line" $wordcount $linecount
Replace the 20 with your max word length and the other numbers as well if you need to.
Here is a similar Perl solution; but rather written as a complete script.
#!/usr/bin/perl
use 5.012;
die "USAGE: $0 wordlist.txt [text-to-search.txt]\n" unless #ARGV;
my $wordsfile = shift #ARGV;
my #wordlist = do {
open my $words_fh, "<", $wordsfile or die "Can't open $wordsfile: $!";
map {chomp; length() ? $_ : ()} <$words_fh>;
};
my %words;
while (<>) {
for my $word (#wordlist) {
my $cnt = 0;
$cnt++ for /\Q$word\E/g;
$words{$word}[0] += $cnt;
$words{$word}[1] += 1&!! $cnt; # trick to force 1 or 0.
}
}
# sorts output after frequency. remove `sort {...}` to get unsorted output.
for my $key (sort {$words{$b}->[0] <=> $words{$a}->[0] or $a cmp $b} keys %words) {
say join "\t", $key, #{ $words{$key} };
}
Example output:
blue 3 2
green 1 1
red 0 0
yellow 0 0
Advantage over bash script: every file is only read once.
This gets pretty ugly as a Perl one-liner (partly because it needs to get data from two files and only one can be sent on stdin, partly because of the requirement to count both the number of lines matched and the total number of matches), but here you go:
perl -E 'undef $|; open $w, "<", "words"; #w=<$w>; chomp #w; $r{$_}=[0,{}] for #w; my $re = join "|", #w; while(<>) { $l++; while (/($re)/g) { $r{$1}[0]++; $r{$1}[1]{$l}++; } }; say "$_\t$r{$_}[0]\t" . scalar keys %{$r{$_}[1]} for #w' < text
This requires perl 5.10 or later, but changing it to support 5.8 and earlier is trivial. (Change the -E to -e, change say to print, and add a \n at the end of each line of output.)
Output:
blue 3 2
red 0 0
green 1 1
yellow 0 0
an awk(gawk) oneliner could save you from grep puzzle:
awk 'NR==FNR{n[$0];l[$0];next;}{for(w in n){ s=$0;t=gsub(w,"#",s); n[w]+=t;l[w]+=t>0?1:0;}}END{for(x in n)print x,n[x],l[x]}' words text
format the code a bit:
awk 'NR==FNR{n[$0];l[$0];next;}
{for(w in n){ s=$0;
t=gsub(w,"#",s);
n[w]+=t;l[w]+=t>0?1:0;}
}END{for(x in n)print x,n[x],l[x]}' words text
test with your example:
kent$ awk 'NR==FNR{n[$0];l[$0];next;}{for(w in n){ s=$0;t=gsub(w,"#",s); n[w]+=t;l[w]+=t>0?1:0;}}END{for(x in n)print x,n[x],l[x]}' words text
yellow 0 0
red 0 0
green 1 1
blue 3 2
if you want to format your output, you could just pipe the awk output to column -t
so it looks like:
yellow 0 0
red 0 0
green 1 1
blue 3 2
awk '
NR==FNR { words[$0]; next }
{
for (word in words) {
count = gsub(word,word)
if (count) {
counts[word] += count
lines[word]++
}
}
}
END { for (word in words) printf "%s %d %d\n", word, counts[word], lines[word] }
' file

How to compress 4 consecutive blank lines into one single line in Perl

I'm writing a Perl script to read a log so that to re-write the file into a new log by removing empty lines in case of seeing any consecutive blank lines of 4 or more. In other words, I'll have to compress any 4 consecutive blank lines (or more lines) into one single line; but any case of 1, 2 or 3 lines in the file will have to remain the format. I have tried to get the solution online but the only I can find is
perl -00 -pe ''
or
perl -00pe0
Also, I see the example in vim like this to delete blocks of 4 empty lines :%s/^\n\{4}// which match what I'm looking for but it was in vim not Perl. Can anyone help in this? Thanks.
To collapse 4+ consecutive Unix-style EOLs to a single newline:
$ perl -0777 -pi.bak -e 's|\n{4,}|\n|g' file.txt
An alternative flavor using look-behind:
$ perl -0777 -pi.bak -e 's|(?<=\n)\n{3,}||g' file.txt
use strict;
use warnings;
my $cnt = 0;
sub flush_ws {
$cnt = 1 if ($cnt >= 4);
while ($cnt > 0) {print "\n"; $cnt--; }
}
while (<>) {
if (/^$/) {
$cnt++;
} else {
flush_ws();
print $_;
}
}
flush_ws();
Your -0 hint is a good one since you can use -0777 to slurp the whole file in -p mode. Read more about these guys in perlrun So this oneliner should do the trick:
$ perl -0777 -pe 's/\n{5,}/\n\n/g'
If there are up to four new lines in a row, nothing happens. Five newlines or more (four empty lines or more) are replaced by two newlines (one empty line). Note the /g switch here to replace not only the first match.
Deparsed code:
BEGIN { $/ = undef; $\ = undef; }
LINE: while (defined($_ = <ARGV>)) {
s/\n{5,}/\n\n/g;
}
continue {
die "-p destination: $!\n" unless print $_;
}
HTH! :)
One way using GNU awk, setting the record separator to NUL:
awk 'BEGIN { RS="\0" } { gsub(/\n{5,}/,"\n")}1' file.txt
This assumes that you're definition of empty excludes whitespace
This will do what you need
perl -ne 'if (/\S/) {$n = 1 if $n >= 4; print "\n" x $n, $_; $n = 0} else {$n++}' myfile

How to quickly find and replace many items on a list without replacing previously replaced items in BASH?

I want to perform about many find and replace operations on some text. I have a UTF-8 CSV file containing what to find (in the first column) and what to replace it with (in the second column), arranged from longest to shortest.
E.g.:
orange,fruit2
carrot,vegetable1
apple,fruit3
pear,fruit4
ink,item1
table,item2
Original file:
"I like to eat apples and carrots"
Resulting output file:
"I like to eat fruit3s and vegetable1s."
However, I want to ensure that if one part of text has already been replaced, that it doesn't mess with text that was already replaced. In other words, I don't want it to appear like this (it matched "table" from within vegetable1):
"I like to eat fruit3s and vegeitem21s."
Currently, I am using this method which is quite slow, because I have to do the whole find and replace twice:
(1) Convert the CSV to three files, e.g.:
a.csv b.csv c.csv
orange 0001 fruit2
carrot 0002 vegetable1
apple 0003 fruit3
pear 0004 fruit4
ink 0005 item1
table 0006 item 2
(2) Then, replace all items from a.csv in file.txt with the matching column in b.csv, using ZZZ around the words to make sure there is no mistake later in matching the numbers:
a=1
b=`wc -l < ./a.csv`
while [ $a -le $b ]
do
for i in `sed -n "$a"p ./b.csv`; do
for j in `sed -n "$a"p ./a.csv`; do
sed -i "s/$i/ZZZ$j\ZZZ/g" ./file.txt
echo "Instances of '"$i"' replaced with '"ZZZ$j\ZZZ"' ("$a"/"$b")."
a=`expr $a + 1`
done
done
done
(3) Then running this same script again, but to replace ZZZ0001ZZZ with fruit2 from c.csv.
Running the first replacement takes about 2 hours, but as I must run this code twice to avoid editing the already replaced items, it takes twice as long. Is there a more efficient way to run a find and replace that does not perform replacements on text already replaced?
Here's a perl solution which is doing the replacement in "one phase".
#!/usr/bin/perl
use strict;
my %map = (
orange => "fruit2",
carrot => "vegetable1",
apple => "fruit3",
pear => "fruit4",
ink => "item1",
table => "item2",
);
my $repl_rx = '(' . join("|", map { quotemeta } keys %map) . ')';
my $str = "I like to eat apples and carrots";
$str =~ s{$repl_rx}{$map{$1}}g;
print $str, "\n";
Tcl has a command to do exactly this: string map
tclsh <<'END'
set map {
"orange" "fruit2"
"carrot" "vegetable1"
"apple" "fruit3"
"pear" "fruit4"
"ink" "item1"
"table" "item2"
}
set str "I like to eat apples and carrots"
puts [string map $map $str]
END
I like to eat fruit3s and vegetable1s
This is how to implement it in bash (requires bash v4 for the associative array)
declare -A map=(
[orange]=fruit2
[carrot]=vegetable1
[apple]=fruit3
[pear]=fruit4
[ink]=item1
[table]=item2
)
str="I like to eat apples and carrots"
echo "$str"
i=0
while (( i < ${#str} )); do
matched=false
for key in "${!map[#]}"; do
if [[ ${str:$i:${#key}} = $key ]]; then
str=${str:0:$i}${map[$key]}${str:$((i+${#key}))}
((i+=${#map[$key]}))
matched=true
break
fi
done
$matched || ((i++))
done
echo "$str"
I like to eat apples and carrots
I like to eat fruit3s and vegetable1s
This will not be speedy.
Clearly, you may get different results if you order the map differently. In fact, I believe the order of "${!map[#]}" is unspecified, so you might want to specify the order of the keys explicitly:
keys=(orange carrot apple pear ink table)
# ...
for key in "${keys[#]}"; do
One way to do it would be to do a two-phase replace:
phase 1:
s/orange/##1##/
s/carrot/##2##/
...
phase 2:
s/##1##/fruit2/
s/##2##/vegetable1/
...
The ##1## markers should be chosen so that they don't appear in the original text or the replacements of course.
Here's a proof-of-concept implementation in perl:
#!/usr/bin/perl -w
#
my $repls = $ARGV[0];
die ("first parameter must be the replacement list file") unless defined ($repls);
my $tmpFmt = "###%d###";
open(my $replsFile, "<", $repls) || die("$!: $repls");
shift;
my #replsList;
my $i = 0;
while (<$replsFile>) {
chomp;
my ($from, $to) = /\"([^\"]*)\",\"([^\"]*)\"/;
if (defined($from) && defined($to)) {
push(#replsList, [$from, sprintf($tmpFmt, ++$i), $to]);
}
}
while (<>) {
foreach my $r (#replsList) {
s/$r->[0]/$r->[1]/g;
}
foreach my $r (#replsList) {
s/$r->[1]/$r->[2]/g;
}
print;
}
I would guess that most of your slowness is coming from creating so many sed commands, which each need to individually process the entire file. Some minor adjustments to your current process would speed this up a lot by running 1 sed per file per step.
a=1
b=`wc -l < ./a.csv`
while [ $a -le $b ]
do
cmd=""
for i in `sed -n "$a"p ./a.csv`; do
for j in `sed -n "$a"p ./b.csv`; do
cmd="$cmd ; s/$i/ZZZ${j}ZZZ/g"
echo "Instances of '"$i"' replaced with '"ZZZ${j}ZZZ"' ("$a"/"$b")."
a=`expr $a + 1`
done
done
sed -i "$cmd" ./file.txt
done
Doing it twice is probably not your problem. If you managed to just do it once using your basic strategy, it would still take you an hour, right? You probably need to use a different technology or tool. Switching to Perl, as above, might make your code a lot faster (give it a try)
But continuing down the path of other posters, the next step might be pipelining. Write a little program that replaces two columns, then run that program twice, simultaneously. The first run swaps out strings in column1 with strings in column2, the next swaps out strings in column2 with strings in column3.
Your command line would be like this
cat input_file.txt | perl replace.pl replace_file.txt 1 2 | perl replace.pl replace_file.txt 2 3 > completely_replaced.txt
And replace.pl would be like this (similar to other solutions)
#!/usr/bin/perl -w
my $replace_file = $ARGV[0];
my $before_replace_colnum = $ARGV[1] - 1;
my $after_replace_colnum = $ARGV[2] - 1;
open(REPLACEFILE, $replace_file) || die("couldn't open $replace_file: $!");
my #replace_pairs;
# read in the list of things to replace
while(<REPLACEFILE>) {
chomp();
my #cols = split /\t/, $_;
my $to_replace = $cols[$before_replace_colnum];
my $replace_with = $cols[$after_replace_colnum];
push #replace_pairs, [$to_replace, $replace_with];
}
# read input from stdin, do swapping
while(<STDIN>) {
# loop over all replacement strings
foreach my $replace_pair (#replace_pairs) {
my($to_replace,$replace_with) = #{$replace_pair};
$_ =~ s/${to_replace}/${replace_with}/g;
}
print STDOUT $_;
}
A bash+sed approach:
count=0
bigfrom=""
bigto=""
while IFS=, read from to; do
read countmd5sum x < <(md5sum <<< $count)
count=$(( $count + 1 ))
bigfrom="$bigfrom;s/$from/$countmd5sum/g"
bigto="$bigto;s/$countmd5sum/$to/g"
done < replace-list.csv
sed "${bigfrom:1}$bigto" input_file.txt
I have chosen md5sum, to get some unique token. But some other mechanism can also be used to generate such token; like reading from /dev/urandom or shuf -n1 -i 10000000-20000000
A awk+sed approach:
awk -F, '{a[NR-1]="s/####"NR"####/"$2"/";print "s/"$1"/####"NR"####/"}; END{for (i=0;i<NR;i++)print a[i];}' replace-list.csv > /tmp/sed_script.sed
sed -f /tmp/sed_script.sed input.txt
A cat+sed+sed approach:
cat -n replace-list.csv | sed -rn 'H;g;s|(.*)\n *([0-9]+) *[^,]*,(.*)|\1\ns/####\2####/\3/|;x;s|.*\n *([0-9]+)[ \t]*([^,]+).*|s/\2/####\1####/|p;${g;s/^\n//;p}' > /tmp/sed_script.sed
sed -f /tmp/sed_script.sed input.txt
Mechanism:
Here, it first generates the sed script, using the csv as input file.
Then uses another sed instance to operate on input.txt
Notes:
The intermediate file generated - sed_script.sed can be re-used again, unless the input csv file changes.
####<number>#### is chosen as some pattern, which is not present in the input file. Change this pattern if required.
cat -n | is not UUOC :)
This might work for you (GNU sed):
sed -r 'h;s/./&\\n/g;H;x;s/([^,]*),.*,(.*)/s|\1|\2|g/;$s/$/;s|\\n||g/' csv_file | sed -rf - original_file
Convert the csv file into a sed script. The trick here is to replace the substitution string with one which will not be re-substituted. In this case each character in the substitution string is replaced by itself and a \n. Finally once all substitutions have taken place the \n's are removed leaving the finished string.
There are a lot of cool answers here already. I'm posting this because I'm taking a slightly different approach by making some large assumptions about the data to replace ( based on the sample data ):
Words to replace don't contain spaces
Words are replaced based on the longest, exactly matching prefix
Each word to replace is exactly represented in the csv
This a single pass, awk only answer with very little regex.
It reads the "repl.csv" file into an associative array ( see BEGIN{} ), then attempts to match on prefixes of each word when the length of the word is bound by key length limits, trying to avoid looking in the associative array whenever possible:
#!/bin/awk -f
BEGIN {
while( getline repline < "repl.csv" ) {
split( repline, replarr, "," )
replassocarr[ replarr[1] ] = replarr[2]
# set some bounds on the replace word sizes
if( minKeyLen == 0 || length( replarr[1] ) < minKeyLen )
minKeyLen = length( replarr[1] )
if( maxKeyLen == 0 || length( replarr[1] ) > maxKeyLen )
maxKeyLen = length( replarr[1] )
}
close( "repl.csv" )
}
{
i = 1
while( i <= NF ) { print_word( $i, i == NF ); i++ }
}
function print_word( w, end ) {
wl = length( w )
for( j = wl; j >= 0 && prefix_len_bound( wl, j ); j-- ) {
key = substr( w, 1, j )
wl = length( key )
if( wl >= minKeyLen && key in replassocarr ) {
printf( "%s%s%s", replassocarr[ key ],
substr( w, j+1 ), !end ? " " : "\n" )
return
}
}
printf( "%s%s", w, !end ? " " : "\n" )
}
function prefix_len_bound( len, jlen ) {
return len >= minKeyLen && (len <= maxKeyLen || jlen > maxKeylen)
}
Based on input like:
I like to eat apples and carrots
orange you glad to see me
Some people eat pears while others drink ink
It yields output like:
I like to eat fruit3s and vegetable1s
fruit2 you glad to see me
Some people eat fruit4s while others drink item1
Of course any "savings" of not looking the replassocarr go away when the words to be replaced goes to length=1 or if the average word length is much greater than the words to replace.