Reading from DATA via the typical slurp works. Trying to use DATA as a filehandle on which I can do a seek does not work. Is anyone able to point me the to the obvious mistake I must be making?
Code:
#!/usr/bin/env perl
use strict;
use warnings;
if ($ARGV[0] eq 'seek' ) {
my $log_fh = \*DATA;
$log_fh->seek(64,0);
print "\n-- 64 --\n",join ("", <$log_fh> );
} else {
while (<DATA>) {
print $_;
}
}
exit;
__DATA__
01234567890123456789
1234567890123456789
1234567890123456789
12
X <- That X is the 64th char in
this file.
Y <- That Y is the 106th char in this file.
junk
more junk.
bye!
$ perl file_from_data.pl slurp
01234567890123456789
1234567890123456789
1234567890123456789
12
X <- That X is the 64th char in
this file.
Y <- That Y is the 106th char in this file.
junk
more junk.
bye!
Running the while() loop:
$ perl file_from_data.pl slurp
01234567890123456789
1234567890123456789
1234567890123456789
12
X <- That X is the 64th char in
this file.
Y <- That Y is the 106th char in this file.
junk
more junk.
bye!
Running the seek(), it appears to not start at DATA but the start of the script:
$ perl file_from_data.pl seek
-- 64 --
'seek' ) {
my $log_fh = \*DATA;
$log_fh->seek(64,0);
print "\n-- 64 --\n",join ("", <$log_fh> );
} else {
while (<DATA>) {
print $_;
}
}
exit;
__DATA__
01234567890123456789
1234567890123456789
1234567890123456789
12
X <- That X is the 64th char in
this file.
Y <- That Y is the 106th char in this file.
junk
more junk.
bye!
This is an old Perl:
$ perl -v
This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-
thread-multi
Running the seek(), it appears to not start at DATA but the start of the script
I don't think you're making any mistake at all. That's exactly what happens. DATA is a filehandle that is open on your source file. Before your first read() from that filehandle, the file pointer is positioned immediately after the __DATA__ token in the file. But you can use seek() to move the file pointer to any position at all in the file.
I guess it would be harder to implement a "special case" filehandle that wasn't able to move back before its initial position.
I have 2 separate files namely A & B containing same header lines but 2 and 1 column respectively. I want to take inverse log2 of the 2nd column or 1st column in separate files but keep the other description intact. I am having some thing like this.. values in file A $1 and $2 are separated by delimiter tab
file A
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr1
12 0.781985
16 0.810993
20 0.769601
24 0.733831
file B
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar
variableStep chrom=chr1
0.721985
0.610993
0.760123
0.573831
I expect an output like this. file A
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr1
12 1.7194950944
16 1.754418585
20 1.7047982296
24 1.6630493726
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr2
for file B (in this file values are just copy paste of file A)
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar
variableStep chrom=chr1
1.7194950944
1.754418585
1.7047982296
1.6630493726
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig rep1.bar.wig graphType=bar
variableStep chrom=chr2
This awk script does the calculation that you want:
awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' file
This matches lines that contain only digits, periods and any space characters, substituting the value of the last field $NF for 2 raised to the power of $NF. The format specifier %.12f can be modified to give you the required number of decimal places. The 1 at the end is shorthand for {print}.
Testing it out on your new files:
$ awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' A
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr1
12 1.719495094445
16 1.754418584953
20 1.704798229573
24 1.663049372620
$ awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' B
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar
variableStep chrom=chr1
1.649449947457
1.527310087388
1.693635012985
1.488470882686
So here's the Perl version:
use strict;
open IN, $ARGV[0];
while (<IN>) {
chomp;
if (/^(.*)[\t ]*(-?\d\.\d*)/) { # format "nn m.mmmmm"
my $power = 2 ** $2;
print("$1\t" . $power . "\n");
} elsif (/^(-?\d\.\d*)/) { # format "m.mmmmm"
my $power = 2 ** $1;
print($power . "\n");
} else { # echo all other stuff
print;
print ("\n");
}
}
close IN;
If you run <file>.pl <datafile> (replace with appropriate names) it will convert one file so the lines have 2**<2nd value>). It simply echoes the lines that do not match the number pattern.
This is the modified little script of #ThomasKilian
Thanks to him for providing the framework.
use strict;
open IN, $ARGV[0];
while (<IN>) {
chomp;
if (/^(\d*)[\t ]*(-?\d\.\d*)/) { # format "nn m.mmmmm"
my $power = 2 ** $2;
$power= sprintf("%.12f", $power);
print("$1\t" . $power . "\n");
} elsif (/^(-?\d\.\d*)/) { # format "m.mmmmm"
my $power = 2 ** $1;
$power= sprintf("%.12f", $power);
print($power . "\n");
} else { # echo all other stuff
print;
print ("\n");
}
}
close IN;
I have a very large file of size about 300-500 MB. I need to search for String1 in that file first. Then search for String2 starting from the position of String1. Then again search for String3 starting from the position of String2. For example,
String1 = "abc"
String2 = "123"
String3 = "opq"
File :
def
123
opq
opq
123
opq
abc //come here first
blah blah
123 //come here next
blah
opq //read this finally and print
afg
123
blah blah
123
def
Methods I followed,
I tried reading the file line by line and searching for the matching pattern.
It was a very slow method (had to wait for minutes).
Then I stored the whole file into an array and grepped the matching lines to get the final line.
It was quite fast in searching but slower in loading the file into an array. The memory consumed is also high.
Is there an efficient method to perform such a task?
Using a perl one liner and range operators:
perl -ne 'print("$. $_") && exit if (/abc/ .. 1) && (/123/ .. 1) && /opq/' file
Outputs:
11 opq //read this finally and print
Explanation:
Switches:
-n: Creates a while(<>){..} loop for each line in your input file.
-e: Tells perl to execute the code on command line.
Code:
print("$. $_"): prints the line number $. followed by the current line $_
exit: Terminate processing after desired line is found.
if (/abc/ .. 1) && (/123/ .. 1) && /opq/: Find patterns in order.
Addendum - for including functionality inside a script
I would advise against shelling out to another perl process to achieve this functionality. Instead just convert this to a non-commandline version:
use strict;
use warnings;
use autodie;
open my $fh, '<', 'file';
while (<$fh>) {
if ((/abc/ .. 1) && (/123/ .. 1) && /opq/) {
print "$. $_";
last;
}
}
I want to perform about many find and replace operations on some text. I have a UTF-8 CSV file containing what to find (in the first column) and what to replace it with (in the second column), arranged from longest to shortest.
E.g.:
orange,fruit2
carrot,vegetable1
apple,fruit3
pear,fruit4
ink,item1
table,item2
Original file:
"I like to eat apples and carrots"
Resulting output file:
"I like to eat fruit3s and vegetable1s."
However, I want to ensure that if one part of text has already been replaced, that it doesn't mess with text that was already replaced. In other words, I don't want it to appear like this (it matched "table" from within vegetable1):
"I like to eat fruit3s and vegeitem21s."
Currently, I am using this method which is quite slow, because I have to do the whole find and replace twice:
(1) Convert the CSV to three files, e.g.:
a.csv b.csv c.csv
orange 0001 fruit2
carrot 0002 vegetable1
apple 0003 fruit3
pear 0004 fruit4
ink 0005 item1
table 0006 item 2
(2) Then, replace all items from a.csv in file.txt with the matching column in b.csv, using ZZZ around the words to make sure there is no mistake later in matching the numbers:
a=1
b=`wc -l < ./a.csv`
while [ $a -le $b ]
do
for i in `sed -n "$a"p ./b.csv`; do
for j in `sed -n "$a"p ./a.csv`; do
sed -i "s/$i/ZZZ$j\ZZZ/g" ./file.txt
echo "Instances of '"$i"' replaced with '"ZZZ$j\ZZZ"' ("$a"/"$b")."
a=`expr $a + 1`
done
done
done
(3) Then running this same script again, but to replace ZZZ0001ZZZ with fruit2 from c.csv.
Running the first replacement takes about 2 hours, but as I must run this code twice to avoid editing the already replaced items, it takes twice as long. Is there a more efficient way to run a find and replace that does not perform replacements on text already replaced?
Here's a perl solution which is doing the replacement in "one phase".
#!/usr/bin/perl
use strict;
my %map = (
orange => "fruit2",
carrot => "vegetable1",
apple => "fruit3",
pear => "fruit4",
ink => "item1",
table => "item2",
);
my $repl_rx = '(' . join("|", map { quotemeta } keys %map) . ')';
my $str = "I like to eat apples and carrots";
$str =~ s{$repl_rx}{$map{$1}}g;
print $str, "\n";
Tcl has a command to do exactly this: string map
tclsh <<'END'
set map {
"orange" "fruit2"
"carrot" "vegetable1"
"apple" "fruit3"
"pear" "fruit4"
"ink" "item1"
"table" "item2"
}
set str "I like to eat apples and carrots"
puts [string map $map $str]
END
I like to eat fruit3s and vegetable1s
This is how to implement it in bash (requires bash v4 for the associative array)
declare -A map=(
[orange]=fruit2
[carrot]=vegetable1
[apple]=fruit3
[pear]=fruit4
[ink]=item1
[table]=item2
)
str="I like to eat apples and carrots"
echo "$str"
i=0
while (( i < ${#str} )); do
matched=false
for key in "${!map[#]}"; do
if [[ ${str:$i:${#key}} = $key ]]; then
str=${str:0:$i}${map[$key]}${str:$((i+${#key}))}
((i+=${#map[$key]}))
matched=true
break
fi
done
$matched || ((i++))
done
echo "$str"
I like to eat apples and carrots
I like to eat fruit3s and vegetable1s
This will not be speedy.
Clearly, you may get different results if you order the map differently. In fact, I believe the order of "${!map[#]}" is unspecified, so you might want to specify the order of the keys explicitly:
keys=(orange carrot apple pear ink table)
# ...
for key in "${keys[#]}"; do
One way to do it would be to do a two-phase replace:
phase 1:
s/orange/##1##/
s/carrot/##2##/
...
phase 2:
s/##1##/fruit2/
s/##2##/vegetable1/
...
The ##1## markers should be chosen so that they don't appear in the original text or the replacements of course.
Here's a proof-of-concept implementation in perl:
#!/usr/bin/perl -w
#
my $repls = $ARGV[0];
die ("first parameter must be the replacement list file") unless defined ($repls);
my $tmpFmt = "###%d###";
open(my $replsFile, "<", $repls) || die("$!: $repls");
shift;
my #replsList;
my $i = 0;
while (<$replsFile>) {
chomp;
my ($from, $to) = /\"([^\"]*)\",\"([^\"]*)\"/;
if (defined($from) && defined($to)) {
push(#replsList, [$from, sprintf($tmpFmt, ++$i), $to]);
}
}
while (<>) {
foreach my $r (#replsList) {
s/$r->[0]/$r->[1]/g;
}
foreach my $r (#replsList) {
s/$r->[1]/$r->[2]/g;
}
print;
}
I would guess that most of your slowness is coming from creating so many sed commands, which each need to individually process the entire file. Some minor adjustments to your current process would speed this up a lot by running 1 sed per file per step.
a=1
b=`wc -l < ./a.csv`
while [ $a -le $b ]
do
cmd=""
for i in `sed -n "$a"p ./a.csv`; do
for j in `sed -n "$a"p ./b.csv`; do
cmd="$cmd ; s/$i/ZZZ${j}ZZZ/g"
echo "Instances of '"$i"' replaced with '"ZZZ${j}ZZZ"' ("$a"/"$b")."
a=`expr $a + 1`
done
done
sed -i "$cmd" ./file.txt
done
Doing it twice is probably not your problem. If you managed to just do it once using your basic strategy, it would still take you an hour, right? You probably need to use a different technology or tool. Switching to Perl, as above, might make your code a lot faster (give it a try)
But continuing down the path of other posters, the next step might be pipelining. Write a little program that replaces two columns, then run that program twice, simultaneously. The first run swaps out strings in column1 with strings in column2, the next swaps out strings in column2 with strings in column3.
Your command line would be like this
cat input_file.txt | perl replace.pl replace_file.txt 1 2 | perl replace.pl replace_file.txt 2 3 > completely_replaced.txt
And replace.pl would be like this (similar to other solutions)
#!/usr/bin/perl -w
my $replace_file = $ARGV[0];
my $before_replace_colnum = $ARGV[1] - 1;
my $after_replace_colnum = $ARGV[2] - 1;
open(REPLACEFILE, $replace_file) || die("couldn't open $replace_file: $!");
my #replace_pairs;
# read in the list of things to replace
while(<REPLACEFILE>) {
chomp();
my #cols = split /\t/, $_;
my $to_replace = $cols[$before_replace_colnum];
my $replace_with = $cols[$after_replace_colnum];
push #replace_pairs, [$to_replace, $replace_with];
}
# read input from stdin, do swapping
while(<STDIN>) {
# loop over all replacement strings
foreach my $replace_pair (#replace_pairs) {
my($to_replace,$replace_with) = #{$replace_pair};
$_ =~ s/${to_replace}/${replace_with}/g;
}
print STDOUT $_;
}
A bash+sed approach:
count=0
bigfrom=""
bigto=""
while IFS=, read from to; do
read countmd5sum x < <(md5sum <<< $count)
count=$(( $count + 1 ))
bigfrom="$bigfrom;s/$from/$countmd5sum/g"
bigto="$bigto;s/$countmd5sum/$to/g"
done < replace-list.csv
sed "${bigfrom:1}$bigto" input_file.txt
I have chosen md5sum, to get some unique token. But some other mechanism can also be used to generate such token; like reading from /dev/urandom or shuf -n1 -i 10000000-20000000
A awk+sed approach:
awk -F, '{a[NR-1]="s/####"NR"####/"$2"/";print "s/"$1"/####"NR"####/"}; END{for (i=0;i<NR;i++)print a[i];}' replace-list.csv > /tmp/sed_script.sed
sed -f /tmp/sed_script.sed input.txt
A cat+sed+sed approach:
cat -n replace-list.csv | sed -rn 'H;g;s|(.*)\n *([0-9]+) *[^,]*,(.*)|\1\ns/####\2####/\3/|;x;s|.*\n *([0-9]+)[ \t]*([^,]+).*|s/\2/####\1####/|p;${g;s/^\n//;p}' > /tmp/sed_script.sed
sed -f /tmp/sed_script.sed input.txt
Mechanism:
Here, it first generates the sed script, using the csv as input file.
Then uses another sed instance to operate on input.txt
Notes:
The intermediate file generated - sed_script.sed can be re-used again, unless the input csv file changes.
####<number>#### is chosen as some pattern, which is not present in the input file. Change this pattern if required.
cat -n | is not UUOC :)
This might work for you (GNU sed):
sed -r 'h;s/./&\\n/g;H;x;s/([^,]*),.*,(.*)/s|\1|\2|g/;$s/$/;s|\\n||g/' csv_file | sed -rf - original_file
Convert the csv file into a sed script. The trick here is to replace the substitution string with one which will not be re-substituted. In this case each character in the substitution string is replaced by itself and a \n. Finally once all substitutions have taken place the \n's are removed leaving the finished string.
There are a lot of cool answers here already. I'm posting this because I'm taking a slightly different approach by making some large assumptions about the data to replace ( based on the sample data ):
Words to replace don't contain spaces
Words are replaced based on the longest, exactly matching prefix
Each word to replace is exactly represented in the csv
This a single pass, awk only answer with very little regex.
It reads the "repl.csv" file into an associative array ( see BEGIN{} ), then attempts to match on prefixes of each word when the length of the word is bound by key length limits, trying to avoid looking in the associative array whenever possible:
#!/bin/awk -f
BEGIN {
while( getline repline < "repl.csv" ) {
split( repline, replarr, "," )
replassocarr[ replarr[1] ] = replarr[2]
# set some bounds on the replace word sizes
if( minKeyLen == 0 || length( replarr[1] ) < minKeyLen )
minKeyLen = length( replarr[1] )
if( maxKeyLen == 0 || length( replarr[1] ) > maxKeyLen )
maxKeyLen = length( replarr[1] )
}
close( "repl.csv" )
}
{
i = 1
while( i <= NF ) { print_word( $i, i == NF ); i++ }
}
function print_word( w, end ) {
wl = length( w )
for( j = wl; j >= 0 && prefix_len_bound( wl, j ); j-- ) {
key = substr( w, 1, j )
wl = length( key )
if( wl >= minKeyLen && key in replassocarr ) {
printf( "%s%s%s", replassocarr[ key ],
substr( w, j+1 ), !end ? " " : "\n" )
return
}
}
printf( "%s%s", w, !end ? " " : "\n" )
}
function prefix_len_bound( len, jlen ) {
return len >= minKeyLen && (len <= maxKeyLen || jlen > maxKeylen)
}
Based on input like:
I like to eat apples and carrots
orange you glad to see me
Some people eat pears while others drink ink
It yields output like:
I like to eat fruit3s and vegetable1s
fruit2 you glad to see me
Some people eat fruit4s while others drink item1
Of course any "savings" of not looking the replassocarr go away when the words to be replaced goes to length=1 or if the average word length is much greater than the words to replace.
Hey all, sorry for posting this here I could not find an answer anywhere and my solution is not working. I have a log file being written in the following fashion (don't ask):
=============================
04-12-2011 11:37:10 SOMETHING_GOES_HERE
Variable1
Something Goes Here
=============================
04-12-2011 11:37:20 SOMETHING_GOES_HERE
Variable2
Anything different may be here
=============================
04-12-2011 11:37:30 SOMETHING_GOES_HERE
Variable3
is altogether different here
=============================
What I'd like to do (either in perl or awk as this is an RTOS) is:
Take a look at the file, if Variable1 exists, then start at Variable2 and print everything between the equal signs:
E.g.
=============================
04-12-2011 11:37:10 SOMETHING_GOES_HERE
Variable1
Mary had a little lamb
=============================
04-12-2011 11:37:20 SOMETHING_GOES_HERE
Variable2
The cow jumped over the moon
=============================
awk '/Mary had/{getline;getline;getline;print}'
will only print
04-12-2011 11:37:20 SOMETHING_GOES_HERE
but what I need is everything between the equal signs. I tried butchering up a perl script which isn't working either. Any thoughts?
Alright, this worked (sort of)
#!perl -w
use strict;
my $LOGFILE = "/home/mydir/MyTestFile";
open my $fh, "<$LOGFILE" or die("could not open log file: $!");
my $in = 0;
while(<$fh>)
{
$in = 1 if /Variable1/i;
print if($in);
$in = 0 if /Variable2/i;
}
In the sense that a lot was stripped out. Now another q I have is selective printing a-la awk. Typically, I can get the line before something using:
echo "
test
hello
foo
bar" | awk '/foo/{print x};{x=$0}'
Will print test, however haven't found a way to get the word test (will always be a different word, but the word foo will always remain). Any takers (by the way many thanks in advance)
local $/ = "\n=============================\n";
while (<>) {
chomp;
...
}
Alternative:
my $rec = '';
while (<>) {
if (!/^=============================$/) {
$rec .= $_;
next;
}
...
$rec = '';
}
$/, chomp
I'm not totally clear on if "Variable1/2" etc are constant strings or actually varying quanitites, but does this work:
if [ $(grep Variable1 $file | wc -l) -gt 0 ]; then
sed -n '/Variable2/,/=======/' $file
fi