Change a number arithmetically in a text file using perl - perl

I have a bunch of numbers in a text file as follows (example
r0 = 204
r1 = 205
max_gap = 20u
min = 0
max = 8
thickness = 2
color = green
fill_under = yes
fill_color = green
r0 = 205
r1 = 206
I would like to divide any line with r0 = by 100 so that the line will then read
r0 = 20.4
I would like to do this for all lines with r0 and also for r1. Is there a way to do this in perl?
This is my attempt but doesnt work mainly because I've never used perl before which is why I'm asking such a simple question
#!/usr/bin/perl
$string= r0\s+=\s+\\(d+)
$num= $1/100
$num2= r0\s+=\s+\\$num
s/$string/$num2;
A one liner I could run from bash would be much better though. I know it'll involve the s/find/replace function but not sure how to specify the integer part

perl -pei 's#^(r[01]\s*=\s*)(\d+)$#$1.$2/100#e' filename
The options mean:
-p = Run the code in a loop that prints the modified input
-e = Execute the code in the first argument
-i = Replace the input file(s) with the output
The regular expression bits mean:
^ = beginning of line
r[01] = r0 or r1
\s*=\s* = any amount of whitespace, an =, and any amount of whitespace
\d+ = digits
$ = end of line
The replacement uses the e modifier, which means that it should be executed as a Perl expression. $1 and $2 are the contents of the two capture groups: $1 is everything before the number, $2 is the number. $2/100 divides the number by 100, and . concatenates the two pieces together.

As a one-liner:
perl -pi -e 's{^r[01]\s*=\s*\K(\d+)$}{$1/10}e' filename.txt

Here is an awk solution:
awk '/^r[01]/ {$3/=100} 1' file
r0 = 2.04
r1 = 2.05
max_gap = 20u
min = 0
max = 8
thickness = 2
color = green
fill_under = yes
fill_color = green
r0 = 2.05
r1 = 2.06

Related

Matlab; how to extract information from a header's file (text file)

I have many text files that have 35 lines of header followed by a large matrix with data of an image (that info can be ignored and do not need to read it at the moment). I want to be able to read the header lines and extract information contained on those lines. For instance the first few lines of the header are..
File Version Number: 1.0
Date: 06/05/2015
Time: 10:33:44 AM
===========================================================
Beam Voltage (-kV) = 13.000
Filament (W) = 4.052
Cond. (-kV) = 8.885
CenterX1 (V) = 10.7
CenterY1 (V) = -45.9
Objective (%) = 71.40
OctupoleX = -0.4653
OctupoleY = -0.1914
Angle (deg) = 0.00
.
I would like to be able to open this text file and read the vulue of the day and time the file was created, filament power, the condenser voltage, the angle, etc.. and save these in variables or send them to a text box on a GUI program.
I have tried several things but since the values I want to extract some times are after a '=' or after a ':' or simply after a '' then I do not know how to approach this. Perhaps reading each line and look for a match of a word?
Any help would be much appreciated.
Thanks,
Alex
This is not particularly difficult, and one of the ways to do it would be to parse line-by-line as you suggested. Something like this:
MAX_LINES_TO_READ = 35;
fid = fopen('input.txt');
lineCount = 0;
dateString = '';
beamVoltage = 0;
while ~eof(fid)
line = fgetl(fid);
lineCount = lineCount + 1;
%//check conditions for skipping loop body
if isempty(line)
continue
elseif lineCount > MAX_LINES_TO_READ
break
end
%//find headers you are interested in
if strfind(line, 'Date')
%//find the first location of the header separator
idx = find(line, ':', 1);
%//extract substring starting from 1 char after separator
%//note: the trim is to get rid of leading/trailing whitespace
dateString = strtrim(line(idx + 1 : end));
elseif strfind(line, 'Beam Voltage')
idx = find(line, '=', 1);
beamVoltage = str2double(line(idx + 1 : end));
end
end
fclose(fid);

If statement inside awk to change a value

I have the following file
...
MODE P E
IMP:P 1 19r 0
IMP:E 1 19r 0
...
SDEF POS= 0 0 14.6 AXS= 0 0 1 EXT=d3 RAD= d4 cell=23 ERG=d1 PAR=2
SI1 L 0.020
SP1 1
SI4 0. 3.401
SI3 0.9
...
NPS 20000000
I want to do the following task
Check if after the sequence ERG= there is a number or a string.
If it's a string, find the sequence SI1 L and change the value after that, using values that the user inputs.
If it's a number, change the number using values that the user inputs.
Note that if after ERG= there is a number, there will be no SI1 L sequence.
For instance number 2 can be accomplished using the following
#! /bin/bash
vals=(0.02 0.03 0.04 0.05)
for val in "${vals[#]}"; do
awk -vval="$val" '$1=="SI1"{$3=val}1' 20
done
How can the above algorithm be achieved?
#!/bin/bash
val="$#"
awk -v val="$val" '
BEGIN { i=1; split (val,v," ") }
# If it is a string, find the sequence SI1 L and change the value after that, using values that the user inputs
/SDEF POS.*ERG=[a-zA-Z]+/ { flag="y" ; }
/SI1 L/ { if (flag=="y") { $3=v[i]; i++; flag="n"; } }
# If it is a number, change the number using values that the user inputs.
/SDEF POS.*ERG=[0-9]+ / { sub(/ERG=[0-9]*/, "ERG="v[i],$0);i++; }
1
' file
hints:
If the rule find ERG with at least one or more letters ([a-zA-Z]+, it will set the flag.
The /SI1 L/ rule will only triggers, if the flag is set. If the rule triggered, it would unset the flag again, so that any following /SI L/ wouldn't trigger again.
.* stands for 0-n sign or character
[A-Za-z]+ stands for 1-n alphabetic character in lower or upper case
awk -F '[[:blank:]=]' -v string_value="foo" -v number_value=42 '
/ERG=/ {
for (i=1; i<NF; i++)
if ($i == "ERG") {
isstring = ($(i+1) ~ /[^[:digit:]]/)
break
}
if (!isstring)
$(i+1) = number_value
}
/SI1 L/ && isstring { $NF = string_value }
1
' filename

How do I determine the maximum range for perl's range iterator?

I can exceed perl's range iteration bounds like so, with or without -Mbigint:
$» perl -E 'say $^V; say for (0..shift)' 1e19
v5.16.2
Range iterator outside integer range at -e line 1.
How can I determine this upper limit, without simply trying until I exceed it?
It's an IV.
>> similarly works on integers, so you can use
my $max_iv = -1 >> 1;
my $min_iv = -(-1 >> 1) - 1;
They can also be derived from the size of an IV.
my $max_iv = (1 << ($iv_bits-1)) - 1;
my $min_iv = -(1 << ($iv_bits-1));
The size of an IV can be obtained using
use Config qw( %Config );
my $iv_bits = 8 * $Config{ivsize};
or
my $iv_bits = 8 * length pack 'j', 0;

Get regions from a file that are part of regions in other file (Without loops)

I have two files:
regions.txt: First column is the chromosome name, second and third are start and end position.
1 100 200
1 400 600
2 600 700
coverage.txt: First column is chromosome name, again second and third are start and end positions, and last column is the score.
1 100 101 5
1 101 102 7
1 103 105 8
2 600 601 10
2 601 602 15
This file is very huge it is about 15GB with about 300 million lines.
I basically want to get the mean of all scores in coverage.txt that are in each region in regions.txt.
In other words, start at the first line in regions.txt, if there is a line in coverage.txt which has the same chromosome, start-coverage is >= start-region, and end-coverage is <= end-region, then save its score to a new array. After finish searching in all coverages.txt print the region chromosome, start, end, and the mean of all scores that have been found.
Expected output:
1 100 200 14.6 which is (5+7+8)/3
1 400 600 0 no match at coverages.txt
2 600 700 12.5 which is (10+15)/2
I built the following MATLAB script which take very long time since I have to loop over coverage.txt many time. I don't know how to make a fast awk similar script.
My matlab script
fc = fopen('coverage.txt', 'r');
ft = fopen('regions.txt', 'r');
fw = fopen('out.txt', 'w');
while feof(ft) == 0
linet = fgetl(ft);
scant = textscan(linet, '%d%d%d');
tchr = scant{1};
tx = scant{2};
ty = scant{3};
coverages = [];
frewind(fc);
while feof(fc) == 0
linec = fgetl(fc);
scanc = textscan(linec, '%d%d%d%d');
cchr = scanc{1};
cx = scanc{2};
cy = scanc{3};
cov = scanc{4};
if (cchr == tchr) && (cx >= tx) && (cy <= ty)
coverages = cat(2, coverages, cov);
end
end
covmed = median(coverages);
fprintf(fw, '%d\t%d\t%d\t%d\n', tchr, tx, ty, covmed);
end
Any suggestions to make an alternative using AWK, Perl, or , ... etc I will aslo be pleased if someone can teach me how to get rid of all loops in my matlab script.
Thanks
Here is a Perl solution. I use hashes (aka dictionaries) to access the various ranges via the chromosome, thus reducing the number of loop iterations.
This is potentially efficient, as I don't do a full loop over regions.txt on every input line. Efficiency could perhaps be increased further when multithreading is used.
#!/usr/bin/perl
my ($rangefile) = #ARGV;
open my $rFH, '<', $rangefile or die "Can't open $rangefile";
# construct the ranges. The chromosome is used as range key.
my %ranges;
while (<$rFH>) {
chomp;
my #field = split /\s+/;
push #{$ranges{$field[0]}}, [#field[1,2], 0, 0];
}
close $rFH;
# iterate over all the input
while (my $line = <STDIN>) {
chomp $line;
my ($chrom, $lower, $upper, $value) = split /\s+/, $line;
# only loop over ranges with matching chromosome
foreach my $range (#{$ranges{$chrom}}) {
if ($$range[0] <= $lower and $upper <= $$range[1]) {
$$range[2]++;
$$range[3] += $value;
last; # break out of foreach early because ranges don't overlap
}
}
}
# create the report
foreach my $chrom (sort {$a <=> $b} keys %ranges) {
foreach my $range (#{$ranges{$chrom}}) {
my $value = $$range[2] ? $$range[3]/$$range[2] : 0;
printf "%d %d %d %.1f\n", $chrom, #$range[0,1], $value;
}
}
Example invocation:
$ perl script.pl regions.txt <coverage.txt >output.txt
Output on the example input:
1 100 200 6.7
1 400 600 0.0
2 600 700 12.5
(because (5+7+8)/3 = 6.66…)
Normally, I would load the files into R and calculate it, but given that one of them is so huge, this would become a problem. Here are some thoughts that might help you solving it.
Consider splitting coverage.txt by chromosomes. This would make the calculations less demanding.
Instead of looping over coverage.txt, you first read the regions.txt full into memory (I assume it is much smaller). For each region, you keep a score and a number.
Process coverage.txt line by line. For each line, you determine the chromosome and the region that this particular stretch belongs to. This will require some footwork, but if regions.txt is not too large, it might be more efficient. Add the score to the score of the region and increment number by one.
An alternative, most efficient way requires both files to be sorted first by chromosome, then by position.
Take a line from regions.txt. Record the chromosome and positions. If there is a line remaining from previous loop, go to 3.; otherwise go to 2.
Take a line from coverage.txt.
Check whether it is within the current region.
yes: add the score to the region, increment number. Move to 2.
no: divide score by number, write the current region to output, go to 1.
This last method requires some fine tuning, but will be most efficient -- it requires to go through each file only once and does not require to store almost anything in the memory.
Here's one way using join and awk. Run like:
join regions.txt coverage.txt | awk -f script.awk - regions.txt
Contents of script.awk:
FNR==NR && $4>=$2 && $5<=$3 {
sum[$1 FS $2 FS $3]+=$6
cnt[$1 FS $2 FS $3]++
next
}
{
if ($1 FS $2 FS $3 in sum) {
printf "%s %.1f\n", $0, sum[$1 FS $2 FS $3]/cnt[$1 FS $2 FS $3]
}
else if (NF == 3) {
print $0 " 0"
}
}
Results:
1 100 200 6.7
1 400 600 0
2 600 700 12.5
Alternatively, here's the one-liner:
join regions.txt coverage.txt | awk 'FNR==NR && $4>=$2 && $5<=$3 { sum[$1 FS $2 FS $3]+=$6; cnt[$1 FS $2 FS $3]++; next } { if ($1 FS $2 FS $3 in sum) printf "%s %.1f\n", $0, sum[$1 FS $2 FS $3]/cnt[$1 FS $2 FS $3]; else if (NF == 3) print $0 " 0" }' - regions.txt
Here is a simple MATLAB way to bin your coverage into regions:
% extract the regions extents
bins = regions(:,2:3)';
bins = bins(:);
% extract the coverage - only the start is needed
covs = coverage(:,2);
% use histc to place the coverage start into proper regions
% this line counts how many coverages there are in a region
% and assigns them proper region ids.
[h, i]= histc(covs(:), bins(:));
% sum the scores into correct regions (second output of histc gives this)
total = accumarray(i, coverage(:,4), [numel(bins),1]);
% average the score in regions (first output of histc is useful)
avg = total./h;
% remove every second entry - our regions are defined by start/end
avg = avg(1:2:end);
Now this works assuming that the regions are non-overlapping, but I guess that is the case. Also, every entry in the coverage file has to fall into some region.
Also, it is trivial to 'block' this approach over coverages, if you want to avoid reading in the whole file. You only need the bins, your regions file, which presumably is small. You can process the coverages in blocks, incrementally add to total and compute the average in the end.

Prefix match in MATLAB

Hey guys, I have a very simple problem in MATLAB:
I have some strings which are like this:
Pic001
Pic002
Pic003
004
Not every string starts with the prefix "Pic". So how can I cut off the part "pic" that only the numbers at the end shall remain to have an equal format for all my strings?
Greets, poeschlorn
If 'Pic' only ever occurs as a prefix in your strings and nowhere else within the strings then you could use STRREP to remove it like this:
>> x = {'Pic001'; 'Pic002'; 'Pic003'; '004'}
x =
'Pic001'
'Pic002'
'Pic003'
'004'
>> x = strrep(x, 'Pic', '')
x =
'001'
'002'
'003'
'004'
If 'Pic' can occur elsewhere in your strings and you only want to remove it when it occurs as a prefix then use STRNCMP to compare the first three characters of your strings:
>> x = {'Pic001'; 'Pic002'; 'Pic003'; '004'}
x =
'Pic001'
'Pic002'
'Pic003'
'004'
>> for ii = find(strncmp(x, 'Pic', 3))'
x{ii}(1:3) = [];
end
>> x
x =
'001'
'002'
'003'
'004'
strings = {'Pic001'; 'Pic002'; 'Pic003'; '004'};
numbers = regexp(strings, '(PIC)?(\d*)','match');
for cc = 1:length(numbers);
fprintf('%s\n', char(numbers{cc}));
end;