I have the following file
...
MODE P E
IMP:P 1 19r 0
IMP:E 1 19r 0
...
SDEF POS= 0 0 14.6 AXS= 0 0 1 EXT=d3 RAD= d4 cell=23 ERG=d1 PAR=2
SI1 L 0.020
SP1 1
SI4 0. 3.401
SI3 0.9
...
NPS 20000000
I want to do the following task
Check if after the sequence ERG= there is a number or a string.
If it's a string, find the sequence SI1 L and change the value after that, using values that the user inputs.
If it's a number, change the number using values that the user inputs.
Note that if after ERG= there is a number, there will be no SI1 L sequence.
For instance number 2 can be accomplished using the following
#! /bin/bash
vals=(0.02 0.03 0.04 0.05)
for val in "${vals[#]}"; do
awk -vval="$val" '$1=="SI1"{$3=val}1' 20
done
How can the above algorithm be achieved?
#!/bin/bash
val="$#"
awk -v val="$val" '
BEGIN { i=1; split (val,v," ") }
# If it is a string, find the sequence SI1 L and change the value after that, using values that the user inputs
/SDEF POS.*ERG=[a-zA-Z]+/ { flag="y" ; }
/SI1 L/ { if (flag=="y") { $3=v[i]; i++; flag="n"; } }
# If it is a number, change the number using values that the user inputs.
/SDEF POS.*ERG=[0-9]+ / { sub(/ERG=[0-9]*/, "ERG="v[i],$0);i++; }
1
' file
hints:
If the rule find ERG with at least one or more letters ([a-zA-Z]+, it will set the flag.
The /SI1 L/ rule will only triggers, if the flag is set. If the rule triggered, it would unset the flag again, so that any following /SI L/ wouldn't trigger again.
.* stands for 0-n sign or character
[A-Za-z]+ stands for 1-n alphabetic character in lower or upper case
awk -F '[[:blank:]=]' -v string_value="foo" -v number_value=42 '
/ERG=/ {
for (i=1; i<NF; i++)
if ($i == "ERG") {
isstring = ($(i+1) ~ /[^[:digit:]]/)
break
}
if (!isstring)
$(i+1) = number_value
}
/SI1 L/ && isstring { $NF = string_value }
1
' filename
Related
I am trying to determine if all the files/folders present in a directory have a ascending order numbering pattern at same place throughout in their name
If the numbers were always present at a constant place in every case , this would have been super easy
ls $HOME/dir
1. Some String
2. Some String- Part 4
3. Some String- Part 5
Here i would just simply use something like
ls $HOME/dir | sort -V | grep -Eo '^[0-9]'
The command will output 1 2 3 and The files/folders have ascending order numbering pattern is a easy conclusion
Now there are 2 problems here :
Its not necessary that these numbers would always be at start like above
There could be sometimes random numbers in between
==========================================
ls $HOME/dir
Lecture 1 - Some String
Lecture 2 - Some String - Part 4
Lecture 3 - Some String - Part 5
Expected Output - 1 2 3
I main thing is that i need grep to only output numbers if they are present in ascending order at the very same position in filenames throughout
==========================================
ls $HOME/dir
1. Some String
Some String - Part 2
Some String - Part 3
For something like this , grep shouldn't output anything at all because even though it has ascending numbers in name, they are not present at same place throughout
==========================================
PS / The 'Some String' part in all my example would be different for each file/folders. Only the position of the ascending numbers being constant (If any ) is to be considered
One More final example
ls $HOME/dir
CB) Lecture 1 xyz
CB) Lecture 2 abc-part 8
CB) Lecture 3 pqr-part 9
Expected Output - 1 2 3
Here is one solution using AWK:
printnumbers.awk
BEGIN {
numberRegex = "[0-9]+([^A-Za-z0-9]|$)"
}
NR == 1 {
numberPos = match($0, numberRegex)
}
match($0, numberRegex) == numberPos {
matchedString = substr($0, RSTART, RLENGTH)
match(matchedString, "[0-9]+")
result = result substr(matchedString, RSTART, RLENGTH) "\n"
next
}
{
result = ""
exit 1
}
END {
printf("%s", result)
}
Then run
$ ls $HOME/dir | sort -V | awk -f printnumbers.awk
Edit 2021-05-14
A second approach is to split each line into fields with non-digits as separators. Then each field is either a number or an empty string. For each line we check the fields to see if a sequence of consecutive numbers starting from one is formed.
Here is the logic:
BEGIN {
FS = "[^0-9]+"
}
{
for (i = 1; i <= NF; i++) {
numbersConsecutive[i] = ($i == NR) && ((NR == 1) || numbersConsecutive[i])
}
if (NF > numbersConsecutiveLen) {
numbersConsecutiveLen = NF
}
}
END {
consecutiveNumbersFound = 0
for (i = 1; i <= numbersConsecutiveLen; i++) {
if (numbersConsecutive[i]) {
consecutiveNumbersFound = 1
}
}
if (consecutiveNumbersFound) {
for (i = 1; i <= NR; i++) {
print i
}
}
}
I'm trying to code that calculates how many upper and lower characters in a string. Here's my code.
I've been trying to convert it to string, but not working.
def up_low(string):
result1 = 0
result2 = 0
for x in string:
if x == x.upper():
result1 + 1
elif x == x.lower():
result2 + 1
print('You have ' + str(result1) + ' upper characters and ' +
str(result2) + ' lower characters!')
up_low('Hello Mr. Rogers, how are you this fine Tuesday?')
I expect my outcome to calculate the upper and lower characters. Right now I'm getting "You have 0 upper characters and 0 lower characters!".
It's not adding up to result1 and result2.
Seems your error is in the assignation, missimg a '=' symbol (E.g. result1 += 1)
for x in string:
if x == x.upper():
result1 += 1
elif x == x.lower():
result2 +**=** 1
The problem is in the line result1 + 1 and result2 + 1. This is an expression, but not an assignment. In other words, you increment the counter, and then the incremented value goes nowhere.
The solution is to work the assignment operator = into there somewhere.
I have a bunch of numbers in a text file as follows (example
r0 = 204
r1 = 205
max_gap = 20u
min = 0
max = 8
thickness = 2
color = green
fill_under = yes
fill_color = green
r0 = 205
r1 = 206
I would like to divide any line with r0 = by 100 so that the line will then read
r0 = 20.4
I would like to do this for all lines with r0 and also for r1. Is there a way to do this in perl?
This is my attempt but doesnt work mainly because I've never used perl before which is why I'm asking such a simple question
#!/usr/bin/perl
$string= r0\s+=\s+\\(d+)
$num= $1/100
$num2= r0\s+=\s+\\$num
s/$string/$num2;
A one liner I could run from bash would be much better though. I know it'll involve the s/find/replace function but not sure how to specify the integer part
perl -pei 's#^(r[01]\s*=\s*)(\d+)$#$1.$2/100#e' filename
The options mean:
-p = Run the code in a loop that prints the modified input
-e = Execute the code in the first argument
-i = Replace the input file(s) with the output
The regular expression bits mean:
^ = beginning of line
r[01] = r0 or r1
\s*=\s* = any amount of whitespace, an =, and any amount of whitespace
\d+ = digits
$ = end of line
The replacement uses the e modifier, which means that it should be executed as a Perl expression. $1 and $2 are the contents of the two capture groups: $1 is everything before the number, $2 is the number. $2/100 divides the number by 100, and . concatenates the two pieces together.
As a one-liner:
perl -pi -e 's{^r[01]\s*=\s*\K(\d+)$}{$1/10}e' filename.txt
Here is an awk solution:
awk '/^r[01]/ {$3/=100} 1' file
r0 = 2.04
r1 = 2.05
max_gap = 20u
min = 0
max = 8
thickness = 2
color = green
fill_under = yes
fill_color = green
r0 = 2.05
r1 = 2.06
I am trying to complete the following task:
Create
a
script
that
will
repeatedly
create
a
random
integer
K
in
the
range
of
0
to
20
until
every
case
has
been
entered
at
least
once.
You
have
3
possible
cases.
Case
A
is
entered
when
the
range
of
the
random
integer
K
is
between
or
equal
to
0
and
7.
Case
B
is
entered
when
the
range
of
the
random
integer
K
is
between
or
equal
to
8
and
14.
Case
C
is
entered
when
the
range
of
the
random
integer
K
is
between
or
equal
to
15
and
20.
Rules:
When
a
case
is
entered
you
must
print
to
the
user
“Congratulations
you
entered
Case
(A,
B,
or
C)”.
You
can
only
enter
each
case
once.
If
the
program
attempts
to
enter
the
same
case
more
than
once,
you
must
print.
“Invalid,
that
case
has
already
been
entered”.
The
program
will
end
once
all
the
cases
have
been
entered
and
the
program
will
print
“Good
job,
you
have
entered
all
cases”.
If
the
program
attempts
to
enter
any
already
entered
cases
more
than
3
times
(3
total
times
not
just
for
one
specific
case),
the
program
will
end
and
print
to
the
user
“That
random
generator
wasn’t
random
enough”.
Here is the code I have so fa. It has taken me a couple hours to debug. Am I approaching this the wrong way????Please let me know.
K = round(rand*(20))
flag = 0;
counterA =0;
counterB=0;
counterC=0;
switch K
case {0,1,2,3,4,5,6,7}
fprintf('Congratulations you entered Case A\n')
flag = 1;
counterA = 1
case {8,9,10,11,12,13,14}
fprintf('Congratulations you entered Case B\n')
flag =2;
counterB = 1
case {15,16,17,18,19,20}
fprintf ('Congratulations you entered Case C\n')
flag = 3;
counterC = 1
end
while flag == 1 || flag == 2 || flag ==3
K = round(rand*(20))
if K >=0 && K<=7 && flag==1
disp ('Invalid, that case has already been entered')
counterA = counterA+1
elseif K >=8 && K<=14 && flag ==2
disp ('Invalid, that case has already been entered')
counterB=counterB+1
elseif K >=15 && K<=20 && flag==3
disp ('Invalid, that case has already been entered')
counterC =counterC+1
elseif K >=0 && K<=7 && flag ~=1
counterA =counterA+1
flag == 1;
if counterA==1&&counterB~=2 ||counterA==1&&counterC~=2
fprintf('COngrats guacamole A\n')
end
elseif K >=8 && K<=14 && flag ~=2
counterB=counterB+1
flag == 2;
if counterB ==1&&counterA~=2||counterB==1&&counterC~=2
fprintf('COngratsavacado B\n')
end
elseif K >=15 && K<=20 && flag~=3
counterC=counterC+1
flag == 3;
if counterC==1&&counterA~=2||counterC==1&&counterB~=2
fprintf ('Congratscilantro C\n')
end
end
if counterA==1 && counterB==1 && counterC==1
flag=100;
disp('DONE')
elseif counterA == 3|| counterB==3 || counterC==3
disp ('That random generator wasnt random enough')
flag =99;
elseif counterA==2||counterB==2||counterC==2
disp('Inval')
end
Some words about your code:
Don't use variable names like counterA,counterB,counterC, use a array with 3 elements instead. In this case: You need only a total limit, thus one variable is enough.
rand*20 generates random values between 0 and 20, but using round(rand*20) causes a lower probability for 0 and 20. Use randi if you need integers.
Use "Start indent" to format your code clean, it makes it easier to read.
This is not a full solution, the part with the 3 errors is missing. I think you will get this on your own.
caseNames={'A','B','C'};
caseEntered=[false,false,false];
%while there exist a case which is not entered and limit is not reached, continue
while ~all(caseEntered)
K = randi([0,20]);
switch K
case {0,1,2,3,4,5,6,7}
cs=1;
case {8,9,10,11,12,13,14}
cs=2;
case {15,16,17,18,19,20}
cs=3;
end
if caseEntered(cs)
%case has previously been entered, tdb
else
%case is entered frist time
fprintf('Congratulations you entered Case %s\n',caseNames{cs});
caseEntered(cs)=true;
end
end
I have two files:
regions.txt: First column is the chromosome name, second and third are start and end position.
1 100 200
1 400 600
2 600 700
coverage.txt: First column is chromosome name, again second and third are start and end positions, and last column is the score.
1 100 101 5
1 101 102 7
1 103 105 8
2 600 601 10
2 601 602 15
This file is very huge it is about 15GB with about 300 million lines.
I basically want to get the mean of all scores in coverage.txt that are in each region in regions.txt.
In other words, start at the first line in regions.txt, if there is a line in coverage.txt which has the same chromosome, start-coverage is >= start-region, and end-coverage is <= end-region, then save its score to a new array. After finish searching in all coverages.txt print the region chromosome, start, end, and the mean of all scores that have been found.
Expected output:
1 100 200 14.6 which is (5+7+8)/3
1 400 600 0 no match at coverages.txt
2 600 700 12.5 which is (10+15)/2
I built the following MATLAB script which take very long time since I have to loop over coverage.txt many time. I don't know how to make a fast awk similar script.
My matlab script
fc = fopen('coverage.txt', 'r');
ft = fopen('regions.txt', 'r');
fw = fopen('out.txt', 'w');
while feof(ft) == 0
linet = fgetl(ft);
scant = textscan(linet, '%d%d%d');
tchr = scant{1};
tx = scant{2};
ty = scant{3};
coverages = [];
frewind(fc);
while feof(fc) == 0
linec = fgetl(fc);
scanc = textscan(linec, '%d%d%d%d');
cchr = scanc{1};
cx = scanc{2};
cy = scanc{3};
cov = scanc{4};
if (cchr == tchr) && (cx >= tx) && (cy <= ty)
coverages = cat(2, coverages, cov);
end
end
covmed = median(coverages);
fprintf(fw, '%d\t%d\t%d\t%d\n', tchr, tx, ty, covmed);
end
Any suggestions to make an alternative using AWK, Perl, or , ... etc I will aslo be pleased if someone can teach me how to get rid of all loops in my matlab script.
Thanks
Here is a Perl solution. I use hashes (aka dictionaries) to access the various ranges via the chromosome, thus reducing the number of loop iterations.
This is potentially efficient, as I don't do a full loop over regions.txt on every input line. Efficiency could perhaps be increased further when multithreading is used.
#!/usr/bin/perl
my ($rangefile) = #ARGV;
open my $rFH, '<', $rangefile or die "Can't open $rangefile";
# construct the ranges. The chromosome is used as range key.
my %ranges;
while (<$rFH>) {
chomp;
my #field = split /\s+/;
push #{$ranges{$field[0]}}, [#field[1,2], 0, 0];
}
close $rFH;
# iterate over all the input
while (my $line = <STDIN>) {
chomp $line;
my ($chrom, $lower, $upper, $value) = split /\s+/, $line;
# only loop over ranges with matching chromosome
foreach my $range (#{$ranges{$chrom}}) {
if ($$range[0] <= $lower and $upper <= $$range[1]) {
$$range[2]++;
$$range[3] += $value;
last; # break out of foreach early because ranges don't overlap
}
}
}
# create the report
foreach my $chrom (sort {$a <=> $b} keys %ranges) {
foreach my $range (#{$ranges{$chrom}}) {
my $value = $$range[2] ? $$range[3]/$$range[2] : 0;
printf "%d %d %d %.1f\n", $chrom, #$range[0,1], $value;
}
}
Example invocation:
$ perl script.pl regions.txt <coverage.txt >output.txt
Output on the example input:
1 100 200 6.7
1 400 600 0.0
2 600 700 12.5
(because (5+7+8)/3 = 6.66…)
Normally, I would load the files into R and calculate it, but given that one of them is so huge, this would become a problem. Here are some thoughts that might help you solving it.
Consider splitting coverage.txt by chromosomes. This would make the calculations less demanding.
Instead of looping over coverage.txt, you first read the regions.txt full into memory (I assume it is much smaller). For each region, you keep a score and a number.
Process coverage.txt line by line. For each line, you determine the chromosome and the region that this particular stretch belongs to. This will require some footwork, but if regions.txt is not too large, it might be more efficient. Add the score to the score of the region and increment number by one.
An alternative, most efficient way requires both files to be sorted first by chromosome, then by position.
Take a line from regions.txt. Record the chromosome and positions. If there is a line remaining from previous loop, go to 3.; otherwise go to 2.
Take a line from coverage.txt.
Check whether it is within the current region.
yes: add the score to the region, increment number. Move to 2.
no: divide score by number, write the current region to output, go to 1.
This last method requires some fine tuning, but will be most efficient -- it requires to go through each file only once and does not require to store almost anything in the memory.
Here's one way using join and awk. Run like:
join regions.txt coverage.txt | awk -f script.awk - regions.txt
Contents of script.awk:
FNR==NR && $4>=$2 && $5<=$3 {
sum[$1 FS $2 FS $3]+=$6
cnt[$1 FS $2 FS $3]++
next
}
{
if ($1 FS $2 FS $3 in sum) {
printf "%s %.1f\n", $0, sum[$1 FS $2 FS $3]/cnt[$1 FS $2 FS $3]
}
else if (NF == 3) {
print $0 " 0"
}
}
Results:
1 100 200 6.7
1 400 600 0
2 600 700 12.5
Alternatively, here's the one-liner:
join regions.txt coverage.txt | awk 'FNR==NR && $4>=$2 && $5<=$3 { sum[$1 FS $2 FS $3]+=$6; cnt[$1 FS $2 FS $3]++; next } { if ($1 FS $2 FS $3 in sum) printf "%s %.1f\n", $0, sum[$1 FS $2 FS $3]/cnt[$1 FS $2 FS $3]; else if (NF == 3) print $0 " 0" }' - regions.txt
Here is a simple MATLAB way to bin your coverage into regions:
% extract the regions extents
bins = regions(:,2:3)';
bins = bins(:);
% extract the coverage - only the start is needed
covs = coverage(:,2);
% use histc to place the coverage start into proper regions
% this line counts how many coverages there are in a region
% and assigns them proper region ids.
[h, i]= histc(covs(:), bins(:));
% sum the scores into correct regions (second output of histc gives this)
total = accumarray(i, coverage(:,4), [numel(bins),1]);
% average the score in regions (first output of histc is useful)
avg = total./h;
% remove every second entry - our regions are defined by start/end
avg = avg(1:2:end);
Now this works assuming that the regions are non-overlapping, but I guess that is the case. Also, every entry in the coverage file has to fall into some region.
Also, it is trivial to 'block' this approach over coverages, if you want to avoid reading in the whole file. You only need the bins, your regions file, which presumably is small. You can process the coverages in blocks, incrementally add to total and compute the average in the end.