Using AWK to select records from file1 with obs.# in file 2

Using AWK to select records from file1 with obs.# in file 2 - select

I have these to simple test files:
This is fil1, containing records I want to select some from
01
02
07
05
10
20
30
25
This is keepNR, containing the record numbers I want to extract from fil1
1
4
7
What I want are these records from fil1
01 (observation/record # 1)
05 (observation/record # 4)
30 (observation/record # 7)
I am a novice to AWK, but I have tried these programs:
using this, I can see that the observations are there
awk 'FNR==NR {a[$1]; next } { for (elem in a) { print "elem=",elem,"FNR=",FNR,"$1=",$1 }} ' keepNR fil1
I had hoped this would work, but I get more than the 2 records:
awk 'FNR==NR {a[$1]; next } { for (elem in a) { if (FNR ~ a[elem]) print elem,FNR,$1; next }} END{ for (elem in a) { print "END:", elem }}' keepNR fil1
1 1 01
1 2 02
1 3 07
1 4 05
1 5 10
1 6 20
1 7 30
1 8 25
I first tried using the == instead of the ~, but then no result ??
as you can see here:
gg#gg:~/bin/awktest$ awk 'FNR==NR {a[$1]; next } { for (elem in a) { if (FNR == a[elem]) print elem,FNR,$1; next }} ' keepNR fil1
gg#gg:~/bin/awktest$
I have also tried (FNR - a[elem])==0 with no output
So I have 2 questions
why does if (FNR ~ a[elem]) work, but if (FNR == a[elem]) does not ?
why do I get 8 lines of output, instead of 2 ?
If you have a better solution, I would love to see it
Kind Regards :)

You don't assign to a[$1] so its value is empty: FNR==NR { a[$1]; next }
for (elem in a) sets elem to the keys of a.
if (FNR == a[elem]) compares against the value in a[elem]. The value is empty, so there is no match.
if (FNR ~ a[elem]) tests if FNR matches the empty regex (//), so it always matches.
A simpler method is to test if FNR is a key of a:
awk '
FNR==NR { a[$1]; next }
FNR in a { print "FNR="FNR, "$1="$1 }
' keepNR fil1
which should output:
FNR=1 $1=01
FNR=4 $1=05
FNR=7 $1=30

Related

How determine if files/folders have a ascending order numbering pattern?

I am trying to determine if all the files/folders present in a directory have a ascending order numbering pattern at same place throughout in their name
If the numbers were always present at a constant place in every case , this would have been super easy
ls $HOME/dir
1. Some String
2. Some String- Part 4
3. Some String- Part 5
Here i would just simply use something like
ls $HOME/dir | sort -V | grep -Eo '^[0-9]'
The command will output 1 2 3 and The files/folders have ascending order numbering pattern is a easy conclusion
Now there are 2 problems here :
Its not necessary that these numbers would always be at start like above
There could be sometimes random numbers in between
==========================================
ls $HOME/dir
Lecture 1 - Some String
Lecture 2 - Some String - Part 4
Lecture 3 - Some String - Part 5
Expected Output - 1 2 3
I main thing is that i need grep to only output numbers if they are present in ascending order at the very same position in filenames throughout
==========================================
ls $HOME/dir
1. Some String
Some String - Part 2
Some String - Part 3
For something like this , grep shouldn't output anything at all because even though it has ascending numbers in name, they are not present at same place throughout
==========================================
PS / The 'Some String' part in all my example would be different for each file/folders. Only the position of the ascending numbers being constant (If any ) is to be considered
One More final example
ls $HOME/dir
CB) Lecture 1 xyz
CB) Lecture 2 abc-part 8
CB) Lecture 3 pqr-part 9
Expected Output - 1 2 3

Here is one solution using AWK:
printnumbers.awk
BEGIN {
numberRegex = "[0-9]+([^A-Za-z0-9]|$)"
}
NR == 1 {
numberPos = match($0, numberRegex)
}
match($0, numberRegex) == numberPos {
matchedString = substr($0, RSTART, RLENGTH)
match(matchedString, "[0-9]+")
result = result substr(matchedString, RSTART, RLENGTH) "\n"
next
}
{
result = ""
exit 1
}
END {
printf("%s", result)
}
Then run
$ ls $HOME/dir | sort -V | awk -f printnumbers.awk
Edit 2021-05-14
A second approach is to split each line into fields with non-digits as separators. Then each field is either a number or an empty string. For each line we check the fields to see if a sequence of consecutive numbers starting from one is formed.
Here is the logic:
BEGIN {
FS = "[^0-9]+"
}
{
for (i = 1; i <= NF; i++) {
numbersConsecutive[i] = ($i == NR) && ((NR == 1) || numbersConsecutive[i])
}
if (NF > numbersConsecutiveLen) {
numbersConsecutiveLen = NF
}
}
END {
consecutiveNumbersFound = 0
for (i = 1; i <= numbersConsecutiveLen; i++) {
if (numbersConsecutive[i]) {
consecutiveNumbersFound = 1
}
}
if (consecutiveNumbersFound) {
for (i = 1; i <= NR; i++) {
print i
}
}
}

Find last Friday’s Date in Perl 6?

I want to generate a sequence that ends last Friday from Monday to Thursday, and the Friday of the previous week if the sequence starts on Saturday and Sunday. That is, assuming that today is 2018-05-09, then last Friday is 2018-05-04,
If today is 2018-05-12, then last Friday is also 2018-05-04. So I write:
(Date.today, *.earlier(:1day) ... ( *.day-of-week==5 && *.week[1]+1==Date.today.week[1] )).tail # Output: 2018-05-06
But the result is 2018-05-06 instead of 2018-05-04.
Then I used a Junction:
(Date.today, *.earlier(:1day) ... all( *.day-of-week==5, *.week[1]+1==Date.today.week[1] )).tail # Output: 2018-05-04
Why && in the first situation is wrong? The ... operator says that:
The right-hand side will have an endpoint, which can be Inf or * for "infinite" lists (whose elements are only produced on demand), an expression which will end the sequence when True, or other elements such as Junctions.
What's wrong with the && operator?

The problem is your ending condition
*.day-of-week==5 && *.week[1]+1==Date.today.week[1]
That is two WhateverCode lambdas that each take 1 argument.
*.day-of-week==5
*.week[1]+1==Date.today.week[1]
Since a code object is a true value, the && operator moves onto the second one. So the sequence stops when it reaches sunday of the previous week.
Even if the code was a single lambda it wouldn't work like you expect as it would be a lambda that takes two arguments.
The right way to do this check is to use some sort of block.
{.day-of-week==5 && .week-number+1 == Date.today.week-number}
It might be a good idea to wrap it in a subroutine so that you can test it.
sub last-friday ( Date:D $date ) {
# cache it so that it doesn't have to be looked up on each iteration
my $week-number = $date.week-number - 1;
(
$date,
*.earlier( :1day )
...
{
.day-of-week == 5
&&
.week-number == $week-number
}
).tail
}
say last-friday Date.new: :year(2018), :month( 5), :day( 9); # 2018-05-04
say last-friday Date.new: :year(2018), :month( 5), :day(12); # 2018-05-04
say Date.today.&last-friday; # 2018-05-04
You could also just calculate the proper date.
sub last-friday ( Date:D $date ) {
$date.earlier:
days => (
$date.day-of-week # reset to previous sunday
+ 2 # go two days earlier to get friday
)
}
say last-friday Date.new: :year(2018), :month( 5), :day( 9); # 2018-05-04
say last-friday Date.new: :year(2018), :month( 5), :day(12); # 2018-05-04
say Date.today.&last-friday; # 2018-05-04

This sequence will go from today to last Friday:
say Date.today, *.earlier(:1day) ... *.day-of-week==5
# OUTPUT: «(2018-05-09 2018-05-08 2018-05-07 2018-05-06 2018-05-05 2018-05-04)␤»
Also
say Date.new("2018-05-11"), *.earlier(:1day) ... *.day-of-week==5;
# OUTPUT: «(2018-05-11)␤»
The last argument of the ... operator is the sequence terminator, or a condition it must meet. If the sequence must end on Friday, the simplest condition is that one above.

bash script how to change date with hrs format and add missing days with copy of above line

I have csv file with data:
"smth","txt","33","01-06-2015 00:00"
"smth","txt","33","02-06-2015 09:06"
"smth","txt","34","03-06-2015 09:54"
"smth","txt","34","04-06-2015 00:09"
"smth","txt","33","05-06-2015 00:09"
"smth","txt","32","07-06-2015 00:09"
"smth","txt","30","08-06-2015 10:26"
"smth","txt","31","09-06-2015 12:09"
"smth","txt","30","10-06-2015 13:17"
it should have 30 lines as 30 days of june. There is missing 06-06-2015 and from 11-30-06-2015. I need to put line after 05-06-2015 with data from this line to 06-06-2015 and add missing data from 11-30 june with same data as 10-06-2015.
output csv file format should look like this:
smth#txt#33#2015-06-01
field with number 33 is random so it cannot be always 33
update 22-06-2015
some of my csv files have data like:
"smth","txt","33","01-06-2015 00:00"
"smth","txt","33","02-06-2015 09:06"
"smth","txt","34","03-06-2015 09:54"
"smth","txt","34","04-06-2015 00:09"
"smth","txt","33","05-06-2015 00:09"
"smth","txt","32","07-06-2015 00:09"
"smth","txt","30","08-06-2015 10:26"
"smth","txt","31","09-06-2015 12:09"
"smth","txt","30","10-06-2015 13:17"
"smth2","txt","33","01-06-2015 00:00"
"smth2","txt","33","02-06-2015 09:06"
"smth2","txt","34","03-06-2015 09:54"
"smth2","txt","34","04-06-2015 00:09"
"smth2","txt","33","05-06-2015 00:09"
"smth2","txt","32","07-06-2015 00:09"
"smth2","txt","30","08-06-2015 10:26"
"smth2","txt","31","09-06-2015 12:09"
"smth2","txt","30","10-06-2015 13:17"
so result should be like:
01-30 06-2015 of "smth" and 01-30 06-2015 of "smth2"
below is example (dont look at numbers in column 3, it should work as u made it)
smth#txt#33#2015-06-01
smth#txt#33#2015-06-02
smth#txt#33#2015-06-03
smth#txt#33#2015-06-04
smth#txt#33#2015-06-05
smth#txt#33#2015-06-06
smth#txt#33#2015-06-07
smth#txt#33#2015-06-08
smth#txt#33#2015-06-09
smth#txt#33#2015-06-10
smth#txt#33#2015-06-11
smth#txt#33#2015-06-12
smth#txt#33#2015-06-13
smth#txt#33#2015-06-14
smth#txt#33#2015-06-15
smth#txt#33#2015-06-16
smth#txt#33#2015-06-17
smth#txt#33#2015-06-18
smth#txt#33#2015-06-19
smth#txt#33#2015-06-20
smth#txt#33#2015-06-21
smth#txt#33#2015-06-22
smth#txt#33#2015-06-23
smth#txt#33#2015-06-24
smth#txt#33#2015-06-25
smth#txt#33#2015-06-26
smth#txt#33#2015-06-27
smth#txt#33#2015-06-28
smth#txt#33#2015-06-29
smth#txt#33#2015-06-30
smth2#txt#33#2015-06-01
smth2#txt#33#2015-06-02
smth2#txt#33#2015-06-03
smth2#txt#33#2015-06-04
smth2#txt#33#2015-06-05
smth2#txt#33#2015-06-06
smth2#txt#33#2015-06-07
smth2#txt#33#2015-06-08
smth2#txt#33#2015-06-09
smth2#txt#33#2015-06-10
smth2#txt#33#2015-06-11
smth2#txt#33#2015-06-12
smth2#txt#33#2015-06-13
smth2#txt#33#2015-06-14
smth2#txt#33#2015-06-15
smth2#txt#33#2015-06-16
smth2#txt#33#2015-06-17
smth2#txt#33#2015-06-18
smth2#txt#33#2015-06-19
smth2#txt#33#2015-06-20
smth2#txt#33#2015-06-21
smth2#txt#33#2015-06-22
smth2#txt#33#2015-06-23
smth2#txt#33#2015-06-24
smth2#txt#33#2015-06-25
smth2#txt#33#2015-06-26
smth2#txt#33#2015-06-27
smth2#txt#33#2015-06-28
smth2#txt#33#2015-06-29
smth2#txt#33#2015-06-30
pls help me with that, show me path to create bash script makes my life easier :)

Bash solution - too complicated for my taste, I'd reach for a more powerful language like Perl.
#!/bin/bash
remove_doublequotes () {
line=("${line[#]#\"}")
line=("${line[#]%\"}")
}
fix_timestamp () {
line[3]=${line[3]:6:4}-${line[3]:3:2}-${line[3]:0:2}
}
read_next=0
printed=0
# Extract the date from the first line to get the number of days in the month.
IFS=, read -a line
year=${line[3]:7:4}
month=${line[3]:4:2}
day=${line[3]:1:2}
if [[ $day != 01 ]] ; then
echo "First day missing." >&2
exit 1
fi
cal=$(echo $(cal "$month" "$year"))
last_day=${cal##* }
remove_doublequotes
fix_timestamp
for day in $(seq 1 $last_day) ; do
day=$(printf %02d $day)
if (( read_next )) ; then
if IFS=, read -a line ; then
remove_doublequotes
fix_timestamp
printed=0
else # Fill in the missing day at the month end.
line=("${last_line[#]}")
fi
fi
if [[ ${line[3]} == *"-$day" ]] ; then # Current line should be printed.
(IFS=#; echo "${line[*]}")
read_next=1
last_line=("${line[#]}")
printed=1
else # Fake the report.
insert=("${last_line[#]}")
insert[3]=${insert[3]:0:8}$day
(IFS=#; echo "${insert[*]}")
read_next=0 # We still have to print the line later.
fi
done
if (( ! printed )) ; then # Input contains extra lines.
echo "Line '${line[#]}' not processed" >&2
exit 1
fi

Here's a ruby solution. It does not matter if the first record of your data is the first of the month.
require 'date'
require 'csv'
# store the data in a hash, keyed by date
new = {}
data = CSV.parse(File.read(ARGV.shift))
data.each do |row|
d = DateTime.parse(row[-1])
new[d.to_date] = row
end
# fill in all the missing dates for this month
row = data[0]
d = DateTime.parse(row[-1])
date = Date.new(d.year, d.month, 1)
while date.month == d.month
if new.has_key?(date)
row = new[date]
else
new[date] = row[0..-2] + [date.strftime("%d-%m-%Y %H:%M")]
end
date += 1
end
# print the CSV
new.keys.sort.each do |key|
puts CSV.generate_line(new[key], :force_quotes=>true)
end
Run it like: ruby program.rb file.csv
outputs
"smth","txt","33","01-06-2015 00:00"
"smth","txt","33","02-06-2015 09:06"
"smth","txt","34","03-06-2015 09:54"
"smth","txt","34","04-06-2015 00:09"
"smth","txt","33","05-06-2015 00:09"
"smth","txt","33","06-06-2015 00:00"
"smth","txt","32","07-06-2015 00:09"
"smth","txt","30","08-06-2015 10:26"
"smth","txt","31","09-06-2015 12:09"
"smth","txt","30","10-06-2015 13:17"
"smth","txt","30","11-06-2015 00:00"
"smth","txt","30","12-06-2015 00:00"
"smth","txt","30","13-06-2015 00:00"
"smth","txt","30","14-06-2015 00:00"
"smth","txt","30","15-06-2015 00:00"
"smth","txt","30","16-06-2015 00:00"
"smth","txt","30","17-06-2015 00:00"
"smth","txt","30","18-06-2015 00:00"
"smth","txt","30","19-06-2015 00:00"
"smth","txt","30","20-06-2015 00:00"
"smth","txt","30","21-06-2015 00:00"
"smth","txt","30","22-06-2015 00:00"
"smth","txt","30","23-06-2015 00:00"
"smth","txt","30","24-06-2015 00:00"
"smth","txt","30","25-06-2015 00:00"
"smth","txt","30","26-06-2015 00:00"
"smth","txt","30","27-06-2015 00:00"
"smth","txt","30","28-06-2015 00:00"
"smth","txt","30","29-06-2015 00:00"
"smth","txt","30","30-06-2015 00:00"
A GNU awk version.
BEGIN {FS = OFS = ","}
{
datetime = gensub(/^"|"$/, "", "g", $NF)
split(datetime, a, /[- :]/)
day = mktime( a[3] " " a[2] " " a[1] " 0 0 0" )
data[day] = $0
}
NR == 1 {
month = strftime("%m", day)
year = strftime("%Y", day)
row = $0
}
END {
mday = 1
while ( (day = mktime(year " " month " " mday++ " 0 0 0"))
&& strftime("%m", day) == month
) {
if (day in data) {
$0 = row = data[day]
}
else {
$0 = row
$NF = strftime("\"%d-%m-%Y %H:%M\"", day)
}
print
}
}

If statement inside awk to change a value

I have the following file
...
MODE P E
IMP:P 1 19r 0
IMP:E 1 19r 0
...
SDEF POS= 0 0 14.6 AXS= 0 0 1 EXT=d3 RAD= d4 cell=23 ERG=d1 PAR=2
SI1 L 0.020
SP1 1
SI4 0. 3.401
SI3 0.9
...
NPS 20000000
I want to do the following task
Check if after the sequence ERG= there is a number or a string.
If it's a string, find the sequence SI1 L and change the value after that, using values that the user inputs.
If it's a number, change the number using values that the user inputs.
Note that if after ERG= there is a number, there will be no SI1 L sequence.
For instance number 2 can be accomplished using the following
#! /bin/bash
vals=(0.02 0.03 0.04 0.05)
for val in "${vals[#]}"; do
awk -vval="$val" '$1=="SI1"{$3=val}1' 20
done
How can the above algorithm be achieved?

#!/bin/bash
val="$#"
awk -v val="$val" '
BEGIN { i=1; split (val,v," ") }
# If it is a string, find the sequence SI1 L and change the value after that, using values that the user inputs
/SDEF POS.*ERG=[a-zA-Z]+/ { flag="y" ; }
/SI1 L/ { if (flag=="y") { $3=v[i]; i++; flag="n"; } }
# If it is a number, change the number using values that the user inputs.
/SDEF POS.*ERG=[0-9]+ / { sub(/ERG=[0-9]*/, "ERG="v[i],$0);i++; }
1
' file
hints:
If the rule find ERG with at least one or more letters ([a-zA-Z]+, it will set the flag.
The /SI1 L/ rule will only triggers, if the flag is set. If the rule triggered, it would unset the flag again, so that any following /SI L/ wouldn't trigger again.
.* stands for 0-n sign or character
[A-Za-z]+ stands for 1-n alphabetic character in lower or upper case

awk -F '[[:blank:]=]' -v string_value="foo" -v number_value=42 '
/ERG=/ {
for (i=1; i<NF; i++)
if ($i == "ERG") {
isstring = ($(i+1) ~ /[^[:digit:]]/)
break
}
if (!isstring)
$(i+1) = number_value
}
/SI1 L/ && isstring { $NF = string_value }
1
' filename

Get regions from a file that are part of regions in other file (Without loops)

I have two files:
regions.txt: First column is the chromosome name, second and third are start and end position.
1 100 200
1 400 600
2 600 700
coverage.txt: First column is chromosome name, again second and third are start and end positions, and last column is the score.
1 100 101 5
1 101 102 7
1 103 105 8
2 600 601 10
2 601 602 15
This file is very huge it is about 15GB with about 300 million lines.
I basically want to get the mean of all scores in coverage.txt that are in each region in regions.txt.
In other words, start at the first line in regions.txt, if there is a line in coverage.txt which has the same chromosome, start-coverage is >= start-region, and end-coverage is <= end-region, then save its score to a new array. After finish searching in all coverages.txt print the region chromosome, start, end, and the mean of all scores that have been found.
Expected output:
1 100 200 14.6 which is (5+7+8)/3
1 400 600 0 no match at coverages.txt
2 600 700 12.5 which is (10+15)/2
I built the following MATLAB script which take very long time since I have to loop over coverage.txt many time. I don't know how to make a fast awk similar script.
My matlab script
fc = fopen('coverage.txt', 'r');
ft = fopen('regions.txt', 'r');
fw = fopen('out.txt', 'w');
while feof(ft) == 0
linet = fgetl(ft);
scant = textscan(linet, '%d%d%d');
tchr = scant{1};
tx = scant{2};
ty = scant{3};
coverages = [];
frewind(fc);
while feof(fc) == 0
linec = fgetl(fc);
scanc = textscan(linec, '%d%d%d%d');
cchr = scanc{1};
cx = scanc{2};
cy = scanc{3};
cov = scanc{4};
if (cchr == tchr) && (cx >= tx) && (cy <= ty)
coverages = cat(2, coverages, cov);
end
end
covmed = median(coverages);
fprintf(fw, '%d\t%d\t%d\t%d\n', tchr, tx, ty, covmed);
end
Any suggestions to make an alternative using AWK, Perl, or , ... etc I will aslo be pleased if someone can teach me how to get rid of all loops in my matlab script.
Thanks

Here is a Perl solution. I use hashes (aka dictionaries) to access the various ranges via the chromosome, thus reducing the number of loop iterations.
This is potentially efficient, as I don't do a full loop over regions.txt on every input line. Efficiency could perhaps be increased further when multithreading is used.
#!/usr/bin/perl
my ($rangefile) = #ARGV;
open my $rFH, '<', $rangefile or die "Can't open $rangefile";
# construct the ranges. The chromosome is used as range key.
my %ranges;
while (<$rFH>) {
chomp;
my #field = split /\s+/;
push #{$ranges{$field[0]}}, [#field[1,2], 0, 0];
}
close $rFH;
# iterate over all the input
while (my $line = <STDIN>) {
chomp $line;
my ($chrom, $lower, $upper, $value) = split /\s+/, $line;
# only loop over ranges with matching chromosome
foreach my $range (#{$ranges{$chrom}}) {
if ($$range[0] <= $lower and $upper <= $$range[1]) {
$$range[2]++;
$$range[3] += $value;
last; # break out of foreach early because ranges don't overlap
}
}
}
# create the report
foreach my $chrom (sort {$a <=> $b} keys %ranges) {
foreach my $range (#{$ranges{$chrom}}) {
my $value = $$range[2] ? $$range[3]/$$range[2] : 0;
printf "%d %d %d %.1f\n", $chrom, #$range[0,1], $value;
}
}
Example invocation:
$ perl script.pl regions.txt <coverage.txt >output.txt
Output on the example input:
1 100 200 6.7
1 400 600 0.0
2 600 700 12.5
(because (5+7+8)/3 = 6.66…)

Normally, I would load the files into R and calculate it, but given that one of them is so huge, this would become a problem. Here are some thoughts that might help you solving it.
Consider splitting coverage.txt by chromosomes. This would make the calculations less demanding.
Instead of looping over coverage.txt, you first read the regions.txt full into memory (I assume it is much smaller). For each region, you keep a score and a number.
Process coverage.txt line by line. For each line, you determine the chromosome and the region that this particular stretch belongs to. This will require some footwork, but if regions.txt is not too large, it might be more efficient. Add the score to the score of the region and increment number by one.
An alternative, most efficient way requires both files to be sorted first by chromosome, then by position.
Take a line from regions.txt. Record the chromosome and positions. If there is a line remaining from previous loop, go to 3.; otherwise go to 2.
Take a line from coverage.txt.
Check whether it is within the current region.
yes: add the score to the region, increment number. Move to 2.
no: divide score by number, write the current region to output, go to 1.
This last method requires some fine tuning, but will be most efficient -- it requires to go through each file only once and does not require to store almost anything in the memory.

Here's one way using join and awk. Run like:
join regions.txt coverage.txt | awk -f script.awk - regions.txt
Contents of script.awk:
FNR==NR && $4>=$2 && $5<=$3 {
sum[$1 FS $2 FS $3]+=$6
cnt[$1 FS $2 FS $3]++
next
}
{
if ($1 FS $2 FS $3 in sum) {
printf "%s %.1f\n", $0, sum[$1 FS $2 FS $3]/cnt[$1 FS $2 FS $3]
}
else if (NF == 3) {
print $0 " 0"
}
}
Results:
1 100 200 6.7
1 400 600 0
2 600 700 12.5
Alternatively, here's the one-liner:
join regions.txt coverage.txt | awk 'FNR==NR && $4>=$2 && $5<=$3 { sum[$1 FS $2 FS $3]+=$6; cnt[$1 FS $2 FS $3]++; next } { if ($1 FS $2 FS $3 in sum) printf "%s %.1f\n", $0, sum[$1 FS $2 FS $3]/cnt[$1 FS $2 FS $3]; else if (NF == 3) print $0 " 0" }' - regions.txt

Here is a simple MATLAB way to bin your coverage into regions:
% extract the regions extents
bins = regions(:,2:3)';
bins = bins(:);
% extract the coverage - only the start is needed
covs = coverage(:,2);
% use histc to place the coverage start into proper regions
% this line counts how many coverages there are in a region
% and assigns them proper region ids.
[h, i]= histc(covs(:), bins(:));
% sum the scores into correct regions (second output of histc gives this)
total = accumarray(i, coverage(:,4), [numel(bins),1]);
% average the score in regions (first output of histc is useful)
avg = total./h;
% remove every second entry - our regions are defined by start/end
avg = avg(1:2:end);
Now this works assuming that the regions are non-overlapping, but I guess that is the case. Also, every entry in the coverage file has to fall into some region.
Also, it is trivial to 'block' this approach over coverages, if you want to avoid reading in the whole file. You only need the bins, your regions file, which presumably is small. You can process the coverages in blocks, incrementally add to total and compute the average in the end.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Using AWK to select records from file1 with obs.# in file 2 - select

Related

How determine if files/folders have a ascending order numbering pattern?

Find last Friday’s Date in Perl 6?

bash script how to change date with hrs format and add missing days with copy of above line

If statement inside awk to change a value

Get regions from a file that are part of regions in other file (Without loops)

Categories

Resources