Merge data based on column match - sed

I have a record as follows (Input) .There are 8 fields and each fields are separated by tab space.From the above records I need to generate a new file as follows based on 1,2,4,5 column match.
Input file:
-10 68120047 . X Y . Pass A=0.0257732
-10 68120047 . X Y . Pass B=0.0263158
-10 68120047 . X Y . Pass C=0.0280899
output
-10 68120047 . X Y . Pass A=0.0257732;B=0.0263158;C=0.0280899

your example 1,2,4,5 don't match!. you have y and Y
try this one-liner with $1 and $2 as key. you could add to $1 $2 $4 $5 too.
awk '{r=$NF;k=$1$2;a[k]=a[k]?a[k]";"r:$0}END{for(x in a)print a[x]}' file
with your content in file:
kent$  awk '{r=$NF;k=$1$2;a[k]=a[k]?a[k]";"r:$0}END{for(x in a)print a[x]}' file 
-10  68120047    .   X   Y   .   Pass    A=0.0257732;B=0.0263158;C=0.0280899

Related

Calculating Factorials using QBasic

I'm writing a program that calculates the Factorial of 5 numbers and output the results in a Tabular form but I keep getting Zeros.
Factorial Formula:. n! = n×(n-1)!
I tried:
CLS
DIM arr(5) AS INTEGER
FOR x = 1 TO 5
INPUT "Enter Factors: ", n
NEXT x
f = 1
FOR i = 1 TO arr(n)
f = f * i
NEXT i
PRINT
PRINT "The factorial of input numbers are:";
PRINT
FOR x = 1 TO n
PRINT f(x)
NEXT x
END
and I'm expecting:
Numbers Factorrials
5 120
3 6
6 720
8 40320
4 24
You did some mistakes
FOR i = 1 TO arr(n)
where is n defined
you also never stored actual values into arr
PRINT f(x)
here you take from array f that is also not defined in your code
Possible solution to calculate arrays of factorials:
CLS
DIM arr(5) AS INTEGER
DIM ans(5) AS LONG
FOR x = 1 TO 5
INPUT "Enter Factors: ", arr(x)
f& = 1
FOR i = 1 TO arr(x)
f& = f& * i
NEXT i
ans(x) = f&
NEXT x
PRINT
PRINT "The factorial of input numbers are:";
PRINT
PRINT "Numbers", "Factorials"
FOR x = 1 TO 5
PRINT arr(x), ans(x)
NEXT x
END
I don't have a BASIC interpreter right in front of me, but I think this is what you're looking for:
CLS
DIM arr(5) AS INTEGER
DIM ans(5) AS LONG 'You need a separate array to store results in.
FOR x = 1 TO 5
INPUT "Enter Factors: ", arr(x)
NEXT x
FOR x = 1 to 5
f& = 1
FOR i = 1 TO arr(x)
f& = f& * i
NEXT i
ans(x) = f&
NEXT x
PRINT
PRINT "The factorial of input numbers are:";
PRINT
PRINT "Numbers", "Factorials"
FOR x = 1 TO 5
PRINT STR$(arr(x)), ans(x)
NEXT x
END
Just a comment though: In programming, you should avoid reusing variables unless you are short on memory. It can be done right, but it creates many opportunities for hard to find bugs in larger programs.
Possible solution to calculate arrays of factorials and square roots:
CLS
PRINT "Number of values";: INPUT n
DIM arr(n) AS INTEGER
DIM ans(n) AS LONG
FOR x = 1 TO n
PRINT "Enter value"; x;: INPUT arr(x)
f& = 1
FOR i = 1 TO arr(x)
f& = f& * i
NEXT i
ans(x) = f&
NEXT x
PRINT
PRINT "The factorial/square root of input numbers are:";
PRINT
PRINT "Number", "Factorial", "Squareroot"
FOR x = 1 TO n
PRINT arr(x), ans(x), SQR(arr(x))
NEXT x
END

Scilab Storing Into Array [Channel Coding]

This is part of my codes
e=0;
c=0;
n=10000;
for t=zeros(1:n)
//state1
x=rand();
if(x<=0.95) then disp(t);
c=c+1;
elseif(x>0.95)
//state2
x=rand();
if(x<=0.99) then disp(t)
c=c+1;
//state3
elseif(x>0.99) then disp(t=1)
e=e+1;
arr(e)=t; //store error bits only
end
end
end
disp(c);
disp(e);
for z=1:e //loop the earlier arr(s)
disp(arr(z)) //display all arr of s
end
clear();
What I was trying to do is to generate 10000 of zeros.
Out of these 10000 zeros, there will be few with errors meaning to say for example I might get 9990 of zeros and 10 of ones.
Currently, I have made an array storing only the ones. Now I'm abit lost on how do I store both zeros and ones into the same array.
Let say, current running..I will end up with 10 of ones (Those zeros that contains error bit). Then at this part of the code, all the zeros that has turned into ones will be stored into arr(e). Therefore the output would be
0
0
0
0
0
0
0
0
0
0
But what i wanted is something like this.
arr[1] = 0
.
.
.
arr[250] = 1
.
.
.
arr[749] = 1
.
.
.
arr[1234] = 1
.
.
.
arr[5463] = 1
.
.
.
arr[6678] = 1
.
.
.
arr[8890] = 1
.
.
.
arr[9987] = 1
.
.
.
arr[10000] = 0
Which shows the error bit occur at 250,749,1234,5463,6678,8890,9987
Thank you
All you have to do is:
e = [250 759 1234 5463 6678 8890 9987];
arr = zeros(10000,1);
arr(e) = 1;
e defines where you want the values in arr to be changed to 1. You simply just use e to index into arr and set the corresponding positions to 1. That's it... nothing really to it!

If statement inside awk to change a value

I have the following file
...
MODE P E
IMP:P 1 19r 0
IMP:E 1 19r 0
...
SDEF POS= 0 0 14.6 AXS= 0 0 1 EXT=d3 RAD= d4 cell=23 ERG=d1 PAR=2
SI1 L 0.020
SP1 1
SI4 0. 3.401
SI3 0.9
...
NPS 20000000
I want to do the following task
Check if after the sequence ERG= there is a number or a string.
If it's a string, find the sequence SI1 L and change the value after that, using values that the user inputs.
If it's a number, change the number using values that the user inputs.
Note that if after ERG= there is a number, there will be no SI1 L sequence.
For instance number 2 can be accomplished using the following
#! /bin/bash
vals=(0.02 0.03 0.04 0.05)
for val in "${vals[#]}"; do
awk -vval="$val" '$1=="SI1"{$3=val}1' 20
done
How can the above algorithm be achieved?
#!/bin/bash
val="$#"
awk -v val="$val" '
BEGIN { i=1; split (val,v," ") }
# If it is a string, find the sequence SI1 L and change the value after that, using values that the user inputs
/SDEF POS.*ERG=[a-zA-Z]+/ { flag="y" ; }
/SI1 L/ { if (flag=="y") { $3=v[i]; i++; flag="n"; } }
# If it is a number, change the number using values that the user inputs.
/SDEF POS.*ERG=[0-9]+ / { sub(/ERG=[0-9]*/, "ERG="v[i],$0);i++; }
1
' file
hints:
If the rule find ERG with at least one or more letters ([a-zA-Z]+, it will set the flag.
The /SI1 L/ rule will only triggers, if the flag is set. If the rule triggered, it would unset the flag again, so that any following /SI L/ wouldn't trigger again.
.* stands for 0-n sign or character
[A-Za-z]+ stands for 1-n alphabetic character in lower or upper case
awk -F '[[:blank:]=]' -v string_value="foo" -v number_value=42 '
/ERG=/ {
for (i=1; i<NF; i++)
if ($i == "ERG") {
isstring = ($(i+1) ~ /[^[:digit:]]/)
break
}
if (!isstring)
$(i+1) = number_value
}
/SI1 L/ && isstring { $NF = string_value }
1
' filename

Get regions from a file that are part of regions in other file (Without loops)

I have two files:
regions.txt: First column is the chromosome name, second and third are start and end position.
1 100 200
1 400 600
2 600 700
coverage.txt: First column is chromosome name, again second and third are start and end positions, and last column is the score.
1 100 101 5
1 101 102 7
1 103 105 8
2 600 601 10
2 601 602 15
This file is very huge it is about 15GB with about 300 million lines.
I basically want to get the mean of all scores in coverage.txt that are in each region in regions.txt.
In other words, start at the first line in regions.txt, if there is a line in coverage.txt which has the same chromosome, start-coverage is >= start-region, and end-coverage is <= end-region, then save its score to a new array. After finish searching in all coverages.txt print the region chromosome, start, end, and the mean of all scores that have been found.
Expected output:
1 100 200 14.6 which is (5+7+8)/3
1 400 600 0 no match at coverages.txt
2 600 700 12.5 which is (10+15)/2
I built the following MATLAB script which take very long time since I have to loop over coverage.txt many time. I don't know how to make a fast awk similar script.
My matlab script
fc = fopen('coverage.txt', 'r');
ft = fopen('regions.txt', 'r');
fw = fopen('out.txt', 'w');
while feof(ft) == 0
linet = fgetl(ft);
scant = textscan(linet, '%d%d%d');
tchr = scant{1};
tx = scant{2};
ty = scant{3};
coverages = [];
frewind(fc);
while feof(fc) == 0
linec = fgetl(fc);
scanc = textscan(linec, '%d%d%d%d');
cchr = scanc{1};
cx = scanc{2};
cy = scanc{3};
cov = scanc{4};
if (cchr == tchr) && (cx >= tx) && (cy <= ty)
coverages = cat(2, coverages, cov);
end
end
covmed = median(coverages);
fprintf(fw, '%d\t%d\t%d\t%d\n', tchr, tx, ty, covmed);
end
Any suggestions to make an alternative using AWK, Perl, or , ... etc I will aslo be pleased if someone can teach me how to get rid of all loops in my matlab script.
Thanks
Here is a Perl solution. I use hashes (aka dictionaries) to access the various ranges via the chromosome, thus reducing the number of loop iterations.
This is potentially efficient, as I don't do a full loop over regions.txt on every input line. Efficiency could perhaps be increased further when multithreading is used.
#!/usr/bin/perl
my ($rangefile) = #ARGV;
open my $rFH, '<', $rangefile or die "Can't open $rangefile";
# construct the ranges. The chromosome is used as range key.
my %ranges;
while (<$rFH>) {
chomp;
my #field = split /\s+/;
push #{$ranges{$field[0]}}, [#field[1,2], 0, 0];
}
close $rFH;
# iterate over all the input
while (my $line = <STDIN>) {
chomp $line;
my ($chrom, $lower, $upper, $value) = split /\s+/, $line;
# only loop over ranges with matching chromosome
foreach my $range (#{$ranges{$chrom}}) {
if ($$range[0] <= $lower and $upper <= $$range[1]) {
$$range[2]++;
$$range[3] += $value;
last; # break out of foreach early because ranges don't overlap
}
}
}
# create the report
foreach my $chrom (sort {$a <=> $b} keys %ranges) {
foreach my $range (#{$ranges{$chrom}}) {
my $value = $$range[2] ? $$range[3]/$$range[2] : 0;
printf "%d %d %d %.1f\n", $chrom, #$range[0,1], $value;
}
}
Example invocation:
$ perl script.pl regions.txt <coverage.txt >output.txt
Output on the example input:
1 100 200 6.7
1 400 600 0.0
2 600 700 12.5
(because (5+7+8)/3 = 6.66…)
Normally, I would load the files into R and calculate it, but given that one of them is so huge, this would become a problem. Here are some thoughts that might help you solving it.
Consider splitting coverage.txt by chromosomes. This would make the calculations less demanding.
Instead of looping over coverage.txt, you first read the regions.txt full into memory (I assume it is much smaller). For each region, you keep a score and a number.
Process coverage.txt line by line. For each line, you determine the chromosome and the region that this particular stretch belongs to. This will require some footwork, but if regions.txt is not too large, it might be more efficient. Add the score to the score of the region and increment number by one.
An alternative, most efficient way requires both files to be sorted first by chromosome, then by position.
Take a line from regions.txt. Record the chromosome and positions. If there is a line remaining from previous loop, go to 3.; otherwise go to 2.
Take a line from coverage.txt.
Check whether it is within the current region.
yes: add the score to the region, increment number. Move to 2.
no: divide score by number, write the current region to output, go to 1.
This last method requires some fine tuning, but will be most efficient -- it requires to go through each file only once and does not require to store almost anything in the memory.
Here's one way using join and awk. Run like:
join regions.txt coverage.txt | awk -f script.awk - regions.txt
Contents of script.awk:
FNR==NR && $4>=$2 && $5<=$3 {
sum[$1 FS $2 FS $3]+=$6
cnt[$1 FS $2 FS $3]++
next
}
{
if ($1 FS $2 FS $3 in sum) {
printf "%s %.1f\n", $0, sum[$1 FS $2 FS $3]/cnt[$1 FS $2 FS $3]
}
else if (NF == 3) {
print $0 " 0"
}
}
Results:
1 100 200 6.7
1 400 600 0
2 600 700 12.5
Alternatively, here's the one-liner:
join regions.txt coverage.txt | awk 'FNR==NR && $4>=$2 && $5<=$3 { sum[$1 FS $2 FS $3]+=$6; cnt[$1 FS $2 FS $3]++; next } { if ($1 FS $2 FS $3 in sum) printf "%s %.1f\n", $0, sum[$1 FS $2 FS $3]/cnt[$1 FS $2 FS $3]; else if (NF == 3) print $0 " 0" }' - regions.txt
Here is a simple MATLAB way to bin your coverage into regions:
% extract the regions extents
bins = regions(:,2:3)';
bins = bins(:);
% extract the coverage - only the start is needed
covs = coverage(:,2);
% use histc to place the coverage start into proper regions
% this line counts how many coverages there are in a region
% and assigns them proper region ids.
[h, i]= histc(covs(:), bins(:));
% sum the scores into correct regions (second output of histc gives this)
total = accumarray(i, coverage(:,4), [numel(bins),1]);
% average the score in regions (first output of histc is useful)
avg = total./h;
% remove every second entry - our regions are defined by start/end
avg = avg(1:2:end);
Now this works assuming that the regions are non-overlapping, but I guess that is the case. Also, every entry in the coverage file has to fall into some region.
Also, it is trivial to 'block' this approach over coverages, if you want to avoid reading in the whole file. You only need the bins, your regions file, which presumably is small. You can process the coverages in blocks, incrementally add to total and compute the average in the end.

Prefix match in MATLAB

Hey guys, I have a very simple problem in MATLAB:
I have some strings which are like this:
Pic001
Pic002
Pic003
004
Not every string starts with the prefix "Pic". So how can I cut off the part "pic" that only the numbers at the end shall remain to have an equal format for all my strings?
Greets, poeschlorn
If 'Pic' only ever occurs as a prefix in your strings and nowhere else within the strings then you could use STRREP to remove it like this:
>> x = {'Pic001'; 'Pic002'; 'Pic003'; '004'}
x =
'Pic001'
'Pic002'
'Pic003'
'004'
>> x = strrep(x, 'Pic', '')
x =
'001'
'002'
'003'
'004'
If 'Pic' can occur elsewhere in your strings and you only want to remove it when it occurs as a prefix then use STRNCMP to compare the first three characters of your strings:
>> x = {'Pic001'; 'Pic002'; 'Pic003'; '004'}
x =
'Pic001'
'Pic002'
'Pic003'
'004'
>> for ii = find(strncmp(x, 'Pic', 3))'
x{ii}(1:3) = [];
end
>> x
x =
'001'
'002'
'003'
'004'
strings = {'Pic001'; 'Pic002'; 'Pic003'; '004'};
numbers = regexp(strings, '(PIC)?(\d*)','match');
for cc = 1:length(numbers);
fprintf('%s\n', char(numbers{cc}));
end;