computing the average of variables with missing values in Stata - average

I know how to calculate the avarage of variables without missing value, but I am not sure about calculating it with missing values. For example we have 6 area halls as follows:
area_hall_1 area_hall_2 area_hall_3 area_hall_4 area_hall_5 area_hall_6
580 580 650 . . .
1000 1000 . . .
825 825 . . . .
912 912 . . . .
670 . . . . .
790 . . . . .
750 900 1000 1000 900 750

The reported (or rather implied) problem makes no sense whatsoever. Consider the data posted (an extra missing value is needed in the second observation).
. clear
. input area_hall_1 area_hall_2 area_hall_3 area_hall_4 area_hall_5 area_hall_6
area_ha~1 area_ha~2 area_ha~3 area_ha~4 area_ha~5 area_ha~6
1. 580 580 650 . . .
2. 1000 1000 . . . .
3. 825 825 . . . .
4. 912 912 . . . .
5. 670 . . . . .
6. 790 . . . . .
7. 750 900 1000 1000 900 750
8. end
. egen area_hall_mean = rowmean(area_hall_?)
. egen area_hall_count = rownonmiss(area_hall_?)
. l *_mean *_count , sep(0)
+---------------------+
| area_h~n area_h~t |
|---------------------|
1. | 603.3333 3 |
2. | 1000 2 |
3. | 825 2 |
4. | 912 2 |
5. | 670 1 |
6. | 790 1 |
7. | 883.3333 6 |
+---------------------+
. di (580+580+650)/3
603.33333
The egen function rowmean() ignores missing values. How it could do otherwise? The only other possibility is to report that a mean cannot be calculated because there are missing values. That is defensible, but not at all typical Stata style. So the means reported are exactly those the OP wants. An independent calculation with display shows that the means reported are those desired. (A profound sceptic is at liberty to inspect the code with viewsource _growmean.ado.)

Related

How to delete a pattern in a file that have the same structure but different content?

I working with a file (.gff3) in which this pattern appears (where # correspond to numbers):
TRINITY_DN###_c0_g1~~
example:
BAN_TRINITY_DN0_c0_g1_i1 transdecoder gene 1 580 . + . ID=TRINITY_DN0_c0_g1~~TRINITY_DN0_c0_g1_i1.p1;Name=ORF%20type%3A5prime_partial%20len%3A190%20%28%2B%29%2Cscore%3D182.16
BAN_TRINITY_DN0_c0_g1_i1 transdecoder mRNA 1 580 . + . ID=TRINITY_DN0_c0_g1_i1.p1;Parent=TRINITY_DN0_c0_g1~~TRINITY_DN0_c0_g1_i1.p1;Name=ORF%20type%3A5prime_partial%20len%3A190%20%28%2B%29%2Cscore%3D182.16
BAN_TRINITY_DN0_c0_g1_i1 transdecoder exon 1 580 . + . ID=TRINITY_DN0_c0_g1_i1.p1.exon1;Parent=TRINITY_DN0_c0_g1_i1.p1
BAN_TRINITY_DN0_c0_g1_i1 transdecoder CDS 1 570 . + 0 ID=cds.TRINITY_DN0_c0_g1_i1.p1;Parent=TRINITY_DN0_c0_g1_i1.p1
BAN_TRINITY_DN0_c0_g1_i1 transdecoder three_prime_UTR 571 580 . + . ID=TRINITY_DN0_c0_g1_i1.p1.utr3p1;Parent=TRINITY_DN0_c0_g1_i1.p1
BAN_TRINITY_DN101_c0_g1_i1 transdecoder gene 1 230 . - . ID=TRINITY_DN101_c0_g1~~TRINITY_DN101_c0_g1_i1.p1;Name=ORF%20type%3Ainternal%20len%3A77%20%28-%29%2Cscore%3D24.09
BAN_TRINITY_DN101_c0_g1_i1 transdecoder mRNA 1 230 . - . ID=TRINITY_DN101_c0_g1_i1.p1;Parent=TRINITY_DN101_c0_g1~~TRINITY_DN101_c0_g1_i1.p1;Name=ORF%20type%3Ainternal%20len%3A77%20%28-%29%2Cscore%3D24.09
BAN_TRINITY_DN101_c0_g1_i1 transdecoder exon 1 230 . - . ID=TRINITY_DN101_c0_g1_i1.p1.exon1;Parent=TRINITY_DN101_c0_g1_i1.p1
BAN_TRINITY_DN101_c0_g1_i1 transdecoder CDS 3 230 . - 0 ID=cds.TRINITY_DN101_c0_g1_i1.p1;Parent=TRINITY_DN101_c0_g1_i1.p1
I'd like to simply delete the pattern, so the output would be something like this:
BAN_TRINITY_DN0_c0_g1_i1 transdecoder gene 1 580 . + . ID=TRINITY_DN0_c0_g1_i1.p1;Name=ORF%20type%3A5prime_partial%20len%3A190%20%28%2B%29%2Cscore%3D182.16
BAN_TRINITY_DN0_c0_g1_i1 transdecoder mRNA 1 580 . + . ID=TRINITY_DN0_c0_g1_i1.p1;Parent=TRINITY_DN0_c0_g1_i1.p1;Name=ORF%20type%3A5prime_partial%20len%3A190%20%28%2B%29%2Cscore%3D182.16
BAN_TRINITY_DN0_c0_g1_i1 transdecoder exon 1 580 . + . ID=TRINITY_DN0_c0_g1_i1.p1.exon1;Parent=TRINITY_DN0_c0_g1_i1.p1
BAN_TRINITY_DN0_c0_g1_i1 transdecoder CDS 1 570 . + 0 ID=cds.TRINITY_DN0_c0_g1_i1.p1;Parent=TRINITY_DN0_c0_g1_i1.p1
BAN_TRINITY_DN0_c0_g1_i1 transdecoder three_prime_UTR 571 580 . + . ID=TRINITY_DN0_c0_g1_i1.p1.utr3p1;Parent=TRINITY_DN0_c0_g1_i1.p1
BAN_TRINITY_DN101_c0_g1_i1 transdecoder gene 1 230 . - . ID=TRINITY_DN101_c0_g1_i1.p1;Name=ORF%20type%3Ainternal%20len%3A77%20%28-%29%2Cscore%3D24.09
BAN_TRINITY_DN101_c0_g1_i1 transdecoder mRNA 1 230 . - . ID=TRINITY_DN101_c0_g1_i1.p1;Parent=TRINITY_DN101_c0_g1_i1.p1;Name=ORF%20type%3Ainternal%20len%3A77%20%28-%29%2Cscore%3D24.09
BAN_TRINITY_DN101_c0_g1_i1 transdecoder exon 1 230 . - . ID=TRINITY_DN101_c0_g1_i1.p1.exon1;Parent=TRINITY_DN101_c0_g1_i1.p1
BAN_TRINITY_DN101_c0_g1_i1 transdecoder CDS 3 230 . - 0 ID=cds.TRINITY_DN101_c0_g1_i1.p1;Parent=TRINITY_DN101_c0_g1_i1.p1
I tried to use sed to do it, however, since the pattern change in terms of size and composition and I don't know how to perform the character deletion by taking this into account (I'm quite new on using bash).
Does anyone has an idea of how to do???
Something like
sed 's/TRINITY_DN[0-9]*_c0_g1~~//' input.gff3
[0-9]* matches any number of consecutive digits.
If you agree to write the pattern you want to delete according to the syntax of the regex, just issue:
PATTERN='TRINITY_DN[0-9][0-9][0-9]_c0_g1~~'
sed "/$PATTERN/s///g" file.gff3
I supposed the pattern could occur several times on one line. If this is not the case, remove the g at the end of the first argument of the sed command.
If you don't know how many digits you will have behind TRINITY_DN, you can replace [0-9][0-9][0-9] by [0-9]*.
If you want another syntax for describing your patterns (v.gr. # instead of [0-9]), please specify.

How can I match values in one file to ranges from another?

There are two input files, as the following lines show.
Columns 3 and 4 in input1 hold a range (such as 1 to 78 in the first row)
Column 2 in input2 holds a single position value (such 32 in the first row) which corresponds to one of the ranges in column in input1, and the corresponding value in column 2: in this case B100002.
I want to generate a file that contain the position, relative to the start of the range, for the every value in column 2 of file input1
For example, 358-344 + 1 = 15 is the relative position value for B100043
input1:
Scaffold_1 B100002 1 78
Scaffold_1 B100041 179 243
Scaffold_1 B100043 344 418
Scaffold_1 B100045 519 583
Scaffold_1 B100058 684 751
Scaffold_1 B100059 852 915
Scaffold_1 B100066 1016 1079
Scaffold_1 B100080 1180 1246
Scaffold_1 B100111 1347 1413
Scaffold_1 B100118 1514 1585
Scaffold_2 B123465 31531 31595
input2:
Scaffold_1 32
Scaffold_1 358
Scaffold_2 31533
Required output:
B100002 32
B100043 15
B123465 2
This is my solution
Change the format from input1 to input_1 and input2 to input_2 (tab separation)
Use software bedtools and awk to generate the output file that I want.
input_1:
Scaffold_1 . B100002 1 78 . . . .
Scaffold_1 . B100041 179 243 . . . .
Scaffold_1 . B100043 344 418 . . . .
Scaffold_1 . B100045 519 583 . . . .
Scaffold_1 . B100058 684 751 . . . .
Scaffold_1 . B100059 852 915 . . . .
Scaffold_1 . B100066 1016 1079 . . . .
Scaffold_1 . B100080 1180 1246 . . . .
Scaffold_1 . B100111 1347 1413 . . . .
Scaffold_1 . B100118 1514 1585 . . . .
Scaffold_1 . B101068 9218 9290 . . . .
Scaffold_2 . B123465 31531 31595 . . . .
input_2:
Scaffold_1 . . 31 33 . . . .
Scaffold_1 . . 357 359 . . . .
Scaffold_2 . . 31532 31534 . . . .
bedtools intersect -wb -a test2 -b test1 | awk '{print $12,($5-$13)}'
B100002 32
B100043 15
B123465 3
How can I use awk or perl to achieve my purpose? (I have to change file format when I use bedtools.)
if the data file sizes are not huge, there is a simpler way
$ join input1 input2 | awk '$5<$4 && $3<$5 {print $2, $5-$3+1}'
B100002 32
B100043 15
B123465 3
This Perl code seems to solve your problem
It is a common idiom: to load the entire dictionary in input1.txt into an in-memory data structure -- here %data, which is indexed by the scaffold ID -- and then process the object data to gather information from the dictionary
Assuming your input1 isn't enormous this should work fine. It's impossible to key data structures on a range, so every candidate range must be checked to see if the index falls above the start and below the end
If there is a match then the ID is printed together with the result of the arithmetic to calculate the one-based relative index
Note that your required result for the entry Scaffold_2 31533 should be 3 and not 2
use strict;
use warnings 'all';
use autodie;
use Data::Dump;
my %data;
{
open my $fh, '<', 'input1.txt';
while ( <$fh> ) {
next unless /\S/;
my ($scaff, $code, $start, $end) = split;
push #{ $data{$scaff} }, { start => $start, end => $end, code => $code };
}
}
open my $fh, '<', 'input2.txt';
while ( <$fh> ) {
my ($scaff, $index) = split;
my $items = $data{$scaff} or die qq{No such scaffold "$scaff"};
for my $item( #$items ) {
next unless $index >= $item->{start} and $index <= $item->{end};
printf "%s\t%d\n",
$item->{code},
$index - $item->{start} + 1;
last
}
}
output
B100002 32
B100043 15
B123465 3

Create features (long vector) with scala

I have a Big CSV file (~2GB) that contains a parameter X that for each day has around 1000 record.
What I want to do is transform this column to a set features (vectors) of length 1000 (one for each day).
For example:
==> Day 1 Day P1
1 1
1 2
1 5
1 9
1 .
1 .
1 .
1 6
==> Day 2 1 4
2 1
2 2
2 5
2 7
2 .
2 .
2 .
2 8
Will be transformed to:
d1 1 2 5 9 . . . 6
d2 4 1 2 5 . . . 8
.
.
.
dn
How can I do that in Scala ?
I know that there will be issue with the memory, I'll try to store the result on multiple steps.
Here is what I've tried so far:
df_data.map(x => (x(1),x(3))).filter(x=> x._1== 1).zipWithIndex.map(x=> (x._1._1,(x._2,x._1._2))).groupByKey()
Now I get something like:
(1, (0,val1),(1,val2),(2,val3),...,(n,valn))

Dividing complex shapes into contiguous sub-shapes in MATLAB

I have a 3D shape loaded into MATLAB as a 3D matrix. The matrix is fairly large, e.g. 250x250x250. The shape is defined within the matrix by numbers >0 but <=1, so all positive numbers in the matrix are "shape", and all zeros are "non-shape". The shape is contiguous. A simplified (8x8) example of one plane of such a shape is shown in below:
0 0 0 0 0 0 0 0
0 0 1 .5 .1 .2 1 0
0 0 0 0 0 .3 0 0
0 0 .2 .3 1 1 1 1
0 0 0 .8 1 0 0 0
0 .2 .1 1 0 1 0 0
0 .1 .9 .9 .9 0 0 0
0 0 0 0 0 0 0 0
I need to split this shape into 2 sub-shapes where the sum of values of the two sub-shapes is roughly equal, and where the two sub-shapes are contiguous. So a valid division could be [N.B. zeros replaced by '.' for visual clarity]:
. . . . . . . .
. . B B B B B .
. . . . . B . .
. . A A B B B B
. . . A A . . .
. A A A . A . .
. A A A A . . .
. . . . . . . .
But the following division would be invalid because not all of the values in sub-shape B can be directly joined up with each other.
. . . . . . . .
. . B B B A A .
. . . . . A . .
. . B B A A A A
. . . B A . . .
. B B B . A . .
. B B B B . . .
. . . . . . . .
My real-world example is in 3 dimensions and much larger. Any ideas how I could divide my shape into 2 contiguous sub-shapes. By extension, how can I divide it into 3 contiguous sub-shapes if I wanted to, again where the sum of values in the sub-shapes is approximately equal?

Octave generate combination subsets

Given a number N, I would like to create a matrix of x columns with every combination of a subset of N. For example, if N is 16 and x is 3 then I should get a matrix of 560 rows and each row will have 3 columns and contain a unique combination from the numbers 1 to 16.
Can I use a function zzz(N,x) ?
I will be generating a lot of them with different N and x values so a for loop will slow things down.
Just use the nchoosek function:
N = 16;
x = 3;
nchoosek(1:N, x)
returns 560 rows like this:
. . .
. . .
. . .
1 2 13
1 2 14
1 2 15
1 2 16
1 3 4
1 3 5
1 3 6
1 3 7
. . .
. . .
. . .