Information on XXX.cnt obtained by an RSEM analysis - stat

I obtained a "XXX.cnt" in a newly created "XXX.stat" directory after an RSEM-1.3.3 analysis.
Shown below is the content of the XXX.cnt.
0 2726098 0 2726098
1534055 1192043 1993977
9793897 1
0 0
1 732121
2 410181
3 513309
4 610475
5 90206
6 81551
7 63620
8 44947
9 33029
10 21745
11 22282
12 21545
13 13324
14 17247
.
.
.
What do these numbers mean?
Thank you in advance for your kindness.

The format and meanings of each field are described in "cnt_file_description.txt" under RSEM directory.
http://deweylab.github.io/RSEM/rsem-calculate-expression.html#OUTPUT
https://github.com/bli25broad/RSEM_tutorial
Here is the transcript.
# '#' marks the start of comments (till the end of the line)
# *.cnt file contains alignment statistics based purely on the alignment results obtained from aligners
N0 N1 N2 N_tot
# N0, number of unalignable reads; N1, number of alignable reads; N2, number of filtered reads due to too many alignments; N_tot = N0 + N1 + N2
nUnique nMulti nUncertain
# nUnique, number of reads aligned uniquely to a gene; nMulti, number of reads aligned to multiple genes; nUnique + nMulti = N1;
# nUncertain, number of reads aligned to multiple locations in the given reference sequences, which include isoform-level multi-mapping reads
nHits read_type
# nHits, number of total alignments.
# read_type: 0, single-end read, no quality score; 1, single-end read, with quality score; 2, paired-end read, no quality score; 3, paired-end read, with quality score
# The next section counts reads by the number of alignments they have. Each line contains two values separated by a TAB character. The first value is number of alignments. 'Inf' refers to reads filtered due to too many alignments. The second value is the number of reads that contain such many alignments
0 N0
...
number_of_alignments number_of_reads_with_that_many_alignments
...
Inf N2

Related

Hashing functions and Universal Hashing Family

I need to determine whether the following Hash Functions Set is universal or not:
Let U be the set of the keys - {000, 001, 002, 003, ... ,999} - all the numbers between 0 and 999 padded with 0 in the beginning where needed. Let n = 10 and 1 < a < 9 ,an integer between 1 and 9. We denote by ha(x) the rightmost digit of the number a*x.
For example, h2(123) = 6, because, 2 * 123 = 246.
We also denote H = {h1, h2, h3, ... ,h9} as our set of hash functions.
Is H is universal? prove.
I know I need to calculate the probability for collision of 2 different keys and check if it's smaller or equal to 1/n (which is 1/10), so I tried to separate into cases - if a is odd or even, because when a is even the last digit of a*x will be 0/2/4/6/8, else it could be anything. But it didn't help me so much as I'm stuck on it.
Would be very glad for some help here.

Detect contiguous numbers - MATLAB

I coded a program that create some bunch of binary numbers like this:
out = [0,1,1,0,1,1,1,0,0,0,1,0];
I want check existence of nine 1 digit together in above out, for example when we have this in our output:
out_2 = [0,0,0,0,1,1,1,1,1,1,1,1,1];
or
out_3 = [0,0,0,1,1,1,1,0,0,1,0,1,1,1,1,1,1,1,1,1,0,0,0,1,1,0];
condition variable should be set to 1. We don't know exact position of start of ones in outvariable. It is random. I only want find existence of duplicate ones values in above variable (one occurrence or more).
PS.
We are searching for a general answer to find other duplicate numbers (not only 1 here and not only for binary data. this is just an example)
You can use convolution to solve such r-contiguous detection cases.
Case #1 : To find contiguous 1s in a binary array -
check = any(conv(double(input_arr),ones(r,1))>=r)
Sample run -
input_arr =
0 0 0 0 1 1 1 1 1 1 1 1 1
r =
9
check =
1
Case #2 : For detecting any number as contiguous, you could modify it a bit, like so -
check = any(conv(double(diff(input_arr)==0),ones(1,r-1))>=r-1)
Sample run -
input_arr =
3 5 2 4 4 4 5 5 2 2
r =
3
check =
1
To save Stackoverflow from further duplicates, also feel free to look into related problems -
Fast r-contiguous matching (based on location similarities).
r-contiguous matching, MATLAB.

Format of training data in CSV type for encog 3.0 and using it

I wonder that how can I make a csv file for storing training data in encog. Currently I have 200 features (f) as inputs and multi outputs (o) (for example author A, B ,C...). So how can organize the CSV file ? Should I look like this?
f1, f2, f3 ... f200, o1
f1, f2, f3 ... f200, o2
f1, f2, f3 ... f200, o3
Some of my questions are:
Can o1, o2 and o3 accept String ? (Authors' names).
Will the format of training csv file and testing cvs file look the same ?
Is it possible to feed the NN directly with the CSV file ? Or It must be converted to multi dimension array as this examples ? Since I have to 200 features as inputs, this will quite difficult.
double XOR_INPUT[][] = [
[0,0],
[1,0],
[0,1],
[1,1]
];
How to normalize the data in the csv file (to -+1 range) by using encog framework ?
Thank you very much.
No. A neural network operates only with float numbers, preferably 0 to 1 (output) or -1 to 1 (input). For strings, use 1 of n encoding.
So eg. if your outputs are 'a', 'b', 'c', set it to
1 0 0 = 'a'
0 1 0 = 'b'
0 0 1 = 'c'
You can also add a null class if necessary, for no result found.
You can read the data from csv, but encog is looking for everything in a 2d double array (or more correctly an 'array of arrays').
To simplify things, start with say 10 features.
Normalization is done per feature. So for each feature, the formula for normalization for a datapoint a is:
((a - min) / range) + 1
Where range = max - min of that feature.
So all input datapoints should be in the range -1 to 1.
Maybe post a real example of the data, that might give a better impression of what you need to do.

Splitting up number by certain amount

I'm trying to split up numbers by a given value (4000) and have the numbers placed in an array
Example:
max value given is: 8202
So the split_array should be split by 4000 unless it gets to the end and it's less than 4000
in which case it just goes to the end.
start_pos, end_pos
0,4000
4001,8001
8002,8202
so the first row in the array would be
[0 4000]
second row would be
[4001 8001]
third row would be
[8002 8202]
please note that the max value can change from (8202) to be any other number like (16034) but never a decimal
How can I go about doing this using matlab / octave
This should produce what you want
n = 8202;
a = [0:4001:n; [4000:4001:n-1 n]]'
returns
a =
0 4000
4001 8001
8002 8202

how to create unique integer number from 3 different integers numbers(1 Oracle Long, 1 Date Field, 1 Short)

the thing is that, the 1st number is already ORACLE LONG,
second one a Date (SQL DATE, no timestamp info extra), the last one being a Short value in the range 1000-100'000.
how can I create sort of hash value that will be unique for each combination optimally?
string concatenation and converting to long later:
I don't want this, for example.
Day Month
12 1 --> 121
1 12 --> 121
When you have a few numeric values and need to have a single "unique" (that is, statistically improbable duplicate) value out of them you can usually use a formula like:
h = (a*P1 + b)*P2 + c
where P1 and P2 are either well-chosen numbers (e.g. if you know 'a' is always in the 1-31 range, you can use P1=32) or, when you know nothing particular about the allowable ranges of a,b,c best approach is to have P1 and P2 as big prime numbers (they have the least chance to generate values that collide).
For an optimal solution the math is a bit more complex than that, but using prime numbers you can usually have a decent solution.
For example, Java implementation for .hashCode() for an array (or a String) is something like:
h = 0;
for (int i = 0; i < a.length; ++i)
h = h * 31 + a[i];
Even though personally, I would have chosen a prime bigger than 31 as values inside a String can easily collide, since a delta of 31 places can be quite common, e.g.:
"BB".hashCode() == "Aa".hashCode() == 2122
Your
12 1 --> 121
1 12 --> 121
problem is easily fixed by zero-padding your input numbers to the maximum width expected for each input field.
For example, if the first field can range from 0 to 10000 and the second field can range from 0 to 100, your example becomes:
00012 001 --> 00012001
00001 012 --> 00001012
In python, you can use this:
#pip install pairing
import pairing as pf
n = [12,6,20,19]
print(n)
key = pf.pair(pf.pair(n[0],n[1]),
pf.pair(n[2], n[3]))
print(key)
m = [pf.depair(pf.depair(key)[0]),
pf.depair(pf.depair(key)[1])]
print(m)
Output is:
[12, 6, 20, 19]
477575
[(12, 6), (20, 19)]