DESeq2 - organizing data - count data with more columns than meta data row, remove TCGA IDs that don't match between datasets - match

I am in the process of running a DEA with DESeq2 with some lung squamous cell carcinoma data from Broad firehose.
I have used the RNAseq data, raw counts for the count data, my metadata is generated from Broad firehose data on CNV.
The two datasets are related by TCGA IDs, each data set contains matching IDs, but the raw counts data has several "extra" TCGA IDs that don't match/ exist in the metadata. counts length = 552, meta length = 501
I need to get my metadata and raw counts data in the same order and have the datasets contain the matching TCGA IDs and drop the IDs/ samples that don't match.
I've been trying to work around different ways to do this using, match and %in%, I can identify the positions where the raw count data do not have the same TCGA IDs as the metadata, but can't wrap my head around how to take the raw counts and drop the IDs/samples that do not match the samples in the metadata.
Any ideas would help on how to match up two data sets and eliminate row/ columns that don't match.
colnames(lusc_reads)
[1] "TCGA-18-3406" "TCGA-18-3407" "TCGA-18-3408" "TCGA-18-3409" "TCGA-18-3410"
rownames(lusc_meta)
[1] "TCGA-60-2722" "TCGA-43-7657" "TCGA-58-A46N" "TCGA-NC-A5HL" "TCGA-63-A5MB"
match(colnames(lusc_reads), rownames(lusc_meta))
[1] 318 265 114 372 353 150 8 287 215 57 199 268 239 179 164 249 383 17 274

Assuming lusc_reads and lusc_meta are dataframes, this should work:
samples_to_keep <- as.vector(rownames(lusc_meta))
lusc_reads_new <- lusc_reads[colnames(lusc_reads) %in% samples_to_keep]
To put them in the same order as the metadata
lusc_reads_new <- lusc_reads_new[,match(samples_to_keep, colnames(lusc_reads_new))]

Related

kdb q - efficiently count tables in flatfiles

I have a lot of tables stored in flat files (in a directory called basepath) and I want to check their number of rows. The best I can so right now is:
c:([] filename:system "ls ",basepath;
tablesize:count each get each hsym `$basepath,/:system "ls ",basepath)
which loads each table entirely into memory and then performs the count (that's quite slow). Is saving as splayed tables the only way to make this faster (because I would only load 1 column and count that) or is there a trick in q that I can use?
Thanks for the help
If you have basepath defined as a string of the path to directory where all your flat tables are stored then you can create a dictionary of the row counts as follows:
q)cnt:{count get hsym x}
q)filename:key hsym `$basepath
q)filename!cnt each filename
t| 2
g| 3
This is where I have flat tables t and g saved in my basepath directory. This stops you from having to use system commands which are often less effiecient.
The function cnt takes the path of each flat table (as a symbol) and returns the number of rows without saving them into memory.
The best solution if you have control of the process of saving such files down is to add an extra step of saving the meta information of the row count somewhere seperate at the same time of saving the raw data. This would allow you to quickly access the table size from this file instead of reading the full tbale in each time.
However, note that to avoid pulling them into memory at all you would have to instead use read1 and look at the headers on the binary data. As you said it would be better to save as a splayed table and read in one column.
UPDATE: I would not recommend doing this and strongly suggest doing the above but for curiosity after looking into using read1 here's an example what what a hacky solution might look like:
f:{
b:read1(y;0;x);
if[not 0x62630b~b[2 4 5];'`$"not a table"];
cc:first first((),"i";(),4)1:b 7+til 4;
if[null ce:first where cc=sums 0x0=11 _ b;:.z.s[x*2;y]];
c:`$"\000" vs "c"$b[11+til ce];
n:first first((),"i";(),4)1:b[(20+ce)+til 4];
:`columns`rows!(c;n);
}[2000]
The q binary file format isn’t documented anywhere, the only way to figure it out is to save different things and see how the bytes change. It’s also subject to changes between versions - the above is written for 3.5 and is probably valid for 3.0-3.5 only, not the latest 3.6 release or anything 2.X.
The given code works in the following way:
reads a chunk from the front of the file
validates that it looks like a flat unkeyed table (flip[98] of a dict[99] with symbol[11] keys)
reads the count of symbols in the list of columns as a little endian 4 byte int
scans through the null terminated strings for that many zero bytes
if the columns are so numerous or verbose that we don’t have them
all in this chunk it will double the size of the chunk and try again
turn the strings into symbols
using the offset we get from the end of the column list, skip a bit
more of the header for the mixed list of columns
then read the count from the header of the first column
Hope this answers your question!
From experimenting with the binary files, it seems that the table count is saved as part of the binary file when you save down a flat file, taking up 4 bytes after the initial object type and column headings which will vary from table to table.
`:test set ([]a:1 2 3;b:4 5 6;c:7 8 9;aa:10 11 12;bb:13 14 15)
q)read1 `:test
0xff016200630b000500000061006200630061610062620000000500000009000300000
0 7 11 31
bytes | example | meaning
---------------------------------------------------------------------------------------
0 - 5 | 0xff016200630b0 | object is a flat table
7 - 11 | 0x05000000 | number of columns (5)
12- 22 | 0x6100620063006161006262 | one byte for the ascii values of column "a" and "b" in hex followed by the one byte separator
23 - 30 | 0x0000050000000900 | 8 bytes that can be skipped
31 - 34 | 0x0300000 | 4 bytes for row count of first column (3)
This should help you understand the function that Fiona posted.
The binary is saved down little-endian meaning the most-significant byte is the right-hand most digit - doing this in decimal for the number 100 would give 001, with the 100's (most significant) on the right and then 10s and finally 1s on the left. In the binary file, each group of 2 digits is a byte.
You can use 1: to read in the contents of a binary file, with additional arguments in the list specifying the offset - where to start reading from, and how many bytes to read. In our case we want to start at byte 31 and read in 4 bytes, specifying the output should be an integer and to cut the input into separate 4 byte chunks.
q)first first (enlist "i";enlist 4)1:(`:test;31;4)
3i
Converting the little-endian bytes into a long gives us the row count. Since this only has to read in 4 bytes instead of the whole file it is a lot quicker.
For a table with 10000 rows and 2 columns there is not much difference:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10000;31;4)
0
q)\t count get `:test10000
0
For a table with 100m rows and 2 columns:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10m;31;4)
0
q)\t count get `:test10m
2023
If you have a splayed table instead you can read in the number of elements in one of the columns from bytes 9-13 like so, assuming the column is a simple list:
q)first first (enlist "i";enlist 4)1:(`:a;8;4)
3i
You can read more about reading in from binary files here https://code.kx.com/q/ref/filenumbers/#1-binary-files
You can make what you currently have more efficient by using the following
counttables:{count each get each hsym `$basepath}
This will improve the speed of the count by not including the extra read in of the data as well as the join which you are currently doing. You are correct though that if the tables where saved splayed you would only have to read in the one column making it much more efficient.
If your tables are stored uncompressed there's probably something quite hacky you could do with a read1 on the headers within the file until you find the first column header.
But v hacky :-(
Are you responsible for saving these down? Can you keep a running state as you do?

bulk import 80 lines of data via API

I have a tool that every x hours creates a y set of lines that I would simply like to add to a column into a specific smartsheet. And then every x hours I would simply like to overwrite these values with the new ones. That can have a different numbers of lines.
As I read the API in order to add or update anything I need to get all the row and columne IDs of the smart sheet in question.
Isn't there a easy way to formulate a JSON with a set of data and columne name and it just auto adds the rows as needed?
Data example is:
21
23
43
23
12
23
43
23
12
34
54
23
and then it could be:
23
23
55
4
322
12
3
455
3
AUTO
I really find it hard to believe that I need to read so much information into a script to be able to add just row of data. Nothing fancy.
Looking into sticking to just using cURL or Python
Thanks
If you want to add this data as new rows, this is fairly simple. It's only if you would like to replace existing data in existing rows where you would need to specify the row id.
The python SDK allows you to specify just a single column id, like so:
row_a = smartsheet.models.Row()
row_a.cells.append({
'column_id': 642523719853956
'value': 'New Status',
'strict': False
})
For more details, please see the API documentation regarding adding rows.

reshape and merge in stata

I have three data sets:
First, called education.dta. It contains individuals(students) over many years with their achieved educations from yr 1990-2000. Originally it is in wide format, but I can easily reshape it to long. It is presented as wide under:
id educ_90 educ_91 ... educ_00 cohort
1 0 1 1 87
2 1 1 2 75
3 0 0 2 90
Second, called graduate.dta. It contains information of when individuals(students) have finished high school. However, this data set do not contain several years only a "snapshot" of the individ when they finish high school and characteristics of the individual students such as backgroung (for ex parents occupation).
id schoolid county cohort ...
1 11 123 87
2 11 123 75
3 22 243 90
The third data set is called teachers.dta. It contains informations about all teachers at high school such as their education, if they work full or part time, gender... This data set is long.
id schoolid county year education
22 11 123 2011 1
21 11 123 2001 1
23 22 243 2015 3
Now I want to merge these three data sets.
First, I want to merge education.dta and graduate.dta on id.
Problem when education.dta is wide: I manage to merge education and graduation.dta. Then I make a loop so that all the variables in graduation.dta takes the same over all years, for eksample:
forv j=1990/2000 {
gen county j´=.
replace countyj´=county
}
However, afterwards when reshaping to long stata reposts that variable id does not uniquely identify the observations.
further, I have tried to first reshape education.dta to long, and thereafter merge either 1:m or m:1 with education as master, using graduation.dta.
However stata again reposts that id is not unique. How do I deal with this?
In next step I want to merge the above with teachers.dta on schoolid.
I want my final dataset in long format.
Thanks for your help :)
I am not certain that I have exactly the format of your data, it would be helpful if you gave us a toy dataset to look at using dataex (and could even help you figure out the problem yourself!)
But to start, because you are seeing that id is not unique, you need to figure out why there might be multiple ids in any of the datasets. Can someone in graduate.dta or education.dta appear more than once? help duplicates will probably be useful to explore the data in this way.
Because you want your dataset in long format I suggest reshaping education.dta to long first, then doing something like merge m:1 id using "graduate.dta" (once you figure out why some observations are showing up more than once) and then, finally something like merge 1:1 schoolid year using "teacher.dta" and you will have your final dataset.

Cannot allocate memory for a column of compound floats on a partitioned table

I have a partitioned table in my hdb that includes a column containing large lists of floats (at most 400 floats per element). eg each element looks like
(100.0 1.0 ...)
When trying to select on this column from days where there are particularly high numbers of rows I get an error saying
'./2015.02.07/table/column# Cannot allocate memory
The same error arises from a query like:
select column[;0] from table where date=2015.02.07
even though on days with fewer rows this query returns the first value of each element in the column.
Is there a way to stream this column in a select to decrease the memory requirements of holding the whole column in memory for a large day?
EDIT
.Q.ind on large days fails with the same error.
ie given I can work with 2015.02.01 but not 2015.02.02:
.Q.ind[select from table where date=2015.02.01;enlist 1]
is fine but
.Q.ind[select from table where date=2015.02.02;enlist 1]
fails with
{0!$[#.Q.pm;p3;(?).]#[x;0;p1[;y;z]]}
'./2015.02.10/table/column2#: Cannot allocate memory
#
.[?]
(+`time`sym`column1`column2!`:./2015.02.02/table;();0b;())
I should note I am using the free 32-bit version
I think this is all just a combination of the free-32bit memory limitation, the fact that your row counts are possibly large and the fact that (unavoidably) something must be pulled entirely into memory when retrieving data from a column, whether it is the column itself that gets entirely pulled in (in the non-nested case) or if its the nested-index column that gets entirely pulled in.
Another thing to consider is that kdb uses powers-of-two (buddy) memory allocation. Even if todays table only contains one more row than yesterdays, the memory requirements per column could double. Take a simple example:
In the free 32bit version (windows) you can create this many floats and it only uses ~1.07gb of memory
q)\ts 134217726?1.0
3093 1073741952
However, try to generate one extra float and you hit a memory limit
q)\ts 134217727?1.0
wsfull
So even a small amount of rows in the difference between one day and the next can be very significant if you're near the boundary of allocatable powers of two.
--DISCLAIMER-- the following is hacky and is only intended for debugging!
You can actually manually try to access the data from the nested list, though you may still have memory issues here anyway.
Create a nested table and splay it
q)tab:([] col1:(101 102 103f;104 105f;106 107 108 109 110f;111 112f))
q)tab
col1
--------------------
101 102 103f
104 105f
106 107 108 109 110f
111 112f
q)
q)`:test/ set tab
`:test/
You can try to read in the indices from the nested-index file
q)2_first (enlist "j";enlist 8)1:`:test/col1
3 5 10 12
So the indices for splitting the full list of floats (the col1# file) is index 3, index 5, 10 etc etc
Say I want the first 3 rows
q)myrows:3#2_first (enlist "j";enlist 8)1:`:test/col1
q)myrows
3 5 10
then I know that I need the first 10 floats from the col1# file and need to split them at index 3 and 5. Then I can read the col1# file partially and split it correctly
q)(0,-1_myrows) cut raze (enlist "f";enlist 8)1:(`$":test/col1#";0;8*last myrows)
101 102 103f
104 105f
106 107 108 109 110f
But this is precisely what KDB does under the covers anyway so I suspect that you'll still have trouble even reading in the nested-index file in the first place.
Check this debug/hack and see if you can partially read that way. But obviously it's not a long-term solution!
Nested columns make querying in the usual way difficult, as the # file also needs to be loaded into memory (even with a [;0])
Your best bet is to select map a date partition in, and then select within that chunk by chunk, e.g. a million rows at a time (or whatever is sensible given the size of nested floats).
Perhaps also consider 32bit floats, if some decimal accuracy can be sacrificed.
EDIT
So after comments I guess the best way is to go each partition a number of lines at a time with .Q.ind
Just to give my 2 cents on this, I had a similar error but with a 64-bit instance.
I suspected that the memory needed to be de-fragmented as it was running for almost a year.
Bouncing the instance solved the issue, and released a lot of virtual memory

SAS: Combining two data sets with different format

I have two datasets that are formatted differently
data1 looks like:
data1:
YYMM test1
1101 98
1102 98
1103 94
1104 92
1105 99
1106 91
data 2 is just a single grand mean that looks like:
data2:
GM
95
I would like to combine the two and have something that looks like this:
WANT:
YYMM test1 GM
1101 98 95
1102 98 95
1103 94 95
1104 92 95
1105 99 95
1106 91 95
I'm sure there are different ways to go about configuring this but I thought I should make the 95 into a column and merge with data1.
Do I have to use macro for this simple task? Please show me some light!
One straightforward way is to just merge without by statement and the use of retain:
data WANT (drop=temp);
merge DATA1 DATA2 (rename=(GM=temp));
retain GM;
if _N_=1 then GM=temp;
run;
So basically you put the two datasets together.
Because there is no by-statement, it will join together the first record of both datasets, the second record of both datasets and so on.
At the first record (if N=1), you grab the average and you put it in a variable for which the last value will be remembered (retain GM).
So in record 2, 3 etc, the value will still be what you put into it at record 1.
To keep it all clean, i renamed your GM variable on the input, so it was available to use as name for the retained variable. And of course, i dropped the redundant variable.
You can also approach this issue with a macro variable or a proc sql. But better keep it simple.
Here's a similar way that's slightly simpler.
data want;
set data1;
if _n_=1 then set data2;
run;