SAS: Combining two data sets with different format - merge

I have two datasets that are formatted differently
data1 looks like:
data1:
YYMM test1
1101 98
1102 98
1103 94
1104 92
1105 99
1106 91
data 2 is just a single grand mean that looks like:
data2:
GM
95
I would like to combine the two and have something that looks like this:
WANT:
YYMM test1 GM
1101 98 95
1102 98 95
1103 94 95
1104 92 95
1105 99 95
1106 91 95
I'm sure there are different ways to go about configuring this but I thought I should make the 95 into a column and merge with data1.
Do I have to use macro for this simple task? Please show me some light!

One straightforward way is to just merge without by statement and the use of retain:
data WANT (drop=temp);
merge DATA1 DATA2 (rename=(GM=temp));
retain GM;
if _N_=1 then GM=temp;
run;
So basically you put the two datasets together.
Because there is no by-statement, it will join together the first record of both datasets, the second record of both datasets and so on.
At the first record (if N=1), you grab the average and you put it in a variable for which the last value will be remembered (retain GM).
So in record 2, 3 etc, the value will still be what you put into it at record 1.
To keep it all clean, i renamed your GM variable on the input, so it was available to use as name for the retained variable. And of course, i dropped the redundant variable.
You can also approach this issue with a macro variable or a proc sql. But better keep it simple.

Here's a similar way that's slightly simpler.
data want;
set data1;
if _n_=1 then set data2;
run;

Related

Sorting KDB table while excluding Total row

I have noticed when using xasc capital letters take precedence over lower case.
Trying to exclude Total from being considered when doing the sort, wanting to avoid using "lower" then recapitalizing it again. I have my solution below but its rather poor code
t:flip (`active`price`price2)!(`def`abc`xyz`hij`Total;12j, 44j, 468j, 26j, 550j;49j, 83j, 716j, 25j, 873j)
Thinking there's a better way than this
(`active xasc select from t where not active=`Total),select from t where active=`Total
Although it does not match the sort order of your example answer, if you're looking to sort by true lexicographical order excluding captials you could do the following:
q)t:([]active:`def`abc`xyz`hij`Total;price:12 44 468 26 550;price2:49 83 716 25 873)
q)t iasc lower t`active
active price price2
-------------------
abc 44 83
def 12 49
hij 26 25
Total 550 873
xyz 468 716
Otherwise, if you're looking to have the Total row at the bottom following the sort then you will need to append it after doing so - given your example table:
q)(select[<active]from t where active<>`Total),select from t where active=`Total
active price price2
-------------------
abc 44 83
def 12 49
hij 26 25
xyz 468 716
Total 550 873
There isn't really a much cleaner way to do it, but this approach ensures Total is at the bottom without needing two selects (but it needs a group and a sort)
q)raze`active xasc/:t group`Total=t`active
active price price2
-------------------
abc 44 83
def 12 49
hij 550 873
xyz 26 25
Total 468 716
Matthew's is probably the best all-round solution.
If you know Total is always going to end up first after the sort then:
{1_x,1#x}`active xasc t // sort, join the first row to the end, drop first row
is a pretty concise solution - this is obviously not ideal if you don't have control over the active column contents as other uppercase entries would make this unpredictable.

DESeq2 - organizing data - count data with more columns than meta data row, remove TCGA IDs that don't match between datasets

I am in the process of running a DEA with DESeq2 with some lung squamous cell carcinoma data from Broad firehose.
I have used the RNAseq data, raw counts for the count data, my metadata is generated from Broad firehose data on CNV.
The two datasets are related by TCGA IDs, each data set contains matching IDs, but the raw counts data has several "extra" TCGA IDs that don't match/ exist in the metadata. counts length = 552, meta length = 501
I need to get my metadata and raw counts data in the same order and have the datasets contain the matching TCGA IDs and drop the IDs/ samples that don't match.
I've been trying to work around different ways to do this using, match and %in%, I can identify the positions where the raw count data do not have the same TCGA IDs as the metadata, but can't wrap my head around how to take the raw counts and drop the IDs/samples that do not match the samples in the metadata.
Any ideas would help on how to match up two data sets and eliminate row/ columns that don't match.
colnames(lusc_reads)
[1] "TCGA-18-3406" "TCGA-18-3407" "TCGA-18-3408" "TCGA-18-3409" "TCGA-18-3410"
rownames(lusc_meta)
[1] "TCGA-60-2722" "TCGA-43-7657" "TCGA-58-A46N" "TCGA-NC-A5HL" "TCGA-63-A5MB"
match(colnames(lusc_reads), rownames(lusc_meta))
[1] 318 265 114 372 353 150 8 287 215 57 199 268 239 179 164 249 383 17 274
Assuming lusc_reads and lusc_meta are dataframes, this should work:
samples_to_keep <- as.vector(rownames(lusc_meta))
lusc_reads_new <- lusc_reads[colnames(lusc_reads) %in% samples_to_keep]
To put them in the same order as the metadata
lusc_reads_new <- lusc_reads_new[,match(samples_to_keep, colnames(lusc_reads_new))]

How to solve the below scenario using transformer loop or anything in datastage

My data is like below in one column coming from a file.
Source_data---(This is column name)
CUSTOMER 15
METER 8
METERStatement 1
READING 1
METER 56
Meterstatement 14
Reading 5
Reading 6
Reading 7
CUSTOMER 38
METER 24
METERStatement 1
READING 51
CUSTOMER 77
METER 38
READING 9
I want the output data to be like below in one column
CUSTOMER 15 METER 8 METERStatement 1 READING 1
CUSTOMER 15 METER 56 Meterstatement 14 Reading 5
CUSTOMER 15 METER 56 Meterstatement 14 Reading 6
CUSTOMER 15 METER 56 Meterstatement 14 Reading 7
CUSTOMER 38 METER 24 Meterstatement 1 Reading 51
CUSTOMER 77 METER 38 'pad 100 spaces' Reading 9
I am trying to solve by reading transformer looping documentation but could not figure out an actual solution. anything helps. thank you all.
Yes this could be solved within a transformer stage.
Concatenation is done with ":".
So use a stage variable to concat the input until a new "Meter" or "Customer" row comes up.
Save the "Customer" in a second stage variable in case it does not change.
Use a condition to only output thew rows where a "Reading" exists.
Reset the concatenated string when a "Reading" has been processed.
I guess you want the padding for missing fields in general - you could do these checks in separate stage variables. You have to store the previous item inorder to kow wat is missing - and maybe even more if two consecutive items could be missing.

bulk import 80 lines of data via API

I have a tool that every x hours creates a y set of lines that I would simply like to add to a column into a specific smartsheet. And then every x hours I would simply like to overwrite these values with the new ones. That can have a different numbers of lines.
As I read the API in order to add or update anything I need to get all the row and columne IDs of the smart sheet in question.
Isn't there a easy way to formulate a JSON with a set of data and columne name and it just auto adds the rows as needed?
Data example is:
21
23
43
23
12
23
43
23
12
34
54
23
and then it could be:
23
23
55
4
322
12
3
455
3
AUTO
I really find it hard to believe that I need to read so much information into a script to be able to add just row of data. Nothing fancy.
Looking into sticking to just using cURL or Python
Thanks
If you want to add this data as new rows, this is fairly simple. It's only if you would like to replace existing data in existing rows where you would need to specify the row id.
The python SDK allows you to specify just a single column id, like so:
row_a = smartsheet.models.Row()
row_a.cells.append({
'column_id': 642523719853956
'value': 'New Status',
'strict': False
})
For more details, please see the API documentation regarding adding rows.

reshape and merge in stata

I have three data sets:
First, called education.dta. It contains individuals(students) over many years with their achieved educations from yr 1990-2000. Originally it is in wide format, but I can easily reshape it to long. It is presented as wide under:
id educ_90 educ_91 ... educ_00 cohort
1 0 1 1 87
2 1 1 2 75
3 0 0 2 90
Second, called graduate.dta. It contains information of when individuals(students) have finished high school. However, this data set do not contain several years only a "snapshot" of the individ when they finish high school and characteristics of the individual students such as backgroung (for ex parents occupation).
id schoolid county cohort ...
1 11 123 87
2 11 123 75
3 22 243 90
The third data set is called teachers.dta. It contains informations about all teachers at high school such as their education, if they work full or part time, gender... This data set is long.
id schoolid county year education
22 11 123 2011 1
21 11 123 2001 1
23 22 243 2015 3
Now I want to merge these three data sets.
First, I want to merge education.dta and graduate.dta on id.
Problem when education.dta is wide: I manage to merge education and graduation.dta. Then I make a loop so that all the variables in graduation.dta takes the same over all years, for eksample:
forv j=1990/2000 {
gen county j´=.
replace countyj´=county
}
However, afterwards when reshaping to long stata reposts that variable id does not uniquely identify the observations.
further, I have tried to first reshape education.dta to long, and thereafter merge either 1:m or m:1 with education as master, using graduation.dta.
However stata again reposts that id is not unique. How do I deal with this?
In next step I want to merge the above with teachers.dta on schoolid.
I want my final dataset in long format.
Thanks for your help :)
I am not certain that I have exactly the format of your data, it would be helpful if you gave us a toy dataset to look at using dataex (and could even help you figure out the problem yourself!)
But to start, because you are seeing that id is not unique, you need to figure out why there might be multiple ids in any of the datasets. Can someone in graduate.dta or education.dta appear more than once? help duplicates will probably be useful to explore the data in this way.
Because you want your dataset in long format I suggest reshaping education.dta to long first, then doing something like merge m:1 id using "graduate.dta" (once you figure out why some observations are showing up more than once) and then, finally something like merge 1:1 schoolid year using "teacher.dta" and you will have your final dataset.