SAS hash table - multivalues for single key and find_next - hash

I am using a hash table to look up values and determine whether there is a match. These are the testing codes.
What I am trying to do is to check whether students in roster have entries in grade_roster; if a student has entries in grade_roster, I attach that grade_on_report to a new dataset that is the dup of roster.
name is my key and the grades are values. One name may have several grades, i.e. multivalues for a single key. I was able to find all names in the roster that have a match in grade_roster using find_next(), but I could not attach the right grade to the new dataset.
It seems like whenever find_next() is called, the value for the key was set to the next item in the list, and therefore assign that value to all previous keys.
Here is my code:
data roster;
input name $ course $ grade_on_paper $;
datalines;
Mary English A
Mary German B
Josh English B
Lily Spanish B
Lucy Physics C
John Music A
Eric Math A
Eric Music B
;
run;
data grade_roster;
input name $ course $ grade_on_report $;
datalines;
Mary English A
Mary German B
John Music A
Eric Math A
Eric Music B
;
run;
data assign_grade;
set roster;
format grade_on_report $1.;
declare hash ht1(dataset:"grade_roster", multidata:"Y");
ht1.defineKey("name");
ht1.defineData("grade_on_report");
ht1.defineDone();
rc = ht1.find();
do while(rc = 0);
rc = ht1.find_next();
end;
run;
What I got was this:
name course grade_on_paper grade_on_report name_found
1 Mary English A B Y
2 Mary German B B Y
3 Josh English B
4 Lily Spanish B
5 Lucy Physics C
6 John Music A A Y
7 Eric Math A B Y
8 Eric Music B B Y
What I want is:
name course grade_on_paper grade_on_report name_found
1 Mary English A A Y
2 Mary German B B Y
3 Josh English B
4 Lily Spanish B
5 Lucy Physics C
6 John Music A A Y
7 Eric Math A A Y
8 Eric Music B B Y
Note: name and course together are not unique identifiers. It seems like they are unique identifiers in this particular test code but they are not unique identifiers in the actual dataset I am working on. The goal is to use name as the only key in ht1.defineKey() and get the correct result.
Any help would be appreciated. Thank you!

If you are trying to find data on a specific student and course, you need to make both name and course part of the key.
data assign_grade;
set roster;
format grade_on_report $1.;
declare hash ht1(dataset:"grade_roster", multidata:"Y");
ht1.defineKey("name", "course");
ht1.defineData("grade_on_report");
ht1.defineDone();
rc = ht1.find();
do while(rc = 0);
rc = ht1.find_next();
end;
run
I tested this on sligthly different data
data roster;
input name $ course $ grade_on_paper $;
datalines;
Mary English A
Mary German B
Josh English B
Lily Spanish B
Lucy Physics C
John Music A
Eric Math A
Eric Music B
;
run;
data grade_roster;
input name $ course $ grade_on_report $;
datalines;
Mary German B
Mary English A
John Music A
Eric Math C
Eric Music B
;
run;
and got what I expected.

Related

Genomic Ranges - Merge Overlaps in Single File (R STUDIO)

I would like to find overlapped regions in the file and merge them keeping the earlier start and the later stop (merge 2 regions in 1)
I meant to use Genomic Ranges but I am not sure how to code the script.
This is what the file fileA.txt contain:
chr start end value
chr1 58708485 58708713 1
chr1 58709084 58710538 2
chr1 98766295 98766639 3
chr1 98766902 98770338 4
Script:
library(GenomicRanges)
query = with(fileA.txt, GRanges(chr, IRanges(start=start, end=end)))
subject = with(fileA.txt, GRanges(chr, IRanges(start=start, end=end)))
hits = findOverlaps(gr1)
ranges(query)[queryHits(hits)] = ranges(subject)[subjectHits(hits)]
I am not sure how to set query and subject for a single file, as well the object being a document need any kind of "" or specific format (bedGraph, txt are fine?) in order to be recognised in the script?
Thank you a lot in advance for your help!
K.

Tcl scripting using grep,sed

I have to find a particular string first, then i need to pick up a value in that string, which is changing aftrr every run. How can i do this.
Stage pro begin_time { size is 54mb and 3 sec }
Stage pro1 begin_time {.....}
I have very big file. I have to specifically look for begin_time string first then look for its value and store it somewhere. How can i do this.
I have tried it using grep but i don't know how to store value of 3 here. Similarly i have many cases here, how can i handle this situation
Thanks for the reply. I want to store it as value 3 with respect to pro, then some other value corresponding to pro1. Second, time mentioned here is a start time. I have same information for each of the stage pro,pro1 but with an end time. I have to extract end time as well. Then i have to caluculate net time elapsed, i.e. time between start time and end time
cat aaa
Stage pro begin_time { size is 54mb and 3 sec }
Stage pro1 begin_time {.....}
cat aaa |awk '$3=="begin_time" {print $9}' |grep .
3
if you wish to store it :
var=$(cat aaa |awk '$3=="begin_time" {print $9}' |grep .)
echo $var
3
If you wish to map pro with 3 then,
cat aaa | awk '$3=="begin_time" {print $2,$9}' |grep .
pro 3

Title the Result from SphinxSearch

Any way I can title the search results?
Example:
select name from person ;
Will give
abhilash joseph c
But i want it to be 'Abhilash Joseph C'
Can I write udf for sphinx? How can I do it ?

Stata mmerge update replace gives wrong output

I wanted to test what happened if I replace a variable with a different data type:
clear
input id x0
1 1
2 13
3 .
end
list
save tabA, replace
clear
input id str5 x0
1 "1"
2 "23"
3 "33"
end
list
save tabB, replace
use tabA, clear
mmerge id using tabB, type(1:1) update replace
list
The result is:
+--------------------------------------------------+
| id x0 _merge |
|--------------------------------------------------|
1. | 1 1 in both, master agrees with using data |
2. | 2 13 in both, master agrees with using data |
3. | 3 . in both, master agrees with using data |
+--------------------------------------------------+
This seems very strange to me. I expected breakdown or disagreement. Is this a bug or am I missing something?
mmerge is user-written (Jeroen Weesie, SSC, 2002).
If you use the official merge in an up-to-date Stata, you will get what you expect.
. merge 1:1 id using tabB, update replace
x0 is str5 in using data
r(106);
I have not looked inside mmerge. My own guess is that what you see is a feature from the author's point of view, namely that it's not a problem if one variable is numeric and one variable is string so long as their contents agree. But why are you not using merge directly? There was a brief period several years ago when mmerge had some advantages over merge, but that's long past. BTW, I agree in wanting my merges to be very conservative and not indulgent on variable types.

print column and count the string

I have large column wise text file with space demlimited
Name subject Result
John maths pass
John science fail
John history pass
John geography pass
Jack maths pass
jack history fail
kelly science pass
kelly history pass
I want to count for each name (it is long name list, each name should be appear only once), how many of them pass. For eg. For John, he passed 3 and similarily for Jack he passed 1. It should print the result as
Name Passcount
John 3
Jack 1
Kelly 2
Can anybody can help with awk or perl script. Thanks in advance
You can try something like this -
awk '
BEGIN{ print "Name\tPasscount"}
NR>1{if ($3=="pass") a[$1]++}
END{ for (x in a) print x"\t"a[x]}' file
Test:
$ cat file
Name subject Result
John maths pass
John science fail
John history pass
John geography pass
Jack maths pass
jack history fail
kelly science pass
kelly history pass
$ awk 'BEGIN{ print "Name\tPasscount"} NR>1{if ($3=="pass") a[$1]++}END{ for (x in a) print x"\t"a[x]}' file
Name Passcount
Jack 1
kelly 2
John 3