reshape and merge in stata - merge

I have three data sets:
First, called education.dta. It contains individuals(students) over many years with their achieved educations from yr 1990-2000. Originally it is in wide format, but I can easily reshape it to long. It is presented as wide under:
id educ_90 educ_91 ... educ_00 cohort
1 0 1 1 87
2 1 1 2 75
3 0 0 2 90
Second, called graduate.dta. It contains information of when individuals(students) have finished high school. However, this data set do not contain several years only a "snapshot" of the individ when they finish high school and characteristics of the individual students such as backgroung (for ex parents occupation).
id schoolid county cohort ...
1 11 123 87
2 11 123 75
3 22 243 90
The third data set is called teachers.dta. It contains informations about all teachers at high school such as their education, if they work full or part time, gender... This data set is long.
id schoolid county year education
22 11 123 2011 1
21 11 123 2001 1
23 22 243 2015 3
Now I want to merge these three data sets.
First, I want to merge education.dta and graduate.dta on id.
Problem when education.dta is wide: I manage to merge education and graduation.dta. Then I make a loop so that all the variables in graduation.dta takes the same over all years, for eksample:
forv j=1990/2000 {
gen county j´=.
replace countyj´=county
}
However, afterwards when reshaping to long stata reposts that variable id does not uniquely identify the observations.
further, I have tried to first reshape education.dta to long, and thereafter merge either 1:m or m:1 with education as master, using graduation.dta.
However stata again reposts that id is not unique. How do I deal with this?
In next step I want to merge the above with teachers.dta on schoolid.
I want my final dataset in long format.
Thanks for your help :)

I am not certain that I have exactly the format of your data, it would be helpful if you gave us a toy dataset to look at using dataex (and could even help you figure out the problem yourself!)
But to start, because you are seeing that id is not unique, you need to figure out why there might be multiple ids in any of the datasets. Can someone in graduate.dta or education.dta appear more than once? help duplicates will probably be useful to explore the data in this way.
Because you want your dataset in long format I suggest reshaping education.dta to long first, then doing something like merge m:1 id using "graduate.dta" (once you figure out why some observations are showing up more than once) and then, finally something like merge 1:1 schoolid year using "teacher.dta" and you will have your final dataset.

Related

Perform calculation on a different number of rows for each ID

Hopefully someone smart here can help me with a description how to solve this issue. I am relative new to SPSS and want to select cases with a certain requirement.
I have a group of Identeties who has made a mathtest multiple of times. We have 1000 ID where each person (ID) has done the test 10 times. Now i wanna select how many of these persons have scored atleast 40/50 once in this test. I have managed to do so.
Here is the problem. I now wanna calculate the average score of all the tests every individual has done after the first time they scored atleast 40 points.
Example: ID nr 8 has a score of; 34,35,27,37,32,45,41,32,34,47
These are all in 10 different rows. So ID nr 1 appears in 10 different rows. ID 2 in 10 other rows and so on.
Like this:
ID 1 Score 34
ID 1 Score 35
ID 1 Score 27
As you can see the person has scored atleast 40p at the 6th time. And i wanna take the average henceforth from this point. So in this case (45+41+32+34+47)/6.
I also wanna know how if i consider a person "smart" or not. A smart person is someone with atleast 2 mathscores with 40p+ (dosent have to be after each other, 2 seperate occations is ok)
How do i do that?
In the following code I assume you have a variable that identifies the order of the tests for each ID:
sort cases by ID TestNumber.
compute ScoreOver40 = score >= 40. /* this identifies all scores g/e 40.
compute seq=ScoreOver40.
* if seq was 1, all following seq values for same ID will also become 1.
if ID = lag(ID) and lag(seq)=1 seq=1.
if seq=1 scoresAfterFirst40 = score.
*now to aggregate by ID.
dataset declare agg.
aggregate /out=agg /break=ID /meanAfterFirst40=mean(scoresAfterFirst40)
/NumScores40P = sum(ScoreOver40).
In the new dataset called agg you should find for every ID the mean of scores after the first score of 40 or more, and the count of scores of 40 or more in all the 10 tests.
EDIT:
Now you can use the aggregated data for further analysis. For example, you can determine which IDs had two or more high (40+) scores:
dataset activate agg.
compute GoodAtMaths = (NumScores40P >= 2).
exe.

If you have a data object with large features (100+), is it better to expand this data by adding columns or by storing it in row format? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
So, if you can imagine a person was simultaneously enrolled in 100 different courses and had just received final grades for all courses, would it be better practice to store that information like so (wide columns):
personID
Math
Science
English
1
90
88
98
2
91
98
90
(and ...97 other columns)
Or like this:
personID
Grade
Subject
1
90
Math
1
88
Science
1
98
English
This brings the columns down significantly (from 100 to 3).
creating 100+ columns on any table is always bad idea. there is limit on no of columns in any database.
read this to get better idea. as you can think of following way.
StudentsMarkDetails
StudentsMarkDetailsID
personID
SubjectId
MarksObtained
ExamID
1
1
3
64
1
2
2
4
36
1
3
1
4
36
2
SubjectMaster
SubjectId
SubjectName
1
Maths
2
Science
3
English
Definitely option two - few fixed number of columns.
The reasons are many. Here’s a few:
maintenance: if you have 100 columns now, tomorrow you’ll have 101. Adding a column requires a schema change, which is painful
access: to get a value, you need to code the column name. This a form of maintenance problem in the code that accesses it, be it queries or app code
queries: basic queries become impossible. Write the query that returns the average grade for a student and the problem will be immediately obvious
Here are two solid rules I made for myself that I never break:
Prefer more rows over more columns.
Prefer more columns over more tables.
It is better to store the data with one row per person and per course. Why? Here are some reasons:
Different people may have different courses.
Most databases have a limit on the number of columns in a table.
You may want to store additional information per score, such as the date/time it was entered.
It is easy to add new courses or to change existing scores, if necessary.

Select value in table in tableau

I am quite new to Tableau, so have patience with me :)
I have two tables,
Table one (T1) contains all my data with the first row being Year-Week, like 2014-01, 2014-02, and so on. Quick question regarding this, how do I make Tableau consider this as a date, and not as string?
T1 contains a lot of data that looks like this:
YearWeek Spend TV Movies
2014-01 5000 42 12
2014-02 4800 41 32
2014-03 2000 24 14
....
2015-24 7000 45 65
I have another table (T2) that contains information regarding some values I want to multiply with the T1 columns, T2 looks like:
NAME TV Movies
Weight 2 5
Response 6 3
Ad 7 2
Version 1 0
I want to create a calculated field (TVNEW) that takes the values from T1 of TV, and adds Response(TV) to it, and times it with the weight(TV),
So something like this:
(T1[TV]+T2[TV[Response]])*T2[TV[Weight]]
This looks like this for the rows:
(42+6)*2
(41+6)*2
(24+6)*2
...
(45+6)*2
So the calculation should take a specific value from T2, and do the calculation for each value in T1[TV]
Thanks in advance
The easy answer to your question will be: No, not natively.
What you want to do sounds like accessing a 2 dimensional array and that's not really the intention of Tableau. Additionally you have 2 completely independent tables without a common attribute to JOIN on. Tableau is just not meant to work that way.
I cannot think of a way to dynamically extract that value (I assume your example is just that, an example; and in your case you don't just use two values in the calculation, otherwise you could create 2 parameters that you can use in your calculated fields)
When I look at your tables it looks like you could transpose and join them that they ideally look like this: (Edit: Comment says transposing is not an option)
Medium Value YearWeek Spend
Movies 12 2014-01 5,000
Movies 32 2014-02 4,000
Movies 14 2014-03 2,000
Movies 65 2015-24 7,000
TV 42 2014-01 5,000
TV 41 2014-02 4,000
TV 24 2014-03 2,000
TV 45 2015-24 7,000
and
Medium Weight Response Ad Version
TV 2 6 7 1
Movies 5 3 2 0
Depending on the systems you work with you could already put it in one CSV or table so you wouldn't have to do a JOIN in Tableau.
Now you can create the first table natively in Tableau (from Version 9.0 onwards), if you open your data source, in the Data Source Preview choose the columns TV and Movies, click on the small triangle and then on Pivot. (At this point you can also choose the YearWeek column click on the triangle and Split to create a seperate field for Year and Week. You won't be able to assign the type date to it put that shouldn't give you any disadvantages.)
For the second table I can think of two possibilities:
you have access to a tool that can transpose your table (Excel can do that see: Convert matrix to 3-column table ('reverse pivot', 'unpivot', 'flatten', 'normalize') Once you have done that you can open it in Tableau and join the two tables on Medium
You could create calculated fields depending on the medium:
Field: Weight
CASE [Medium]
WHEN 'TV' THEN 2
WHEN 'Movies' THEN 5
END
And accordingly for Response, Ad and Version
Obviously that is only reasonable if you really just need a handfull of values.
Once this is done it's only a matter of creating a calculated field with
([Value]+[Response])*[Weight]
And this will calculate all the values for your table

Summarizing each subgroup in Group Footer - Crystal Reports

I'm trying to work on a report for a client. Basically I need something like such
Group 1: Customer ID
Group 2: Truck ID
CustID Vehicle ID Detention Time
------ ---------- --------------
ABX 100 60
35
20
TOTAL: 115
200 80
15
TOTAL: 95
300 10
TOTAL: 10
TOTALS FOR CUSTOMER ABX
100 115
200 95
300 10
Is there anyway to accomplish this without a subreport? I was hoping for a "summary field" that I could summarize more than just a single value.
Thanks!
(FYI using Crystal Reports 2008)
Use a crosstab; place it in the report-footer section.
There might be a better way to do this, but the one that comes to mind is to use two arrays: One to store the truck ID and another to store the corresponding total. In each inner grouping (TruckID), just tack on another array element and store its total time. To display, you could cast the values to strings, attach a newline character after each entry, and set the field to "Can Grow". So altogether, you'd need three formulas: one to initialize the arrays (in GH1), one to update the arrays with sum({truck.time},{truck.ID}) (in GF2), and one to display each entry (in GF1).
With that being said, CR has terrible support for containers... You're limited to 1-dimensional, non-dynamic arrays that are gimped at 1000 items max. It doesn't sound like these would be big problems for what you're trying to do, but you will need to redim preserve the arrays unless you know ahead of time how many trucks you'll have per customer.

Calculating change in leaders for baseball stats in MSSQL

Imagine I have a MSSQL 2005 table(bbstats) that updates weekly showing
various cumulative categories of baseball accomplishments for a team
week 1
Player H SO HR
Sammy 7 11 2
Ted 14 3 0
Arthur 2 15 0
Zach 9 14 3
week 2
Player H SO HR
Sammy 12 16 4
Ted 21 7 1
Arthur 3 18 0
Zach 12 18 3
I wish to highlight textually where there has been a change in leader for each category
so after week 2 there would be nothing to report on hits(H); Zach has joined Arthur with most strikeouts(SO) at
18; and Sammy is new leader in homeruns(HR) with 4
So I would want to set up a process something like
a) save the past data(week 1) as table bbstatsPrior,
b) updates the bbstats for the new results - I do not need assistance with this
c) compare between the tables for the player(s with ties) with max value for each column
and spits out only where they differ
d) move onto next column and repeat
In any real world example there would be significantly more columns to calculate for
Thanks
Responding to Brents comments, I am really after any changes in the leaders for each category
So I would have something like
select top 1 with ties player
from bbstatsPrior
order by H desc
and
select top 1 with ties player,H
from bbstats
order by H desc
I then want to compare the player from each query (do I need to do temp tables) . If they differ I want to output the second select statement. For the H category Ted is leader `from both tables but for other categories there are changes between the weeks
I can then loop through the columns using
select name from sys.all_columns sc
where sc.object_id=object_id('bbstats') and name <>'player'
If the number of stats doesn't change often, you could easily just write a single query to get this data. Join bbStats to bbStatsPrior where bbstatsprior.week < bbstats.week and bbstats.week=#weekNumber. Then just do a simple comparison between bbstats.Hits to bbstatsPrior.Hits to get your difference.
If the stats change often, you could use dynamic SQL to do this for all columns that match a certain pattern or are in a list of columns based on sys.columns for that table?
You could add a column for each stat column to designate the leader using a correlated subquery to find the max value for that column and see if it's equal to the current record.
This might get you started, but I'd recommend posting what you currently have to achieve this and the community can help you from there.