Perform calculation on a different number of rows for each ID

Perform calculation on a different number of rows for each ID - encoding

Hopefully someone smart here can help me with a description how to solve this issue. I am relative new to SPSS and want to select cases with a certain requirement.
I have a group of Identeties who has made a mathtest multiple of times. We have 1000 ID where each person (ID) has done the test 10 times. Now i wanna select how many of these persons have scored atleast 40/50 once in this test. I have managed to do so.
Here is the problem. I now wanna calculate the average score of all the tests every individual has done after the first time they scored atleast 40 points.
Example: ID nr 8 has a score of; 34,35,27,37,32,45,41,32,34,47
These are all in 10 different rows. So ID nr 1 appears in 10 different rows. ID 2 in 10 other rows and so on.
Like this:
ID 1 Score 34
ID 1 Score 35
ID 1 Score 27
As you can see the person has scored atleast 40p at the 6th time. And i wanna take the average henceforth from this point. So in this case (45+41+32+34+47)/6.
I also wanna know how if i consider a person "smart" or not. A smart person is someone with atleast 2 mathscores with 40p+ (dosent have to be after each other, 2 seperate occations is ok)
How do i do that?

In the following code I assume you have a variable that identifies the order of the tests for each ID:
sort cases by ID TestNumber.
compute ScoreOver40 = score >= 40. /* this identifies all scores g/e 40.
compute seq=ScoreOver40.
* if seq was 1, all following seq values for same ID will also become 1.
if ID = lag(ID) and lag(seq)=1 seq=1.
if seq=1 scoresAfterFirst40 = score.
*now to aggregate by ID.
dataset declare agg.
aggregate /out=agg /break=ID /meanAfterFirst40=mean(scoresAfterFirst40)
/NumScores40P = sum(ScoreOver40).
In the new dataset called agg you should find for every ID the mean of scores after the first score of 40 or more, and the count of scores of 40 or more in all the 10 tests.
EDIT:
Now you can use the aggregated data for further analysis. For example, you can determine which IDs had two or more high (40+) scores:
dataset activate agg.
compute GoodAtMaths = (NumScores40P >= 2).
exe.

Related

Tableau: Distinct count of a field which occurs more than once

I have a field customer_id and I need to track the number of unique users and repeat users. For example the table is as below:
customer_id
11
22
33
11
44
22
Here, the no. of unique users is 4 (11,22,33,44) and number of repeat users are 2 (11,22).
I am calculating unique users as COUNTD([customer_id]).
How can I calculate repeat users? It is basically the distinct count of the values which appear more than once. I tried with the following expression:
COUNTD(IF COUNT([customer_id]) > 1
THEN [customer_id]
END)
but I'm getting an error: Cannot mix aggregate and non-aggregate arguments comparisons or results in IF expressions
How else can I calculate the repeat users?
Thanks in advance.

According to your filter needs, you can rely on LOD using FIXED/INCLUDE:
{ FIXED [Customer Id] : if sum({ FIXED [Customer Id] : COUNT([Customer Id])}) > 1 then 1 end }
Basically, in the inner LOD you count the occourrences, and then you just take in consideration records having 2+ (>1) of them:

A simple alternative to Fabio's answer can also do the job. Just create a calculated field
COUNT([customer id]) >1
and add this to filter shelf.
You can filter out false candidates to remove unique users and taking returning customers only.

reshape and merge in stata

I have three data sets:
First, called education.dta. It contains individuals(students) over many years with their achieved educations from yr 1990-2000. Originally it is in wide format, but I can easily reshape it to long. It is presented as wide under:
id educ_90 educ_91 ... educ_00 cohort
1 0 1 1 87
2 1 1 2 75
3 0 0 2 90
Second, called graduate.dta. It contains information of when individuals(students) have finished high school. However, this data set do not contain several years only a "snapshot" of the individ when they finish high school and characteristics of the individual students such as backgroung (for ex parents occupation).
id schoolid county cohort ...
1 11 123 87
2 11 123 75
3 22 243 90
The third data set is called teachers.dta. It contains informations about all teachers at high school such as their education, if they work full or part time, gender... This data set is long.
id schoolid county year education
22 11 123 2011 1
21 11 123 2001 1
23 22 243 2015 3
Now I want to merge these three data sets.
First, I want to merge education.dta and graduate.dta on id.
Problem when education.dta is wide: I manage to merge education and graduation.dta. Then I make a loop so that all the variables in graduation.dta takes the same over all years, for eksample:
forv j=1990/2000 {
gen county j´=.
replace countyj´=county
}
However, afterwards when reshaping to long stata reposts that variable id does not uniquely identify the observations.
further, I have tried to first reshape education.dta to long, and thereafter merge either 1:m or m:1 with education as master, using graduation.dta.
However stata again reposts that id is not unique. How do I deal with this?
In next step I want to merge the above with teachers.dta on schoolid.
I want my final dataset in long format.
Thanks for your help :)

I am not certain that I have exactly the format of your data, it would be helpful if you gave us a toy dataset to look at using dataex (and could even help you figure out the problem yourself!)
But to start, because you are seeing that id is not unique, you need to figure out why there might be multiple ids in any of the datasets. Can someone in graduate.dta or education.dta appear more than once? help duplicates will probably be useful to explore the data in this way.
Because you want your dataset in long format I suggest reshaping education.dta to long first, then doing something like merge m:1 id using "graduate.dta" (once you figure out why some observations are showing up more than once) and then, finally something like merge 1:1 schoolid year using "teacher.dta" and you will have your final dataset.

Select value in table in tableau

I am quite new to Tableau, so have patience with me :)
I have two tables,
Table one (T1) contains all my data with the first row being Year-Week, like 2014-01, 2014-02, and so on. Quick question regarding this, how do I make Tableau consider this as a date, and not as string?
T1 contains a lot of data that looks like this:
YearWeek Spend TV Movies
2014-01 5000 42 12
2014-02 4800 41 32
2014-03 2000 24 14
....
2015-24 7000 45 65
I have another table (T2) that contains information regarding some values I want to multiply with the T1 columns, T2 looks like:
NAME TV Movies
Weight 2 5
Response 6 3
Ad 7 2
Version 1 0
I want to create a calculated field (TVNEW) that takes the values from T1 of TV, and adds Response(TV) to it, and times it with the weight(TV),
So something like this:
(T1[TV]+T2[TV[Response]])*T2[TV[Weight]]
This looks like this for the rows:
(42+6)*2
(41+6)*2
(24+6)*2
...
(45+6)*2
So the calculation should take a specific value from T2, and do the calculation for each value in T1[TV]
Thanks in advance

The easy answer to your question will be: No, not natively.
What you want to do sounds like accessing a 2 dimensional array and that's not really the intention of Tableau. Additionally you have 2 completely independent tables without a common attribute to JOIN on. Tableau is just not meant to work that way.
I cannot think of a way to dynamically extract that value (I assume your example is just that, an example; and in your case you don't just use two values in the calculation, otherwise you could create 2 parameters that you can use in your calculated fields)
When I look at your tables it looks like you could transpose and join them that they ideally look like this: (Edit: Comment says transposing is not an option)
Medium Value YearWeek Spend
Movies 12 2014-01 5,000
Movies 32 2014-02 4,000
Movies 14 2014-03 2,000
Movies 65 2015-24 7,000
TV 42 2014-01 5,000
TV 41 2014-02 4,000
TV 24 2014-03 2,000
TV 45 2015-24 7,000
and
Medium Weight Response Ad Version
TV 2 6 7 1
Movies 5 3 2 0
Depending on the systems you work with you could already put it in one CSV or table so you wouldn't have to do a JOIN in Tableau.
Now you can create the first table natively in Tableau (from Version 9.0 onwards), if you open your data source, in the Data Source Preview choose the columns TV and Movies, click on the small triangle and then on Pivot. (At this point you can also choose the YearWeek column click on the triangle and Split to create a seperate field for Year and Week. You won't be able to assign the type date to it put that shouldn't give you any disadvantages.)
For the second table I can think of two possibilities:
you have access to a tool that can transpose your table (Excel can do that see: Convert matrix to 3-column table ('reverse pivot', 'unpivot', 'flatten', 'normalize') Once you have done that you can open it in Tableau and join the two tables on Medium
You could create calculated fields depending on the medium:
Field: Weight
CASE [Medium]
WHEN 'TV' THEN 2
WHEN 'Movies' THEN 5
END
And accordingly for Response, Ad and Version
Obviously that is only reasonable if you really just need a handfull of values.
Once this is done it's only a matter of creating a calculated field with
([Value]+[Response])*[Weight]
And this will calculate all the values for your table

how do i subtotal on groups based on a condition

My report is grouped on clinic, staffname with subtotals by clinic. I need to count patients by staff where they had more than 1 admit date. I can get the correct grand total, but on the detail and subtotals, it is a progressive number.
Here's what I want
clinic1
staffname1 10
staffname2 95
subtotal 105
clinic2
staffname3 6
subtotal 6
grand total 111
Here is what I get:
clinic1
staffname1 10
staffname2 105
subtotal 105
clinic2
staffname3 111
subtotal 111
grand total 111

A lot of this may depend on the structure of your data, e.g., what is in your "detail" level. I am also assuming that you want to count of how many patients had more than one admit date, not the total number of admits for patients with multiple admits. Given that, and assuming a patient will only appear once per admit date, then this should work:
Group by patient also, so it's clinic -> staff -> patient, but suppress that group.
Create a formula to count if the number of records in each patient group is more than 1, something like this: if count({patient},{patient}) > 1 then 1 else 0
Take the formula you just created, and use it to make a summary field wherever you want a total, e.g., in the staff head it will give you a count or that staff member, in clinic it will do so for clinic, etc.
Something else to consider: I'm guessing this could be intended to gauge staff on their quality by seeing how many patients had to seek additional treatment. Even if that's not the full intent, whatever this is being used for could be skewed by staff encountering more/fewer patients. For example, staff that have 10 readmits out of 100 visits would look worse than someone who only had 5 readmits, but had also only seen 20 patients.
So: Along with the metric on which you are requesting information, I would also add a ratio metric. In the staff header, this would be straightforward: count({patients}) / distinctcount({patients}) which will give you the ratio of distinct to repeat visits. Keep in mind also that this could skew high for a staff member that had, for example, 50 patients, but one of them came back a dozen times.

To get the count of the fields that has Greater than 1.
Assuming the count field is a database field and value coming directly from database
Create a formula #Count and write below code. Place the forumla in details.
:
if(databaswefield.count > 1)
then 1
else 0
Now take the summary of the #count formula in required group footers.
Let me know if you are looking for something different.
Edit..............................................................
If I got your comment correctly.
Though you take distinct count you can use that in your calculations by storing the value in a shared variable. Something like below and you can retrive value from that variable
Shared Numbervar count;
count:=distinctcount(patientid)

SQL Sum and Group By for a running Tally?

I'm completely rewriting my question to simplify it. Sorry if you read the prior version. (The previous version of this question included a very complex query example that created a distraction from what I really need.) I'm using SQL Express.
I have a table of lessons.
LessonID StudentID StudentName LengthInMinutes
1 1 Chuck 120
2 2 George 60
3 2 George 30
4 1 Chuck 60
5 1 Chuck 10
These would be ordered by date. (Of course the actual table is thousands of records with dates and other lesson-related data but this is a simplification.)
I need to query this table such that I get all rows (or a subset of rows by a date range or by student), but I need my query to add a new column we might call PriorLessonMinutes. That is, the sum of all minutes of all lessons for the same student in lessons of PRIOR dates only.
So the query would return:
LessonID StudentID StudentName LengthInMinutes PriorLessonMinutes
1 1 Chuck 120 0
2 2 George 60 0
3 2 George 30 60 (The sum Length from row 2 only)
4 1 Chuck 60 120 (The sum Length from row 1 only)
5 1 Chuck 10 180 (The sum of Length from rows 1 and 4)
In essence, I need a running tally of the sum of prior lesson minutes for each student. Ideally the tally shouldn't include the current row, but if it does, no big deal as I can do subtraction in the code that receives the query.
Further, (and this is important) if I retrieve only a subset of records, (for example by a date range) PriorLessonMinutes must be a sum that considers rows that are NOT returned.
My first idea was to use SUM() and to GROUP BY Student, but that isn't right because unless I'm mistaken it would include a sum of minutes for all rows for each student, including rows that come after the row which aren't relevant to the sum I need.
OPTIONS I'M REJECTING: I could scan through all rows in my code that receives it, (although this would force me to retrieve all rows unnecessarily) but that's obviously inefficient. I could also put a real data field in there and populate it, but this too presents problems when other records are deleted or altered.
I have no idea how to write such a query together. Any guidance?

This is a great opportunity to use Windowed Aggregates. The trick is that you need SQL Server 2012 Express. If you can get it, then this is the query you are looking for:
select *,
sum(LengthInMinutes)
over (partition by StudentId order by LessonId
rows between unbounded preceding and 1 preceding)
as PriorLessonMinutes
from Lessons
Note that it returns NULLs instead of 0s (zeroes). If you insist on zeroes, use COALESCE function to turn NULLs into zeroes.
I suggest using a nested query to limit the number of rows returned:
select * from
(
select *,
sum(LengthInMinutes)
over (partition by StudentId order by LessonId
rows between unbounded preceding and 1 preceding)
as PriorLessonMinutes
from Lessons
) as NestedLessons
where LessonId > 3 -- this is an example of a filter
This way the filter is applied after the aggregation is complete.
Now, if you want to apply a filter that doesn't affect the aggregation (like only querying data for a certain student), you should apply the filter to the inner query, as pruning the rows that don't affect the computation early (like data for other students) will improve the performance.

I feel the following code will serve your purpose.Check it:-
select Students.StudentID ,Students.First, Students.Last,sum(Lessons.LengthInMinutes)
as TotalPriorMinutes from lessons,students
where Lessons.StartDateTime < getdate()
and Lessons.StudentID = Students.StudentID
and StartDateTime >= '20090130 00:00:00' and StartDateTime < '20790101 00:00:00'
group by Students.StudentID ,Students.First, Students.Last