I have a field customer_id and I need to track the number of unique users and repeat users. For example the table is as below:
customer_id
11
22
33
11
44
22
Here, the no. of unique users is 4 (11,22,33,44) and number of repeat users are 2 (11,22).
I am calculating unique users as COUNTD([customer_id]).
How can I calculate repeat users? It is basically the distinct count of the values which appear more than once. I tried with the following expression:
COUNTD(IF COUNT([customer_id]) > 1
THEN [customer_id]
END)
but I'm getting an error: Cannot mix aggregate and non-aggregate arguments comparisons or results in IF expressions
How else can I calculate the repeat users?
Thanks in advance.
According to your filter needs, you can rely on LOD using FIXED/INCLUDE:
{ FIXED [Customer Id] : if sum({ FIXED [Customer Id] : COUNT([Customer Id])}) > 1 then 1 end }
Basically, in the inner LOD you count the occourrences, and then you just take in consideration records having 2+ (>1) of them:
A simple alternative to Fabio's answer can also do the job. Just create a calculated field
COUNT([customer id]) >1
and add this to filter shelf.
You can filter out false candidates to remove unique users and taking returning customers only.
Related
Hopefully someone smart here can help me with a description how to solve this issue. I am relative new to SPSS and want to select cases with a certain requirement.
I have a group of Identeties who has made a mathtest multiple of times. We have 1000 ID where each person (ID) has done the test 10 times. Now i wanna select how many of these persons have scored atleast 40/50 once in this test. I have managed to do so.
Here is the problem. I now wanna calculate the average score of all the tests every individual has done after the first time they scored atleast 40 points.
Example: ID nr 8 has a score of; 34,35,27,37,32,45,41,32,34,47
These are all in 10 different rows. So ID nr 1 appears in 10 different rows. ID 2 in 10 other rows and so on.
Like this:
ID 1 Score 34
ID 1 Score 35
ID 1 Score 27
As you can see the person has scored atleast 40p at the 6th time. And i wanna take the average henceforth from this point. So in this case (45+41+32+34+47)/6.
I also wanna know how if i consider a person "smart" or not. A smart person is someone with atleast 2 mathscores with 40p+ (dosent have to be after each other, 2 seperate occations is ok)
How do i do that?
In the following code I assume you have a variable that identifies the order of the tests for each ID:
sort cases by ID TestNumber.
compute ScoreOver40 = score >= 40. /* this identifies all scores g/e 40.
compute seq=ScoreOver40.
* if seq was 1, all following seq values for same ID will also become 1.
if ID = lag(ID) and lag(seq)=1 seq=1.
if seq=1 scoresAfterFirst40 = score.
*now to aggregate by ID.
dataset declare agg.
aggregate /out=agg /break=ID /meanAfterFirst40=mean(scoresAfterFirst40)
/NumScores40P = sum(ScoreOver40).
In the new dataset called agg you should find for every ID the mean of scores after the first score of 40 or more, and the count of scores of 40 or more in all the 10 tests.
EDIT:
Now you can use the aggregated data for further analysis. For example, you can determine which IDs had two or more high (40+) scores:
dataset activate agg.
compute GoodAtMaths = (NumScores40P >= 2).
exe.
In my data set, I am looking for value that have both positive and negative result under the amount category. For example, one entity can be bank account, and there are money coming in (positive number) and money going out (negative number).
SELECT description, account_subtype_id, subcategory_id, (case when amount > 0 then 1 end) AS amount_p, (case when amount < 0 then 0 end) AS amount_n
FROM mx.transactions
LIMIT 100
;
This approach doesn't help much because now my data looks like:
bank_A 1 null
bank_A null 0
But I really want to get something like:
bank_A 1 0
because this will be really helpful for my analysis.
Actually. If there is a way to do this, it would be even better:
For example, an entity has
Bank_A $500 -$300 -- (these two results both are from the amount column)
If you want just one row per description, you need to group by description and use aggregate functions. A clean way to check whether there's any positive or negative amount would be to check whether min(amount) and max(amount) are less/greater than 0:
SELECT
description,
min(amount) < 0 AS amount_n,
max(amount) > 0 AS amount_p
FROM mx.transactions
GROUP BY description
These tests will give you true and false values, but you can use them in your CASE/IF statements if you want something else. Or to get the actual values rather than testing against 0, just use min and max directly.
It looks like you've got multiple columns potentially acting as your bank_A identifier. If that's the case, you can GROUP BY all of them.
SELECT
description,
account_subtype_id,
subcategory_id,
min(amount),
max(amount)
FROM mx.transactions
GROUP BY description, account_subtype_id, subcategory_id
I am currently working on Pentaho and I have the following problem:
I want to get a "rooling distinct count on a value, which ignores the "group by" performed by Business Analytics. For instance:
Date Field
2013-01-01 A
2013-02-05 B
2013-02-06 A
2013-02-07 A
2013-03-02 C
2013-04-03 B
When I use a classical "distinct count" aggregator in my schema, sum it, and then add "month" to column, I get:
Month Count Sum
2013-01 1 1
2013-02 2 3
2013-03 1 4
2013-04 1 5
What I would like to get would be:
Month Sum
2013-01 1
2013-02 2
2013-03 3
2013-04 3
which is the distinct count of all Fields so far. Does anyone has any idea on this topic?
my database is in Postgre, and I'm looking for any solution under PDI, PSW, PBA or PME.
Thank you!
A naive approach in PDI is the following:
Sort the rows by the Field column
Add a sequence for changing values in the Field column
Map all sequence values > 1 to zero
These first 3 effectively flag the first time a value was seen (no matter the date).
Sort the rows by year/month
Sum the mapped sequence values by year+month
Get a Cumulative Sum of all the previous sums
These 3 aggregate the distinct values per month, then keep a cumulative sum. In PDI this might look something like:
I posted a Gist of this transformation here.
A more efficient solution is to parallelize the two sorts, then join at the latest point possible. I posted this one as it is easier to explain, but it shouldn't be too difficult to take this transformation and make it more parallel.
I have a simple table:
ID - JID - AMOUNT
1 - 1 - 100
2 - 2 - 50
3 - 2 - -25
4 - 3 - 100
5 - 3 - -50
I want to end up with:
JID - FIRSTBALANCE
1 - 100
2 - 50
3 - 100
Because Firebird is so insanely difficult when it comes to aggregation, this doesn't work:
SELECT jid, amount as firstBalance
FROM table
GROUP BY jid
How can I get it so it groups by JID, and automatically set the value of firstbalance to the first value in the table?
Depends on what do you mean with "automatically set the value of firstbalance to the first value in the table". From the example of the desired result you gave I thought you consider the row with lowest ID value for given JID group as "first" so
SELECT DISTINCT JID,
(SELECT amount FROM table s WHERE s.JID = o.JID ORDER BY s.ID ROWS 1)
FROM table o
should work.
Firebird does not contain a first() or a last() aggregate function. This has been requested and denied by the team due to which item would be chosen. You'd need to specify an order by clause for the items that get aggregated.
The answer you selected gets you the max(amount) not the first(amount). This is not what you asked for (though possibly it is what you wanted).
For future Googlers/Bingers here's how you get the first item. It's not a terrific solution, and it can be slow.
select distinct a.jid,
(select first 1 b.amount
from table b
where b.jid = a.jid
order by b.id) as amount
from table a
order by a.jid
It will retrieve the three JID fields and the first found amount as determined by ID order.
Don't hold your breath for this to get built into Firebird. When asked about a positional aggregate in the past, the response was:
"I have a great deal of trouble with that concept because position isn't a relational concept and the introduction of positional operators will signficantly inhibit efforts to improve performance by performing operations in parallel."
This is what I was looking for:
SELECT jid, max(amount) as firstBalance
FROM table
GROUP BY jid
I'm completely rewriting my question to simplify it. Sorry if you read the prior version. (The previous version of this question included a very complex query example that created a distraction from what I really need.) I'm using SQL Express.
I have a table of lessons.
LessonID StudentID StudentName LengthInMinutes
1 1 Chuck 120
2 2 George 60
3 2 George 30
4 1 Chuck 60
5 1 Chuck 10
These would be ordered by date. (Of course the actual table is thousands of records with dates and other lesson-related data but this is a simplification.)
I need to query this table such that I get all rows (or a subset of rows by a date range or by student), but I need my query to add a new column we might call PriorLessonMinutes. That is, the sum of all minutes of all lessons for the same student in lessons of PRIOR dates only.
So the query would return:
LessonID StudentID StudentName LengthInMinutes PriorLessonMinutes
1 1 Chuck 120 0
2 2 George 60 0
3 2 George 30 60 (The sum Length from row 2 only)
4 1 Chuck 60 120 (The sum Length from row 1 only)
5 1 Chuck 10 180 (The sum of Length from rows 1 and 4)
In essence, I need a running tally of the sum of prior lesson minutes for each student. Ideally the tally shouldn't include the current row, but if it does, no big deal as I can do subtraction in the code that receives the query.
Further, (and this is important) if I retrieve only a subset of records, (for example by a date range) PriorLessonMinutes must be a sum that considers rows that are NOT returned.
My first idea was to use SUM() and to GROUP BY Student, but that isn't right because unless I'm mistaken it would include a sum of minutes for all rows for each student, including rows that come after the row which aren't relevant to the sum I need.
OPTIONS I'M REJECTING: I could scan through all rows in my code that receives it, (although this would force me to retrieve all rows unnecessarily) but that's obviously inefficient. I could also put a real data field in there and populate it, but this too presents problems when other records are deleted or altered.
I have no idea how to write such a query together. Any guidance?
This is a great opportunity to use Windowed Aggregates. The trick is that you need SQL Server 2012 Express. If you can get it, then this is the query you are looking for:
select *,
sum(LengthInMinutes)
over (partition by StudentId order by LessonId
rows between unbounded preceding and 1 preceding)
as PriorLessonMinutes
from Lessons
Note that it returns NULLs instead of 0s (zeroes). If you insist on zeroes, use COALESCE function to turn NULLs into zeroes.
I suggest using a nested query to limit the number of rows returned:
select * from
(
select *,
sum(LengthInMinutes)
over (partition by StudentId order by LessonId
rows between unbounded preceding and 1 preceding)
as PriorLessonMinutes
from Lessons
) as NestedLessons
where LessonId > 3 -- this is an example of a filter
This way the filter is applied after the aggregation is complete.
Now, if you want to apply a filter that doesn't affect the aggregation (like only querying data for a certain student), you should apply the filter to the inner query, as pruning the rows that don't affect the computation early (like data for other students) will improve the performance.
I feel the following code will serve your purpose.Check it:-
select Students.StudentID ,Students.First, Students.Last,sum(Lessons.LengthInMinutes)
as TotalPriorMinutes from lessons,students
where Lessons.StartDateTime < getdate()
and Lessons.StudentID = Students.StudentID
and StartDateTime >= '20090130 00:00:00' and StartDateTime < '20790101 00:00:00'
group by Students.StudentID ,Students.First, Students.Last