SPSS - merging files with duplicate cases of ID variable and new cases/variables - merge

I have an administrative dataset for store visits from multiple years that I am trying to merge into one under the ID variable.
Each dataset has duplicates of an ID that occur during different store visits, annotated by Date. Some of the more recent data files also have new variables (Y) not contained in the old data files. Datasets from different years will also contain different cases indicated by different ID. Also, some variables may be the same for each case but at different dates. I want the merged file to retain these duplicates.
Example data files:
File 1
ID Date X
1 3 4
1 5 3
2 1 4
File 2
ID Date X Y
1 6 4 2
1 7 1 5
2 8 4 7
3 7 2 3
I want the merged file to continue listing ALL duplicate cases, as such:
ID Date X Y
1 3 4 .
1 5 3 .
1 6 4 2
1 7 1 5
2 1 4 .
2 8 4 7
3 7 2 3
I then plan to restructure (CASESTOVARS /AUTOFIX=0) the merged file so that it looks like this:
ID Date.1 Date.2 Date.3 Date.4 X.1 X.2 X.3 X.4 Y.1 Y.2 Y.3 Y.4
1 3 5 6 7 4 3 4 1 . . 2 5
2 1 8 . . 4 4 . . . 7 . .
3 7 . . . 2 . . . 3 . . .
I am having trouble with the initial merging process, however. I have tried looking up the safest way to merge files when they both have duplicate cases in order to make sure no data are lost in the process. It seems that the "Add Variables" method results in lost values for duplicate variables.
Thanks!
EDIT: If I used the "Add Variables" function and used both the ID and Date variables as the key variables, would that help avoid deletion of duplicate cases?

Why not try add cases instead of add variables? if there are no occurrences of the same Id with the same date it should work OK with the casestovars.
If there are such cases, you'll need to think what you want to do with them before you can proceed with the casestovars.
One way would be to aggregate by ID and DATE and decide if you want to e.g. add up the data vars for this case.

Related

I am trying to combine 6 columns into two, I cant use concatenate, Google Sheets is constantly updating the rows

So I am trying to combine 6 columns
Number
Date
Number
Date
Number
Date
1
1/12/21
2
2/20/21
3
3/5/21
2
2/12/21
3
2/27/21
4
4/1/21
3
1/15/20
4
1/20/21
1
3/30/21
4
1/4/21
1
2/25/21
2
4/2/21
So what I am trying to accomplish is that these rows would combine into two with the latest date being displayed:
Number
Date
1
3/30/21
2
4/2/21
3
3/5/21
4
4/1/21
To get the latest date, I have tried to use
=query('scba fill practise'!A:G,"select A, max(G) group by A")
To get all my numbers to constantly update, I've used
=UNIQUE({A3:A;C3:C;E3:E})
Maybe something like this?
=QUERY({'scba fill practise'!A2:B4;'scba fill practise'!C2:D4;'scba fill practise'!E2:F4}, "SELECT Col1, MAX(Col2) GROUP BY Col1")

Grouping multiple values

Grouping multiple values on Details section
I have got an output from SQL query:
ID Value
1 1
1 3
1 5
1 7
1 9
2 1
2 4
3 1
3 2
3 3
I just want to have on each page ID and whole list of values assigned to this id. On next page I should have next ID a it's values.
As you can see for ID 1 I have got 4 values, for 2 I have got only 2 values, for 3 I have got 3 values. I want to say that how many values I have got for particular ID can be different.
I don't know what is the name of this kind go grouping, If someone will name it I will be able to dig the Internet to find the solution.
If someone knew how to do this and will share the knowledge I will really appreciate this.
Best regards,
Volcano
You should add a group (Insert Group) for ID and put Value in the detail section. Make sure to start each group on a new page (Section Expert for your group header or footer, then tick New Page Before / After.

SPSS - Create dummy for top volume months within customer grouping

I need to create a dummy for the top purchase months within each customer ID. That is, if a month belong to one of the four months within the year where the customer purchased the most then it is noted with the number 1, otherwise 0.
Example of data, cust id, order date, volume and new variable dummy:
This code creates some sample data:
data list free/ID volume (2f4).
begin data
1 100 1 500 2 1 2 2 2 3 2 90 1 600 1 90 1 870 2 9 2 8 2 10
end data.
Using the sample data in the question, this code will create a new variable containing the dummy according to your definition:
RANK VARIABLES=volume (A) BY ID /RANK.
compute high4=(Rvolume<=4).

Get consecutive sequence number in ireport

I need to display row number sequence of each group.
I have used $V{PAGE_COUNT} and evaluation time as now
The report data that I am getting is
Group A
1.
2
3
4
...........
page ends ......
Group A
1
2
3
4
page ends ---------
Group B
1
2
3
4
5
page ends....
But my requirement is
Group A
1.
2
3
4
...........
page ends
Group A
5
6
7
8
9
page ends .......
Group B
1
2
3
4
5
page ends....
I need all rows of same group to be continuous sequence. And start sequence from 1 when group is changed
You should use the GroupName_COUNT variable in this case.
The quote from the JasperReports Ultimate Guide
When declaring a report group, the engine automatically creates a count variable that
calculates the number of records that make up the current group (that is, the number of
records processed between group ruptures).
The name of this variable is derived from the name of the group it corresponds to,
suffixed with the _COUNT sequence. It can be used like any other report variable, in any
report expression, even in the current group expression, as shown in the BreakGroup
group of the /demo/samples/jasper sample)
More info is here: Data Grouping

Aggregate path counts using HierarchyID

Business problem - understand process fallout using analytics data.
Here is what we have done so far:
Build a dictionary table with every possible process step
Find each process "start"
Find the last step for each start
Join dictionary table to last step to find path to final step
In the final report output we end up with a list of paths for each start to each final step:
User Fallout Step HierarchyID.ToString()
A 1/1/1
B 1/1/1/1/1
C 1/1/1/1
D 1/1/1
E 1/1
What this means is that five users (A-E) started the process. Assume only User B finished, the other four did not. Since this is a simple example (without branching) we want the output to look as follows:
Step Unique Users
1 5
2 5
3 4
4 2
5 1
The easiest solution I could think of is to take each hierarchyID.ToString(), parse that out into a set of subpaths, JOIN back to the dictionary table, and output using GROUP BY.
Given the volume of data, I'd like to use the built-in HierarchyID functions, e.g. IsAncestorOf.
Any ideas or thoughts how I could write this? Maybe a recursive CTE?
Restructuring the data may help with this. For example, structuring the data like this:
User Step Process#
---- ---- --------
A 1 1
A 2 1
A 3 1
B 1 2
B 2 2
B 3 2
B 4 2
B 5 2
E 1 3
E 2 3
E 1 4
E 2 4
E 3 4
Allows you to run the following query:
select step,
count(distinct process#) as process_iterations,
count(distinct user) as unique_users
from stepdata
group by step
order by step;
which returns:
Step Process_Iterations Unique_Users
---- ------------------ ------------
1 4 3
2 4 3
3 3 3
4 1 1
5 1 1
I'm not familiar with hierarchyid, but splitting out that data into chunks for analysis looks like the sort of problem numbers tables are very good for. Join a numbers table against the individual substrings in the fallout and it shouldn't be too hard to treat the whole thing as a table and analyse it on the fly, without any non-set operations.