How can I merge two data sets when the using data set is a subset of the master data set? Stata - merge

I have three data sets I would like to merge: one contains data on all members of a household and the other two contain more detailed data on women and men respectively.
I would like to link my detailed data to the household data. See an example of the data sets below. I need to use household ID (HH ID) and individual ID (IND ID) to link to bolded observations. Once the women and men data are merged there will still be some missing observations that do appear in the household data (which I would like to keep).
Household data set
HH ID
IND ID
1
1
1
2
1
3
1
4
2
1
2
2
2
3
3
1
Women's data set
HH ID
IND ID
1
1
1
3
2
2
3
1
I have tried the following 1:1 merge
merge 1:1 hhid indid using "womens_data.dta"
and got the following error message,
variables hhid indid do not uniquely identify observations in the master data
r(459);
What should I be doing instead?

Related

Recursive CTE in Postgres - SUM at each parent node

I`m storing a hierarchical product adjacency list and a separate table for all the sales on those products.
Currently, I`m trying to present to the user a "Sales Report" with the total amount/sum sold per each product, and at the parent level.
In the sales table, I do not have info on the sales per group, thus, I have info available only at the level of the child. From what I have read I need to use recursive CTE, and I tried creating some queries, without any success.
Example of my dataset:
F - folders
P - products
Products table:
id
name
parentid
1
F1
NULL
2
F2
1
3
P1
2
4
P2
2
Sales table:
id
quant
sum
3
3
90
4
2
100
What I need to obtain in the report:
id
name
parentid
quant
sum
1
F1
NULL
5
190
2
F2
1
5
190
3
P1
2
3
90
4
P2
2
2
100
Logically, I understand that I need to fetch each row, and recursively go through all its children in order to SUM the quant and the sum, however I have no clue how to write it.
I`d be thankful for any guidance on where I could read more about recursive CTE, or anything that can help my situation.
Cheers!

Number of Order From Customers

This seems like a simple question but I'm having trouble building my graph. I'm trying to get the number of customers who made 1 order, 2 orders, 3 orders etc..
Sample Data Source:
Customer ID| Order ID| Date Ordered
A 10 06/01/2019
A 11 06/02/2019
A 12 06/02/2019
B 15 06/05/2019
B 16 06/05/2019
B 17 06/05/2019
C 20 06/06/2019
C 21 06/06/2019
I can easily get the graph to show that Customer A made 3 Orders , Customer B made 3 Orders and Customer C made 2 orders.. etc.
What I'm trying to show is how many customer places a certain number of orders . So in our sample data. 1 Order = 0 , 2 Orders = 1, 3 Orders = 2. So in the X axis im trying to show (1 Order, 2 Orders , 3 Orders, 4 Orders.. etc )
I tried doing calculations such as IF COUNT([CustomerID]) > 2 then '1 order' but I can't seem to get it right. Any advice will be helpful. Thanks in advance
Maybe you can try using LOD expressions and create a new calculated field like this:
{INCLUDE [Customer ID]: COUNTD([Order ID])}
And then use that field to show that info.

SPSS - Create dummy for top volume months within customer grouping

I need to create a dummy for the top purchase months within each customer ID. That is, if a month belong to one of the four months within the year where the customer purchased the most then it is noted with the number 1, otherwise 0.
Example of data, cust id, order date, volume and new variable dummy:
This code creates some sample data:
data list free/ID volume (2f4).
begin data
1 100 1 500 2 1 2 2 2 3 2 90 1 600 1 90 1 870 2 9 2 8 2 10
end data.
Using the sample data in the question, this code will create a new variable containing the dummy according to your definition:
RANK VARIABLES=volume (A) BY ID /RANK.
compute high4=(Rvolume<=4).

"Inserting" Records into Fields from a Database Feed

So the background to this is I'm trying to create a survival curve based on a database feed from the directions here.
What I have so far is three calculated fields per below. Patient ID is not a calculated field or necessary for the survival analysis, but I believe it could be useful for this question. For reference, there are about 20,000 unique patients.
Patient ID | Time | Censor | Group
Id1 3 0 1
Id2 8 0 2
Id3 1 1 1
Id4 3 1 1
Id5 11 0 1
Id5 7 1 2
What I would like to do is insert two records (one for each group) such:
Patient ID | Time | Censor | Group | Link
0 1
0 2
Id1 3 0 1 link
Id2 8 0 2 link
Id3 1 1 1 link
Id4 3 1 1 link
Id5 11 0 1 link
Id5 7 1 2 link
I unsuccessfully tried to create an excel spreadsheet with these base attributes to union with the columns, however, an excel spreadsheet does not appear to be able to union with a database.
My next idea is to find 2 of the 20,000 patients where I can create a calculated field along these lines (not sure this is feasible in Tableau, please excuse my syntax):
IF [Patient ID] = Id3 THEN [TIME] = 0 AND [CENSOR] IS NULL
END
and then a [Link] calculated formula:
IF [Patient ID] = Id3 THEN NULL
ELSE "link"
END
Any help would be appreciated. Would like to avoid inserting these records in the database.
The best / easiest option is to use an outer join to your excel workbook -- this is a new feature in Tableau version 10 (Cross database joins)
Then, once the dataset is combined, you can build business logic through a filter or calculated field based on the absence or presence of the Excel data.
http://www.tableau.com/about/blog/2016/7/integrate-your-data-cross-database-joins-56724

Aggregate path counts using HierarchyID

Business problem - understand process fallout using analytics data.
Here is what we have done so far:
Build a dictionary table with every possible process step
Find each process "start"
Find the last step for each start
Join dictionary table to last step to find path to final step
In the final report output we end up with a list of paths for each start to each final step:
User Fallout Step HierarchyID.ToString()
A 1/1/1
B 1/1/1/1/1
C 1/1/1/1
D 1/1/1
E 1/1
What this means is that five users (A-E) started the process. Assume only User B finished, the other four did not. Since this is a simple example (without branching) we want the output to look as follows:
Step Unique Users
1 5
2 5
3 4
4 2
5 1
The easiest solution I could think of is to take each hierarchyID.ToString(), parse that out into a set of subpaths, JOIN back to the dictionary table, and output using GROUP BY.
Given the volume of data, I'd like to use the built-in HierarchyID functions, e.g. IsAncestorOf.
Any ideas or thoughts how I could write this? Maybe a recursive CTE?
Restructuring the data may help with this. For example, structuring the data like this:
User Step Process#
---- ---- --------
A 1 1
A 2 1
A 3 1
B 1 2
B 2 2
B 3 2
B 4 2
B 5 2
E 1 3
E 2 3
E 1 4
E 2 4
E 3 4
Allows you to run the following query:
select step,
count(distinct process#) as process_iterations,
count(distinct user) as unique_users
from stepdata
group by step
order by step;
which returns:
Step Process_Iterations Unique_Users
---- ------------------ ------------
1 4 3
2 4 3
3 3 3
4 1 1
5 1 1
I'm not familiar with hierarchyid, but splitting out that data into chunks for analysis looks like the sort of problem numbers tables are very good for. Join a numbers table against the individual substrings in the fallout and it shouldn't be too hard to treat the whole thing as a table and analyse it on the fly, without any non-set operations.