Merging 2 datasets when I observe 2 variables over multiple time periods - merge

I'm trying to merge two datasets that look like this:
Dataset 1
unique_id client_id year var1 var2
0001 10001 2000 . .
0001 10001 2001 . .
0002 10002 2000 . .
0002 10002 2001 . .
Dataset 2
client_id year var3 var4
10001 2000 . .
10001 2001 . .
10001 2002 . .
10002 2000 . .
10003 2001 . .
I observe each unique ID over multiple time periods and each client over multiple time periods.
I want to merge the dataset so that for each unique_id at time year, I have the corresponding client_id with variables 3 and 4 also at time year. Hence I want the final dataset to look the following way:
unique_id client_id year var1 var2 var3(of client i in year t) .......
0001 10001 2000 . . .
0001 10001 2001 . . .
0002 10001 2000 . . .
Please let me know if I need to provide more information.

use "dataset1", clear
merge 1:1 client_id year using "dataset2"

Related

Merging two dataset in Stata with inconsistent ID

I have 2 dataset as follows:
input province municipality year str5 muni_name population
1 1 2000 AAA 1000
1 2 2000 AAB 5000
2 1 2000 AAC 1500
2 2 2000 AAA 3000
3 1 2000 AAA 5600
end
input province municipality year str5 muni_name population
1 1 2010 AAA 2000
1 2 2010 AAB 6000
10 5 2010 AAC 3500
10 6 2010 AAA 7000
11 1 2010 AAA 6000
end
In each dataset, observations are uniquely identified by the combination of province and municipality. However, the value of the province or municipality may change depending on the year. Technically, province 2 municipality 1 in year 2000 is the same as province 10 municipality 5 in year 2010. We can tell this by the same muni_name. However, the challenge arises because muni_name does not uniquely identify the observation. In fact, there are municipality name that is found in multiple province (AAA is found in province 1, 2 and 3 in year 2000).
I'd like to have the final merged dataset as follows:
input id year str5 id_name population
1 1 2000 AAA 1000
1 1 2010 AAA 2000
1 2 2000 AAB 5000
1 2 2010 AAB 6000
10 5 2000 AAC 1500
10 5 2010 AAC 3500
10 6 2000 AAA 3000
10 6 2010 AAA 7000
11 1 2000 AAA 5600
11 1 2010 AAA 6000
end
I'd like to have the merged data to be uniquely identified by province, municipality and year. For the conflicting province municipality code, I'd like to replace the province municipality with the most recent year.
What is the best way to do this? My current idea is as follows: Ideally, I'd like to consider a match to be 'perfect match' if province, municipality and muni_name all coincides (AAA and AAB in province 1 corresponds to this). Among the observation that are not 'perfect match', I'd like to match by non-duplicate muni_name (let's call this semi-perfect match). In this example, AAC corresponds to this.
Lastly, for duplicated muni_name (AAA) that is not 'perfect match', I'll match based on the province of other perfect match or semi-perfect match. Note that AAC and one of the AAA is in the same province. Since AAC is matched to province 10 in 2010, AAA in the same province should also be matched to province 10 in 2010.
How can I code the following match strategy in Stata?

PostgreSQL: Count Number of Occurrences in Columns

BACKGROUND
I have three large tables (employee_info, driver_info, school_info) that I have joined together on common attributes using a series of LEFT OUTER JOIN operations. After each join, the resulting number of records increased slightly, indicating that there are duplicate IDs in the data. To try and find all of the duplicates in the IDs, I dumped the ID columns into a temp table like so:
Original Dump of ID Columns
first_name
last_name
employee_id
driver_id
school_id
Mickey
Mouse
1234
abcd
wxyz
Donald
Duck
2423
heca
qwer
Mary
Poppins
1111
acbe
aaaa
Wiley
Cayote
1234
strf
aaaa
Daffy
Duck
1256
acbe
pqrs
Bugs
Bunny
9999
strf
yxwv
Pink
Panther
2222
zzzz
zzaa
Michael
Archangel
0000
rstu
aaaa
In this overly simplified example, you will see that IDs 1234 (employee_id), strf (driver_id), and aaaa (school_id) are each duplicated at least once. I would like to add a count column for each of the ID columns, and populate them with the count for each ID used, like so:
ID Columns with Counts
first_name
last_name
employee_id
employee_id_count
driver_id
driver_id_count
school_id
school_id_count
Mickey
Mouse
1234
2
abcd
1
wxyz
1
Donald
Duck
2423
1
heca
1
qwer
1
Mary
Poppins
1111
1
acbe
1
aaaa
3
Wiley
Cayote
1234
2
strf
2
aaaa
3
Daffy
Duck
1256
1
acbe
1
pqrs
1
Bugs
Bunny
9999
1
strf
2
yxwv
1
Pink
Panther
2222
1
zzzz
1
zzaa
1
Michael
Archangel
0000
1
rstu
1
aaaa
3
You can see that IDs 1234 and strf each have 2 in the count, and aaaa has 3. After generating this table, my goal is to pull out all records where any of the counts are greater than 1, like so:
All Records with One or More Duplicate IDs
first_name
last_name
employee_id
employee_id_count
driver_id
driver_id_count
school_id
school_id_count
Mickey
Mouse
1234
2
abcd
1
wxyz
1
Mary
Poppins
1111
1
acbe
1
aaaa
3
Wiley
Cayote
1234
2
strf
2
aaaa
3
Bugs
Bunny
9999
1
strf
2
yxwv
1
Michael
Archangel
0000
1
rstu
1
aaaa
3
Real World Perspective
In my real-world work, the JOIN'd table contains 100 columns, 15 different ID fields and over 30,000 records, and the final table came out to be 28 more than the original. This may seem like a small amount, but each of the 28 represent a broken link that we must fix.
Is there a simple way to get the counts populated like in the second table above? I have been wrestling with this for hours already, and have not been able to make this work. I tried some aggregate functions, but they cannot be used in table UPDATE operations.
The COUNT function, when used as an analytic function, can do what you want here, e.g.
WITH cte AS (
SELECT *,
COUNT(employee_id) OVER (PARTITION BY employee_id) employee_id_count,
COUNT(driver_id) OVER (PARTITION BY driver_id) driver_id_count,
COUNT(school_id) OVER (PARTITION BY school_id) school_id_count
FROM yourTable
)
SELECT *
FROM cte
WHERE
employee_id_count > 1
driver_id_count > 1
school_id_count > 1;

Tableau Count Records Between Date Ranges

I am trying to get a count of records between dates.
My data has records from 01/01/2020 to 04/01/2020.
I have set up two parameters, Start-date & End-date
I only want to count the records that are between my start (01/01/2020) and end date (01/31/2020).
Sample Data
Sheet_ID Supervisor_ID Category_ID Date
OB-111 1111 1 01/01/2020
OB-112 1111 4 03/01/2020
OB-113 1111 2 01/01/2020
OB-114 2222 2 01/01/2020
OB-115 2222 2 01/21/2020
I am trying to show the following
Supervisor_ID Category_ID Count
1111 1 1
1111 2 1
2222 2 2
Thank you in advance!
Create a calculated as follows:
IF [Date]>=[StartDate] AND [Date]<=[EndDate] THEN 1 END
Sum this field to get the count.

How can I do subtraction between two rows based on their id value?

I have the following table basically hour reading of the equipment based on the shifts. It contained over 1500 rows. What I want to do is subtract the next shifts reading from the previous one so that I can find the hours worked for that equipment.
Id Shift Eqpmt HourReading
-- ----- ----- ------------
1 Shift1 E21 2488
2 Shift1 E52 36882
3 Shift1 Q53 2384
4 Shift1 S54 44874
. . . .
. . . .
11 Shift2 E21 2500
12 Shift2 E52 36900
13 Shift2 Q53 2388
14 Shift2 S54 44875
. . . .
. . . .
. . . .
select
distinct sh.Shift
,sh.Eqmpt
,(a.HourReading-sh.HourReading) WorkedHrs
from sh
join (select id,Shift, Eqpmt, HourReading from sh) a on
a.Eqpmt=sh.Eqpmt and a.id>sh.id
I tried the above script but it subtracts every shift's value and giving me over 30000 records. Actually, I added myself that id column to do that operation but it seems it's not working still.
and this is what I want to get actually
Shift Eqpmt WorkedHours
------ ----- ------------
Shift1 E21 12
Shift1 E52 8
Shift1 Q53 4
Shift1 S54 1
I think your attempted query would work for this sample data, but I take it you have more than 2 shifts in the actual data (e.g. "Shift3", "Shift4", etc). Therefore I think you might want to look at using lead.
Something like:
SELECT *
FROM
(
SELECT shift, eqpmt, lead(hourreading) OVER (PARTITION BY eqpmt ORDER BY id) - hourreading as workedhours
FROM sh
) a
WHERE workedhours is not null;
The lead(hourreading) OVER (PARTITION BY eqpmt ORDER BY id) bit will get the next row (ordered by id) that has the same eqpmt value. I wrapped it in the outer query checking for workedhours is not null so that the last shift (that isn't finished yet) doesn't show up in the result.

How to summarize the data generated in crystal report

I have created a Crystal report whose sample output is below:
C_name P_name Code P. Rev F. rev C.rev
ABC AAA ABC-1-1 100 1100 1100
ABC AAA ABC-1-2 200 1200 1200
ABC AAA ABC-1-3 300 1300 2300
XYZ BBB XYZ-1-1 200 1200 2200
XYZ BBB XYZ-1-2 150 1150 3150
DEF CCC DEF-5-1 400 1400 1400
DEF CCC DEF-5-6 100 1100 2100
DEF CCC DEF-5-9 200 1200 4200
DEF DDD DEF-8-11 300 1300 2300
DEF DDD DEF-8-12 400 1400 400
Now, I want to add up the values for max value of Code. For example, ABC have 3 codes out of which ABC-1-3 is the latest code. So I want to dsiplay one record for these 3 records and add up the revenue values for 3 records and display it in one row only. The final output should look like below:
ABC AAA ABC-1-3 600 3600 4600
XYZ BBB XYZ-1-2 350 2350 5350
DEF CCC DEF-5-9 700 3700 7700
DEF DDD DEF-8-12 700 2700 2700
Please help..
Thanks
you can insert Group and P_name is the Group field which you select to create your group then in Running Total Fields and create new field for each of P.Rev F.rev C.rev in Evaluate and Reset part click on change of group and chose Type of summary Sum and for the Code field do the same,just chose Type of summary maximum. now you can add new fields to your Group and see the result.