How can I merge two data sets with ID variation in stata

How can I merge two data sets with ID variation in stata - merge

I have following two data sets.
The first one from children looks like this.
ID year Q1 Q2 Q3 Q4 ....
101 2014 1 2 2 2
101 2016 1 2 2 2
101 2017 1 2 2 2
101 2018 1 2 2 2
401 2014 1 2 2 2
401 2015 1 2 3 3
401 2016 1 2 2 2
401 2017 1 2 1 1
401 2018 1 2 2 2
402 2014 1 1 0 3
402 2015 1 1 2 2
402 2016 1 1 2 2
402 2017 1 1 3 3
402 2018 1 1 2 3
Here's the second one from their parents.
ID year Q101 Q102
1 2014 1 3
1 2015 1 3
1 2016 1 3
1 2017 1 2
1 2018 1 2
2 2014 2 .
2 2015 1 2
2 2016 . .
2 2017 1 3
2 2018 2 .
4 2014 1 3
4 2015 1 3
4 2016 1 3
4 2017 1 3
4 2018 1 3
So the parent data ID can be matched to the children data ID deleted last two digits. It seems that parent ID 4 has two children.
I tried
merge 1:m ID using kids data as the master data set.
but it didn't work.
Thank you.

Getting good answers is made more likely by (a) attempting code and showing what you tried and (b) giving data in the form of code anybody using Stata can run. The code here follows from editing your post and is close to what you could get directly by using dataex as explained in the Stata tag wiki or indeed at help dataex in an up-to-date Stata or one in which you installed dataex from SSC.
clear
input ID year Q1 Q2 Q3 Q4
101 2014 1 2 2 2
101 2016 1 2 2 2
101 2017 1 2 2 2
101 2018 1 2 2 2
401 2014 1 2 2 2
401 2015 1 2 3 3
401 2016 1 2 2 2
401 2017 1 2 1 1
401 2018 1 2 2 2
402 2014 1 1 0 3
402 2015 1 1 2 2
402 2016 1 1 2 2
402 2017 1 1 3 3
402 2018 1 1 2 3
end
gen IDP = floor(ID/100)
save children
clear
input ID year Q101 Q102
1 2014 1 3
1 2015 1 3
1 2016 1 3
1 2017 1 2
1 2018 1 2
2 2014 2 .
2 2015 1 2
2 2016 . .
2 2017 1 3
2 2018 2 .
4 2014 1 3
4 2015 1 3
4 2016 1 3
4 2017 1 3
4 2018 1 3
end
rename ID IDP
merge 1:m IDP year using children
list
+----------------------------------------------------------------------+
| IDP year Q101 Q102 ID Q1 Q2 Q3 Q4 _merge |
|----------------------------------------------------------------------|
1. | 1 2014 1 3 101 1 2 2 2 matched (3) |
2. | 1 2015 1 3 . . . . . master only (1) |
3. | 1 2016 1 3 101 1 2 2 2 matched (3) |
4. | 1 2017 1 2 101 1 2 2 2 matched (3) |
5. | 1 2018 1 2 101 1 2 2 2 matched (3) |
|----------------------------------------------------------------------|
6. | 2 2014 2 . . . . . . master only (1) |
7. | 2 2015 1 2 . . . . . master only (1) |
8. | 2 2016 . . . . . . . master only (1) |
9. | 2 2017 1 3 . . . . . master only (1) |
10. | 2 2018 2 . . . . . . master only (1) |
|----------------------------------------------------------------------|
11. | 4 2014 1 3 401 1 2 2 2 matched (3) |
12. | 4 2015 1 3 401 1 2 3 3 matched (3) |
13. | 4 2016 1 3 402 1 1 2 2 matched (3) |
14. | 4 2017 1 3 401 1 2 1 1 matched (3) |
15. | 4 2018 1 3 402 1 1 2 3 matched (3) |
|----------------------------------------------------------------------|
16. | 4 2014 1 3 402 1 1 0 3 matched (3) |
17. | 4 2015 1 3 402 1 1 2 2 matched (3) |
18. | 4 2016 1 3 401 1 2 2 2 matched (3) |
19. | 4 2017 1 3 402 1 1 3 3 matched (3) |
20. | 4 2018 1 3 401 1 2 2 2 matched (3) |
+----------------------------------------------------------------------+
As far as the merge is concerned the essentials are identifiers with the same name(s) in both datasets and the correct pattern for merging. The parent identifier is only implied by the children dataset.

Related

Creating a unique ID variable using information from two other variables

I have a large dataset of children and their parents (could be one or two),
collected over multiple waves.
The child has a unique ID, but the parents
have just been called parent1 "1" or parent2 "2", so they do not have their own unique ID.
I would like to make a new variable like "New ParentID" below which
gives each parent a unique ID.
`
> ChildID = c("1","1","1","1","1","2","2","2","2","3","3","3","3","3","3")
> StudyWave = c("1","2","3","4","5","2","3","4","6","1","2","3","4","5","6")
> ParentID = c("1","2","1","2","1","1","1","1","1","1","2","1","2","1","2")
> NewParentID = c("1","2","1","2","1","3","3","3","3","4","5","4","5","4","5")
> data=cbind.data.frame(ChildID, StudyWave, ParentID, NewParentID)
> data
ChildID StudyWave ParentID NewParentID
1 1 1 1 1
2 1 2 2 2
3 1 3 1 1
4 1 4 2 2
5 1 5 1 1
6 2 2 1 3
7 2 3 1 3
8 2 4 1 3
9 2 6 1 3
10 3 1 1 4
11 3 2 2 5
12 3 3 1 4
13 3 4 2 5
14 3 5 1 4
15 3 6 2 5
`
Many thanks in advance for any suggestions - I am stuck.

Using a growth formula for grouped observations

I have a dataset which is shown below:
clear
input year price growth id
2008 5 -0.444 1
2009 . . 1
2010 7 -0.222 1
2011 9 0 1
2011 8 -0.111 1
2012 9 0 1
2013 11 0.22 1
2012 10 0 2
2013 12 0.2 2
2013 . . 2
2014 13 0.3 2
2015 17 0.7 2
2015 16 0.6 2
end
I want to generate variable growth which is the growth of price. The growth formula is:
growth = price of second-year - price of base year / price of base year
The base year is always 2012.
How can I generate this growth variable for each group of observation (by id)?

The base price can be picked out directly by egen:
bysort id: egen price_b = total(price * (year == 2012))
generate wanted = (price - price_b) / price_b
Notice that total is used along with the assumption that, for each id, you have only one observation with year = 2012.

The following works for me:
bysort id: generate obs = _n
generate double wanted = .
levelsof id, local(ids)
foreach x of local ids {
summarize obs if id == `x' & year == 2012, meanonly
bysort id: replace wanted = (price - price[`=obs[r(min)]']) / ///
price[`=obs[r(min)]'] if id == `x'
}
If the id values are consecutive, then the following will be faster:
forvalues i = 1 / 2 {
summarize obs if id == `i' & year == 2012, meanonly
bysort id: replace wanted = (price - price[`=obs[r(min)]']) / ///
price[`=obs[r(min)]'] if id == `i'
}
Results:
list, sepby(id)
+-----------------------------------------------+
| year price growth id obs wanted |
|-----------------------------------------------|
1. | 2008 5 -.444 1 1 -.44444444 |
2. | 2009 . . 1 2 . |
3. | 2010 7 -.222 1 3 -.22222222 |
4. | 2011 9 0 1 4 0 |
5. | 2011 8 -.111 1 5 -.11111111 |
6. | 2012 9 0 1 6 0 |
7. | 2013 11 .22 1 7 .22222222 |
|-----------------------------------------------|
8. | 2012 10 0 2 1 0 |
9. | 2013 12 .2 2 2 .2 |
10. | 2013 . . 2 3 . |
11. | 2014 13 .3 2 4 .3 |
12. | 2015 17 .7 2 5 .7 |
13. | 2015 16 .6 2 6 .6 |
+-----------------------------------------------+

Perl function localtime giving incorrect values for years between 1964 and 1967

I was getting some whacky values from localtime function in Perl. The following is some code for which I get incorrect values.
In particular, this code is meant to determine the weekday for the first of each year.
#!/usr/bin/perl
use strict 'vars';
use Time::Local;
use POSIX qw(strftime);
mytable();
sub mytable {
print "Year" . " "x4 . "Jan 1st (localtime)" . " "x4 . "Jan 1st (Gauss)\n";
foreach my $year ( 1964 .. 2017 )
{
my $janlocaltime = evalweekday( 1,1,$year);
my $jangauss = gauss($year);
my $diff = $jangauss - $janlocaltime;
printf "%4s%10s%-12s ",$year,"",$janlocaltime;
printf "%12s",$jangauss;
printf " <----- ERROR: off by %2s", $diff if ( $diff != 0 );
print "\n";
}
}
sub evalweekday {
## Using "localtime"
my ($day,$month,$year) = #_;
my $epoch = timelocal(0,0,0, $day,$month-1,$year-1900);
my $weekday = ( localtime($epoch) ) [6];
return $weekday;
}
sub gauss {
## Alternative approach
my ($year) = #_;
my $weekday =
( 1 + 5 * ( ( $year - 1 ) % 4 )
+ 4 * ( ( $year - 1 ) % 100 )
+ 6 * ( ( $year - 1 ) % 400 )
) % 7;
return $weekday;
}
Here is the output which shows the years with incorrect values:
Year Jan 1st (localtime) Jan 1st (Gauss)
1964 2 3 <----- ERROR: off by 1
1965 4 5 <----- ERROR: off by 1
1966 5 6 <----- ERROR: off by 1
1967 6 0 <----- ERROR: off by -6
1968 1 1
1969 3 3
1970 4 4
1971 5 5
1972 6 6
1973 1 1
1974 2 2
1975 3 3
1976 4 4
1977 6 6
1978 0 0
1979 1 1
1980 2 2
1981 4 4
1982 5 5
1983 6 6
1984 0 0
1985 2 2
1986 3 3
1987 4 4
1988 5 5
1989 0 0
1990 1 1
1991 2 2
1992 3 3
1993 5 5
1994 6 6
1995 0 0
1996 1 1
1997 3 3
1998 4 4
1999 5 5
2000 6 6
2001 1 1
2002 2 2
2003 3 3
2004 4 4
2005 6 6
2006 0 0
2007 1 1
2008 2 2
2009 4 4
2010 5 5
2011 6 6
2012 0 0
2013 2 2
2014 3 3
2015 4 4
2016 5 5
2017 0 0
In fact, the errors seem to extend as far back as 1900, but I just haven't verified that they are in fact wrong prior to 1964.
perl --version returns the following:
This is perl 5, version 18, subversion 2 (v5.18.2) built for darwin-thread-multi-2level
(with 2 registered patches, see perl -V for more detail)
Copyright 1987-2013, Larry Wall
Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.
Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.
I'm not sure whether it's relevant, but my operating system is macOS Sierra Version 10.12.3.
I've read through the documentation, but I don't see anything (or I'm being blind) regarding values returned prior to 1968. I've also tried to do a websearch but am not pulling up anything beyond the typical misunderstandings of array values and the numbering of months and days of the year.
Could someone help me out and explain what I'm getting wrong? Or, if this is an issue with my version of Perl, let me know what I can do to fix it.

This is likely to do with how negative epoch values are handled in Time::Local. Have a look at perldoc Time::Local #Negative-Epoch-Values
On my Linux box (perl 5.20), your code demonstrates the issue nicely. If you print out the epoch value received, you will see the issue, namely that the epoch returned by timelocal becomes huge instead of more negative:
Year Epoch Jan 1st (localtime) Jan 1st (Gauss)
1964 2966342400 2 3 <----- ERROR: off by 1
1965 2997964800 4 5 <----- ERROR: off by 1
1966 3029500800 5 6 <----- ERROR: off by 1
1967 3061036800 6 0 <----- ERROR: off by -6
1968 -63185400 1 1
1969 -31563000 3 3
1970 -27000 4 4
1971 31509000 5 5
1972 63045000 6 6
Why don't you try using DateTime library instead:
use DateTime;
my $dt = DateTime->new(
year => 1966, # Real Year
day => 1, # 1-31
month => 1, # 1-12
hour => 0, # 0-23
second => 0, # 0-59
);
print $dt->dow . "\n";
6
6 = Saturday which matches the Wikipedian view: Jan 1, 1966 (Saturday)

output for different changing variables

I have 4 variables a,b,c,d. a can vary 1,2 i.e. a=1,2, b=1,2,3, c=1,2,3,4, d=1,2,3,4,5 so by varying each element I want to make output value i.e.
a b c d output
1 1 1 1 1
1 1 1 2 2
1 1 1 3 3
1 1 1 4 4
1 1 1 5 5
now varying c with 1 value and d with all values i.e.
a b c d output
1 1 2 1 6
1 1 2 2 7
1 1 2 3 8
1 1 2 4 9
1 1 2 5 10
now change c to 3 and doing above so getting output as 11,12,13,14,15. when c reaches max varying limit then change b i.e.
a b c d output
1 2 1 1 16
1 2 1 2 17
1 2 1 3 18
1 2 1 4 19
1 2 1 5 20
then
a b c d output
1 2 2 1 21
1 2 2 2 22
1 2 2 3 23
1 2 2 4 24
1 2 2 5 25
so like this I want to proceed and want output for all conditions of a,b,c,d. so how to do it or any equation to do this in matlab. in above a,b,c,d vary 2,3,4,5 i.e in increasing order but in general case they can vary without increasing order e.g. a,b,c,d can vary 7,4,9,13.

A possible algorithm could be to buil the combinations column by column considering the number of times eache value has to be repeted starting form the array d
Defined:
len_a the length of the arraya
len_b the length of the arrayb
len_c the length of the arrayc
len_d the length of the arrayd
you need to replicate the d array len_a * len_b * len_c times.
The array c needs to be replicated len_c * len_d times to cover the "right side" combination, the this set of data have to be replicated len_a * len_b times to account for the "left side" to come.
Similar approach applies for the definiton of the array a and b.
To have the set of combinations in a "random" sequence, is sufficient to
use the randperm function.
% Define the input arrays
a=1:2;
len_a=length(a);
b=1:3;
len_b=length(b);
c=1:4;
len_c=length(c);
d=1:5;
len_d=length(d);
% Generate the fourth column of the table
%
d1=repmat(d',len_a*len_b*len_c,1)
%
% Generate the third column of the table
c1=repmat(reshape(bsxfun(#plus,zeros(len_d,1),[1:len_c]),len_c*len_d,1),len_a*len_b,1)
%
% Generate the second column of the table
b1=repmat(reshape(bsxfun(#plus,zeros(len_c*len_d,1),[1:len_b]),len_b*len_c*len_d,1),len_a,1)
%
% Generate the first column of the table
a1=reshape(bsxfun(#plus,zeros(len_b*len_c*len_d,1),[1:len_a]),len_a*len_b*len_c*len_d,1)
%
% Merge the colums and add the counter in the fifth column
combination_set_1=[a1 b1 c1 d1 (1:len_a*len_b*len_c*len_d)']
% Randomize the rows
combination_set_2=combination_set_1(randperm(len_a*len_b*len_c*len_d),:)
Output:
1 1 1 1 1
1 1 1 2 2
1 1 1 3 3
1 1 1 4 4
1 1 1 5 5
1 1 2 1 6
1 1 2 2 7
1 1 2 3 8
1 1 2 4 9
1 1 2 5 10
1 1 3 1 11
1 1 3 2 12
1 1 3 3 13
1 1 3 4 14
1 1 3 5 15
1 1 4 1 16
1 1 4 2 17
1 1 4 3 18
1 1 4 4 19
1 1 4 5 20
1 2 1 1 21
1 2 1 2 22
1 2 1 3 23
1 2 1 4 24
1 2 1 5 25
1 2 2 1 26
1 2 2 2 27
1 2 2 3 28
1 2 2 4 29
1 2 2 5 30
1 2 3 1 31
1 2 3 2 32
1 2 3 3 33
1 2 3 4 34
1 2 3 5 35
1 2 4 1 36
1 2 4 2 37
1 2 4 3 38
1 2 4 4 39
1 2 4 5 40
1 3 1 1 41
1 3 1 2 42
1 3 1 3 43
1 3 1 4 44
1 3 1 5 45
1 3 2 1 46
1 3 2 2 47
1 3 2 3 48
1 3 2 4 49
1 3 2 5 50
1 3 3 1 51
1 3 3 2 52
1 3 3 3 53
1 3 3 4 54
1 3 3 5 55
1 3 4 1 56
1 3 4 2 57
1 3 4 3 58
1 3 4 4 59
1 3 4 5 60
2 1 1 1 61
2 1 1 2 62
2 1 1 3 63
2 1 1 4 64
2 1 1 5 65
2 1 2 1 66
2 1 2 2 67
2 1 2 3 68
2 1 2 4 69
2 1 2 5 70
2 1 3 1 71
2 1 3 2 72
2 1 3 3 73
2 1 3 4 74
2 1 3 5 75
2 1 4 1 76
2 1 4 2 77
2 1 4 3 78
2 1 4 4 79
2 1 4 5 80
2 2 1 1 81
2 2 1 2 82
2 2 1 3 83
2 2 1 4 84
2 2 1 5 85
2 2 2 1 86
2 2 2 2 87
2 2 2 3 88
2 2 2 4 89
2 2 2 5 90
2 2 3 1 91
2 2 3 2 92
2 2 3 3 93
2 2 3 4 94
2 2 3 5 95
2 2 4 1 96
2 2 4 2 97
2 2 4 3 98
2 2 4 4 99
2 2 4 5 100
2 3 1 1 101
2 3 1 2 102
2 3 1 3 103
2 3 1 4 104
2 3 1 5 105
2 3 2 1 106
2 3 2 2 107
2 3 2 3 108
2 3 2 4 109
2 3 2 5 110
2 3 3 1 111
2 3 3 2 112
2 3 3 3 113
2 3 3 4 114
2 3 3 5 115
2 3 4 1 116
2 3 4 2 117
2 3 4 3 118
2 3 4 4 119
2 3 4 5 120
Hope this helps.
Qapla'

Reporting in Crystal Reports with Formulas

I want to create a formula for a table that looks like this but don't know what forumals to do to scan these files.
ID UDOfficer DATE
1 6 Jan
1 7 Jan
1 9 Jan
2 6 June
3 6 April
4 6 May
5 5 Dec
6 7 Nov
7 6 April
7 4 April
What I want to create:
A formula to put in a crosstab's column to capture UDOfficer = 6 and all others, but if UDOfficer is in 6, 7 and 9, the all others can not count that ID that was already counted for UDOFFICER 6.
OUTPUTCrosstable
DATE UDOFF6 UDOFFOTHER
JAN 1 0
APR 2 0
MAY 1 0
JUN 1 0
NOV 0 1
DEC 0 1

You can use Grouping as well as IF Else to make it work as per your requirement.
First create the group with Date.
Create 2 formulas and place in detail to count the occurrences.
Create a formula #UDOFF6 and write below code:
if UDOfficer =6
then 1
else 0
Create another formula UDOFFOTHER and write below code:
if UDOfficer <> 6
then 1
else 0
Take sum of both formulas in group footer of Date your problem will be solved.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How can I merge two data sets with ID variation in stata - merge

Related

Creating a unique ID variable using information from two other variables

Using a growth formula for grouped observations

Perl function localtime giving incorrect values for years between 1964 and 1967

output for different changing variables

Reporting in Crystal Reports with Formulas

Categories

Resources