Merging two datasets with variables in one corresponding observations in the other - merge

I ran into a challenge today and I hope to get some help.
I want to merge 2 data sets. Dataset1 contains the member roster with 3 variables, the batch_id, member_num in the batch and occupation. Dataset2 consists of member license statuses.
The challenge here is that in dataset2 the member_num is represented as a variable in a way that member_num_x in dataset2 corresponds to "x" observation under variable member_num in dataset1. I need to merge these two datasets so that in the end I have one dataset that has batch_id, member_num, occupation, and license status for each member.
Dataset1
| batch_id | member_num | occupation |
| -------- | -------- | --------
| A01 | 1 | Driver |
| A01 | 2 | Driver |
| A01 | 3 | Driver |
| A01 | 4 | Driver |
| A02 | 1 | Navigator |
| A02 | 2 | Navigator |
Dataset2
| batch_id |member_num_1|member_num_2|member_num_3|member_num_4|
| -------- | -------- | -------- | -------- | -------- |
| A01 | Yes | NA | Yes | No |
| A02 | No | | NA |
Desired Output
| batch_id | member_num | occupation | License_status
| -------- | -------- | --------
| A01 | 1 | Driver | Yes
| A01 | 2 | Driver | NA
| A01 | 3 | Driver | Yes
| A01 | 4 | Driver | No
| A02 | 1 | Navigator | No
| A02 | 2 | Navigator | NA
I have tried using the merge command in Stata but there is no option to do this particular kind of merging. The options that are there use unique variables (almost same as joins on primary keys).

You need to reshape d2 into long format and then merge/link on batch_id, member_num
Approach 1 (using frames)
clear
use d1
frame create d2
frame d2: use d2
frame d2: reshape long member_num_, i(batch_id) j(member_num)
frlink 1:1 batch_id member_num, frame(d2)
frget License_status = member_num_, from(d2)
Approach 2 (using merge)
clear
use d2
reshape long member_num_, i(batch_id) j(member_num)
rename member_num_ License_status
tempfile d2long
save `d2long',replace
use d1,clear
merge 1:1 batch_id member_num using `d2long',nogenerate keep(1 3)
Output:
batch_id member~m occupat~n d2 Licens~s
1. A01 1 Driver 1 Yes
2. A01 2 Driver 2 NA
3. A01 3 Driver 3 Yes
4. A01 4 Driver 4 No
5. A02 1 Navigator 5 No
6. A02 2 Navigator 6 NA
Input:
d1.dta:
batch_id member~m occupat~n
1. A01 1 Driver
2. A01 2 Driver
3. A01 3 Driver
4. A01 4 Driver
5. A02 1 Navigator
6. A02 2 Navigator
d2.dta:
batch_id member~1 member~2 member~3 member~4
1. A01 Yes NA Yes No
2. A02 No NA

The final answer with the minor edit:
clear
use "file_Dataset2.dta" reshape long member_num_, i(batch_id) j(member_num) rename member_num_ License_status
tempfile d2long
save d2long',replace
use "file_Dataset1.dta",clear
merge m:1 batch_id member_num using d2long',nogenerate keep(1 3)

Related

Avarage per group in PySpark

I have PySpark dataframe below:
cust | amount |
----------------
A | 5 |
A | 1 |
A | 3 |
B | 4 |
B | 4 |
B | 2 |
C | 2 |
C | 1 |
C | 7 |
C | 5 |
I need to group by column 'cust' and calculates the average per group.
Expected result:
cust | avg_amount
-------------------
A | 3
B | 3.333
C | 7.5
I've been using the code as below but giving me the error.
data.withColumn("avg_amount", F.avg("amount"))
Any idea how I can make this average?
Use groupBy to count the number of transactions and the average of amount by customer:
from pyspark.sql import functions as F
data = data.groupBy("cust")\
.agg(
F.count("*").alias("amount"),
F.avg("amount").alias("avg_amount")
)

how to migrate relational tables to dynamoDB table

I am new at DynamoDB, in my current project, I am trying to migrate most relational tables to Dynamo DB. I am facing a tricky scenario which I don't know how to solve
In Posgresql, 2 tables:
Student
id | name | age | address | phone
---+--------+-----+---------+--------
1 | Alex | 18 | aaaaaa | 88888
2 | Tome | 19 | bbbbbb | 99999
3 | Mary | 18 | ccccc | 00000
4 | Peter | 20 | dddddd | 00000
Registration
id | class | student | year
---+--------+---------+---------
1 | A1 | 1 | 2018
2 | A1 | 3 | 2018
3 | A1 | 4 | 2017
4 | B1 | 2 | 2018
My query:
select s.id, s.name, s.age, s.address, s.phone
from Registration r inner join Student s on r.student = s.id
where r.class = 'A1' and r.year = '2018'
Result:
id | name | age | address | phone
---+--------+-----+---------+--------
1 | Alex | 18 | aaaaaa | 88888
3 | Mary | 18 | ccccc | 00000
So, how can I design the dynamoDB table to achieve this result? in extend for CRUD
Any advice is appreciated
DynamoDB table design is going to depend largely on your access patterns. Without knowing the full requirements and queries needed by your app, it's not going to be possible to write a proper answer. But given your example here's a table design that might work:
| (GSI PK) |
(P. Key) | (Sort) | (GSI Sort)
studentId | itemType | name | age | address | phone | year
----------+----------+--------+-----+---------+-------+------
1 | Details | Alex | 18 | aaaaaa | 88888 |
1 | Class_A1 | | | | | 2018
2 | Details | Tome | 19 | bbbbbb | 99999 |
2 | Class_B1 | | | | | 2018
3 | Details | Mary | 18 | ccccc | 00000 |
3 | Class_A1 | | | | | 2018
4 | Details | Peter | 20 | dddddd | 00000 |
4 | Class_A1 | | | | | 2017
Note the global secondary index with the partition key on the item type and the sort key on the year.
With this design we have a few query options:
1) Get student for a given id: GetItem(partitionKey: studentId, sortkey: Details)
2) Get all classes for a given student id: Query(partitionKey: studentId, sortkey: STARTS_WITH("Class"));
3) Get all students in class A1 and year 2018: Query(GSI partitionkey: "Class_A1", sortkey: equals(2018))
For global secondary indexes, the partition and sort key don't need to be unique therefore you can have many Class_A1, 2018 combos. If you haven't already read the Best Practices for DyanmoDB I highly recommend reading it in full.

how to flatten rows to columns in postgreSQL

using postgresql 9.3 I have a table that shows indivual permits issued across a single year below:
permit_typ| zipcode| address| name
-------------+------+------+-----
CONSTRUCTION | 20004 | 124 fake streeet | billy joe
SUPPLEMENTAL | 20005 | 124 fake streeet | james oswald
POST CARD | 20005 | 124 fake streeet | who cares
HOME OCCUPATION | 20007 | 124 fake streeet | who cares
SHOP DRAWING | 20009 | 124 fake streeet | who cares
I am trying to flatten this so it looks like
CONSTRUCTION | SUPPLEMENTAL | POST CARD| HOME OCCUPATION | SHOP DRAWING | zipcode
-------------+--------------+-----------+----------------+--------------+--------
1 | 2 | 3 | 5 | 6 | 20004
1 | 2 | 3 | 5 | 6 | 20005
1 | 2 | 3 | 5 | 6 | 20006
1 | 2 | 3 | 5 | 6 | 20007
1 | 2 | 3 | 5 | 6 | 20008
have been trying to use Crosstab but its a bit above my rusty SQL experiance. anybody have any ideas
I usually approach this type of query using conditional aggregation. In Postgres, you can do:
select zipcode,
sum( (permit_typ = 'CONSTRUCTION')::int) as Construction,
sum( (permit_typ = 'SUPPLEMENTAL')::int) as SUPPLEMENTAL,
. . .
from t
group by zipcode;

Merge multiple tables with a common column name

I am trying to merge multiple tables that have a common column name which need not have the same values across the tables. For ex,
-tmp1-
id dat
1 234
2 432
3 412
-tmp2-
id nom
1 jim
2
3 ryan
4 jack
-tmp3-
id pin
1 gi23
2 x4ed
3 yit42
8 hiu11
If above are the input, the output needs to be,
id dat nom pin
1 234 jim gi23
2 432 x4ed
3 412 ryan yit42
4 jack
8 hiu11
Thanks in advance.
postgresql 8.2.15 on greenplum from R(pass-through queries)
use FULL JOIN ... USING (id) syntax.
please see example: http://sqlfiddle.com/#!12/3aff2/1
this is how diffrent join types work (provided that tab1.row3 meets joining condition with tab2.row1, and tab1.row3 meets tab2.row2):
| tab1 | | tab2 | | JOIN | | LEFT JOIN | | RIGHT JOIN | | FULL JOIN |
-------- -------- ------------------------- ------------------------- ------------------------- -------------------------
| row1 | | tab1.row1 | | tab1.row1 |
| row2 | | tab1.row2 | | tab1.row2 |
| row3 | | row1 | | tab1.row3 | tab2.row1 | | tab1.row3 | tab2.row1 | | tab1.row3 | tab2.row1 | | tab1.row3 | tab2.row1 |
| row4 | | row2 | | tab1.row4 | tab2.row2 | | tab1.row4 | tab2.row2 | | tab1.row4 | tab2.row2 | | tab1.row4 | tab2.row2 |
| row3 | | tab2.row3 | | tab2.row3 |
| row4 | | tab2.row4 | | tab2.row4 |

I'm a bit new to PostgreSQL and need how to construct complex query

I need to list all the cities you can get to after stopping off at exactly one other city, starting off from any city of my choice. And list with it the distance to the final city and the intermediate city.
The tables in the database consist of cities, with the attributes:
| city_id | name |
1 Edinburgh
2 Newcastle
3 Manchester
citypairs:
| citypair_id | city_id |
1 1
1 2
2 1
2 3
3 2
3 3
and distances:
| citypair_id | distance |
1 1234
2 1324
3 1324
and trains:
| train_id | departure_city_id | destination_city_id |
1 1 2
2 2 3
3 1 3
4 3 2
I haven't put any of the data in but basically if a city.name is chosen at random by me I need to find out which cities I can get to from this city if I go via another city (i.e. in two journeys) and then the distance to the final and intermediate city.
How would you, or how should I, go about forming a query to return the desired table?
Edited to include data and a missing table! As an example you can go from Edinburgh(1) to Manchester(3) via Newcastle(2) and you can go from Edinburgh to Newcastle via Manchester, however you can not go from Manchester to Edinburgh via Newcastle (since a train departs from 3, arrives at 2, but no train from 2 arrives in 1) and this route should not be returned from the query. Apologies for any confusion beforehand.
I've got a CTE that builds a tree of all the destinations.
WITH RECURSIVE trip AS (
SELECT c.city_id AS start_city,
ARRAY[c.city_id] AS route,
cast(c.name AS varchar(100)) AS route_text,
c.city_id AS leg_start_city,
c.city_id AS leg_end_city,
0 AS trip_count,
0 AS leg_length,
0 AS total_length
FROM cities c
UNION ALL
SELECT
trip.start_city,
trip.route || t.destination_city_id,
cast(trip.route_text || ',' || c.name AS varchar(100)),
t.departure_city_id,
t.destination_city_id,
trip.trip_count + 1,
d.distance,
trip.total_length + d.distance
FROM trains t
INNER JOIN trip
ON t.departure_city_id = trip.leg_end_city
INNER JOIN citypairs cps
ON t.departure_city_id = cps.city_id
INNER JOIN citypairs cpe
ON t.destination_city_id = cpe.city_id AND
cpe.citypair_id = cps.citypair_id
INNER JOIN distances d
ON cps.citypair_id = d.citypair_id
INNER JOIN cities c
ON t.destination_city_id = c.city_id
WHERE NOT (array[t.destination_city_id] <# trip.route))
SELECT *
FROM trip
WHERE trip_count = 2
AND start_city = (SELECT city_id FROM cities WHERE name = 'Edinburgh');
The CTE starts from each city (in the non-recursive part at the start), then determines all the destination cities it can go to. It keeps a track of all the cities its been to in an array (the route column), so it won't loop back to itself again. As it progresses, it keeps track of the overall trip distance, and the number of trains taken (in trip_count).
As it goes through the tree, it keeps a running total of the distance.
This gives results of
| START_CITY | ROUTE | ROUTE_TEXT | LEG_START_CITY | LEG_END_CITY | TRIP_COUNT | LEG_LENGTH | TOTAL_LENGTH |
--------------------------------------------------------------------------------------------------------------------------------
| 1 | 1,2,3 | Edinburgh,Newcastle,Manchester | 2 | 3 | 2 | 1324 | 2558 |
| 1 | 1,3,2 | Edinburgh,Manchester,Newcastle | 3 | 2 | 2 | 1324 | 2648 |
If you change remove the final WHERE clause it'll show all the possible trips in the data, likewise you can change the trip_count to find all single train destinations etc.
| START_CITY | ROUTE | ROUTE_TEXT | LEG_START_CITY | LEG_END_CITY | TRIP_COUNT | LEG_LENGTH | TOTAL_LENGTH |
--------------------------------------------------------------------------------------------------------------------------------
| 1 | 1 | Edinburgh | 1 | 1 | 0 | 0 | 0 |
| 2 | 2 | Newcastle | 2 | 2 | 0 | 0 | 0 |
| 3 | 3 | Manchester | 3 | 3 | 0 | 0 | 0 |
| 1 | 1,2 | Edinburgh,Newcastle | 1 | 2 | 1 | 1234 | 1234 |
| 1 | 1,3 | Edinburgh,Manchester | 1 | 3 | 1 | 1324 | 1324 |
| 2 | 2,3 | Newcastle,Manchester | 2 | 3 | 1 | 1324 | 1324 |
| 3 | 3,2 | Manchester,Newcastle | 3 | 2 | 1 | 1324 | 1324 |
| 1 | 1,2,3 | Edinburgh,Newcastle,Manchester | 2 | 3 | 2 | 1324 | 2558 |
| 1 | 1,3,2 | Edinburgh,Manchester,Newcastle | 3 | 2 | 2 | 1324 | 2648 |
The cast( ... as varchar(100)) is a bit hacky, and I'm not sure why it was needed, but I haven't had a chance to get around that yet.
The SQL is here for testing: http://sqlfiddle.com/#!1/93964/24
The first part is easy:
SELECT c2.name
FROM cities AS c
JOIN trains t ON c.city_id=t.departure_city_id
JOIN trains t2 ON t.destination_city_id=t2.departure_city_id
JOIN cities AS c2 ON t2.destination_city_id=c2.city_id
WHERE c2.city_id!=c.city_id
AND c.name='Edinburgh';
http://sqlfiddle.com/#!12/a656f/14
In PG 9.1+ you could even do it with a recursive CTE for any number of cities in between. The distances are a little more complicated and you probably would be better off transforming city_pairs into actual pairs.