How to merge data in stata - merge

I'm learning stata, and trying to understand merging. Can someone explain the difference between different kinds of merging to me? (1:1, 1:m, m:1, m:m)?

In case the Stata manual is unclear, here's a quick overview.
First, it's important to clear up the terminology.
A merge basically connects rows in two datasets (Stata calls them observations) based on a specified variable or list of variables, called key variables. You have to start with one dataset already in memory (Stata calls this the master dataset), and you merge another dataset to it (the other dataset is called the using dataset). What you're left with is a single dataset containing all of the variables from the master, and any variable from the using that didn't already exist in master. It also generates a new variable called _merge indicating whether there were rows in master that weren't in using or vice versa. The merged dataset (unless otherwise specified) will contain all rows from master and using, regardless of whether the key variables matched between the two.
The concept of a "unique identifier" is important. If a variable (or combination of variables) has a different value in every row, it uniquely identifies rows. This is important for the details about 1:1, 1:m etc.
1:1 means the key variable provides unique identifiers in both datasets. You will be left with all of the rows from both datasets in memory.
1:m means the key variable in the master dataset uniquely identifies rows, but the key variable from the using dataset doesn't. You will still be left with all of the rows from both datasets, but if a key variable has duplicate observations in the using dataset, the master dataset will gain duplicates to match them.
m:1 is the opposite of 1:m. The key variable in the master dataset doesn't uniquely identify rows, but the key variable in the using dataset does.
m:m is kind of weird. The key variable doesn't uniquely identify rows in either dataset, so you will be left with duplicated rows from both sides.
Example:
** make a dataset and save as a tempfile called `b'. Note that k uniquely identifies rows
set obs 3
gen k = _n
gen b = "b"
list
+-------+
| k b |
|-------|
1. | 1 b |
2. | 2 b |
3. | 3 b |
+-------+
tempfile b
save `b'
** make another dataset and merge `b' to it. Note that k uniquely identifies rows
set obs 3
gen k = _n
gen a = "a"
list
+-------+
| k a |
|-------|
1. | 1 a |
2. | 2 a |
3. | 3 a |
+-------+
merge 1:1 k using `b'
list
+-------------------------+
| k a b _merge |
|-------------------------|
1. | 1 a b matched (3) |
2. | 2 a b matched (3) |
3. | 3 a b matched (3) |
+-------------------------+
** make another dataset and merge `b' to it. Note that k does not uniquely identify rows and that k=2 and k=3 do not exist in the master dataset
clear
set obs 3
gen k = 1
gen a = "a"
list
+-------+
| k a |
|-------|
1. | 1 a |
2. | 1 a |
3. | 1 a |
+-------+
merge m:1 k using `b'
list
+----------------------------+
| k a b _merge |
|----------------------------|
1. | 1 a b matched (3) |
2. | 1 a b matched (3) |
3. | 1 a b matched (3) |
4. | 2 b using only (2) |
5. | 3 b using only (2) |
+----------------------------+

Related

How do I identify the value of a skewed task of my Foundry job?

I've looked into my job and have identified that I do indeed have a skewed task. How do I determine what the actual value is inside this task that is causing the skew?
My Python Transforms code looks like this:
from transforms.api import Input, Output, transform
#transform(
...
)
def my_compute_function(...):
...
df = df.join(df_2, ["joint_col"])
...
Theory
Skew problems originate from anything that causes an exchange in your job. Things that cause exchanges include but are not limited to: joins, windows, groupBys.
These operations result in data movement across your Executors based upon the found values inside the DataFrames used. This means that when a used DataFrame has many repeated values on the column dictating the exchange, those rows all end up in the same task, thus increasing its size.
Example
Let's consider the following example distribution of data for your join:
DataFrame 1 (df1)
| col_1 | col_2 |
|-------|-------|
| key_1 | 1 |
| key_1 | 2 |
| key_1 | 3 |
| key_1 | 1 |
| key_1 | 2 |
| key_2 | 1 |
DataFrame 2 (df2)
| col_1 | col_2 |
|-------|-------|
| key_1 | 1 |
| key_1 | 2 |
| key_1 | 3 |
| key_1 | 1 |
| key_2 | 2 |
| key_3 | 1 |
These DataFrames when joined together on col_1 will have the following data distributed across the executors:
Task 1:
Receives: 5 rows of key_1 from df1
Receives: 4 rows of key_1 from df2
Total Input: 9 rows of data sent to task_1
Result: 5 * 4 = 20 rows of output data
Task 2:
Receives: 1 row of key_2 from df1
Receives: 1 row of key_2 from df2
Total Input: 2 rows of data sent to task_2
Result: 1 * 1 = 1 rows of output data
Task 3:
Receives: 1 row of key_3 from df2
Total Input: 1 rows of data sent to task_3
Result: 1 * 0 = 0 rows of output data (missed key; no key found in df1)
If you therefore look at the counts of input and output rows per task, you'll see that Task 1 has far more data than the others. This task is skewed.
Identification
The question now becomes how we identify that key_1 is the culprit of the skew since this isn't visible in Spark (the underlying engine powering the join).
If we look at the above example, we see that all we need to know is the actual counts per key of the joint column. This means we can:
Aggregate each side of the join on the joint key and count the rows per key
Multiply the counts of each side of the join to determine the output row counts
The easiest way to do this is by opening the Analysis (Contour) tool in Foundry and performing the following analysis:
Add df1 as input to a first path
Add Pivot Table board, using col_1 as the rows, and Row count as the aggregate
Click the ⇄ Switch to pivoted data button
Use the Multi-Column Editor board to keep only col_1 and the COUNT column. Prefix each of them with df1_, resulting in an output from the path which is only df1_col_1 and df1_COUNT.
Add df2 as input to a second path
Add Pivot Table board, again using col_1 as the rows, and Row count as the aggregate
Click the ⇄ Switch to pivoted data button
Use the Multi-Column Editor board to keep only col_1 and the COUNT column. Prefix each of them with df2_, resulting in an output from the path which is only df2_col_1 and df2_COUNT.
Create a third path, using the result of the first path (df1_col_1 and df1_COUNT1)
Add a Join board, making the right side of the join the result of the second path (df2_col_1 and df2_col_1). Ensure the join type is Full join
Add all columns from the right side (you don't need to add a prefix, all the columns are unique
Configure the join board to join on df1_col_1 equals df2_col_1
Add an Expression board to create a new column, output_row_count which multiplies the two COUNT columns together
Add a Sort board that sorts on output_row_count descending
If you now preview the resultant data, you will have a sorted list of keys from both sides of the join that are causing the skew

Tag unique rows?

I have some data from different systems which can be joined only in a certain case because of different granularity between the data sets.
Given three columns:
call_date, login_id, customer_id
How can I efficiently 'flag' each row which has a unique value across those three values? I didn't want to SELECT DISTINCT because I do not know which of the rows actually matches up with the other. I want to know which records (combination of columns) exist only once in a single date.
For example, if a customer called in 5 times on a single date and ordered a product, I do not know which of those specific call records ties back to the product order (lack of timestamps in the raw data). However, if a customer only called in once on a specific date and had a product order, I know for sure that the order ties back to that call record. (This is just an example - I am doing something similar across about 7 different tables from different source data).
timestamp customer_id login_name score unique
01/24/2017 18:58:11 441987 abc123 .25 TRUE
03/31/2017 15:01:20 783356 abc123 1 FALSE
03/31/2017 16:51:32 783356 abc123 0 FALSE
call_date customer_id login_name order unique
01/24/2017 441987 abc123 0 TRUE
03/31/2017 783356 abc123 1 TRUE
In the above example, I would only want to join rows where the 'uniqueness' is True for both tables. So on 1/24, I know that there was no order for the call which had a score of 0.25.
To find whether the row (or some set of columns) is unique within the list of rows, you need to make use of PostgreSQL window functions.
SELECT *,
(count(*) OVER(PARTITION BY b, c, d) = 1) as unique_within_b_c_d_columns
FROM unnest(ARRAY[
row(1, 2, 3, 1),
row(2, 2, 3, 2),
row(3, 2, 3, 2),
row(4, 2, 3, 4)
]) as t(a int, b int, c int, d int)
Output:
| a | b | c | d | unique_within_b_c_d_columns |
-----------------------------------------------
| 1 | 2 | 3 | 1 | true |
| 2 | 2 | 3 | 2 | false |
| 3 | 2 | 3 | 2 | false |
| 4 | 2 | 3 | 4 | true |
In PARTITION clause you need to specify the list of columns that you want to make comparison on. Note that in the example above a column doesn't take part in comparison.

How to automatically calculate the SUS Score for a given spreadsheet in LibreOffice Calc?

I have several spreadsheets for a SUS-Score usability test.
They have this form:
| Strongly disagree | | | | Strongly agree |
I think, that I would use this system often | x | | | | |
I found the system too complex | |x| | | |
(..) | | | | | x |
(...) | x | | | | |
To calculate the SUS-Score you have 3 rules:
Odd item: Pos - 1
Even item: 5 - Pos
Add Score, multiply by 2.5
So for the first entry (odd item) you have: Pos - 1 = 1 - 1 = 0
Second item (even): 5 - Pos = 5 - 2 = 3
Now I have several of those spreadsheets and want to calculate the SUS-Score automatically. I've changed the x to a 1 and tried to use IF(F5=1,5-1). But I would need an IF-condition for every column: =IF(F5=1;5-1;IF(E5=1;4-1;IF(D5=1;3-1;IF(C5=1;2-1;IF(B5=1;1-1))))), so is there an easier way to calculate the score, based on the position in the table?
I would use a helper table and then SUM() all the cells of the helper table and multiply by 2.5. This formula (modified as needed, see notes below) can start your helper table and be copy-pasted to fill out the entire table:
=IF(D2="x";IF(MOD(ROW();2)=1;5-D$1;D$1-1);"")
Here D is an answer column
Depending on what row (odd/even) your answers start you may need to change the =1 after the MOD function to =0
This assumes the position number is in row 1; if position numbers are in a different row change the number after the $ appropriately

Talend - Append two rows from a delimited file

How can I append two rows of a delimited file?
For example, I have:
a | b | c | d
e | f | g | h
and I want:
a | b | c | d | e | f | g | h
This new file may or may not be saved after the transformation.
Is there any possiblity that you have a join condition or relation between these two rows. Or it is always just two rows, as lets say your file contains 4 rows (how would like to merge them now)
a|b|c
d|e|f
x|y|z
m|g|s
if you have a way to relate these rows, then it will be easier using tmap
Ok the information you have shared in comment helps..
try this
tfileinputdelimited_1 (read all rows from file) -->filter_01 (only 'TX' rows)-->tmap(add sequence start with 1,1)
so output of tmap will have all columns + sequence_column having value 1, 2, 3..for row 1, row2, row3...and so on..
Similarly have another pipeline
tfileinputdelimited_2 (read all rows from file) -->filter_02 (only 'RX' rows)-->tmap(add sequence start with 1,1)
so output of tmap will have all columns + sequence_column having value 1, 2, 3..for row 1, row2, row3...and so on..
Now both these pipeline input them to tMap - and join based on sequence column and select all columns you need from them into single output.

Query to remove all redundant entries from a table

I have a Postgres table that describes relationships between entities, this table is populated by a process which I cannot modify. This is an example of that table:
+-----+-----+
| e1 | e2 |
|-----+-----|
| A | B |
| C | D |
| D | C |
| ... | ... |
+-----+-----+
I want to write a SQL query that will remove all unecessary relationships from the table, for example the relationship [D, C] is redundant as it's already defined by [C, D].
I have a query that deletes using a self join but this removes everything to do with the relationship, e.g.:
DELETE FROM foo USING foo b WHERE foo.e2 = b.e1 AND foo.e1 = b.e2;
Results in:
+-----+-----+
| e1 | e2 |
|-----+-----|
| A | B |
| ... | ... |
+-----+-----+
However, I need a query that will leave me with one of the relationships, it doesn't matter which relationship remains, either [C, D] or [D, C] but not both.
I feel like there is a simple solution here but it's escaping me.
A general solution is to use the always unique pseudo-column ctid:
DELETE FROM foo USING foo b WHERE foo.e2 = b.e1 AND foo.e1 = b.e2
AND foo.ctid > b.ctid;
Incidentally it keeps the tuple whose physical location is nearest to the first data page of the table.
Assuming that an exact duplicate row is constrained against, there will always be at most two rows for a given relationship: (C,D) and (D,C) in your example. The same constraint also means the two columns have a distinct values: the pair (C,C) might be legal, but cannot be duplicated.
Assuming that the datatype involved has a sane definition of >, you can add a condition that the row to be deleted is the one where the first column > the second column, and leave the other untouched.
In your sample query, this would mean adding AND foo.e1 > foo.e2.