Combine columns from two sources - azure-data-factory

I have two sources resulting from some transformation in data flow:
I have tried using join, it replicates the data no matter join I select it outputs similar stuff:
I have tried union as well but union either creates null in columns (if done by name) or rows (if done by position)
Shouldnt the join just concat the columns together because the IDs are same in both table.
This is how the desired ouput should look:
I want concat the version column to the first source so that it looks like this:
ID name value version
111 file1 0.1 3
111 file2 0.82 15
111 file3 2.2 2

Both of your source files have only one matching column (ID) and it is not unique.
When you join both sources on the ID column, each row of source1 joins with all the matching rows of source2.
Here, your row1 (111) of source1 joins with all 3 matching rows (111) of source2, hence it results in 9 rows with different version values for each row in source1.
To get only 3 rows as your expected results, you need a unique matching row in each source.
Add window transformation for both sources and get the rowNumber() based on the ID column.
Source1->window1:
Window1 data preview:
Source2->window2:
Window2 data preview:
Add join transformation to join data from window transformations on ID and rank columns.
Join data preview:
Add select transformation to remove the unwanted columns.
Select data preview:

That is expected with a join. For example, when you join tables in SQL, you also supply the target projection as part of the select statement. What you need to do here is add a Select transformation after your Join transformation. In there, you will reduce the projection to just the columns that would like to retain. You'll be able to choose which side (left or right) you would like to keep for the ID column.

Related

SELECT query returns more records than exists

Background
I have a table with raster data (grib_data) created by using raster2pgsql.
I have created a second table (turb_mod) with a subset of the points in grib_data that has a value above a certain threshold.
This subset table (turb_mod) has been created with the following query
WITH turb AS (SELECT rid, rast, (ST_PixelAsPoints(rast)).val AS val
FROM grib_data
)
SELECT rid, rast INTO turb_mod
FROM turb WHERE val > 0.5;
The response when creating the table is "SELECT 53" indicating that the table turb_mod would now hold 53 rows
Problem
If I now try to return the raster data from turb_mod using the below query it returns all records from the original table, not the 53 that I am expecting
SELECT (ST_PixelAsPoints(rast)).x AS x FROM turb_mod;
Questions
Why does my query not return only the 53 records?
Is there a better way to create a table with a selection of raster points from the original table? I want to use the subset to apply further geospatial functions like spatial clustering.
In your final SELECT, you're calling the function ST_PixelAsPoints, which is a set-returning function. This results in an output row [being] generated for each element of the function's result set (reference), and can thus result in a different row count to that of your source table, turb_mod.
Your query is functionally equivalent to this (preferred) syntax:
SELECT points.x
FROM
turb_mod
JOIN LATERAL ST_PixelAsPoints(rast) points ON TRUE;
This syntax better shows what's happening, and also shows how you might choose to include more columns from the function's output, which may help to answer your second point.

Drop function not working after left outer join in pyspark

My pyspark version is 2.1.1. I am trying to join two dataframes (left outer) having two columns id and priority. I am creating my dataframes like this:
a = "select 123 as id, 1 as priority"
a_df = spark.sql(a)
b = "select 123 as id, 1 as priority union select 112 as uid, 1 as priority"
b_df = spark.sql(b)
c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(b_df.priority)
c_df schema is coming as DataFrame[uid: int, priority: int, uid: int, priority: int]
The drop function is not removing the columns.
But if I try to do:
c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(a_df.priority)
Then priority column for a_df gets dropped.
Not sure if there is a version change issue or something else, but it feels very weird that drop function will behave like this.
I know the workaround can be to remove the unwanted columns first, and then do the join. But still not sure why drop function is not working?
Thanks in advance.
Duplicate column names with joins in pyspark lead to unpredictable behavior, and I've read to disambiguate the names before joining. From stackoverflow, Spark Dataframe distinguish columns with duplicated name and Pyspark Join and then column select is showing unexpected output . I'm sorry to say I can't find why pyspark doesn't work as you describe.
But the databricks documentation addresses this problem: https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html
From the databricks:
If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. This makes it harder to select those columns. This topic and notebook demonstrate how perform a join so that you don’t have duplicated columns.
When you join, instead you can try either using an alias (thats typically what I use), or you can join the columns as an list type or str.
df = left.join(right, ["priority"])

Merging in powerquery

although i selected Full Join, i couldn't get the all rows from both tables.
how can i get all rows from both tables ? (all 12093 rows)
maybe another join type may help ?
let
Source = Table.NestedJoin(#"Beton Irsaliye Kumulatif",{"Proje No & Adi", "Firma Kodu"},#"Beton Muhasebe Kumulatif",{"Proje No & Adi", "Hesap No"},"Beton Muhasebe Kumulatif",JoinKind.FullOuter)
in
Source
Your merge is accounting for all your rows. It's just that 4 of the rows in the first table don't have matches in the second table.
Here's a simple example of what is happening. Here, I have two tables: Table1 and Table2. Both have 10 rows. In fact, both are exactly the same.
If I choose to do a Full Outer join with these, using Col1 and Col2 for matching, I'll see this:
It tells me that 10 of the rows from the first table (Table1) match rows of the second table (Table2).
Now, if I change the last two rows of Table1 (specifically, the last two rows of Col2 of Table1) like this:
Then when I try to do a Full Outer join the same way, I'll see this:
Only 8 of the rows from the first table (Table1) match rows of the second table (Table2).
But when I continue with the merge, I'll see Table1's information in a table with Table2's matching information as embedded tables in column "NewColumn" of that table:
When I then expand "NewColumn", I see all the info from Table1, as before, and all matching info from Table2, as well as rows that don't have matches between the two tables.
All rows of both tables are accounted for.

Updating multiple rows in one table based on multiple rows in a second

I have two tables, table1 and table2, both of which contain columns that store postgis geometries. What I want to do is see where the geometry stored in any row of table2 geometrically intersects with the geometry stored in any row of table1 and update a count column in table1 with the number of intersections. Therefore, if I have a geometry in row 1 of table1 that intersects with the geometries stored in 5 rows in table2, I want to store a count of 5 in a separate column in table one. The tricky part for me is that I want to do this for every row of column 1 at the same time.
I have the following:
UPDATE circles SET intersectCount = intersectCount + 1 FROM rectangles
WHERE ST_INTERSECTS(cirlces.geom, rectangles.geom);
...which doesn't seem to be working. I'm not too familiar with postgres (or sql in general) and I'm wondering if I can do this all in one statement or if I need a few. I have some ideas for how I would do this with multiple statements (or using for loop) but I'm really looking for a concise solution. Any help would be much appreciated.
Thanks!
something like:
update t1 set ctr=helper.ctr
from (
select t1.id, count(*) as cnt
from t1, t2
where st_intersects(t1.col, t2.col)
group by t1.id
) helper
where helper.id=t1.id
?
btw: Your version does not work, because a row can get updated only once in a single update statement.

Duplicate values returned with joins

I was wondering if there is a way using TSQL join statement (or any other available option) to only display certain values. I will try and explain exactly what I mean.
My database has tables called Job, consign, dechead, decitem. Job, consign, and dechead will only ever have one line per record but decitem can have multiple records all tied to the dechead with a foreign key. I am writing a query that pulls various values from each table. This is fine with all the tables except decitem. From dechead I need to pull an invoice value and from decitem I need to grab the net wieghts. When the results are returned if dechead has multiple child decitem tables it displays all values from both tables. What I need it to do is only display the dechad values once and then all the decitems values.
e.g.
1 ¦123¦£2000¦15.00¦1
2 ¦--¦------¦20.00¦2
3 ¦--¦------¦25.00¦3
Line 1 displays values from dechead and the first line/Join from decitems. Lines 2 and 3 just display values from decitem. If I then export the query to say excel I do not have duplicate values in the first two fileds of lines 2 and 3
e.g.
1 ¦123¦£2000¦15.00¦1
2 ¦123¦£2000¦20.00¦2
3 ¦123¦£2000¦25.00¦3
Thanks in advance.
Check out 'group by' for your RDBMS http://msdn.microsoft.com/en-US/library/ms177673%28v=SQL.90%29.aspx
this is a task best left for the application, but if you must do it in sql, try this:
SELECT
CASE
WHEN RowVal=1 THEN dt.col1
ELSE NULL
END as Col1
,CASE
WHEN RowVal=1 THEN dt.col2
ELSE NULL
END as Col2
,dt.Col3
,dt.Col4
FROM (SELECT
col1, col2, col3
,ROW_NUMBER OVER(PARTITION BY Col1 ORDER BY Col1,Col4) AS RowVal
FROM ...rest of your big query here...
) dt
ORDER BY dt.col1,dt.Col4