ADF - Dataflow, using Join to send new values - azure-data-factory

there are two tables
tbl_1 as a source data
ID | Submission_id
--------------------
1 | A00_1
2 | A00_2
3 | A00_3
4 | A00_4
5 | A00_5
6 | A00_6
7 | A00_7
tbl_2 as destination. In this table, Submission_id is unique key.
ID | Submission_id
--------------------
1 | A00_1
2 | A00_2
3 | A00_3
4 | A00_4
tbl_1 as input value and tbl_2 as destination (sink). Expected result is only A00_5, A00_6 & A00_7 sent to tbl_2. So, this picture below is the Join
for AlterRow,
expected ouput
tbl_2
ID | Submission_id
--------------------
1 | A00_1
2 | A00_2
3 | A00_3
4 | A00_4
5 | A00_5 -->(new)
6 | A00_6 -->(new)
7 | A00_7 -->(new)
But, output result from alterRow are all Submission_id. It should be only not equal comparison that has been stated in the alter row condition,
notEquals(DC__Submission_ID_BigInt, SrcStgDestination#{_Submission_ID}).
How to solve this problem in Azure DataFlow use 'Join' ?

I tried doing the same procedure and got the same result (all rows getting inserted). We were able to perform join in the desired way but couldn’t proceed further to get the required output. You can use the approach given below instead, which is achieved using JOINS.
In general, when we want to get records from table1 which are not present in table2, we execute the following query (in sql server).
select t1.id,t1.submission_id from t1 left outer join t2 on t1.submission_id = t2.submission_id where t2.submission_id is NULL
In the Dataflow, we were able to achieve the join successfully (same procedure as yours). Now instead using alter row transformation, I used filter transformation (to achieve t2.submission_id is NULL condition). I used the following expression (condition) to filter.
isNull(d1#submission_id) && isNull(d1#id)
Now proceed to configure the sink (tbl_2). The preview would show the records as in the below image.
Publish and run the dataflow activity in your pipeline to get the desired results.

Related

How to convert nested json to data frame with kdb+

I am trying to get the data from cryptostats like below, it gives me back a nested json. I want it to be in a table format. How do I do that?
query:"https://api.cryptostats.community/api/v1/fees/oneDayTotalFees/2023-02-07";
raw:.Q.hg query;
res:.j.k raw;
To get json file, use https://api.cryptostats.community/api/v1/fees/oneDayTotalFees/2023-02-07
To view json code into a table format, use https://jsongrid.com/json-grid
Final result would be a kdb+ table which has all the cols from nested json output
They are all dictionaries
q)distinct type each res[`data]
,99h
But they do not collapse to a table because they do not all have matching keys
q)distinct key each res[`data]
`id`bundle`results`metadata`errors
`id`bundle`results`metadata
Looking at a row where errors is populated we can see it is a dictionary
q)res[`data;0;`errors]
oneDayTotalFees| "Error executing oneDayTotalFees on compound: Date incomplete"
You can create a prototype dictionary with a blank errors key in it and join , each piece of data onto it. This will result in uniform dictionaries which will be promoted to a table type 98h
q)table:(enlist[`errors]!enlist (`$())!()),/:res`data
q)type table
98h
Row which already had errors is unaffected:
q)table 0
errors | (,`oneDayTotalFees)!,"Error executing oneDayTotalFees..
id | "compound"
bundle | 0n
results | (,`oneDayTotalFees)!,0n
metadata| `source`icon`name`category`description`feeDescription;..
Row which previously did not have errors now has a valid empty dictionary
q)table 1
errors | (`symbol$())!()
id | "swapr-ethereum"
bundle | "swapr"
results | (,`oneDayTotalFees)!,24.78725
metadata| `category`name`icon`bundle`blockchain`description`feeDescription..
https://kx.com/blog/kdb-q-insights-parsing-json-files/
https://code.kx.com/q/ref/join/
https://code.kx.com/q/kb/faq/#construction
https://code.kx.com/q/basics/datatypes/
https://code.kx.com/q/ref/maps/#each-left-and-each-right
If you want to explore nested objects you can index at depth (see blog post linked above). If you have many sparse keys leaving it like this is efficient for storage:
q)select tokenSymbol:metadata[::;`tokenSymbol] from table where not ""~/:metadata[::;`tokenSymbol]
tokenSymbol
-----------
"HNY"
If you do wish to explode a nested field you can run similar to:
q)table:table,'{flip c!flip table[`metadata]#\:(c:distinct raze key each table[`metadata])}[]
q)meta table
c | t f a
----------------| -----
errors |
id | C
bundle | C
results |
metadata |
source | C
icon | C
name | C
category | C
description | C
feeDescription | C
blockchain | C
website | C
tokenTicker | C
tokenCoingecko | C
protocolLaunch | C
tokenLaunch | C
adapter | C
subtitle | C
events | C
shortName | C
protocolShutdown| C
tokenSymbol | C
subcategory | C
tokenticker | C
tokencoingecko | C
Care needs to be taken will filling in nulls and keeping consistent types of data in each column. In this dataset the events tag inside metadata is tabular data:
q)select distinct type each events from table
events
------
10
98
0
This would need to be cleaned similar to:
q)table:update events:count[i]#enlist ([] date:();description:()) from table where not 98h=type each events
The data returned from the API contains dictionaries with two distinct sets of keys:
q)distinct key each res`data
`id`bundle`results`metadata`errors
`id`bundle`results`metadata
One simple way to convert this to a table is to enlist each dictionary first, converting them to tables, then joining with uj:
q)(uj/)enlist each res`data
id bundle results metadata ..
-----------------------------------------------------------------------------..
"compound" 0n (,`oneDayTotalFees)!,0n `source`i..
"swapr-ethereum" "swapr" (,`oneDayTotalFees)!,24.78725 `category..
...
This works as uj generalises the join operator ,, allowing different schemas with common elements to be combined.

Aggregate function to extract all fields based on maximum date

In one table I have duplicate values ​​that I would like to group and export only those fields where the value in the "published_at" field is the most up-to-date (the latest date possible). Do I understand it correctly as I use the MAX aggregate function the corresponding fields I would like to extract will refer to the max found or will it take the first found in the table?
Let me demonstrate you this on simple example (in real world example I am also joining two different tables). I would like to group it by id and extract all fields but only relating to the max published_at field. My query would be:
SELECT "t1"."id", "t1"."field", MAX("t1"."published_at") as "published_at"
FROM "t1"
GROUP By "t1"."id"
| id | field | published_at |
---------------------------------
| 1 | document1 | 2022-01-10 |
| 1 | document2 | 2022-01-11 |
| 1 | document3 | 2022-01-12 |
The result I want is:
1 - document3 - 2022-01-12
Also one question - why am I getting this error "ERROR: column "t1"."field" must appear in the GROUP BY clause or be used in an aggregate function". Can I use MAX function on string type column?
If you want the latest row for each id, you can use DISTINCT ON. For example:
select distinct on (id) *
from t
order by id, published_at desc
If you just want the latest row in the whole result set you can use LIMIT. For example:
select *
from t
order by published_at desc
limit 1

Delete the duplicate rows but keep latest using JOINS throwing Error

I have tried all the ways to delete the duplicate data except one but all of them are taking a lot of time to query. But JOINS was the only one that took very less time. But here the issue is I am able to use select query. but delete is not working in pgadmin4 (PostgreSQL 14.0). how will I be able to resolve this issue.
delete s1 FROM persons s1,
persons s2
where
s1.personid > s2.personid
AND s1.lastname = s2.lastname
order by personid desc
limit 100
It throws error saying "syntax error at or near "s1".
How can I solve this issue?
+-------------------------------------------+
|personid| firstname | lastname | email |
|------------------------------------------ |
| 1 shanny edward shane#123 |
| ------------------------------------------|
| 2 abc way abc#123 |
+-------------------------------------------+
You may use the following exists logic when deleting:
DELETE
FROM persons p1
WHERE EXISTS (
SELECT 1
FROM persons p2
WHERE p2.lastname = p1.lastname AND
p2.personid < p1.personid
);
The above logic will spare, for each last name group of records, a single record having the smallest personid value.

Postgres Query for Beginners

Ok, I deleted previous post and will try this again. I am sure I don't know the topic and I'm not sure if this is a loop or if I should use a stored function or how to get what I'm looking for. Here's sample data and expected output;
I have a single table A. Table has following fields; date created, unique person key, type, location.
I need a Postgres query that says for any given month(parameter, based on date created) and given a location(parameter based on location field), provide me fieds below where unique person key may be duplicated + or – 30 days from the date created within the month given for same type but all locations.
Example Data
Date Created | Unique Person | Type | Location
---------------------------------------------------
2/5/2017 | 1 | Admit | Hospital1
2/6/2017 | 2 | Admit | Hospital2
2/15/2017 | 1 | Admit | Hospital2
2/28/2017 | 3 | Admit | Hospital2
3/3/2017 | 2 | Admit | Hospital1
3/15/2017 | 3 | Admit | Hospital3
3/20/2017 | 4 | Admit | Hospital1
4/1/2017 | 1 | Admit | Hospital2
Output for the month of March for Hospital1:
DateCreated| UniquePerson | Type | Location | +-30days | OtherLoc.
------------------------------------------------------------------------
3/3/2017 | 2 | Admit| Hospital1 | 2/6/2017 | Hospital2
Output for the month of March for Hospital2:
None, because no one was seen at Hospital2 in March
Output for the month of March for Hospital3:
DateCreated| UniquePerson | Type | Location | +-30days | otherLoc.
------------------------------------------------------------------------
3/15/2017 | 3 | Admit| Hospital3 | 2/28/2017 | Hospital2
Version 1
I would use a WITH clause. Please, notice that I've added a column id that is a primary key to simplify the query. It's just to prevent the rows to be matched with themselves.
WITH x AS (
SELECT
id,
date_created,
unique_person_id,
type,
location
FROM
a
WHERE
location = 'Hospital1' AND
date_trunc('month', date_created) = date_trunc('month', '2017-03-01'::date)
)
SELECT
x.date_created,
x.unique_person_id,
x.type,
x.location,
a.date_created AS "+-30days",
a.location AS other_location
FROM
x
JOIN a
USING (unique_person_id, type)
WHERE
x.id != a.id AND
abs(x.date_created - a.date_created) <= 30;
Now a little bit of explanations:
First we select, let's say a reference data with a WITH clause. Think of it as a temporary table that we can reference in the main query. In given example it could be a "main visit" in given hospital.
Then we join "main visits" with other visits of the same person and type (JOIN condition) that happen in date difference of 30 days (WHERE condition).
Notice that the WITH query has the limits you want to check (location and date). I use date_trunc function that truncates the date to specified precision (a month in this case).
Version 2
As #Laurenz Albe suggested, there is no special need to use a WITH clause. Right, so here is a second version.
SELECT
x.date_created,
x.unique_person_id,
x.type,
x.location,
a.date_created AS "+-30days",
a.location AS other_location
FROM
a AS x
JOIN a
USING (unique_person_id, type)
WHERE
x.location = 'Hospital1' AND
date_trunc('month', x.date_created) = date_trunc('month', '2017-03-01'::date) AND
x.id != a.id AND
abs(x.date_created - a.date_created) <= 30;
This version is shorter than the first one but, in my opinion, the first is easier to understand. I don't have big enough set of data to test and I wonder which one runs faster (the query planner shows similar values for both).

Union summary statistics with query result in SQLAlchemy?

I have a PostgreSQL table that stores readings from power meters. I use SQLAlchemy and psycopg2 to query the database. Some large sites can have multiple power meters, and I have a query that returns timestamped data, aggregated by facility:
Raw table:
timestamp | meter_id | facility_id | reading
1:00:00 | 1 | 1 | 1.0
1:00:00 | 2 | 1 | 1.5
1:00:00 | 3 | 2 | 2.1
1:00:30 | 1 | 1 | 1.1
1:00:30 | 2 | 1 | 1.6
1:00:30 | 3 | 2 | 2.2
Aggregated:
timestamp | facility_1 | facility_2
1:00:00 | 2.5 | 2.1
1:00:30 | 2.7 | 2.2
The query I use for this looks like this:
SELECT
reading.timestamp,
sum(reading.reading) FILTER (WHERE reading.facility_id = 1) as facility_1,
sum(reading.reading) FILTER (WHERE reading.facility_id = 2) as facility_2
FROM reading
GROUP BY reading.timestamp
WHERE
reading.timestamp >= 1:00:00 AND reading.timestamp < 1:01:00
AND reading.facility_id IN 1, 2
(Sorry for any SQL errors, I've simplified the problem a little for clarity). I often need to downsample the data for display, which I do by wrapping the above query in a FROM...AS... clause and binning the data into larger time intervals. Before doing that, though, I'd like to grab some summary statistics from my derived "facilities" table--min reading, max reading, avg reading, etc., similar to what's described in this blog post. However, I can't figure out how to use SQLALchemy to get this data--I keep getting psycopg2 errors from the resulting SQL. My SQLAlchemy version of the above query is:
selects = [Reading.timestamp,
sqlalchemy.func.sum(Reading.reading).filter(Reading.facility_id==1),
sqlalchemy.func.sum(Reading.reading).filter(Reading.facility_id==2)
]
base_query = db.session.query(*selects). \
group_by(Reading.timestamp). \
filter(Reading.facility_id.in_([1, 2])). \
filter(and_(Reading.timestamp>=start_time, Reading.timestamp<=end_time)). \
order_by(Reading.timestamp)
I can get summary statistics with something like this:
subq = base_query.subquery()
avg_selects = [sqlalchemy.func.avg(col) for col in subq.columns]
avg_query = db.session.query(*avg_selects)
Which will return a single row with the average of all columns from my original query. However, I can't figure out how to get this with my original query--I end up having to get the statistics separately, which feel like a huge waste (these queries can go over many rows). Queries like the one below always return errors:
all = base_query.union(avg_query).all()
ProgrammingError: (psycopg2.ProgrammingError) syntax error at or near "UNION"
LINE 4: ...reading.timestamp ORDER BY reading.timestamp UNION SELE...
I feel like my understanding of SQLAlchemy's subquery system is weak, but I haven't been able to make headway from the subquery tutorial in SQLAlchemy's documentation. Ideas?
The answer was right in the error message--I needed to remove the ORDER BY clause from the subquery to outside of the union operation and move it outside of the UNION. I am using dummy timestamps for the summary statistics to ensure that they are at the top of the query result in a predictable order after ordering on timestamp. The following code works:
from sqlalchemy.sql import expression, func
from datetime import datetime
from models import Reading
selects = [Reading.timestamp.label("timestamp_"),
func.sum(Reading.reading).filter(Reading.facility_id==1),
func.sum(Reading.reading).filter(Reading.facility_id==2)
]
base_query = db.session.query(*selects). \
group_by(Reading.timestamp). \
filter(Reading.facility_id.in_([1, 2])). \
filter(and_(Reading.timestamp>=start_time, Reading.timestamp<=end_time))
subq = base_query.subquery()
avg_selects = [expression.bindparam('dummy_date', datetime(1980, 1, 1)).label("timestamp_")
avg_selects += [func.avg(col) for col in subq.columns[1:]
avg_query = db.session.query(*avg_selects)
full_query = base_query.union(avg_query).order_by(asc("timestamp_"))
I would be happy to hear more graceful ways of accomplishing this. The query is wrapped in a function that takes arbitrary lists of facility IDs; the "columns" trick is the only way I've figured out to make it work with arbitrary columns (as long as the first column is always timestamps).