Stream Joins based on distinct values from a previous stream - pyspark

I have a main table A which stores all the events which occur with some details.
Then for each "event" in Table A, there is a separate table by that name (called the event table)
I am using the main table A as a stream and ideally need to perform a join to the event table so that I get a single table for Table A with its respective details.
In the case below there are two distinct event types tables each with its own table and schema.
Example:
Table A:
id
time
detail_1
detail_2
event
Table event 1:
id
detail_6
detail_8
detail_9
Table event 2:
id
detail 11
detail 12
How do I union it so I have a single table in the end with the corresponding details from Table A, event 1 and event 2?
Here is what I was trying to do:
df = (
spark.readStream.format("delta")
.option("ignoreChanges", "true")
.load(f"{table_name}")
)
event_types = df.select("event").distinct().collect()
for row in event_types:
event = row[0].replace(" ", "_").replace(":","").lower()
if event in ["task", "module", "_created", "test_created", "test_to_be_deleted"]:
df_event = spark.readStream.format("delta").option("ignoreChanges", "true").load(f"{event}")
joined_df = df.join(df_event, Seq("message_id"),"inner")
df.writeStream.format("delta").outputMode("append").option(
"checkpointLocation",
f"{table}",
).trigger(once=True).foreachBatch(apple_a_bunch_of_changes).start()
is there a better way to do this?

Related

is it possible to copy data from table to another based off if data in one table matches the data in the other table in a specific column

I have a table A with the following columns and data:
Tag_ = P-111
Description_ = Pump
HP_ = 100
RPM_ = 1,800
I have another table B with same columns:
Tag_ = P-111
Description_
HP_
RPM_
Is there a way when I enter data in the Tag_ column in Table B and there is matching data in the same Tag_ column in Table A that I can set up a trigger or automatic command that sends the data from the other columns in Table A to Table B?

Is it possible to create a stream in ksqlDB to emit an ID that detects changes from 3 different tables that are joined together?

Is it possible to create a stream that detects changes on 3 different tables? For example, I have Table A which contains Ids for Table B and Table C. If I constructed my join query correctly. could I emit an event that contains Table A's id if there was a change in Table B or C?
Table A
id
b_id
c_id
field_abc
field_xyz
Table B
id
foo
Table C
id
bar
I want a stream that will emit Table A id's if there is any changes in any of those 3 tables. Is this possible?
For example, if fields field_abc, foo, or bar were to change, I want Table A's id to be emitted to a stream.
I recently ran into a similar issue as what you're describing. Currently this isn't possible using streams or tables due to limitations on ksqlDB. We did find a way to achieve the same results though.
Our solution was to create a custom query with the connector that creates a 3-way joined table and combines the updated fields on the 3 tables.
CREATE SOURCE CONNECTOR xyz_change WITH (
'connector.class' = '${v_connector_class}',
'connection.url' = '${v_connection_url}',
'connection.user' = '${v_connection_user}',
'connection.password' = '${v_connection_pass}',
'topic.prefix' = 'jdbc_abc_change',
'mode' = 'timestamp+incrementing',
'numeric.mapping' = 'best_fit',
'incrementing.column.name' = 'id',
'timestamp.column.name' = 'last_modified',
'key' = 'id',
'key.converter' = '${v_converter_long}',
'query' = 'select id, last_modified from(select a.id as id, GREATEST(a.last_modified, COALESCE(b.last_modified,from_unixtime(0)), COALESCE(c.last_modified,from_unixtime(0))) as last_modified from aaa a LEFT JOIN bbb b on a.fk_id = b.id LEFT JOIN ccc c on a.fk_id = c.id ) sub'
);
With this you're able to create any streams/tables you need off of it.

Create column with aggregated value with calculation in PBI

Imagine you have two tables:
Table User:
ID, Name
Table Orders:
ID, UserID
I'm trying to create a new column in table User which should contain aggregated values of distinct count of Order.IDs.
Calculated column:
OrderCount = CALCULATE(DISTINCTCOUNT(Orders[Id]))
Alternatively if you don't/can't have a relationship between the two tables:
OrderCount2 = CALCULATE(DISTINCTCOUNT(Orders[Id]),FILTER(Orders, Orders[UserId] = User[Id]))
If all you need is to display it in some visualisation, you can use Orders[Id] directly by setting the aggregate option to Count (Distinct) in Values under Visualizations side pane.

Postgres reverse inheritance? (inherit rows from parent table)

I have a number of records that are common to all schemas. I place these records in a shared schema table, and would like to inheritthe rows of record from this shared parent table in each of the child schemas.
Suppose I have the following schemas:
CREATE SCHEMA parent;
CREATE SCHEMA a;
CREATE SCHEMA b;
CREATE TABLE parent.component (product_id serial PRIMARY KEY, title text);
CREATE TABLE a.component () INHERITS (parent.component);
CREATE TABLE b.product () INHERITS (parent.component);
INSERT INTO parent.component(title) VALUES ('parent');
INSERT INTO a.component(title) VALUES ('a_test') ,('a_test2') ;
INSERT INTO b.component(title) VALUES ('b_test') ,('b_test2');
Is there a way to select the union of rows from the parent and either a.component or b.component when I issue a select on either a or b ?
So for example:
SELECT * FROM a.component;
returns rows:
id | title
---------------
1 parent
2 a_test
3 a_test2
PostgreSQL has multiple inheritance, so a table can be the child of many tables. You could try to inherit in the other direction!
But maybe a simple UNION ALL query is a simpler and better solution.

Embedded Select for From value

Having difficulty framing my question for Google.
I am trying to embed a select statement which pulls partition table names from a view. I want to cycle through these tables and do a search within them for a value count.
I have:
SELECT COUNT(objectA)
FROM (SELECT partitiontablename
FROM partitions
WHERE tablename = 'x')
AS tableNameQuery
WHERE objectB = 1
I am getting ERROR: column "objectB" does not exist
The partitions tables do have objectB (they are the same table structure). Can you guide me to what i am doing wrong?
Thank you!
Try this query:
SELECT COUNT(objectA)
FROM (
SELECT partitiontablename, objectB, objectA
FROM partitions
WHERE tablename = 'x'
) AS tableNameQuery
WHERE objectB = 1
The subquery in your query retrieves only partitiontablename column, so the outer query sees only that column, but doesn't see objectB.
The same problem is with objectA used in COUNT() in the outer query.