How to compare two tables in KSQL using sub query? - apache-kafka

I've created two tables in the KsqlDB. Both of them will have similar structure.
|ID |TIMESTAMP |PAYLOAD |
+------------------------+------------------------+------------------------+
|"1" |1664248879 |{"ID":"1","cha|
| | |nnel":"1"} |
|"2" |1664248879 |{"ID":"2","cha|
| | |nnel":"2"} |
|"5" |1664248879 |{"ID":"5","cha|
| | |nnel":"3"} |
|"6" |1664248879 |{"ID":"6","cha|
| | |nnel":"6"}
Now I need to find the difference between the two tables. I've tried a few queries and knew that sub-queries are not allowed in ksqlDB. Is it possible to achieve this using a KSQL query?

Related

Spark scala finding value in another dataframe

Hello I'm fairly new to spark and I need help with this little exercise. I want to find certain values in another dataframe but if those values aren't present I want to reduce the length of each value until I find the match. I have these dataframes:
----------------
|values_to_find|
----------------
| ABCDE |
| CBDEA |
| ACDEA |
| EACBA |
----------------
------------------
| list | Id |
------------------
| EAC | 1 |
| ACDE | 2 |
| CBDEA | 3 |
| ABC | 4 |
------------------
And I expect the next output:
--------------------------------
| Id | list | values_to_find |
--------------------------------
| 4 | ABC | ABCDE |
| 3 | CBDEA | CBDEA |
| 2 | ACDE | ACDEA |
| 1 | EAC | EACBA |
--------------------------------
For example ABCDE isn't present so I reduce its length by one (ABCD), again it doesn't match any so I reduce it again and this time I get ABC, which matches so I use that value to join and form a new dataframe. There is no need to worry about duplicates values when reducing the length but I need to find the exact match. Also, I would like to avoid using a UDF if possible.
I'm using a foreach to get every value in the first dataframe and I can do a substring there (if there is no match) but I'm not sure how to lookup these values in the 2nd dataframe. What's the best way to do it? I've seen tons of UDFs that could do the trick but I want to avoid that as stated before.
df1.foreach { values_to_find =>
df1.get(0).toString.substring(0, 4)}
Edit: Those dataframes are examples, I have many more values, the solution should be dynamic... iterate over some values and find their match in another dataframe with the catch that I need to reduce their length if not present.
Thanks for the help!
You can load the dataframe as temporary view and write the SQL. Is the above scenario you are implementing for the first time in Spark or already did in the previous code ( i mean before spark have you implemented in the legacy system). With Spark you have the freedom to write udf in scala or use SQL. Sorry i don't have solution handy so just giving a pointer.
the following will help you.
val dataDF1 = Seq((4,"ABC"),(3,"CBDEA"),(2,"ACDE"),(1,"EAC")).toDF("Id","list")
val dataDF2 = Seq(("ABCDE"),("CBDEA"),("ACDEA"),("EACBA")).toDF("compare")
dataDF1.createOrReplaceTempView("table1")
dataDF2.createOrReplaceTempView("table2")
spark.sql("select * from table1 inner join table2 on table1.list like concat('%',SUBSTRING(table2.compare,1,3),'%')").show()
Output:
+---+-----+-------+
| Id| list|compare|
+---+-----+-------+
| 4| ABC| ABCDE|
| 3|CBDEA| CBDEA|
| 2| ACDE| ACDEA|
| 1| EAC| EACBA|
+---+-----+-------+

Does SQL have a way to group rows without squashing the group into a single row?

I want to do a single query that outputs an array of arrays of table rows. Think along the lines of <table><rowgroup><tr><tr><tr><rowgroup><tr><tr>. Is SQL capable of this? (specifically, as implemented in MariaDB, though migration to AWS RDS might occur one day)
The GROUP BY statement alone does not do this, it creates one row per group.
Here's an example of what I'm thinking of…
SELECT * FROM memes;
+------------+----------+
| file_name | file_ext |
+------------+----------+
| kittens | jpeg |
| puppies | gif |
| cats | jpeg |
| doggos | mp4 |
| horses | gif |
| chickens | gif |
| ducks | jpeg |
+------------+----------+
SELECT * FROM memes GROUP BY file_ext WITHOUT COLLAPSING GROUPS;
+------------+----------+
| file_name | file_ext |
+------------+----------+
| kittens | jpeg |
| cats | jpeg |
| ducks | jpeg |
+------------+----------+
| puppies | gif |
| horses | gif |
| chickens | gif |
+------------+----------+
| doggos | mp4 |
+------------+----------+
I've been using MySQL for ~20 years and have not come across this functionality before but maybe I've just been looking in the wrong place ¯\_(ツ)_/¯
I haven't seen an array rendering such as the one you want, but you can simulate it with multiple GROUP BY / GROUP_CONCAT() clauses.
For example:
select concat('[', group_concat(g), ']') as a
from (
select concat('[', group_concat(file_name), ']') as g
from memes
group by file_ext
) x
Result:
a
---------------------------------------------------------
[[puppies,horses,chickens],[kittens,cats,ducks],[doggos]]
See running example at DB Fiddle.
You can tweak the delimiters such as ,, [, and ].
SELECT ... ORDER BY file_ext will come close to your second output.
Using GROUP BY ... WITH ROLLUP would let you do subtotals under each group, which is not what you wanted either, but it would give you extra lines where you want the breaks.

SQL parameter table

I suspect this question is already well-answered but perhaps due to limited SQL vocabulary I have not managed to find what I need. I have a database with many code:description mappings in a single 'parameter' table. I would like to define a query or procedure to return the descriptions for all (or an arbitrary list of) coded values in a given 'content' table with their descriptions from the parameter table. I don't want to alter the original data, I just want to display friendly results.
Is there a standard way to do this?
Can it be accomplished with SELECT or are other statements required?
Here is a sample query for a single coded field:
SELECT TOP (5)
newid() as id,
B.BRIDGE_STATUS,
P.SHORTDESC
FROM
BRIDGE B
LEFT JOIN PARAMTRS P ON P.TABLE_NAME = 'BRIDGE'
AND P.FIELD_NAME = 'BRIDGE_STATUS'
AND P.PARMVALUE = B.BRIDGE_STATUS
ORDER BY
id
I want to produce 'decoded' results like:
| id | BRIDGE_STATUS |
|--------------------------------------|------------ |
| BABCEC1E-5FE2-46FA-9763-000131F2F688 | Active |
| 758F5201-4742-43C6-8550-000571875265 | Active |
| 5E51634C-4DD9-4B0A-BBF5-00087DF71C8B | Active |
| 0A4EA521-DE70-4D04-93B8-000CD12B7F55 | Inactive |
| 815C6C66-8995-4893-9A1B-000F00F839A4 | Proposed |
Rather than original, coded data like:
| id | BRIDGE_STATUS |
|--------------------------------------|---------------|
| F50214D7-F726-4996-9C0C-00021BD681A4 | 3 |
| 4F173E40-54DC-495E-9B84-000B446F09C3 | 3 |
| F9C216CD-0453-434B-AFA0-000C39EFA0FB | 3 |
| 5D09554E-201D-4208-A786-000C537759A1 | 1 |
| F0BDB9A4-E796-4786-8781-000FC60E200C | 4 |
but for an arbitrary number of columns.

Enabling Parallelization in Spark with Partition Pushdown in MemSQL

I have a columnstore table in MemSQL that has a schema similar the one below:
CREATE TABLE key_metrics (
source_id TEXT,
date TEXT,
metric1 FLOAT,
metric2 FLOAT,
…
SHARD KEY (source_id, date) USING CLUSTERED COLUMNSTORE
);
I have a Spark application (running with Spark Job Server) that queries the MemSQL table. Below is a simplified form of the kind of Dataframe operation I am doing (in Scala):
sparkSession
.read
.format(“com.memsql.spark.connector”)
.options( Map (“path” -> “dbName.key_metrics”))
.load()
.filter(col(“source_id”).equalTo(“12345678”)
.filter(col(“date”)).isin(Seq(“2019-02-01”, “2019-02-02”, “2019-02-03”))
I have confirmed by looking at the physical plan that these filter predicates are being pushed down to MemSQL.
I have also checked that there is a pretty even distribution of the partitions in the table:
±--------------±----------------±-------------±-------±-----------+
| DATABASE_NAME | TABLE_NAME | PARTITION_ID | ROWS | MEMORY_USE |
±--------------±----------------±-------------±-------±-----------+
| dbName | key_metrics | 0 | 784012 | 0 |
| dbName | key_metrics | 1 | 778441 | 0 |
| dbName | key_metrics | 2 | 671606 | 0 |
| dbName | key_metrics | 3 | 748569 | 0 |
| dbName | key_metrics | 4 | 622241 | 0 |
| dbName | key_metrics | 5 | 739029 | 0 |
| dbName | key_metrics | 6 | 955205 | 0 |
| dbName | key_metrics | 7 | 751677 | 0 |
±--------------±----------------±-------------±-------±-----------+
My question is regarding partition pushdown. It is my understanding that with it, we can use all the cores of the machines and leverage parallelism for bulk loading. According to the docs, this is done by creating as many Spark tasks as there are MemSQL database partitions.
However when running the Spark pipeline and observing the Spark UI, it seems that there is only one Spark task that is created which makes a single query to the DB that runs on a single core.
I have made sure that the following properties are set as well:
spark.memsql.disablePartitionPushdown = false
spark.memsql.defaultDatabase = “dbName”
Is my understanding of partition pushdown incorrect? Is there some other configuration that I am missing?
Would appreciate your input on this.
Thanks!
Singlestore credentials have to be the same on all nodes to take advantage of partition pushdown.
And if you have same credentials thru out all the nodes please try installing latest version of spark connector.
Because it often occurs due to spark connector and singlestore compatibility issues.

MongoDB model for cross vendor time series data

I know my problem seems better be solved by RDBMS models. But I really want to deploy it using MongoDB because I have potential irregular fields to add on each record in the future and also want to practice my NoSQL database skills.
PE ratio and PB ratio data provided by one vendor:
| Vendor5_ID| PE| PB|date |
|----------:|----:|-----:|:----------|
| 210| 3.90| 2.620|2017-08-22 |
| 210| 3.90| 2.875|2017-08-22 |
| 228| 3.85| 2.320|2017-08-22 |
| 214| 3.08| 3.215|2017-08-22 |
| 187| 3.15| 3.440|2017-08-22 |
| 181| 2.76| 3.460|2017-08-22 |
Price data and analyst covering provided by another vendor
|Symbol | Price| Analyst|date |
|:------|-----:|-------:|:----------|
|AAPL | 160| 6|2017-08-22 |
|MSFT | 160| 6|2017-08-22 |
|GOOG | 108| 4|2017-08-22 |
And I have key convert data:
| uniqueID|Symbol |from |to |
|--------:|:------|:----------|:----------|
| 1|AAPL |2016-01-10 |2017-08-22 |
| 2|MSFT |2016-01-10 |2017-08-22 |
| 3|GOOG |2016-01-10 |2017-08-22 |
| uniqueID| Vendor5_ID|from |to |
|--------:|----------:|:----------|:----------|
| 1| 210|2016-01-10 |2017-08-22 |
| 2| 228|2016-01-10 |2017-08-22 |
| 3| 214|2016-01-10 |2017-08-22 |
I want to execute time range query fast. I come up with an idea that store each column as a collection,
db.PE:
{
_id,
uniqueID,
Vendor5_ID,
value,
date
}
db.PB:
{
_id,
uniqueID,
Vendor5_ID,
value,
date
}
db.Price:
{
_id,
uniqueID,
Symbol,
value,
date
}
db.Analyst:
{
_id,
uniqueID,
Symbol,
value,
date
}
Is this a good solution? What model do you think is the best if there are far more data to add by different vendor?
I would consider using nested table or child table approach. I am not sure the extent of support mongo has for this kind of support. I would consider using Oracle NoSQL Database for this usecase with nested tables support with TTL and higher throughput (because of BDB as storage engine). With nested tables you could store PE and PB with timestamps in the child/nested table while the parent table continues to hold symbol/vendor_id and any other details. This will ensure that your queries are on the same shard, putting them in a different collection will not guarentee same shard.