How to use temporal table join in batch mode Flink SQL? - flink-sql

In order to revise data T+1, for the reason of data delay, i want to execute temporal table join using flink sql in batch mode. And flink official document show join support running in batch mode, but i got error when execute sql: org.apache.flink.table.api.TableException: unexpected correlate variable $cor1 in the plan
set table.dynamic-table-options.enabled=true;
with ods as (select ts,id,name from tbl_ods /*+options('read.streaming.enabled'='false')/* where pdate=xxx),
dim as (select ts,id,city from tbl_dim /*+options('read.streaming.enabled'='false')/* where pdate=xxx)
select
ods.id,
ods.name,
dim.city
from ods
left join dim for system_time as of dim.ts as dm on ods.id=dm.id;
Flink version:1.15.2
Hudi:0.12.2
Whether temporal table join not support in flink sql batch mode? Or i need to change the sql?

Related

How does Spark SQL execute SQL query with joining operation?

For the following operation to run sql statement in spark sql to join two tables in PostgreSQL:
val df = spark.read.jdbc(url, 'select * from table_1 join table_2 on a where x', connproperties);
Will Database Engine execute the joining operation and sends back the joined results? Or will the Database send all records of table_1 and table_2 to spark job and spark job do the joining? Are there some documentation to explain this operation? Thanks!
The PostgreSQL database will only return a single resultset from a single query. If you would use valid SQL, that could be the joined result. Or nothing, in case no records match your conditions.

Can't we join two tables and fetch data in Kafka?

I have joined two tables and fetched data using Postgres source connector. But every time it gave the same issue i.e.
I have run the same query in Postgres and it runs without any issue. Is fetching data by joining tables not possible in Kafka?
I solve this issue by using the concept of the subquery. The problem was that
when I use an alias, the alias column is interpreted as a whole as a column name and therefore the problem occurs. Here goes my query:
select * from (select p.\"Id\" as \"p_Id\", p.\"CreatedDate\" as p_createddate, p.\"ClassId\" as p_classid, c.\"Id\" as c_id, c.\"CreatedDate\" as c_createddate, c.\"ClassId\" as c_classid from \"PolicyIssuance\" p left join \"ClassDetails\" c on p.\"DocumentNumber\" = c.\"DocumentNo\") as result"

Error executing Apache BEAM sql query - Use a Window.into or Window.triggering transform prior to GroupByKey

How do I include Window.into or Window.triggering transform prior to GroupByKey in BEAM SQL?
I have following 2 tables:
Ist table
CREATE TABLE table1(
field1 varchar
,field2 varchar
)
2nd Table
CREATE TABLE table2(
field1 varchar
,field3 varchar
)
And I am writing the result in a 3rd Table
CREATE TABLE table3(
field1 varchar
,field3 varchar
)
First 2 tables are reading data from a kafka stream and I am doing a join on these tables and inserting the data into the third table, using the following query. The first 2 tables are un-bounded/non-bounded
INSERT INTO table3
(field1,
field3)
SELECT a.field1,
b.field3
FROM table1 a
JOIN table2 b
ON a.field1 = b.field1
I am getting the following error:
Caused by: java.lang.IllegalStateException: GroupByKey cannot be
applied to non-bounded PCollection in the GlobalWindow without a
trigger. Use a Window.into or Window.triggering transform prior to
GroupByKey. at
org.apache.beam.sdk.transforms.GroupByKey.applicableTo(GroupByKey.java:173)
at
org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:204)
at
org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:120)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537) at
org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472) at
org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286) at
org.apache.beam.sdk.transforms.join.CoGroupByKey.expand(CoGroupByKey.java:126)
at
org.apache.beam.sdk.transforms.join.CoGroupByKey.expand(CoGroupByKey.java:74)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537) at
org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472) at
org.apache.beam.sdk.transforms.join.KeyedPCollectionTuple.apply(KeyedPCollectionTuple.java:107)
at
org.apache.beam.sdk.extensions.joinlibrary.Join.innerJoin(Join.java:59)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamJoinRel.standardJoin(BeamJoinRel.java:217)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamJoinRel.buildBeamPipeline(BeamJoinRel.java:161)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamProjectRel.buildBeamPipeline(BeamProjectRel.java:68)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamAggregationRel.buildBeamPipeline(BeamAggregationRel.java:80)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamIOSinkRel.buildBeamPipeline(BeamIOSinkRel.java:64)
at
org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.compileBeamPipeline(BeamQueryPlanner.java:127)
at
com.dss.tss.v2.client.BeamSqlCli.compilePipeline(BeamSqlCli.java:95)
at com.dss.test.v2.client.SQLCli.main(SQLCli.java:100)
This is current implementation limitation of Beam SQL. You need to define windows and then join the inputs per-window.
Couple of examples of how to do joins and windowing in Beam SQL:
complex SQL query with HOP window and joins;
test which defines a window in Java outside of SQL and then applies query with join;
examples of other window functions can be found here;
Background
The problem is caused by the fact that it's hard to define Join operation for unbounded data streams in general, it is not limited to Beam SQL.
Imagine, for example, when data processing system receives inputs from 2 sources and then has to match records between them. From high level perspective, such system has to keep all the data it has seen so far, and then for each new record it has to go over all records in the second input source to see if there's a match there. It works fine when you have finite and small data sources. In simple case you could just load everything in memory, match the data from the sources, produce output.
With streaming data you cannot keep caching it forever. What if data never stops coming? And it is unclear when you want to emit the data. What if you have an outer join operation, when do you decide that you don't have a matching record from another input?
For example see the explanation for the unbounded PCollections in GroupByKey section of the Beam guide. And Joins in Beam are usually implemented on top of it using CoGroupByKey (Beam SQL Joins as well).
All of these questions can probably be answered for a specific pipeline, but it's hard to solve them in general case. Current approach in Beam SDK and Beam SQL is to delegate it to the user to solve for concrete business case. Beam allows users decide what data to aggregate together into a window, how long to wait for late data, and when to emit the results. There are also things like state cells and timers for more granular control. This allows a programmer writing a pipeline to explicitly define the behavior and work around these problems somewhat, with (a lot of) extra complexity.
Beam SQL is implemented on top of regular Beam SDK concepts and is bound by the same limitations. But it has more implementations of its own. For example, you don't have a SQL syntax to define triggers, state, or custom windows. Or you cannot write a custom ParDo that could keep a state in an external service.

Spark SQL - pyspark api vs sql queries

All,
I have question regarding writing SparkSQL program, is there difference of performance between writing
SQLContext.sql("select count(*) from (select distinct col1,col2 from table))")
using pyspark Api : df.select("col1,col2").distinct().count().
I would like to hear out the suggestion and correct way to convert very large queries like ( 1000 lines ) joining 10+ tables to Py-Spark program
I am from SQL background and we are working on converting existing logic to hadoop, hence SQL is handy.

flink Table SQL Api

I want to know can we write query using two tables ( join) in flink Table and SQL api.
I am new to flik, I want to create two table from two different data set and query them and produce other dataset.
my query would be like select... from table1, table 2... so can we write like this query which querying two tables or more?
Thanks
Flink's Table API supports join operations (full, left, right, inner joins) on batch tables (e.g. those created from a DataSet).
SELECT c, g FROM Table3, Table5 WHERE b = e
For streaming tables (e.g. those created from a DataStream), Flink does not yet support join operations. But the Flink community is actively working to add them in the near future.