I want to know can we write query using two tables ( join) in flink Table and SQL api.
I am new to flik, I want to create two table from two different data set and query them and produce other dataset.
my query would be like select... from table1, table 2... so can we write like this query which querying two tables or more?
Thanks
Flink's Table API supports join operations (full, left, right, inner joins) on batch tables (e.g. those created from a DataSet).
SELECT c, g FROM Table3, Table5 WHERE b = e
For streaming tables (e.g. those created from a DataStream), Flink does not yet support join operations. But the Flink community is actively working to add them in the near future.
Related
I have joined two tables and fetched data using Postgres source connector. But every time it gave the same issue i.e.
I have run the same query in Postgres and it runs without any issue. Is fetching data by joining tables not possible in Kafka?
I solve this issue by using the concept of the subquery. The problem was that
when I use an alias, the alias column is interpreted as a whole as a column name and therefore the problem occurs. Here goes my query:
select * from (select p.\"Id\" as \"p_Id\", p.\"CreatedDate\" as p_createddate, p.\"ClassId\" as p_classid, c.\"Id\" as c_id, c.\"CreatedDate\" as c_createddate, c.\"ClassId\" as c_classid from \"PolicyIssuance\" p left join \"ClassDetails\" c on p.\"DocumentNumber\" = c.\"DocumentNo\") as result"
I have two tables, by joining that 2 tables i need to take distinct count of one particular column from OrientDB. Since joins donot support in OrientDB. Please suggest me in orient db how join works and please suggest me also how to use edges for join in orientdb.
How do I include Window.into or Window.triggering transform prior to GroupByKey in BEAM SQL?
I have following 2 tables:
Ist table
CREATE TABLE table1(
field1 varchar
,field2 varchar
)
2nd Table
CREATE TABLE table2(
field1 varchar
,field3 varchar
)
And I am writing the result in a 3rd Table
CREATE TABLE table3(
field1 varchar
,field3 varchar
)
First 2 tables are reading data from a kafka stream and I am doing a join on these tables and inserting the data into the third table, using the following query. The first 2 tables are un-bounded/non-bounded
INSERT INTO table3
(field1,
field3)
SELECT a.field1,
b.field3
FROM table1 a
JOIN table2 b
ON a.field1 = b.field1
I am getting the following error:
Caused by: java.lang.IllegalStateException: GroupByKey cannot be
applied to non-bounded PCollection in the GlobalWindow without a
trigger. Use a Window.into or Window.triggering transform prior to
GroupByKey. at
org.apache.beam.sdk.transforms.GroupByKey.applicableTo(GroupByKey.java:173)
at
org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:204)
at
org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:120)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537) at
org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472) at
org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286) at
org.apache.beam.sdk.transforms.join.CoGroupByKey.expand(CoGroupByKey.java:126)
at
org.apache.beam.sdk.transforms.join.CoGroupByKey.expand(CoGroupByKey.java:74)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537) at
org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472) at
org.apache.beam.sdk.transforms.join.KeyedPCollectionTuple.apply(KeyedPCollectionTuple.java:107)
at
org.apache.beam.sdk.extensions.joinlibrary.Join.innerJoin(Join.java:59)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamJoinRel.standardJoin(BeamJoinRel.java:217)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamJoinRel.buildBeamPipeline(BeamJoinRel.java:161)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamProjectRel.buildBeamPipeline(BeamProjectRel.java:68)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamAggregationRel.buildBeamPipeline(BeamAggregationRel.java:80)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamIOSinkRel.buildBeamPipeline(BeamIOSinkRel.java:64)
at
org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.compileBeamPipeline(BeamQueryPlanner.java:127)
at
com.dss.tss.v2.client.BeamSqlCli.compilePipeline(BeamSqlCli.java:95)
at com.dss.test.v2.client.SQLCli.main(SQLCli.java:100)
This is current implementation limitation of Beam SQL. You need to define windows and then join the inputs per-window.
Couple of examples of how to do joins and windowing in Beam SQL:
complex SQL query with HOP window and joins;
test which defines a window in Java outside of SQL and then applies query with join;
examples of other window functions can be found here;
Background
The problem is caused by the fact that it's hard to define Join operation for unbounded data streams in general, it is not limited to Beam SQL.
Imagine, for example, when data processing system receives inputs from 2 sources and then has to match records between them. From high level perspective, such system has to keep all the data it has seen so far, and then for each new record it has to go over all records in the second input source to see if there's a match there. It works fine when you have finite and small data sources. In simple case you could just load everything in memory, match the data from the sources, produce output.
With streaming data you cannot keep caching it forever. What if data never stops coming? And it is unclear when you want to emit the data. What if you have an outer join operation, when do you decide that you don't have a matching record from another input?
For example see the explanation for the unbounded PCollections in GroupByKey section of the Beam guide. And Joins in Beam are usually implemented on top of it using CoGroupByKey (Beam SQL Joins as well).
All of these questions can probably be answered for a specific pipeline, but it's hard to solve them in general case. Current approach in Beam SDK and Beam SQL is to delegate it to the user to solve for concrete business case. Beam allows users decide what data to aggregate together into a window, how long to wait for late data, and when to emit the results. There are also things like state cells and timers for more granular control. This allows a programmer writing a pipeline to explicitly define the behavior and work around these problems somewhat, with (a lot of) extra complexity.
Beam SQL is implemented on top of regular Beam SDK concepts and is bound by the same limitations. But it has more implementations of its own. For example, you don't have a SQL syntax to define triggers, state, or custom windows. Or you cannot write a custom ParDo that could keep a state in an external service.
I am Implementing a project where MySql data is imported to hdfs using sqoop. It had nearly 30 tables.I am reading each table as a dataframe by inferring schema and registered as temp tables. I has few questions in doing this...
1. There several joins need to implemented for the tables suppose say df1 to df10 . In MySQL the query will be
select a.id,b.name,c.AccountName from accounts a priority b bills c where a.id=b.id and c.name=a.name
Instead of using
sqlContext.sql(select a.id,b.name,c.AccountName from accounts a priority b bills c where a.id=b.id and c.name=a.name)
Is there other to join all the data frames effectively based on conditions..
Is it the correct way to convert tables to data frames and querying on top of them or any better way to approach this type of joins and querying in spark
I had similiar problem and I end up Using :
val df_list = ListBuffer[DataFrame]()
df_list .toList.reduce((a, b) => a.join(b, a.col(a.schema.head.name) === b.col(b.schema.head.name), "left_outer"))
You could make a free sql statement on Sqoop and join everything there. Or Use Spark JDBC to do the same job
I have a query like this:
Select A.table1.atr1, ... , B.table1.atr1
from A.table1
join B.table1 on (A.table1.atr1 = B.table1.atr2)
join B.table2 on (B.table1.atr2 = B.table2.atr2)
...(some similar joins)
join A.table2 on (A.table1.atr1 = A.table2.atr2)
where ...
A and B are jdbc datasources. I wonder how teiid handles multiple joins on the same database. Are they pushed down to the database? Is the join order between table A and B important? In my example i am using a join between A and B, then between B and B and then between A and A. Do i need to rearrange the order or to create 2 temporary tables on database A and database B?
If the database supports joins, yes they can be pushed down. During the planning of the query Teiid optimizer checks the capabilities of source and decides it can be pushed or need to process in Teiid engine. Based on it, it will re-write the query.