Error executing Apache BEAM sql query - Use a Window.into or Window.triggering transform prior to GroupByKey - apache-beam

How do I include Window.into or Window.triggering transform prior to GroupByKey in BEAM SQL?
I have following 2 tables:
Ist table
CREATE TABLE table1(
field1 varchar
,field2 varchar
)
2nd Table
CREATE TABLE table2(
field1 varchar
,field3 varchar
)
And I am writing the result in a 3rd Table
CREATE TABLE table3(
field1 varchar
,field3 varchar
)
First 2 tables are reading data from a kafka stream and I am doing a join on these tables and inserting the data into the third table, using the following query. The first 2 tables are un-bounded/non-bounded
INSERT INTO table3
(field1,
field3)
SELECT a.field1,
b.field3
FROM table1 a
JOIN table2 b
ON a.field1 = b.field1
I am getting the following error:
Caused by: java.lang.IllegalStateException: GroupByKey cannot be
applied to non-bounded PCollection in the GlobalWindow without a
trigger. Use a Window.into or Window.triggering transform prior to
GroupByKey. at
org.apache.beam.sdk.transforms.GroupByKey.applicableTo(GroupByKey.java:173)
at
org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:204)
at
org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:120)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537) at
org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472) at
org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286) at
org.apache.beam.sdk.transforms.join.CoGroupByKey.expand(CoGroupByKey.java:126)
at
org.apache.beam.sdk.transforms.join.CoGroupByKey.expand(CoGroupByKey.java:74)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537) at
org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472) at
org.apache.beam.sdk.transforms.join.KeyedPCollectionTuple.apply(KeyedPCollectionTuple.java:107)
at
org.apache.beam.sdk.extensions.joinlibrary.Join.innerJoin(Join.java:59)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamJoinRel.standardJoin(BeamJoinRel.java:217)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamJoinRel.buildBeamPipeline(BeamJoinRel.java:161)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamProjectRel.buildBeamPipeline(BeamProjectRel.java:68)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamAggregationRel.buildBeamPipeline(BeamAggregationRel.java:80)
at
org.apache.beam.sdk.extensions.sql.impl.rel.BeamIOSinkRel.buildBeamPipeline(BeamIOSinkRel.java:64)
at
org.apache.beam.sdk.extensions.sql.impl.planner.BeamQueryPlanner.compileBeamPipeline(BeamQueryPlanner.java:127)
at
com.dss.tss.v2.client.BeamSqlCli.compilePipeline(BeamSqlCli.java:95)
at com.dss.test.v2.client.SQLCli.main(SQLCli.java:100)

This is current implementation limitation of Beam SQL. You need to define windows and then join the inputs per-window.
Couple of examples of how to do joins and windowing in Beam SQL:
complex SQL query with HOP window and joins;
test which defines a window in Java outside of SQL and then applies query with join;
examples of other window functions can be found here;
Background
The problem is caused by the fact that it's hard to define Join operation for unbounded data streams in general, it is not limited to Beam SQL.
Imagine, for example, when data processing system receives inputs from 2 sources and then has to match records between them. From high level perspective, such system has to keep all the data it has seen so far, and then for each new record it has to go over all records in the second input source to see if there's a match there. It works fine when you have finite and small data sources. In simple case you could just load everything in memory, match the data from the sources, produce output.
With streaming data you cannot keep caching it forever. What if data never stops coming? And it is unclear when you want to emit the data. What if you have an outer join operation, when do you decide that you don't have a matching record from another input?
For example see the explanation for the unbounded PCollections in GroupByKey section of the Beam guide. And Joins in Beam are usually implemented on top of it using CoGroupByKey (Beam SQL Joins as well).
All of these questions can probably be answered for a specific pipeline, but it's hard to solve them in general case. Current approach in Beam SDK and Beam SQL is to delegate it to the user to solve for concrete business case. Beam allows users decide what data to aggregate together into a window, how long to wait for late data, and when to emit the results. There are also things like state cells and timers for more granular control. This allows a programmer writing a pipeline to explicitly define the behavior and work around these problems somewhat, with (a lot of) extra complexity.
Beam SQL is implemented on top of regular Beam SDK concepts and is bound by the same limitations. But it has more implementations of its own. For example, you don't have a SQL syntax to define triggers, state, or custom windows. Or you cannot write a custom ParDo that could keep a state in an external service.

Related

Most efficient way to DECODE multiple columns -- DB2

I am fairly new to DB2 (and SQL in general) and I am having trouble finding an efficient method to DECODE columns
Currently, the database has a number of tables most of which have a significant number of their columns as numbers, these numbers correspond to a table with the real values. We are talking 9,500 different values (e.g '502=yes' or '1413= Graduate Student')
In any situation, I would just do WHERE clause and show where they are equal, but since there are 20-30 columns that need to be decoded per table, I can't really do this (that I know of).
Is there a way to effectively just display the corresponding value from the other table?
Example:
SELECT TEST_ID, DECODE(TEST_STATUS, 5111, 'Approved, 5112, 'In Progress') TEST_STATUS
FROM TEST_TABLE
The above works fine.......but I manually look up the numbers and review them to build the statements. As I mentioned, some tables have 20-30 columns that would need this AND some need DECODE statements that would be 12-15 conditions.
Is there anything that would allow me to do something simpler like:
SELECT TEST_ID, DECODE(TEST_STATUS = *TableWithCodeValues*) TEST_STATUS
FROM TEST_TABLE
EDIT: Also, to be more clear, I know I can do a ton of INNER JOINS, but I wasn't sure if there was a more efficient way than that.
From a logical point of view, I would consider splitting the lookup table into several domain/dimension tables. Not sure if that is possible to do for you, so I'll leave that part.
As mentioned in my comment I would stay away from using DECODE as described in your post. I would start by doing it as usual joins:
SELECT a.TEST_STATUS
, b.TEST_STATUS_DESCRIPTION
, a.ANOTHER_STATUS
, c.ANOTHER_STATUS_DESCRIPTION
, ...
FROM TEST_TABLE as a
JOIN TEST_STATUS_TABLE as b
ON a.TEST_STATUS = b.TEST_STATUS
JOIN ANOTHER_STATUS_TABLE as c
ON a.ANOTHER_STATUS = c.ANOTHER_STATUS
JOIN ...
If things are too slow there are a couple of things you can try:
Create a statistical view that can help determine cardinalities from the joins (may help the optimizer creating a better plan):
https://www.ibm.com/support/knowledgecenter/sl/SSEPGG_9.7.0/com.ibm.db2.luw.admin.perf.doc/doc/c0021713.html
If your license admits you can experiment with Materialized Query Tables (MQT). Note that there is a penalty for modifications of the base tables, so if you have more of a OLTP workload, this is probably not a good idea:
https://www.ibm.com/developerworks/data/library/techarticle/dm-0509melnyk/index.html
A third option if your lookup table is fairly static is to cache the lookup table in the application. Read the TEST_TABLE from the database, and lookup descriptions in the application. Further improvements may be to add triggers that invalidate the cache when lookup table is modified.
If you don't want to do all these joins you could create yourself an own LOOKUP function.
create or replace function lookup(IN_ID INTEGER)
returns varchar(32)
deterministic reads sql data
begin atomic
declare OUT_TEXT varchar(32);--
set OUT_TEXT=(select text from test.lookup where id=IN_ID);--
return OUT_TEXT;--
end;
With a table TEST.LOOKUP like
create table test.lookup(id integer, text varchar(32))
containing some id/text pairs this will return the text value corrseponding to an id .. if not found NULL.
With your mentioned 10k id/text pairs and an index on the ID field this shouldn't be a performance issue as such data amount should be easily be cached in the corresponding bufferpool.

flink Table SQL Api

I want to know can we write query using two tables ( join) in flink Table and SQL api.
I am new to flik, I want to create two table from two different data set and query them and produce other dataset.
my query would be like select... from table1, table 2... so can we write like this query which querying two tables or more?
Thanks
Flink's Table API supports join operations (full, left, right, inner joins) on batch tables (e.g. those created from a DataSet).
SELECT c, g FROM Table3, Table5 WHERE b = e
For streaming tables (e.g. those created from a DataStream), Flink does not yet support join operations. But the Flink community is actively working to add them in the near future.

Execution Plan on a View looking at Partitioned Tables

I currently have tables that are partitioned out by year & month for our sales transactions. For example, we have sales tables that would look something like this:
factdailysales_201501
factdailysales_201502
factdailysales_201503 etc ...
Generally, I've always performed dynamic SQL to capture a Start Date, End Date, find out what partitions those are, and then loop through each of those partitions ... but its starting to become such a hassle and I've learned that this is probably not the best way to do it in terms of just maintenance, trouble shooting, and performance.
I decided to build a view that would UNION ALL of my sales partitions together. However, I don't want selecting from the view to have to scan all of the partitions on execution, it would take away the whole purpose of partitioning tables out. Because of this, I added check constraints on date to each of my sales tables. This way when I selected from the view, it would know which tables to access from instead of scanning every table.
Here are the following examples below:
SELECT SUM([retail])
FROM Sales_Orig
WHERE [Date] >= '2015-03-01'
This query has the execution plan of only pulling from the partitions that I need.
My problem that i'm facing right now is that most of the time when my team will be writing stored procedures, they would more than likely write their queries where a date variable is passed into the where statement.
DECLARE #SD DATE = '2015-03-01'
SELECT SUM([retail])
FROM Sales_Orig
WHERE [Date] >= #SD
However, when a variable is being passed in, the execution plan now scans ALL of the partitions in the view, causing the performance to take wayyy longer than when I hard coded in the date
I suppose I could do dynamic SQL again and insert the date string into the SELECT statement, but it would bring me back to the beginning of trying to get rid of dynamic SQL in the first place for this simple sales query.
So my question is, am I setting this up wrong? Am I on the right track? It seems that the view can't take in a variable for the check constraint and ends up scanning every table. Is there another approach anyone would recommend? Maybe my original solution of just looping through partitions via dynamic SQL is the best way to do it?
** EDIT **
http://sqlsunday.com/2014/08/31/partitioned-views/
This article is actually where I initially saw the idea! It seems when using that exact same solution, I'm still experiencing the same struggle!
Thanks!!
Okay this might work. It's a table-valued function that only access tables according to your #start and #end parameters so only accessing your "partitions" that it needs. I figured you could take this concept and write some dynamic SQL to create all the if statements.
Now of course new tables are added every day so how does that tie in. Well I think the best way would be is that every day you alter the function adding the next sales table. That way querying it is simple. And you could use the same dynamic sql you used to create the function to alter it which should be relatively simple.
Note: I added default values that are the min and max of the data type DATE. That way you could query something like everything from 20140101 and onward or vice versa.
Your tables
SELECT CAST('20150101' AS DATE) datesVal INTO factDailySales_20150101;
SELECT CAST('20150102' AS DATE) datesVal INTO factDailySales_20150102;
SELECT CAST('20150103' AS DATE) datesVal INTO factDailySales_20150103;
The Function
CREATE FUNCTION ufn_factTotalSales (#Start DATE = '17530101', #End DATE = '99991231')
RETURNS #factTotalSales TABLE
(
datesVal DATE
)
AS
BEGIN
IF(CAST('20150101' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150101
END
IF(CAST('20150102' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150102
END
IF(CAST('20150103' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150103
END
RETURN;
END
GO
All tables
SELECT *
FROM ufn_factTotalSales(default,default)
All tables greater than or equal to 20150102
SELECT *
FROM ufn_factTotalSales('20150102',default)
**All tables less than or equal to 20150102
SELECT *
FROM ufn_factTotalSales(default,'20150102')
All tables between specific range
SELECT *
FROM ufn_factTotalSales('20150101','20150102')
Is this the ideal solution? No. The ideal would be to combine all tables into one and having good indexes. I know you said that wouldn't work because of the way other code has been written. Hear me out. Now perhaps this is off the wall, lets say you do combine the tables but obviously there are old scripts looking for specific daily sales tables. Maybe you could create views with the dailySales names that access the factTotalSales. OR You could create synonyms for the factTotalSales that would correspond to each factDailySales.
Maybe you could look into that. It wouldn't be easy, but I think letting SQL Server optimize your queries the way it was designed is a better way of doing it instead of forcing it with dynamic SQL.
Just my two cents. Hope this helps. At the very least, I hope it gave you some ideas.
5 years later: option(recompile).
The planner needs to have access to the constants to eliminate the table entirely from the query plan. With a variable, without a forced recompile, a generic plan is used. (Related: parameter sniffing.)
While this means the query plan is larger as it has to include all tables, it does not mean that all tables are actually scanned: look at the IO stats, as table scan elimination occurs even if such shows in the query plan.
The 'Number Of Executions' in the query plan will be 0 when the tables are not scanned: unfortunately, these branches are still reported as a non-zero percentage cost "Table Scan" node in the query plan & UI, which will appear high proportionally if the query is trivially fast. The displayed percentage cost of these extra "Table Scan" nodes approaches zero as the amount of data returned from the actually used base tables increases.
This same optimization/elimination occurs when the view is not a Partitioned View (eg. base tables are missing partition column in PK), yet the underlying tables have a suitable Check Constraint on the filtered column. It also occurs when the view selects a constant value to establish the partition that is not otherwise stored in the table. With a constant in the query or recompiled plan the tables will be eliminated entirely. With a variable the tables will still not actually be scanned and thus eliminated logically during query execution.
The use of a proper Partitioned View is only really beneficial to allow a direct Insert & Update, with the major caveat that it requires the partition column to be in each table's PK and disallows the use of an identity column (making a Partitioned View largely useless IMOHO). SQL Server handles the optimizations very similarly for other quasi-Partitioned View cases.
(This is on SQL Server 2014; earlier versions might not have optimized the different patterns as efficiently.)

Cassandra efficient table walk

I'm currently working on a benchmark (which is part of my bachelor thesis) that compares SQL and NoSQL Databases based on an abstract data model an abstract queries to achieve fair implementation on all systems.
I'm currently working on the implementation of a query that is specified as follows:
I have a table in Cassandra that is specified as follows:
CREATE TABLE allocated(
partition_key int,
financial_institution varchar,
primary_uuid uuid,
report_name varchar,
view_name varchar,
row_name varchar,
col_name varchar,
amount float,
PRIMARY KEY (partition_key, report_name, primary_uuid));
This table contains about 100,000,000 records (~300GB).
We now need to calculate the sum for the field "amount" for every possible combination of report_name, view_name, col_name and row_name.
In SQL this would be quite easy, just select sum (amount) and group it by the fields you want.
However, since Cassandra does not support these operations (which is perfectly fine) I need to achieve this on another way.
Currently I achieve this by doing a full-table walk, processing each record and storing the sum in a HashMap in Java for each combination.
The prepared statement I use is as follows:
SELECT
partition_key,
financial_institution,
report_name,
view_name,
col_name,
row_name,
amount
FROM allocated;
That works partially on machines with lots on RAM for both, cassandra and the Java app, but crashes on smaller machines.
Now I'm wondering whether it's possible to achieve this on a faster way?
I could imagine using the partition_key, which serves also as the cassandra partition key and do this for every partition (I have 5 of them).
Also I though of doing this multithreaded by assigning every partition and report to a seperate thread and running it parallel. But I guess this would cause a lot of overhead on the application side.
Now to the actual question: Would you recommend another execution strategy to achieve this?
Maybe I still think too much in a SQL-like way.
Thank you for you support.
Here are two ideas that may help you.
1) You can efficiently scan rows in any table using the following approach. Consider a table with PRIMARY KEY (pk, sk, tk). Let's use a fetch size of 1000, but you can try other values.
First query (Q1):
select whatever_columns from allocated limit 1000;
Process these and then record the value of the three columns that form the primary key. Let's say these values are pk_val, sk_val, and tk_val. Here is your next query (Q2):
select whatever_columns from allocated where token(pk) = token(pk_val) and sk = sk_val and tk > tk_val limit 1000;
The above query will look for records for the same pk and sk, but for the next values of tk. Keep repeating as long as you keep getting 1000 records. When get anything less, you ignore the tk, and do greater on sk. Here is the query (Q3):
select whatever_columns from allocated where token(pk) = token(pk_val) and sk > sk_val limit 1000;
Again, keep doing this as long as you get 1000 rows. Once you are done, you run the following query (Q4):
select whatever_columns from allocated where token(pk) > token(pk_val) limit 1000;
Now, you again use the pk_val, sk_val, tk_val from the last record, and run Q2 with these values, then Q3, then Q4.....
You are done when Q4 returns less than 1000.
2) I am assuming that 'report_name, view_name, col_name and row_name' are not unique and that's why you maintain a hashmap to keep track of the total amount whenever you see the same combination again. Here is something that may work better. Create a table in cassandra where key is a combination of these four values (maybe delimited). If there were three, you could have simply used a composite key for those three. Now, you also need a column called amounts which is a list. As you are scanning the allocate table (using the approach above), for each row, you do the following:
update amounts_table set amounts = amounts + whatever_amount where my_primary_key = four_col_values_delimited;
Once you are done, you can scan this table and compute the sum of the list for each row you see and dump it wherever you want. Note that since there is only one key, you can scan using only token(primary_key) > token(last_value_of_primary_key).
Sorry if my description is confusing. Please let me know if this helps.

MAX(), DISTINCT and group by in Cassandra

I am trying to remodel a SQL database Cassandra such that, I can find the Cassandra equivalent for the SQL queries. I use CQL 3 and Cassandra v1.2. I modeled the db design in cassandra so that it supports the order by clauses and denormalized tables to support the join operation. However I am at sea when it comes to DISTINCT, SUM() and GROUPBY equvalents
SELECT a1,MAX(b1) FROM demo1 group by a1.
SELECT DISTINCT (a2) FROM demo2 where b2='sea'
SELECT sum(a3), sum(b3) from demo3 where c3='water' and d3='ocean'
This is like a showstopper to my work for past couple of days. Is there a way in Cassandra, that I can model the db schema to support queries of these kind? I cant think of any way in Cassandra . How are such queries be implemented using Cassandra?
I read that a hive layer over Cassandra can possibly make these queries work. I am just wondering if that is the only way that such queries can be supported in Cassandra..? Pls advise on any other possible methods..
With Cassandra you solve these kinds of problems by doing more work when you insert your data -- which sounds like it would be slow, but Cassandra is designed for fast writes, and you're probably going to read the data many more times than you write it so it makes sense when you consider the whole system.
I can't tell you exactly how to create your tables to model your problem because it will depend a lot on the details. You need to figure a schema that lets you get the data without performing any on-the-fly aggregations. Think about how you would create views for the queries in an RDBMS, and then try to think how you would insert data directly into those views, not into the underlying tables. That's kind of how you model things in Cassandra.
Although this is an old question, it appears in Google search results pretty high. So I wanted to give an update.
Cassandra 2.2+ supports user defined function and user defined aggregates. WARNING: this does not mean that you don't have to do data modeling anymore (as it was pointed by #Theo) rather it just allows you to slightly preprocess your data upon retrieval.
SELECT DISTINCT (a2) FROM demo2 where b2='sea'
To implement DISTINCT, you should define a function and an agreggate. I'll call both the function and the aggregate uniq rather than distinct to emphasize the fact that it is user defined.
CREATE OR REPLACE FUNCTION uniq(state set<text>, val text)
CALLED ON NULL INPUT RETURNS set<text> LANGUAGE java
AS 'state.add(val); return state;';
CREATE OR REPLACE AGGREGATE uniq(text)
SFUNC uniq STYPE set<text> INITCOND {};
Then you use it as follows:
SELECT uniq(a2) FROM demo2 where b2='sea';
SELECT sum(a3), sum(b3) from demo3 where c3='water' and d3='ocean'
SUM is provided out of the box and works as you would expect. See system.sum.
SELECT a1,MAX(b1) FROM demo1 group by a1
GROUP BY is a tricky one. Actually, there is no way to group result rows by some column. But what you can do is to create a map<text, int> and to group them manually in the map. Based on an example from Christopher Batey's blog, group-by and max:
CREATE OR REPLACE FUNCTION state_group_and_max(state map<text, int>, type text, amount int)
CALLED ON NULL INPUT
RETURNS map<text, int>
LANGUAGE java AS '
Integer val = (Integer) state.get(type);
if (val == null) val = amount; else val = Math.max(val, amount);
state.put(type, val);
return state;
' ;
CREATE OR REPLACE AGGREGATE state_group_and_max(text, int)
SFUNC state_group_and_max
STYPE map<text, int>
INITCOND {};
Then you use it as follows:
SELECT state_group_and_max(a1, b1) FROM demo1;
Notes
As it was mentioned above, you still have to invest some time in data modeling, don't overuse these features
You have to set enable_user_defined_functions=true in your cassandra.yaml to enable the features
You can overload the functions to support grouping by columns of different types.
References:
Great UDF and UDA examples by Christopher Batey and few more
Datastax docs on UDF and UDA
User Defined Functions in Cassandra 3.0 (Planet Cassandra Blog)
Cassandra 3.10 now supports Group by parition key and clustering key. You can refer to this link for more detail.
Cassandra doesn't support operations like this. You can use something like Hive on top or there's a (non-free) product from Acunu that may do what you need.
The other solution is to do the work yourself. For example, you can sum things by reading in all the data from certain rows and summing. Or maintain a Cassandra counter to increment on the fly.