I have an issue with ksql, when trying to do a pull query all I get is "Query terminated" with no results if I use it as a push query it works but what I desire is to get the current "state" of the TABLE.
the table is created off a stream:
create table MY_TABLE as select metadata->"DATA1", metadata->"DATA2", count(metadata->"DATA2") as DATA2s from My_STREAM where metadata->"DATA2"='key' GROUP BY metadata->"DATA1";
and if I do a
select * from MY_TABLE;
the result is
query terminated
if I do
select * from MY_TABLE EMIT CHANGES;
I get the desired output except I only want the current "state", what am I missing?
The versions of ksql and cli are:
CLI v0.25.1, Server v0.25.1
Related
I have the following pipeline with a range of activities, see image below.
I keep on getting the error with my lookup activity
Failure happened on 'Source' side.
ErrorCode=SqlOperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=A
database operation failed with the following error: 'Invalid column
name
'updated_at'.',Source=,''Type=System.Data.SqlClient.SqlException,Message=Invalid
column name 'updated_at'.,Source=.Net SqlClient Data
Provider,SqlErrorNumber=207,Class=16,ErrorCode=-2146232060,State=1,Errors=[{Class=16,Number=207,State=1,Message=Invalid
column name 'updated_at'.,},],'
I kind of know what the problem is.. the lookup isn't looping through the individual tables to find the column name 'updated_at'.
But, I don't understand why.
The Lookup 'Lookup New Watermark' activity has the following query
SELECT MAX(updated_at) as NewWatermarkvalue FROM #{item().Table_Name}
The ForEach activity 'For Each Table' as the following for Items:
#activity('Find SalesDB Tables').output.value
The Lookup activity 'Find SalesDB Tables' has the following query
SELECT QUOTENAME(table_schema)+'.'+QUOTENAME(table_name) AS Table_Name FROM information_Schema.tables WHERE table_name not in ('watermarktable', 'database_firewall_rules')
The only thing I can see that is wrong with the 'Lookup New Watermark' actvitiy is that its not looping through table. Can someone let me know what is needed.
Just to show the column exists I adjusted the connection from
To the following:
And the Lookup was able to find the updated_at column on dbo.Products, but couldn't locate the updated_at column on the other 4 tables.
Therefore, I'm suggesting the problem is that the Lookup activity isn't iterating over the tables automatically.
The error is when using the following query on a table that does not have updated_at column, we get this error.
SELECT MAX(updated_at) as NewWatermarkvalue FROM #{item().Table_Name}
The items field in for each activity was given the value as #activity('FindSalesDBTables').output.value (returns a list of table names). Inside the for each, when we use the above query, it will be executed as following:
#first iteration
SELECT MAX(updated_at) as NewWatermarkvalue FROM <table_1>
#second iteration
SELECT MAX(updated_at) as NewWatermarkvalue FROM <table_2>
.
.
...
During this process, when we use the above query on a table that does not have updated_at column, it gives the same error. The following is a demonstration of the same.
I created 2 tables (for demonstration) called t1 and t2.
create table t1(id int, updated_at int)
create table t2(id int, up int)
I used look up activity to get the list of table names using the following query:
SELECT QUOTENAME(table_schema)+'.'+QUOTENAME(table_name) AS Table_Name FROM information_Schema.tables WHERE table_name not in ('watermarktable', 'database_firewall_rules','ipv6_database_firewall_rules')
Inside the for each activity (looping through #activity('lookup1').output.value), I have tried the same query as given.
SELECT MAX(updated_at) as NewWatermarkvalue FROM #{item().Table_Name}
After debugging the pipeline, we can observe that it produces the same error.
For iteration where the table is t1 (has updated_at column):
For iteration where the table is t2 (does not have updated_at column):
If you publish and run this pipeline, the pipeline will fail giving the same error.
Therefore, try to check if the updated_at column exists or not in the particular table (current for each item). If it does exist, proceed to query it.
Inside for each use look up with the following query. It returns the length of column in bytes if the column exists in a table, else it returns null. Use this result along with If condition activity.
select COL_LENGTH('#{item().Table_Name}','updated_at') as column_exists
Use the following condition in If activity. If it returns false, then it indicates that the particular table contains updated_at column and we can work with it.
#equals(activity('check for column in table').output.firstRow['column_exists'],null)
The following is the debug output for the same (t1 and t2 tables)
You can continue with other required activities inside the False section of the If condition activity using above process.
When I use Flink SQL to execute the following statement, the error is reported as follows:
Request
Group the data in the user_behavior_kafka_table according to the user_id field, and then take out the piece of data with the largest value of the ts field in each group
excute sql
SELECT user_id,item_id,ts FROM user_behavior_kafka_table AS a
WHERE ts = (select max(b.ts)
FROM user_behavior_kafka_table AS b
WHERE a.user_id = b.user_id );
Flink version
1.11.2
error message
AppendStreamTableSink doesn't support consuming update changes which is produced by node Join(joinType=[InnerJoin], where=[((user_id = user_id0) AND (ts = EXPR$0))], select=[user_id, item_id, ts, user_id0, EXPR$0], leftInputSpec=[NoUniqueKey], rightInputSpec=[JoinKeyContainsUniqueKey])
Job deploy
On Yarn
Table Message
user_behavior_kafka_table data from consumer kafka topic
{"user_id":"aaa","item_id":"11-222-333","comment":"aaa access item at","ts":100}
{"user_id":"ccc","item_id":"11-222-334","comment":"ccc access item at","ts":200}
{"user_id":"ccc","item_id":"11-222-333","comment":"ccc access item at","ts":300}
{"user_id":"bbb","item_id":"11-222-334","comment":"bbb access item at","ts":200}
{"user_id":"aaa","item_id":"11-222-333","comment":"aaa access item at","ts":200}
{"user_id":"aaa","item_id":"11-222-334","comment":"aaa access item at","ts":400}
{"user_id":"ccc","item_id":"11-222-333","comment":"ccc access item at","ts":400}
{"user_id":"vvv","item_id":"11-222-334","comment":"vvv access item at","ts":200}
{"user_id":"bbb","item_id":"11-222-333","comment":"bbb access item at","ts":300}
{"user_id":"aaa","item_id":"11-222-334","comment":"aaa access item at","ts":300}
{"user_id":"ccc","item_id":"11-222-333","comment":"ccc access item at","ts":100}
{"user_id":"bbb","item_id":"11-222-334","comment":"bbb access item at","ts":100}
user_behavior_hive_table Expected result
{"user_id":"aaa","item_id":"11-222-334","comment":"aaa access item at","ts":400}
{"user_id":"bbb","item_id":"11-222-333","comment":"bbb access item at","ts":300}
{"user_id":"ccc","item_id":"11-222-333","comment":"ccc access item at","ts":400}
{"user_id":"vvv","item_id":"11-222-334","comment":"vvv access item at","ts":200}
To get the results you expect from that query, it needs to be executed in batch mode. As a streaming query, the Flink SQL planner can't cope with it, and if it could, it would produce a stream of results, where the last result for each user_id would match the expected results, but there would be additional, intermediate results.
For example, for user aaa, these results would appear:
aaa 11-222-333 100
aaa 11-222-333 200
aaa 11-222-334 400
but the row where ts=300 would be skipped, since it was never the row with the max value for ts.
If you want to make this work in streaming mode, try reformulating it as a top-n query:
SELECT user_id, item_id, ts FROM
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts DESC) AS row_num
FROM user_behavior_kafka_table)
WHERE row_num = 1;
I believe this should work, but I'm not in a position to easily test it.
Trying to get a list of COPY commands run on a particular date and the tables that were updated for each COPY command.
Working with this query:
select
slc.query as query_id,
trim(slc.filename) as file,
slc.curtime as updated,
slc.lines_scanned as rows,
sq.querytxt as querytxt
from stl_load_commits slc
join stl_query sq on sq.query = slc.query
where trunc(slc.curtime) = '2020-05-07';
How can we get the table that was updated for each COPY command? Maybe using a Redshift RegEx function on querytxt? Or joining to another system table to find the table id or name?
this regex will select the table or schema.table from stl_query.querytxt
select
slc.query as query_id,
trim(slc.filename) as file,
slc.curtime as updated,
slc.lines_scanned as rows,
sq.querytxt as querytxt,
REGEXP_REPLACE(LOWER(sq.querytxt), '^copy (analyze )?(\\S+).*$', '$2') AS t
from stl_load_commits slc
join stl_query sq on sq.query = slc.query
where trunc(updated) = '2020-05-07';
I've got a problem with creating a table or stream in KSQL.
I've made everything as shown in official examples and I don't get why my code does not work.
Example from https://docs.confluent.io/current/ksql/docs/tutorials/examples.html#joining :
CREATE TABLE pageviews_per_region_per_session AS
SELECT regionid,
windowStart(),
windowEnd(),
count(*)
FROM pageviews_enriched
WINDOW SESSION (60 SECONDS)
GROUP BY regionid;
NOW MY CODE:
I've tried to run select in command prom and it WORKS WELL:
SELECT count(*) as attempts_count, "computer", (WINDOWSTART() / 1000) as row_time
FROM LOG_FLATTENED
WINDOW TUMBLING (SIZE 20 SECONDS)
WHERE "event_id" = 4625
GROUP BY "computer"
HAVING count(*) > 2;
But when I try to create the table based on this select (from ksql command-line tool):
CREATE TABLE `incorrect_logins` AS
SELECT count(*) as attempts_count, "computer", (WINDOWSTART() / 1000) as row_time
FROM LOG_FLATTENED
WINDOW TUMBLING (SIZE 20 SECONDS)
WHERE "event_id" = 4625
GROUP BY "computer"
HAVING count(*) > 2;
I GET AN ERROR - io.confluent.ksql.util.KsqlStatementException: Column COMPUTER cannot be resolved. But this column exists and select without create table statement works perfectly.
I'm using the latest stable KSQL image (confluentinc/cp-ksql-server:5.3.1)
In first place, I apologize for my bad english, if something that I'll say it's not clear enough, do not hesitate to reply me and I try to explain me in a better way.
I don't know a lot of KSQL, but I'll try to help you, based on my experience creating STREAMS like your TABLE.
1) As you probably know, KSQL process everything as UpperCase unless you specify the opposite.
2) KSQL doesn't support double quotes in a SELECT inside a CREATE query, in fact, KSQL will ignore this characters and will handle your field as a UpperCase column, for that reason, in the error returned to you, appears COMPUTER and not "computer".
A workaround of this issue is:
In first place, create an empty table with the lowerCase fields:
CREATE TABLE "incorrect_logins" ("attempts_count" INTEGER, "computer" VARCHAR, "row_time" INTEGER) WITH (KAFKA_TOPIC='topic_that_you_want', VALUE_FORMAT='avro')
(If the topic doesn't exist, you'll have to create it before)
Once the table has been created, you could insert data in the table using your SELECT query:
INSERT INTO "incorrect_logins" SELECT count() as "attempts_count", "computer", (WINDOWSTART() / 1000) as "row_time"
FROM LOG_FLATTENED
WINDOW TUMBLING (SIZE 20 SECONDS)
WHERE "event_id" = 4625
GROUP BY "computer"
HAVING count() > 2;
Hope it helps you!
I get error in confluent-5.0.0.
ksql>CREATE TABLE order_per_hour AS SELECT after->order_id,count(*) FROM transaction WINDOW SESSION(60 seconds) GROUP BY after->order_id;
name is null
error-name is null
after is the struct field in schema.
simple select query without group by is working fine.
I've submitted a PR to add support for this to KSQL here https://github.com/confluentinc/ksql/pull/2076
Hope this helps,
Andy
Currently you can only use column names in the GROUP BY clause. As a work around you can write your query as the following:
CREATE STREAM foo AS SELECT after->order_id as o_id FROM transaction;
CREATE TABLE order_per_hour AS SELECT o_id,count(*) FROM foo WINDOW SESSION(60 seconds) GROUP BY o_id;