Data not syncing from mysql to elastic search after processing through Kafka - apache-kafka

We are trying to send data from MySQL to elastic(ETL) though Kafka.
In MySQL we have multiple tables which we need to aggregate in specific format than we can send it to elastic search.
For that we used debezium to connect with Mysql and elastic and transformed data through ksql.
we have created streams for both the tables then partition them and create table of one entity but after joining we dint get the data from both the tables.
we are trying to join two tables of Mysql through Ksql and send it to elastic search using debezium.
Table 1: items
table 2 : item_images
CREATE STREAM items_from_debezium (id integer, tenant_id integer, name string, sku string, barcode string, qty integer, type integer, archived integer)
WITH (KAFKA_TOPIC='mySqlTesting.orderhive.items',VALUE_FORMAT='json');
CREATE STREAM images_from_debezium (id integer,item_id integer,image string, thumbnail string)
WITH (KAFKA_TOPIC='mySqlTesting.orderhive.item_images',VALUE_FORMAT='json');
CREATE STREAM items_flat
WITH (KAFKA_TOPIC='ITEMS_REPART',VALUE_FORMAT='json',PARTITIONS=1) as SELECT * FROM items_from_debezium PARTITION BY id;
CREATE STREAM images_flat
WITH (KAFKA_TOPIC='IMAGES_REPART',VALUE_FORMAT='json',PARTITIONS=1) as SELECT * FROM images_from_debezium PARTITION BY item_id;
CREATE TABLE item_images (id integer,item_id integer,image string, thumbnail string)
WITH (KAFKA_TOPIC='IMAGES_REPART',VALUE_FORMAT='json',KEY='item_id');
SELECT item_images.id,item_images.image,item_images.thumbnail,items_flat.id,items_flat.name,items_flat.sku,items_flat.barcode,items_flat.type,items_flat.archived,items_flat.qty
FROM items_flat left join item_images on items_flat.id=item_images.item_id
limit 10;
We are expecting data of both the tables but we are getting null from item_images table.

Related

Create a table from a topic using a where clause using ksql

I'm using the latest version of Kafka sql server 0.29.2, I guess. I'm trying to create a reading table that reads from a specific topic which receives lots of events, but I'm interested in specific events. The JSON event has a property named "evenType", so I want to continually filter the events and create a specific table to store the client data, like phone number, email etc., to update the client info.
I created a stream called orders_inputs only for testing purposes, and then I tried to create this table, but I got that error.
create table orders(orderid varchar PRIMARY KEY, itemid varchar) WITH (KAFKA_TOPIC='ORDERS', PARTITIONS=1, REPLICAS=1) as select orderid, itemid from orders_inputs where type='t1';
line 1:120: mismatched input 'as' expecting ';'
Statement: create table orders(orderid varchar PRIMARY KEY, itemid varchar) WITH (KAFKA_TOPIC='ORDERS', PARTITIONS=1, REPLICAS=1) as select orderid, itemid from orders_inputs where type='t1';
Caused by: line 1:120: mismatched input 'as' expecting ';'
Caused by: org.antlr.v4.runtime.InputMismatchException
If you are wanting to create a table that contains the results of a select query from a stream you can use CREATE TABLE AS SELECT
https://docs.confluent.io/5.2.1/ksql/docs/developer-guide/create-a-table.html#create-a-ksql-table-with-streaming-query-results
e.g.
CREATE TABLE orders AS
SELECT orderid, itemid FROM orders_inputs
WHERE type='t1';
You can specify the primary key when creating the stream order_inputs: https://docs.confluent.io/5.4.4/ksql/docs/developer-guide/syntax-reference.html#message-keys
Otherwise, you can specify the primary key when creating a table from a topic:
https://docs.confluent.io/5.2.1/ksql/docs/developer-guide/create-a-table.html#create-a-table-with-selected-columns
e.g.
CREATE TABLE orders
(orderid VARCHAR PRIMARY KEY,
itemid VARCHAR)
WITH (KAFKA_TOPIC = 'orders',
VALUE_FORMAT='JSON');
However, you would then have to query the table and filter where type=t1

KSQL Persisent Query not writing data to KSQL Table

I have two KSQL Tables each having the same key. I am running the following query on them
CREATE TABLE TEMP1 AS SELECT
b.MFG_DATE,
b.rowtime as bd_rowtime,
s.rowtime as sd_rowtime,
b.EXPIRY_DATE as EXP_DATE,
b.BATCH_NO as BATCH_NO,
s.rowkey as SD_ID
FROM GR_SD4 s
INNER JOIN GR_BD4 b ON b.rowkey = s.rowkey;
PARTITION BY s.rowkey;
The resulting table does not get populated with data but when I run the select query separately it populates the data. I am confused on what could be the reason for the table not being populated with data.
The issue may be related to the PARTITION BY clause in your query. Since you are joining two tables, the resulting table will have a composite primary key (rowkey, s.rowkey). The PARTITION BY clause should be updated to reflect this, i.e. PARTITION BY rowkey, s.rowkey. This should ensure that the data is correctly partitioned and can be inserted into the table.

pgAdmin doesn't show user's tables from Yugabyte DB

i have installed YugabyteDB and created local cluster using this command
./bin/yugabyted start
the database is up and running , then i create the keyspaces and tables by running the following command
cqlsh -f resources/IoTData.cql
IoTData.cql contains the following :
// Create keyspace
CREATE KEYSPACE IF NOT EXISTS TrafficKeySpace;
// Create keyspace
CREATE KEYSPACE IF NOT EXISTS TrafficKeySpace;
// Create tables
CREATE TABLE IF NOT EXISTS TrafficKeySpace.Origin_Table (vehicleId text, routeId text, vehicleType text, longitude text, latitude text, timeStamp timestamp, speed double, fuelLevel double, PRIMARY KEY ((vehicleId), timeStamp)) WITH default_time_to_live = 3600;
CREATE TABLE IF NOT EXISTS TrafficKeySpace.Total_Traffic (routeId text, vehicleType text, totalCount bigint, timeStamp timestamp, recordDate text, PRIMARY KEY (routeId, recordDate, vehicleType));
CREATE TABLE IF NOT EXISTS TrafficKeySpace.Window_Traffic (routeId text, vehicleType text, totalCount bigint, timeStamp timestamp, recordDate text, PRIMARY KEY (routeId, recordDate, vehicleType));
CREATE TABLE IF NOT EXISTS TrafficKeySpace.Poi_Traffic(vehicleid text, vehicletype text, distance bigint, timeStamp timestamp, PRIMARY KEY (vehicleid));
// Select from the tables
SELECT count(*) FROM TrafficKeySpace.Origin_Table;
SELECT count(*) FROM TrafficKeySpace.Total_Traffic;
SELECT count(*) FROM TrafficKeySpace.Window_Traffic;
SELECT count(*) FROM TrafficKeySpace.Poi_Traffic;
// Truncate the tables
TRUNCATE TABLE TrafficKeySpace.Origin_Table;
TRUNCATE TABLE TrafficKeySpace.Total_Traffic;
TRUNCATE TABLE TrafficKeySpace.Window_Traffic;
TRUNCATE TABLE TrafficKeySpace.Poi_Traffic;
The YB-Master Admin UI shows me that tables are created , but when i am using pgAdmin client to brows data from that database it doesn't shows me those tables.
in order to connect to yugabyteDB i used those properties :
database : yugabyte
user : yugabyte
password : yugabyte
host : localhost
port : 5433
why the client doesn't show tables i have created
why the client doesn't show tables i have created
The reason is that the 2 different layers can't interact with each other. YSQL data/tables cannot be read from YCQL clients and vice-versa.
This is also explained in the faq:
The YugabyteDB APIs are currently isolated and independent from one
another. Data inserted or managed by one API cannot be queried by the
other API. Additionally, Yugabyte does not provide a way to access the
data across the APIs.

KSQL Windowed Aggregation Stream

I am trying to group events by one of its properties and over time using the KSQL Windowed Aggregation, specifically the Session Window.
I have a STREAM made from a kafka topic with the TIMESTAMP property well specified.
When I try to create a STREAM with a Session Windowing with a query like:
CREATE STREAM SESSION_STREAM AS
SELECT ...
FROM EVENT_STREAM
WINDOW SESSION (5 MINUTES)
GROUP BY ...;
I always get the error:
Your SELECT query produces a TABLE. Please use CREATE TABLE AS SELECT statement instead.
Is it possible to create a STREAM with a Windowed Aggregation?
When I try as suggested to create a TABLE and then a STREAM that contains all the session starting events, with a query like:
CREATE STREAM SESSION_START_STREAM AS
SELECT *
FROM SESSION_TABLE
WHERE WINDOWSTART=WINDOWEND;
KSQL informs me that:
KSQL does not support persistent queries on windowed tables
How to create a STREAM of events starting a session window in KSQL?
Your create stream statement, if switched to a create table statement will create a table that is constantly being updated. The sink topic SESSION_STREAM will contain the stream of changes to the table, i.e. its changelog.
ksqlDB models this as a TABLE, because it has TABLE semantics, i.e. only a single row can exist in the table with any specific key. However, the changelog will contain the STREAM of changes that have been applied to the table.
If what you want is a topic containing all the sessions then something like this will create that:
-- create a stream with a new 'data' topic:
CREATE STREAM DATA (USER_ID INT)
WITH (kafka_topic='data', value_format='json');
-- create a table that tracks user interactions per session:
CREATE TABLE SESSION AS
SELECT USER_ID, COUNT(USER_ID) AS COUNT
FROM DATA
WINDOW SESSION (5 SECONDS)
GROUP BY USER_ID;
This will create a SESSIONS topic that contains the changes to the SESSIONS table: i.e. its changelog.
If you want to convert this into a stream of session start events, then unfortunately ksqlDB doesn't yet allow you to directly change create a stream from the table, but you can create a stream over the table's change log:
-- Create a stream over the existing `SESSIONS` topic.
-- Note it states the window_type is 'Session'.
CREATE STREAM SESSION_STREAM (ROWKEY INT KEY, COUNT BIGINT)
WITH (kafka_topic='SESSIONS', value_format='JSON', window_type='Session');
-- Create a stream of window start events:
CREATE STREAM SESSION_STARTS AS
SELECT * FROM SESSION_STREAM
WHERE WINDOWSTART = WINDOWEND;
Note, with the upcoming 0.10 release you'll be able to name the key column in the SESSION_STREAM correctly:
CREATE STREAM SESSION_STREAM (USER_ID INT KEY, COUNT BIGINT)
WITH (kafka_topic='SESSIONS', value_format='JSON', window_type='Session');

Errors while saving Spark Dataframe to Hbase using Apache Phoenix

I'm trying to save jsonRDD into hbase using apache phoenix spark plugin : df.saveToPhoenix(tableName, zkUrl = Some(quorumAddress)). The table looks like:
CREATE TABLE IF NOT EXISTS person (
ID BIGINT NOT NULL PRIMARY KEY,
NAME VARCHAR,
SURNAME VARCHAR) SALT_BUCKETS = 40, COMPRESSION='GZ';
I have about 100,000 - 2,000,000 records in this kind of tables. Some of them are saved normally. But some of them fail with error:
java.lang.RuntimeException: org.apache.phoenix.exception.PhoenixIOException:
callTimeout=1200000, callDuration=2902566: row 'PERSON' on table 'SYSTEM.CATALOG' at
region=SYSTEM.CATALOG,,1443172839381.a593d4dbac97863f897bca469e8bac66.,
hostname=hadoop-02,16020,1443292360474, seqNum=339
at org.apache.phoenix.mapreduce.PhoenixRecordWriter.close(PhoenixRecordWriter.java:62)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$5.apply$mcV$sp(PairRDDFunctions.scala:1043)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1294)
What could that mean? Are there any other ways to bulk insert data from DataFrame to hbase?