With KSQL, why does my table keep data with older ROWTIME and discard updates with newer ROWTIME? - confluent-platform

I have a process that feeds relatively simple vehicle data into a kafka topic. The records are keyd by registration and the values contain things like latitude/longitude etc + a value called DateTime which is a timestamp based on the sensor that took the readings (not the producer or the cluster).
My data arrives out of order in general and also especially if I keep on pumping the same test data set into the vehicle-update-log topic over and over. My data set contains two records for the vehicle I'm testing with.
My expectation is that when I do a select on the table, that it will return one row with the most recent data based on the ROWTIME of the records. (I've verified that the ROWTIME is getting set correctly.)
What happens instead is that the result has both rows (for the same primary KEY) and the last value is the oldest ROWTIME.
I'm confused; I thought ksql will keep the most recent update only. Must I now write additional logic on the client side to pick the latest of the data I get?
I created the table like this:
CREATE TABLE vehicle_updates
(
Latitude DOUBLE,
Longitude DOUBLE,
DateTime BIGINT,
Registration STRING PRIMARY KEY
)
WITH
(
KAFKA_TOPIC = 'vehicle-update-log',
VALUE_FORMAT = 'JSON_SR',
TIMESTAMP = 'DateTime'
);
Here is my query:
SELECT
registration,
ROWTIME,
TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss.SSS', 'Africa/Johannesburg') AS rowtime_formatted
FROM vehicle_updates
WHERE registration = 'BT66MVE'
EMIT CHANGES;
Results while no data is flowing:
+------------------------------+------------------------------+------------------------------+
|REGISTRATION |ROWTIME |ROWTIME_FORMATTED |
+------------------------------+------------------------------+------------------------------+
|BT66MVE |1631532052000 |2021-09-13 13:20:52.000 |
|BT66MVE |1631527147000 |2021-09-13 11:59:07.000 |
Here's the same query, but I'm pumping the data set into the topic again while the query is running. I'm surprised to be getting the older record as updates.
Results while feeding data:
+------------------------------+------------------------------+------------------------------+
|REGISTRATION |ROWTIME |ROWTIME_FORMATTED |
+------------------------------+------------------------------+------------------------------+
|BT66MVE |1631532052000 |2021-09-13 13:20:52.000 |
|BT66MVE |1631527147000 |2021-09-13 11:59:07.000 |
|BT66MVE |1631527147000 |2021-09-13 11:59:07.000 |
What gives?

In the end, it's a issue in Kafka Streams, that is not easy to resolve: https://issues.apache.org/jira/browse/KAFKA-10493 (we are working on some long term solution for it already though).
While event-time based processing is a central design pillar, there are some gaps that still needs to get closed.
The underlying issue is, that Kafka itself was originally designed based on log-append order only. Timestamps got added later (in 0.10 release). There are still some gaps today (eg, https://issues.apache.org/jira/browse/KAFKA-7061) in which "offset order" is dominant. You are hitting one of those cases.

Related

Spark Delta Table Updates

I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost
If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)
Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location

Hbase reading freezes for a few records when reading with partial rowkey

I am reading data from HBase through spark. The code runs fine when reading the data using a prefix filter with a complete rowkey or using GET, but it freezes if I use a partial prefixed rowkey. The rowkey structure is md5OfAkey_Akey_txDate_someKey. I want to read all data matching “Akeys” from a data frame. The table has a single column family , 50 column qualifiers and has around 200 million records. So when I read using md5OfAkey_Akey_txDate the code gets stuck while if I construct the whole key it runs fine. But I do not want to pass the whole rowkey as I want to read all data for a particular account(Akey) and transaction date (txDate). Any help would be appreciated.

ksql - creating a stream from a json array

My kafka topic is pushing data in this format (coming from collectd):
[{"values":[100.000080140372],"dstypes":["derive"],"dsnames":["value"],"time":1529970061.145,"interval":10.000,"host":"k5.orch","plugin":"cpu","plugin_instance":"23","type":"cpu","type_instance":"idle","meta":{"network:received":true}}]
It's a combination of arrays, ints and floats... and the whole thing is inside a json array. As a result Im having a heck of a time using ksql to do anything with this data.
When I create a 'default' stream as
create stream cd_temp with (kafka_topic='ctd_test', value_format='json');
I get this result:
ksql> describe cd_temp;
Field | Type
-------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
-------------------------------------
Any select will return the ROWTIME and an 8 digit hex value for ROWKEY.
I've spent some time trying to extract the json fields to no avail. What concerns me is this:
ksql> print 'ctd_test' from beginning;
Format:JSON
com.fasterxml.jackson.databind.node.ArrayNode cannot be cast to com.fasterxml.jackson.databind.node.ObjectNode
Is it possible that this topic can't be used in ksql? Is there a technique for unpacking the outer array to get to the interesting bits inside?
At the time of writing, (June 2018), KSQL can't handle a JSON message where the whole thing is embedded inside a top level array. There is a github issue to track this. I'd suggest adding a +1 vote on this issue to up the priority of it.
Also, I notice that your create stream statement is not defining the schema of the json message. While this won't help in this situation, it is something that you'll need for other Json input formats, i.e. you create statement should be something like:
create stream cd_temp (values ARRAY<DOUBLE>, dstypes ARRAY<VARCHAR>, etc) with (kafka_topic='ctd_test', value_format='json');

HBase - rowkey basics

NOTE : Its a few hours ago that I have begun HBase and I come from an RDBMS background :P
I have a RDBMS-like table CUSTOMERS having the following columns:
CUSTOMER_ID STRING
CUSTOMER_NAME STRING
CUSTOMER_EMAIL STRING
CUSTOMER_ADDRESS STRING
CUSTOMER_MOBILE STRING
I have thought of the following HBase equivalent :
table : CUSTOMERS rowkey : CUSTOMER_ID
column family : CUSTOMER_INFO
columns : NAME EMAIL ADDRESS MOBILE
From whatever I have read, a primary key in an RDBMS table is roughly similar to a HBase table's rowkey. Accordingly, I want to keep CUSTOMER_ID as the rowkey.
My questions are dumb and straightforward :
Irrespective of whether I use a shell command or the HBaseAdmin java
class, how do I define the rowkey? I didn't find anything to do it
either in the shell or in the HBaseAdmin class(some thing like
HBaseAdmin.createSuperKey(...))
Given a HBase table, how to determine the rowkey details i.e which are the values used as rowkey?
I understand that rowkey design is a critical thing. Suppose a customer id is receives values like CUST_12345, CUST_34434 and so on, how will HBase use the rowkey to decide in which region do particular rows reside(assuming that region concept is similar to DB horizontal partitioning)?
***Edited to add sample code snippet
I'm simply trying to create one row for the customer table using 'put' in the shell. I did this :
hbase(main):011:0> put 'CUSTOMERS', 'CUSTID12345', 'CUSTOMER_INFO:NAME','Omkar Joshi'
0 row(s) in 0.1030 seconds
hbase(main):012:0> scan 'CUSTOMERS'
ROW COLUMN+CELL
CUSTID12345 column=CUSTOMER_INFO:NAME, timestamp=1365600052104, value=Omkar Joshi
1 row(s) in 0.0500 seconds
hbase(main):013:0> put 'CUSTOMERS', 'CUSTID614', 'CUSTOMER_INFO:NAME','Prachi Shah', 'CUSTOMER_INFO:EMAIL','Prachi.Shah#lntinfotech.com'
ERROR: wrong number of arguments (6 for 5)
Here is some help for this command:
Put a cell 'value' at specified table/row/column and optionally
timestamp coordinates. To put a cell value into table 't1' at
row 'r1' under column 'c1' marked with the time 'ts1', do:
hbase> put 't1', 'r1', 'c1', 'value', ts1
hbase(main):014:0> put 'CUSTOMERS', 'CUSTID12345', 'CUSTOMER_INFO:EMAIL','Omkar.Joshi#lntinfotech.com'
0 row(s) in 0.0160 seconds
hbase(main):015:0>
hbase(main):016:0* scan 'CUSTOMERS'
ROW COLUMN+CELL
CUSTID12345 column=CUSTOMER_INFO:EMAIL, timestamp=1365600369284, value=Omkar.Joshi#lntinfotech.com
CUSTID12345 column=CUSTOMER_INFO:NAME, timestamp=1365600052104, value=Omkar Joshi
1 row(s) in 0.0230 seconds
As put takes max. 5 arguments, I was not able to figure out how to insert the entire row in one put command. This is resulting in incremental versions of the same row which isn't required and I'm not sure if CUSTOMER_ID is being used as a rowkey !
Thanks and regards !
You don't, the key (and any other column for that matter) is a bytearray you can put whatever you want there- even encapsulate sub-entities
Not sure I understand that - each value is stored as key+column family + column qualifier + datetime + value - so the key is there.
HBase figures out which region a record will go to as it goes. When regions gets too big it repartitions. Also from time to time when there's too much junk HBase performs compactions to rearrage the files. You can control that when you pre-partition yourself, which is somehting you should definitely think about in the future. However, since it seems you are just starting out with HBase you can start with HBase taking care of that. Once you understand your usage patterns and data better you will probably want to go over that again.
You can read/hear a little about HBase schema design here and here

hbase rowkey design

I am moving from mysql to hbase due to increasing data.
I am designing rowkey for efficient access pattern.
I want to achieve 3 goals.
Get all results of email address
Get all results of email address + item_type
Get all results of particular email address + item_id
I have 4 attributes to choose from
user email
reverse timestamp
item_type
item_id
What should my rowkey look like to get rows efficiently?
Thanks
Assuming your main access is by email you can have your main table key as
email + reverse time + item_id (assuming item_id gives you uniqueness)
You can have an additional "index" table with email+item_type+reverse time+item_id and email+item_id as keys that maps to the first table (so retrieving by these is a two step process)
Maybe you are already headed in the right direction as far as concatenated row keys: in any case following comes to mind from your post:
Partitioning key likely consists of your reverse timestamp plus the most frequently queried natural key - would that be the email? Let us suppose so: then choose to make the prefix based on which of the two (reverse timestamp vs email) provides most balanced / non-skewed distribution of your data. That makes your region servers happier.
Choose based on better balanced distribution of records:
reverse timestamp plus most frequently queried natural key
e.g. reversetimestamp-email
or email-reversetimestamp
In that manner you will avoid hot spotting on your region servers.
.
To obtain good performance on the additional (secondary ) indexes, that is not "baked into" hbase yet: they have a design doc for it (look under SecondaryIndexing in the wiki).
But you can build your own a couple of ways:
a) use coprocessor to write the item_type as rowkey to separate tabole with a column containing the original (user_email-reverse timestamp (or vice-versa) fact table rowke
b) if disk space not issue and/or the rows are small, just go ahead and duplicate the entire row in the second (and third for the item-id case) tables.