How can i query a map field in ksql? - apache-kafka

I have created a stream from a topic i ksql. The stream has the fields as below. I can query diferent field for example: select category from fake-data-119. I would like to know how can i get a single item from the map field, for example : status?
The data that are coming from the source are:
ProducerRecord(topic=fake-data-119, partition=null, headers=RecordHeaders(headers = [], isReadOnly = true), key=null, value={"deviceId": 16, "category": "visibility sensors", "timeStamp": "Tue Jun 19 10:11:10 CEST 2018", "deviceProperties": {"visibility": "72", "status": "true"}}, timestamp=null)
ProducerRecord(topic=fake-data-119, partition=null, headers=RecordHeaders(headers = [], isReadOnly = true), key=null, value={"deviceId": 6, "category": "fans", "timeStamp": "Tue Jun 19 10:11:11 CEST 2018", "deviceProperties": {"temperature": "22", "rotationSense": "1", "status": "false", "frequency": "56"}}, timestamp=null)
ProducerRecord(topic=fake-data-119, partition=null, headers=RecordHeaders(headers = [], isReadOnly = true), key=null, value={"deviceId": 23, "category": "air quality monitors", "timeStamp": "Tue Jun 19 10:11:12 CEST 2018", "deviceProperties": {"coPpm": "136", "status": "false", "Co2Ppm": "450"}}, timestamp=null)
I am using the statement below to create the stream:
CREATE STREAM fakeData119 WITH (KAFKA_TOPIC='fake-data-119', VALUE_FORMAT='AVRO');
Field | Type
---------------------------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
DEVICEID | INTEGER
CATEGORY | VARCHAR(STRING)
TIMESTAMP | VARCHAR(STRING)
DEVICEPROPERTIES | MAP[VARCHAR(STRING),VARCHAR(STRING)]
---------------------------------------------------------
ksql> select * from fakeData119;
1529394182864 | null | 6 | fans | Tue Jun 19 09:43:02 CEST 2018 | {temperature=36, rotationSense=1, status=false, frequency=72}
1529394183869 | null | 5 | fans | Tue Jun 19 09:43:03 CEST 2018 | {temperature=23, rotationSense=1, status=true, frequency=76}
1529394184872 | null | 16 | visibility sensors | Tue Jun 19 09:43:04 CEST 2018 | {visibility=14, status=true}
1529394185875 | null | 25 | air quality monitors | Tue Jun 19 09:43:05 CEST 2018 | {coPpm=280, status=false, Co2Ppm=170}

You can get items in the map in the following way:
select deviceproperties['status'] from fakedata119

Related

SQL Query to display Calculated fields on a year, monthly basis

I need help writing this SQL query (PostgresSQL) to display results in the form below:
--------------------------------------------------------------------------------
State | Jan '17 | Feb '17 | Mar '17 | Apr '17 | May '17 ... Dec '18
--------------------------------------------------------------------------------
Principal Outs. |700,839 |923,000 |953,000 |6532,293 | 789,000 ... 913,212
Disbursal Amount |23,000 |25,000 |23,992 | 23,627 | 25,374 ... 23,209
Interest |113,000 |235,000 |293,992 |322,627 |323,374 ... 267,209
There are multiple tables but I would be okay joining them.

Cannot create stream in Ksql

I have the stream as below and i want to create another stream from this. I am trying the command as below and i am getting the following error. Am i missing something?
ksql> create stream down_devices_stream as select * from fakedata119 where deviceProperties['status']='false';
Failed to generate code for SqlPredicate.filterExpression: (FAKEDATA119.DEVICEPROPERTIES['status'] = 'false')schema:org.apache.kafka.connect.data.SchemaBuilder#6e18dbbfisWindowedKey:false
Caused by: Line 1, Column 180: Operator "<=" not allowed on reference operands
ksql> select * from fakedata119;
1529505497087 | null | 19 | visibility sensors | Wed Jun 20 16:38:17 CEST 2018 | {visibility=74, status=true}
1529505498087 | null | 7 | fans | Wed Jun 20 16:38:18 CEST 2018 | {temperature=44, rotationSense=1, status=false, frequency=49}
1529505499088 | null | 28 | air quality monitors | Wed Jun 20 16:38:19 CEST 2018 | {coPpm=257, status=false, Co2Ppm=134}
1529505500089 | null | 4 | fans | Wed Jun 20 16:38:20 CEST 2018 | {temperature=42, rotationSense=1, status=true, frequency=51}
1529505501089 | null | 23 | air quality monitors | Wed Jun 20 16:38:21 CEST 2018 | {coPpm=158, status=true, Co2Ppm=215}
sql> describe fakedata119;
Field | Type
---------------------------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
DEVICEID | INTEGER
CATEGORY | VARCHAR(STRING)
TIMESTAMP | VARCHAR(STRING)
DEVICEPROPERTIES | MAP[VARCHAR(STRING),VARCHAR(STRING)]
Without seeing your input data, I have guessed that it looks something like this:
{
"id": "a42",
"category": "foo",
"timestamp": "2018-06-21 10:04:57 BST",
"deviceID": 42,
"deviceProperties": {
"status": "false",
"foo": "bar"
}
}
And if so, you are better using EXTRACTJSONFIELD to access the nested values, and build predicates.
CREATE STREAM test (Id VARCHAR, category VARCHAR, timeStamp VARCHAR, \
deviceID INTEGER, deviceProperties VARCHAR) \
WITH (KAFKA_TOPIC='test_map2', VALUE_FORMAT='JSON');
ksql> SELECT EXTRACTJSONFIELD(DEVICEPROPERTIES,'$.status') AS STATUS FROM fakeData223;
false
ksql> SELECT * FROM fakeData223 \
WHERE EXTRACTJSONFIELD(DEVICEPROPERTIES,'$.status')='false';
1529572405759 | null | a42 | foo | 2018-06-21 10:04:57 BST | 42 | {"status":"false","foo":"bar"}
The error you've found I've logged as a bug to track here: https://github.com/confluentinc/ksql/issues/1474
I've added a test to cover this usecase:
https://github.com/confluentinc/ksql/pull/1476/files
Interestingly, this passes on our master and upcoming 5.0 branches, but fails on 4.1.
So... looks like this is an issue on the version you're using, but the good news is its fixed on the up coming release. Plus you can use Robin's work around above for now.
Happy querying!
Andy

How to get the lag of a column in a Spark streaming dataframe?

I have data streaming into my Spark Scala application in this format
id mark1 mark2 mark3 time
uuid1 100 200 300 Tue Aug 8 14:06:02 PDT 2017
uuid1 100 200 300 Tue Aug 8 14:06:22 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:32 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:52 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:58 PDT 2017
I have it read into columns id, mark1, mark2, mark3 and time. The time is converted to datetime format as well.
I want to get this grouped by id and get the lag for mark1 which gives the previous row's mark1 value.
Something like this:
id mark1 mark2 mark3 prev_mark time
uuid1 100 200 300 null Tue Aug 8 14:06:02 PDT 2017
uuid1 100 200 300 100 Tue Aug 8 14:06:22 PDT 2017
uuid2 150 250 350 null Tue Aug 8 14:06:32 PDT 2017
uuid2 150 250 350 150 Tue Aug 8 14:06:52 PDT 2017
uuid2 150 250 350 150 Tue Aug 8 14:06:58 PDT 2017
Consider the dataframe to be markDF. I have tried:
val window = Window.partitionBy("uuid").orderBy("timestamp")
val newerDF = newDF.withColumn("prev_mark", lag("mark1", 1, null).over(window))`
which says non time windows cannot be applied on streaming/appending datasets/frames.
I have also tried:
val window = Window.partitionBy("uuid").orderBy("timestamp").rowsBetween(-10, 10)
val newerDF = newDF.withColumn("prev_mark", lag("mark1", 1, null).over(window))
To get a window for few rows which did not work either. The streaming window something like:
window("timestamp", "10 minutes")
cannot be used to send over the lag. I am super confused on how to do this. Any help would be awesome!!
I would advise you to change the time column into String as
+-----+-----+-----+-----+----------------------------+
|id |mark1|mark2|mark3|time |
+-----+-----+-----+-----+----------------------------+
|uuid1|100 |200 |300 |Tue Aug 8 14:06:02 PDT 2017|
|uuid1|100 |200 |300 |Tue Aug 8 14:06:22 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:32 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:52 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:58 PDT 2017|
+-----+-----+-----+-----+----------------------------+
root
|-- id: string (nullable = true)
|-- mark1: integer (nullable = false)
|-- mark2: integer (nullable = false)
|-- mark3: integer (nullable = false)
|-- time: string (nullable = true)
After that doing the following should work
df.withColumn("prev_mark", lag("mark1", 1).over(Window.partitionBy("id").orderBy("time")))
Which will give you output as
+-----+-----+-----+-----+----------------------------+---------+
|id |mark1|mark2|mark3|time |prev_mark|
+-----+-----+-----+-----+----------------------------+---------+
|uuid1|100 |200 |300 |Tue Aug 8 14:06:02 PDT 2017|null |
|uuid1|100 |200 |300 |Tue Aug 8 14:06:22 PDT 2017|100 |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:32 PDT 2017|null |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:52 PDT 2017|150 |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:58 PDT 2017|150 |
+-----+-----+-----+-----+----------------------------+---------+

function to calculate aggregate sum count in postgresql

Is there a function that calculates the total count of the complete month like below? I am not sure if postgres. I am looking for the grand total value.
2012-08=# select date_trunc('day', time), count(distinct column) from table_name group by 1 order by 1;
date_trunc | count
---------------------+-------
2012-08-01 00:00:00 | 22
2012-08-02 00:00:00 | 34
2012-08-03 00:00:00 | 25
2012-08-04 00:00:00 | 30
2012-08-05 00:00:00 | 27
2012-08-06 00:00:00 | 31
2012-08-07 00:00:00 | 23
2012-08-08 00:00:00 | 28
2012-08-09 00:00:00 | 28
2012-08-10 00:00:00 | 28
2012-08-11 00:00:00 | 24
2012-08-12 00:00:00 | 36
2012-08-13 00:00:00 | 28
2012-08-14 00:00:00 | 23
2012-08-15 00:00:00 | 23
2012-08-16 00:00:00 | 30
2012-08-17 00:00:00 | 20
2012-08-18 00:00:00 | 30
2012-08-19 00:00:00 | 20
2012-08-20 00:00:00 | 24
2012-08-21 00:00:00 | 20
2012-08-22 00:00:00 | 17
2012-08-23 00:00:00 | 23
2012-08-24 00:00:00 | 25
2012-08-25 00:00:00 | 35
2012-08-26 00:00:00 | 18
2012-08-27 00:00:00 | 16
2012-08-28 00:00:00 | 11
2012-08-29 00:00:00 | 22
2012-08-30 00:00:00 | 26
2012-08-31 00:00:00 | 17
(31 rows)
--------------------------------
Total | 12345
As best I can guess from your question and comments you want sub-totals of the distinct counts by month. You can't do this with group by date_trunc('month',time) because that'll do a count(distinct column) that's distinct across all days.
For this you need a subquery or CTE:
WITH day_counts(day,day_col_count) AS (
select date_trunc('day', time), count(distinct column)
from table_name group by 1
)
SELECT 'Day', day, day_col_count
FROM day_counts
UNION ALL
SELECT 'Month', date_trunc('month', day), sum(day_col_count)
FROM day_counts
GROUP BY 2
ORDER BY 2;
My earlier guess before comments was: Group by month?
select date_trunc('month', time), count(distinct column)
from table_name
group by date_trunc('month', time)
order by time
Or are you trying to include running totals or subtotal lines? For running totals you need to use sum as a window function. Subtotals are just a pain, as SQL doesn't really lend its self to them; you need to UNION two queries then wrap them in an outer ORDER BY.
select
date_trunc('day', time)::text as "date",
count(distinct column) as count
from table_name
group by 1
union
select
'Total',
count(distinct column)
from table_name
group by 1, date_trunc('month', time)
order by "date" = 'Total', 1

Mongodb replica set polluted logs and arbiter in "initial startup"

I'm running a replica set on with MongoBD v.2.0.3, here is the latest status:
+-----------------------------------------------------------------------------------------------------------------------+
| Member |id|Up| cctime |Last heartbeat|Votes|Priority| State | Messages | optime |skew|
|--------------------+--+--+-----------+--------------+-----+--------+------------------+---------------+----------+----|
|127.0.1.1:27002 |0 |1 |13 hrs |2 secs ago |1 |1 |PRIMARY | |4f619079:2|1 |
|--------------------+--+--+-----------+--------------+-----+--------+------------------+---------------+----------+----|
|127.0.1.1:27003 |1 |1 |12 hrs |1 sec ago |1 |1 |SECONDARY | |4f619079:2|1 |
|--------------------+--+--+-----------+--------------+-----+--------+------------------+---------------+----------+----|
|127.0.1.1:27001 |2 |1 |2.5e+02 hrs|1 sec ago |1 |0 |SECONDARY (hidden)| |4f619079:2|-1 |
|--------------------+--+--+-----------+--------------+-----+--------+------------------+---------------+----------+----|
|127.0.1.1:27000 (me)|3 |1 |2.5e+02 hrs| |1 |1 |ARBITER |initial startup|0:0 | |
|--------------------+--+--+-----------+--------------+-----+--------+------------------+---------------+----------+----|
|127.0.1.1:27004 |4 |1 |9.5 hrs |2 secs ago |1 |1 |SECONDARY | |4f619079:2|-1 |
+-----------------------------------------------------------------------------------------------------------------------+
I'm puzzled by the following:
1) The arbiter always report the same message "initial startup" and "optime" of 0:0.
What does the "initial startup" mean, and is it normal for this message not to change?
Why the "optime" is always "0:0"?
2) What information does the skew column convey?
I've set up my replicas according to MongoDB's documentation and data seems to replicate across the set nicely, so no problem with that.
Another thing is that logs across all MongoDB hosts are polluted with such entries:
Thu Mar 15 03:25:29 [initandlisten] connection accepted from 127.0.0.1:38599 #112781
Thu Mar 15 03:25:29 [conn112781] authenticate: { authenticate: 1, nonce: "99e2a4a5124541b9", user: "__system", key: "417d42d26643b2c2d014b89900700263" }
Thu Mar 15 03:25:32 [clientcursormon] mem (MB) res:12 virt:244 mapped:32
Thu Mar 15 03:25:34 [conn112779] end connection 127.0.0.1:38586
Thu Mar 15 03:25:34 [initandlisten] connection accepted from 127.0.0.1:38602 #112782
Thu Mar 15 03:25:34 [conn112782] authenticate: { authenticate: 1, nonce: "a021e521ac9e19bc", user: "__system", key: "14507310174c89cdab3b82decb52b47c" }
Thu Mar 15 03:25:36 [conn112778] end connection 127.0.0.1:38585
Thu Mar 15 03:25:36 [initandlisten] connection accepted from 127.0.0.1:38604 #112783
Thu Mar 15 03:25:37 [conn112783] authenticate: { authenticate: 1, nonce: "58bcf511e040b760", user: "__system", key: "24c5b20886f6d390d1ea8ea1c61fd109" }
Thu Mar 15 03:26:00 [conn112781] end connection 127.0.0.1:38599
Thu Mar 15 03:26:00 [initandlisten] connection accepted from 127.0.0.1:38615 #112784
Thu Mar 15 03:26:00 [conn112784] authenticate: { authenticate: 1, nonce: "8a8f24fe012a03fe", user: "__system", key: "9b0be0c7fc790021b25aeb4511d85848" }
Thu Mar 15 03:26:01 [conn112780] end connection 127.0.0.1:38598
Thu Mar 15 03:26:01 [initandlisten] connection accepted from 127.0.0.1:38616 #112785
Thu Mar 15 03:26:01 [conn112785] authenticate: { authenticate: 1, nonce: "420808aa9a12947", user: "__system", key: "90e8654a2eb3981219c370208989e97a" }
Thu Mar 15 03:26:04 [conn112782] end connection 127.0.0.1:38602
Thu Mar 15 03:26:04 [initandlisten] connection accepted from 127.0.0.1:38617 #112786
Thu Mar 15 03:26:04 [conn112786] authenticate: { authenticate: 1, nonce: "b46ac4868db60973", user: "__system", key: "43cda53cc503bce942040ba8d3c6c3b1" }
Thu Mar 15 03:26:09 [conn112783] end connection 127.0.0.1:38604
Thu Mar 15 03:26:09 [initandlisten] connection accepted from 127.0.0.1:38621 #112787
Thu Mar 15 03:26:10 [conn112787] authenticate: { authenticate: 1, nonce: "20fae7ed47cd1780", user: "__system", key: "f7b81c2d53ad48343e917e2db9125470" }
Thu Mar 15 03:26:30 [conn112784] end connection 127.0.0.1:38615
Thu Mar 15 03:26:30 [initandlisten] connection accepted from 127.0.0.1:38632 #112788
Thu Mar 15 03:26:31 [conn112788] authenticate: { authenticate: 1, nonce: "38ee5b7b665d26be", user: "__system", key: "49c1f9f4e3b5cf2bf05bfcbb939ee422" }
Thu Mar 15 03:26:33 [conn112785] end connection 127.0.0.1:38616
It seems like many connections are established and dropped. Is that the replica set heartbeat?
Additional information
Arbiter config
dbpath=/var/lib/mongodb
logpath=/var/log/mongodb/mongodb.log
logappend=true
port = 27000
bind_ip = 127.0.1.1
rest = true
journal = true
replSet = myreplname
keyFile = /etc/mongodb/set.key
oplogSize = 8
quiet = true
Replica set member config
dbpath=/root/local/var/mongodb
logpath=/root/local/var/log/mongodb.log
logappend=true
port = 27002
bind_ip = 127.0.1.1
rest = true
journal = true
replSet = myreplname
keyFile = /root/local/etc/set.key
quiet = true
MongoDB instances are running on different machines and connect to each other over SSH tunnels setup in fully connected mesh.
The arbiter doesn't do anything besides participate in elections, so it has no further operations after startup. "Skew" is clock skew in seconds between this member and the others in the set. Yes, the connect / disconnect messages are heartbeats.