Spark SQL group data by range and trigger alerts - pyspark

I am processing the data stream from Kafka using structured streaming with pyspark. I want to publish alerts to Kafka if the readings are abnormal in avro format
source temperature timestamp
1001 21 4/28/2019 10:25
1001 22 4/28/2019 10:26
1001 23 4/28/2019 10:27
1001 24 4/28/2019 10:28
1001 25 4/28/2019 10:29
1001 34 4/28/2019 10:30
1001 37 4/28/2019 10:31
1001 36 4/28/2019 10:32
1001 38 4/28/2019 10:33
1001 40 4/28/2019 10:34
1001 41 4/28/2019 10:35
1001 42 4/28/2019 10:36
1001 45 4/28/2019 10:37
1001 47 4/28/2019 10:38
1001 50 4/28/2019 10:39
1001 41 4/28/2019 10:40
1001 42 4/28/2019 10:41
1001 45 4/28/2019 10:42
1001 47 4/28/2019 10:43
1001 50 4/28/2019 10:44
Transform
source range count alert
1001 21-25 5 HIGH
1001 26-30 5 MEDIUM
1001 40-45 5 MEDIUM
1001 45-50 5 HIGH
I have defined a window function with 20 sec and 1 sec sliding. I am able to publish alerts with simple where condition but I am not able to tranform the data frame like above and trigger alerts if the count is 20 for any alert priority (all records in a window are matches with any priority HIGH->count(20) etc). Can any one suggest how to do this?
Also I am able to publish data using json format but not able to generate using AVRO. Scala and Java has to_avro() function but pyspark doesn't have any.
Appreciate your response

I am able to solve this problem using Bucketizer feature transfrom from ml library in spark.
How to bin in PySpark?

Related

What is the schema of Kerberos database?

I dumped a kerberos database with
$ kdb5_util dump /User/user/kerberos/dbdump
In the file, each line has information of principals as
princ 38 23 4 3 0 PRINCIPAL#REALM 4224 86400 604800 0 0 0 1618454069 0 3 40 XXX 2 25 XXX 8 2 0100 1 4 XXX 1 7 18 62 XXX 1 7 17 46 XXX 1 7 16 54 XXX -1;
However, I cannot figure out what each column means.
I want to find locked principals from this database.
How can I get the schema of a dumped kerberos database?
I sent an email to kerberos, and it says that supporting a document of the full dump file is on a todo list.
Instead, it suggests to use tabdump, which supports various dumptypes: https://web.mit.edu/kerberos/krb5-devel/doc/admin/admin_commands/kdb5_util.html#tabdump

What does consumer LAG mean in Consumer Group

I'm observing that Kafka Consumer is inconsistently not able to receive the messages when Producer trying to send it. When i checked kafka consumer , there are LAG values seen :
docker run --net=host --rm <docker image> kafka-consumer-groups --zookeeper localhost:2181 --describe --group mgmt_testing
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG OWNER
mgmt_testing mgmt_testing 0 44 44 0 mgmt_testing_aws-us-east-1-mr3-10-10-8-218-1561090200381-21858516-0
mgmt_testing mgmt_testing 1 35 35 0 mgmt_testing_aws-us-east-1-mr3-10-10-8-218-1561090200381-21858516-0
mgmt_testing mgmt_testing 2 39 39 0 mgmt_testing_aws-us-east-1-mr3-10-10-8-218-1561090200381-21858516-0
mgmt_testing mgmt_testing 3 37 37 0 mgmt_testing_aws-us-east-1-mr3-10-10-8-218-1561090200381-21858516-0
mgmt_testing mgmt_testing 4 25 38 13 mgmt_testing_aws-us-east-1-mr3-10-10-8-218-1561090200381-21858516-0
mgmt_testing mgmt_testing 5 458 666 208 mgmt_testing_aws-us-east-1-mr3-10-10-8-218-1561090200381-21858516-0
mgmt_testing mgmt_testing 6 808167 808181 14 mgmt_testing_aws-us-east-1-mr3-10-10-8-218-1561090200381-21858516-0
mgmt_testing mgmt_testing 7 434028 434041 13 mgmt_testing_aws-us-east-1-mr3-10-10-8-218-1561090200381-21858516-0
What does LAG mean here ? And will this be the reason that consumer is not able to receive the messages?
Essentially, lag is the fact that there will always be some delay between publish a message to a Kafka broker and consuming it. There's a good description on sematext's website: https://sematext.com/blog/kafka-consumer-lag-offsets-monitoring/

Find the number of days in each month between 2 given dates

My input is :
catg_desc equipment_number present_date
STANDBY 123 24-06-2018
OTHERS 123 21-04-2019
READY 123 26-04-2019
JOB 256 26-04-2019
I have solved the scenario in postgresql but is multiplying the number of records. I don't want to increase the number of records in the final table as that can go upto 35,000,000 and difficult to handle in tableau.
using generate_series, we are inserting the data of missing months.
Expected Output:
catg_desc equipment_number present_date Mon-yy no of days
STANDBY 123 24-06-2018 Jun-18 7
STANDBY 124 24-06-2018 Jul-18 31
STANDBY 125 24-06-2018 Aug-18 31
STANDBY 126 24-06-2018 Sep-18 30
STANDBY 127 24-06-2018 Oct-18 31
STANDBY 128 24-06-2018 Nov-18 30
STANDBY 129 24-06-2018 Dec-18 31
STANDBY 130 24-06-2018 Jan-19 31
STANDBY 131 24-06-2018 Feb-19 28
STANDBY 132 24-06-2018 Mar-19 31
STANDBY 133 24-06-2018 Apr-19 20
OTHERS 123 24-06-2018 Apr-19 5
READY 123 26-04-2019 Apr-19 30
READY 124 26-04-2019 May-19 22 (till current date)
JOB 256 26-04-2019 Apr-19 5
JOB 256 26-04-2019 May-19 22 (till current date)
If you have 2 date fields then use datediff(). create a calculated field as below
DATEDIFF('day',[Startdate],[enddate)

Kafka broker shut down even though the log dirs are present

When I saw this error message:
ERROR Shutdown broker because all log dirs in /tmp/kafka-logs have failed (kafka.log.LogManager)
The first thought is "well the /tmp directory probably got cleared out by the O/S (linux) - so I should update the kafka config to point to something permanent. However the directory is present and has not been wiped:
ll /tmp/kafka-logs/
total 20
drwxrwxr-x 2 ec2-user ec2-user 38 Apr 7 16:56 __consumer_offsets-0
drwxrwxr-x 2 ec2-user ec2-user 38 Apr 7 16:56 __consumer_offsets-7
drwxrwxr-x 2 ec2-user ec2-user 38 Apr 7 16:56 __consumer_offsets-42
..
drwxrwxr-x 2 ec2-user ec2-user 38 Apr 7 16:56 __consumer_offsets-32
drwxrwxr-x 2 ec2-user ec2-user 141 Apr 12 02:49 flights_raw-0
drwxrwxr-x 2 ec2-user ec2-user 178 Apr 12 08:25 air2008-0
drwxrwxr-x 2 ec2-user ec2-user 141 Apr 12 13:38 testtopic-0
-rw-rw-r-- 1 ec2-user ec2-user 1244 Apr 17 22:29 recovery-point-offset-checkpoint
-rw-rw-r-- 1 ec2-user ec2-user 4 Apr 17 22:29 log-start-offset-checkpoint
-rw-rw-r-- 1 ec2-user ec2-user 1248 Apr 17 22:30 replication-offset-checkpoint
So then what does this actually mean, why is it happening and what should be done to correct/avoid the error?
In related question best answer suggests to delete log dir both for Kafka /tmp/kafka-logs and Zookeper /tmp/zookeeper.
Probably it's because of Kafka issue which was resolved in August 2018.
Hope this will help.

Kafka pod fails to come up after pod deletion with NFS

We were trying to run a Kafka cluster on Kubernetes using NFS provisioner. The cluster came up fine. However when we killed one of the Kafka pods, the replacement pod failed to come up.
Persistent volume before pod deletion:
# mount
10.102.32.184:/export/pvc-ce1461b3-1b38-11e8-a88e-005056073f99 on /opt/kafka/data type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.133.40.245,local_lock=none,addr=10.102.32.184)
# ls -al /opt/kafka/data/logs
total 4
drwxr-sr-x 2 99 99 152 Feb 26 21:07 .
drwxrwsrwx 3 99 99 18 Feb 26 21:07 ..
-rw-r--r-- 1 99 99 0 Feb 26 21:07 .lock
-rw-r--r-- 1 99 99 0 Feb 26 21:07 cleaner-offset-checkpoint
-rw-r--r-- 1 99 99 57 Feb 26 21:07 meta.properties
-rw-r--r-- 1 99 99 0 Feb 26 21:07 recovery-point-offset-checkpoint
-rw-r--r-- 1 99 99 0 Feb 26 21:07 replication-offset-checkpoint
# cat /opt/kafka/data/logs /meta.properties
#
#Mon Feb 26 21:07:08 UTC 2018
version=0
broker.id=1003
Deleting the pod:
kubectl delete pod kafka-iced-unicorn-1
The reattached persistent volume in the newly created pod:
# ls -al /opt/kafka/data/logs
total 4
drwxr-sr-x 2 99 99 180 Feb 26 21:10 .
drwxrwsrwx 3 99 99 18 Feb 26 21:07 ..
-rw-r--r-- 1 99 99 0 Feb 26 21:10 .kafka_cleanshutdown
-rw-r--r-- 1 99 99 0 Feb 26 21:07 .lock
-rw-r--r-- 1 99 99 0 Feb 26 21:07 cleaner-offset-checkpoint
-rw-r--r-- 1 99 99 57 Feb 26 21:07 meta.properties
-rw-r--r-- 1 99 99 0 Feb 26 21:07 recovery-point-offset-checkpoint
-rw-r--r-- 1 99 99 0 Feb 26 21:07 replication-offset-checkpoint
#cat /opt/kafka/data/logs/meta.properties
#
#Mon Feb 26 21:07:08 UTC 2018
version=0
broker.id=1003
We see the following error in the Kafka logs:
[2018-02-26 21:26:40,606] INFO [ThrottledRequestReaper-Produce], Starting (kafka.server.ClientQuotaManager$ThrottledRequestReaper)
[2018-02-26 21:26:40,711] FATAL [Kafka Server 1002], Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
java.io.IOException: Invalid argument
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createNewFile(File.java:1012)
at kafka.utils.FileLock.<init>(FileLock.scala:28)
at kafka.log.LogManager$$anonfun$lockLogDirs$1.apply(LogManager.scala:104)
at kafka.log.LogManager$$anonfun$lockLogDirs$1.apply(LogManager.scala:103)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at kafka.log.LogManager.lockLogDirs(LogManager.scala:103)
at kafka.log.LogManager.<init>(LogManager.scala:65)
at kafka.server.KafkaServer.createLogManager(KafkaServer.scala:648)
at kafka.server.KafkaServer.startup(KafkaServer.scala:208)
at io.confluent.support.metrics.SupportedServerStartable.startup(SupportedServerStartable.java:102)
at io.confluent.support.metrics.SupportedKafka.main(SupportedKafka.java:49)
[2018-02-26 21:26:40,713] INFO [Kafka Server 1002], shutting down (kafka.server.KafkaServer)
[2018-02-26 21:26:40,715] INFO Terminate ZkClient event thread. (org.I0Itec.zkclient.ZkEventThread)
The only way around this seems to be to delete the persistent volume claim and force delete the pod again. Or alternatively use another storage provider than NFS (rook is working fine in this scenario).
Has anyone come across this issue with NFS provisioner?