Custom UDAF not working ( Ksql: Confluent) - apache-kafka

I am facing issues while creating custom UDAF in Ksql. Use case is to find "first" and "last" value of a column in a tumbling window. There is no such built in UDAF (https://docs.confluent.io/current/ksql/docs/syntax-reference.html#aggregate-functions) so I am trying to create custom UDAF.
I performed following steps based on this document https://www.confluent.io/blog/write-user-defined-function-udf-ksql/
i. created UDAF & AggregateFunctionFactory and registered it in FunctionRegistry as follows:
addAggregateFunctionFactory(new MyAggFunctionFactory());
ii.Build ksql-engine jar and replaced the same in confluent package at following path $CONFLUENT_HOME/share/java/ksql.
iii.Restarted ksql-server
However, it seems that function is not registered. Any suggestions?
Confluent Version: 4.1.0
Note: I tried creating simple UDF .That works well. Issue is with UDAF

Issue was that I was naming the function as 'First' which seems to be some keyword. Changed the function name , it worked

Related

Creating Custom Jupyter Widgets

I'm trying to create a custom jupyter widget that takes a pandas.dataframe as an input and simply renders a modified html version of the dataframe as an output. I'm stuck at the start in terms of defining a dataframe as the input for the widget
I have tried to follow the online examples and I think I would be fine with most string inputs to a widget, but I'm lost when trying a dataframe as an input
I just like to be able to pass a dataframe into my custom widget and validate that is is a dataframe
You can do this using jp_proxy_widget. In fact it is almost implemented in this notebook:
https://nbviewer.jupyter.org/github/AaronWatters/jp_doodle/blob/master/notebooks/misc/In%20place%20html%20table%20update%20demo.ipynb
The implementation is more complex than you requested because it supports
in-place updates of the table.
Please see https://github.com/AaronWatters/jp_proxy_widget
The example notebook is from https://github.com/AaronWatters/jp_doodle

Adding Input Operator Dynamically to a running Apache Apex application

Is it possible to add input operator for different source in the running Apex application?
For example: In an production environment, I am running an Apex application to read the text-file from input source and I want to add Kafka source with its input operator to the same DAG.
Priyanshu,
You can have multiple input operators. Just add kafka input operator to you dag.
http://docs.datatorrent.com/library_operators/
Amol

Hive configuration for Spark integration tests

I am looking for a way to configure Hive for Spark SQL integration testing such that tables are written either in a temporary directory or somewhere under the test root. My investigation suggests that this requires setting both fs.defaultFS and hive.metastore.warehouse.dir before HiveContext is created.
Just setting the latter, as mentioned in this answer is not working on Spark 1.6.1.
val sqlc = new HiveContext(sparkContext)
sqlc.setConf("hive.metastore.warehouse.dir", hiveWarehouseDir)
The table metadata goes in the right place but the written files go to /user/hive/warehouse.
If a dataframe is saved without an explicit path, e.g.,
df.write.saveAsTable("tbl")
the location to write files to is determined via a call to HiveMetastoreCatalog.hiveDefaultTableFilePath, which uses the location of the default database, which seems to be cached during the HiveContext construction, thus setting fs.defaultFS after HiveContext construction has no effect.
As an aside, but very relevant for integration testing, this also means that DROP TABLE tbl only removes the table metadata but leaves the table files, which wreaks havoc with expectations. This is a known problem--see here & here--and the solution may be to ensure that hive.metastore.warehouse.dir == fs.defaultFS + user/hive/warehouse.
In short, how can configuration properties such as fs.defaultFS and hive.metastore.warehouse.dir be set programmatically before the HiveContext constructor runs?
In Spark 2.0 you can set "spark.sql.warehouse.dir" on the SparkSession's builder, before creating a SparkSession. It should propagate correctly.
For Spark 1.6, I think your best bet might be to programmatically create a hite-site.xml.
The spark-testing-base library has a TestHiveContext configured as part of the setup for DataFrameSuiteBaseLike. Even if you're unable to use scala-testing-base directly for some reason, you can see how they make the configuration work.

Predict Class Probabilities in Spark RandomForestClassifier

I built random forest models using ml.classification.RandomForestClassifier. I am trying to extract the predict probabilities from the models but I only saw prediction classes instead of the probabilities. According to this issue link, the issue is resolved and it leads to this github pull request and this. However, It seems it's resolved in the version 1.5. I'm using the AWS EMR which provides Spark 1.4.1 and sill have no idea how to get the predict probabilities. If anyone knows how to do it, please share your thought or solutions. Thanks!
I have already answered a similar question before.
Unfortunately, with MLLIb you can't get the probabilities per instance for classification models till version 1.4.1.
There is JIRA issues (SPARK-4362 and SPARK-6885) concerning this exact topic which is IN PROGRESS as I'm writing the answer now. Nevertheless, the issue seems to be on hold since November 2014
There is currently no way to get the posterior probability of a prediction with Naive Baye's model during prediction. This should be made available along with the label.
And here is a note from #sean-owen on the mailing list on a similar topic regarding the Naive Bayes classification algorithm:
This was recently discussed on this mailing list. You can't get the probabilities out directly now, but you can hack a bit to get the internal data structures of NaiveBayesModel and compute it from there.
Reference : source.
This issue has been resolved with Spark 1.5.0. Please refer to the JIRA issue for more details.
Concerning AWS, there is not much you can do now for that. A solution might be if you can fork the emr-bootstrap-actions for spark and configure it for you needs, then you'll be able to install Spark on AWS using the bootstrap step.
Nevertheless, this might seem a little complicated.
There is some thing you might need to consider :
update the spark/config.file to install you spark-1.5. Something like :
+3 1.5.0 python s3://support.elasticmapreduce/spark/install-spark-script.py s3://path.to.your.bucket.spark.installation/spark/1.5.0/spark-1.5.0.tgz
this file list above must be a proper build of spark located in an specified s3 bucket you own for the time being.
To build your spark, I advice you reading about it in the examples section about building-spark-for-emr and also the official documentation. That should be about it! (I hope I haven't forgotten anything)
EDIT : Amazon EMR release 4.1.0 offers an upgraded version of Apache Spark (1.5.0). You can check here for more details.
Unfortunately this isn't possible with version 1.4.1, you could extend the random forest class and copy some of the code I added in that pull request if you can't upgrade - but be sure to switch back to the regular version once you are able to upgrade.
Spark 1.5.0 is now supported natively on EMR with the emr-4.1.0 release! No more need to use the emr-bootstrap-actions, which btw only work on 3.x AMIs, not emr-4.x releases.

Sample code for Cassandra trigger

First of all the basic question is -> how to implement trigger in Cassandra?
How do I make delete operation in multiple tables in Cassandra Trigger. Is there any sample code for delete? If there is any detailed documentation on Cassandra Trigger with sample codes it would be very helpful.
Thanks
Chaity
you can find here a Documentation about using CQL
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/trigger_r.html
It's this maybe you want to have?
I hope it is not too late fore response
In order to implement Cassandra Trigger, you need to:
implement ITrigger interface from Cassandra Maven Dependency
build jar with dependencies and place it under /etc/cassandra/triggers folder (location may vary depending on environment: docker, local and etc)
start Cassandra and execute CREATE TRIGGER ... cql query
You can check my sample project https://github.com/timurt/cassandra-trigger
Inside I implemented detection of the insert,update,delete operations for partition,row,cell entities
Hope this will help you!