How to Save data to influxDB using a spark streaming job - scala

I have tried library pygmalios/reactiveinflux-spark but its dependency is prone to error with our set of libraries and also have tried com.paulgoldbaum/scala-influxdb-client which is giving runtime error while reading file in cluster.
Is there any reliable library to address the needed solution

Related

Error on Spark MLFlow Model Registery using DataBricks after upgrade: cannot load trained XGBoost model

I had some code for training and then using XGBoost models on a Databricks environment. As my runtime version got deprecated, I upgraded it, but I quickly noticed I could not load my trained models anymore. The reason seems to be a change in the naming of functions in Sparkdl:
Error loading metadata: Expected class name sparkdl.xgboost.xgboost_core.XgboostClassifierModel but found class name sparkdl.xgboost.xgboost.XgboostClassifierModel
Would anyone have advise on how to fix this issue? Maybe modify the metadata?

How to run a Scio pipeline on Dataflow from SBT (local)

I am trying to run my first Scio pipeline on Dataflow .
The code in question can be found here. However I do not think that is too important.
My first experiment was to read some local CSV files and write another local CSV file, using the DirecRunner. That worked as expected.
Now, I am trying to read the files from GCS, write the output to BigQuery and run the pipeline using the DataflowRunner. I already made all the necessary changes (or that is what I believe). But I am unable to make it run.
I already gcloud auth application-default login and when I do
sbt run --runner=DataflowRunner --project=project-id --input-path=gs://path/to/data --output-table=dataset.table
I can see the Jb is submitted in Dataflow. However, after one hour the jobs fails with the following error message.
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h.
(Note, the job did nothing in all that time, and since this is an experiment the data is simple too small to take more than a couple of minutes).
Checking the StackDriver I can find the follow error:
java.lang.ClassNotFoundException: scala.collection.Seq
Related to some jackson thing:
java.util.ServiceConfigurationError: com.fasterxml.jackson.databind.Module: Provider com.fasterxml.jackson.module.scala.DefaultScalaModule could not be instantiated
And that is what is killing each executor just at the start. I really do not understand why I can not find the Scala standard library.
I also tried to first create a template and runt it latter with:
sbt run --runner=DataflowRunner --project=project-id --input-path=gs://path/to/data --output-table=dataset.table --stagingLocation=gs://path/to/staging --templateLocation=gs://path/to/templates/template-1
But, after running the template, I get the same error.
Also, I noticed that in the staging folder there are a lot of jars, but the scala-library.jar is not in there.
I am missing something obvious?
It's a known issue with sbt 1.3.0 which introduced some breaking change w.r.t. class loaders. Try 1.2.8?
Also the Jackson issue is probably related to Java 11 or above. Stay with Java 8 for now.
Fix by setting the sbt classLoaderLayeringStrategy:
run / classLoaderLayeringStrategy := ClassLoaderLayeringStrategy.Flat
sbt uses a new classloader for the application that is run with run. This causes other classes already loaded by the JVM (Predef for instance) to be reused, reducing startup time. See in-process classloaders for details.
This doesn't play well with the Beam DataflowRunner because it explicitly does not stage classes from parent classloaders, see PipelineResources.java#L51:
Attempts to detect all the resources the class loader has access to. This does not recurse to class loader parents stopping it from pulling in resources from the system class loader.
So the fix is to force all classes used by your application to be loaded in the same classloader so that DataflowRunner stages everything.
Hope that helps

updating kafka dependency in camus is causing messages not read by EtlRecordReader

In my project camus is used for long time and it is never get updated.
The camus project uses kafka version 0.8.2.2. I want to find a workaround to use kafka 1.0.0.
So I cloned the directory and updated the dependency. When I do that the Message here requires additional parameters here.
As given in the github link above, the code compiles but the messages are not read from the kafka due to the condition here.
Is it possible to update the kafka dependency along with appropriate data constructors of kafka.message.Message and make it work.

JAI can't execute in native spark - only in sbt and as a separate scala function

I want to use a library (JAI) with spark to parse some spatial raster files. Unfortunately, there are some strange issues. JAI only works when running via the build tool i.e. sbt run when executed in spark.
When executed via spark-submit the error is:
java.lang.IllegalArgumentException: The input argument(s) may not be null.
at javax.media.jai.ParameterBlockJAI.getDefaultMode(ParameterBlockJAI.java:136)
at javax.media.jai.ParameterBlockJAI.<init>(ParameterBlockJAI.java:157)
at javax.media.jai.ParameterBlockJAI.<init>(ParameterBlockJAI.java:178)
at org.geotools.process.raster.PolygonExtractionProcess.execute(PolygonExtractionProcess.java:171)
Which looks like some native dependency is not called correctly.
Assuming something is wrong with the class path I tried to run a plain java/scala function. but this one works just fine.
In fact, the exact same problem occurs when Nifi is calling the parse function.
Is spark messing with the class paths? What is different from running the jar natively via java-jar or through spark or NiFi? Both show the same problem even when concurrency is disabled and they run only on a single thread.
JAI vendorname == null is somewhat similar as it shows what can go wrong when running a jar with JAI. I could not identify this as the exact same problem though.
I created a minimal example here:
https://github.com/geoHeil/jai-packaging-problem
Due to the dependency on the build process & packaging of native libraries I think it will not be possible to include snippets directly in this posting.
edit
I am pretty convinced this has to do the the assembly merge strategy, so far I could not find one which works.
Below you can see that the Vectorize operation is missing on sparks class path
edit 2
I think spark / NiFis class loader will not load some of the required registry files for JAI. A plain java app works fine with these assembly/ fat-jar settings.

Error: Could not find or load main class config.zookeeper.properties

I am trying to execute a sample producer consumer application using Apache Kafka. I downloaded it from https://www.apache.org/dyn/closer.cgi?path=/kafka/0.10.0.0/kafka-0.10.0.0-src.tgz . Then I started following the steps given in http://www.javaworld.com/article/3060078/big-data/big-data-messaging-with-kafka-part-1.html.
When I tried to run bin/zookeeper-server-start.sh config/zookeeper.properties, I am getting Error: Could not find or load main class config.zookeeper.properties I googled about the issue but didn't get any useful information on this. Can anyone help me to continue?
You've downloaded the source package. Download the binary package of Kafka and do testing.
You have to download the binary version from the official Kafka web site.
Assuming you have the correct binary version check to see that you do not already have CLASSPATH defined in your environment. If you do and the defined CLASSPATH has a space in it (e.g.C:\Program Files\<>) then neither zookeeper or kafka will start.
To solve this either delete your existing CLASSPATH or modify the startup script that builds the zookeeper and kafka CLASSPATH values, putting your CLASSPATH entry in double quotes before the path is built