MLlib examples not working - scala

I am trying the MLlib examples from this page (on Spark using Scala): MLlib Page
All the examples are throwing the same error error. I have given the one I am getting for Linear Regression:
scala> val model = LinearRegressionWithSGD.train(parsedData, numIterations)
java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:123)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:136)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
Could someone please guide on what is causing this error? Thank you.

Just figured out the answer...Apparently, some of the settings in bashrc was conflicting with Spark. Removing the bashrc file fixed the issue.

Related

Databricks scala cannot find uber objects

First allow me to say, I am trying to learn Databricks but have years of Data Factory and ETL experience. I got some code that uses AIS data to map logistics movements.
The code uses Uber libraries for H3Core functionality. I did see a demo of this code on a coworkers laptop so I know it CAN work.
I am having trouble finding the uber objects. I assume this is a newbie thing. I imagine my problem is environmental.
I cannot post all the code but the include lines that are throwing the error:
import com.uber.h3core.H3Core
import com.uber.h3core.util.GeoCoord
import com.uber.h3core.LengthUnit
Those lines produce the following errors on execution:
command-2206228078162026:1: error: object uber is not a member of package com
import com.uber.h3core.H3Core
^
command-2206228078162026:2: error: object uber is not a member of package com
import com.uber.h3core.util.GeoCoord
^
command-2206228078162026:3: error: object uber is not a member of package com
import com.uber.h3core.LengthUnit
^
Below I think it is trying to reference objects created with missing class libraries for H3Core:
command-2206228078162026:15: error: not found: value H3Core
val h3 = H3Core.newInstance()
^
command-2206228078162026:26: error: not found: value H3Core
val h3 = H3Core.newInstance()
^
I have also had similar issues, usually occurring when the size of the jar is a little bigger where in the notebook starts running before the jar is fully loaded. But I have not been able to reproduce it consistently. (So don't quote the answer on it.)
Couple of Options we have tried. Please validate and see if it work for you.
Have the jar available on the DBFS path and then install it from DBFS as a dependent library at the job level. Please refer to the link - Dependent-Libraries
We have seen that sometimes even this fails. If it is possible, you can check if the library has been loaded and sleep and then check again. Not sure how you would do that in scala.
Please validate what works....
Cheers...

Spark giving multiple datasource error on saving parquet file

I am trying to learn spark and scala, on my trying to write the dataframe object of my result to parquet file by calling the parquet method, i am getting error as such
Code Base that fails:-
df2.write.mode(SaveMode.Overwrite).parquet(outputPath)
This fails too
df2.write.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat").mode(SaveMode.Overwrite).parquet(outputPath)
Error Log:-
Exception in thread "main" org.apache.spark.sql.AnalysisException: Multiple sources found for parquet (org.apache.spark.sql.execution.datasources.v2.parquet.ParquetDataSourceV2, org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat), please specify the fully qualified class name.;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:707)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:733)
at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:967)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:304)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:848)
How ever if I called another method for the save, the code works properly,
This works fine:-
df2.write.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat").mode(SaveMode.Overwrite).save(outputPath)
Although I have a solution for the issue, i'd like to understand why the first approach is not working and how I can solve it.
The details of the specification i am using are:-
Scala 2.12.9
Java 1.8
Spark 2.4.4
P.S. This issue is only seen on spark-submit

Spark cannot find case class on classpath

I have an issue where Spark is failing to generate code for a case class. Here is the spark error
Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 52, Column 43: Identifier expected instead of '.'
Here is the referenced line in the generated code
/* 052 */ private com.avro.message.video.public.MetricObservation MapObjects_loopValue34;
It should be noted that com.avro.message.video.public.MetricObservation is a nested case class in part of a larger hierarchy. It is also used in other places in the code fine. It should also be noted that this pipeline works fine if I use the RDD API, but I want to use the Dataset API because I want to write out the Dataset in parquet. Has anyone seen this issue before?
I'm using Scala 2.11 and Spark 2.1.0. I was able to upgrade to Spark 2.2.1 and the issue is still there.
Do you think that SI-7555 or something like it has any bearing on this? I have noticed the past that Scala reflection has had issues generating TypeTags for statically nested classes. Do you think something like that is going on or is this strictly a catalyst issue in spark? You might want to file a spark ticket too.
So it turns out that changing the package name of the affect class "fixes" (ie made go away) the problem. I really have no idea why this is or even how to reproduce it in a small test case. What worked for me was I just created a higher level package that work. Specifically com.avro.message.video.public -> com.avro.message.publicVideo.

how to run evolutionary algorithms using pyspark

I want to run evolutionary algorithms like GA,PSO using pyspark on spark.How to do this using MLLib using Deap python library.Is there any other library available to perform same task.
Have a look at my answer on how to use DEAP with Spark and see if it works for you.
Here is an example of how to configure the DEAP toolbox to replace the map function by a custom one using Spark.
from pyspark import SparkContext
sc = SparkContext(appName="DEAP")
def sparkMap(algorithm, population):
return sc.parallelize(population).map(algorithm)
toolbox.register("map", sparkMap)
In https://github.com/DEAP/deap/issues/268 they show how to do this in the DEAP package. However this is an issue. but they mention there is a pull request (https://github.com/DEAP/deap/pull/76), and is seems the fixed code/branch is from a forked repo.
It sounds like if you rebuild the package with that code it should resolve the issue.
Another resource I found, haven't tried it, is https://apacheignite.readme.io/docs/genetic-algorithms.
Also came across this https://github.com/paduraru2009/genetic-algorithm-with-Spark-for-test-generation

PredictionIO evaluation in classifier

Has someone achieved to make an evaluation correctly using PredictionIO?
I am using the classification template in a server, but using more attributes, it is trained with a dataset I got and makes predictions well. However, it fails doing the evaluation, and I have all the data labeled, the data I use to train the algorithm...
The error:
Exception in thread "main" java.lang.IllegalArgumentException:
requirement failed: RDD[labeledPoints] in PreparedData cannot be
empty. Please check if DataSource generates TrainingData and
Preparator generates PreparedData correctly.
DataSource.scala and Preparator.scala should work as they are.
Thanks for any help
Evaluation (using command shown in the doc), is working with the latest, given that you set spark to 1.4.1 in your build.sbt . See this github issue:
https://github.com/PredictionIO/template-scala-parallel-textclassification/issues/2
Finally I got It starting all the again. For classification, be sure to follow the guide steps and: 1. Add all the attrs you use about your dataset to Engine, Evaluation, DataSource and NaiveBayesAlgorithms scala files. 2. Rename the app name for yours in engine.json and Evaluation.scala. 3. Re build the app "pio build --verbose". 4. Now you can evaluate, "pio eval yourpackagename.AccuracyEvaluation yourpackagename.EngineParamsList"