I have a already built in code for logistic regression using Apache spark Scala. Now i am going create a jar file from this using IntelliJ IDEA. But i am getting some errors .
First I imported the data using a CSV file. Then i fitted a logistic regression model. After that i evaluated the model. Finally i need to save the model evaluation results to a text file. I am getting an error when i try to write the model evaluation results to a file.
Here is my jar file :
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.FeatureHasher
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
object class1 {
def main(args: Array[String]): Unit = {
val sc: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val df2 = sc.read.options(Map("inferSchema"->"true","sep"->",","header"->"true")).csv(args(0))
val hasher = new FeatureHasher().setInputCols(Array("x1","x2")).setOutputCol("features")
val transformed = hasher.transform(df2)
val lr = new LogisticRegression().setMaxIter(100).setRegParam(0.1).
setElasticNetParam(0.6).setFeaturesCol("features").setLabelCol("automatic")
val Array(train, test) = transformed.randomSplit(Array(0.9, 0.1))
val lrModel = lr.fit(train)
val result = lrModel.transform(test)
val evaluator = new MulticlassClassificationEvaluator()
evaluator.setLabelCol("automatic")
evaluator.setMetricName("accuracy")
val accuracy = evaluator.evaluate(result)
accuracy.saveAsFiles(args(1))
}
}
my error is as follows :
[error] C:\Spark\src\main\scala\WordCount.scala:39:14: value saveAsFiles is not a member of Double
[error] accuracy.saveAsFiles(args(1))
[error] ^
[error] one error found
[error] (Compile / compileIncremental) Compilation failed
This error implied that ,i cannot use saveAsFiles with a double object.
Can someone help me in understanding how to fix this ?
Thank you
accuracy is no longer a DataFrame. It's just a simple double. You can use regular Scala to save it to a file e.g.
Files.write(..., accuracy.toString)
Related
I'm trying to get mass elevation data from tiff image, I have a csv file. csv file contents latitude, longitude and other attributes also. Looping through csv file and getting latitude and longitude and calling elevation method, Code given below. Reference RasterFrames extracting location information problem
package main.scala.sample
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.locationtech.rasterframes._
import org.locationtech.rasterframes.datasource.raster._
import org.locationtech.rasterframes.encoders.CatalystSerializer._
import geotrellis.raster._
import geotrellis.vector.Extent
import org.locationtech.jts.geom.Point
import org.apache.spark.sql.functions.col
object SparkSQLExample {
def main(args: Array[String]) {
implicit val spark = SparkSession.builder()
.master("local[*]").appName("RasterFrames")
.withKryoSerialization.getOrCreate().withRasterFrames
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val example = "https://raw.githubusercontent.com/locationtech/rasterframes/develop/core/src/test/resources/LC08_B7_Memphis_COG.tiff"
val rf = spark.read.raster.from(example).load()
val rf_value_at_point = udf((extentEnc: Row, tile: Tile, point: Point) => {
val extent = extentEnc.to[Extent]
Raster(tile, extent).getDoubleValueAtPoint(point)
})
val spark_file:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples")
.getOrCreate()
spark_file.sparkContext.setLogLevel("ERROR")
println("spark read csv files from a directory into RDD")
val rddFromFile = spark_file.sparkContext.textFile("point.csv")
println(rddFromFile.getClass)
def customF(str: String): String = {
val lat = str.split('|')(2).toDouble;
val long = str.split('|')(3).toDouble;
val point = st_makePoint(long, lat)
val test = rf.where(st_intersects(rf_geometry(col("proj_raster")), point))
.select(rf_value_at_point(rf_extent(col("proj_raster")), rf_tile(col("proj_raster")), point) as "value")
return test.toString()
}
val rdd2=rddFromFile.map(f=> customF(f))
rdd2.foreach(t=>println(t))
spark.stop()
}
}
when I'm running getting null pointer exception, any help appreciated
java.lang.NullPointerException
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:182)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:64)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3416)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:1490)
at org.apache.spark.sql.Dataset.where(Dataset.scala:1518)
at main.scala.sample.SparkSQLExample$.main$scala$sample$SparkSQLExample$$customF$1(SparkSQLExample.scala:49)
The function which is being mapped over the RDD (customF) is not null safe. Try calling customF(null) and see what happens. If it throws an exception, then you will have to make sure that rddFromFile doesn't contain any null/missing values.
It is a little hard to tell if that is exactly where issue is. I think the stack trace of the exception is less helpful than usual because the function is being run in a spark tasks on the workers.
If that is the issue, you could rewrite customF to handle the case where str is null or change the parameter type to Option[String] (and tweak the logic accordingly).
By the way, the same thing allies for UDFs. They need to either
Accept Option types as input
Handle the case where each arg is null or
Only be applied to data with no missing values.
I am new to spark, can you please help in this?
The below simple pipeline to do a logistic regression produces an exception:
The Code:
package pipeline.tutorial.com
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.RFormula
import org.apache.spark.ml.tuning.ParamGridBuilder
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.tuning.TrainValidationSplit
object PipelineDemo {
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.ERROR)
val conf = new SparkConf()
conf.set("spark.master", "local")
conf.set("spark.app.name", "PipelineDemo")
val sc = new SparkContext(conf)
val spark = SparkSession.builder().appName("PipelineDemo").getOrCreate()
val df = spark.read.json("C:/Spark-The-Definitive-Guide-master/data/simple-ml")
val rForm = new RFormula()
val lr = new LogisticRegression().setLabelCol("label").setFeaturesCol("features")
val stages = Array(rForm, lr)
val pipeline = new Pipeline().setStages(stages)
val params = new ParamGridBuilder().addGrid(rForm.formula, Array(
"lab ~ . + color:value1",
"lab ~ . + color:value1 + color:value2")).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).addGrid(lr.regParam, Array(0.1, 2.0)).build()
val evaluator = new BinaryClassificationEvaluator()
.setMetricName("areaUnderROC")
.setRawPredictionCol("prediction")
.setLabelCol("label")
val tvs = new TrainValidationSplit()
.setTrainRatio(0.75)
.setEstimatorParamMaps(params)
.setEstimator(pipeline)
.setEvaluator(evaluator)
val Array(train, test) = df.randomSplit(Array(0.7, 0.3))
val model = tvs.fit(train)
val rs = model.transform(test)
rs.select("features", "label", "prediction").show()
}
}
// end code.
The code runs fine from the spark-shell
when writing it as a spark application (using eclipse scala ide) it gives the error:
Caused by: java.io.NotSerializableException: scala.runtime.LazyRef
Thanks.
I solved it by removing scala library from the build path, to do this, right click on the scala library container > build path > remove from build path
not sure about the root cause though.
this error can be resolved by changing the scala version in your project to 2.12.8 or higher. Scala 2.12.8 works and is very stable. You can change this by going to your project structure (In Intellij you can go by pressing 'Ctrl+alt+shift+S'). Go to Global libraries and in there you have to remove the old scala version by using the - symbol and add the new scala version i.e. 2.12.8 or higher from the + symbol.
I am writing Test Cases for Spark using ScalaTest.
import org.apache.spark.sql.SparkSession
import org.scalatest.{BeforeAndAfterAll, FlatSpec}
class ClassNameSpec extends FlatSpec with BeforeAndAfterAll {
var spark: SparkSession = _
var className: ClassName = _
override def beforeAll(): Unit = {
spark = SparkSession.builder().master("local").appName("class-name-test").getOrCreate()
className = new ClassName(spark)
}
it should "return data" in {
import spark.implicits._
val result = className.getData(input)
assert(result.count() == 3)
}
override def afterAll(): Unit = {
spark.stop()
}
}
When I try to compile the test suite it gives me following error:
stable identifier required, but ClassNameSpec.this.spark.implicits found.
[error] import spark.implicits._
[error] ^
[error] one error found
[error] (test:compileIncremental) Compilation failed
I am not able to understand why I cannot import spark.implicits._ in a test suite.
Any help is appreciated !
To do an import you need a "stable identifier" as the error message says. This means that you need to have a val, not a var.
Since you defined spark as a var, scala can't import correctly.
To solve this you can simply do something like:
val spark2 = spark
import spark2.implicits._
or instead change the original var to val, e.g.:
lazy val spark: SparkSession = SparkSession.builder().master("local").appName("class-name-test").getOrCreate()
I am using scala, spark, IntelliJ and maven.
I have used below code :
val joinCondition = when($"exp.fnal_expr_dt" >= $"exp.nonfnal_expr_dt",
$"exp.manr_cd"===$"score.MANR_CD")
val score = exprDF.as("exp").join(scoreDF.as("score"),joinCondition,"inner")
and
val score= list.withColumn("scr", lit(0))
But when try to build using maven, getting below errors -
error: not found: value when
and
error: not found: value lit
For $ and === I have used import sqlContext.implicits.StringToColumn and it is working fine. No error occurred at the time of maven build.But for lit(0) and when what I need to import or is there any other way resolve the issue.
Let's consider the following context :
val spark : SparkSession = _ // or val sqlContext: SQLContext = new SQLContext(sc) for 1.x
val list: DataFrame = ???
To use when and lit, you'll need to import the proper functions :
import org.apache.spark.sql.functions.{col, lit, when}
Now you can use them as followed :
list.select(when(col("column_name").isNotNull, lit(1)))
Now you can use lit also in your code :
val score = list.withColumn("scr", lit(0))
I have a basic spark mllib program as follows.
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.mllib.linalg.Vectors
class Sample {
val conf = new SparkConf().setAppName("helloApp").setMaster("local")
val sc = new SparkContext(conf)
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)
// Export to PMML
println("PMML Model:\n" + clusters.toPMML)
}
I have manually added spark-core , spark-mllib and spark-sql to the project class path through intellij all having version 1.5.0.
I am getting the below error when I run the program? any idea what's wrong?
Error:scalac: error while loading Vector, Missing dependency 'bad
symbolic reference. A signature in Vector.class refers to term types
in package org.apache.spark.sql which is not available. It may be
completely missing from the current classpath, or the version on the
classpath might be incompatible with the version used when compiling
Vector.class.', required by
/home/fazlann/Downloads/spark-mllib_2.10-1.5.0.jar(org/apache/spark/mllib/linalg/Vector.class
DesirePRG. I have met the same problem as yours. The solution is to import some jar which assemble the spark and hadoop, such as spark-assembly-1.4.1-hadoop2.4.0.jar, then it could work properly.