Spark recommendation ALS nullpointer exception - scala

I am trying to build an ALS model using Spark.mllib.recommendation.
I am getting a null pointer exception.
I do not see any null values in the columns I am using. help needed.
import org.apache.spark.mllib.recommendation.Rating
import org.apache.spark.ml.recommendation.ALS
val path = "DataPath"
val data = spark.read.json(path)
data.printSchema()
data.createOrReplaceTempView("reviews")
val raw_reviews = spark.sql("Select reviewerID, cast(asin as int) as ProductID, overall from reviews")
raw_reviews.printSchema()
import org.apache.spark.ml.feature.StringIndexer
val stringindexer = new StringIndexer()
.setInputCol("reviewerID")
.setOutputCol("userID")
val modelc = stringindexer.fit(raw_reviews)
val df = modelc.transform(raw_reviews)
val Array(training,test) = df.randomSplit(Array(0.8,0.2))
val als = new ALS().setMaxIter(5).setRegParam(0.01).setUserCol("userID").setItemCol("ProductID").setRatingCol("overall")
val model = als.fit(training)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 30.0 failed 1 times, most recent failure: Lost task 1.0 in stage 30.0 (TID 94, localhost): java.lang.NullPointerException: Value at index 1 is null

Related

Error writting file to S3 org.apache.spark.SparkException: Job aborted

I trying to write the output using databricks notebook from this code using df write to a S3 bucket, but getting this error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 20.0 failed 4 times, most recent failure: Lost task 0.3 in stage 20.0 (TID 50, 10.239.78.116, executor 0): java.nio.file.AccessDeniedException: tfstest/0/outputFile/_started_7476690454845668203: regular upload on tfstest/0/outputFile/_started_7476690454845668203: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied; request: PUT https://tfs-databricks-sys-test.s3.amazonaws.com tfstest/0/outputFile/_started_7476690454845668203 {}
val outputFile = "s3a://tfsdl-ghd-wb/raidnd/Incte_19&20.csv" // OUTPUT path
val df3 = spark.read.option("header","true").csv("s3a://tfsdl-ghd-wb/raidnd/rawdata.csv")
val writer = df3.coalesce(1).write.csv("outputFile")
filtinp.foreach(x => {
val (com1, avg1) = com1Average(filtermp, x)
val (com2, avg2) = com2Average(filtermp, x)
})
def getFileSystem(path: String): FileSystem = {
val hconf = new Configuration() // initialize new hadoop configuration
new Path(path).getFileSystem(hconf) // get new filesystem to handle data
It seems to be a permissions issue, but I'm able to write to this S3 bucket doing a Union for example, with two dataframes in same location, so I don't get what's really happening here. Also, note the path show in the error https://tfs-databricks-sys-test.s3.amazonaws.com is not the path I'm trying to use.

ERROR SparkContext: Error initializing SparkContext.ERROR Utils: Uncaught exception in thread main

I thought it can work,but i failed actually
import math._
import org.apache.spark.sql.SparkSession
object Position {
def main(args: Array[String]): Unit = {
// create Spark DataFrame with Spark configuration
val spark= SparkSession.builder().getOrCreate()
// Read csv with DataFrame
val file1 = spark.read.csv("file:///home/aaron/Downloads/taxi_gps.txt")
val file2 = spark.read.csv("file:///home/aaron/Downloads/district.txt")
//change the name
val new_file1= file1.withColumnRenamed("_c4","lat")
.withColumnRenamed("_c5","lon")
val new_file2= file2.withColumnRenamed("_c0","dis")
.withColumnRenamed("_1","lat")
.withColumnRenamed("_2","lon")
.withColumnRenamed("_c3","r")
//geo code
def haversine(lat1:Double, lon1:Double, lat2:Double, lon2:Double): Double ={
val R = 6372.8 //radius in km
val dLat=(lat2 - lat1).toRadians
val dLon=(lon2 - lon1).toRadians
val a = pow(sin(dLat/2),2) + pow(sin(dLon/2),2) * cos(lat1.toRadians) * cos(lat2.toRadians)
val c = 2 * asin(sqrt(a))
R * c
}
//count
new_file2.foreach(row => {
val district = row.getAs[Float]("dis")
val lon = row.getAs[Float]("lon")
val lat = row.getAs[Float]("lat")
val distance = row.getAs[Float]("r")
var temp = 0
new_file1.foreach(taxi => {
val taxiLon = taxi.getAs[Float]("lon")
val taxiLat = taxi.getAs[Float]("lat")
if(haversine(lat,lon,taxiLat,taxiLon) <= distance) {
temp+=1
}
})
println(s"district:$district temp=$temp")
})
}
}
Here's results
20/06/07 23:04:11 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
......
20/06/07 23:04:11 ERROR Utils: Uncaught exception in thread main
java.lang.NullPointerException
......
20/06/07 23:04:11 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
I am not sure that since this seems to be Spark, using a DF inside a DF is the only mistake to this program.
I am not familiar with scala and spark,it is quite a tough question for me. I hope you guys can help me,thx!
Your exception says org.apache.spark.SparkException: A master URL must be set in your configuration set master url in master function.
I hope you are running code in some IDE. If yes, Please replace this val spark= SparkSession.builder().getOrCreate() with val spark= SparkSession.builder().master("local[*]").getOrCreate() in your code.
or if you are executing this code using spark-submit try add this --master yarn.

Task not serializable Spark / Scala

I did this code and I get always this error on the line
val randomForestModel = randomForestClassifier.fit(trainingData)
the code:
val seed = 5043
val Array(trainingData, testData) = labelDf.randomSplit(Array(0.7, 0.3), seed)
trainingData.cache()
testData.cache()
// train Random Forest model with training data set
val randomForestClassifier = new RandomForestClassifier()
.setImpurity("gini")
.setMaxDepth(3)
.setNumTrees(20)
.setFeatureSubsetStrategy("auto")
.setSeed(seed)
val randomForestModel = randomForestClassifier.fit(trainingData)
println(randomForestModel.toDebugString)
The error :
ERROR Instrumentation: org.apache.spark.SparkException: Task not serializable

spark create Dataframe in UDF

I have a example, want to create Dataframe in a UDF. Something like the one below
import org.apache.spark.ml.classification.LogisticRegressionModel
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.feature.VectorAssembler
data to Dataframe
val df = Seq((1,1,34,23,34,56),(2,1,56,34,56,23),(3,0,34,23,23,78),(4,0,23,34,78,23),(5,1,56,23,23,12),
(6,1,67,34,56,34),(7,0,23,23,23,56),(8,0,12,34,45,89),(9,1,12,34,12,34),(10,0,12,34,23,34)).toDF("id","label","tag1","tag2","tag3","tag4")
val assemblerDF = new VectorAssembler().setInputCols(Array("tag1", "tag2", "tag3","tag4")).setOutputCol("features")
val data = assemblerDF.transform(df)
val Array(train,test) = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val testData=test.toDF
val loadmodel=LogisticRegressionModel.load("/user/xu/savemodel")
sc.broadcast(loadmodel)
val assemblerFe = new VectorAssembler().setInputCols(Array("a", "b", "c","d")).setOutputCol("features")
sc.broadcast(assemblerFe)
UDF
def predict(predictSet:Vector):Double={
val set=Seq((1,2,3,4)).toDF("a","b","c","d")
val predata = assemblerFe.transform(set)
val result=loadmodel.transform(predata)
result.rdd.take(1)(0)(3).toString.toDouble}
spark.udf.register("predict", predict _)
testData.registerTempTable("datatable")
spark.sql("SELECT predict(features) FROM datatable").take(1)
i get an error like
ERROR Executor: Exception in task 3.0 in stage 4.0 (TID 7) [Executor task launch worker for task 7]
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => double)
and
WARN TaskSetManager: Lost task 3.0 in stage 4.0 (TID 7, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => double)
Is dataframe not supported? I am using Spark 2.3.0 and Scala 2.11. thanks
As mentioned in comments, you don't need UDF here to apply the Trained model to test data. You can apply the model to test dataframe in the main program itself as below:
val df = Seq((1,1,34,23,34,56),(2,1,56,34,56,23),(3,0,34,23,23,78),(4,0,23,34,78,23),(5,1,56,23,23,12),
(6,1,67,34,56,34),(7,0,23,23,23,56),(8,0,12,34,45,89),(9,1,12,34,12,34),(10,0,12,34,23,34)).toDF("id","label","tag1","tag2","tag3","tag4")
val assemblerDF = new VectorAssembler().setInputCols(Array("tag1", "tag2", "tag3","tag4")).setOutputCol("features")
val data = assemblerDF.transform(df)
val Array(train,test) = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val testData=test.toDF
val loadmodel=LogisticRegressionModel.load("/user/xu/savemodel")
sc.broadcast(loadmodel)
val assemblerFe = new VectorAssembler().setInputCols(Array("a", "b", "c","d")).setOutputCol("features")
sc.broadcast(assemblerFe)
val set=Seq((1,2,3,4)).toDF("a","b","c","d")
val predata = assemblerFe.transform(set)
val result=loadmodel.transform(predata) // Applying model on predata dataframe. You can apply model on any DataFrame.
result is a DataFrame now, you can Reigister the DataFrame as a table and query predictionLabel and features using SQL OR you can directly select the predictLabel and other fields from DataFrame.
Please note, UDF is a feature of Spark SQL to define new Column-based functions that extend the vocabulary of Spark SQL’s DSL for transforming Datasets. It doesnt return the DataFrame itself as a return type. and generally its not advised to use UDF's unless necessary, refer to: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs-blackbox.html

Spark sqlContext UDF acting on Sets

I've been trying to define a function that works within Spark's DataFrame which takes scala sets as input and outputs an integer. I'm getting the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 20 in stage 25.0 failed 1 times, most recent failure: Lost task 20.0 in stage 25.0 (TID 473, localhost): java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.collection.immutable.Set
Here is a simple code that gives the crux of the issue:
// generate sample data
case class Dummy( x:Array[Integer] )
val df = sqlContext.createDataFrame(Seq(
Dummy(Array(1,2,3)),
Dummy(Array(10,20,30,40))
))
// define the UDF
import org.apache.spark.sql.functions._
def setSize(A:Set[Integer]):Integer = {
A.size
}
// For some reason I couldn't get it to work without this valued function
val sizeWrap: (Set[Integer] => Integer) = setSize(_)
val sizeUDF = udf(sizeWrap)
// this produces the error
df.withColumn("colSize", sizeUDF('x)).show
What am I missing here? How can I get this to work? I know I can do this by casting to RDD but I don't want to go back and forth between RDD and DataFrames.
Use Seq:
val sizeUDF = udf((x: Seq) => setSize(x.toSet))