I am running an RF model on spark, [https://spark.apache.org/docs/2.0.0/ml-classification-regression.html#random-forest-classifier]
My issue is that if I load 2 different dataframes for train and test, eg:
val Array(trainingData, testData) = Array(convertedVecDF, convertedVecDF_test)
I get the above "java.util.NoSuchElementException: key not found: -1.0" being the cause of the error, but when I do the following, I get none
val data = convertedVecDF.union(convertedVecDF_test)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
And run the code in the link, it works fine.
The class of all variables: data, convertedVecDF, convertedDF_test, trainingData, testData is Class[_ <: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = class org.apache.spark.sql.Dataset
When I run seperate out the variables as in the first case, and use a very small test data (say 10 points), it works fine
Why is that ? Seems like a resource accessing issue, but I can't seem to understand how is Spark working.
What can I do to run the first case ? i.e. Run with Separate train/test data
EDIT
This problem was due to this issue: Error when passing data from a Dataframe into an existing ML VectorIndexerModel As I moved to Spark 2.3.1, it solved the issue
Related
I have a very large file that contains individual JSONs which I would like to iterate through, turning each one into a Map using the Jackson library:
import com.fasterxml.jackson.databind.ObjectMapper import com.fasterxml.module.scala.DefaultScalaModule
import com.fasterxml.module.scala.ScalaObjectMapper
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.register(DefaultScalaModule)
val lines = sc.textFile(fileName)
on a single JSON string, I can perform without issues:
mapper.readValue[Map[String, Object]](JSONString)
to get my map.
However, if I try the following by iterating through an RDD[String] like so I get the following error:
lines.foreach(line=> mapper.readValue[Map[String, Object]])
org.apache.Spark.SparkException: Task not serializable
I can do lines.take(10000) or so and then work on that but this file is so huge I can't "take" or "collect" the whole file in one go and I want to be able to use the same solution across files of all different sizes.
After the string becomes a Map, I need to perform functions on it and write to a string, so any solution that allows me to do that without going over my allocated memory will help. Thank you!
Managed to solve this with the below:
import scala.util.parsing.json._
val myMap = JSON.parseFull(jsonString).get.asInstanceOf[Map[String, Object]]
The above will work on an RDD[String]
The following snippet works fine in Spark 2.2.1 but gives a rather cryptic runtime exception in Spark 2.3.0:
import sparkSession.implicits._
import org.apache.spark.sql.functions._
case class X(xid: Long, yid: Int)
case class Y(yid: Int, zid: Long)
case class Z(zid: Long, b: Boolean)
val xs = Seq(X(1L, 10)).toDS()
val ys = Seq(Y(10, 100L)).toDS()
val zs = Seq.empty[Z].toDS()
val j = xs
.join(ys, "yid")
.join(zs, Seq("zid"), "left")
.withColumn("BAM", when('b, "B").otherwise("NB"))
j.show()
In Spark 2.2.1 it prints to the console
+---+---+---+----+---+
|zid|yid|xid| b|BAM|
+---+---+---+----+---+
|100| 10| 1|null| NB|
+---+---+---+----+---+
In Spark 2.3.0 it results in:
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'BAM
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)
at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:435)
at org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157)
...
The culprit really seems to be Dataset being created from an empty Seq[Z]. When you change that to something that will also result in an empty Dataset[Z] it works as in Spark 2.2.1, e.g.
val zs = Seq(Z(10L, true)).toDS().filter('zid === 999L)
In the migration guide from 2.2 to 2.3 is mentioned:
Since Spark 2.3, the Join/Filter’s deterministic predicates that are after the first non-deterministic predicates are also pushed down/through the child operators, if possible. In prior Spark versions, these filters are not eligible for predicate pushdown.
Is this related, or a (known) bug?
#user9613318 there is a bug create by the OP but it is closed as "Cannot Reproduce" because the dev says that
I am not able to reproduce on current master. This must have been fixed.
but there is no reference to another underlying issue so it might remain a mystery.
I have worked this around on 2.3.0 by issuing someEmptyDataset.cache() right after the empty Dataset was created. OP's example didn't fail anymore like that (with zs.cache()), also the actual problem at work went away with this trick.
(As a side note, OP-s code doesn't fail for me on Spark 2.3.2 run locally. Though I don't see the related fix in 2.3.2 change log, so it's maybe due to some other differences in the environment...)
I am learning apache spark and trying to execute a small program on scala terminal.
I have started the dfs, yarn and history server using the following commands:
start-dfs.sh
start-yarn.sh
mr-jobhistory-deamon.sh start historyserver
and then in the scala terminal, i have written the following commands:
var file = sc.textFile("/Users/****/Documents/backups/h/*****/input/ncdc/micro-tab/sample.txt");
val records = lines.map(_.split("\t"));
val filters = records.filter(rec => (rec(1) != "9999" && rec(2).matches("[01459]")));
val tuples = filters.map(rec => (rec(0).toInt, rec(1).toInt));
val maxTemps = tuples.reduceByKey((a,b) => Math.max(a,b));
all commands are executed successfully, except the last one, which throws the following error:
error: value reduceByKey is not a member of org.apache.spark.rdd.RDD[(Int, Int)]
i found some solutions like:
This comes from using a pair rdd function generically. The reduceByKey method is actually a method of the PairRDDFunctions class, which has an implicit conversion from RDD.So it requires several implicit typeclasses. Normally when working with simple concrete types, those are already in scope. But you should be able to amend your method to also require those same implicit.
But i am not sure how to achieve this.
Any help, how to resolve this issue?
It seems you are missing an import. Try writing this in the console:
import org.apache.spark.SparkContext._
And then running the above commands. This import brings an implicit conversion which lets you use the reduceByKey method.
I've been trying to learn/use Scala for machine learning and to do that I need to convert string variables to an index of dummies.
The way I've done it is with the StringIndexer in Scala. Before running I've used df.na.fill("missing") to replace missing values. Even after I run that I still get a NullPointerException.
Is there something else I should be doing or something else I should be checking? I used printSchema to filter only on the string columns to get the list of columns I needed to run StringIndexer on.
val newDf1 = reweight.na.fill("Missing")
val cat_cols = Array("highest_tier_nm", "day_of_week", "month",
"provided", "docsis", "dwelling_type_grp", "dwelling_type_cd", "market"
"bulk_flag")
val transformers: Array[org.apache.spark.ml.PipelineStage] = cat_cols
.map(cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index"))
val stages: Array[org.apache.spark.ml.PipelineStage] = transformers
val categorical = new Pipeline().setStages(stages)
val cat_reweight = categorical.fit(newDf)
Normally when using machine learning you would train the model with one part of the data and then test it with another part. Hence, there are two different methods to use to reflect this. You have only used fit() which is equivalent to training a model (or a pipeline).
This mean that your cat_reweight is not a dataframe, it is a PipelineModel. A PipelineModel have a function transform() that takes data with the same format as the one used for training and gives a dataframe as output. In other words, you should add .transform(newDf1) after fit(newDf1).
Another possible issue is that in your code you have used fit(newDf) instead of fit(newDf1). Make sure the correct dataframe is used for both the fit() and transform() methods, otherwise you will get a NullPointerException.
It works for me when running locally, however, if you still get an error you could try to cache() after replacing the nulls and then performing an action to make sure all transformations are done.
Hope it helps!
With the purpose of save a file (delimited by |) into a DataFrame, I have developed the next code:
val file = sc.textFile("path/file/")
val rddFile = file.map(a => a.split("\\|")).map(x => ArchivoProcesar(x(0), x(1), x(2), x(3))
val dfInsumos = rddFile.toDF()
My case class used for the creation of my DataFrame is defined as followed:
case class ArchivoProcesar(nombre_insumo: String, tipo_rep: String, validado: String, Cargado: String)
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly. But when I executed my program into eclipse, it throws me the next error:
Is it something missing inside my scala class that I'm using and running with eclipse. Or what could be the reason that my functions works correctly at the spark-shell but not in my eclipse app?
Regards.
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly.
That's because spark-shell takes care of creating an instance of SparkContext for you. spark-shell then makes sure that references to SparkContext are not from "sensitive places".
But when I executed my program into eclipse, it throws me the next error:
Somewhere in your Spark application you hold a reference to org.apache.spark.SparkContext that is not serializable and so holds your Spark computation back from being serialized and send across the wire to executors.
As #T. Gawęda has mentioned in a comment:
I think that ArchivoProcesar is a nested class and as a nested class has a reference to the outer class that has a property of type SparkContext
So while copying the code from spark-shell to Eclipse you have added some additional lines that you don't show thinking that they are not necessary which happens to be quite the contrary. Find any places where you create and reference SparkContext and you will find the root cause of your issue.
I can see that the Spark processing happens inside ValidacionInsumos class that main method uses. I think the affecting method is LeerInsumosAValidar that does map transformation and that's where you should seek the answer.
Your case class must have public scope. You can't have ArchivoProcesar inside a class