How to find statistics using stat() in Spark - scala

I'm trying to display the stats() of a Double RDD
val stat = x.map(s => s._2).stats()
where x is of type RDD[Double]
In spark-shell, this run fine.
But while running it in eclipse and compiling using sbt, it throws the following error.
value stats is not a member of org.apache.spark.rdd.RDD[Double]

Try to import additional functions for RDD's with doubles.
Add 'import org.apache.spark.SparkContext._` to your source.
This is an usual "error" when something is missing for the RDD class.

Related

How to resolve error: value reduceByKey is not a member of org.apache.spark.rdd.RDD[(Int, Int)]?

I am learning apache spark and trying to execute a small program on scala terminal.
I have started the dfs, yarn and history server using the following commands:
start-dfs.sh
start-yarn.sh
mr-jobhistory-deamon.sh start historyserver
and then in the scala terminal, i have written the following commands:
var file = sc.textFile("/Users/****/Documents/backups/h/*****/input/ncdc/micro-tab/sample.txt");
val records = lines.map(_.split("\t"));
val filters = records.filter(rec => (rec(1) != "9999" && rec(2).matches("[01459]")));
val tuples = filters.map(rec => (rec(0).toInt, rec(1).toInt));
val maxTemps = tuples.reduceByKey((a,b) => Math.max(a,b));
all commands are executed successfully, except the last one, which throws the following error:
error: value reduceByKey is not a member of org.apache.spark.rdd.RDD[(Int, Int)]
i found some solutions like:
This comes from using a pair rdd function generically. The reduceByKey method is actually a method of the PairRDDFunctions class, which has an implicit conversion from RDD.So it requires several implicit typeclasses. Normally when working with simple concrete types, those are already in scope. But you should be able to amend your method to also require those same implicit.
But i am not sure how to achieve this.
Any help, how to resolve this issue?
It seems you are missing an import. Try writing this in the console:
import org.apache.spark.SparkContext._
And then running the above commands. This import brings an implicit conversion which lets you use the reduceByKey method.

Getting a not found error when using split with dataframes

I am working with spark and scala. I have the below code that works fine on spark-shell, but when I try to move it over to Intelij it throws on error, stating unable to find split.
What am I missing, what do I need to import for split to work
var outputDF = inputDF.withColumn(srcColumn,
split(inputDF.col(srcColumn),splitBy).getItem(selectIndex))
You're probably missing this import:
import org.apache.spark.sql.functions._
(which is provided automatically in a Spark shell)

Why does Spark application work in spark-shell but fail with "org.apache.spark.SparkException: Task not serializable" in Eclipse?

With the purpose of save a file (delimited by |) into a DataFrame, I have developed the next code:
val file = sc.textFile("path/file/")
val rddFile = file.map(a => a.split("\\|")).map(x => ArchivoProcesar(x(0), x(1), x(2), x(3))
val dfInsumos = rddFile.toDF()
My case class used for the creation of my DataFrame is defined as followed:
case class ArchivoProcesar(nombre_insumo: String, tipo_rep: String, validado: String, Cargado: String)
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly. But when I executed my program into eclipse, it throws me the next error:
Is it something missing inside my scala class that I'm using and running with eclipse. Or what could be the reason that my functions works correctly at the spark-shell but not in my eclipse app?
Regards.
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly.
That's because spark-shell takes care of creating an instance of SparkContext for you. spark-shell then makes sure that references to SparkContext are not from "sensitive places".
But when I executed my program into eclipse, it throws me the next error:
Somewhere in your Spark application you hold a reference to org.apache.spark.SparkContext that is not serializable and so holds your Spark computation back from being serialized and send across the wire to executors.
As #T. Gawęda has mentioned in a comment:
I think that ArchivoProcesar is a nested class and as a nested class has a reference to the outer class that has a property of type SparkContext
So while copying the code from spark-shell to Eclipse you have added some additional lines that you don't show thinking that they are not necessary which happens to be quite the contrary. Find any places where you create and reference SparkContext and you will find the root cause of your issue.
I can see that the Spark processing happens inside ValidacionInsumos class that main method uses. I think the affecting method is LeerInsumosAValidar that does map transformation and that's where you should seek the answer.
Your case class must have public scope. You can't have ArchivoProcesar inside a class

pyspark FPGrowth doesn't work with RDD

I am trying to use the FPGrowth function on some data in Spark. I tested the example here with no problems:
https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html
However, my dataset is coming from hive
data = hiveContext.sql('select transactionid, itemid from transactions')
model = FPGrowth.train(data, minSupport=0.1, numPartitions=100)
This failed with Method does not exist:
py4j.protocol.Py4JError: An error occurred while calling o764.trainFPGrowthModel. Trace:
py4j.Py4JException: Method trainFPGrowthModel([class org.apache.spark.sql.DataFrame, class java.lang.Double, class java.lang.Integer]) does not exist
So, I converted it to an RDD:
data=data.rdd
Now I start getting some strange pickle serializer errors.
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
Then I start looking at the types. In the example, the data is run through a flatmap. This returns a different type than the RDD.
RDD Type returned by flatmap: pyspark.rdd.PipelinedRDD
RDD Type returned by hiveContext: pyspark.rdd.RDD
FPGrowth only seems to work with the PipelinedRDD. Is there some way I can convert a regular RDD to a PipelinedRDD?
Thanks!
Well, my query was wrong, but changed that to use collect_set and then
I managed to get around the type error by doing:
data=data.map(lambda row: row[0])

Scala code works on Spark Terminal but not in eclipse

The code below works fine when I run it on Spark Terminal but in eclipse it throws an error. What might be the reason. Please let me know if you need more information.
val IABLabels= IAB.zip(labels)
val temp1 = IABLabels.groupBy(x=>x._2).mapValues( _.map( _._1 ))
Error in Eclipse:
value mapValues is not a member of org.apache.spark.rdd.RDD[(Int, Iterable[(String, Int)])]
The code runs perfectly fine on Spark shell.
You should use this import to access extra functions on RDD's of (key,value) pairs through an implicit conversion:
import org.apache.spark.SparkContext._
You can check the API docs for further details.