Scala/Spark can't match function - scala

I'm trying to run the following command:
df = df.withColumn("DATATmp", to_date($"DATA", "yyyyMMdd"))
And getting this error:
<console>:34: error: too many arguments for method to_date: (e: org.apache.spark.sql.Column)org.apache.spark.sql.Column
How could I specify the exactly function to import? Has another way to avoid this error?
EDIT: Spark version 2.1

As can be seen in the detailed scaladoc, the to_date function with two parameters has been added in 2.2.0, whereas the one-argument version existed since 1.5.
If you are working with an older Spark version, either upgrade, or don't use this function.

Related

type mismatch errors when upgrading from scala 2.9 to 2.13.2

I recently revived an old library that was written in scala 2.9, and I created a new scala project using scala 2.13.2
I am getting errors like the following:
type mismatch;
found : scala.collection.mutable.Buffer[Any]
[error] required: Seq[Any]
Was there a specific change between 2.9 to 2.13.2 that involved not implicitly casting sequences or something that might solve many of these types of compile errors?
I had to add .toSeq to many of my function return statements that were vals of Buffer[Any] that needed to be passed as an arguement to a function expected a Sequence.
Quite a lot things happened in the last 7+ years (including rewrite of the collections library).
If adding .toSeq solves your problem - just go for it.
If you want to know what exactly has changed - try upgrading version-by version: first upgrade to scala-2.10., then to 2.11., then 2.12.*, then, finally, to 2.13.2.
At each upgrade you'll probably see deprecation warnings. Fix them before upgrading to the next version.
Brave, but perhaps bad form, to disturb the dead. Nevertheless, maybe pass mutable.Buffer as mutable.Seq instead of Seq which is by default immutable.Seq. Consider
val mb = mutable.Buffer(11, Some(42))
val ms: mutable.Seq[Any] = mb // OK
val is: Seq[Any] = mb // NOK

No implicits found for parameter evidence

I have a line of code in a scala app that takes a dataframe with one column and two rows, and assigns them to variables start and end:
val Array(start, end) = datesInt.map(_.getInt(0)).collect()
This code works fine when run in a REPL, but when I try to put the same line in a scala object in Intellij, it inserts a grey (?: Encoder[Int]) before the .collect() statement, and show an inline error No implicits found for parameter evidence$6: Encoder[Int]
I'm pretty new to scala and I'm not sure how to resolve this.
Spark needs to know how to serialize JVM types to send them from workers to the master. In some cases they can be automatically generated and for some types there are explicit implementations written by Spark devs. In this case you can implicitly pass them. If your SparkSession is named spark then you miss following line:
import spark.implicits._
As you are new to Scala: implicits are parameters that you don't have to explicitly pass. In your example map function requires Encoder[Int]. By adding this import, it is going to be included in the scope and thus passed automatically to map function.
Check Scala documentation to learn more.

Cannot resolve symbol mapReduceTriplets

I am using Spark 2.2, Scala 2.11 and GraphX. When I try to compile the folloiwing code in Intellij, I get the error Cannot resolve symbol mapReduceTriplets:
val nodeWeightMapFunc = (e:EdgeTriplet[VD,Long]) => Iterator((e.srcId,e.attr), (e.dstId,e.attr))
val nodeWeightReduceFunc = (e1:Long,e2:Long) => e1+e2
val nodeWeights = graph.mapReduceTriplets(nodeWeightMapFunc,nodeWeightReduceFunc)
I was reading here that it's possible to substitute mapReduceTriplets with aggregateMessages, but it's unclear how exactly can I do it?
mapReduceTriplets belonged to legacy API and has been removed from the public API. Specifically if you check the current documentation:
In earlier versions of GraphX neighborhood aggregation was accomplished using the mapReduceTriplets operator

No such element exception in machine learning pipeline using scala

I am trying to implement an ML pipeline in Spark using Scala and I used the sample code available on the Spark website. I am converting my RDD[labeledpoints] into a data frame using the functions available in the SQlContext package. It gives me a NoSuchElementException:
Code Snippet:
Error Message:
Error at the line Pipeline.fit(training_df)
The type Vector you have inside your for-loop (prob: Vector) takes a type parameter; such as Vector[Double], Vector[String], etc. You just need to specify the type you data your vector will store.
As a site note: The single argument overloaded version of createDataFrame() you use seems to be experimental. In case you are planning to use it for some long term project.
The pipeline in your code snippet is currently empty, so there is nothing to be fit. You need to specify the stages using .setStages(). See the example in the spark.ml documentation here.

Why KMeansModel.predict error has started to appear since Spark 1.0.1.?

I work with Scala (2.10.4 version) and Spark - I have moved to Spark 1.0.1. version and noticed one of my scripts is not working correctly now. It uses k-means method from the MLlib library in the following manner.
Assume I have a KMeansModel object named clusters:
scala> clusters.toString
res8: String = org.apache.spark.mllib.clustering.KMeansModel#689eab53
Here is my method in question and an error I receive while trying to compile it:
scala> def clustersSize(normData: RDD[Array[Double]]) = {
| normData.map(r => clusters.predict(r))
| }
<console>:28: error: overloaded method value predict with alternatives:
(points: org.apache.spark.api.java.JavaRDD[org.apache.spark.mllib.linalg.Vector])org.apache.spark.api.java.JavaRDD[Integer] <and>
(points: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector])org.apache.spark.rdd.RDD[Int] <and>
(point: org.apache.spark.mllib.linalg.Vector)Int
cannot be applied to (Array[Double])
normData.map(r => clusters.predict(r))
The KMeansModel documentation clearly says that the predict function needs an argument of Array[Double] type and I think I do put (don't I?) an argument of such type to it. Thank you in advance for any suggestions on what am I doing wrong.
You're using Spark 1.0.1 but the documentation page you cite is for 0.9.0. Check the current documentation and you'll see that the API has changed. See the migration guide for background.