overloaded method value corr with alternatives - scala

I am trying compute the correlation between 2 features, that are being read from two separate text files as shown below.
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.stat.Statistics
import scala.io.Source
object Corr {
def main() {
val sparkSession = SparkSession.builder
.master("local")
.appName("Correlation")
.getOrCreate()
val sc = sparkSession.sparkContext
val feature_1 = Source.fromFile("feature_1.txt").getLines.toArray
val feature_2 = Source.fromFile("feature_2.txt").getLines.toArray
val feature_1_dist = sc.parallelize(feature_1)
val feature_2_dist = sc.parallelize(feature_2)
val correlation: Double = Statistics.corr(feature_1_dist, feature_2_dist, "pearson")
println(s"Correlation is: $correlation")
}
}
Corr.main()
However, I get the following error:
overloaded method value corr with alternatives:
(x: org.apache.spark.api.java.JavaRDD[java.lang.Double],y: org.apache.spark.api.java.JavaRDD[java.lang.Double],method: String)scala.Double <and>
(x: org.apache.spark.rdd.RDD[scala.Double],y: org.apache.spark.rdd.RDD[scala.Double],method: String)scala.Double
cannot be applied to (org.apache.spark.rdd.RDD[String], org.apache.spark.rdd.RDD[String], String)
val correlation: Double = Statistics.corr(feature_1_dist, feature_2_dist, "pearson")
What I am trying to do, looks very similar to the example here but I can't figure it out.

As it is stated in error message, you need to have a RDD[Double], but you have RDD[String]. So, you could do something like this (if you have one number per row):
val feature_1 = Source.fromFile("feature_1.txt").getLines.toArray.map(_.toDouble)
val feature_2 = Source.fromFile("feature_2.txt").getLines.toArray.map(_.toDouble)

Related

Convert RDD[RDD[T]] to RDD[T]

I think the title pretty much sums up what I am trying to do here. I have the following piece of code
implicit val sc: SparkContext = spark.sparkContext
val result = RDD[RDD[GenericRecord]] = sc.parallelize(dates).map { date =>
val foo: RDD[GenericRecord] = readSomething(...)
foo
}
I want to convert result to an RDD of GenericRecord but foo is not Traversable so that I can use flatMap. Any ideas here?
As discussed here, Spark does not support nested RDDs. So even if I was able to flat map this somehow it would fail on runtime. What I ended up doing is the following:
implicit val sc: SparkContext = spark.sparkContext
val partials = IndexedSeq[RDD[GenericRecord]] = dates.map { date =>
val foo: RDD[GenericRecord] = readSomething(...)
foo
}
val result:RDD[GenericRecord] = sc.union(partials)

Scala/Spark - Create Dataset with one column from another Dataset

I am trying to create a Dataset with only one column from Case Class.
Below is the code:
case class vectorData(value: Array[String], vectors: Vector)
def main(args: Array[String]) {
val spark = SparkSession.builder
.appName("Hello world!")
.master("local[*]")
.getOrCreate()
import spark.implicits._
//blah blah and read data etc.
val word2vec = new Word2Vec()
.setInputCol("value").setOutputCol("vectors")
.setVectorSize(5).setMinCount(0).setWindowSize(5)
val dataset = spark.createDataset(data)
val model = word2vec.fit(dataset)
val encoder = org.apache.spark.sql.Encoders.product[vectorData]
val result = model.transform(dataset).as(encoder)
//val output: Dataset[Vector] = ???
}
As shown in last line of the code, I want the output to be only the 2nd column which has Vector type with vectors data.
I tried with:
val output = result.map(o => o.vectors)
But this line highlighted error No implicit arguments of type: Encoder[Vector]
How to resolve this?
I think line:
implicit val vectorEncoder: Encoder[Vector] = org.apache.spark.sql.Encoders.product[Vector]
should make
val output = result.map(o => o.vectors)
correct

Spark dataframe orderby using many columns in scala

In Spark 1.6 , Basically I would like to apply partition by and then do order by using two columns so that I can apply rank logic for each partition
val str = "insertdatetime,a_load_dt"
val orderByList = str.split(",")
val ptr = "memberidnum"
val partitionsColumnsList = ptr.split(",").toList
val landingDF = hc.sql("""select memberidnum,insertdatetime,'2019-09-26' as a_load_dt from landing_omega.omegamaster""")
val stagingDF = hc.sql("""select memberidnum,insertdatetime,a_load_dt from staging_omega.omegamaster where recordstatus ='current'""")
val unionedDF = landingDF.unionAll(stagingDF)
unionedDF.registerTempTable("temp_table")
val windowFunction = Window.partitionBy(partitionsColumnsList.map(elem => col(elem)):_*).orderBy(unionedDF(orderByList(0),orderByList(1)).desc)
But it throws the below error
scala> val windowFunction = Window.partitionBy(partitionsColumnsList.map(elem => col(elem)):_*).orderBy(unionedDF(orderByList(0),orderByList(1)).desc)
<console>:56: error: too many arguments for method apply: (colName: String)org.apache.spark.sql.Column in class DataFrame
val windowFunction = Window.partitionBy(partitionsColumnsList.map(elem => col(elem)):_*).orderBy(unionedDF(orderByList(0),orderByList(1)).desc)
How do I fix this issue . I want to apply order by on two columns desc order
Please help
^
You can simply do the below change:
val windowFunction = Window.partitionBy(partitionsColumnsList.head, partitionsColumnsList.tail:_*).orderBy(unionedDF(orderByList(0),orderByList(1)).desc)
You can use the below snippet:
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.expressions.Window
Window.partitionBy(partitionsColumnsList.map(col): _*)
.orderBy(array_union(orderByList.map(col): _*).desc)
If this did not work. Please let me know.

Get Cluster_ID and the rest of table using Spark MLlib KMeans

I have this dataset (I'm putting some a few rows):
11.97,1355,401
3.49,25579,12908
9.29,129186,10882
28.73,10153,22356
3.69,22872,9798
13.49,160371,2911
24.36,106764,867
3.99,163670,16397
19.64,132547,401
And I'm trying to assign all this rows to 4 clusters using K-Means. For that I'm using the code that I see in this post: Spark MLLib Kmeans from dataframe, and back again
val data = sc.textFile("/user/cloudera/TESTE1")
val idPointRDD = data.map(s => (s(0), Vectors.dense(s(1).toInt,s(2).toInt))).cache()
val clusters = KMeans.train(idPointRDD.map(_._2), 4, 20)
val clustersRDD = clusters.predict(idPointRDD.map(_._2))
val idClusterRDD = idPointRDD.map(_._1).zip(clustersRDD)
val idCluster = idClusterRDD.toDF("purchase","id","product","cluster")
I'm getting this outputs:
scala> import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
scala> import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vectors
scala> val data = sc.textFile("/user/cloudera/TESTE")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/TESTE MapPartitionsRDD[7] at textFile at <console>:29
scala> val idPointRDD = data.map(s => (s(0), Vectors.dense(s(1).toInt,s(2).toInt))).cache()
idPointRDD: org.apache.spark.rdd.RDD[(Char, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[8] at map at <console>:31
But when I run it I'm getting the following error:
java.lang.UnsupportedOperationException: Schema for type Char is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:715)
How can I solve this problem?
Many thanks!
Here is the thing. You are actually reading a CSV of values into an RDD of String and not converting it properly to numeric values. Instead since a string is a collection of character when you call upon s(0) per example this actually works converts the Char value to an integer or a double but it's not what you are actually looking for.
You need to split your val data : RDD[String]
val data : RDD[String] = ???
val idPointRDD = data.map {
s =>
s.split(",") match {
case Array(x,y,z) => Vectors.dense(x.toDouble, Integer.parseInt(y).toDouble,Integer.parseInt(z).toDouble)
}
}.cache()
This should work for you !

(Array/ML Vector/MLlib Vector) RDD to ML Vector Dataframe coulmn

I need to convert an RDD to a single column o.a.s.ml.linalg.Vector DataFrame, in order to use the ML algorithms, specifically K-Means for this case. This is my RDD:
val parsedData = sc.textFile("/digits480x.csv").map(s => Row(org.apache.spark.mllib.linalg.Vectors.dense(s.split(',').slice(0,64).map(_.toDouble))))
I tried doing what this answer suggests with no luck, I suppose because you end up with a MLlib Vector, it throws a mismatch error when running the algorithm. Now if I change this:
import org.apache.spark.mllib.linalg.{Vectors, VectorUDT}
val schema = new StructType()
.add("features", new VectorUDT())
to this:
import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
val parsedData = sc.textFile("/digits480x.csv").map(s => Row(org.apache.spark.ml.linalg.Vectors.dense(s.split(',').slice(0,64).map(_.toDouble))))
val schema = new StructType()
.add("features", new VectorUDT())
I would get an error because ML VectorUDT is private.
I also tried converting the RDD as an array of doubles to Dataframe, and get the ML Dense Vector like this:
var parsedData = sc.textFile("/home/pililo/Documents/Mi_Memoria/Codigo/Datasets/Digits/digits480x.csv").map(s => Row(s.split(',').slice(0,64).map(_.toDouble)))
parsedData: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
val schema2 = new StructType().add("features", ArrayType(DoubleType))
schema2: org.apache.spark.sql.types.StructType = StructType(StructField(features,ArrayType(DoubleType,true),true))
val df = spark.createDataFrame(parsedData, schema2)
df: org.apache.spark.sql.DataFrame = [features: array<double>]
val df2 = df.map{ case Row(features: Array[Double]) => Row(org.apache.spark.ml.linalg.Vectors.dense(features)) }
Which throws the following error, even though spark.implicits._ is imported:
error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
Any help is greatly appreciated, thanks!
Out of the top of my head:
Use csv source and VectorAssembler:
import scala.util.Try
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.feature.VectorAssembler
val path: String = ???
val n: Int = ???
val m:Int = ???
val raw = spark.read.csv(path)
val featureCols = raw.columns.slice(n, m)
val exprs = featureCols.map(c => col(c).cast("double"))
val assembler = new VectorAssembler()
.setInputCols(featureCols)
.setOutputCol("features")
assembler.transform(raw.select(exprs: _*)).select($"features")
Use text source and UDF:
def parse_(n: Int, m: Int)(s: String) = Try(
Vectors.dense(s.split(',').slice(n, m).map(_.toDouble))
).toOption
def parse(n: Int, m: Int) = udf(parse_(n, m) _)
val raw = spark.read.text(path)
raw.select(parse(n, m)(col(raw.columns.head)).alias("features"))
Use text source and drop wrapping Row
spark.read.text(path).as[String].map(parse_(n, m)).toDF