How to convert a Dataset[Seq[T]] to Dataset[T] in Spark - scala

How do I convert a Dataset[Seq[T]] to Dataset[T]?
For example, Dataset[Seq[Car]] to Dataset[Car].

You can do flatMap:
val df = Seq(Seq(1, 2, 3), Seq(4, 5, 6, 7)).toDF("s").as[Seq[Int]];
df.flatMap(x => x.toList)
You can also try explode function:
df.select(explode('s)).select("col.*").as[Car]
Full example:
import org.apache.spark.sql.functions._
case class Car(i : Int);
val df = Seq(List(Car(1), Car(2), Car(3))).toDF("s").as[List[Car]];
val df1 = df.flatMap(x => x.toList)
val df2 = df.select(explode('s)).select("col.*").as[Car]

Related

Error in types double and DenseVector[Double]

The following code is the answer to this question: Anomaly detection with PCA in Spark
import breeze.linalg.{DenseVector, inv}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{PCA, StandardScaler,VectorAssembler}
import org.apache.spark.ml.linalg.{Matrix, Vector}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.functions._
object SparkApp extends App {
val session = SparkSession.builder()
.appName("spark-app").master("local[*]").getOrCreate()
session.sparkContext.setLogLevel("ERROR")
import session.implicits._
val df = Seq(
(1, 4, 0),
(3, 4, 0),
(1, 3, 0),
(3, 3, 0),
(67, 37, 0) //outlier
).toDF("x", "y", "z")
val vectorAssembler = new VectorAssembler().setInputCols(Array("x", "y", "z")).setOutputCol("vector")
val standardScalar = new StandardScaler().setInputCol("vector").setOutputCol("normalized- vector").setWithMean(true)
.setWithStd(true)
val pca = new PCA().setInputCol("normalized-vector").setOutputCol("pca-features").setK(2)
val pipeline = new Pipeline().setStages(
Array(vectorAssembler, standardScalar, pca)
)
val pcaDF = pipeline.fit(df).transform(df)
def withMahalanobois(df: DataFrame, inputCol: String): DataFrame = {
val Row(coeff1: Matrix) = Correlation.corr(df, inputCol).head
val invCovariance = inv(new breeze.linalg.DenseMatrix(2, 2, coeff1.toArray))
val mahalanobois = udf[Double, Vector] { v =>
val vB = DenseVector(v.toArray)
vB.t * invCovariance * vB
}
df.withColumn("mahalanobois", mahalanobois(df(inputCol)))
}
val withMahalanobois: DataFrame = withMahalanobois(pcaDF, "pca-features")
session.close()
}
But when I try to run it, it crushes in this line:
vB.t * invCovariance * vB
Error message:
type mismatch: found breeze.linalg.DenseVector[Double], required: Double
How can I solve this?

Spark Scala Dataframe convert a column of Array of Struct to a column of Map

I am new to Scala.
I have a Dataframe with fields
ID:string, Time:timestamp, Items:array(struct(name:string,ranking:long))
I want to convert each row of the Items field to a hashmap, with the name as the key.
I am not very sure how to do this.
This can be done using a UDF:
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
// Sample data:
val df = Seq(
("id1", "t1", Array(("n1", 4L), ("n2", 5L))),
("id2", "t2", Array(("n3", 6L), ("n4", 7L)))
).toDF("ID", "Time", "Items")
// Create UDF converting array of (String, Long) structs to Map[String, Long]
val arrayToMap = udf[Map[String, Long], Seq[Row]] {
array => array.map { case Row(key: String, value: Long) => (key, value) }.toMap
}
// apply UDF
val result = df.withColumn("Items", arrayToMap($"Items"))
result.show(false)
// +---+----+---------------------+
// |ID |Time|Items |
// +---+----+---------------------+
// |id1|t1 |Map(n1 -> 4, n2 -> 5)|
// |id2|t2 |Map(n3 -> 6, n4 -> 7)|
// +---+----+---------------------+
I can't see a way to do this without a UDF (using Spark's built-in functions only).
Since 2.4.0, one can use map_from_entries:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
(Array(("n1", 4L), ("n2", 5L))),
(Array(("n3", 6L), ("n4", 7L)))
).toDF("Items")
df.select(map_from_entries($"Items")).show
/*
+-----------------------+
|map_from_entries(Items)|
+-----------------------+
| [n1 -> 4, n2 -> 5]|
| [n3 -> 6, n4 -> 7]|
+-----------------------+
*/

Spark - Make dataframe with multi column csv

origin.csv
no,key1,key2,key3,key4,key5,...
1,A1,B1,C1,D1,E1,..
2,A2,B2,C2,D2,E2,..
3,A3,B3,C3,D3,E3,..
WhatIwant.csv
1,A1,key1
1,B1,key2
1,C1,key3
...
3,A3,key1
3,B3,key2
...
I loaded csv with read method(origin.csv dataframe), but unable to convert it.
val df = spark.read
.option("header", true)
.option("charset", "euc-kr")
.csv(csvFilePath)
Any idea of this?
Try this.
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
val df = Seq((1,"A1","B1","C1","D1"), (2,"A2","B2","C2","D2"), (3,"A3","B3","C3","D2")).toDF("no", "key1", "key2","key3","key4")
df.show
def myUDF(df: DataFrame, by: Seq[String]): DataFrame = {
val (columns, types) = df.dtypes.filter{ case (clm, _) => !by.contains(clm)}.unzip
require(types.distinct.size == 1)
val keys = explode(array(
columns.map(clm => struct(lit(clm).alias("key"),col(clm).alias("val"))): _*
))
val byValue = by.map(col(_))
df.select(byValue :+ keys.alias("_key"): _*).select(byValue ++ Seq($"_key.val", $"_key.key"): _*)
}
val df1 = myUDF(df, Seq("no"))
df1.show

Can't run LDA on Dataset[(scala.Long, org.apache.spark.mllib.linalg.Vector)] in Spark 2.0

I am following this tutorial video on LDA example and I'm getting the following issue :
<console>:37: error: overloaded method value run with alternatives:
(documents: org.apache.spark.api.java.JavaPairRDD[java.lang.Long,org.apache.spark.mllib.linalg.Vector])org.apache.spark.mllib.clustering.LDAModel <and>
(documents: org.apache.spark.rdd.RDD[(scala.Long, org.apache.spark.mllib.linalg.Vector)])org.apache.spark.mllib.clustering.LDAModel
cannot be applied to (org.apache.spark.sql.Dataset[(scala.Long, org.apache.spark.mllib.linalg.Vector)])
val model = run(lda_countVector)
^
So I want to convert this DF to RDD but it is always assigned as DataSet for me. Can anyone please look into this issue?
// Convert DF to RDD
import org.apache.spark.mllib.linalg.Vector
val lda_countVector = countVectors.map { case Row(id: Long, countVector: Vector) => (id, countVector) }
// import org.apache.spark.mllib.linalg.Vector
// lda_countVector: org.apache.spark.sql.Dataset[(Long, org.apache.spark.mllib.linalg.Vector)] = [_1: bigint, _2: vector]
Spark API changed between 1.x and 2.x branch. In particular DataFrame.map returns Dataset not an RDD so the result is not compatible with old MLlib RDD-based API. You should convert data to RDD first as followed :
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA}
val a = Vectors.dense(Array(1.0, 2.0, 3.0))
val b = Vectors.dense(Array(3.0, 4.0, 5.0))
val df = Seq((1L ,a), (2L, b), (2L, a)).toDF
val ldaDF = df.rdd.map {
case Row(id: Long, countVector: Vector) => (id, countVector)
}
val model = new LDA().setK(3).run(ldaDF)
or you can convert to typed dataset and then to RDD:
val model = new LDA().setK(3).run(df.as[(Long, Vector)].rdd)

Spark: How to transform LabeledPoint features values from int to 0/1?

I want to run Naive Bayes in Spark, but to do this I have to transform features values from my LabeledPoint to 0/1. My LabeledPoint looks like this:
scala> transformedData.collect()
res29: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((0.0,(400036,[7744],[2.0])), (0.0,(400036,[7744,8608],[3.0,3.0])), (0.0,(400036,[7744],[2.0])), (0.0,(400036,[133,218,2162,7460,7744,9567],[1.0,1.0,2.0,1.0,42.0,21.0])), (0.0,(400036,[133,218,1589,2162,2784,2922,3274,6914,7008,7131,7460,8608,9437,9567,199999,200021,200035,200048,200051,200056,200058,200064,200070,200072,200075,200087,400008,400011],[4.0,1.0,6.0,53.0,6.0,1.0,1.0,2.0,11.0,17.0,48.0,3.0,4.0,113.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,28.0,1.0,1.0,1.0,1.0,1.0,4.0])), (0.0,(400036,[1589,3585,4830,6935,6936,7744,400008,400011],[2.0,6.0,3.0,52.0,4.0,3.0,1.0,2.0])), (0.0,(400036,[1589,2162,2784,2922,4123,7008,7131,7792,8608],[23.0,70.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0])), (0.0,(400036,[4830,6935,6936,400008,400011],[1.0,36.0...
How can I transform those features values into 1 (it's sparse representation so there will be no 0) ?
I guess you're looking for something like this:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
val transformedData = sc.parallelize(Seq(
LabeledPoint(1.0, Vectors.sparse(5, Array(1, 3), Array(9.0, 3.2))),
LabeledPoint(5.0, Vectors.sparse(5, Array(0, 2, 4), Array(1.0, 2.0, 3.0)))
))
def binarizeFeatures(rdd: RDD[LabeledPoint]) = rdd.map{
case LabeledPoint(label, features) => {
val v = features.toSparse
LabeledPoint(lab,
Vectors.sparse(v.size, v.indices, Array.fill(v.numNonzeros)(1.0)))}}
binarizeFeatures(transformedData).collect
// Array[org.apache.spark.mllib.regression.LabeledPoint] = Array(
// (1.0,(5,[1,3],[1.0,1.0])),
// (1.0,(5,[0,2,4],[1.0,1.0,1.0])))