Convert RDD[RDD[T]] to RDD[T] - scala

I think the title pretty much sums up what I am trying to do here. I have the following piece of code
implicit val sc: SparkContext = spark.sparkContext
val result = RDD[RDD[GenericRecord]] = sc.parallelize(dates).map { date =>
val foo: RDD[GenericRecord] = readSomething(...)
foo
}
I want to convert result to an RDD of GenericRecord but foo is not Traversable so that I can use flatMap. Any ideas here?

As discussed here, Spark does not support nested RDDs. So even if I was able to flat map this somehow it would fail on runtime. What I ended up doing is the following:
implicit val sc: SparkContext = spark.sparkContext
val partials = IndexedSeq[RDD[GenericRecord]] = dates.map { date =>
val foo: RDD[GenericRecord] = readSomething(...)
foo
}
val result:RDD[GenericRecord] = sc.union(partials)

Related

Scala/Spark - Create Dataset with one column from another Dataset

I am trying to create a Dataset with only one column from Case Class.
Below is the code:
case class vectorData(value: Array[String], vectors: Vector)
def main(args: Array[String]) {
val spark = SparkSession.builder
.appName("Hello world!")
.master("local[*]")
.getOrCreate()
import spark.implicits._
//blah blah and read data etc.
val word2vec = new Word2Vec()
.setInputCol("value").setOutputCol("vectors")
.setVectorSize(5).setMinCount(0).setWindowSize(5)
val dataset = spark.createDataset(data)
val model = word2vec.fit(dataset)
val encoder = org.apache.spark.sql.Encoders.product[vectorData]
val result = model.transform(dataset).as(encoder)
//val output: Dataset[Vector] = ???
}
As shown in last line of the code, I want the output to be only the 2nd column which has Vector type with vectors data.
I tried with:
val output = result.map(o => o.vectors)
But this line highlighted error No implicit arguments of type: Encoder[Vector]
How to resolve this?
I think line:
implicit val vectorEncoder: Encoder[Vector] = org.apache.spark.sql.Encoders.product[Vector]
should make
val output = result.map(o => o.vectors)
correct

How to add a new method to DataFrame type?

Imagine I have this Scala function that operates upon a Spark dataframe:
class MyClass {
def makeColumnNull(df: DataFrame, columnToMakeNull: String): DataFrame = {
val colType = df.select(columnToMakeNull).schema.head.dataType
df.withColumn(columnToMakeNull, lit(null).cast(colType))
}
}
I call it like so:
val df = spark.range(0,10).toDF()
val df2 = MyClass.makeColumnNull(df, "id")
That works fine however it doesn't work in the same fluent manner as Spark's API. What I'd like to is rewrite my function in a way that enables me to do this:
val df2 = df.makeColumnNull("id")
Can anyone help?
Implicit classes is the way to go, I've used them to extend several spark classes. So you need this:
package com.mycompany.utils.spark
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.lit
object DataFrameExtensions {
implicit class DataFrameWrapper(df: DataFrame) {
def makeColumnNull(columnToMakeNull: String): DataFrame = {
val colType = df.select(columnToMakeNull).schema.head.dataType
df.withColumn(columnToMakeNull, lit(null).cast(colType))
}
}
}
then you have to import com.mycompany.utils.spark.DataFrameExtensions._ and you will able to invoke makeColumnNull() against any DataFrame object

overloaded method value corr with alternatives

I am trying compute the correlation between 2 features, that are being read from two separate text files as shown below.
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.stat.Statistics
import scala.io.Source
object Corr {
def main() {
val sparkSession = SparkSession.builder
.master("local")
.appName("Correlation")
.getOrCreate()
val sc = sparkSession.sparkContext
val feature_1 = Source.fromFile("feature_1.txt").getLines.toArray
val feature_2 = Source.fromFile("feature_2.txt").getLines.toArray
val feature_1_dist = sc.parallelize(feature_1)
val feature_2_dist = sc.parallelize(feature_2)
val correlation: Double = Statistics.corr(feature_1_dist, feature_2_dist, "pearson")
println(s"Correlation is: $correlation")
}
}
Corr.main()
However, I get the following error:
overloaded method value corr with alternatives:
(x: org.apache.spark.api.java.JavaRDD[java.lang.Double],y: org.apache.spark.api.java.JavaRDD[java.lang.Double],method: String)scala.Double <and>
(x: org.apache.spark.rdd.RDD[scala.Double],y: org.apache.spark.rdd.RDD[scala.Double],method: String)scala.Double
cannot be applied to (org.apache.spark.rdd.RDD[String], org.apache.spark.rdd.RDD[String], String)
val correlation: Double = Statistics.corr(feature_1_dist, feature_2_dist, "pearson")
What I am trying to do, looks very similar to the example here but I can't figure it out.
As it is stated in error message, you need to have a RDD[Double], but you have RDD[String]. So, you could do something like this (if you have one number per row):
val feature_1 = Source.fromFile("feature_1.txt").getLines.toArray.map(_.toDouble)
val feature_2 = Source.fromFile("feature_2.txt").getLines.toArray.map(_.toDouble)

How to use feature extraction with DStream in Apache Spark

I have data that arrive from Kafka through DStream. I want to perform feature extraction in order to obtain some keywords.
I do not want to wait for arrival of all data (as it is intended to be continuous stream that potentially never ends), so I hope to perform extraction in chunks - it doesn't matter to me if the accuracy will suffer a bit.
So far I put together something like that:
def extractKeywords(stream: DStream[Data]): Unit = {
val spark: SparkSession = SparkSession.builder.getOrCreate
val streamWithWords: DStream[(Data, Seq[String])] = stream map extractWordsFromData
val streamWithFeatures: DStream[(Data, Array[String])] = streamWithWords transform extractFeatures(spark) _
val streamWithKeywords: DStream[DataWithKeywords] = streamWithFeatures map addKeywordsToData
streamWithFeatures.print()
}
def extractFeatures(spark: SparkSession)
(rdd: RDD[(Data, Seq[String])]): RDD[(Data, Array[String])] = {
val df = spark.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(numOfFeatures)
val rawFeatures = hashingTF.transform(df)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(rawFeatures)
val rescaledData = idfModel.transform(rawFeature)
import spark.implicits._
rescaledData.select("data", "features").as[(Data, Array[String])].rdd
}
However, I received java.lang.IllegalStateException: Haven't seen any document yet. - I am not surprised as I just try out to scrap things together, and I understand that since I am not waiting for an arrival of some data, the generated model might be empty when I try to use it on data.
What would be the right approach for this problem?
I used advises from comments and split the procedure into 2 runs:
one that calculated IDF model and saves it to file
def trainFeatures(idfModelFile: File, rdd: RDD[(String, Seq[String])]) = {
val session: SparkSession = SparkSession.builder.getOrCreate
val wordsDf = session.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedDf)
idfModel.write.save(idfModelFile.getAbsolutePath)
}
one that reads IDF model from file and simply runs it on all incoming information
val idfModel = IDFModel.load(idfModelFile.getAbsolutePath)
val documentDf = spark.createDataFrame(rdd).toDF("update", "document")
val tokenizer = new Tokenizer().setInputCol("document").setOutputCol("words")
val wordsDf = tokenizer.transform(documentDf)
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val extractor = idfModel.setInputCol("rawFeatures").setOutputCol("features")
val featuresDf = extractor.transform(featurizedDf)
featuresDf.select("update", "features")

(Array/ML Vector/MLlib Vector) RDD to ML Vector Dataframe coulmn

I need to convert an RDD to a single column o.a.s.ml.linalg.Vector DataFrame, in order to use the ML algorithms, specifically K-Means for this case. This is my RDD:
val parsedData = sc.textFile("/digits480x.csv").map(s => Row(org.apache.spark.mllib.linalg.Vectors.dense(s.split(',').slice(0,64).map(_.toDouble))))
I tried doing what this answer suggests with no luck, I suppose because you end up with a MLlib Vector, it throws a mismatch error when running the algorithm. Now if I change this:
import org.apache.spark.mllib.linalg.{Vectors, VectorUDT}
val schema = new StructType()
.add("features", new VectorUDT())
to this:
import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
val parsedData = sc.textFile("/digits480x.csv").map(s => Row(org.apache.spark.ml.linalg.Vectors.dense(s.split(',').slice(0,64).map(_.toDouble))))
val schema = new StructType()
.add("features", new VectorUDT())
I would get an error because ML VectorUDT is private.
I also tried converting the RDD as an array of doubles to Dataframe, and get the ML Dense Vector like this:
var parsedData = sc.textFile("/home/pililo/Documents/Mi_Memoria/Codigo/Datasets/Digits/digits480x.csv").map(s => Row(s.split(',').slice(0,64).map(_.toDouble)))
parsedData: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
val schema2 = new StructType().add("features", ArrayType(DoubleType))
schema2: org.apache.spark.sql.types.StructType = StructType(StructField(features,ArrayType(DoubleType,true),true))
val df = spark.createDataFrame(parsedData, schema2)
df: org.apache.spark.sql.DataFrame = [features: array<double>]
val df2 = df.map{ case Row(features: Array[Double]) => Row(org.apache.spark.ml.linalg.Vectors.dense(features)) }
Which throws the following error, even though spark.implicits._ is imported:
error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
Any help is greatly appreciated, thanks!
Out of the top of my head:
Use csv source and VectorAssembler:
import scala.util.Try
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.feature.VectorAssembler
val path: String = ???
val n: Int = ???
val m:Int = ???
val raw = spark.read.csv(path)
val featureCols = raw.columns.slice(n, m)
val exprs = featureCols.map(c => col(c).cast("double"))
val assembler = new VectorAssembler()
.setInputCols(featureCols)
.setOutputCol("features")
assembler.transform(raw.select(exprs: _*)).select($"features")
Use text source and UDF:
def parse_(n: Int, m: Int)(s: String) = Try(
Vectors.dense(s.split(',').slice(n, m).map(_.toDouble))
).toOption
def parse(n: Int, m: Int) = udf(parse_(n, m) _)
val raw = spark.read.text(path)
raw.select(parse(n, m)(col(raw.columns.head)).alias("features"))
Use text source and drop wrapping Row
spark.read.text(path).as[String].map(parse_(n, m)).toDF