Scala/Spark: Converting zero inflated data in dataframe to libsvm - scala

I am very new to scala (typically I do this in R)
I have imported a large dataframe (2000+ columns, 100000+ rows) that is zero-inflated.
Task
To convert the data to libsvm format
Steps
As I understand the steps are as follows
Ensure feature columns are set to DoubleType and Target is an Int
Iterate through each row, retaining each value >0 in one array and index of its column in another array
Convert to RDD[LabeledPoint]
Save RDD in libsvm format
I am stuck on 3 (but maybe) because I am doing step 2 wrong.
Here is my code:
Main Function:
#Test
def testSpark(): Unit =
{
try
{
var mDF: DataFrame = spark.read.option("header", "true").option("inferSchema", "true").csv("src/test/resources/knimeMergedTRimmedVariables.csv")
val mDFTyped = castAllTypedColumnsTo(mDF, IntegerType, DoubleType)
val indexer = new StringIndexer()
.setInputCol("Majors_Final")
.setOutputCol("Majors_Final_Indexed")
val mDFTypedIndexed = indexer.fit(mDFTyped).transform(mDFTyped)
val mDFFinal = castColumnTo(mDFTypedIndexed,"Majors_Final_Indexed", IntegerType)
//only doubles accepted by sparse vector, so that's what we filter for
val fieldSeq: scala.collection.Seq[StructField] = schema.fields.toSeq.filter(f => f.dataType == DoubleType)
val fieldNameSeq: Seq[String] = fieldSeq.map(f => f.name)
val labeled:DataFrame = mDFFinal.map(row => convertRowToLabeledPoint(row,fieldNameSeq,row.getAs("Majors_Final_Indexed"))).toDF()
assertTrue(true)
}
catch
{
case ex: Exception =>
{
println(s"There has been an Exception. Message is ${ex.getMessage} and ${ex}")
fail()
}
}
}
Convert each row to LabeledPoint:
#throws(classOf[Exception])
private def convertRowToLabeledPoint(rowIn: Row, fieldNameSeq: Seq[String], label:Int): LabeledPoint =
{
try
{
val values: Map[String, Double] = rowIn.getValuesMap(fieldNameSeq)
val sortedValuesMap = ListMap(values.toSeq.sortBy(_._1): _*)
val rowValuesItr: Iterable[Double] = sortedValuesMap.values
var positionsArray: ArrayBuffer[Int] = ArrayBuffer[Int]()
var valuesArray: ArrayBuffer[Double] = ArrayBuffer[Double]()
var currentPosition: Int = 0
rowValuesItr.foreach
{
kv =>
if (kv > 0)
{
valuesArray += kv;
positionsArray += currentPosition;
}
currentPosition = currentPosition + 1;
}
val lp:LabeledPoint = new LabeledPoint(label, org.apache.spark.mllib.linalg.Vectors.sparse(positionsArray.size,positionsArray.toArray, valuesArray.toArray))
return lp
}
catch
{
case ex: Exception =>
{
throw new Exception(ex)
}
}
}
Problem
So then I try to create a dataframe of labeledpoints which can easily be converted to an RDD.
val labeled:DataFrame = mDFFinal.map(row => convertRowToLabeledPoint(row,fieldNameSeq,row.getAs("Majors_Final_Indexed"))).toDF()
But I get the following error:
SparkTest.scala:285: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for seri
alizing other types will be added in future releases.
[INFO] val labeled:DataFrame = mDFFinal.map(row => convertRowToLabeledPoint(row,fieldNameSeq,row.getAs("Majors_Final_Indexed"))).toDF()

OK, so I skipped the DataFrame and created an Array of LabeledPoints whish is easily converted to an RDD. The rest is easy.
I stress, that while this works, I am new to scala and there may be more efficient ways to do this.
Main Function is now as follows:
val mDF: DataFrame = spark.read.option("header", "true").option("inferSchema", "true").csv("src/test/resources/knimeMergedTRimmedVariables.csv")
val mDFTyped = castAllTypedColumnsTo(mDF, IntegerType, DoubleType)
val indexer = new StringIndexer()
.setInputCol("Majors_Final")
.setOutputCol("Majors_Final_Indexed")
val mDFTypedIndexed = indexer.fit(mDFTyped).transform(mDFTyped)
val mDFFinal = castColumnTo(mDFTypedIndexed,"Majors_Final_Indexed", IntegerType)
mDFFinal.show()
//only doubles accepted by sparse vector, so that's what we filter for
val fieldSeq: scala.collection.Seq[StructField] = mDFFinal.schema.fields.toSeq.filter(f => f.dataType == DoubleType)
val fieldNameSeq: Seq[String] = fieldSeq.map(f => f.name)
var positionsArray: ArrayBuffer[LabeledPoint] = ArrayBuffer[LabeledPoint]()
mDFFinal.collect().foreach
{
row => positionsArray+=convertRowToLabeledPoint(row,fieldNameSeq,row.getAs("Majors_Final_Indexed"));
}
val mRdd:RDD[LabeledPoint]= spark.sparkContext.parallelize(positionsArray.toSeq)
MLUtils.saveAsLibSVMFile(mRdd, "./output/libsvm")

Related

Spark: FlatMap and CountVectorizer pipeline

I working on the pipeline and try to split the column value before passing it to CountVectorizer.
For this purpose I made a custom Transformer.
class FlatMapTransformer(override val uid: String)
extends Transformer {
/**
* Param for input column name.
* #group param
*/
final val inputCol = new Param[String](this, "inputCol", "The input column")
final def getInputCol: String = $(inputCol)
/**
* Param for output column name.
* #group param
*/
final val outputCol = new Param[String](this, "outputCol", "The output column")
final def getOutputCol: String = $(outputCol)
def setInputCol(value: String): this.type = set(inputCol, value)
def setOutputCol(value: String): this.type = set(outputCol, value)
def this() = this(Identifiable.randomUID("FlatMapTransformer"))
private val flatMap: String => Seq[String] = { input: String =>
input.split(",")
}
override def copy(extra: ParamMap): SplitString = defaultCopy(extra)
override def transform(dataset: Dataset[_]): DataFrame = {
val flatMapUdf = udf(flatMap)
dataset.withColumn($(outputCol), explode(flatMapUdf(col($(inputCol)))))
}
override def transformSchema(schema: StructType): StructType = {
val dataType = schema($(inputCol)).dataType
require(
dataType.isInstanceOf[StringType],
s"Input column must be of type StringType but got ${dataType}")
val inputFields = schema.fields
require(
!inputFields.exists(_.name == $(outputCol)),
s"Output column ${$(outputCol)} already exists.")
DataTypes.createStructType(
Array(
DataTypes.createStructField($(outputCol), DataTypes.StringType, false)))
}
}
The code seems legit, but when I try to chain it with other operation the problem occurs. Here is my pipeline:
val train = reader.readTrainingData()
val cat_features = getFeaturesByType(taskConfig, "categorical")
val num_features = getFeaturesByType(taskConfig, "numeric")
val cat_ohe_features = getFeaturesByType(taskConfig, "categorical", Some("ohe"))
val cat_features_string_index = cat_features.
filter { feature: String => !cat_ohe_features.contains(feature) }
val catIndexer = cat_features_string_index.map {
feature =>
new StringIndexer()
.setInputCol(feature)
.setOutputCol(feature + "_index")
.setHandleInvalid("keep")
}
val flatMapper = cat_ohe_features.map {
feature =>
new FlatMapTransformer()
.setInputCol(feature)
.setOutputCol(feature + "_transformed")
}
val countVectorizer = cat_ohe_features.map {
feature =>
new CountVectorizer()
.setInputCol(feature + "_transformed")
.setOutputCol(feature + "_vectorized")
.setVocabSize(10)
}
// val countVectorizer = cat_ohe_features.map {
// feature =>
//
// val flatMapper = new FlatMapTransformer()
// .setInputCol(feature)
// .setOutputCol(feature + "_transformed")
//
// new CountVectorizer()
// .setInputCol(flatMapper.getOutputCol)
// .setOutputCol(feature + "_vectorized")
// .setVocabSize(10)
// }
val cat_features_index = cat_features_string_index.map {
(feature: String) => feature + "_index"
}
val count_vectorized_index = cat_ohe_features.map {
(feature: String) => feature + "_vectorized"
}
val catFeatureAssembler = new VectorAssembler()
.setInputCols(cat_features_index)
.setOutputCol("cat_features")
val oheFeatureAssembler = new VectorAssembler()
.setInputCols(count_vectorized_index)
.setOutputCol("cat_ohe_features")
val numFeatureAssembler = new VectorAssembler()
.setInputCols(num_features)
.setOutputCol("num_features")
val featureAssembler = new VectorAssembler()
.setInputCols(Array("cat_features", "num_features", "cat_ohe_features_vectorized"))
.setOutputCol("features")
val pipelineStages = catIndexer ++ flatMapper ++ countVectorizer ++
Array(
catFeatureAssembler,
oheFeatureAssembler,
numFeatureAssembler,
featureAssembler)
val pipeline = new Pipeline().setStages(pipelineStages)
pipeline.fit(dataset = train)
Running this code, I receive an error:
java.lang.IllegalArgumentException: Field "my_ohe_field_trasformed" does not exist.
[info] java.lang.IllegalArgumentException: Field "from_expdelv_areas_transformed" does not exist.
[info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
[info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
[info] at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
[info] at scala.collection.AbstractMap.getOrElse(Map.scala:59)
[info] at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
[info] at org.apache.spark.ml.util.SchemaUtils$.checkColumnTypes(SchemaUtils.scala:56)
[info] at org.apache.spark.ml.feature.CountVectorizerParams$class.validateAndTransformSchema(CountVectorizer.scala:75)
[info] at org.apache.spark.ml.feature.CountVectorizer.validateAndTransformSchema(CountVectorizer.scala:123)
[info] at org.apache.spark.ml.feature.CountVectorizer.transformSchema(CountVectorizer.scala:188)
When I uncomment the stringSplitter and countVectorizer the error is raised in my Transformer
java.lang.IllegalArgumentException: Field "my_ohe_field" does not exist. at
val dataType = schema($(inputCol)).dataType
Result of calling pipeline.getStages:
strIdx_3c2630a738f0
strIdx_0d76d55d4200
FlatMapTransformer_fd8595c2969c
FlatMapTransformer_2e9a7af0b0fa
cntVec_c2ef31f00181
cntVec_68a78eca06c9
vecAssembler_a81dd9f43d56
vecAssembler_b647d348f0a0
vecAssembler_b5065a22d5c8
vecAssembler_d9176b8bb593
I might follow the wrong way. Any comments are appreciated.
Your FlatMapTransformer #transform is incorrect, your kind of dropping/ignoring all other columns when you select only on outputCol
please modify your method to -
override def transform(dataset: Dataset[_]): DataFrame = {
val flatMapUdf = udf(flatMap)
dataset.withColumn($(outputCol), explode(flatMapUdf(col($(inputCol)))))
}
Also, Modify your transformSchema to check input column first before checking its datatype-
override def transformSchema(schema: StructType): StructType = {
require(schema.names.contains($(inputCol)), "inputCOl is not there in the input dataframe")
//... rest as it is
}
Update-1 based on comments
PLease modify the copy method (Though it's not the cause for exception you facing)-
override def copy(extra: ParamMap): FlatMapTransformer = defaultCopy(extra)
please note that the CountVectorizer takes the column having columns of type ArrayType(StringType, true/false) and since the FlatMapTransformer output columns becomes the input of CountVectorizer, you need to make sure output column of FlatMapTransformer must be of ArrayType(StringType, true/false). I think, this is not the case, your code today is as following-
override def transform(dataset: Dataset[_]): DataFrame = {
val flatMapUdf = udf(flatMap)
dataset.withColumn($(outputCol), explode(flatMapUdf(col($(inputCol)))))
}
The explode functions converts the array<string> to string, so the output of the transformer becomes StringType. you may wanted to change this code to-
override def transform(dataset: Dataset[_]): DataFrame = {
val flatMapUdf = udf(flatMap)
dataset.withColumn($(outputCol), flatMapUdf(col($(inputCol))))
}
modify transformSchema method to output ArrayType(StringType)
override def transformSchema(schema: StructType): StructType = {
val dataType = schema($(inputCol)).dataType
require(
dataType.isInstanceOf[StringType],
s"Input column must be of type StringType but got ${dataType}")
val inputFields = schema.fields
require(
!inputFields.exists(_.name == $(outputCol)),
s"Output column ${$(outputCol)} already exists.")
schema.add($(outputCol), ArrayType(StringType))
}
change vector assembler to this-
val featureAssembler = new VectorAssembler()
.setInputCols(Array("cat_features", "num_features", "cat_ohe_features"))
.setOutputCol("features")
I tried to execute your pipeline on dummy dataframe, it worked well. Please refer this gist for full code.

Spark doesn't conform to expected type TraversableOnce

val num_idf_pairs = rescaledData.select("item", "features")
.rdd.map(x => {(x(0), x(1))})
val itemRdd = rescaledData.select("item", "features").where("item = 1")
.rdd.map(x => {(x(0), x(1))})
val b_num_idf_pairs = sparkSession.sparkContext.broadcast(num_idf_pairs.collect())
val sims = num_idf_pairs.flatMap {
case (key, value) =>
val sv1 = value.asInstanceOf[SV]
import breeze.linalg._
val valuesVector = new SparseVector[Double](sv1.indices, sv1.values, sv1.size)
itemRdd.map {
case (id2, idf2) =>
val sv2 = idf2.asInstanceOf[SV]
val xVector = new SparseVector[Double](sv2.indices, sv2.values, sv2.size)
val sim = valuesVector.dot(xVector) / (norm(valuesVector) * norm(xVector))
(id2.toString, key.toString, sim)
}
}
The error is doesn't conform to expected type TraversableOnce.
When i modify as follows:
val b_num_idf_pairs = sparkSession.sparkContext.broadcast(num_idf_pairs.collect())
val docSims = num_idf_pairs.flatMap {
case (id1, idf1) =>
val idfs = b_num_idf_pairs.value.filter(_._1 != id1)
val sv1 = idf1.asInstanceOf[SV]
import breeze.linalg._
val bsv1 = new SparseVector[Double](sv1.indices, sv1.values, sv1.size)
idfs.map {
case (id2, idf2) =>
val sv2 = idf2.asInstanceOf[SV]
val bsv2 = new SparseVector[Double](sv2.indices, sv2.values, sv2.size)
val cosSim = bsv1.dot(bsv2).asInstanceOf[Double] / (norm(bsv1) * norm(bsv2))
(id1.toString(), id2.toString(), cosSim)
}
}
it compiles but this will cause an OutOfMemoryException. I set --executor-memory 4G.
The first snippet:
num_idf_pairs.flatMap {
...
itemRdd.map { ...}
}
is not only not valid Spark code (no nested transformations are allowed), but also, as you already know, won't type check, because RDD is not TraversableOnce.
The second snippet likely fails, because data you are trying to collect and broadcast is to large.
It looks like you are trying to find all items similarity so you'll need Cartesian product, and structure your code roughly like this:
num_idf_pairs
.cartesian(itemRdd)
.filter { case ((id1, idf1), (id2, idf2)) => id1 != id2 }
.map { case ((id1, idf1), (id2, idf2)) => {
val cosSim = ??? // Compute similarity
(id1.toString(), id2.toString(), cosSim)
}}

Spark DF: Schema for type Unit is not supported

I am new to Scala and Spark and trying to build on some samples I found. Essentially I am trying to call a function from within a data frame to get State from zip code using Google API..
I have the code working separately but not together ;(
Here is the piece of code not working...
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type Unit is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:716)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:654)
at org.apache.spark.sql.functions$.udf(functions.scala:2837)
at MovieRatings$.getstate(MovieRatings.scala:51)
at MovieRatings$$anonfun$4.apply(MovieRatings.scala:48)
at MovieRatings$$anonfun$4.apply(MovieRatings.scala:47)...
Line 51 starts with def getstate = udf {(zipcode:String)...
...
code:
userDF.createOrReplaceTempView("Users")
// SQL statements can be run by using the sql methods provided by Spark
val zipcodesDF = spark.sql("SELECT distinct zipcode, zipcode as state FROM Users")
// zipcodesDF.map(zipcodes => "zipcode: " + zipcodes.getAs[String]("zipcode") + getstate(zipcodes.getAs[String]("zipcode"))).show()
val colNames = zipcodesDF.columns
val cols = colNames.map(cName => zipcodesDF.col(cName))
val theColumn = zipcodesDF("state")
val mappedCols = cols.map(c =>
if (c.toString() == theColumn.toString()) getstate(c).as("transformed") else c)
val newDF = zipcodesDF.select(mappedCols:_*).show()
}
def getstate = udf {(zipcode:String) => {
val url = "http://maps.googleapis.com/maps/api/geocode/json?address="+zipcode
val result = scala.io.Source.fromURL(url).mkString
val address = parse(result)
val shortnames = for {
JObject(address_components) <- address
JField("short_name", short_name) <- address_components
} yield short_name
val state = shortnames(3)
//return state.toString()
val stater = state.toString()
}
}
Thanks for the responses.. I think I figured it out. Here is the code that works. One thing to note is Google API has restriction so some valid zip codes don't have state info.. not an issue for me though.
private def loaduserdata(spark: SparkSession): Unit = {
import spark.implicits._
// Create an RDD of User objects from a text file, convert it to a Dataframe
val userDF = spark.sparkContext
.textFile("examples/src/main/resources/users.csv")
.map(_.split("::"))
.map(attributes => users(attributes(0).trim.toInt, attributes(1), attributes(2).trim.toInt, attributes(3), attributes(4)))
.toDF()
// Register the DataFrame as a temporary view
userDF.createOrReplaceTempView("Users")
// SQL statements can be run by using the sql methods provided by Spark
val zipcodesDF = spark.sql("SELECT distinct zipcode, substr(zipcode,1,5) as state FROM Users ORDER BY zipcode desc") // zipcodesDF.map(zipcodes => "zipcode: " + zipcodes.getAs[String]("zipcode") + getstate(zipcodes.getAs[String]("zipcode"))).show()
val colNames = zipcodesDF.columns
val cols = colNames.map(cName => zipcodesDF.col(cName))
val theColumn = zipcodesDF("state")
val mappedCols = cols.map(c =>
if (c.toString() == theColumn.toString()) getstate(c).as("state") else c)
val geoDF = zipcodesDF.select(mappedCols:_*)//.show()
geoDF.createOrReplaceTempView("Geo")
}
val getstate = udf {(zipcode: String) =>
val url = "http://maps.googleapis.com/maps/api/geocode/json?address="+zipcode
val result = scala.io.Source.fromURL(url).mkString
val address = parse(result)
val statenm = for {
JObject(statename) <- address
JField("types", JArray(types)) <- statename
JField("short_name", JString(short_name)) <- statename
if types.toString().equals("List(JString(administrative_area_level_1), JString(political))")
// if types.head.equals("JString(administrative_area_level_1)")
} yield short_name
val str = if (statenm.isEmpty.toString().equals("true")) "N/A" else statenm.head
}

Make RDD from LIST[Row] In scala(in spark)

I'm making some code with scala & spark and want to make CSV file from RDD or LIST[Row].
I wanted to process 'ListRDD' data parellel so I thouth output data would be more than one file.
val conf = new SparkConf().setAppName("Csv Application").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val logFile = "data.csv "
val rawdf = sqlContext.read.format("com.databricks.spark.csv")....
val rowRDD = rawdf.map { row =>
Row(
row.getAs( myMap.ID).toString,
row.getAs( myMap.Dept)
.....
)
}
val df = sqlContext.createDataFrame(rowRDD, mySchema)
val MapRDD = df.map { x => (x.getAs[String](myMap.ID), List(x)) }
val ListRDD = MapRDD.reduceByKey { (a: List[Row], b: List[Row]) => List(a, b).flatten }
myClass.myFunction( ListRDD)
in myClass..
def myFunction(ListRDD: RDD[(String, List[Row])]) = {
var rows: RDD[Row]
ListRDD.foreach( row => {
rows.add? gather? ( make(row._2)) // make( row._2) will return List[Row]
})
rows.saveAsFile(" path") // it's my final goal
}
def make( list: List[Row]) : List[Row] = {
data processing from List[Row]
}
I tried to make RDD data from List by sc.parallelize( list) BUT somehow nothing works. anyidea to make RDD type data from make function.
If you want to make an RDD from a List[Row], here is a way to do so
//Assuming list is your List[Row]
val newRDD: RDD[Object] = sc.makeRDD(list.toArray());

obtain a specific value from a RDD according to another RDD

I want to map a RDD by lookup another RDD by this code:
val product = numOfT.map{case((a,b),c)=>
val h = keyValueRecords.lookup(b).take(1).mkString.toInt
(a,(h*c))
}
a,b are Strings and c is a Integer. keyValueRecords is like this: RDD[(string,string)]-
i got type missmatch error: how can I fix it ?
what is my mistake ?
This is a sample of data:
userId,movieId,rating,timestamp
1,16,4.0,1217897793
1,24,1.5,1217895807
1,32,4.0,1217896246
2,3,2.0,859046959
3,7,3.0,8414840873
I'm triying by this code:
val lines = sc.textFile("ratings.txt").map(s => {
val substrings = s.split(",")
(substrings(0), (substrings(1),substrings(1)))
})
val shoppingList = lines.groupByKey()
val coOccurence = shoppingList.flatMap{case(k,v) =>
val arry1 = v.toArray
val arry2 = v.toArray
val pairs = for (pair1 <- arry1; pair2 <- arry2 ) yield ((pair1,pair2),1)
pairs.iterator
}
val numOfT = coOccurence.reduceByKey((a,b)=>(a+b)) // (((item,rate),(item,rate)),coccurence)
// produce recommend for an especial user
val keyValueRecords = sc.textFile("ratings.txt").map(s => {
val substrings = s.split(",")
(substrings(0), (substrings(1),substrings(2)))
}).filter{case(k,v)=> k=="1"}.groupByKey().flatMap{case(k,v) =>
val arry1 = v.toArray
val arry2 = v.toArray
val pairs = for (pair1 <- arry1; pair2 <- arry2 ) yield ((pair1,pair2),1)
pairs.iterator
}
val numOfTForaUser = keyValueRecords.reduceByKey((a,b)=>(a+b))
val joined = numOfT.join(numOfTForaUser).map{case(k,v)=>(k._1._1,(k._2._2.toFloat*v._1.toFloat))}.collect.foreach(println)
The Last RDD won't produced. Is it wrong ?