How to map a SchemaRDD to a PairRDD

How to map a SchemaRDD to a PairRDD - scala

I am trying to figure out how to map a SchemaRDD object that I retrieved from a sql HiveContext over to a PairRDDFunctions[String, Vector] object where the string value is the name column in the schemaRDD and the rest of the columns (BytesIn, BytesOut, etc...) are the vector.

Assuming you have columns: "name", "bytesIn", "bytesOut"
val schemaRDD: SchemaRDD = ...
val pairs: RDD[(String, (Long, Long)] =
schemaRDD.select("name", "bytesIn", "bytesOut").rdd.map {
case Row(name, bytesIn, bytesOut) =>
name -> (bytesIn, bytesOut)
}
// To import PairRDDFunctions via implicits
import SparkContext._
pairs.groupByKey ... etc

Related

Scala : how to pass variable in a UDF and use in withColumn

I have a variable of type Map[String, Set[String]
val metadata = Map(a -> Set(b ,c))
val colToUse = "existingcol" // Option[String]
I am trying to add a new column in my dataFrame using metadata and colToUse which is an existing column in my dataframe
value of metadata is Set of Strings and
key is just a string which is a value of a column in df.
eg :
val metadata = Map['mike', ['physics','chemistry']]
val colToUse = 'student_name' // student_name is a column name in df
'mike' will be a value of "student_name" column.
i am trying to add a new column in existing DF where i can add subjects of each student based on student_name and metadata
myDF.withColumn("subjects", metadata.getorelse(col(colToUse), set.empty)
The above will not work in scala as i need pass columns only in withColumn.
Tried using UDF
def logic: (Map[String, Set[String]], String) => Set[String] =
(metadata: Map[String, Set[String]], colToUse: String) => {
metadata.getOrElse(colToUse, Set("a"))
}
def myUDF = udf(logic)
def getVal: Column = { myUDF(metadata, col(colToUse.get) }
and using it in withcolumn :
myDF.withColumn("newCol", getVal(metadata, colToUse)
Getting error : Unsupported literal type class scala.Tuple2
Looking for a best simplistic way to approach this ?
Issue 2: In getVal , for passing metadata a column is expected but i am passing a map

Is something like this is what you need:
val spark = SparkSession.builder().master("local[1]").getOrCreate()
val df = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(Row("mike"))),
StructType(List(StructField("student_name", StringType)))
)
df.show()
First test dataframe:
+------------+
|student_name|
+------------+
| mike|
+------------+
And now, create the udf that uses the map:
val metadata = Map("mike" -> Set("physics", "chemistry"))
val colToUse = "student_name"
def createUdf =
udf((key: String) => metadata.getOrElse(key, Set.empty))
and uset it in withColumn function:
df.withColumn("subjects", createUdf(col(colToUse))).show()
it gives:
+------------+--------------------+
|student_name| subjects|
+------------+--------------------+
| mike|[physics, chemistry]|
+------------+--------------------+
am I missing something?

aggregating jsonarray into Map<key, list> in spark in spark2.x

I am quite new to Spark. I have a input json file which I am reading as
val df = spark.read.json("/Users/user/Desktop/resource.json");
Contents of resource.json looks like this:
{"path":"path1","key":"key1","region":"region1"}
{"path":"path112","key":"key1","region":"region1"}
{"path":"path22","key":"key2","region":"region1"}
Is there any way we can process this dataframe and aggregate result as
Map<key, List<data>>
where data is each json object in which key is present.
For ex: expected result is
Map<key1 =[{"path":"path1","key":"key1","region":"region1"}, {"path":"path112","key":"key1","region":"region1"}] ,
key2 = [{"path":"path22","key":"key2","region":"region1"}]>
Any reference/documents/link to proceed further would be a great help.
Thank you.

Here is what you can do:
import org.json4s._
import org.json4s.jackson.Serialization.read
case class cC(path: String, key: String, region: String)
val df = spark.read.json("/Users/user/Desktop/resource.json");
scala> df.show
+----+-------+-------+
| key| path| region|
+----+-------+-------+
|key1| path1|region1|
|key1|path112|region1|
|key2| path22|region1|
+----+-------+-------+
//Please note that original json structure is gone. Use .toJSON to get json back and extract key from json and create RDD[(String, String)] RDD[(key, json)]
val rdd = df.toJSON.rdd.map(m => {
implicit val formats = DefaultFormats
val parsedObj = read[cC](m)
(parsedObj.key, m)
})
scala> rdd.collect.groupBy(_._1).map(m => (m._1,m._2.map(_._2).toList))
res39: scala.collection.immutable.Map[String,List[String]] = Map(key2 -> List({"key":"key2","path":"path22","region":"region1"}), key1 -> List({"key":"key1","path":"path1","region":"region1"}, {"key":"key1","path":"path112","region":"region1"}))

You can use groupBy with collect_list, which is an aggregation function that collects all matching values into a list per key.
Notice that the original JSON strings are already "gone" (Spark parses them into individual columns), so if you really want a list of all records (with all their columns, including the key), you can use the struct function to combine columns into one column:
import org.apache.spark.sql.functions._
import spark.implicits._
df.groupBy($"key")
.agg(collect_list(struct($"path", $"key", $"region")) as "value")
The result would be:
+----+--------------------------------------------------+
|key |value |
+----+--------------------------------------------------+
|key1|[[path1, key1, region1], [path112, key1, region1]]|
|key2|[[path22, key2, region1]] |
+----+--------------------------------------------------+

DoubleType to VectorUDT

I have this code written using Spark 2.1:
val mycolumns = originalFile.schema.fieldNames
mycolumns.map(cname => stddevPerColumnName(df.select(cname), cname))
def stddevPerColumnName(df: DataFrame, cname: String): DataFrame =
new StandardScaler()
.setInputCol(cname)
.setOutputCol("stddev")
.setWithStd(true)
.fit(df)
.transform(df)
Every single column has type DoubleType originally inferred from a CSV file.
When I run the code I get the Exception:
Column FirstColumn must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually DoubleType.
How can I convert the column type Double to VectorUDT?

you need to pass vector into ML model:use assembler to put double values into vector then do your ML then take values out of vector if required back to double
import org.apache.spark.ml.feature.{MinMaxScaler,VectorAssembler}
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.sql.functions._
val assembler = new VectorAssembler().setInputCols(Array("yourDoubleValue")).setOutputCol("features")
def assembler (ds: Dataset[T]) = {mlib.assembler.transform(ds)}
val vectorToColumn = udf{ (x: DenseVector, index: Int) => x(index) }
val scaler = new StandardScaler().setInputCol("features").setOutputCol("featuresScaled")
*use DenseVector or SparseVector depending on your data
full example:
val data = spark.read....
val data_assembled = assembler.transform(data)
val assembled = scaler.fit(ds).transform(ds)
.withColumn("backToMyDouble",round(mlib.vectorToColumn(col("featuresScaled"),lit(0)),2))

Spark, Scala - column type determine

I can load data from database, and I do some process with this data.
The problem is some table has date column as 'String', but some others trait it as 'timestamp'.
I cannot know what type of date column is until loading data.
> x.getAs[String]("date") // could be error when date column is timestamp type
> x.getAs[Timestamp]("date") // could be error when date column is string type
This is how I load data from spark.
spark.read
.format("jdbc")
.option("url", url)
.option("dbtable", table)
.option("user", user)
.option("password", password)
.load()
Is there any way to trait them together? or convert it as string always?

You can pattern-match on the type of the column (using the DataFrame's schema) to decide whether to parse the String into a Timestamp or just use the Timestamp as is - and use the unix_timestamp function to do the actual conversion:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
// preparing some example data - df1 with String type and df2 with Timestamp type
val df1 = Seq(("a", "2016-02-01"), ("b", "2016-02-02")).toDF("key", "date")
val df2 = Seq(
("a", new Timestamp(new SimpleDateFormat("yyyy-MM-dd").parse("2016-02-01").getTime)),
("b", new Timestamp(new SimpleDateFormat("yyyy-MM-dd").parse("2016-02-02").getTime))
).toDF("key", "date")
// If column is String, converts it to Timestamp
def normalizeDate(df: DataFrame): DataFrame = {
df.schema("date").dataType match {
case StringType => df.withColumn("date", unix_timestamp($"date", "yyyy-MM-dd").cast("timestamp"))
case _ => df
}
}
// after "normalizing", you can assume date has Timestamp type -
// both would print the same thing:
normalizeDate(df1).rdd.map(r => r.getAs[Timestamp]("date")).foreach(println)
normalizeDate(df2).rdd.map(r => r.getAs[Timestamp]("date")).foreach(println)

Here are a few things you can try:
(1) Start utilizing the inferSchema function during load if you have a version that supports it. This will have spark figure the data type of columns, this doesn't work in all scenarios. Also look at the input data, if you have quotes I advise adding an extra argument to account for them during the load.
val inputDF = spark.read.format("csv").option("header","true").option("inferSchema","true").load(fileLocation)
(2) To identify the data type of a column you can use the below code, it will place all of the column name and data types into their own Arrays of Strings.
val columnNames : Array[String] = inputDF.columns
val columnDataTypes : Array[String] = inputDF.schema.fields.map(x=>x.dataType).map(x=>x.toString)

It has a easy way to address this which is get(i: Int): Any. And it will be map between Spark SQL types and return types automatically. e.g.
val fieldIndex = row.fieldIndex("date")
val date = row.get(fieldIndex)

def parseLocationColumn(df: DataFrame): DataFrame = {
df.schema("location").dataType match {
case StringType => df.withColumn("locationTemp", $"location")
.withColumn("countryTemp", lit("Unknown"))
.withColumn("regionTemp", lit("Unknown"))
.withColumn("zoneTemp", lit("Unknown"))
case _ => df.withColumn("locationTemp", $"location.location")
.withColumn("countryTemp", $"location.country")
.withColumn("regionTemp", $"location.region")
.withColumn("zoneTemp", $"location.zone")
}
}

Scala java.lang.String cannot be cast to java.lang.Double error when converting double type dataframe to LabeledPoint in Spark

I have a dataset of 2002 variables. All variables are numeric. I first read in the dataset to Spark 1.5.0 and created a Double Type dataframe following the instruction here . Then I converted the dataframe to LabeledPoint following instructions here and here. However, when I tried to print out sample rows in the generated LabeledPoint, I got the "java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double" error. Below is the Scala code I used. Sorry for the long code but I hope that will help the debug.
Could anyone please tell me where the error is coming from and how to resolve the problem? Thank you very much for your help!
Below is the Scala code I used:
// Read in dataset but drop the header row
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val trainRDD = sc.textFile("train.txt").filter(line => !line.contains("target"))
// Read in header file to get column names. Store in an Array.
val dictFile = "header.txt"
var arrName = new Array[String](2002)
for (line <- Source.fromFile(dictFile).getLines) {
arrName = line.split('\t').map(_.trim).toArray
}
// Create dataframe using programmatically specifying the schema method
// Encode schema in a string
var schemaString = arrName.mkString(" ")
// Import Row
import org.apache.spark.sql.Row
// Import RDD
import org.apache.spark.rdd.RDD
// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType,LongType,FloatType,DoubleType}
// Generate the Double Type schema based on the string of schema
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, DoubleType, true)))
// Create rowRDD and convert String type to Double type
val arrVar = sc.broadcast(0 to 2001 toArray)
def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
val rowRDD = rdd.map(_.split("\t")).map(_.map({y => y.toDouble})).map(p => Row.fromSeq(anArray.value map p))
return rowRDD
}
val rowRDDTrain = createRowRDD(trainRDD, arrVar)
// Apply the schema to the RDD.
val trainDF = sqlContext.createDataFrame(rowRDDTrain, schema)
trainDF.printSchema
// Verified all 2002 variables are in "double (nullable = true)" format
// Define toLabeledPoint( ) to convert dataframe to LabeledPoint format
// Reference: https://stackoverflow.com/questions/31638770/rdd-to-labeledpoint-conversion
def toLabeledPoint(dataDF:org.apache.spark.sql.DataFrame) : org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = {
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val targetInd = dataDF.columns.indexOf("target")
val ignored = List("target")
val featInd = dataDF.columns.diff(ignored).map(dataDF.columns.indexOf(_))
val dataLP = dataDF.rdd.map(r => LabeledPoint(r.getDouble(targetInd),
Vectors.dense(featInd.map(r.getDouble(_)).toArray)))
return dataLP
}
// Create LabeledPoint from dataframe
val trainLP = toLabeledPoint(trainDF)
// Print out sammple rows in the generated LabeledPoint
trainLP.take(5).foreach(println)
// Failed: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double
Update:
Thanks a lot for David Griffin's and zero323's comments below. David is correct. I find that exception is indeed caused by the null values in the data. I replaced the following original code:
def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
val rowRDD = rdd.map(_.split("\t")).map(_.map({y => y.toDouble})).map(p => Row.fromSeq(anArray.value map p))
return rowRDD
}
with this one to impute null values to 0.0 and then the problem is gone:
def createRowRDD(rdd:RDD[String], anArray:org.apache.spark.broadcast.Broadcast[Array[Int]]) : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = {
val rowRDD = rdd.map(_.split("\t")).map(_.map({y => try {y.toDouble} catch {case _ : Throwable => 0.0}})).map(p => Row.fromSeq(anArray.value map p))
return rowRDD
}