How to iterate scala wrappedArray? (Spark) - scala

I perform the following operations:
val tempDict = sqlContext.sql("select words.pName_token,collect_set(words.pID) as docids
from words
group by words.pName_token").toDF()
val wordDocs = tempDict.filter(newDict("pName_token")===word)
val listDocs = wordDocs.map(t => t(1)).collect()
listDocs: Array
[Any] = Array(WrappedArray(123, 234, 205876618, 456))
My question is how do I iterate over this wrapped array or convert this into a list?
The options I get for the listDocs are apply, asInstanceOf, clone, isInstanceOf, length, toString, and update.
How do I proceed?

Here is one way to solve this.
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
val data = Seq((Seq(1,2,3),Seq(4,5,6),Seq(7,8,9)))
val df = sqlContext.createDataFrame(data)
val first = df.first
// use a pattern match to deferral the type
val mapped = first.getAs[WrappedArray[Int]](0)
// now we can use it like normal collection
mapped.mkString("\n")
// get rows where has array
val rows = df.collect.map {
case Row(a: Seq[Any], b: Seq[Any], c: Seq[Any]) =>
(a, b, c)
}
rows.mkString("\n")

Related

Spark ML insert/fit custom OneHotEncoder into a Pipeline

Say I have a few features/columns in a dataframe on which I apply the regular OneHotEncoder, and one (let, n-th) column on which I need to apply my custom OneHotEncoder. Then I need to use VectorAssembler to assemble those features, and put into a Pipeline, finally fitting my trainData and getting predictions from my testData, such as:
val sIndexer1 = new StringIndexer().setInputCol("my_feature1").setOutputCol("indexed_feature1")
// ... let, n-1 such sIndexers for n-1 features
val featureEncoder = new OneHotEncoderEstimator().setInputCols(Array(sIndexer1.getOutputCol), ...).
setOutputCols(Array("encoded_feature1", ... ))
// **need to insert output from my custom OneHotEncoder function (please see below)**
// (which takes the n-th feature as input) in a way that matches the VectorAssembler below
val vectorAssembler = new VectorAssembler().setInputCols(featureEncoder.getOutputCols + ???).
setOutputCol("assembled_features")
...
val pipeline = new Pipeline().setStages(Array(sIndexer1, ...,featureEncoder, vectorAssembler, myClassifier))
val model = pipeline.fit(trainData)
val predictions = model.transform(testData)
How can I modify the building of the vectorAssembler so that it can ingest the output from the custom OneHotEncoder?
The problem is my desired oheEncodingTopN() cannot/should not refer to the "actual" dataframe, since it would be a part of the pipeline (to apply on trainData/testData).
Note:
I tested that the custom OneHotEncoder (see link) works just as expected separately on e.g. trainData. Basically, oheEncodingTopN applies OneHotEncoding on the input column, but for the top N frequent values only (e.g. N = 50), and put all the rest infrequent values in a dummy column (say, "default"), e.g.:
val oheEncoded = oheEncodingTopN(df, "my_featureN", 50)
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{col, lit, when}
import org.apache.spark.sql.Column
def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))
def oheEncodingTopN(df: DataFrame, colName: String, n: Int): DataFrame = {
df.createOrReplaceTempView("data")
val topNDF = spark.sql(s"select $colName, count(*) as count from data group by $colName order by count desc limit $n")
val pivotTopNDF = topNDF.
groupBy(colName).
pivot(colName).
count().
withColumn("default", lit(1))
val joinedTopNDF = df.join(pivotTopNDF, Seq(colName), "left").drop(colName)
val oheEncodedDF = joinedTopNDF.
na.fill(0, joinedTopNDF.columns).
withColumn("default", flip(col("default")))
oheEncodedDF
}
I think the cleanest way would be to create your own class that extends spark ML Transformer so that you can play with as you would do with any other transformer (like OneHotEncoder). Your class would look like this :
import org.apache.spark.ml.Transformer
import org.apache.spark.ml.param.Param
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Dataset, Column}
class OHEncodingTopN(n :Int, override val uid: String) extends Transformer {
final val inputCol= new Param[String](this, "inputCol", "The input column")
final val outputCol = new Param[String](this, "outputCol", "The output column")
; def setInputCol(value: String): this.type = set(inputCol, value)
def setOutputCol(value: String): this.type = set(outputCol, value)
def this(n :Int) = this(n, Identifiable.randomUID("OHEncodingTopN"))
def copy(extra: ParamMap): OHEncodingTopN = {
defaultCopy(extra)
}
override def transformSchema(schema: StructType): StructType = {
// Check that the input type is what you want if needed
// val idx = schema.fieldIndex($(inputCol))
// val field = schema.fields(idx)
// if (field.dataType != StringType) {
// throw new Exception(s"Input type ${field.dataType} did not match input type StringType")
// }
// Add the return field
schema.add(StructField($(outputCol), IntegerType, false))
}
def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))
def transform(df: Dataset[_]): DataFrame = {
df.createOrReplaceTempView("data")
val colName = $(inputCol)
val topNDF = df.sparkSession.sql(s"select $colName, count(*) as count from data group by $colName order by count desc limit $n")
val pivotTopNDF = topNDF.
groupBy(colName).
pivot(colName).
count().
withColumn("default", lit(1))
val joinedTopNDF = df.join(pivotTopNDF, Seq(colName), "left").drop(colName)
val oheEncodedDF = joinedTopNDF.
na.fill(0, joinedTopNDF.columns).
withColumn("default", flip(col("default")))
oheEncodedDF
}
}
Now on a OHEncodingTopN object you should be able to call .getOuputCol to perform what you want. Good luck.
EDIT: your method that I just copy pasted in the transform method should be slightly modified in order to output a column of type Vector having the name given in the setOutputCol.

How can i split a string of dataframe schema into each Structs

I want to split a schema of a dataframe into a collection. I am trying this, but the schema is printed out as a string. Is there anyway I can split it into a collection per StructType so that I can manipulate it (like take only array columns from the output)? I am trying to flatten a complex multi level struct + array dataframe.
import org.apache.spark.sql.functions.explode
import org.apache.spark.sql._
val test = sqlContext.read.json(sc.parallelize(Seq("""{"a":1,"b":[2,3],"d":[2,3]}""")))
test.printSchema
val flattened = test.withColumn("b", explode($"d"))
flattened.printSchema
def identifyArrayColumns(dataFrame : DataFrame) = {
val output = for ( d <- dataFrame.collect()) yield
{
d.schema
}
output.toList
}
identifyArrayColumns(test)
Output currently is
identifyArrayColumns: (dataFrame: org.apache.spark.sql.DataFrame)List[org.apache.spark.sql.types.StructType]
res58: List[org.apache.spark.sql.types.StructType] = List(StructType(StructField(a,LongType,true), StructField(b,ArrayType(LongType,true),true), StructField(d,ArrayType(LongType,true),true)))
It is one full string, so I cannot filter only the array columns. Suppose if I do a foreach(println). I get only one line
scala> output.foreach(println)
StructType(StructField(a,LongType,true), StructField(b,ArrayType(LongType,true),true), StructField(d,ArrayType(LongType,true),true))
What I want is each StructTypes in a single element in a collection
You can simply filter the fields of the DataFrame's schema for fields with type array - no need to inspect the DataFrame's data for this:
def identifyArrayColumns(schema: StructType): List[StructField] = {
schema.fields.filter(_.dataType.typeName == "array").toList
}
NOTE that this is a "shallow" solution that would only return the array fields directly under "root", if you want to also find Arrays within Arrays / maps / structs, you'd need to recursively traverse the shcema and produce this filtered result, something like:
// can be converted into a tail-recursive method by adding another argument to accumulate results
def identifyArrayColumns(schema: StructType): List[StructField] = {
val arrays = schema.fields.filter(_.dataType.typeName == "array").toList
val deeperArrays = schema.fields.flatMap {
case f # StructField(_, s: StructType, _, _) => identifyArrayColumns(s)
case _ => List()
}
arrays ++ deeperArrays
}

Filtering in Scala

So suppose I have the following data (only the first few rows, this data covers an entire year) -
(2014-08-31T00:05:00.000+01:00, John)
(2014-08-31T00:11:00.000+01:00, Sarah)
(2014-08-31T00:12:00.000+01:00, George)
(2014-08-31T00:05:00.000+01:00, John)
(2014-09-01T00:05:00.000+01:00, Sarah)
(2014-09-01T00:05:00.000+01:00, George)
(2014-09-01T00:05:00.000+01:00, Jason)
I would like to filter the data so that I only see what the names are for a specific date (say, 2014-09-05). I've tried doing this using the filter function in Scala but I keep receiving the following error -
error: value xxxx is not a member of (org.joda.time.DateTime, String)
Is there another way of doing this?
The filter method takes a function, called a predicate, that takes as parameter an element of your (I'm assuming) RDD, and returns a Boolean.
The returned RDD will keep only the rows for which the predicate evaluates to true.
In your case, it seems that what you want is something like
rdd.filter{
case (date, _) => date.withTimeAtStartOfDay() == new DateTime("2017-03-31")
}
I presume from the tag your question is in the context of Spark and not pure Scala. Given that, you could filter a dataframe on a date and get the associated name(s) like this:
import org.apache.spark.sql.functions._
import sparkSession.implicits._
Seq(
("2014-08-31T00:05:00.000+01:00", "John"),
("2014-08-31T00:11:00.000+01:00", "Sarah")
...
)
.toDF("date", "name")
.filter(to_date('date).equalTo(Date.valueOf("2014-09-05")))
.select("name")
Note that the Date above is java.sql.Date.
Here's a function that takes a date, a list of datetime-name pairs, and returns a list of names for the date:
def getNames(d: String, l: List[(String, String)]): List[String] = {
val date = """^([^T]*).*""".r
val dateMap = list.map {
case (x, y) => ( x match { case date(z) => z }, y )
}.
groupBy(_._1) mapValues( _.map(_._2) )
dateMap.getOrElse(d, List[String]())
}
val list = List(
("2014-08-31T00:05:00.000+01:00", "John"),
("2014-08-31T00:11:00.000+01:00", "Sarah"),
("2014-08-31T00:12:00.000+01:00", "George"),
("2014-08-31T00:05:00.000+01:00", "John"),
("2014-09-01T00:05:00.000+01:00", "Sarah"),
("2014-09-01T00:05:00.000+01:00", "George"),
("2014-09-01T00:05:00.000+01:00", "Jason")
)
getNames("2014-09-01", list)
res1: List[String] = List(Sarah, George, Jason)
val dateTimeStringZero = "2014-08-12T00:05:00.000+01:00"
val dateTimeOne:DateTime = org.joda.time.format.ISODateTimeFormat.dateTime.withZoneUTC.parseDateTime(dateTimeStringZero)
import java.text.SimpleDateFormat
val df = new DateTime(new SimpleDateFormat("yyyy-MM-dd").parse("2014-08-12"))
println(dateTimeOne.getYear==df.getYear)
println(dateTimeOne.getMonthOfYear==df.getYear)
...

how to join two datasets by key in scala spark

I have two datasets and each dataset have two elements.
Below are examples.
Data1: (name, animal)
('abc,def', 'monkey(1)')
('df,gh', 'zebra')
...
Data2: (name, fruit)
('a,efg', 'apple')
('abc,def', 'banana(1)')
...
Results expected: (name, animal, fruit)
('abc,def', 'monkey(1)', 'banana(1)')
...
I want to join these two datasets by using first column 'name.' I have tried to do this for a couple of hours, but I couldn't figure out. Can anyone help me?
val sparkConf = new SparkConf().setAppName("abc").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text1 = sc.textFile(args(0))
val text2 = sc.textFile(args(1))
val joined = text1.join(text2)
Above code is not working!
join is defined on RDDs of pairs, that is, RDDs of type RDD[(K,V)].
The first step needed is to transform the input data into the right type.
We first need to transform the original data of type String into pairs of (Key, Value):
val parse:String => (String, String) = s => {
val regex = "^\\('([^']+)',[\\W]*'([^']+)'\\)$".r
s match {
case regex(k,v) => (k,v)
case _ => ("","")
}
}
(Note that we can't use a simple split(",") expression because the key contains commas)
Then we use that function to parse the text input data:
val s1 = Seq("('abc,def', 'monkey(1)')","('df,gh', 'zebra')")
val s2 = Seq("('a,efg', 'apple')","('abc,def', 'banana(1)')")
val rdd1 = sparkContext.parallelize(s1)
val rdd2 = sparkContext.parallelize(s2)
val kvRdd1 = rdd1.map(parse)
val kvRdd2 = rdd2.map(parse)
Finally, we use the join method to join the two RDDs
val joined = kvRdd1.join(kvRdd2)
// Let's check out results
joined.collect
// res31: Array[(String, (String, String))] = Array((abc,def,(monkey(1),banana(1))))
You have to create pairRDDs first for your data sets then you have to apply join transformation. Your data sets are not looking accurate.
Please consider the below example.
**Dataset1**
a 1
b 2
c 3
**Dataset2**
a 8
b 4
Your code should be like below in Scala
val pairRDD1 = sc.textFile("/path_to_yourfile/first.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val pairRDD2 = sc.textFile("/path_to_yourfile/second.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val joinRDD = pairRDD1.join(pairRDD2)
joinRDD.collect
Here is the result from scala shell
res10: Array[(String, (String, String))] = Array((a,(1,8)), (b,(2,4)))

Scala function does not return a value

I think I understand the rules of implicit returns but I can't figure out why splithead is not being set. This code is run via
val m = new TaxiModel(sc, file)
and then I expect
m.splithead
to give me an array strings. Note head is an array of strings.
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
class TaxiModel(sc: SparkContext, dat: String) {
val rawData = sc.textFile(dat)
val head = rawData.take(10)
val splithead = head.slice(1,11).foreach(splitData)
def splitData(dat: String): Array[String] = {
val splits = dat.split("\",\"")
val split0 = splits(0).substring(1, splits(0).length)
val split8 = splits(8).substring(0, splits(8).length - 1)
Array(split0).union(splits.slice(1, 8)).union(Array(split8))
}
}
foreach just evaluates expression, and do not collect any data while iterating. You probably need map or flatMap (see docs here)
head.slice(1,11).map(splitData) // gives you Array[Array[String]]
head.slice(1,11).flatMap(splitData) // gives you Array[String]
Consider also a for comprehension (which desugars in this case into flatMap),
for (s <- head.slice(1,11)) yield splitData(s)
Note also that Scala strings are equipped with ordered collections methods, thus
splits(0).substring(1, splits(0).length)
proves equivalent to any of the following
splits(0).drop(1)
splits(0).tail