How to convert Spark's TableRDD to RDD[Array[Double]] in Scala? - scala

I am trying to perform Scala operation on Shark. I am creating an RDD as follows:
val tmp: shark.api.TableRDD = sc.sql2rdd("select duration from test")
I need it to convert it to RDD[Array[Double]]. I tried toArray, but it doesn't seem to work.
I also tried converting it to Array[String] and then converting using map as follows:
val tmp_2 = tmp.map(row => row.getString(0))
val tmp_3 = tmp_2.map { row =>
val features = Array[Double] (row(0))
}
But this gives me a Spark's RDD[Unit] which cannot be used in the function. Is there any other way to proceed with this type conversion?
Edit I also tried using toDouble, but this gives me an RDD[Double] type, not RDD[Array[Double]]
val tmp_5 = tmp_2.map(_.toDouble)
Edit 2:
I managed to do this as follows:
A sample of the data:
296.98567000000003
230.84362999999999
212.89751000000001
914.02404000000001
305.55383
A Spark Table RDD was created first.
val tmp = sc.sql2rdd("select duration from test")
I made use of getString to translate it to a RDD[String] and then converted it to an RDD[Array[Double]].
val duration = tmp.map(row => Array[Double](row.getString(0).toDouble))

Related

How to apply filters on spark scala dataframe view?

I am pasting a snippet here where I am facing issues with the BigQuery Read. The "wherePart" has more number of records and hence BQ call is invoked again and again. Keeping the filter outside of BQ Read would help. The idea is, first read the "mainTable" from BQ, store it in a spark view, then apply the "wherePart" filter to this view in spark.
["subDate" is a function to subtract one date from another and return the number of days in between]
val Df = getFb(config, mainTable, ds)
def getFb(config: DataFrame, mainTable: String, ds: String) : DataFrame = {
val fb = config.map(row => Target.Pfb(
row.getAs[String]("m1"),
row.getAs[String]("m2"),
row.getAs[Seq[Int]]("days")))
.collect
val wherePart = fb.map(x => (x.m1, x.m2, subDate(ds, x.days.max - 1))).
map(x => s"(idata_${x._1} = '${x._2}' AND ds BETWEEN '${x._3}' AND '${ds}')").
mkString(" OR ")
val q = new Q()
val tempView = "tempView"
spark.readBigQueryTable(mainTable, wherePart).createOrReplaceTempView(tempView)
val Df = q.mainTableLogs(tempView)
Df
}
Could someone please help me here.
Are you using the spark-bigquery-connector? If so the right syntax is
spark.read.format("bigquery")
.load(mainTable)
.where(wherePart)
.createOrReplaceTempView(tempView)

Unable to flatten array of DataFrames

I have an array of DataFrames that I obtain by using randomSplit() in this manner:
val folds = df.randomSplit(Array.fill(5)(1.0/5)) //Array[Dataset[Row]]
I'll be iterating over folds using a for loop, where I will be dropping the ith entry inside folds and store it separately. Then I will be using all the others as another DataFrame as in my code below:
val df = spark.read.format("csv").load("xyz")
val folds = df.randomSplit(Array.fill(5)(1.0/5))
for (i <- folds.indices) {
var ts = folds
val testSet = ts(i)
ts = ts.drop(i)
var trainSet = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], testSet.schema)
for (j <- ts.indices) {
trainSet = trainSet.union(ts(j))
}
}
While this does serve my purpose, I was also trying another approach where I would still separate folds into ts and testSet, and then use the flatten function for the remaining inside ts to create another DataFrame using something like this:
val df = spark.read.format("csv").load("xyz")
val folds = df.randomSplit(Array.fill(5)(1.0/5))
for (i <- folds.indices) {
var ts = folds
val testSet = ts(i)
ts = ts.drop(i)
var trainSet = ts.flatten
}
But at the initialization of the trainSet line, I get an error that: No Implicits Found for parameter asTrav: Dataset[Row] => Traversable[U_]. I have also done import spark.implicits._ after initializing the SparkSession.
My end goal with the creation of trainSet after flatten is to retrieve a DataFrame created after joining (union) the other Dataset[Row]s inside ts. I'm not sure where I'm going wrong.
I'm using Spark 2.4.5 with Scala 2.11.12
EDIT 1: Added how I read the Dataframe
I'm not sure what's your intention here but instead of using mutable variables and flattening you can do recursive iteration like this:
val folds = df.randomSplit(Array.fill(5)(1.0/5)) //Array[Dataset[Row]]
val testSet = spark.createDataFrame(Seq.empty)
val trainSet = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], testSet.schema)
go(folds, Array.empty)
def go(items: Array[Dataset[Row]], result: Array[Dataset[Row]]): Array[Dataset[Row]] = items match {
case arr # Array(_, _*) =>
val res = arr.map { t =>
trainSet.union(t)
}
go(arr.tail, result ++ res)
case Array() => result
}
As I have seen the use case of testSet, there is no usage of it in the method body
I have replaced that for loop with a simple reduce:
val trainSet = ts.reduce((a,b) => a.union(b))

Cannot splat an Array into function's arguments that accepts varargs

I have tried to make a function that can enrich a given DataFrame with a "session" column using a window function. So I need to use partitionBy and orderBy.
val by_uuid_per_date = Window.partitionBy("uuid").orderBy("year","month","day")
// A Session = A day of events for a certain user. uuid x (year+month+day)
val enriched_df = df
.withColumn("session", dense_rank().over(by_uuid_per_date))
.orderBy("uuid","timestamp")
.select("uuid","year","month","day","session")
This works perfectly, but when I try to make a function that encapsulates this behavior :
PS: I used the _* splat operator.
def enrich_with_session(df:DataFrame,
window_partition_cols:Array[String],
window_order_by_cols:Array[String],
presentation_order_by_cols:Array[String]):DataFrame={
val by_uuid_per_date = Window.partitionBy(window_partition_cols: _*).orderBy(window_order_by_cols: _*)
df.withColumn("session", dense_rank().over(by_uuid_per_date))
.orderBy(presentation_order_by_cols:_*)
.select("uuid","year","month","mday","session")
}
I get the following error:
notebook:6: error: no `: _*' annotation allowed here
(such annotations are only allowed in arguments to -parameters)
val by_uuid_per_date = Window.partitionBy(window_partition_cols: _).orderBy(window_order_by_cols: _*)
partitionBy and orderBy are expecting Seq[Column] or
Array[Column] as arguments, see below:
val data = Seq(
(1,99),
(1,99),
(1,70),
(1,20)
).toDF("id","value")
data.select('id,'value, rank().over(Window.partitionBy('id).orderBy('value))).show()
val partitionBy: Seq[Column] = Seq(data("id"))
val orderBy: Seq[Column] = Seq(data("value"))
data.select('id,'value, rank().over(Window.partitionBy(partitionBy:_*).orderBy(orderBy:_*))).show()
So in this case, your code should looks like this:
def enrich_with_session(df:DataFrame,
window_partition_cols:Array[String],
window_order_by_cols:Array[String],
presentation_order_by_cols:Array[String]):DataFrame={
val window_partition_cols_2: Array[Column] = window_partition_cols.map(df(_))
val window_order_by_cols_2: Array[Column] = window_order_by_cols.map(df(_))
val presentation_order_by_cols_2: Array[Column] = presentation_order_by_cols.map(df(_))
val by_uuid_per_date = Window.partitionBy(window_partition_cols_2: _*).orderBy(window_order_by_cols_2: _*)
df.withColumn("session", dense_rank().over(by_uuid_per_date))
.orderBy(presentation_order_by_cols_2:_*)
.select("uuid","year","month","mday","session")
}

How to correctly handle Option in Spark/Scala?

I have a method, createDataFrame, which returns an Option[DataFrame]. I then want to 'get' the DataFrame and use it in later code. I'm getting a type mismatch that I can't fix:
val df2: DataFrame = createDataFrame("filename.txt") match {
case Some(df) => { //proceed with pipeline
df.filter($"activityLabel" > 0)
case None => println("could not create dataframe")
}
val Array(trainData, testData) = df2.randomSplit(Array(0.5,0.5),seed = 12345)
I need df2 to be of type: DataFrame otherwise later code won't recognise df2 as a DataFrame e.g. val Array(trainData, testData) = df2.randomSplit(Array(0.5,0.5),seed = 12345)
However, the case None statement is not of type DataFrame, it returns Unit, so won't compile. But if I don't declare the type of df2 the later code won't compile as it is not recognised as a DataFrame. If someone can suggest a fix that would be helpful - been going round in circles with this for some time. Thanks
What you need is a map. If you map over an Option[T] you are doing something like: "if it's None I'm doing nothing, otherwise I transform the content of the Option in something else. In your case this content is the dataframe itself. So inside this myDFOpt.map() function you can put all your dataframe transformation and just in the end do the pattern matching you did, where you may print something if you have a None.
edit:
val df2: DataFrame = createDataFrame("filename.txt").map(df=>{
val filteredDF=df.filter($"activityLabel" > 0)
val Array(trainData, testData) = filteredDF.randomSplit(Array(0.5,0.5),seed = 12345)})

String filter using Spark UDF

input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong
Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()
For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD
What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.