How to correctly handle Option in Spark/Scala? - scala

I have a method, createDataFrame, which returns an Option[DataFrame]. I then want to 'get' the DataFrame and use it in later code. I'm getting a type mismatch that I can't fix:
val df2: DataFrame = createDataFrame("filename.txt") match {
case Some(df) => { //proceed with pipeline
df.filter($"activityLabel" > 0)
case None => println("could not create dataframe")
}
val Array(trainData, testData) = df2.randomSplit(Array(0.5,0.5),seed = 12345)
I need df2 to be of type: DataFrame otherwise later code won't recognise df2 as a DataFrame e.g. val Array(trainData, testData) = df2.randomSplit(Array(0.5,0.5),seed = 12345)
However, the case None statement is not of type DataFrame, it returns Unit, so won't compile. But if I don't declare the type of df2 the later code won't compile as it is not recognised as a DataFrame. If someone can suggest a fix that would be helpful - been going round in circles with this for some time. Thanks

What you need is a map. If you map over an Option[T] you are doing something like: "if it's None I'm doing nothing, otherwise I transform the content of the Option in something else. In your case this content is the dataframe itself. So inside this myDFOpt.map() function you can put all your dataframe transformation and just in the end do the pattern matching you did, where you may print something if you have a None.
edit:
val df2: DataFrame = createDataFrame("filename.txt").map(df=>{
val filteredDF=df.filter($"activityLabel" > 0)
val Array(trainData, testData) = filteredDF.randomSplit(Array(0.5,0.5),seed = 12345)})

Related

How to apply filters on spark scala dataframe view?

I am pasting a snippet here where I am facing issues with the BigQuery Read. The "wherePart" has more number of records and hence BQ call is invoked again and again. Keeping the filter outside of BQ Read would help. The idea is, first read the "mainTable" from BQ, store it in a spark view, then apply the "wherePart" filter to this view in spark.
["subDate" is a function to subtract one date from another and return the number of days in between]
val Df = getFb(config, mainTable, ds)
def getFb(config: DataFrame, mainTable: String, ds: String) : DataFrame = {
val fb = config.map(row => Target.Pfb(
row.getAs[String]("m1"),
row.getAs[String]("m2"),
row.getAs[Seq[Int]]("days")))
.collect
val wherePart = fb.map(x => (x.m1, x.m2, subDate(ds, x.days.max - 1))).
map(x => s"(idata_${x._1} = '${x._2}' AND ds BETWEEN '${x._3}' AND '${ds}')").
mkString(" OR ")
val q = new Q()
val tempView = "tempView"
spark.readBigQueryTable(mainTable, wherePart).createOrReplaceTempView(tempView)
val Df = q.mainTableLogs(tempView)
Df
}
Could someone please help me here.
Are you using the spark-bigquery-connector? If so the right syntax is
spark.read.format("bigquery")
.load(mainTable)
.where(wherePart)
.createOrReplaceTempView(tempView)

scala DataFrame cast log failure warning

I have a DataFrame in scala with a column of type String.
I want to cast it to type Long.
I found that the easy way to do that is by using the cast function:
val df: DataFrame
df.withColumn("long_col", df("str_col").cast(LongType))
This will successfully cast "1" to 1.
But if there is a string value that can't be cast to Long, e.g "some string" the result value will be null.
This is great, except I would like to know when this happens. I want to output a warning log whenever the casting failed and resulted in a null value.
And I can't just look at the output DF and check how many null values it has in the "long_col" column, because the original "string_col" column sometimes contains nulls too.
I want the following behavior:
if the value was cast correctly - all good
if there was a non-null string value that failed to cast - warning log
if there was a null value (and the result is also null) - all good
Is there any way to tell the cast function to log these warnings? I tried to read through the implementation and I didn't find any way to do it.
I found a way to do it like this:
def getNullsCount(df: DataFrame, column: String): Long = {
val c: Column = df(column)
df.select(count(when(c.isNull, true)) as "count").limit(1).collect()(0).getLong(0)
}
val countNulls: Long = getNullsCount(df, "str_col")
val newDF = df.withColumn("long_col", df("str_col").cast(LongType))
val countNewNulls: Long = getNullsCount(newDF, "long_col")
if (countNulls != countNewNulls) {
log.warn(s"failed to cast ${countNewNulls - countNulls} values")
}
newDF
I'm not sure if this is an efficient implementation. If anyone has any feedback on how to improve it I would appreciate it.
EDIT
I think this is more efficient because it can calculate both counts in parallel:
val newDF = df.withColumn("long_col", df("str_col").cast(LongType))
val nullsCount1 = df.select(count(when(df("str_col").isNull, true)) as "str_col_count")
val nullsCount2 = newDF.select(count(when(newDF("long_col").isNull, true)) as "long_col_count")
val joined = nullsCount1.join(nullsCount2)
val nullsDiff = joined.select(col("long_col_count") - col("str_col_count") as "diff")
val diffs: Map[String, Long] = nullsDiff.limit(1).collect()(0).getValuesMap[Long](Seq("diff"))
val diff: Long = diffs("diff")
if (diff != 0) {
log.warn(s"failed to cast $diff values")
}

How to deal with None output of a function in Scala?

I have the following function:
def getData(spark: SparkSession,
indices: Option[String]): Option[DataFrame] = {
indices.map{
ind =>
spark
.read
.format("org.elasticsearch.spark.sql")
.load(ind)
}
}
This function returns Option[DataFrame].
Then I want to use this function as follows:
val df = getData(spark, indices)
df.persist(StorageLevel.MEMORY_AND_DISK)
Of course the last two lines of code will not compile because df might be None. What is the idiomatic way deal with None output in Scala?
I would like to throw an exception and stop the program if df is None. Otherwise I want to persist it.
If you do care about the None I'd use simple pattern match here:
df match {
case None => throw new RuntimeException()
case Some(dataFrame) => dataFrame.persist(StorageLevel.MEMORY_AND_DISK)
}
But if you don't care, just use foreach like:
df.foreach { dataFrame =>
dataFrame.persist(StorageLevel.MEMORY_AND_DISK)
}
val df = dfOption.getOrElse(throw new Exception("Disaster Strikes"))
df.persist(...)

Some(null) to Stringtype nullable scala.matcherror

I have an RDD[(Seq[String], Seq[String])] with some null values in the data.
The RDD converted to dataframe looks like this
+----------+----------+
| col1| col2|
+----------+----------+
|[111, aaa]|[xx, null]|
+----------+----------+
Following is the sample code:
val rdd = sc.parallelize(Seq((Seq("111","aaa"),Seq("xx",null))))
val df = rdd.toDF("col1","col2")
val keys = Array("col1","col2")
val values = df.flatMap {
case Row(t1: Seq[String], t2: Seq[String]) => Some((t1 zip t2).toMap)
case Row(_, null) => None
}
val transposed = values.map(someFunc(keys))
val schema = StructType(keys.map(name => StructField(name, DataTypes.StringType, nullable = true)))
val transposedDf = sc.createDataFrame(transposed, schema)
transposed.show()
It runs fine until the point I create a transposedDF, however as soon as I hit show it throws the following error:
scala.MatchError: null
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:97)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
IF there are no null values in the rdd the code works fine. I do not understand why does it fail when I have any null values, becauase I am specifying the schema of StringType with nullable as true. Am i doing something wrong? I am using spark 1.6.1 and scala 2.10
Pattern match is performed linearly as it appears in the sources, so, this line:
case Row(t1: Seq[String], t2: Seq[String]) => Some((t1 zip t2).toMap)
Which doesn't have any restrictions on the values of t1 and t2 never matter match with the null value.
Effectively, put the null check before and it should work.
The issue is that whether you find null or not the first pattern matches. After all, t2: Seq[String] could theoretically be null. While it's true that you can solve this immediately by simply making the null pattern appear first, I feel it is imperative to use the facilities in the Scala language to get rid of null altogether and avoid more bad runtime surprises.
So you could do something like this:
def foo(s: Seq[String]) = if (s.contains(null)) None else Some(s)
//or you could do fancy things with filter/filterNot
df.map {
case (first, second) => (foo(first), foo(second))
}
This will provide you the Some/None tuples you seem to want, but I would see about flattening out those Nones as well.
I think you will need to encode null values to blank or special String before performing assert operations. Also keep in mind that Spark executes lazily. So from the like "val values = df.flatMap" onward everything is executed only when show() is executed.

Filtering in Scala

So suppose I have the following data (only the first few rows, this data covers an entire year) -
(2014-08-31T00:05:00.000+01:00, John)
(2014-08-31T00:11:00.000+01:00, Sarah)
(2014-08-31T00:12:00.000+01:00, George)
(2014-08-31T00:05:00.000+01:00, John)
(2014-09-01T00:05:00.000+01:00, Sarah)
(2014-09-01T00:05:00.000+01:00, George)
(2014-09-01T00:05:00.000+01:00, Jason)
I would like to filter the data so that I only see what the names are for a specific date (say, 2014-09-05). I've tried doing this using the filter function in Scala but I keep receiving the following error -
error: value xxxx is not a member of (org.joda.time.DateTime, String)
Is there another way of doing this?
The filter method takes a function, called a predicate, that takes as parameter an element of your (I'm assuming) RDD, and returns a Boolean.
The returned RDD will keep only the rows for which the predicate evaluates to true.
In your case, it seems that what you want is something like
rdd.filter{
case (date, _) => date.withTimeAtStartOfDay() == new DateTime("2017-03-31")
}
I presume from the tag your question is in the context of Spark and not pure Scala. Given that, you could filter a dataframe on a date and get the associated name(s) like this:
import org.apache.spark.sql.functions._
import sparkSession.implicits._
Seq(
("2014-08-31T00:05:00.000+01:00", "John"),
("2014-08-31T00:11:00.000+01:00", "Sarah")
...
)
.toDF("date", "name")
.filter(to_date('date).equalTo(Date.valueOf("2014-09-05")))
.select("name")
Note that the Date above is java.sql.Date.
Here's a function that takes a date, a list of datetime-name pairs, and returns a list of names for the date:
def getNames(d: String, l: List[(String, String)]): List[String] = {
val date = """^([^T]*).*""".r
val dateMap = list.map {
case (x, y) => ( x match { case date(z) => z }, y )
}.
groupBy(_._1) mapValues( _.map(_._2) )
dateMap.getOrElse(d, List[String]())
}
val list = List(
("2014-08-31T00:05:00.000+01:00", "John"),
("2014-08-31T00:11:00.000+01:00", "Sarah"),
("2014-08-31T00:12:00.000+01:00", "George"),
("2014-08-31T00:05:00.000+01:00", "John"),
("2014-09-01T00:05:00.000+01:00", "Sarah"),
("2014-09-01T00:05:00.000+01:00", "George"),
("2014-09-01T00:05:00.000+01:00", "Jason")
)
getNames("2014-09-01", list)
res1: List[String] = List(Sarah, George, Jason)
val dateTimeStringZero = "2014-08-12T00:05:00.000+01:00"
val dateTimeOne:DateTime = org.joda.time.format.ISODateTimeFormat.dateTime.withZoneUTC.parseDateTime(dateTimeStringZero)
import java.text.SimpleDateFormat
val df = new DateTime(new SimpleDateFormat("yyyy-MM-dd").parse("2014-08-12"))
println(dateTimeOne.getYear==df.getYear)
println(dateTimeOne.getMonthOfYear==df.getYear)
...