scala DataFrame cast log failure warning - scala

I have a DataFrame in scala with a column of type String.
I want to cast it to type Long.
I found that the easy way to do that is by using the cast function:
val df: DataFrame
df.withColumn("long_col", df("str_col").cast(LongType))
This will successfully cast "1" to 1.
But if there is a string value that can't be cast to Long, e.g "some string" the result value will be null.
This is great, except I would like to know when this happens. I want to output a warning log whenever the casting failed and resulted in a null value.
And I can't just look at the output DF and check how many null values it has in the "long_col" column, because the original "string_col" column sometimes contains nulls too.
I want the following behavior:
if the value was cast correctly - all good
if there was a non-null string value that failed to cast - warning log
if there was a null value (and the result is also null) - all good
Is there any way to tell the cast function to log these warnings? I tried to read through the implementation and I didn't find any way to do it.

I found a way to do it like this:
def getNullsCount(df: DataFrame, column: String): Long = {
val c: Column = df(column)
df.select(count(when(c.isNull, true)) as "count").limit(1).collect()(0).getLong(0)
}
val countNulls: Long = getNullsCount(df, "str_col")
val newDF = df.withColumn("long_col", df("str_col").cast(LongType))
val countNewNulls: Long = getNullsCount(newDF, "long_col")
if (countNulls != countNewNulls) {
log.warn(s"failed to cast ${countNewNulls - countNulls} values")
}
newDF
I'm not sure if this is an efficient implementation. If anyone has any feedback on how to improve it I would appreciate it.
EDIT
I think this is more efficient because it can calculate both counts in parallel:
val newDF = df.withColumn("long_col", df("str_col").cast(LongType))
val nullsCount1 = df.select(count(when(df("str_col").isNull, true)) as "str_col_count")
val nullsCount2 = newDF.select(count(when(newDF("long_col").isNull, true)) as "long_col_count")
val joined = nullsCount1.join(nullsCount2)
val nullsDiff = joined.select(col("long_col_count") - col("str_col_count") as "diff")
val diffs: Map[String, Long] = nullsDiff.limit(1).collect()(0).getValuesMap[Long](Seq("diff"))
val diff: Long = diffs("diff")
if (diff != 0) {
log.warn(s"failed to cast $diff values")
}

Related

required: org.apache.spark.sql.Row

I am running into a problem trying to convert one of the columns of a spark dataframe from a hexadecimal string to a double. I have the following code:
import spark.implicits._
case class MsgRow(block_number: Long, to: String, from: String, value: Double )
def hex2int (hex: String): Double = (new BigInteger(hex.substring(2),16)).doubleValue
txs = txs.map(row=>
MsgRow(row.getLong(0), row.getString(1), row.getString(2), hex2int(row.getString(3)))
)
I can't share the content of my txs dataframe but here is the metadata:
>txs
org.apache.spark.sql.DataFrame = [blockNumber: bigint, to: string ... 4 more fields]
but when I run this I get the error:
error: type mismatch;
found : MsgRow
required: org.apache.spark.sql.Row
MsgRow(row.getLong(0), row.getString(1), row.getString(2), hex2int(row.getString(3)))
^
I don't understand -- why is spark/scala expecting a row object? None of the examples I have seen involve an explicit conversion to a row, and in fact most of them involve an anonymous function returning a case class object, as I have above. And for some reason, googling "required: org.apache.spark.sql.Row" returns only five results, none of which pertains to my situation. Which is why I made the title so non-specific since there is little chance of a false positive. Thanks in advance!
Your error is because you are storing the output to the same variable and txs is expecting Row while you are returning MsgRow. so changing
txs = txs.map(row=>
MsgRow(row.getLong(0), row.getString(1), row.getString(2), hex2int(row.getString(3)))
)
to
val newTxs = txs.map(row=>
MsgRow(row.getLong(0), row.getString(1), row.getString(2), (new BigInteger(row.getString(3).substring(2),16)).doubleValue)
)
should solve your issue.
I have excluded the hex2int function as its giving serialization error.
Thank you #Ramesh for pointing out the bug in my code. His solution works, though it also does not mention the problem that pertains more directly to my OP, which is that the result returned from map is not a dataframe but rather a dataset. Rather than creating a new variable, all I need to do was change
txs = txs.map(row=>
MsgRow(row.getLong(0), row.getString(1), row.getString(2), hex2int(row.getString(3)))
)
to
txs = txs.map(row=>
MsgRow(row.getLong(0), row.getString(1), row.getString(2), hex2int(row.getString(3)))
).toDF
This would probably be the easy answer for most errors containing my title. While #Ramesh's answer got rid of that error, I ran into another error later raleted to the same fundamental issue when I tried to join this result to another dataframe.

Spark: How to get String value while generating output file

I have two files
--------Student.csv---------
StudentId,City
101,NDLS
102,Mumbai
-------StudentDetails.csv---
StudentId,StudentName,Course
101,ABC,C001
102,XYZ,C002
Requirement
StudentId in first should be replaced with StudentName and Course in the second file.
Once replaced I need to generate a new CSV with complete details like
ABC,C001,NDLS
XYZ,C002,Mumbai
Code used
val studentRDD = sc.textFile(file path);
val studentdetailsRDD = sc.textFile(file path);
val studentB = sc.broadcast(studentdetailsRDD.collect)
//Generating CSV
studentRDD.map{student =>
val name = getName(student.StudentId)
val course = getCourse(student.StudentId)
Array(name, course, student.City)
}.mapPartitions{data =>
val stringWriter = new StringWriter();
val csvWriter =new CSVWriter(stringWriter);
csvWriter.writeAll(data.toList)
Iterator(stringWriter.toString())
}.saveAsTextFile(outputPath)
//Functions defined to get details
def getName(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.StudentName}
}
def getCourse(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.Course}
}
Problem
File gets generated but the values are object representations instead of String value.
How can I get the string values instead of objects ?
As suggested in another answer, Spark's DataFrame API is especially suitable for this, as it easily supports joining two DataFrames, and writing CSV files.
However, if you insist on staying with RDD API, looks like the main issue with your code is the lookup functions: getName and getCourse basically do nothing, because their return type is Unit; Using an if without an else means that for some inputs there's no return value, which makes the entire function return Unit.
To fix this, it's easier to get rid of them and simplify the lookup by broadcasting a Map:
// better to broadcast a Map instead of an Array, would make lookups more efficient
val studentB = sc.broadcast(studentdetailsRDD.keyBy(_.StudentId).collectAsMap())
// convert to RDD[String] with the wanted formatting
val resultStrings = studentRDD.map { student =>
val details = studentB.value(student.StudentId)
Array(details.StudentName, details.Course, student.City)
}
.map(_.mkString(",")) // naive CSV writing with no escaping etc., you can also use CSVWriter like you did
// save as text file
resultStrings.saveAsTextFile(outputPath)
Spark has great support for join and write to file. Join only takes 1 line of code and write also only takes 1.
Hand write those code can be error proven, hard to read and most likely super slow.
val df1 = Seq((101,"NDLS"),
(102,"Mumbai")
).toDF("id", "city")
val df2 = Seq((101,"ABC","C001"),
(102,"XYZ","C002")
).toDF("id", "name", "course")
val dfResult = df1.join(df2, "id").select("id", "city", "name")
dfResult.repartition(1).write.csv("hello.csv")
There will be a directory created. There is only 1 file in the directory which is the finally result.

How to correctly handle Option in Spark/Scala?

I have a method, createDataFrame, which returns an Option[DataFrame]. I then want to 'get' the DataFrame and use it in later code. I'm getting a type mismatch that I can't fix:
val df2: DataFrame = createDataFrame("filename.txt") match {
case Some(df) => { //proceed with pipeline
df.filter($"activityLabel" > 0)
case None => println("could not create dataframe")
}
val Array(trainData, testData) = df2.randomSplit(Array(0.5,0.5),seed = 12345)
I need df2 to be of type: DataFrame otherwise later code won't recognise df2 as a DataFrame e.g. val Array(trainData, testData) = df2.randomSplit(Array(0.5,0.5),seed = 12345)
However, the case None statement is not of type DataFrame, it returns Unit, so won't compile. But if I don't declare the type of df2 the later code won't compile as it is not recognised as a DataFrame. If someone can suggest a fix that would be helpful - been going round in circles with this for some time. Thanks
What you need is a map. If you map over an Option[T] you are doing something like: "if it's None I'm doing nothing, otherwise I transform the content of the Option in something else. In your case this content is the dataframe itself. So inside this myDFOpt.map() function you can put all your dataframe transformation and just in the end do the pattern matching you did, where you may print something if you have a None.
edit:
val df2: DataFrame = createDataFrame("filename.txt").map(df=>{
val filteredDF=df.filter($"activityLabel" > 0)
val Array(trainData, testData) = filteredDF.randomSplit(Array(0.5,0.5),seed = 12345)})

Some(null) to Stringtype nullable scala.matcherror

I have an RDD[(Seq[String], Seq[String])] with some null values in the data.
The RDD converted to dataframe looks like this
+----------+----------+
| col1| col2|
+----------+----------+
|[111, aaa]|[xx, null]|
+----------+----------+
Following is the sample code:
val rdd = sc.parallelize(Seq((Seq("111","aaa"),Seq("xx",null))))
val df = rdd.toDF("col1","col2")
val keys = Array("col1","col2")
val values = df.flatMap {
case Row(t1: Seq[String], t2: Seq[String]) => Some((t1 zip t2).toMap)
case Row(_, null) => None
}
val transposed = values.map(someFunc(keys))
val schema = StructType(keys.map(name => StructField(name, DataTypes.StringType, nullable = true)))
val transposedDf = sc.createDataFrame(transposed, schema)
transposed.show()
It runs fine until the point I create a transposedDF, however as soon as I hit show it throws the following error:
scala.MatchError: null
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:97)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
IF there are no null values in the rdd the code works fine. I do not understand why does it fail when I have any null values, becauase I am specifying the schema of StringType with nullable as true. Am i doing something wrong? I am using spark 1.6.1 and scala 2.10
Pattern match is performed linearly as it appears in the sources, so, this line:
case Row(t1: Seq[String], t2: Seq[String]) => Some((t1 zip t2).toMap)
Which doesn't have any restrictions on the values of t1 and t2 never matter match with the null value.
Effectively, put the null check before and it should work.
The issue is that whether you find null or not the first pattern matches. After all, t2: Seq[String] could theoretically be null. While it's true that you can solve this immediately by simply making the null pattern appear first, I feel it is imperative to use the facilities in the Scala language to get rid of null altogether and avoid more bad runtime surprises.
So you could do something like this:
def foo(s: Seq[String]) = if (s.contains(null)) None else Some(s)
//or you could do fancy things with filter/filterNot
df.map {
case (first, second) => (foo(first), foo(second))
}
This will provide you the Some/None tuples you seem to want, but I would see about flattening out those Nones as well.
I think you will need to encode null values to blank or special String before performing assert operations. Also keep in mind that Spark executes lazily. So from the like "val values = df.flatMap" onward everything is executed only when show() is executed.

How to convert Spark's TableRDD to RDD[Array[Double]] in Scala?

I am trying to perform Scala operation on Shark. I am creating an RDD as follows:
val tmp: shark.api.TableRDD = sc.sql2rdd("select duration from test")
I need it to convert it to RDD[Array[Double]]. I tried toArray, but it doesn't seem to work.
I also tried converting it to Array[String] and then converting using map as follows:
val tmp_2 = tmp.map(row => row.getString(0))
val tmp_3 = tmp_2.map { row =>
val features = Array[Double] (row(0))
}
But this gives me a Spark's RDD[Unit] which cannot be used in the function. Is there any other way to proceed with this type conversion?
Edit I also tried using toDouble, but this gives me an RDD[Double] type, not RDD[Array[Double]]
val tmp_5 = tmp_2.map(_.toDouble)
Edit 2:
I managed to do this as follows:
A sample of the data:
296.98567000000003
230.84362999999999
212.89751000000001
914.02404000000001
305.55383
A Spark Table RDD was created first.
val tmp = sc.sql2rdd("select duration from test")
I made use of getString to translate it to a RDD[String] and then converted it to an RDD[Array[Double]].
val duration = tmp.map(row => Array[Double](row.getString(0).toDouble))