required: org.apache.spark.sql.Row - scala

I am running into a problem trying to convert one of the columns of a spark dataframe from a hexadecimal string to a double. I have the following code:
import spark.implicits._
case class MsgRow(block_number: Long, to: String, from: String, value: Double )
def hex2int (hex: String): Double = (new BigInteger(hex.substring(2),16)).doubleValue
txs = txs.map(row=>
MsgRow(row.getLong(0), row.getString(1), row.getString(2), hex2int(row.getString(3)))
)
I can't share the content of my txs dataframe but here is the metadata:
>txs
org.apache.spark.sql.DataFrame = [blockNumber: bigint, to: string ... 4 more fields]
but when I run this I get the error:
error: type mismatch;
found : MsgRow
required: org.apache.spark.sql.Row
MsgRow(row.getLong(0), row.getString(1), row.getString(2), hex2int(row.getString(3)))
^
I don't understand -- why is spark/scala expecting a row object? None of the examples I have seen involve an explicit conversion to a row, and in fact most of them involve an anonymous function returning a case class object, as I have above. And for some reason, googling "required: org.apache.spark.sql.Row" returns only five results, none of which pertains to my situation. Which is why I made the title so non-specific since there is little chance of a false positive. Thanks in advance!

Your error is because you are storing the output to the same variable and txs is expecting Row while you are returning MsgRow. so changing
txs = txs.map(row=>
MsgRow(row.getLong(0), row.getString(1), row.getString(2), hex2int(row.getString(3)))
)
to
val newTxs = txs.map(row=>
MsgRow(row.getLong(0), row.getString(1), row.getString(2), (new BigInteger(row.getString(3).substring(2),16)).doubleValue)
)
should solve your issue.
I have excluded the hex2int function as its giving serialization error.

Thank you #Ramesh for pointing out the bug in my code. His solution works, though it also does not mention the problem that pertains more directly to my OP, which is that the result returned from map is not a dataframe but rather a dataset. Rather than creating a new variable, all I need to do was change
txs = txs.map(row=>
MsgRow(row.getLong(0), row.getString(1), row.getString(2), hex2int(row.getString(3)))
)
to
txs = txs.map(row=>
MsgRow(row.getLong(0), row.getString(1), row.getString(2), hex2int(row.getString(3)))
).toDF
This would probably be the easy answer for most errors containing my title. While #Ramesh's answer got rid of that error, I ran into another error later raleted to the same fundamental issue when I tried to join this result to another dataframe.

Related

scala DataFrame cast log failure warning

I have a DataFrame in scala with a column of type String.
I want to cast it to type Long.
I found that the easy way to do that is by using the cast function:
val df: DataFrame
df.withColumn("long_col", df("str_col").cast(LongType))
This will successfully cast "1" to 1.
But if there is a string value that can't be cast to Long, e.g "some string" the result value will be null.
This is great, except I would like to know when this happens. I want to output a warning log whenever the casting failed and resulted in a null value.
And I can't just look at the output DF and check how many null values it has in the "long_col" column, because the original "string_col" column sometimes contains nulls too.
I want the following behavior:
if the value was cast correctly - all good
if there was a non-null string value that failed to cast - warning log
if there was a null value (and the result is also null) - all good
Is there any way to tell the cast function to log these warnings? I tried to read through the implementation and I didn't find any way to do it.
I found a way to do it like this:
def getNullsCount(df: DataFrame, column: String): Long = {
val c: Column = df(column)
df.select(count(when(c.isNull, true)) as "count").limit(1).collect()(0).getLong(0)
}
val countNulls: Long = getNullsCount(df, "str_col")
val newDF = df.withColumn("long_col", df("str_col").cast(LongType))
val countNewNulls: Long = getNullsCount(newDF, "long_col")
if (countNulls != countNewNulls) {
log.warn(s"failed to cast ${countNewNulls - countNulls} values")
}
newDF
I'm not sure if this is an efficient implementation. If anyone has any feedback on how to improve it I would appreciate it.
EDIT
I think this is more efficient because it can calculate both counts in parallel:
val newDF = df.withColumn("long_col", df("str_col").cast(LongType))
val nullsCount1 = df.select(count(when(df("str_col").isNull, true)) as "str_col_count")
val nullsCount2 = newDF.select(count(when(newDF("long_col").isNull, true)) as "long_col_count")
val joined = nullsCount1.join(nullsCount2)
val nullsDiff = joined.select(col("long_col_count") - col("str_col_count") as "diff")
val diffs: Map[String, Long] = nullsDiff.limit(1).collect()(0).getValuesMap[Long](Seq("diff"))
val diff: Long = diffs("diff")
if (diff != 0) {
log.warn(s"failed to cast $diff values")
}

Scala WithColumn only if both columns exists

I have seen some variations of this question asked but havent found exactly what Im looking for. Here is the question:
I have some report names that I have collected in a dataframe and pivoted. The trouble I am having is regarding the resilience of the report_name. I cant be assured that every 90 days data will be present and that Rpt1, Rpt2, and Rpt3 will be there. So how do I go about creating a calculation ONLY if the column is present. I have outlined how my code looks right now. It works if all columns are there, but Id like to future proof it to ensure that if the report is not present in the 90 day window the pipline will not error out, but instead just skip the .withColumn addition
df1=(reports.alias("r")
.groupBy(uniqueid)
.filter("current_date<=90")
.pivot(report_name)
**
Result would be the following columns uniqueid Rpt1, Rpt2, Rpt3
* +---+-----+------+----------+
* |id |Rpt1 |Rpt2 |Rpt3 |
* +---+-----+------+----------+
* |205|72 |36 | 12 |
**
df2=(df1.alias("d1")
.withColumn("new_calc",expr("Rpt2/Rpt3"))
You can catch the error with a Try monad and return the original dataframe if withColumn fails.
import scala.util.Try
val df2 = Try(df1.withColumn("new_calc", expr("Rpt2/Rpt3")))
.getOrElse(df1)
.alias("d1")
You can also define it as a method if you want to reuse:
import org.apache.spark.sql.Column
def withColumnIfExist(df: DataFrame, colName: String, col: Column) =
Try(df.withColumn("new_calc",expr("Rpt2/Rpt3"))).getOrElse(df)
val df3 = withColumnIfExist(df1, "new_calc", expr("Rpt2/Rpt3"))
.alias("d1")
And if you need to chain multiple transformation you can use it with transform:
val df4 = df1.alias("d1")
.transform(withColumnIfExist(_, "new_calc", expr("Rpt2/Rpt3")))
.transform(withColumnIfExist(_, "new_calc_2", expr("Rpt1/Rpt2")))
Or you can implement it as an extension method with implicit class:
implicit class RichDataFrame(df: DataFrame) {
def withColumnIfExist(colName: String, col: Column): DataFrame =
Try(df.withColumn("new_calc", expr("Rpt2/Rpt3"))).getOrElse(df)
}
val df5 = df1.alias("d1")
.withColumnIfExist("new_calc", expr("Rpt2/Rpt3"))
.withColumnIfExist("new_calc_2", expr("Rpt1/Rpt2"))
Since withColumn works with all datasets, and if you want withColumnIfExist to work generically for all datasets including dataframe:
implicit class RichDataset[A](ds: Dataset[A]) {
def withColumnIfExist(colName: String, col: Column): DataFrame =
Try(ds.withColumn("new_calc", expr("Rpt2/Rpt3"))).getOrElse(ds.toDF)
}

error: value write is not a member of Unit

DataFrame output:
+--------------+---------------+--------------------+
|Occurence_Date|Duplicate_Count| Message|
+--------------+---------------+--------------------+
| 13/4/2020| 0|No Duplicate reco...|
+--------------+---------------+--------------------+
Final_df2: Unit = ()
Code:
Final_df2.write.csv("/tmp/first_par_to_csv.csv")
But erroing out:
error: value write is not a member of Unit
Final_df2.write.csv("/tmp/first_par_to_csv.csv")
I assume this is the further extension of previous question posted by the same user
I am assuming you get the Final_df2 by doing a show on Final_df1 as provided in the previous question which is what is being told by Goutam.
To resolve this and in continuation of your previous post, here is what you need to do:
val originalString = "Data_time_Occured1,4,Message1"
val Final_df = Seq(originalString)
val Final_df1 = Final_df.map(_.split(",")).map(x => (x(0).trim.toString, x(1).trim.toInt, x(2).trim.toString)).toDF("Data_time_Occured", "Duplicate_Count", "Message")
Final_df1.write.csv("//path//to//your//destination//folder")
Usually you get this issue when your DF object is incorrect, e.g.:
var df = spark.read.csv("file:///home/praveen/emp.csv").show
df.show
When you do df.show() obviously you get error, because var df object is already containing show method at EOL. You cant do again show method on df explicitly.
So what I'm saying is your Final_df2 is incorrect. To debug this I need to know how you created your Final_df2 object.

How to correctly handle Option in Spark/Scala?

I have a method, createDataFrame, which returns an Option[DataFrame]. I then want to 'get' the DataFrame and use it in later code. I'm getting a type mismatch that I can't fix:
val df2: DataFrame = createDataFrame("filename.txt") match {
case Some(df) => { //proceed with pipeline
df.filter($"activityLabel" > 0)
case None => println("could not create dataframe")
}
val Array(trainData, testData) = df2.randomSplit(Array(0.5,0.5),seed = 12345)
I need df2 to be of type: DataFrame otherwise later code won't recognise df2 as a DataFrame e.g. val Array(trainData, testData) = df2.randomSplit(Array(0.5,0.5),seed = 12345)
However, the case None statement is not of type DataFrame, it returns Unit, so won't compile. But if I don't declare the type of df2 the later code won't compile as it is not recognised as a DataFrame. If someone can suggest a fix that would be helpful - been going round in circles with this for some time. Thanks
What you need is a map. If you map over an Option[T] you are doing something like: "if it's None I'm doing nothing, otherwise I transform the content of the Option in something else. In your case this content is the dataframe itself. So inside this myDFOpt.map() function you can put all your dataframe transformation and just in the end do the pattern matching you did, where you may print something if you have a None.
edit:
val df2: DataFrame = createDataFrame("filename.txt").map(df=>{
val filteredDF=df.filter($"activityLabel" > 0)
val Array(trainData, testData) = filteredDF.randomSplit(Array(0.5,0.5),seed = 12345)})

Share HDInsight SPARK SQL Table saveAsTable does not work

I want to show the data from HDInsight SPARK using tableau. I was following this video where they have described how to connect the two systems and expose the data.
currently my script itself is very simple as shown below:
/* csvFile is an RDD of lists, each list representing a line in the CSV file */
val csvLines = sc.textFile("wasb://mycontainer#mysparkstorage.blob.core.windows.net/*/*/*/mydata__000000.csv")
// Define a schema
case class MyData(Timestamp: String, TimezoneOffset: String, SystemGuid: String, TagName: String, NumericValue: Double, StringValue: String)
// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
s => MyData(s(0),
s(1),
s(2),
s(3),
s(4).toDouble,
s(5)
)
).toDF()
// Register as a temporary table called "processdata"
myData.registerTempTable("test_table")
myData.saveAsTable("test_table")
unfortunately I run in to the following error
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
org.apache.spark.sql.AnalysisException: Table `test_table` already exists.;
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:209)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:198)
i have also tried to use the following code to overwrite the table if it exists
import org.apache.spark.sql.SaveMode
myData.saveAsTable("test_table", SaveMode.Overwrite)
but still it gives me same error.
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
java.lang.RuntimeException: Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.SparkStrategies$DDLStrategy$.apply(SparkStrategies.scala:416)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
Can someone please help me fix this issue?
I know it was my mistake, but i'll leave it as an answer as it was not readily available in any of the blogs or forum answers. hopefully it will help someone like me starting with Spark
I figured out that .toDF() actually creates the sqlContext and not the hiveContext based DataFrame. so I have now updated my code as below
// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
s => MyData(s(0),
s(1),
s(2),
s(3),
s(4).toDouble,
s(5)
)
)
// Register as a temporary table called "myData"
val myDataFrame = hiveContext.createDataFrame(myData)
myDataFrame.registerTempTable("mydata_stored")
myDataFrame.write.mode(SaveMode.Overwrite).saveAsTable("mydata_stored")
also make sure that the s(4) has proper double value else add try/catch to handle it. i did something like this:
def parseDouble(s: String): Double = try { s.toDouble } catch { case _ => 0.00 }
parseDouble(s(4))
Regards
Kiran