error: value write is not a member of Unit - scala

DataFrame output:
+--------------+---------------+--------------------+
|Occurence_Date|Duplicate_Count| Message|
+--------------+---------------+--------------------+
| 13/4/2020| 0|No Duplicate reco...|
+--------------+---------------+--------------------+
Final_df2: Unit = ()
Code:
Final_df2.write.csv("/tmp/first_par_to_csv.csv")
But erroing out:
error: value write is not a member of Unit
Final_df2.write.csv("/tmp/first_par_to_csv.csv")

I assume this is the further extension of previous question posted by the same user
I am assuming you get the Final_df2 by doing a show on Final_df1 as provided in the previous question which is what is being told by Goutam.
To resolve this and in continuation of your previous post, here is what you need to do:
val originalString = "Data_time_Occured1,4,Message1"
val Final_df = Seq(originalString)
val Final_df1 = Final_df.map(_.split(",")).map(x => (x(0).trim.toString, x(1).trim.toInt, x(2).trim.toString)).toDF("Data_time_Occured", "Duplicate_Count", "Message")
Final_df1.write.csv("//path//to//your//destination//folder")

Usually you get this issue when your DF object is incorrect, e.g.:
var df = spark.read.csv("file:///home/praveen/emp.csv").show
df.show
When you do df.show() obviously you get error, because var df object is already containing show method at EOL. You cant do again show method on df explicitly.
So what I'm saying is your Final_df2 is incorrect. To debug this I need to know how you created your Final_df2 object.

Related

scala DataFrame cast log failure warning

I have a DataFrame in scala with a column of type String.
I want to cast it to type Long.
I found that the easy way to do that is by using the cast function:
val df: DataFrame
df.withColumn("long_col", df("str_col").cast(LongType))
This will successfully cast "1" to 1.
But if there is a string value that can't be cast to Long, e.g "some string" the result value will be null.
This is great, except I would like to know when this happens. I want to output a warning log whenever the casting failed and resulted in a null value.
And I can't just look at the output DF and check how many null values it has in the "long_col" column, because the original "string_col" column sometimes contains nulls too.
I want the following behavior:
if the value was cast correctly - all good
if there was a non-null string value that failed to cast - warning log
if there was a null value (and the result is also null) - all good
Is there any way to tell the cast function to log these warnings? I tried to read through the implementation and I didn't find any way to do it.
I found a way to do it like this:
def getNullsCount(df: DataFrame, column: String): Long = {
val c: Column = df(column)
df.select(count(when(c.isNull, true)) as "count").limit(1).collect()(0).getLong(0)
}
val countNulls: Long = getNullsCount(df, "str_col")
val newDF = df.withColumn("long_col", df("str_col").cast(LongType))
val countNewNulls: Long = getNullsCount(newDF, "long_col")
if (countNulls != countNewNulls) {
log.warn(s"failed to cast ${countNewNulls - countNulls} values")
}
newDF
I'm not sure if this is an efficient implementation. If anyone has any feedback on how to improve it I would appreciate it.
EDIT
I think this is more efficient because it can calculate both counts in parallel:
val newDF = df.withColumn("long_col", df("str_col").cast(LongType))
val nullsCount1 = df.select(count(when(df("str_col").isNull, true)) as "str_col_count")
val nullsCount2 = newDF.select(count(when(newDF("long_col").isNull, true)) as "long_col_count")
val joined = nullsCount1.join(nullsCount2)
val nullsDiff = joined.select(col("long_col_count") - col("str_col_count") as "diff")
val diffs: Map[String, Long] = nullsDiff.limit(1).collect()(0).getValuesMap[Long](Seq("diff"))
val diff: Long = diffs("diff")
if (diff != 0) {
log.warn(s"failed to cast $diff values")
}

value defined in "if-else structure" is not found outside the "if-else structure"

In the following code, I expected the compiler to identify that the output gets defined either in the if section or in the else section.
val df1 = spark.createDataFrame(Seq(
(1, 10),
(2, 20)
)).toDF("A", "B")
val df2 = spark.emptyDataFrame
if(df2.isEmpty){
val output = df1
}
else{
val output = df2
}
println(output.show)
However, it gives me an error saying error: not found: value output. if I do the same exact implementation in python it works fine and I get the expected output. In order to make this work in spark using scala I have defined output as a mutable variable and update it inside the if-else.
var output = spark.emptyDataFrame
if(df2.isEmpty){
output = df1
}
else{
output = df2
}
println(output.show)
Why doesn't the first implementation work and is there a way to get the expected outcome without using a mutable variable?
I suspect you come from a Python background where this kind of behavior is allowed.
In Scala this is not possible to achieve as is, because the if / else structure creates a new block, and what is defined in a block only resides in such block.
You may fix this by using a mutable variable...
var output: DataFrame = _
if(df2.isEmpty){
output = df1
}
else{
output = df2
}
However, this is very Java and goes against the immutable principle.
In Scala, a block is an expression, and as such, they can return values.
Thus, this is the more idiomatic way to solve the problem in Scala.
val output = if(df2.isEmpty) df1 else df2

java.text.ParseException: Unparseable date: "Some(2014-05-14T14:40:25.950)"

I need to fetch the date from a file.
Below is my spark program:
import org.apache.spark.sql.SparkSession
import scala.xml.XML
import java.text.SimpleDateFormat
object Active6Month {
def main(args:Array[String]){
val format = new SimpleDateFormat("yyyy-MM-dd'T'hh:mm:ss.SSS")
val format1 = new SimpleDateFormat("yyyy-MM")
val spark = SparkSession.builder.appName("Active6Months").master("local").getOrCreate()
val data = spark.read.textFile("D:\\BGH\\StackOverFlow\\Posts.xml").rdd
val date = data.filter{line => {
line.toString().trim().startsWith("<row")
}}.filter{line=>{
line.contains("PostTypeId=\"1\"")
}}.map{line=>{
val xml = XML.loadString(line)
var closedDate = format1.format(format.parse(xml.attribute("ClosedDate").toString())).toString()
(closedDate,1)
}}.reduceByKey(_+_)
date.foreach(println)
spark.stop
}
}
And I am getting this error:
java.text.ParseException: Unparseable date: "Some(2014-05-14T14:40:25.950)"
The format of date in file is perfect i.e:
CreationDate="2014-05-13T23:58:30.457"
But in error it shows the String "Some" attached to it.
And my other question is why same working in below code:
val date = data.filter{line => {
line.toString().trim().startsWith("<row")
}}.filter{line=>{
line.contains("PostTypeId=\"1\"")
}}.flatMap{line=>{
val xml = XML.loadString(line)
xml.attribute("ClosedDate")
}}.map{line=>{
(format1.format(format.parse(line.toString())).toString(),1)
}}.reduceByKey(_+_)
My guess is that xml.attribute("ClosedDate").toString() is actually returning a string containing Some attached to it. Have you debugged that to make sure?
Maybe you shouldn't use toString(), but instead, get the attribute value, by using the proper method.
Or you can do it the "ugly" way and include "Some" in the pattern:
val format = new SimpleDateFormat("'Some('yyyy-MM-dd'T'hh:mm:ss.SSS')'")
Your second approach works because (and that's a guess because I don't code in Scala), probably the xml.attribute("ClosedDate") method returns an object, and calling toString() on this object returns the string with "Some" attached to it (why? ask the API authors). But when you use map on this object, it sets the line variable to the correct value (without the "Some" part).

Spark: How to get String value while generating output file

I have two files
--------Student.csv---------
StudentId,City
101,NDLS
102,Mumbai
-------StudentDetails.csv---
StudentId,StudentName,Course
101,ABC,C001
102,XYZ,C002
Requirement
StudentId in first should be replaced with StudentName and Course in the second file.
Once replaced I need to generate a new CSV with complete details like
ABC,C001,NDLS
XYZ,C002,Mumbai
Code used
val studentRDD = sc.textFile(file path);
val studentdetailsRDD = sc.textFile(file path);
val studentB = sc.broadcast(studentdetailsRDD.collect)
//Generating CSV
studentRDD.map{student =>
val name = getName(student.StudentId)
val course = getCourse(student.StudentId)
Array(name, course, student.City)
}.mapPartitions{data =>
val stringWriter = new StringWriter();
val csvWriter =new CSVWriter(stringWriter);
csvWriter.writeAll(data.toList)
Iterator(stringWriter.toString())
}.saveAsTextFile(outputPath)
//Functions defined to get details
def getName(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.StudentName}
}
def getCourse(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.Course}
}
Problem
File gets generated but the values are object representations instead of String value.
How can I get the string values instead of objects ?
As suggested in another answer, Spark's DataFrame API is especially suitable for this, as it easily supports joining two DataFrames, and writing CSV files.
However, if you insist on staying with RDD API, looks like the main issue with your code is the lookup functions: getName and getCourse basically do nothing, because their return type is Unit; Using an if without an else means that for some inputs there's no return value, which makes the entire function return Unit.
To fix this, it's easier to get rid of them and simplify the lookup by broadcasting a Map:
// better to broadcast a Map instead of an Array, would make lookups more efficient
val studentB = sc.broadcast(studentdetailsRDD.keyBy(_.StudentId).collectAsMap())
// convert to RDD[String] with the wanted formatting
val resultStrings = studentRDD.map { student =>
val details = studentB.value(student.StudentId)
Array(details.StudentName, details.Course, student.City)
}
.map(_.mkString(",")) // naive CSV writing with no escaping etc., you can also use CSVWriter like you did
// save as text file
resultStrings.saveAsTextFile(outputPath)
Spark has great support for join and write to file. Join only takes 1 line of code and write also only takes 1.
Hand write those code can be error proven, hard to read and most likely super slow.
val df1 = Seq((101,"NDLS"),
(102,"Mumbai")
).toDF("id", "city")
val df2 = Seq((101,"ABC","C001"),
(102,"XYZ","C002")
).toDF("id", "name", "course")
val dfResult = df1.join(df2, "id").select("id", "city", "name")
dfResult.repartition(1).write.csv("hello.csv")
There will be a directory created. There is only 1 file in the directory which is the finally result.

Share HDInsight SPARK SQL Table saveAsTable does not work

I want to show the data from HDInsight SPARK using tableau. I was following this video where they have described how to connect the two systems and expose the data.
currently my script itself is very simple as shown below:
/* csvFile is an RDD of lists, each list representing a line in the CSV file */
val csvLines = sc.textFile("wasb://mycontainer#mysparkstorage.blob.core.windows.net/*/*/*/mydata__000000.csv")
// Define a schema
case class MyData(Timestamp: String, TimezoneOffset: String, SystemGuid: String, TagName: String, NumericValue: Double, StringValue: String)
// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
s => MyData(s(0),
s(1),
s(2),
s(3),
s(4).toDouble,
s(5)
)
).toDF()
// Register as a temporary table called "processdata"
myData.registerTempTable("test_table")
myData.saveAsTable("test_table")
unfortunately I run in to the following error
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
org.apache.spark.sql.AnalysisException: Table `test_table` already exists.;
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:209)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:198)
i have also tried to use the following code to overwrite the table if it exists
import org.apache.spark.sql.SaveMode
myData.saveAsTable("test_table", SaveMode.Overwrite)
but still it gives me same error.
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
java.lang.RuntimeException: Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.SparkStrategies$DDLStrategy$.apply(SparkStrategies.scala:416)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
Can someone please help me fix this issue?
I know it was my mistake, but i'll leave it as an answer as it was not readily available in any of the blogs or forum answers. hopefully it will help someone like me starting with Spark
I figured out that .toDF() actually creates the sqlContext and not the hiveContext based DataFrame. so I have now updated my code as below
// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
s => MyData(s(0),
s(1),
s(2),
s(3),
s(4).toDouble,
s(5)
)
)
// Register as a temporary table called "myData"
val myDataFrame = hiveContext.createDataFrame(myData)
myDataFrame.registerTempTable("mydata_stored")
myDataFrame.write.mode(SaveMode.Overwrite).saveAsTable("mydata_stored")
also make sure that the s(4) has proper double value else add try/catch to handle it. i did something like this:
def parseDouble(s: String): Double = try { s.toDouble } catch { case _ => 0.00 }
parseDouble(s(4))
Regards
Kiran