Changing columns that are string in Spark GraphFrame - scala

I'm using GraphFrame in spark 2.0 and scala.
I need to remove double quote from columns that are in string type (out of many columns).
I'm trying to do so using UDF as follow:
import org.apache.spark.sql.functions.udf
val removeDoubleQuotes = udf( (x:Any) =>
x match{
case s:String => s.replace("\"","")
case other => other
}
)
And I get the following error since type Any is not supported in GraphFrame.
java.lang.UnsupportedOperationException: Schema for type Any is not
supported
What is a workaround for that?

I think you don't have a column with type Any and you can't return different datatype from udf. You need to have a single datatype return from udf.
If your column is String then you can create udf as
import org.apache.spark.sql.functions.udf
val removeDoubleQuotes = udf( (x:String) => s.replace("\"",""))

Related

Update date format in spark dataframe for multiple spark columns

I have a Spark dataframe where few columns having a different type of date format.
To handle this I have written below code to keep a consistent type of format for all the date columns.
As the date column date format may get change every time hence I have defined a set of date formats in dt_formats.
def to_timestamp_multiple(s: Column, formats: Seq[String]): Column = {
coalesce(formats.map(fmt => to_timestamp(s, fmt)):_*)
}
val dt_formats= Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")
val newDF = df.withColumn("ETD1", date_format(to_timestamp_multiple($"ETD",Seq("dd-MMM-yyyy", dt_formats)).cast("date"), "yyyy-MM-dd")).drop("ETD").withColumnRenamed("ETD1","ETD")
But here I have to create a new column then I have to drop older column then rename the new column.
that make the code unnecessary very clumsy hence I want to get override from this code.
I am trying to implement similar functionality by writing a Scala below function but it is throwing the exception org.apache.spark.sql.catalyst.parser.ParseException:, but I am unable to identify the what change I should made to make it work..
val CleansedData= rawDF.selectExpr(rawDF.columns.map(
x => { x match {
case "ETA" => s"""date_format(to_timestamp_multiple($x, dt_formats).cast("date"), "yyyy-MM-dd") as ETA"""
case _ => x
} } ) : _*)
Hence seeking help.
Thanks in advance.
Create a UDF in order to use with select. The select method takes columns and produces another DataFrame.
Also, instead of using coalesce, it might be more straightforward simply to build a parser that handles all of the formats. You can use DateTimeFormatterBuilder for this.
import java.time.format.DateTimeFormatter
import java.time.format.DateTimeFormatterBuilder
import org.apache.spark.sql.functions.udf
import java.time.LocalDate
import scala.util.Try
import java.sql.Date
val dtFormatStrings:Seq[String] = Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")
// use foldLeft with appendOptional method, which for each format,
// returns a new builder with that additional possible format
val initBuilder = new DateTimeFormatterBuilder()
val builder: DateTimeFormatterBuilder = dtFormatStrings.foldLeft(initBuilder)(
(b: DateTimeFormatterBuilder, s:String) => b.appendOptional(DateTimeFormatter.ofPattern(s)))
val formatter = builder.toFormatter()
// Create the UDF, which just takes
// any function returning a sql-compatible type (java.sql.Date, here)
def toTimeStamp2(dateString:String): Date = {
val dateTry: Try[Date] = Try(java.sql.Date.valueOf(LocalDate.parse(dateString, formatter)))
dateTry.toOption.getOrElse(null)
}
val timeConversionUdf = udf(toTimeStamp2 _)
// example DF and new DF
val df = Seq(("05/08/20"), ("2020-04-03"), ("unparseable")).toDF("ETD")
df.select(timeConversionUdf(col("ETD"))).toDF("ETD2").show
Output:
+----------+
| ETD2|
+----------+
|2020-05-08|
|2020-04-03|
| null|
+----------+
Note that unparseable values end up null, as shown.
try withColumn(...) with same name and coalesce as below-
val dt_formats= Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")
val newDF = df.withColumn("ETD", coalesce(dt_formats.map(fmt => to_date($"ETD", fmt)):_*))

Infer Schema from rdd to Dataframe in Spark Scala

This question is a reference from (Spark - creating schema programmatically with different data types)
I am trying infer schema from rdd to Dataframe , Below is my code
def inferType(field: String) = field.split(":")(1) match {
case "Integer" => IntegerType
case "Double" => DoubleType
case "String" => StringType
case "Timestamp" => TimestampType
case "Date" => DateType
case "Long" => LongType
case _ => StringType
}
val header = c1:String|c2:String|c3:Double|c4:Integer|c5:String|c6:Timestamp|c7:Long|c8:Date
val df1 = Seq(("a|b|44.44|5|c|2018-01-01 01:00:00|456|2018-01-01")).toDF("data")
val rdd1 = df1.rdd.map(x => Row(x.getString(0).split("\\|"): _*))
val schema = StructType(header.split("\\|").map(column => StructField(column.split(":")(0), inferType(column), true)))
val df = spark.createDataFrame(rdd1, schema)
df.show()
When I do the show , it throws the below error . I have to perform this operation on larger scale data and having trouble finding the right solution, can you anybody please help me find a solution for this or any other way, where I can achieve this.
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of int
Thanks in advance
Short answer: String/Text cannot be specified with custom types/formats.
What you are trying to do is that to parse string as sql columns. The difference from other example is that loads from csv, you are trying to just.
Working version can be achieved like this:
// skipped other details such as schematype, spark session...
val header = "c1:String|c2:String|c3:Double|c4:Integer"
// Create `Row` from `Seq`
val row = Row.fromSeq(Seq("a|b|44.44|12|"))
// Create `RDD` from `Row`
val rdd: RDD[Row] = spark.sparkContext
.makeRDD(List(row))
.map { row =>
row.getString(0).split("\\|") match {
case Array(col1, col2, col3, col4) =>
Row.fromTuple(col1, col2, col3.toDouble, col4.toInt)
}
}
val stt: StructType = StructType(
header
.split("\\|")
.map(column => StructField(column, inferType(column), true))
)
val dataFrame = spark.createDataFrame(rdd, stt)
dataFrame.show()
The reason to create a Row from Scala types is that introducing compatible types or Row respected types here.
Note I skipped date and time related fields, date conversions are tricky. You can check my another answer how to use formatted date and timestamps here

calling a scala method passing each row of a dataframe as input

I have a dataframe which has two columns in it, has been created importing a .txt file.
sample file content::
Sankar Biswas, Played{"94"}
Puja "Kumari" Jha, Didnot
Man Women, null
null,Gay Gentleman
null,null
Created a dataframe importing the above file ::
val a = sc.textFile("file:////Users/sankar.biswas/Desktop/hello.txt")
case class Table(contentName: String, VersionDetails: String)
val b = a.map(_.split(",")).map(p => Table(p(0).trim,p(1).trim)).toDF
Now I have a function defined lets say like this ::
def getFormattedName(contentName : String, VersionDetails:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
Now what I need to do is I have to take each row of the dataframe and call the method getFormattedName passing the 2 arguments of the dataframe's each row.
I tried like this and many others but did not work out ::
val a = b.map((m,n) => getFormattedContentName(m,n))
Looking forward to any suggestion you have for me.
Thanks in advance.
I think you have a structured schema and it can be represented by a dataframe.
Dataframe has support for reading the csv input.
import org.apache.spark.sql.types._
val customSchema = StructType(Array(StructField("contentName", StringType, true),StructField("titleVersionDesc", StringType, true)))
val df = spark.read.schema(customSchema).csv("input.csv")
To call a custom method on dataset, you can create a UDF(User Defined Function).
def getFormattedName(contentName : String, titleVersionDesc:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
val get_formatted_name = udf(getFormattedName _)
df.select(get_formatted_name($"contentName", $"titleVersionDesc"))
Try
val a = b.map(row => getFormattedContentName(row(0),row(1)))
Remember that the rows of a dataframe are their own type, not a tuple or something, and you need to use the correct methodology for referring to their elements.

How to convert a case-class-based RDD into a DataFrame?

The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. I am trying to reproduce this concept using sqlContext.createDataFrame(RDD, CaseClass), but my DataFrame ends up empty. Here's my Scala code:
// sc is the SparkContext, while sqlContext is the SQLContext.
// Define the case class and raw data
case class Dog(name: String)
val data = Array(
Dog("Rex"),
Dog("Fido")
)
// Create an RDD from the raw data
val dogRDD = sc.parallelize(data)
// Print the RDD for debugging (this works, shows 2 dogs)
dogRDD.collect().foreach(println)
// Create a DataFrame from the RDD
val dogDF = sqlContext.createDataFrame(dogRDD, classOf[Dog])
// Print the DataFrame for debugging (this fails, shows 0 dogs)
dogDF.show()
The output I'm seeing is:
Dog(Rex)
Dog(Fido)
++
||
++
||
||
++
What am I missing?
Thanks!
All you need is just
val dogDF = sqlContext.createDataFrame(dogRDD)
Second parameter is part of Java API and expects you class follows java beans convention (getters/setters). Your case class doesn't follow this convention, so no property is detected, that leads to empty DataFrame with no columns.
You can create a DataFrame directly from a Seq of case class instances using toDF as follows:
val dogDf = Seq(Dog("Rex"), Dog("Fido")).toDF
Case Class Approach won't Work in cluster mode. It'll give ClassNotFoundException to the case class you defined.
Convert it a RDD[Row] and define the schema of your RDD with StructField and then createDataFrame like
val rdd = data.map { attrs => Row(attrs(0),attrs(1)) }
val rddStruct = new StructType(Array(StructField("id", StringType, nullable = true),StructField("pos", StringType, nullable = true)))
sqlContext.createDataFrame(rdd,rddStruct)
toDF() wont work either

addition of two dataframe integer values in Scala/Spark

So I'm new to both Scala and Spark so it may be kind of a dumb question...
I have the following code :
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(List(1,2,3)).toDF();
df.foreach( value => println( value(0) + value(0) ) );
error: type mismatch;
found : Any
required: String
What is wrong with it? How do I tell "this is an integer not an any"?
I tried value(0).toInt but "value toInt is not a member of Any".
I tried List(1:Integer, 2:Integer, 3:Integer) but I can not convert into a dataframe afterward...
Spark Row is an untyped container. If you want to extract anything else than Any you have to use typed extractor method or pattern matching over the Row (see Spark extracting values from a Row):
df.rdd.map(value => value.getInt(0) + value.getInt(0)).collect.foreach(println)
In practice there should be reason to extract these values at all. Instead you can operate directly on the DataFrame:
df.select($"_1" + $"_1")