calling a scala method passing each row of a dataframe as input - scala

I have a dataframe which has two columns in it, has been created importing a .txt file.
sample file content::
Sankar Biswas, Played{"94"}
Puja "Kumari" Jha, Didnot
Man Women, null
null,Gay Gentleman
null,null
Created a dataframe importing the above file ::
val a = sc.textFile("file:////Users/sankar.biswas/Desktop/hello.txt")
case class Table(contentName: String, VersionDetails: String)
val b = a.map(_.split(",")).map(p => Table(p(0).trim,p(1).trim)).toDF
Now I have a function defined lets say like this ::
def getFormattedName(contentName : String, VersionDetails:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
Now what I need to do is I have to take each row of the dataframe and call the method getFormattedName passing the 2 arguments of the dataframe's each row.
I tried like this and many others but did not work out ::
val a = b.map((m,n) => getFormattedContentName(m,n))
Looking forward to any suggestion you have for me.
Thanks in advance.

I think you have a structured schema and it can be represented by a dataframe.
Dataframe has support for reading the csv input.
import org.apache.spark.sql.types._
val customSchema = StructType(Array(StructField("contentName", StringType, true),StructField("titleVersionDesc", StringType, true)))
val df = spark.read.schema(customSchema).csv("input.csv")
To call a custom method on dataset, you can create a UDF(User Defined Function).
def getFormattedName(contentName : String, titleVersionDesc:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
val get_formatted_name = udf(getFormattedName _)
df.select(get_formatted_name($"contentName", $"titleVersionDesc"))

Try
val a = b.map(row => getFormattedContentName(row(0),row(1)))
Remember that the rows of a dataframe are their own type, not a tuple or something, and you need to use the correct methodology for referring to their elements.

Related

How to pass DataSet(s) to a function that accepts DataFrame(s) as arguments in Apache Spark using Scala?

I have a library in Scala for Spark which contains many functions.
One example is the following function to unite two dataframes that have different columns:
def appendDF(df2: DataFrame): DataFrame = {
val cols1 = df.columns.toSeq
val cols2 = df2.columns.toSeq
def expr(sourceCols: Seq[String], targetCols: Seq[String]): Seq[Column] = {
targetCols.map({
case x if sourceCols.contains(x) => col(x)
case y => lit(null).as(y)
})
}
// both df's need to pass through `expr` to guarantee the same order, as needed for correct unions.
df.select(expr(cols1, cols1): _*).union(df2.select(expr(cols2, cols1): _*))
}
I would like to use this function (and many more) to Dataset[CleanRow] and not DataFrames. CleanRow is a simple class here that defines the names and types of the columns.
My educated guess is to convert the Dataset into Dataframe using .toDF() method. However, I would like to know whether there are better ways to do it.
From my understanding, there shouldn't be many differences between Dataset and Dataframe since Dataset are just Dataframe[Row]. Plus, I think that from Spark 2.x the APIs for DF and DS have been unified, so I was thinking that I could pass either of them interchangeably, but that's not the case.
If changing signature is possible:
import spark.implicits._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Dataset
def f[T](d: Dataset[T]): Dataset[T] = {d}
// You are able to pass a dataframe:
f(Seq(0,1).toDF()).show
// res1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
// You are also able to pass a dataset:
f(spark.createDataset(Seq(0,1)))
// res2: org.apache.spark.sql.Dataset[Int] = [value: int]

unable to store row elements of a dataset, via mapPartitions(), in variables

I am trying to create a Spark Dataset, and then using mapPartitions, trying to access each of its elements and store those in variables. Using below piece of code for the same:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val df = spark.sql("select col1,col2,col3 from table limit 10")
val schema = StructType(Seq(
StructField("col1", StringType),
StructField("col2", StringType),
StructField("col3", StringType)))
val encoder = RowEncoder(schema)
df.mapPartitions{iterator => { val myList = iterator.toList
myList.map(x=> { val value1 = x.getString(0)
val value2 = x.getString(1)
val value3 = x.getString(2)}).iterator}} (encoder)
The error I am getting against this code is:
<console>:39: error: type mismatch;
found : org.apache.spark.sql.catalyst.encoders.ExpressionEncoder[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Encoder[Unit]
val value3 = x.getString(2)}).iterator}} (encoder)
Eventually, I am targeting to store the row elements in variables, and perform some operation with these. Not sure what am I missing here. Any help towards this would be highly appreciated!
Actually, there are several problems with your code:
Your map-statement has no return value, therefore Unit
If you return a tuple of String from mapPartitions, you don't need a RowEncoder (because you don't return a Row, but a Tuple3 which does not need a encoder because its a Product)
You can write your code like this:
df
.mapPartitions{itr => itr.map(x=> (x.getString(0),x.getString(1),x.getString(2)))}
.toDF("col1","col2","col3") // Convert Dataset to Dataframe, get desired field names
But you could just use a simple select statement in DataFrame API, no need for mapPartitions here
df
.select($"col1",$"col2",$"col3")

How to handle missing columns in spark sql

We are dealing with schema free JSON data and sometimes the spark jobs are failing as some of the columns we refer in spark SQL are not available for certain hours in the day. During these hours the spark job fails as the column being referred is not available in the data frame. How to handle this scenario? I have tried UDF but we have too many columns missing so can't really check each and every column for availability. I have also tried inferring a schema on a larger data set and applied it on the data frame expecting that missing columns will be filled with null but the schema application fails with weird errors.
Please suggest
This worked for me. Created a function to check all expected columns and add columns to dataframe if it is missing
def checkAvailableColumns(df: DataFrame, expectedColumnsInput: List[String]) : DataFrame = {
expectedColumnsInput.foldLeft(df) {
(df,column) => {
if(df.columns.contains(column) == false) {
df.withColumn(column,lit(null).cast(StringType))
}
else (df)
}
}
}
val expectedColumns = List("newcol1","newcol2","newcol3")
val finalDf = checkAvailableColumns(castedDateSessions,expectedColumns)
Here is an improved version of the answer #rads provided
#tailrec
def addMissingFields(fields: List[String])(df: DataFrame): DataFrame = {
def addMissingField(field: String)(df: DataFrame): DataFrame =
df.withColumn(field, lit(null).cast(StringType))
fields match {
case Nil =>
df
case c :: cs if c.contains(".") && !df.columns.contains(c.split('.')(0)) =>
val fields = c.split('.')
// it just supports one level of nested, but it can extend
val schema = StructType(Array(StructField(fields(1), StringType)))
addMissingFields(cs)(addMissingField(fields(0), schema)(df))
case ::(c, cs) if !df.columns.contains(c.split('.')(0)) =>
addMissingFields(cs)(addMissingField(c)(df))
case ::(_, cs) =>
addMissingFields(cs)(df)
}
}
Now you can use it as a transformation:
val df = ...
val expectedColumns = List("newcol1","newcol2","newcol3")
df.transform(addMissingFields(expectedColumns))
I haven't tested it in production yet to see if there is any performance issue. I doubt it. But if there was any, I'll update my post.
Here are the steps to add missing columns:
val spark = SparkSession
.builder()
.appName("Spark SQL json example")
.master("local[1]")
.getOrCreate()
import spark.implicits._
val df = spark.read.json
val schema = df.schema
val columns = df.columns // enough for flat tables
You can traverse the auto generated schema. If it is flat table just do
df.columns.
Compare the found columns to the expected columns and add the missing fields like this:
val dataframe2 = df.withColumn("MissingString1", lit(null).cast(StringType) )
.withColumn("MissingString2", lit(null).cast(StringType) )
.withColumn("MissingDouble1", lit(0.0).cast(DoubleType) )
Maybe there is a faster way to add the missing columns in one operation, instead of one by one, but the with withColumns() method which does that is private.
Here's a pyspark solution based on this answer which checks for a list of names (from a configDf - transformed into a list of columns it should have - parameterColumnsToKeepList) - this assumes all missing columns are ints but you could look this up in configdDf dynamically too. My default is null but you could also use 0.
from pyspark.sql.types import IntegerType
for column in parameterColumnsToKeepList:
if column not in processedAllParametersDf.columns:
print('Json missing column: {0}' .format(column))
processedAllParametersDf = processedAllParametersDf.withColumn(column, lit(None).cast(IntegerType()))

udf spark column names

I need to specify a sequence of columns. If I pass two strings, it works fine
val cols = array("predicted1", "predicted2")
but if I pass a sequence or an array, I get an error:
val cols = array(Seq("predicted1", "predicted2"))
Could you please help me? Many thanks!
You have at least two options here:
Using a Seq[String]:
val columns: Seq[String] = Seq("predicted1", "predicted2")
array(columns.head, columns.tail: _*)
Using a Seq[ColumnName]:
val columns: Seq[ColumnName] = Seq($"predicted1", $"predicted2")
array(columns: _*)
Function signature is def array(colName: String, colNames: String*): Column which means that it takes one string and then one or more strings. If you want to use a sequence, do it like this:
array("predicted1", Seq("predicted2"):_*)
From what I can see in the code, there are a couple of overloaded versions of this function, but neither one takes a Seq directly. So converting it into varargs as described should be the way to go.
You can use Spark's array form def array(cols: Column*): Column where the cols val is defined without using the $ column name notation -- i.e. when you want to have a Seq[ColumnName] type specifically, but created using strings. Here is how to solve that...
import org.apache.spark.sql.ColumnName
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val some_states: Seq[String] = Seq("state_AK","state_AL","state_AR","state_AZ")
val some_state_cols: Seq[ColumnName] = some_states.map(s => symbolToColumn(scala.Symbol(s)))
val some_array = array(some_state_cols: _*)
...using Spark's symbolToColumn method.
or with the ColumnName(s) constructor directly.
val some_array: Seq[ColumnName] = some_states.map(s => new ColumnName(s))

How to convert a case-class-based RDD into a DataFrame?

The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. I am trying to reproduce this concept using sqlContext.createDataFrame(RDD, CaseClass), but my DataFrame ends up empty. Here's my Scala code:
// sc is the SparkContext, while sqlContext is the SQLContext.
// Define the case class and raw data
case class Dog(name: String)
val data = Array(
Dog("Rex"),
Dog("Fido")
)
// Create an RDD from the raw data
val dogRDD = sc.parallelize(data)
// Print the RDD for debugging (this works, shows 2 dogs)
dogRDD.collect().foreach(println)
// Create a DataFrame from the RDD
val dogDF = sqlContext.createDataFrame(dogRDD, classOf[Dog])
// Print the DataFrame for debugging (this fails, shows 0 dogs)
dogDF.show()
The output I'm seeing is:
Dog(Rex)
Dog(Fido)
++
||
++
||
||
++
What am I missing?
Thanks!
All you need is just
val dogDF = sqlContext.createDataFrame(dogRDD)
Second parameter is part of Java API and expects you class follows java beans convention (getters/setters). Your case class doesn't follow this convention, so no property is detected, that leads to empty DataFrame with no columns.
You can create a DataFrame directly from a Seq of case class instances using toDF as follows:
val dogDf = Seq(Dog("Rex"), Dog("Fido")).toDF
Case Class Approach won't Work in cluster mode. It'll give ClassNotFoundException to the case class you defined.
Convert it a RDD[Row] and define the schema of your RDD with StructField and then createDataFrame like
val rdd = data.map { attrs => Row(attrs(0),attrs(1)) }
val rddStruct = new StructType(Array(StructField("id", StringType, nullable = true),StructField("pos", StringType, nullable = true)))
sqlContext.createDataFrame(rdd,rddStruct)
toDF() wont work either