How do you change schema in a Spark `DataFrame.map()` operation without joins? - scala

In Spark v3.0.1 I have a DataFrame of arbitrary schema.
I want to turn that DataFrame of arbitrary schema into a new DataFrame with the same schema and a new column that is the result of a calculation over the data discretely present in each row.
I can safely assume that certain columns of certain types are available for the logical calculation despite the DataFrame being of arbitrary schema.
I have solved this previously by creating a new Dataset[outcome] of two columns:
the KEY from the input DataFrame
the OUTCOME of the calculation
... and then joining that DF back on the initial input to add the new column:
val inputDf = Seq(
("1", "input1", "input2"),
("2", "anotherInput1", "anotherInput2"),
).asDF("key", "logicalInput1", "logicalInput2")
case class outcome(key: String, outcome: String)
val outcomes = inputDf.map(row => {
val input1 = row.getAs[String]("logicalInput1")
val input2 = row.getAs[String]("logicalInput2")
val key = row.getAs[String]("key")
val result = if (input1 != "") input1 + input2 else input2
outcome(key, result)
})
val finalDf = inputDf.join(outcomes, Seq("key"))
Is there a more efficient way to map a DataFrame to a new DataFrame with an extra column given arbitrary columns on the input DF upon which we can assume some columns exist to make the calculation?
I'd like to take the inputDF and map over each row, generating a copy of the row and adding a new column to it with the outcome result without having to join afterwards...
NOTE that in the example above, a simple solution exists using Spark API... My calculation is not as simple as concatenating strings together, so the .map or a udf is required for the solution. I'd like to avoid UDF if possible, though that could work too.

Before answering exact question about using .map I think it is worth a brief discussion about using UDFs for this purpose. UDFs were mentioned in the "note" of the question but not in detail.
When we use .map (or .filter, .flatMap, and any other higher order function) on any Dataset [1] we are forcing Spark to fully deserialize the entire row into an object, transforming the object with a function, and then serializing the entire object again. This is very expensive.
A UDF is effectively a wrapper around a Scala function that routes values from certain columns to the arguments of the UDF. Therefore, Spark is aware of which columns are required by the UDF and which are not and thus we save a lot of serialization (and possibly IO) costs by ignoring columns that are not used by the UDF.
In addition, the query optimizer can't really help with .map but a UDF can be part of a larger plan that the optimizer will (in theory) minimize the cost of execution.
I believe that a UDF will usually be better in the kind of scenario put forth int the question. Another smell that indicate UDFs are a good solution is how little code is required compared to other solutions.
val outcome = udf { (input1: String, input2: String) =>
if (input1 != "") input1 + input2 else input2
}
inputDf.withColumn("outcome", outcome(col("logicalInput1"), col("logicalInput2")))
Now to answer the question about using .map! To avoid the join, we need to have the result of the .map be a Row that has all the contents of the input row with the output added. Row is effectively a sequence of values with type Any. Spark manipulates these values in a type-safe way by using the schema information from the dataset. If we create a new Row with a new schema, and provide .map with an Encoder for the new schema, Spark will know how to create a new DataFrame for us.
val newSchema = inputDf.schema.add("outcome", StringType)
val newEncoder = RowEncoder(newSchema)
inputDf
.map { row =>
val rowWithSchema = row.asInstanceOf[GenericRowWithSchema] // This cast might not always be possible!
val input1 = row.getAs[String]("logicalInput1")
val input2 = row.getAs[String]("logicalInput2")
val key = row.getAs[String]("key")
val result = if (input1 != "") input1 + input2 else input2
new GenericRowWithSchema(rowWithSchema.toSeq.toArray :+ result, row.schema).asInstanceOf[Row] // Encoder is invariant so we have to cast again.
}(newEncoder)
.show()
Not as elegant as the UDFs, but it works in this case. However, I'm not sure that this solution is universal.
[1] DataFrame is just an alias for Dataset[Row]

You should use withColumn with an UDF. I don't see why map should be preferred, and I think it's very difficult to append a column in DataFrame API
Or you switch to Dataset API

Related

How does foldLeft in Scala work on DataFrame?

I had a requirement to ingest an RDBMS table into Hive and I had to clean the data in its String columns, before inserting it into a Hive table using a regex_replace pattern. After failing to understand how to apply it on my dataFrame, I finally I came across a method in Scala which is foldLeft which helped in fulfilling the requirement.
I understand how foldLeft works on a collection, for example:
List(1,3,9).foldLeft(100)((x,y) => x+y)
foldLeft takes arguments: initialValue and a function. It adds the result of the function to the accumulator. In the above case, the result is: 113.
But when it comes to dataframe, I am unable to understand how it works.
val stringColumns = yearDF.schema.fields.filter(_.dataType == StringType).map(_.name)
val finalDF = stringColumns.foldLeft(yearDF){ (tempdf, colName) => tempdf.withColumn(colName, regexp_replace(col(colName), "\n", "")) }
In the above code, I got the String columns from the dataFrame: yearDF which is kept in the accumulator of the foldLeft. I have the following doubts regarding the function used in foldLeft:
What value does tempDF hold ? If it is the same as yearDF, how is it mapped to yearDF ?
If withColumns is used in the function and the result is added to
yearDF, how come it is not creating duplicating columns when
Could anyone explain it, so that I can have a better understanding about foldLeft.
Consider a trivialized foldLeft example more similar to your DataFrame version:
List(3, 2, 1).foldLeft("abcde")((acc, x) => acc.take(x))
If you look closely at what the (acc, x) => acc.take(x) function does in each iteration, the foldLeft is no difference from the following:
"abcde".take(3).take(2).take(1)
// Result: "a"
Going back to the foldLeft for your DataFrame:
stringColumns.foldLeft(yearDF){ (tempdf, colName) =>
tempdf.withColumn(colName, regexp_replace(col(colName), "\n", ""))
}
Similarly it's no difference from:
val sz = stringColumns.size
yearDF.
withColumn(stringColumns(0), regexp_replace(col(stringColumns(0)), "\n", "")).
withColumn(stringColumns(1), regexp_replace(col(stringColumns(1)), "\n", "")).
...
withColumn(stringColumns(sz - 1), regexp_replace(col(stringColumns(sz - 1)), "\n", ""))
What value does tempDF hold ? If it is the same as yearDF, how is it mapped to yearDF ?
In each iteration (i = 0, 1, 2, ...), tempDF holds a new DataFrame transformed from applying withColumn(stringColumns(i), ...), starting from yearDF
If withColumns is used in the function and the result is added to yearDF, how come it is not creating duplicating columns when
From withColumn(stringColumns(i), regexp_replace(col(stringColumns(i)), "\n", "")), method withColumn creates a new DataFrame, "adding" a column with the same name as the column stringColumns(i) it derives from, thus essentially resulting in a new DataFrame with the same column list as the original yearDF.
Just to make one point clearer about your second question.
When you call dataframe.withColumn() with an existing column name, it returns a new dataframe with the original column replaced with the new column.
This happens regardless to whether you're in the context of a foldLeft operation.

How to use a string as a expression/argument in Scala/Spark?

I am trying to add lot more columns to a dataframe using existing columns in a dataframe. However, Scala dataframes are immutable making it difficult to do it iteratively. So, I came up with a for loop which outputs the string (see a sample code below, which stores the entire statement I can use on the spark dataframe).
val train_df = sqlContext.sql("select * from someTable")
/*for loop output is similar to the Str variable as below*/
var Str = ".withColumn(\"newCol1\",$\"col1\").withColumn(\"newCol2\",$\"col2\").withColumn(\"newCol3\",$\"col3\")"
/* Below is what I am trying to do" */
val train_df_new = train_df.Str
So, how can I save the expression/argument in a string and reuse it in scala/spark to add all those new columns at once to a new dataframe?
Use a foldLeft instead. Here a Map with the old and new column names are used:
val m = Map(("col1", "newCol1"), ("col2", "newCol2"), ("col3", "newCol3"))
val train_df_new = m.keys.foldLeft(train_df)((df, c) => df.withColumnRenamed(c, m(c)))
Instead of withColumnRenamed any iterative function on the dataframe can be used here.

Converting Column of Dataframe to Seq[Columns] Scala

I am trying to make the next operation:
var test = df.groupBy(keys.map(col(_)): _*).agg(sequence.head, sequence.tail: _*)
I know that the required parameter inside the agg should be a Seq[Columns].
I have then a dataframe "expr" containing the next:
sequences
count(col("colname1"),"*")
count(col("colname2"),"*")
count(col("colname3"),"*")
count(col("colname4"),"*")
The column sequence is of string type and I want to use the values of each row as input of the agg, but I am not capable to reach those.
Any idea of how to give it a try?
If you can change the strings in the sequences column to be SQL commands, then it would be possible to solve. Spark provides a function expr that takes a SQL string and converts it into a column. Example dataframe with working commands:
val df2 = Seq("sum(case when A like 2 then A end) as A", "count(B) as B").toDF("sequences")
To convert the dataframe to Seq[Column]s do:
val seqs = df2.as[String].collect().map(expr(_))
Then the groupBy and agg:
df.groupBy(...).agg(seqs.head, seqs.tail:_*)

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()

How do I convert Array[Row] to RDD[Row]

I have a scenario where I want to convert the result of a dataframe which is in the format Array[Row] to RDD[Row]. I have tried using parallelize, but I don't want to use it as it needs to contain entire data in a single system which is not feasible in production box.
val Bid = spark.sql("select Distinct DeviceId, ButtonName from stb").collect()
val bidrdd = sparkContext.parallelize(Bid)
How do I achieve this? I tried the approach given in this link (How to convert DataFrame to RDD in Scala?), but it didn't work for me.
val bidrdd1 = Bid.map(x => (x(0).toString, x(1).toString)).rdd
It gives an error value rdd is not a member of Array[(String, String)]
The variable Bid which you've created here is not a DataFrame, it is an Array[Row], that's why you can't use .rdd on it. If you want to get an RDD[Row], simply call .rdd on the DataFrame (without calling collect):
val rdd = spark.sql("select Distinct DeviceId, ButtonName from stb").rdd
Your post contains some misconceptions worth noting:
... a dataframe which is in the format Array[Row] ...
Not quite - the Array[Row] is the result of collecting the data from the DataFrame into Driver memory - it's not a DataFrame.
... I don't want to use it as it needs to contain entire data in a single system ...
Note that as soon as you use collect on the DataFrame, you've already collected entire data into a single JVM's memory. So using parallelize is not the issue.