Renaming columns in a PySpark DataFrame with a performant select operation

Renaming columns in a PySpark DataFrame with a performant select operation - pyspark

There are other thread on how to rename columns in a PySpark DataFrame, see here, here and here. I don't think the existing solutions are sufficiently performant or generic (I have a solution that should be better and I'm stuck on an edge case bug). Let's start by reviewing the issues with the current solutions:
Calling withColumnRenamed repeatedly will probably have the same performance problems as calling withColumn a lot, as outlined in this blog post. See Option 2 in this answer.
The toDF approach relies on schema inference and does not necessarily retain the nullable property of columns (toDF should be avoided in production code). I'm guessing this approach is slow as well.
This approach is close, but it's not generic enough and would be way too much manual work for a lot of columns (e.g. if you're trying to convert 2,000 column names to snake_case)
I created a function that's generic and works for all column types, except for column names that include dots:
import pyspark.sql.functions as F
def with_columns_renamed(fun):
def _(df):
cols = list(map(
lambda col_name: F.col(col_name).alias(fun(col_name)),
df.columns
))
return df.select(*cols)
return _
Suppose you have the following DataFrame:
+-------------+-----------+
|i like cheese|yummy stuff|
+-------------+-----------+
| jose| a|
| li| b|
| sam| c|
+-------------+-----------+
Here's how to replace all the whitespaces in the column names with underscores:
def spaces_to_underscores(s):
return s.replace(" ", "_")
df.transform(with_columns_renamed(spaces_to_underscores)).show()
+-------------+-----------+
|i_like_cheese|yummy_stuff|
+-------------+-----------+
| jose| a|
| li| b|
| sam| c|
+-------------+-----------+
The solution works perfectly, except for when the column name contains dots.
Suppose you have this DataFrame:
+-------------+-----------+
|i.like.cheese|yummy.stuff|
+-------------+-----------+
| jose| a|
| li| b|
| sam| c|
+-------------+-----------+
This code will error out:
def dots_to_underscores(s):
return s.replace(".", "_")
df.transform(quinn.with_columns_renamed(dots_to_underscores))
Here's the error message: pyspark.sql.utils.AnalysisException: "cannot resolve 'i.like.cheese' given input columns: [i.like.cheese, yummy.stuff];;\n'Project ['i.like.cheese AS i_like_cheese#242, 'yummy.stuff AS yummy_stuff#243]\n+- LogicalRDD [i.like.cheese#231, yummy.stuff#232], false\n"
How can I modify this solution to work for column names that have dots? I'm also assuming that the Catalyst optimizer will have the same optimization problems for multiple withColumnRenamed calls as it does for multiple withColumn calls. Let me know if Catalyst handles multiple withColumnRenamed calls better for some reason.

You could do something simple like this,
import pyspark.sql.functions as F
def with_columns_renamed(fun):
def _(df):
cols = list(map(
lambda col_name: F.col('`' + col_name + '`').alias(fun(col_name)),
df.columns
))
return df.select(*cols)
return _

I've read the other answers and can't understand why this isn't one of them, feel free to point out if I'm missing something! it's nothing new but it's concise and performs well
def with_columns_renamed(func):
def _(df):
return df.selectExpr(*['`{}` AS `{}`'.format(c, func(c)) for c in df.columns])
return _

Try escaping using ` :
import pyspark.sql.functions as F
def with_columns_renamed(fun):
def _(df):
cols = list(map(
lambda col_name: F.col("`{0}`".format(col_name)).alias(fun(col_name)),
df.columns
))
return df.select(*cols)
return _
Or use withColumnRenamed with reduce.
from functools import reduce
reduce(lambda new_df, col: new_df.withColumnRenamed(col,col.replace('.','_')),df.columns,df)

Related

How to apply udf on a dataframe and on a column in scala?

I am beginner to scala. I tried scala REPL window in intellij.
I have a sample df and trying to test udf function not builtin for understanding.
df:
scala> import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder.appName("elephant").config("spark.master", "local[*]").getOrCreate()
val df = spark.createDataFrame(Seq(("A",1),("B",2),("C",3))).toDF("Letter", "Number")
df.show()
output:
|Letter|Number|
+------+------+
| A| 1|
| B| 2|
| C| 3|
+------+------+
udf for dataframe filter:
scala> def kill_4(n: String) : Boolean = {
| if (n =="A"){ true} else {false}} // please validate if its correct ???
I tried
df.withColumn("new_col", kill_4(col("Letter"))).show() // please tell correct way???
error
error: type mismatch
Second:
I tried direct filter:
df.filter(kill_4(col("Letter"))).show()
output desired
+------+------+
|Letter|Number|
+------+------+
| B| 2|
| C| 3|
+------+------+-

You can register udf and use it in code as follows:
import org.apache.spark.sql.functions.col
def kill_4(n: String) : Boolean = {
if (n =="A"){ true } else {false}
}
val kill_udf = udf((x: String) => kill_4(x))
df.select(col("Letter"),col("Number")
kill_udf(col("Letter")).as("Kill_4") ).show(false)

Please look at the databricks documentation on scala user defined funcitons.
https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html
You do not need the spark session to create a dataframe. I removed that code.
Your function had a couple bugs. Since it is very small, I created a inline one. The udf() call allows the function to be used with dataframes. The call to register allows it to be used with Spark SQL.
A quick SQL statement shows the function works.
Last but not least, we need the udf() and col() functions for the last statement to work.
In short, these three snippets solve your problem.

pyspark SparseVectors dataframe columns .dot product or any other vectors type column computation using #udf or #pandas_udf

I do try to compute .dot product between 2 columns of a give dataframe,
SparseVectors has this ability in spark already so I try to execute this in an easy & scalable way without converting to RDDs or to
DenseVectors but i'm stuck, spent past 3 days to try find out of an
approach and does fail, doesn't return computation for passed 2 vector
columns from dataframe and looking for guidance on this matter,
please, because something I'm missing here and not sure what is root cause ...
For separate vectors and rdd vectors works this approach but does fail
to work when passing dataframe column vectors, to replicate the flow
and issues please see below, ideally would be this computation to happen in parallel since real work data is with billions or more rows (dataframe observations):
from pyspark.ml.linalg import Vectors, SparseVector
from pyspark.sql import Row
df = spark.createDataFrame(
[
[["a","b","c"], SparseVector(4527, {0:0.6363067860791387, 1:1.0888040725098247, 31:4.371858972705023}),SparseVector(4527, {0:0.6363067860791387, 1:2.0888040725098247, 31:4.371858972705023})],
[["d"], SparseVector(4527, {8: 2.729945780576634}), SparseVector(4527, {8: 4.729945780576634})],
], ["word", "i", "j"])
# # daframe content
df.show()
+---------+--------------------+--------------------+
| word| i| j|
+---------+--------------------+--------------------+
|[a, b, c]|(4527,[0,1,31],[0...|(4527,[0,1,31],[0...|
| [d]|(4527,[8],[2.7299...|(4527,[8],[4.7299...|
+---------+--------------------+--------------------+
#udf(returnType=ArrayType(FloatType()))
def sim_cos(v1, v2):
if v1 is not None and v2 is not None:
return float(v1.dot(v2))
# # calling udf
df = df.withColumn("dotP", sim_cos(df.i, df.j))
# # output after udf
df.show()
+---------+--------------------+--------------------+----------+
| word| i| j| dotP|
+---------+--------------------+--------------------+----------+
|[a, b, c]|(4527,[0,1,31],[0...|(4527,[0,1,31],[0...| null|
| [d]|(4527,[8],[2.7299...|(4527,[8],[4.7299...| null|
+---------+--------------------+--------------------+----------+

Rewriting udf as lambda does work on spark 2.4.5. Posting in case
anyone is interested in this approach for PySpark dataframes:
# # rewrite udf as lambda function:
sim_cos = F.udf(lambda x,y : float(x.dot(y)), FloatType())
# # executing udf on dataframe
df = df.withColumn("similarity", sim_cos(col("i"),col("j")))
# # end result
df.show()
+---------+--------------------+--------------------+----------+
| word| i| j|similarity|
+---------+--------------------+--------------------+----------+
|[a, b, c]|(4527,[0,1,31],[0...|(4527,[0,1,31],[0...| 21.792336|
| [d]|(4527,[8],[2.7299...|(4527,[8],[4.7299...| 12.912496|
+---------+--------------------+--------------------+----------+

How to evaluate expressions that are the column values?

I have a big dataframe with millions of rows as follows:
A B C Eqn
12 3 4 A+B
32 8 9 B*C
56 12 2 A+B*C
How to evaluate the expressions in the Eqn column?

You could create a custom UDF that evaluates these arithmetic functions
def evalUDF = udf((a:Int, b:Int, c:Int, eqn:String) => {
val eqnParts = eqn
.replace("A", a.toString)
.replace("B", b.toString)
.replace("C", c.toString)
.split("""\b""")
.toList
val (sum, _) = eqnParts.tail.foldLeft((eqnParts.head.toInt, "")){
case ((runningTotal, "+"), num) => (runningTotal + num.toInt, "")
case ((runningTotal, "-"), num) => (runningTotal - num.toInt, "")
case ((runningTotal, "*"), num) => (runningTotal * num.toInt, "")
case ((runningTotal, _), op) => (runningTotal, op)
}
sum
})
evalDf
.withColumn("eval", evalUDF('A, 'B, 'C, 'Eqn))
.show()
Output:
+---+---+---+-----+----+
| A| B| C| Eqn|eval|
+---+---+---+-----+----+
| 12| 3| 4| A+B| 15|
| 32| 8| 9| B*C| 72|
| 56| 12| 2|A+B*C| 136|
+---+---+---+-----+----+
As you can see this works, but is very fragile (spaces, unknown operators, etc will break the code) and doesn't adhere to order of operations (otherwise the last should have been 92)
So you could write all that yourself or find some library that already does that perhaps (like https://gist.github.com/daixque/1610753)?
Maybe the performance overhead will be very large (especially it you start using recursive parsers), But at least you can perform it on a dataframe instead of collecting it first

I think the only way to execute SQLs that are inside a DataFrame is to select("Eqn").collect first followed by executing the SQLs iteratively on the source Dataset.
Since the SQLs are in a DataFrame that is nothing else but a description of a distributed computation that will be executed on Spark executors there is no way you could submit Spark jobs while processing the SQLs on executors. It is simply too late in the execution pipeline. You should be back on the driver to be able to submit new Spark jobs, say to execute SQLs.
With SQLs on the driver you'd then take the corresponding row per SQL and simply withColumn to execute SQLs (with their rows).
I think it's easier to write it than develop a working Spark application, but that's how I'd go about it.

I am late But incase someone is looking for
Generic Math expression interpreter using variables
Complex/unknown expression that cannot be hardcoded into the UDF (accepted answer)
Then you can use javax.script.ScriptEngineManager
import javax.script.SimpleBindings;
import javax.script.ScriptEngineManager
import java.util.Map
import java.util.HashMap
def calculateFunction = (mathExpression: String, A : Double, B : Double, C : Double ) => {
val vars: Map[String, Object] = new HashMap[String, Object]();
vars.put("A",A.asInstanceOf[Object])
vars.put("B",B.asInstanceOf[Object])
vars.put("C",C.asInstanceOf[Object])
val engine = new ScriptEngineManager().getEngineByExtension("js");
val result = engine.eval(mathExpression, new SimpleBindings(vars));
result.asInstanceOf[Double]
}
val calculateUDF = spark.udf.register("calculateFunction",calculateFunction)
NOTE : This will handle generic expressions and is robust but it give a lot more worse performance than the accepted answer and heavy on memory.

Spark (scala) dataframes - Check whether strings in column contain any items from a set

I'm pretty new to scala and spark and I've been trying to find a solution for this issue all day - it's doing my head in. I've tried 20 different variations of the following code and keep getting type mismatch errors when I try to perform calculations on a column.
I have a spark dataframe, and I wish to check whether each string in a particular column contains any number of words from a pre-defined List (or Set) of words.
Here is some example data for replication:
// sample data frame
val df = Seq(
(1, "foo"),
(2, "barrio"),
(3, "gitten"),
(4, "baa")).toDF("id", "words")
// dictionary Set of words to check
val dict = Set("foo","bar","baaad")
Now, i am trying to create a third column with the results of a comparison to see if the strings in the $"words" column within them contain any of the words in the dict Set of words. So the result should be:
+---+-----------+-------------+
| id| words| word_check|
+---+-----------+-------------+
| 1| foo| true|
| 2| bario| true|
| 3| gitten| false|
| 4| baa| false|
+---+-----------+-------------+
First, I tried to see if i could do it natively without using UDFs, since the dict Set will actually be a large dictionary of > 40K words, and as I understand it this would be more efficient than a UDF:
df.withColumn("word_check", dict.exists(d => $"words".contains(d)))
But i get the error:
type mismatch;
found : org.apache.spark.sql.Column
required: Boolean
I have also tried to create a UDF to do this (using both mutable.Set and mutable.WrappedArray to describe the Set - not sure which is correct but neither work):
val checker: ((String, scala.collection.mutable.Set[String]) => Boolean) = (col: String, array: scala.collection.mutable.Set[String] ) => array.exists(d => col.contains(d))
val udf1 = udf(checker)
df.withColumn("word_check", udf1($"words", dict )).show()
But get another type mismatch:
found : scala.collection.immutable.Set[String]
required: org.apache.spark.sql.Column
If the set was a fixed number, I should be able to use Lit(Int) in the expression? But I don't really understand performing more complex functions on a column by mixing different data types works in scala.
Any help greatly appreciated, especially if it can be done efficiently (it is a large df of > 5m rows).

Regardless of efficiency, this seems to work:
df.withColumn("word_check", dict.foldLeft(lit(false))((a, b) => a || locate(b, $"words") > 0)).show
+---+------+----------+
| id| words|word_check|
+---+------+----------+
| 1| foo| true|
| 2|barrio| true|
| 3|gitten| false|
| 4| baa| false|
+---+------+----------+

Here's how you'd do it with a UDF:
val checkerUdf = udf { (s: String) => dict.exists(s.contains(_)) }
df.withColumn("word_check", checkerUdf($"words")).show()
The mistake in your implementation is that you've created a UDF expecting two arguments, which means you'd have to pass two Columns when applying it - but dict isn't a Column in your DataFrame but rather a local vairable.

if your dict is large, you should not just reference it in your udf, because the entire dict is sent over the network for every task. I would broadcast your dict in combination with an udf:
import org.apache.spark.broadcast.Broadcast
def udf_check(words: Broadcast[scala.collection.immutable.Set[String]]) = {
udf {(s: String) => words.value.exists(s.contains(_))}
}
df.withColumn("word_check", udf_check(sparkContext.broadcast(dict))($"words"))
alternatively, you could also use a join:
val dict_df = dict.toList.toDF("word")
df
.join(broadcast(dict_df),$"words".contains($"word"),"left")
.withColumn("word_check",$"word".isNotNull)
.drop($"word")

Apache Spark SQL identifier expected exception

My question is quite similar to this one: Apache Spark SQL issue : java.lang.RuntimeException: [1.517] failure: identifier expected But I just can't figure out where my problem lies. I am using SQLite as database backend. Connecting and simple select statements work fine.
The offending line:
val df = tableData.selectExpr(tablesMap(t).toSeq:_*).map(r => myMapFunc(r))
tablesMap contains the table name as key and an array of strings as expressions. Printed, the array looks like this:
WrappedArray([My Col A], [ColB] || [Col C] AS ColB)
The table name is also included in square brackets since it contains spaces. The exception I get:
Exception in thread "main" java.lang.RuntimeException: [1.1] failure: identifier expected
I already made sure not to use any Spark Sql keywords. In my opinion there are 2 possible reasons why this code fails: 1) I somehow handle spaces in column names wrong. 2) I handle concatenation wrong.
I am using a resource file, CSV-like, which contains the expressions I want to be evaluated on my tables. Apart from this file, I want to allow the user to specify additional tables and their respective column expressions at runtime. The file looks like this:
TableName,`Col A`,`ColB`,CONCAT(`ColB`, ' ', `Col C`)
Appartently this does not work. Nevertheless I would like to reuse this file, modified of course. My idea was to map the columns with the expressions from an array of strings, like now, to a sequence of spark columns. (This is the only solution for me I could think of, since I want to avoid pulling in all hive dependecies just for this one feature.) I would introduce a small syntax for my expressions to mark raw column names with a $ and some keywords for functions like concat and as. But how could I do this? I tried something like this but it's far far away from even compiling.
def columnsMapFunc( expr: String) : Column = {
if(expr(0) == '$')
return expr.drop(1)
else
return concat(extractedColumnNames).as(newName)
}

Generally speaking using names containing whitespaces is asking for problems but replacing square brackets with backticks should solve the problem:
val df = sc.parallelize(Seq((1,"A"), (2, "B"))).toDF("f o o", "b a r")
df.registerTempTable("foo bar")
df.selectExpr("`f o o`").show
// +-----+
// |f o o|
// +-----+
// | 1|
// | 2|
// +-----+
sqlContext.sql("SELECT `b a r` FROM `foo bar`").show
// +-----+
// |b a r|
// +-----+
// | A|
// | B|
// +-----+
For concatenation you have to use concat function:
df.selectExpr("""concat(`f o o`, " ", `b a r`)""").show
// +----------------------+
// |'concat(f o o, ,b a r)|
// +----------------------+
// | 1 A|
// | 2 B|
// +----------------------+
but it requires HiveContext in Spark 1.4.0.
In practice I would simply rename columns after loading data
df.toDF("foo", "bar")
// org.apache.spark.sql.DataFrame = [foo: int, bar: string]
and use functions instead of expression strings (concat function is available only in Spark >= 1.5.0, for 1.4 and earlier you'll need an UDF):
import org.apache.spark.sql.functions.concat
df.select($"f o o", concat($"f o o", lit(" "), $"b a r")).show
// +----------------------+
// |'concat(f o o, ,b a r)|
// +----------------------+
// | 1 A|
// | 2 B|
// +----------------------+
There is also concat_ws function which takes separator as the first argument:
df.selectExpr("""concat_ws(" ", `f o o`, `b a r`)""")
df.select($"f o o", concat_ws(" ", $"f o o", $"b a r"))