I have a big dataframe with millions of rows as follows:
A B C Eqn
12 3 4 A+B
32 8 9 B*C
56 12 2 A+B*C
How to evaluate the expressions in the Eqn column?
You could create a custom UDF that evaluates these arithmetic functions
def evalUDF = udf((a:Int, b:Int, c:Int, eqn:String) => {
val eqnParts = eqn
.replace("A", a.toString)
.replace("B", b.toString)
.replace("C", c.toString)
.split("""\b""")
.toList
val (sum, _) = eqnParts.tail.foldLeft((eqnParts.head.toInt, "")){
case ((runningTotal, "+"), num) => (runningTotal + num.toInt, "")
case ((runningTotal, "-"), num) => (runningTotal - num.toInt, "")
case ((runningTotal, "*"), num) => (runningTotal * num.toInt, "")
case ((runningTotal, _), op) => (runningTotal, op)
}
sum
})
evalDf
.withColumn("eval", evalUDF('A, 'B, 'C, 'Eqn))
.show()
Output:
+---+---+---+-----+----+
| A| B| C| Eqn|eval|
+---+---+---+-----+----+
| 12| 3| 4| A+B| 15|
| 32| 8| 9| B*C| 72|
| 56| 12| 2|A+B*C| 136|
+---+---+---+-----+----+
As you can see this works, but is very fragile (spaces, unknown operators, etc will break the code) and doesn't adhere to order of operations (otherwise the last should have been 92)
So you could write all that yourself or find some library that already does that perhaps (like https://gist.github.com/daixque/1610753)?
Maybe the performance overhead will be very large (especially it you start using recursive parsers), But at least you can perform it on a dataframe instead of collecting it first
I think the only way to execute SQLs that are inside a DataFrame is to select("Eqn").collect first followed by executing the SQLs iteratively on the source Dataset.
Since the SQLs are in a DataFrame that is nothing else but a description of a distributed computation that will be executed on Spark executors there is no way you could submit Spark jobs while processing the SQLs on executors. It is simply too late in the execution pipeline. You should be back on the driver to be able to submit new Spark jobs, say to execute SQLs.
With SQLs on the driver you'd then take the corresponding row per SQL and simply withColumn to execute SQLs (with their rows).
I think it's easier to write it than develop a working Spark application, but that's how I'd go about it.
I am late But incase someone is looking for
Generic Math expression interpreter using variables
Complex/unknown expression that cannot be hardcoded into the UDF (accepted answer)
Then you can use javax.script.ScriptEngineManager
import javax.script.SimpleBindings;
import javax.script.ScriptEngineManager
import java.util.Map
import java.util.HashMap
def calculateFunction = (mathExpression: String, A : Double, B : Double, C : Double ) => {
val vars: Map[String, Object] = new HashMap[String, Object]();
vars.put("A",A.asInstanceOf[Object])
vars.put("B",B.asInstanceOf[Object])
vars.put("C",C.asInstanceOf[Object])
val engine = new ScriptEngineManager().getEngineByExtension("js");
val result = engine.eval(mathExpression, new SimpleBindings(vars));
result.asInstanceOf[Double]
}
val calculateUDF = spark.udf.register("calculateFunction",calculateFunction)
NOTE : This will handle generic expressions and is robust but it give a lot more worse performance than the accepted answer and heavy on memory.
Related
There are other thread on how to rename columns in a PySpark DataFrame, see here, here and here. I don't think the existing solutions are sufficiently performant or generic (I have a solution that should be better and I'm stuck on an edge case bug). Let's start by reviewing the issues with the current solutions:
Calling withColumnRenamed repeatedly will probably have the same performance problems as calling withColumn a lot, as outlined in this blog post. See Option 2 in this answer.
The toDF approach relies on schema inference and does not necessarily retain the nullable property of columns (toDF should be avoided in production code). I'm guessing this approach is slow as well.
This approach is close, but it's not generic enough and would be way too much manual work for a lot of columns (e.g. if you're trying to convert 2,000 column names to snake_case)
I created a function that's generic and works for all column types, except for column names that include dots:
import pyspark.sql.functions as F
def with_columns_renamed(fun):
def _(df):
cols = list(map(
lambda col_name: F.col(col_name).alias(fun(col_name)),
df.columns
))
return df.select(*cols)
return _
Suppose you have the following DataFrame:
+-------------+-----------+
|i like cheese|yummy stuff|
+-------------+-----------+
| jose| a|
| li| b|
| sam| c|
+-------------+-----------+
Here's how to replace all the whitespaces in the column names with underscores:
def spaces_to_underscores(s):
return s.replace(" ", "_")
df.transform(with_columns_renamed(spaces_to_underscores)).show()
+-------------+-----------+
|i_like_cheese|yummy_stuff|
+-------------+-----------+
| jose| a|
| li| b|
| sam| c|
+-------------+-----------+
The solution works perfectly, except for when the column name contains dots.
Suppose you have this DataFrame:
+-------------+-----------+
|i.like.cheese|yummy.stuff|
+-------------+-----------+
| jose| a|
| li| b|
| sam| c|
+-------------+-----------+
This code will error out:
def dots_to_underscores(s):
return s.replace(".", "_")
df.transform(quinn.with_columns_renamed(dots_to_underscores))
Here's the error message: pyspark.sql.utils.AnalysisException: "cannot resolve 'i.like.cheese' given input columns: [i.like.cheese, yummy.stuff];;\n'Project ['i.like.cheese AS i_like_cheese#242, 'yummy.stuff AS yummy_stuff#243]\n+- LogicalRDD [i.like.cheese#231, yummy.stuff#232], false\n"
How can I modify this solution to work for column names that have dots? I'm also assuming that the Catalyst optimizer will have the same optimization problems for multiple withColumnRenamed calls as it does for multiple withColumn calls. Let me know if Catalyst handles multiple withColumnRenamed calls better for some reason.
You could do something simple like this,
import pyspark.sql.functions as F
def with_columns_renamed(fun):
def _(df):
cols = list(map(
lambda col_name: F.col('`' + col_name + '`').alias(fun(col_name)),
df.columns
))
return df.select(*cols)
return _
I've read the other answers and can't understand why this isn't one of them, feel free to point out if I'm missing something! it's nothing new but it's concise and performs well
def with_columns_renamed(func):
def _(df):
return df.selectExpr(*['`{}` AS `{}`'.format(c, func(c)) for c in df.columns])
return _
Try escaping using ` :
import pyspark.sql.functions as F
def with_columns_renamed(fun):
def _(df):
cols = list(map(
lambda col_name: F.col("`{0}`".format(col_name)).alias(fun(col_name)),
df.columns
))
return df.select(*cols)
return _
Or use withColumnRenamed with reduce.
from functools import reduce
reduce(lambda new_df, col: new_df.withColumnRenamed(col,col.replace('.','_')),df.columns,df)
I've got some data in spark, result: DataFrame = ..., where two integer columns are of interest; week and year. The values of these columns are identical for all rows.
I want to extract these two integer values, and pass them as parameters to create a WeekYear:
case class WeekYear(week: Int, year: Int)
Below is my current solution, but I'm thinking there must be a more elegant way to do this. How can this be done without the intermediate step of creating temp?
val temp = result
.select("week", "year")
.first
.toSeq
.map(_.toString.toInt)
val resultWeekYear = WeekYear(temp(0), temp(1))
The best way to utilize a case class with dataframes is to allow spark to convert it to a dataset with the .as() method. As long as your case class has attributes which match all of the column names, it should work very easily.
case class WeekYear(week: Int, year: Int)
val df = spark.createDataset(Seq((1, 1), (2, 2), (3, 3))).toDF("week", "year")
val ds = df.as[WeekYear]
ds.show()
Which provides a Dataset[WeekYear] that looks like this:
+----+----+
|week|year|
+----+----+
| 1| 1|
| 2| 2|
| 3| 3|
+----+----+
You can utilize some more complicated nested classes, but you have to start working with Encoders for that, so that spark knows how to convert back and forth.
Spark does some implicit conversions, so ds may still look like a Dataframe, but it is actually a strongly typed Dataset[WeekYear], instead of a Dataset[Row] that has arbitrary columns. You operate on it similarly to an RDD. Then just grab the .first() one of those and you'll already have the type you need.
val resultWeekYear = ds.first
I know that count called on an RDD or a DataFrame is an action. But while fiddling with the spark shell, I observed the following
scala> val empDF = Seq((1,"James Gordon", 30, "Homicide"),(2,"Harvey Bullock", 35, "Homicide"),(3,"Kristen Kringle", 28, "Records"),(4,"Edward Nygma", 30, "Forensics"),(5,"Leslie Thompkins", 31, "Forensics")).toDF("id", "name", "age", "department")
empDF: org.apache.spark.sql.DataFrame = [id: int, name: string, age: int, department: string]
scala> empDF.show
+---+----------------+---+----------+
| id| name|age|department|
+---+----------------+---+----------+
| 1| James Gordon| 30| Homicide|
| 2| Harvey Bullock| 35| Homicide|
| 3| Kristen Kringle| 28| Records|
| 4| Edward Nygma| 30| Forensics|
| 5|Leslie Thompkins| 31| Forensics|
+---+----------------+---+----------+
scala> empDF.groupBy("department").count //count returned a DataFrame
res1: org.apache.spark.sql.DataFrame = [department: string, count: bigint]
scala> res1.show
+----------+-----+
|department|count|
+----------+-----+
| Homicide| 2|
| Records| 1|
| Forensics| 2|
+----------+-----+
When I called count on GroupedData (empDF.groupBy("department")), I got another DataFrame as the result (res1). This leads me to believe that count in this case was a transformation. It is further supported by the fact that no computations were triggered when I called count, instead, they started when I ran res1.show.
I haven't been able to find any documentation that suggests count could be a transformation as well. Could someone please shed some light on this?
The .count() what you have used in your code is over RelationalGroupedDataset, which creates a new column with count of elements in the grouped dataset. This is a transformation. Refer:
https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.GroupedDataset
The .count() that you use normally over RDD/DataFrame/Dataset is completely different from the above and this .count() is an Action. Refer: https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.rdd.RDD
EDIT:
always use .count() with .agg() while operating on groupedDataSet in order to avoid confusion in future:
empDF.groupBy($"department").agg(count($"department") as "countDepartment").show
Case 1:
You use rdd.count() to count the number of rows. Since it initiates the DAG execution and returns the data to the driver, its an action for RDD.
for ex: rdd.count // it returns a Long value
Case 2:
If you call count on Dataframe, it initiates the DAG execution and returns the data to the driver, its an action for Dataframe.
for ex: df.count // it returns a Long value
Case 3:
In your case you are calling groupBy on dataframe which returns RelationalGroupedDataset object, and you are calling count on grouped Dataset which returns a Dataframe, so its a transformation since it doesn't gets the data to the driver and initiates the DAG execution.
for ex:
df.groupBy("department") // returns RelationalGroupedDataset
.count // returns a Dataframe so a transformation
.count // returns a Long value since called on DF so an action
As you've already figure out - if method returns a distributed object (Dataset or RDD) it can be qualified as a transformations.
However these distinctions are much better suited for RDDs than Datasets. The latter ones features an optimizer, including recently added cost based optimizer, and might be much less lazy the old API, blurring differences between transformation and action in some case.
Here however it is safe to say count is a transformation.
I am new to scala/spark. I am working on a scala/java application on spark, trying to read some data from a hive table and then sum up all the column values for each row. for example consider the following DF:
+--------+-+-+-+-+-+-+
| address|a|b|c|d|e|f|
+--------+-+-+-+-+-+-+
|Newyork |1|0|1|0|1|1|
| LA |0|1|1|1|0|1|
|Chicago |1|1|0|0|1|1|
+--------+-+-+-+-+-+-+
I want to sum up all the 1's in all the rows and get the total.i.e. the above dataframe's sum of all the columns should be 12 (since there are 12 number of 1's in all the rows combined)
I tried doing this:
var count = 0
DF.foreach( x => {
count = count + Integer.parseInt(x.getAs[String]("a")) + Integer.parseInt(x.getAs[String]("b")) + Integer.parseInt(x.getAs[String]("c")) + Integer.parseInt(x.getAs[String]("d")) + Integer.parseInt(x.getAs[String]("e")) + Integer.parseInt(x.getAs[String]("f"))
})
When I run the above code, the count value is still being zero. I think this has something to do with running the application on a cluster. So, declaring a variable and adding to it doesn't work for me as I have to run my application on a cluster. I also tried declaring static variable in a separate java class and adding to it - that gives me the same result.
As far as my knowledge goes, there should be an easy way of achieving this using the inline functions available in spark/scala libraries.
What would be an efficient way of achieving this? Any help would be appreciated.
Thank you.
P.S: I am using Spark 1.6.
You can sum the columns values firstly which gives back a single Row data frame of sums, then you can convert this Row to a Seq and sum the values up:
val sum_cols = df.columns.tail.map(x => sum(col(x)))
df.agg(sum_cols.head, sum_cols.tail: _*).first.toSeq.asInstanceOf[Seq[Long]].sum
// res9: Long = 12
df.agg(sum_cols.head, sum_cols.tail: _*).show
+------+------+------+------+------+------+
|sum(a)|sum(b)|sum(c)|sum(d)|sum(e)|sum(f)|
+------+------+------+------+------+------+
| 2| 2| 2| 1| 2| 3|
+------+------+------+------+------+------+
Here is an alternative approach:
first let's prepare an aggregating function:
scala> val f = df.drop("address").columns.map(col).reduce((c1, c2) => c1 + c2)
f: org.apache.spark.sql.Column = (((((a + b) + c) + d) + e) + f)
get sum as a DataFrame:
scala> df.agg(sum(f).alias("total")).show
+-----+
|total|
+-----+
| 12|
+-----+
get sum as a Long number:
scala> df.agg(sum(f)).first.getLong(0)
res39: Long = 12
I'm pretty new to scala and spark and I've been trying to find a solution for this issue all day - it's doing my head in. I've tried 20 different variations of the following code and keep getting type mismatch errors when I try to perform calculations on a column.
I have a spark dataframe, and I wish to check whether each string in a particular column contains any number of words from a pre-defined List (or Set) of words.
Here is some example data for replication:
// sample data frame
val df = Seq(
(1, "foo"),
(2, "barrio"),
(3, "gitten"),
(4, "baa")).toDF("id", "words")
// dictionary Set of words to check
val dict = Set("foo","bar","baaad")
Now, i am trying to create a third column with the results of a comparison to see if the strings in the $"words" column within them contain any of the words in the dict Set of words. So the result should be:
+---+-----------+-------------+
| id| words| word_check|
+---+-----------+-------------+
| 1| foo| true|
| 2| bario| true|
| 3| gitten| false|
| 4| baa| false|
+---+-----------+-------------+
First, I tried to see if i could do it natively without using UDFs, since the dict Set will actually be a large dictionary of > 40K words, and as I understand it this would be more efficient than a UDF:
df.withColumn("word_check", dict.exists(d => $"words".contains(d)))
But i get the error:
type mismatch;
found : org.apache.spark.sql.Column
required: Boolean
I have also tried to create a UDF to do this (using both mutable.Set and mutable.WrappedArray to describe the Set - not sure which is correct but neither work):
val checker: ((String, scala.collection.mutable.Set[String]) => Boolean) = (col: String, array: scala.collection.mutable.Set[String] ) => array.exists(d => col.contains(d))
val udf1 = udf(checker)
df.withColumn("word_check", udf1($"words", dict )).show()
But get another type mismatch:
found : scala.collection.immutable.Set[String]
required: org.apache.spark.sql.Column
If the set was a fixed number, I should be able to use Lit(Int) in the expression? But I don't really understand performing more complex functions on a column by mixing different data types works in scala.
Any help greatly appreciated, especially if it can be done efficiently (it is a large df of > 5m rows).
Regardless of efficiency, this seems to work:
df.withColumn("word_check", dict.foldLeft(lit(false))((a, b) => a || locate(b, $"words") > 0)).show
+---+------+----------+
| id| words|word_check|
+---+------+----------+
| 1| foo| true|
| 2|barrio| true|
| 3|gitten| false|
| 4| baa| false|
+---+------+----------+
Here's how you'd do it with a UDF:
val checkerUdf = udf { (s: String) => dict.exists(s.contains(_)) }
df.withColumn("word_check", checkerUdf($"words")).show()
The mistake in your implementation is that you've created a UDF expecting two arguments, which means you'd have to pass two Columns when applying it - but dict isn't a Column in your DataFrame but rather a local vairable.
if your dict is large, you should not just reference it in your udf, because the entire dict is sent over the network for every task. I would broadcast your dict in combination with an udf:
import org.apache.spark.broadcast.Broadcast
def udf_check(words: Broadcast[scala.collection.immutable.Set[String]]) = {
udf {(s: String) => words.value.exists(s.contains(_))}
}
df.withColumn("word_check", udf_check(sparkContext.broadcast(dict))($"words"))
alternatively, you could also use a join:
val dict_df = dict.toList.toDF("word")
df
.join(broadcast(dict_df),$"words".contains($"word"),"left")
.withColumn("word_check",$"word".isNotNull)
.drop($"word")