I am trying to pass a Spark SQL function name to my defined function in Scala.
I am trying to get same functionality as:
myDf.agg(max($"myColumn"))
my attempt doesn't work:
def myFunc(myDf: DataFrame, myParameter: String): Dataframe = {
myDf.agg(myParameter($"myColumn"))
}
Obviously it shouldn't work as I'm providing a string type I am unable to find a way to make it work.
Is it even possible?
Edit:
I have to provide sql function name (and it can be other aggregate function) as parameter when calling my function.
myFunc(anyDf, max) or myFunc(anyDf, "max")
agg also takes a Map[String,String] which allows to do what you want:
def myFunc(myDf: DataFrame, myParameter: String): DataFrame = {
myDf.agg(Map("myColumn"->myParameter))
}
example:
val df = Seq(1.0,2.0,3.0).toDF("myColumn")
myFunc(df,"avg")
.show()
gives:
+-------------+
|avg(myColumn)|
+-------------+
| 2.0|
+-------------+
Try this:
import org.apache.spark.sql.{Column, DataFrame}
val df = Seq((1, 2, 12),(2, 1, 21),(1, 5, 10),(5, 3, 9),(2, 5, 4)).toDF("a","b","c")
def myFunc(df: DataFrame, f: Column): DataFrame = {
df.agg(f)
}
myFunc(df, max(col("a"))).show
+------+
|max(a)|
+------+
| 5|
+------+
Hope it helps!
Related
Been recently trying to apply a default function to aggregated values that were being calculated so that I didn't have to reprocess them afterwards. As far as I see I'm getting the following error.
Caused by: java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
From the following function.
val defaultUDF: UserDefinedFunction = udf[Column, Column, Any](defaultFunction)
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
And applying it the following way.
val initialDF = Seq(
("a", "b", 1),
("a", "b", null),
("a", null, 0)
).toDF("field1", "field2", "field3")
initialDF
.groupBy("field1", "field2")
.agg(
defaultUDF(functions.count("field3"), lit(0)).as("counter") // exception thrown here
)
Am I trying to do black magic in here or is it something that I may be missing?
The issue is in the implementation of your UserDefinedFunction:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
val defaultUDF: UserDefinedFunction = udf[Column, Column, Any](defaultFunction)
// java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
// at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789)
// at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:724)
// at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
// at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:906)
// at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:46)
// at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:723)
// at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:720)
// at org.apache.spark.sql.functions$.udf(functions.scala:3914)
// ... 65 elided
The error you're getting is basically because Spark cannot map the return type (i.e. Column) of your UserDefinedFunction defaultFunction to a Spark DataType.
Your defaultFunction has to accept and return Scala types that correspond with a Spark DataType. You can find the list of supported Scala types here: https://spark.apache.org/docs/latest/sql-reference.html#data-types
In any case, you don't need a UserDefinedFunction if your function takes Columns and returns a Column. For your use-case, the following code will work:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class Record(field1: String, field2: String, field3: java.lang.Integer)
val df = Seq(
Record("a", "b", 1),
Record("a", "b", null),
Record("a", null, 0)
).toDS
df.show
// +------+------+------+
// |field1|field2|field3|
// +------+------+------+
// | a| b| 1|
// | a| b| null|
// | a| null| 0|
// +------+------+------+
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
df
.groupBy("field1", "field2")
.agg(defaultFunction(count("field3"), lit(0)).as("counter"))
.show
// +------+------+-------+
// |field1|field2|counter|
// +------+------+-------+
// | a| b| 1|
// | a| null| 1|
// +------+------+-------+
I have the following case class:
case class Person(name: String, lastname: Option[String] = None, age: BigInt) {}
And the following json:
{ "name": "bemjamin", "age" : 1 }
When I try to transform my dataframe into a dataset:
spark.read.json("example.json")
.as[Person].show()
It shows me the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
cannot resolve 'lastname' given input columns: [age, name];
My question is: If my schema is my case class and it defines that the lastname is optional, shouldn't the as() do the conversion?
I can easily fix this using a .map but I would like to know if there is another cleaner alternative to this.
We have one more option to solve above issue.There are 2 steps required
Make sure that fields that can be missing are declared as nullable
Scala types(like Option[_]).
Provide a schema argument and not depend on schema inference.You can use for example use Spark SQL Encoder:
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Person].schema
You can update code as below.
val schema = Encoders.product[Person].schema
val df = spark.read
.schema(schema)
.json("/Users/../Desktop/example.json")
.as[Person]
+--------+--------+---+
| name|lastname|age|
+--------+--------+---+
|bemjamin| null| 1|
+--------+--------+---+
When you are performing spark.read.json("example.json").as[Person].show(), it is basically reading the dataframe as ,
FileScan json [age#6L,name#7]
and then trying to apply the Encoders for Person object hence getting the AnalysisException as it is not able to find lastname from your json file.
Either you could hint spark that lastname is optional by supplying some data that has lastname or
try this:
val schema: StructType = ScalaReflection.schemaFor[Person].dataType.asInstanceOf[StructType]
val x = spark.read
.schema(schema)
.json("src/main/resources/json/x.json")
.as[Person]
+--------+--------+---+
| name|lastname|age|
+--------+--------+---+
|bemjamin| null| 1|
+--------+--------+---+
Hope it helps.
I have an action array which consists of user id and action type
+-------+-------+
|user_id| type|
+-------+-------+
| 11| SEARCH|
+-------+-------+
| 11| DETAIL|
+-------+-------+
| 12| SEARCH|
+-------+-------+
I want to filter actions that belongs to the users who have at least one search action.
So I created a bloom filter with user ids who has SEARCH action.
Then I tried to filter all actions depending on bloom filter's user status
val df = spark.read...
val searchers = df.filter($"type" === "SEARCH").select("user_id").distinct.as[String].collect
val bloomFilter = BloomFilter.create(100)
searchers.foreach(bloomFilter.putString(_))
df.filter(bloomFilter.mightContainString($"user_id"))
But the code gives an exception
type mismatch;
found : org.apache.spark.sql.ColumnName
required: String
Please let me know how can I pass column value to the BloomFilter.mightContainString method?
Create filter:
val expectedNumItems: Long = ???
val fpp: Double = ???
val f = df.stat.bloomFilter("user_id", expectedNumItems, fpp)
Use udf for filtering:
import org.apache.spark.sql.functions.udf
val mightContain = udf((s: String) => f.mightContain(s))
df.filter(mightContain($"user_id"))
If your current Bloom filter implementation is serializable you should be able to use it the same way, but if data is large enough to justify Bloom filter, you should avoid collecting.
You can do something like this,
val sparkSession = ???
val sc = sparkSession.sparkContext
val bloomFilter = BloomFilter.create(100)
val df = ???
val searchers = df.filter($"type" === "SEARCH").select("user_id").distinct.as[String].collect
At this point, i'll mention the fact that collect is not a good idea. Next you can do something like.
import org.apache.spark.sql.functions.udf
val bbFilter = sc.broadcast(bloomFilter)
val filterUDF = udf((s: String) => bbFilter.value.mightContainString(s))
df.filter(filterUDF($"user_id"))
You can remove the broadcasting if the bloomFilter instance is serializable.
Hope this helps, Cheers.
How do I write SQL statement to reach the goal like the following statement:
SELECT * FROM table t WHERE t.a LIKE '%'||t.b||'%';
Thanks.
spark.sql.Column provides like method but as for now (Spark 1.6.0 / 2.0.0) it works only with string literals. Still you can use raw SQL:
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc) // Make sure you use HiveContext
import sqlContext.implicits._ // Optional, just to be able to use toDF
val df = Seq(("foo", "bar"), ("foobar", "foo"), ("foobar", "bar")).toDF("a", "b")
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE a LIKE CONCAT('%', b, '%')")
// +------+---+
// | a| b|
// +------+---+
// |foobar|foo|
// |foobar|bar|
// +------+---+
or expr / selectExpr:
df.selectExpr("a like CONCAT('%', b, '%')")
In Spark 1.5 it will requireHiveContext. If for some reason Hive context is not an option you can use custom udf:
import org.apache.spark.sql.functions.udf
val simple_like = udf((s: String, p: String) => s.contains(p))
df.where(simple_like($"a", $"b"))
val regex_like = udf((s: String, p: String) =>
new scala.util.matching.Regex(p).findFirstIn(s).nonEmpty)
df.where(regex_like($"a", $"b"))
I have data in a DataFrame with below columns
Fileformat is csv
All below column datatypes are String
employeeid,pexpense,cexpense
Now I need to create a new DataFrame which has new column called expense, which is calculated based on columns pexpense, cexpense.
The tricky part is the calculation algorithm is not an UDF function which I created, but it's an external function that needs to be imported from a Java library which takes primitive types as arguments - in this case pexpense, cexpense - to calculate the value required for new column.
The function signature which is from an external Java jar
public class MyJava
{
public Double calculateExpense(Double pexpense, Double cexpense) {
// calculation
}
}
So how can I invoke that external function to create a new calculated column. Can I register that external function as UDF in my Spark application?
You can create your UDF of the external method similar to the following (illustrated using Scala REPL):
// From a Linux shell prompt:
vi MyJava.java
public class MyJava {
public Double calculateExpense(Double pexpense, Double cexpense) {
return pexpense + cexpense;
}
}
:wq
javac MyJava.java
jar -cvf MyJava.jar MyJava.class
spark-shell --jars /path/to/jar/MyJava.jar
// From within the Spark shell
val df = Seq(
("1", "1.0", "2.0"), ("2", "3.0", "4.0")
).toDF("employeeid", "pexpense", "cexpense")
val myJava = new MyJava
val myJavaUdf = udf(
myJava.calculateExpense _
)
val df2 = df.withColumn("totalexpense", myJavaUdf($"pexpense", $"cexpense") )
df2.show
+----------+--------+--------+------------+
|employeeid|pexpense|cexpense|totalexpense|
+----------+--------+--------+------------+
| 1| 1.0| 2.0| 3.0|
| 2| 3.0| 4.0| 7.0|
+----------+--------+--------+------------+
You can simply "wrap" the given method in a UDF by passing it as an argument to the udf function in org.apache.spark.sql.functions:
import org.apache.spark.sql.functions._
import spark.implicits._
val myUdf = udf(calculateExpense _)
val newDF = df.withColumn("expense", myUdf($"pexpense", $"cexpense"))
This assumes pexpense and cexpense columns are both Doubles.
bellow, is an example of sum two columns:
val somme= udf((a: Int, b: int) => a+b)
val df_new = df.select(col("employeeid"), \
col("pexpense"), \
col("pexpense"), \
somme(col("pexpense"), col("pexpense")) as "expense")