How would I express the following code in Scala via the DataFrame API?
sqlContext.read.parquet("/input").registerTempTable("data")
sqlContext.udf.register("median", new Median)
sqlContext.sql(
"""
|SELECT
| param,
| median(value) as median
|FROM data
|GROUP BY param
""".stripMargin).registerTempTable("medians")
I've started via
val data = sqlContext.read.parquet("/input")
sqlContext.udf.register("median", new Median)
data.groupBy("param")
But them I'm not sure how to call the median function.
You can either use callUDF
data.groupBy("param").agg(callUDF("median", $"value"))
or call it directly:
val median = new Median
data.groupBy("param").agg(median($"value"))
// Equivalent to
data.groupBy("param").agg(new Median()($"value"))
Still, I think it would make more sense to use an object not a class.
Related
I have the following expression,
val pageViews = spark.sql(
s"""
|SELECT
| proposal,
| MIN(timestamp) AS timestamp,
| MAX(page_view_after) AS page_view_after
|FROM page_views
|GROUP BY proposalId
|""".stripMargin
).createOrReplaceTempView("page_views")
I want convert it into one that uses the Dataset API
val pageViews = pageViews.selectExpr("proposal", "MIN(timestamp) AS timestamp", "MAX(page_view_after) AS page_view_after").groupBy("proposal")
The problems is I can't call createOrReplaceTempView on this one - build fails.
My question is how do I convert the first one into the second one and create a TempView out of that?
You can get rid of SQL expression al together by using Spark Sql's functions
import org.apache.spark.sql.functions._
as below
pageViews
.groupBy("proposal")
.agg(max("timestamp").as("timestamp"),max("page_view_after").as("page_view_after"))
`
Considering you have a dataframe available with name pageViews -
Use -
pageViews
.groupBy("proposal")
.agg(expr("min(timestamp) AS timestamp"), expr("max(page_view_after) AS page_view_after"))
Is there a way to create an XML SOAP REQUEST by extracting a few columns from each row of a dataframe ? 10 records in a dataframe means 10 separate SOAP XML REQUESTs.
How would you make the function call using map now?
You can do that by applying a map function to the dataframe.
val df = your dataframe
df.map(x => convertToSOAP(x))
// convertToSOAP is your function.
Putting up an example based on your comment, hope you find this useful.
case class emp(id:String,name:String,city:String)
val list = List(emp("1","user1","NY"),emp("2","user2","SFO"))
val rdd = sc.parallelize(list)
val df = rdd.toDF
df.map(x => "<root><name>" + x.getString(1) + "</name><city>"+ x.getString(2) +"</city></root>").show(false)
// Note: x is a type of org.apache.spark.sql.Row
Output will be as follows :
+--------------------------------------------------+
|value |
+--------------------------------------------------+
|<root><name>user1</name><city>NY</city></root> |
|<root><name>user2</name><city>SFO</city></root> |
+--------------------------------------------------+
I am using following function to parse url but it throws error,
val b = Seq(("http://spark.apache.org/path?query=1"),("https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative")).toDF("url_col")
.withColumn("host",parse_url($"url_col","HOST"))
.withColumn("query",parse_url($"url_col","QUERY"))
.show(false)
Error:
<console>:285: error: not found: value parse_url
.withColumn("host",parse_url($"url_col","HOST"))
^
<console>:286: error: not found: value parse_url
.withColumn("query",parse_url($"url_col","QUERY"))
^
Kindly Guide how to parse url into its different parts.
Answer by #Ramesh is correct, but you also might want some hacky way to use this function without SQL queries :)
Hack is in the fact, that "callUDF" function calls not only UDFs, but any available function.
So you can write:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
b.withColumn("host", callUDF("parse_url", $"url_col", lit("HOST"))).
withColumn("query", callUDF("parse_url", $"url_col", lit("QUERY"))).
show(false)
Edit: after this Pull Request is merged, you can just use parse_url like a normal function. PR made after this question :)
parse_url is available as only sql and not as api . refer to parse_url
so you should be using it as a sql query and not as a function call through api
You should register the dataframe and use query as below
val b = Seq(("http://spark.apache.org/path?query=1"),("https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative")).toDF("url_col")
b.createOrReplaceTempView("temp")
spark.sql("SELECT url_col, parse_url(`url_col`, 'HOST') as HOST, parse_url(`url_col`,'QUERY') as QUERY from temp").show(false)
which should give you output as
+--------------------------------------------------------------------------------------------+-----------------+-------+
|url_col |HOST |QUERY |
+--------------------------------------------------------------------------------------------+-----------------+-------+
|http://spark.apache.org/path?query=1 |spark.apache.org |query=1|
|https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative|people.apache.org|null |
+--------------------------------------------------------------------------------------------+-----------------+-------+
I hope the answer is helpful
As mentioned before, when you register a UDF you don't get a Java function, rather you introduce it to Spark, so you must call it in the "Spark-way".
I want to suggest another method I find convenient, especially when there are several columns you want to add, by using selectExpr
val b = Seq(("http://spark.apache.org/path?query=1"),("https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative")).toDF("url_col")
val c = b.selectExpr("*", "parse_url(url_col, 'HOST') as host", "parse_url(url_col, 'QUERY') as query")
c.show(false)
I created a library called bebe that exposes the parse_url functionality via the Scala API.
Suppose you have the following DataFrame:
+------------------------------------+---------------+
|some_string |part_to_extract|
+------------------------------------+---------------+
|http://spark.apache.org/path?query=1|HOST |
|http://spark.apache.org/path?query=1|QUERY |
|null |null |
+------------------------------------+---------------+
Calculate the different parts of the URL:
df.withColumn("actual", bebe_parse_url(col("some_string"), col("part_to_extract")))
+------------------------------------+---------------+----------------+
|some_string |part_to_extract|actual |
+------------------------------------+---------------+----------------+
|http://spark.apache.org/path?query=1|HOST |spark.apache.org|
|http://spark.apache.org/path?query=1|QUERY |query=1 |
|null |null |null |
+------------------------------------+---------------+----------------+
In Spark-Sql version 1.6, using DataFrames, is there a way to calculate, for a specific column, the fraction of dividing current row and the next one, for every row?
For example, if I have a table with one column, like so
Age
100
50
20
4
I'd like the following output
Franction
2
2.5
5
The last row is dropped because it has no "next row" to be added to.
Right now I am doing it by ranking the table and joining it with itself, where the rank is equals to rank+1.
Is there a better way to do this?
Can this be done with a Window function?
Window function should do only partial tricks. Other partial trick can be done by defining a udf function
def div = udf((age: Double, lag: Double) => lag/age)
First we need to find the lag using Window function and then pass that lag and age in udf function to find the div
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val dataframe = Seq(
("A",100),
("A",50),
("A",20),
("A",4)
).toDF("person", "Age")
val windowSpec = Window.partitionBy("person").orderBy(col("Age").desc)
val newDF = dataframe.withColumn("lag", lag(dataframe("Age"), 1) over(windowSpec))
And finally cal the udf function
newDF.filter(newDF("lag").isNotNull).withColumn("div", div(newDF("Age"), newDF("lag"))).drop("Age", "lag").show
Final output would be
+------+---+
|person|div|
+------+---+
| A|2.0|
| A|2.5|
| A|5.0|
+------+---+
Edited
As #Jacek has suggested a better solution to use .na.drop instead of .filter(newDF("lag").isNotNull) and use / operator , so we don't even need to call the udf function
newDF.na.drop.withColumn("div", newDF("lag")/newDF("Age")).drop("Age", "lag").show
I am trying to use findSynonyms operation without collecting (action). Here is an example. I have a DataFrame which holds vectors.
df.show()
+--------------------+
| result|
+--------------------+
|[-0.0081423431634...|
|[0.04309031420520...|
|[0.03857229948043...|
+--------------------+
I want to use findSynonyms on this DataFrame. I tried
df.map{case Row(vector:Vector) => model.findSynonyms(vector)}
but it throws null pointer exception. Then I've learned, spark does not support nested transformations or actions. One possible way to do is collecting this DataFrame and run findSynonyms then. How can I do this operation on DataFrame level?
If I have understood correctly, you want to perform a function on each row in the DataFrame. To do that, you can declare a User Defined Function (UDF). In your case the UDF will take a vector as input.
import org.apache.spark.sql.functions._
val func = udf((vector: Vector) => {model.findSynonyms(vector)})
df.withColumn("synonymes", func($"result"))
A new column "synonymes" will be created using the results from the func function.