Run SparkSQL query for each line in PySpark dataframe - pyspark

I have a dataframe that contains parameters of a SQL query I need to run. I ultimately need the results of all of these SQL queries to be stored in a separate dataframe. Currently, I am mapping over each row of my parameter dataframe, then using a custom function to create the SQL query that needs run, like so:
# Example df
df = spark.createDataFrame(
[
("contract", 123),
("customer", 223),
],
["id_type", "ids"]
)
df.show()
+--------+----+
| id_type| ids|
+--------+----+
|contract| 123|
|customer| 223|
+--------+----+
________________________________________________________________________________________
# Create custom function that will write a sql query
def query_writer(id_type, ids):
qry = f'''
SELECT * FROM table
WHERE {id_type}_id = '{ids}'
'''
return sqlContext.sql(qry) # I also tried saving the results as a dictionary and outputting that
# Apply this function to each row of the dataframe
rdd1 = df.rdd.map(lambda x:(x[0], x[1], query_writer(x[0], x[1])))
qry = rdd1.take(1)
But, I get this error:
Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation.
SparkContext can only be used on the driver, not in code that it run on workers.
For more information, see SPARK-5063.
Is there any way to run a SQL query for each row of a dataframe in PySpark?
I happen to know that the query will always return 4 rows, if that is helpful.

Related

String aggregation and group by in PySpark

I have a dataset that has Id, Value and Timestamp columns. Id and Value columns are strings. Sample:
Id
Value
Timestamp
Id1
100
1658919600
Id1
200
1658919602
Id1
300
1658919601
Id2
433
1658919677
I want to concatenate Values that belong to the same Id, and order them by Timestamp. E.g. for rows with Id1 the result would look like:
Id
Values
Id1
100;300;200
Some pseudo code would be:
res = SELECT Id,
STRING_AGG(Value,";") WITHING GROUP ORDER BY Timestamp AS Values
FROM table
GROUP BY Id
Can someone help me write this in Databricks? PySpark and SQL are both fine.
You can collect lists of struct ofTimestamp and Value (in that order) for each Id, sort them (sort_array will sort by the first value of struct, i.e Timestamp) and combine Value's values into string using concat_ws.
PySpark (Spark 3.1.2)
import pyspark.sql.functions as F
(df
.groupBy("Id")
.agg(F.expr("concat_ws(';', sort_array(collect_list(struct(Timestamp, Value))).Value) as Values"))
).show(truncate=False)
# +---+-----------+
# |Id |Values |
# +---+-----------+
# |Id1|100;300;200|
# |Id2|433 |
# +---+-----------+
in SparkSQL
SELECT Id, concat_ws(';', sort_array(collect_list(struct(Timestamp, Value))).Value) as Values
FROM table
GROUP BY Id
This is a beautiful question!! This is a perfect use case for Fugue which can port Python and Pandas code to PySpark. I think this is something that is hard to express in Spark but easy to express in native Python or Pandas.
Let's just concern ourselves with 1 ID first. For one ID, using pure native Python, it would look like below. Assume the Timestamps are already sorted when this is applied.
import pandas as pd
df = pd.DataFrame({"Id": ["Id1", "Id1", "Id1", "Id2","Id2","Id2"],
"Value": [100,200,300,433, 500,600],
"Timestamp": [1658919600, 1658919602, 1658919601, 1658919677, 1658919670, 1658919672]})
from typing import Iterable, List, Dict, Any
def logic(df: List[Dict[str,Any]]) -> Iterable[Dict[str,Any]]:
_id = df[0]['Id']
items = []
for row in df:
items.append(row['Value'])
yield {"Id": _id, "Values": items}
Now we can call Fugue with one line of code to run this on Pandas. Fugue uses the type annotation from the logic function to handle conversions for you as it enters the function. We can run this for 1 ID (not sorted yet).
from fugue import transform
transform(df.loc[df["Id"] == "Id1"], logic, schema="Id:str,Values:[int]")
and that generates this:
Id Values
0 Id1 [100, 200, 300]
Now we are ready to bring it to Spark. All we need to do is add the engine and partitioning strategy to the transform call.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = transform(df,
logic,
schema="Id:str,Values:[int]",
partition={"by": "Id", "presort": "Timestamp asc"},
engine=spark)
sdf.show()
Because we passed in the SparkSession, this code will run on Spark.sdf is a SparkDataFrame so we need .show() because it evaluates lazily. Schema is a requirement for Spark so we need it too on Fugue but it's significantly simplified. The partitioning strategy will run logic on each Id, and will sort the items by Timestamp for each partition.
For the FugueSQL version, you can do:
from fugue_sql import fsql
fsql(
"""
SELECT *
FROM df
TRANSFORM PREPARTITION BY Id PRESORT Timestamp ASC USING logic SCHEMA Id:str,Values:[int]
PRINT
"""
).run(spark)
Easiest Solution :
df1=df.sort(asc('Timestamp')).groupBy("id").agg(collect_list('Value').alias('newcol'))
+---+---------------+
| id| newcol|
+---+---------------+
|Id1|[100, 300, 200]|
|Id2| [433]|
+---+---------------+
df1.withColumn('newcol',concat_ws(";",col("newcol"))).show()
+---+-----------+
| id| newcol|
+---+-----------+
|Id1|100;300;200|
|Id2| 433|
+---+-----------+

Filter Pyspark Dataframe with udf on entire row

Is there a way to select the entire row as a column to input into a Pyspark filter udf?
I have a complex filtering function "my_filter" that I want to apply to the entire DataFrame:
my_filter_udf = udf(lambda r: my_filter(r), BooleanType())
new_df = df.filter(my_filter_udf(col("*"))
But
col("*")
throws an error because that's not a valid operation.
I know that I can convert the dataframe to an RDD and then use the RDD's filter method, but I do NOT want to convert it to an RDD and then back into a dataframe. My DataFrame has complex nested types, so the schema inference fails when I try to convert the RDD into a dataframe again.
You should write all columns staticly. For example:
from pyspark.sql import functions as F
# create sample df
df = sc.parallelize([
(1, 'b'),
(1, 'c'),
]).toDF(["id", "category"])
#simple filter function
#F.udf(returnType=BooleanType())
def my_filter(col1, col2):
return (col1>0) & (col2=="b")
df.filter(my_filter('id', 'category')).show()
Results:
+---+--------+
| id|category|
+---+--------+
| 1| b|
+---+--------+
If you have so many columns and you are sure to order of columns:
cols = df.columns
df.filter(my_filter(*cols)).show()
Yields the same output.

Process all columns / the entire row in a Spark UDF

For a dataframe containing a mix of string and numeric datatypes, the goal is to create a new features column that is a minhash of all of them.
While this could be done by performing a dataframe.toRDD it is expensive to do that when the next step will be to simply convert the RDD back to a dataframe.
So is there a way to do a udf along the following lines:
val wholeRowUdf = udf( (row: Row) => computeHash(row))
Row is not a spark sql datatype of course - so this would not work as shown.
Update/clarifiction I realize it is easy to create a full-row UDF that runs inside withColumn. What is not so clear is what can be used inside a spark sql statement:
val featurizedDf = spark.sql("select wholeRowUdf( what goes here? ) as features
from mytable")
Row is not a spark sql datatype of course - so this would not work as shown.
I am going to show that you can use Row to pass all the columns or selected columns to a udf function using struct inbuilt function
First I define a dataframe
val df = Seq(
("a", "b", "c"),
("a1", "b1", "c1")
).toDF("col1", "col2", "col3")
// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// |a |b |c |
// |a1 |b1 |c1 |
// +----+----+----+
Then I define a function to make all the elements in a row as one string separated by , (as you have computeHash function)
import org.apache.spark.sql.Row
def concatFunc(row: Row) = row.mkString(", ")
Then I use it in udf function
import org.apache.spark.sql.functions._
def combineUdf = udf((row: Row) => concatFunc(row))
Finally I call the udf function using withColumn function and struct inbuilt function combining selected columns as one column and pass to the udf function
df.withColumn("contcatenated", combineUdf(struct(col("col1"), col("col2"), col("col3")))).show(false)
// +----+----+----+-------------+
// |col1|col2|col3|contcatenated|
// +----+----+----+-------------+
// |a |b |c |a, b, c |
// |a1 |b1 |c1 |a1, b1, c1 |
// +----+----+----+-------------+
So you can see that Row can be used to pass whole row as an argument
You can even pass all columns in a row at once
val columns = df.columns
df.withColumn("contcatenated", combineUdf(struct(columns.map(col): _*)))
Updated
You can achieve the same with sql queries too, you just need to register the udf function as
df.createOrReplaceTempView("tempview")
sqlContext.udf.register("combineUdf", combineUdf)
sqlContext.sql("select *, combineUdf(struct(`col1`, `col2`, `col3`)) as concatenated from tempview")
It will give you the same result as above
Now if you don't want to hardcode the names of columns then you can select the column names according to your desire and make it a string
val columns = df.columns.map(x => "`"+x+"`").mkString(",")
sqlContext.sql(s"select *, combineUdf(struct(${columns})) as concatenated from tempview")
I hope the answer is helpful
I came up with a workaround: drop the column names into any existing spark sql function to generate a new output column:
concat(${df.columns.tail.mkString(",'-',")}) as Features
In this case the first column in the dataframe is a target and was excluded. That is another advantage of this approach: the actual list of columns many be manipulated.
This approach avoids unnecessary restructuring of the RDD/dataframes.

Spark word2vec findSynonyms on Dataframes

I am trying to use findSynonyms operation without collecting (action). Here is an example. I have a DataFrame which holds vectors.
df.show()
+--------------------+
| result|
+--------------------+
|[-0.0081423431634...|
|[0.04309031420520...|
|[0.03857229948043...|
+--------------------+
I want to use findSynonyms on this DataFrame. I tried
df.map{case Row(vector:Vector) => model.findSynonyms(vector)}
but it throws null pointer exception. Then I've learned, spark does not support nested transformations or actions. One possible way to do is collecting this DataFrame and run findSynonyms then. How can I do this operation on DataFrame level?
If I have understood correctly, you want to perform a function on each row in the DataFrame. To do that, you can declare a User Defined Function (UDF). In your case the UDF will take a vector as input.
import org.apache.spark.sql.functions._
val func = udf((vector: Vector) => {model.findSynonyms(vector)})
df.withColumn("synonymes", func($"result"))
A new column "synonymes" will be created using the results from the func function.

Extract a column value from a spark dataframe and add it to another dataframe

I have a spark dataframe called "df_array" it will always returns a single array as an output like below.
arr_value
[M,J,K]
I want to extract it's value and add to another dataframe.
below is the code I was executing
val new_df = old_df.withColumn("new_array_value", df_array.col("UNCP_ORIG_BPR"))
but my code always fails saying "org.apache.spark.sql.AnalysisException: resolved attribute(s)"
Can someone help me on this
The operation needed here is join
You'll need to have the a common column in both dataframes, which will be used as "key".
After the join you can select which columns to be included in the new dataframe.
More detailed can be found here:
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
join(other, on=None, how=None)
Joins with another DataFrame, using the given join expression.
Parameters:
other – Right side of the join
on – a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.
how – str, default ‘inner’. One of inner, outer, left_outer, right_outer, leftsemi.
The following performs a full outer join between df1 and df2.
>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()
[Row(name=None, height=80), Row(name=u'Bob', height=85), Row(name=u'Alice', height=None)]
If you know the df_array has only one record, you can collect it to driver using first() and then use it as an array of literal values to create a column in any DataFrame:
import org.apache.spark.sql.functions._
// first - collect that single array to driver (assuming array of strings):
val arrValue = df_array.first().getAs[mutable.WrappedArray[String]](0)
// now use lit() function to create a "constant" value column:
val new_df = old_df.withColumn("new_array_value", array(arrValue.map(lit): _*))
new_df.show()
// +--------+--------+---------------+
// |old_col1|old_col2|new_array_value|
// +--------+--------+---------------+
// | 1| a| [M, J, K]|
// | 2| b| [M, J, K]|
// +--------+--------+---------------+