PySpark passing Dataframe as extra parameter to map

PySpark passing Dataframe as extra parameter to map - scala

I want to parallelize a python list, use a map on that list, and pass a Dataframe to the mapper function also
def output_age_split(df):
ages= [18, 19, 20, 21, 22]
age_dfs= spark.sparkContext.parallelize(ages).map(lambda x: test(x, df)
# Unsure of type of age_dfs, but should be able to split into the smaller dfs like this somehow
return age_dfs[0], age_dfs[1] ...
def test(age, df):
return df.where(col("age")==age)
This results in a pickling error
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects
How should i parallelize this operation so that i get returned a collection of Dataframes?
EDIT: Sample of df
|age|name|salary|
|---|----|------|
|18 |John|40000 |
|22 |Joseph|60000 |

The issue is ages_dfs is not a dataframe, it's an RDD. Now, when you are applying a map with test function in it (which returns the dataframe), we end up getting into a weird situation where ages_dfs is actually an RDD of type PipelinedRDD which is neither a dataframe nor iterable.
TypeError: 'PipelinedRDD' object is not iterable
You can try the below workaround, where you just iterate on the list instead and create a collection of dataframe and iterate on them at will.
from pyspark.sql.functions import *
def output_age_split(df):
ages = [18, 19, 20, 21, 22]
result = []
for age in ages:
temp_df = test(age, df)
if(not len(temp_df.head(1)) == 0):
result.append(temp_df)
return result
def test(age, df):
return df.where(col("age")==age)
# +---+------+------+
# |age|name |salary|
# +---+------+------+
# |18 |John |40000 |
# |22 |Joseph|60000 |
# +---+------+------+
df = spark.sparkContext.parallelize(
[
(18, "John", 40000),
(22, "Jpseph", 60000)
]
).toDF(["age", "name", "salary"])
df.show()
result = output_age_split(df)
# Output type is: <class 'list'>
print(f"Output type is: {type(result)}")
for r in result:
r.show()
# +---+----+------+
# |age|name|salary|
# +---+----+------+
# | 18|John| 40000|
# +---+----+------+
# +---+------+------+
# |age| name|salary|
# +---+------+------+
# | 22|Jpseph| 60000|
# +---+------+------+
I am also attaching screenshot from my workspace for your reference.
Problem:
Solution:

Related

PySpark Working with Delta tables - For Loop Optimization with Union

I'm currently working in databricks and have a delta table with 20+ columns. I basically need to take a value from 1 column in each row, send it to an api which returns two values/columns, and then create the other 26 to merge the values back to the original delta table. So input is 28 columns and output is 28 columns. Currently my code looks like:
from pyspark.sql.types import *
from pyspark.sql import functions as F
import requests, uuid, json
from pyspark.sql import SparkSession
from pyspark.sql import DataFrame
from pyspark.sql.functions import col,lit
from functools import reduce
spark.conf.set("spark.sql.adaptive.enabled","true")
spark.conf.set("spark.databricks.adaptive.autoOptimizeShuffle.enabled", "true")
spark.sql('set spark.sql.execution.arrow.pyspark.enabled = true')
spark.conf.set("spark.databricks.optimizer.dynamicPartitionPruning","true")
spark.conf.set("spark.sql.parquet.compression.codec","gzip")
spark.conf.set("spark.sql.inMemorycolumnarStorage.compressed","true")
spark.conf.set("spark.databricks.optimizer.dynamicFilePruning","true");
output=spark.sql("select * from delta.`table`").cache()
SeriesAppend=[]
for i in output.collect():
#small mapping fix
if i['col1']=='val1':
var0='a'
elif i['col1']=='val2':
var0='b'
elif i['col1']=='val3':
var0='c'
elif i['col1']=='val4':
var0='d'
var0=set([var0])
req_var = set(['a','b','c','d'])
var_list=list(req_var-var0)
#subscription info
headers = {header}
body = [{
'text': i['col2']
}]
if len(i['col2'])<500:
request = requests.post(constructed_url, params=params, headers=headers, json=body)
response = request.json()
dumps=json.dumps(response[0])
loads = json.loads(dumps)
json_rdd = sc.parallelize(loads)
json_df = spark.read.json(json_rdd)
json_df = json_df.withColumn('col1',lit(i['col1']))
json_df = json_df.withColumn('col2',lit(i['col2']))
json_df = json_df.withColumn('col3',lit(i['col3']))
...
SeriesAppend.append(json_df)
else:
pass
Series_output=reduce(DataFrame.unionAll, SeriesAppend)
SAMPLE DF with only 3 columns:
df = spark.createDataFrame(
[
("a", "cat","owner1"), # create your data here, be consistent in the types.
("b", "dog","owner2"),
("c", "fish","owner3"),
("d", "fox","owner4"),
("e", "rat","owner5"),
],
["col1", "col2", "col3"]) # add your column names here
I really just need to write the response + other column values to a delta table, so dataframes are not necessarily required, but haven't found a faster way than the above. Right now, I can run 5 inputs, which returns 15 in 25.3 seconds without the unionAll. With the inclusion of the union, it turns into 3 minutes.
The final output would look like:
df = spark.createDataFrame(
[
("a", "cat","owner1","MI", 48003), # create your data here, be consistent in the types.
("b", "dog","owner2", "MI", 48003),
("c", "fish","owner3","MI", 48003),
("d", "fox","owner4","MI", 48003),
("e", "rat","owner5","MI", 48003),
],
["col1", "col2", "col3", "col4", "col5"]) # add your column names here
How can I make this faster in spark?

As mentioned in my comments, you should use UDF to distribute more workload to workers instead of collect and let a single machine (driver) to run it all. It's simply wrong approach and not scalable.
# This is your main function, pure Python and you can unittest it in any way you want.
# The most important about this function is:
# - everything must be encapsulated inside the function, no global variable works here
def req(col1, col2):
if col1 == 'val1':
var0 = 'a'
elif col1 == 'val2':
var0 = 'b'
elif col1 == 'val3':
var0 = 'c'
elif col1 == 'val4':
var0 = 'd'
var0=set([var0])
req_var = set(['a','b','c','d'])
var_list = list(req_var - var0)
#subscription info
headers = {header} # !!! `header` must available **inside** this function, global won't work
body = [{
'text': col2
}]
if len(col2) < 500:
# !!! same as `header`, `constructed_url` must available **inside** this function, global won't work
request = requests.post(constructed_url, params=params, headers=headers, json=body)
response = request.json()
return (response.col4, response.col5)
else:
return None
# Now you wrap the function above into a Spark UDF.
# I'm using only 2 columns here as input, but you can use as many columns as you wish.
# Same as output, I'm using only a tuple with 2 elements, you can make it as many items as you wish.
df.withColumn('temp', F.udf(req, T.ArrayType(T.StringType()))('col1', 'col2')).show()
# Output
# +----+----+------+------------------+
# |col1|col2| col3| temp|
# +----+----+------+------------------+
# | a| cat|owner1|[foo_cat, bar_cat]|
# | b| dog|owner2|[foo_dog, bar_dog]|
# | c|fish|owner3| null|
# | d| fox|owner4| null|
# | e| rat|owner5| null|
# +----+----+------+------------------+
# Now all you have to do is extract the tuple and assign to separate columns
# (and delete temp column to cleanup)
(df
.withColumn('col4', F.col('temp')[0])
.withColumn('col5', F.col('temp')[1])
.drop('temp')
.show()
)
# Output
# +----+----+------+-------+-------+
# |col1|col2| col3| col4| col5|
# +----+----+------+-------+-------+
# | a| cat|owner1|foo_cat|bar_cat|
# | b| dog|owner2|foo_dog|bar_dog|
# | c|fish|owner3| null| null|
# | d| fox|owner4| null| null|
# | e| rat|owner5| null| null|
# +----+----+------+-------+-------+

Dynamic dataframe with n columns and m rows

Reading data from json(dynamic schema) and i'm loading that to dataframe.
Example Dataframe:
scala> import spark.implicits._
import spark.implicits._
scala> val DF = Seq(
(1, "ABC"),
(2, "DEF"),
(3, "GHIJ")
).toDF("id", "word")
someDF: org.apache.spark.sql.DataFrame = [number: int, word: string]
scala> DF.show
+------+-----+
|id | word|
+------+-----+
| 1| ABC|
| 2| DEF|
| 3| GHIJ|
+------+-----+
Requirement:
Column count and names can be anything. I want to read rows in loop to fetch each column one by one. Need to process that value in subsequent flows. Need both column name and value. I'm using scala.
Python:
for i, j in df.iterrows():
print(i, j)
Need the same functionality in scala and it column name and value should be fetched separtely.
Kindly help.

df.iterrows is not from pyspark, but from pandas. In Spark, you can use foreach :
DF
.foreach{_ match {case Row(id:Int,word:String) => println(id,word)}}
Result :
(2,DEF)
(3,GHIJ)
(1,ABC)
I you don't know the number of columns, you cannot use unapply on Row, then just do :
DF
.foreach(row => println(row))
Result :
[1,ABC]
[2,DEF]
[3,GHIJ]
And operate with row using its methods getAs etc

Flatten array of arrays (different dimensions) of a sql.dataframe.DataFrame in pyspark

I have a pyspark.sql.dataframe.DataFrame which is something like this:
+---------------------------+--------------------+--------------------+
|collect_list(results) | userid | page |
+---------------------------+--------------------+--------------------+
| [[[roundtrip, fal...|13482f06-9185-47f...|1429d15b-91d0-44b...|
+---------------------------+--------------------+--------------------+
Inside the collect_list(results) column there is an array with len = 2, and the elements are also arrays (the first one has a len = 1, and the second one a len = 9).
Is there a way to flatten this array of arrays into a unique array with len = 10 using pyspark?
Thanks!

You can flatten an array of array using pyspark.sql.functions.flatten. Documentation here. For example this will create a new column called results with the flatten results assuming your dataframe variable is called df.
import pyspark.sql.functions as F
...
df.withColumn('results', F.flatten('collect_list(results)')

For a version that works before Spark 2.4 (but not before 1.3), you could try to explode the dataset you obtained before grouping, thereby unnesting one level of the array, then call groupBy and collect_list. Like this:
from pyspark.sql.functions import collect_list, explode
df = spark.createDataFrame([("foo", [1,]), ("foo", [2, 3])], schema=("foo", "bar"))
df.show()
# +---+------+
# |foo| bar|
# +---+------+
# |foo| [1]|
# |foo|[2, 3]|
# +---+------+
(df.select(
df.foo,
explode(df.bar))
.groupBy("foo")
.agg(collect_list("col"))
.show())
# +---+-----------------+
# |foo|collect_list(col)|
# +---+-----------------+
# |foo| [1, 2, 3]|
# +---+-----------------+

Get minimum value from an Array in a Spark DataFrame column

I have a DataFrame with Arrays.
val DF = Seq(
("123", "|1|2","3|3|4" ),
("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select($"id", split($"complete1", "\\|").as("complete1"), split($"complete2", "\\|").as("complete2"))
|id |complete1|complete2|
+-------------+---------+---------+
| 123| [, 1, 2]|[3, 3, 4]|
| 124| [, 3, 2]| [, 3, 4]|
+-------------+---------+---------+
How do I extract the minimum of each arrays?
|id |complete1|complete2|
+-------------+---------+---------+
| 123| 1 | 3 |
| 124| 2 | 3 |
+-------------+---------+---------+
I have tried defining a UDF to do this but I am getting an error.
def minArray(a:Array[String]) :String = a.filter(_.nonEmpty).min.mkString
val minArrayUDF = udf(minArray _)
def getMinArray(df: DataFrame, i: Int): DataFrame = df.withColumn("complete" + i, minArrayUDF(df("complete" + i)))
val minDf = (1 to 2).foldLeft(DF){ case (df, i) => getMinArray(df, i)}
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;

Since Spark 2.4, you can use array_min to find the minimum value in an array. To use this function you will first have to cast your arrays of strings to arrays of integers. Casting will also take care of the empty strings by converting them into null values.
DF.select($"id",
array_min(expr("cast(complete1 as array<int>)")).as("complete1"),
array_min(expr("cast(complete2 as array<int>)")).as("complete2"))

You can define your udf function as below
def minUdf = udf((arr: Seq[String])=> arr.filterNot(_ == "").map(_.toInt).min)
and call it as
DF.select(col("id"), minUdf(col("complete1")).as("complete1"), minUdf(col("complete2")).as("complete2")).show(false)
which should give you
+---+---------+---------+
|id |complete1|complete2|
+---+---------+---------+
|123|1 |3 |
|124|2 |3 |
+---+---------+---------+
Updated
In case if the array passed to udf functions are empty or array of empty strings then you will encounter
java.lang.UnsupportedOperationException: empty.min
You should handle that with if else condition in udf function as
def minUdf = udf((arr: Seq[String])=> {
val filtered = arr.filterNot(_ == "")
if(filtered.isEmpty) 0
else filtered.map(_.toInt).min
})
I hope the answer is helpful

Here is how you can do it without using udf
First explode the array you got with split() and then group by the same id and find min
val DF = Seq(
("123", "|1|2","3|3|4" ),
("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select($"id", split($"complete1", "\\|").as("complete1"), split($"complete2", "\\|").as("complete2"))
.withColumn("complete1", explode($"complete1"))
.withColumn("complete2", explode($"complete2"))
.groupBy($"id").agg(min($"complete1".cast(IntegerType)).as("complete1"), min($"complete2".cast(IntegerType)).as("complete2"))
Output:
+---+---------+---------+
|id |complete1|complete2|
+---+---------+---------+
|124|2 |3 |
|123|1 |3 |
+---+---------+---------+

You don't need an UDF for this, you can use sort_array:
val DF = Seq(
("123", "|1|2","3|3|4" ),
("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select(
$"id",
split(regexp_replace($"complete1","^\\|",""), "\\|").as("complete1"),
split(regexp_replace($"complete2","^\\|",""), "\\|").as("complete2")
)
// now select minimum
DF.
.select(
$"id",
sort_array($"complete1")(0).as("complete1"),
sort_array($"complete2")(0).as("complete2")
).show()
+---+---------+---------+
| id|complete1|complete2|
+---+---------+---------+
|123| 1| 3|
|124| 2| 3|
+---+---------+---------+
Note that I removed the leading | before splitting to avoid empty strings in the array

spark dataframe udf mapping indices to values

I have a spark data frame where one column consists of indices of a list. I would like to write a udf that allows me to create a new column with the values associated with the indices.
E.g.
Suppose I have the following dataframe and array:
val df = spark.createDataFrame(Seq((0, Array(1, 1, 2)), (1, Array(1, 2, 0))))
df.show()
+---+---------+
| _1| _2|
+---+---------+
| 0|[1, 1, 2]|
| 1|[1, 2, 0]|
+---+---------+
val sArray = Array("a", "b", "c")
I would like to be able to map the indicies in _2 to their values in sArray leading to this:
+---+---------+---------+
| _1| _2| _3|
+---+---------+---------+
| 0|[1, 1, 2]|[b, b, c]|
| 1|[1, 2, 0]|[b, c, a]|
+---+---------+---------+
I have been trying to do this with a udf:
def indexer (values: Array[String]) =
udf((indices: Array[Int]) => indices.map(values(_)))
df.withColumn("_3", indexer(sArray)($"_2"))
However when I do this, I get the following error:
Failed to execute user defined function
... Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [I
What is going wrong here? How can I fix this?

When operating on an ArrayType column in a DataFrame, the actual type passed into a UDF is mutable.WrappedArray. The failure you see is the result of trying to cast this WrappedArray into the Array[Int] your function expects.
The fix is rather simple - define the function to expect a mutable.WrappedArray[Int]:
def indexer (values: Array[String]): UserDefinedFunction = {
udf((indices: mutable.WrappedArray[Int]) => indices.map(values(_)))
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

PySpark passing Dataframe as extra parameter to map - scala

Related

PySpark Working with Delta tables - For Loop Optimization with Union

Dynamic dataframe with n columns and m rows

Flatten array of arrays (different dimensions) of a sql.dataframe.DataFrame in pyspark

Get minimum value from an Array in a Spark DataFrame column

spark dataframe udf mapping indices to values

Categories

Resources