PySpark Working with Delta tables - For Loop Optimization with Union - pyspark

I'm currently working in databricks and have a delta table with 20+ columns. I basically need to take a value from 1 column in each row, send it to an api which returns two values/columns, and then create the other 26 to merge the values back to the original delta table. So input is 28 columns and output is 28 columns. Currently my code looks like:
from pyspark.sql.types import *
from pyspark.sql import functions as F
import requests, uuid, json
from pyspark.sql import SparkSession
from pyspark.sql import DataFrame
from pyspark.sql.functions import col,lit
from functools import reduce
spark.conf.set("spark.sql.adaptive.enabled","true")
spark.conf.set("spark.databricks.adaptive.autoOptimizeShuffle.enabled", "true")
spark.sql('set spark.sql.execution.arrow.pyspark.enabled = true')
spark.conf.set("spark.databricks.optimizer.dynamicPartitionPruning","true")
spark.conf.set("spark.sql.parquet.compression.codec","gzip")
spark.conf.set("spark.sql.inMemorycolumnarStorage.compressed","true")
spark.conf.set("spark.databricks.optimizer.dynamicFilePruning","true");
output=spark.sql("select * from delta.`table`").cache()
SeriesAppend=[]
for i in output.collect():
#small mapping fix
if i['col1']=='val1':
var0='a'
elif i['col1']=='val2':
var0='b'
elif i['col1']=='val3':
var0='c'
elif i['col1']=='val4':
var0='d'
var0=set([var0])
req_var = set(['a','b','c','d'])
var_list=list(req_var-var0)
#subscription info
headers = {header}
body = [{
'text': i['col2']
}]
if len(i['col2'])<500:
request = requests.post(constructed_url, params=params, headers=headers, json=body)
response = request.json()
dumps=json.dumps(response[0])
loads = json.loads(dumps)
json_rdd = sc.parallelize(loads)
json_df = spark.read.json(json_rdd)
json_df = json_df.withColumn('col1',lit(i['col1']))
json_df = json_df.withColumn('col2',lit(i['col2']))
json_df = json_df.withColumn('col3',lit(i['col3']))
...
SeriesAppend.append(json_df)
else:
pass
Series_output=reduce(DataFrame.unionAll, SeriesAppend)
SAMPLE DF with only 3 columns:
df = spark.createDataFrame(
[
("a", "cat","owner1"), # create your data here, be consistent in the types.
("b", "dog","owner2"),
("c", "fish","owner3"),
("d", "fox","owner4"),
("e", "rat","owner5"),
],
["col1", "col2", "col3"]) # add your column names here
I really just need to write the response + other column values to a delta table, so dataframes are not necessarily required, but haven't found a faster way than the above. Right now, I can run 5 inputs, which returns 15 in 25.3 seconds without the unionAll. With the inclusion of the union, it turns into 3 minutes.
The final output would look like:
df = spark.createDataFrame(
[
("a", "cat","owner1","MI", 48003), # create your data here, be consistent in the types.
("b", "dog","owner2", "MI", 48003),
("c", "fish","owner3","MI", 48003),
("d", "fox","owner4","MI", 48003),
("e", "rat","owner5","MI", 48003),
],
["col1", "col2", "col3", "col4", "col5"]) # add your column names here
How can I make this faster in spark?

As mentioned in my comments, you should use UDF to distribute more workload to workers instead of collect and let a single machine (driver) to run it all. It's simply wrong approach and not scalable.
# This is your main function, pure Python and you can unittest it in any way you want.
# The most important about this function is:
# - everything must be encapsulated inside the function, no global variable works here
def req(col1, col2):
if col1 == 'val1':
var0 = 'a'
elif col1 == 'val2':
var0 = 'b'
elif col1 == 'val3':
var0 = 'c'
elif col1 == 'val4':
var0 = 'd'
var0=set([var0])
req_var = set(['a','b','c','d'])
var_list = list(req_var - var0)
#subscription info
headers = {header} # !!! `header` must available **inside** this function, global won't work
body = [{
'text': col2
}]
if len(col2) < 500:
# !!! same as `header`, `constructed_url` must available **inside** this function, global won't work
request = requests.post(constructed_url, params=params, headers=headers, json=body)
response = request.json()
return (response.col4, response.col5)
else:
return None
# Now you wrap the function above into a Spark UDF.
# I'm using only 2 columns here as input, but you can use as many columns as you wish.
# Same as output, I'm using only a tuple with 2 elements, you can make it as many items as you wish.
df.withColumn('temp', F.udf(req, T.ArrayType(T.StringType()))('col1', 'col2')).show()
# Output
# +----+----+------+------------------+
# |col1|col2| col3| temp|
# +----+----+------+------------------+
# | a| cat|owner1|[foo_cat, bar_cat]|
# | b| dog|owner2|[foo_dog, bar_dog]|
# | c|fish|owner3| null|
# | d| fox|owner4| null|
# | e| rat|owner5| null|
# +----+----+------+------------------+
# Now all you have to do is extract the tuple and assign to separate columns
# (and delete temp column to cleanup)
(df
.withColumn('col4', F.col('temp')[0])
.withColumn('col5', F.col('temp')[1])
.drop('temp')
.show()
)
# Output
# +----+----+------+-------+-------+
# |col1|col2| col3| col4| col5|
# +----+----+------+-------+-------+
# | a| cat|owner1|foo_cat|bar_cat|
# | b| dog|owner2|foo_dog|bar_dog|
# | c|fish|owner3| null| null|
# | d| fox|owner4| null| null|
# | e| rat|owner5| null| null|
# +----+----+------+-------+-------+

Related

pyspark SparseVectors dataframe columns .dot product or any other vectors type column computation using #udf or #pandas_udf

I do try to compute .dot product between 2 columns of a give dataframe,
SparseVectors has this ability in spark already so I try to execute this in an easy & scalable way without converting to RDDs or to
DenseVectors but i'm stuck, spent past 3 days to try find out of an
approach and does fail, doesn't return computation for passed 2 vector
columns from dataframe and looking for guidance on this matter,
please, because something I'm missing here and not sure what is root cause ...
For separate vectors and rdd vectors works this approach but does fail
to work when passing dataframe column vectors, to replicate the flow
and issues please see below, ideally would be this computation to happen in parallel since real work data is with billions or more rows (dataframe observations):
from pyspark.ml.linalg import Vectors, SparseVector
from pyspark.sql import Row
df = spark.createDataFrame(
[
[["a","b","c"], SparseVector(4527, {0:0.6363067860791387, 1:1.0888040725098247, 31:4.371858972705023}),SparseVector(4527, {0:0.6363067860791387, 1:2.0888040725098247, 31:4.371858972705023})],
[["d"], SparseVector(4527, {8: 2.729945780576634}), SparseVector(4527, {8: 4.729945780576634})],
], ["word", "i", "j"])
# # daframe content
df.show()
+---------+--------------------+--------------------+
| word| i| j|
+---------+--------------------+--------------------+
|[a, b, c]|(4527,[0,1,31],[0...|(4527,[0,1,31],[0...|
| [d]|(4527,[8],[2.7299...|(4527,[8],[4.7299...|
+---------+--------------------+--------------------+
#udf(returnType=ArrayType(FloatType()))
def sim_cos(v1, v2):
if v1 is not None and v2 is not None:
return float(v1.dot(v2))
# # calling udf
df = df.withColumn("dotP", sim_cos(df.i, df.j))
# # output after udf
df.show()
+---------+--------------------+--------------------+----------+
| word| i| j| dotP|
+---------+--------------------+--------------------+----------+
|[a, b, c]|(4527,[0,1,31],[0...|(4527,[0,1,31],[0...| null|
| [d]|(4527,[8],[2.7299...|(4527,[8],[4.7299...| null|
+---------+--------------------+--------------------+----------+
Rewriting udf as lambda does work on spark 2.4.5. Posting in case
anyone is interested in this approach for PySpark dataframes:
# # rewrite udf as lambda function:
sim_cos = F.udf(lambda x,y : float(x.dot(y)), FloatType())
# # executing udf on dataframe
df = df.withColumn("similarity", sim_cos(col("i"),col("j")))
# # end result
df.show()
+---------+--------------------+--------------------+----------+
| word| i| j|similarity|
+---------+--------------------+--------------------+----------+
|[a, b, c]|(4527,[0,1,31],[0...|(4527,[0,1,31],[0...| 21.792336|
| [d]|(4527,[8],[2.7299...|(4527,[8],[4.7299...| 12.912496|
+---------+--------------------+--------------------+----------+

PySpark passing Dataframe as extra parameter to map

I want to parallelize a python list, use a map on that list, and pass a Dataframe to the mapper function also
def output_age_split(df):
ages= [18, 19, 20, 21, 22]
age_dfs= spark.sparkContext.parallelize(ages).map(lambda x: test(x, df)
# Unsure of type of age_dfs, but should be able to split into the smaller dfs like this somehow
return age_dfs[0], age_dfs[1] ...
def test(age, df):
return df.where(col("age")==age)
This results in a pickling error
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects
How should i parallelize this operation so that i get returned a collection of Dataframes?
EDIT: Sample of df
|age|name|salary|
|---|----|------|
|18 |John|40000 |
|22 |Joseph|60000 |
The issue is ages_dfs is not a dataframe, it's an RDD. Now, when you are applying a map with test function in it (which returns the dataframe), we end up getting into a weird situation where ages_dfs is actually an RDD of type PipelinedRDD which is neither a dataframe nor iterable.
TypeError: 'PipelinedRDD' object is not iterable
You can try the below workaround, where you just iterate on the list instead and create a collection of dataframe and iterate on them at will.
from pyspark.sql.functions import *
def output_age_split(df):
ages = [18, 19, 20, 21, 22]
result = []
for age in ages:
temp_df = test(age, df)
if(not len(temp_df.head(1)) == 0):
result.append(temp_df)
return result
def test(age, df):
return df.where(col("age")==age)
# +---+------+------+
# |age|name |salary|
# +---+------+------+
# |18 |John |40000 |
# |22 |Joseph|60000 |
# +---+------+------+
df = spark.sparkContext.parallelize(
[
(18, "John", 40000),
(22, "Jpseph", 60000)
]
).toDF(["age", "name", "salary"])
df.show()
result = output_age_split(df)
# Output type is: <class 'list'>
print(f"Output type is: {type(result)}")
for r in result:
r.show()
# +---+----+------+
# |age|name|salary|
# +---+----+------+
# | 18|John| 40000|
# +---+----+------+
# +---+------+------+
# |age| name|salary|
# +---+------+------+
# | 22|Jpseph| 60000|
# +---+------+------+
I am also attaching screenshot from my workspace for your reference.
Problem:
Solution:

PySpark: remove rows which derivate from others

I do have the following dataframe, which contains all the paths within a tree after going through all nodes. For each jump between nodes, a row will be created where "dist" is the number of nodes so far, "node" the current node and "path" the path so far.
dist | node | path
0 | 1 | [1]
1 | 2 | [1,2]
1 | 5 | [1,5]
2 | 3 | [1,2,3]
2 | 4 | [1,2,4]
At the end I just want to have a dataframe containing the complete paths without the intermediate steps:
dist | node | path
1 | 5 | [1,5]
2 | 3 | [1,2,3]
2 | 4 | [1,2,4]
I also tried by having the path column as a string ("1;2;3") and comparing which row is a substring from each other, however i could not find a way to do that.
I found my old code and created an adapted example for your problem. I used the spark graph library Graphframes for this. The path can be determined by a Pregel like message aggregation loop.
Here the code.
First import all modules
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import pyspark.sql.functions as f
from graphframes import GraphFrame
from pyspark.sql.types import *
from graphframes.lib import *
# shortcut for the aggregate message object from the graphframes.lib
AM=AggregateMessages
# to plot the graph
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
spark = (SparkSession
.builder
.appName("PathReduction")
.getOrCreate()
)
sc=spark.sparkContext
Then create a sample dataset
# create dataframe
raw_data = [
("0","1"),
("1","2"),
("1","5"),
("2","3"),
("2","4"),
("a","b"),
("b","c"),
("c","d")]
schema = ["src","dst"]
data = spark.createDataFrame(data=raw_data, schema = schema)
data.show()
+---+---+
|src|dst|
+---+---+
| 0| 1|
| 1| 2|
| 1| 5|
| 2| 3|
| 2| 4|
| a| b|
| b| c|
| c| d|
+---+---+
For visualisation run
plotData_1 = data.select("src","dst").rdd.collect()
plotData_2 = np.array(plotData_1)
plotData_3=[]
for row in plotData_2:
plotData_3.append((row[0],row[1]))
G=nx.DiGraph(directed=True)
G.add_edges_from(plotData_3)
options = {
'node_color': 'orange',
'node_size': 500,
'width': 2,
'arrowstyle': '-|>',
'arrowsize': 20,
}
nx.draw(G, arrows=True, **options,with_labels=True)
With this message aggregation algorithm you find the paths as you searched them. if you set the flag show_steps to True the results of each step is shown which helps to understand.
# if flag is true print results within the loop for debuging
show_steps=False
# max itertions of the loop, should be larger then the longest expected path
max_iter=10
# create vertices from edge data set
vertices=(data.select("src").union(data.select("dst")).distinct().withColumnRenamed('src', 'id'))
edges=data
# create graph to get in and out degrees
gx = GraphFrame(vertices, edges)
# calclulate in and out degrees of each node
inDegrees=gx.inDegrees
outDegrees=gx.outDegrees
if(show_steps==True):
print("in and out degrees")
inDegrees.show()
outDegrees.show()
# create intial vertices
init_vertices=(vertices
# join out degrees on vertices
.join(outDegrees,on="id",how="left")
# join in degree on vertices
.join(inDegrees,on="id",how="left")
# define root, childs in the middle and leafs of the path in order to distinguish full paths later on
.withColumn("nodeType",f.when(f.col("inDegree").isNull(),"root").otherwise(f.when(f.col("outDegree").isNull(),"leaf").otherwise("child")))
# define message with all information [array(id) and array(nodeType)] to be send to the next noe
.withColumn("message",f.array_union(f.array(f.array(f.col("id"))),f.array(f.array(f.col("nodeType")))))
# remove columns that are not used anymore
.drop("inDegree","outDegree")
)
if(show_steps==True):
print("init vertices")
init_vertices.show()
# update graph object with init vertices
gx = GraphFrame(init_vertices, edges)
# define empty dataframe to append found paths on
results = sqlContext.createDataFrame(
sc.emptyRDD(),
StructType([StructField("paths",ArrayType(StringType()),True)])
)
# start loopp for mesage aggregation. Set a max_iter value which has to be larger as the longest path expected
for iter_ in range(max_iter):
if(show_steps==True):
print("iteration step=" + str(iter_))
print("##################################################")
# define the message that should be send. Here we send a message to the source node and we take the column message from the destination source we send backward
msgToSrc = AM.dst["message"]
agg = gx.aggregateMessages(
f.collect_set(AM.msg).alias("aggMess"), # aggregation function is a collect into an array (attention!! this can be an expensive operation in terms of shuffel)
sendToSrc=msgToSrc,
sendToDst=None
)
if(show_steps==True):
print("aggregated message")
agg.show(truncate=False)
# stop loop if no more agg messages collected
if(len(agg.take(1))==0):
print("All paths found in " + str(iter_) + " iterations")
break
# get new vertices to send into next round. Here we have to prepare the next message columns all _column_names are temporary columns for calculation purpose only
vertices_update=(agg
# join initial data to aggregation in order to have to nodeType of the vertice
.join(init_vertices,on="id",how="left")
# exploe the nested array with the path and the nodeType
.withColumn("_explode_to_flatten_array",f.explode(f.col("aggMess")))
# put the path aray into a seperate column
.withColumn("_dataMsg",f.col("_explode_to_flatten_array")[0])
# put the node type into a seperate column
.withColumn("_typeMsg",f.col("_explode_to_flatten_array")[1][0])
# deside if a path is complete. A path is complete if the vertices type is a root and the message type is a leaf
.withColumn("pathComplete",f.when(((f.col("nodeType")=="root") & (f.col("_typeMsg")=="leaf")),True).otherwise(False))
# append the curent vertice id to the path array that is send forward
.withColumn("_message",f.array_union(f.array(f.col("id")),f.col("_dataMsg")))
# merge together the path array and the nodeType array for the new message object
.withColumn("message",f.array_union(f.array(f.col("_message")),f.array(f.array(f.col("_typeMsg")))))
)
if(show_steps==True):
print("new vertices with all temp columns")
vertices_update.show()
# add complete paths to the result dataframe
results=(
results
.union(
vertices_update
.where(f.col("pathComplete")==True)
.select(f.col("_message"))
)
)
# chache the vertices for next iteration and only push forward the two relevant columns in order to reduce data shuffeling between spark executors
cachedNewVertices = AM.getCachedDataFrame(vertices_update.select("id","message"))
# create new updated graph object for next iteration
gx = GraphFrame(cachedNewVertices, gx.edges)
print("##################################################")
print("Collecting result set")
results.show()
it shows then the correct results
All paths found in 3 iterations
##################################################
Collecting result set
+------------+
| paths|
+------------+
| [0, 1, 5]|
|[0, 1, 2, 3]|
|[0, 1, 2, 4]|
|[a, b, c, d]|
+------------+
to get your final dataframe you can join it back or take the first and last element of the array into separate columns
result2=(results
.withColumn("dist",f.element_at(f.col("paths"), 1))
.withColumn("node",f.element_at(f.col("paths"), -1))
)
result2.show()
+------------+----+----+
| paths|dist|node|
+------------+----+----+
| [0, 1, 5]| 0| 5|
|[0, 1, 2, 3]| 0| 3|
|[0, 1, 2, 4]| 0| 4|
|[a, b, c, d]| a| d|
+------------+----+----+
You can write the same algorithm with the Graphframes Pregel API I suppose.
P.S: The algorithm in this form might cause problems if the graph has lops or backward directed edges. I had another algorithm to first clean up loops and cycles

Identify count of datatypes in a column which has multiple datatypes

When I read an excel file, it has a column like this,
Col1
----
aaa
123
true
235
321
23.23
xxx
I need to identify how many datatypes we have in this column. When the data is big the processing time is also big. Any options in pyspark?
Regards,
Ash
spark doesn't have built in udfs to return the value data type, so implement udf to return data type, you can extend the function defined here for other data types long, using regexp also an option
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
def get_data_type(val):
data_type = None
try:
float(val)
data_type = "float"
except ValueError:
if (data_type != None and val.isnumeric()):
data_type = 'int'
else:
if (val.lower() in ("yes", "no", "true", "false")):
data_type = 'boolean'
else:
data_type = "string"
else:
if float(val).is_integer():
data_type = "int"
return data_type
get_data_type_udf = F.udf(get_data_type, T.StringType())
df = spark.createDataFrame(['aaa','123','true','235','321','23.23'], T.StringType()).toDF("col1")
df = df.select(get_data_type_udf(F.col("col1")).alias("data_type")).groupBy("data_type").count()
df.show()
which results
+---------+-----+
|data_type|count|
+---------+-----+
| int| 3|
| boolean| 1|
| string| 1|
| float| 1|
+---------+-----+
You'll have to load the data first to a string column and then you could play a bit by applying some logic to identify the types. Here's an example how you could start by differentiating text from numeric columns. I guess it should be easy to tell whether a number is float or int, etc.
df = spark.createDataFrame([
(1, "a"),
(2, "123"),
(3, "22.12"),
(4, "c"),
(5, "True")
], ("ID","mixed_data"))
from pyspark.sql import functions as F
casted = df.select("ID", "mixed_data",F.when(F.col("mixed_data").cast('float').isNull(), "text").otherwise("some kind of number").alias("guessed_type"))
+---+----------+-------------------+
| ID|mixed_data| guessed_type|
+---+----------+-------------------+
| 1| a| text|
| 2| 123|some kind of number|
| 3| 22.12|some kind of number|
| 4| c| text|
| 5| True| text|
+---+----------+-------------------+

UnionAll for dataframes with different columns from list in spark scala [duplicate]

I have 2 DataFrames:
I need union like this:
The unionAll function doesn't work because the number and the name of columns are different.
How can I do this?
In Scala you just have to append all missing columns as nulls.
import org.apache.spark.sql.functions._
// let df1 and df2 the Dataframes to merge
val df1 = sc.parallelize(List(
(50, 2),
(34, 4)
)).toDF("age", "children")
val df2 = sc.parallelize(List(
(26, true, 60000.00),
(32, false, 35000.00)
)).toDF("age", "education", "income")
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = cols1 ++ cols2 // union
def expr(myCols: Set[String], allCols: Set[String]) = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
df1.select(expr(cols1, total):_*).unionAll(df2.select(expr(cols2, total):_*)).show()
+---+--------+---------+-------+
|age|children|education| income|
+---+--------+---------+-------+
| 50| 2| null| null|
| 34| 4| null| null|
| 26| null| true|60000.0|
| 32| null| false|35000.0|
+---+--------+---------+-------+
Update
Both temporal DataFrames will have the same order of columns, because we are mapping through total in both cases.
df1.select(expr(cols1, total):_*).show()
df2.select(expr(cols2, total):_*).show()
+---+--------+---------+------+
|age|children|education|income|
+---+--------+---------+------+
| 50| 2| null| null|
| 34| 4| null| null|
+---+--------+---------+------+
+---+--------+---------+-------+
|age|children|education| income|
+---+--------+---------+-------+
| 26| null| true|60000.0|
| 32| null| false|35000.0|
+---+--------+---------+-------+
Spark 3.1+
df = df1.unionByName(df2, allowMissingColumns=True)
Test results:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data1=[
(1 , '2016-08-29', 1 , 2, 3),
(2 , '2016-08-29', 1 , 2, 3),
(3 , '2016-08-29', 1 , 2, 3)]
df1 = spark.createDataFrame(data1, ['code' , 'date' , 'A' , 'B', 'C'])
data2=[
(5 , '2016-08-29', 1, 2, 3, 4),
(6 , '2016-08-29', 1, 2, 3, 4),
(7 , '2016-08-29', 1, 2, 3, 4)]
df2 = spark.createDataFrame(data2, ['code' , 'date' , 'B', 'C', 'D', 'E'])
df = df1.unionByName(df2, allowMissingColumns=True)
df.show()
# +----+----------+----+---+---+----+----+
# |code| date| A| B| C| D| E|
# +----+----------+----+---+---+----+----+
# | 1|2016-08-29| 1| 2| 3|null|null|
# | 2|2016-08-29| 1| 2| 3|null|null|
# | 3|2016-08-29| 1| 2| 3|null|null|
# | 5|2016-08-29|null| 1| 2| 3| 4|
# | 6|2016-08-29|null| 1| 2| 3| 4|
# | 7|2016-08-29|null| 1| 2| 3| 4|
# +----+----------+----+---+---+----+----+
Spark 2.3+
diff1 = [c for c in df2.columns if c not in df1.columns]
diff2 = [c for c in df1.columns if c not in df2.columns]
df = df1.select('*', *[F.lit(None).alias(c) for c in diff1]) \
.unionByName(df2.select('*', *[F.lit(None).alias(c) for c in diff2]))
Test results:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
data1=[
(1 , '2016-08-29', 1 , 2, 3),
(2 , '2016-08-29', 1 , 2, 3),
(3 , '2016-08-29', 1 , 2, 3)]
df1 = spark.createDataFrame(data1, ['code' , 'date' , 'A' , 'B', 'C'])
data2=[
(5 , '2016-08-29', 1, 2, 3, 4),
(6 , '2016-08-29', 1, 2, 3, 4),
(7 , '2016-08-29', 1, 2, 3, 4)]
df2 = spark.createDataFrame(data2, ['code' , 'date' , 'B', 'C', 'D', 'E'])
diff1 = [c for c in df2.columns if c not in df1.columns]
diff2 = [c for c in df1.columns if c not in df2.columns]
df = df1.select('*', *[F.lit(None).alias(c) for c in diff1]) \
.unionByName(df2.select('*', *[F.lit(None).alias(c) for c in diff2]))
df.show()
# +----+----------+----+---+---+----+----+
# |code| date| A| B| C| D| E|
# +----+----------+----+---+---+----+----+
# | 1|2016-08-29| 1| 2| 3|null|null|
# | 2|2016-08-29| 1| 2| 3|null|null|
# | 3|2016-08-29| 1| 2| 3|null|null|
# | 5|2016-08-29|null| 1| 2| 3| 4|
# | 6|2016-08-29|null| 1| 2| 3| 4|
# | 7|2016-08-29|null| 1| 2| 3| 4|
# +----+----------+----+---+---+----+----+
Here is my Python version:
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
Here is sample usage:
data = [
Row(zip_code=58542, dma='MIN'),
Row(zip_code=58701, dma='MIN'),
Row(zip_code=57632, dma='MIN'),
Row(zip_code=58734, dma='MIN')
]
firstDF = spark.createDataFrame(data)
data = [
Row(zip_code='534', name='MIN'),
Row(zip_code='353', name='MIN'),
Row(zip_code='134', name='MIN'),
Row(zip_code='245', name='MIN')
]
secondDF = spark.createDataFrame(data)
customUnion(firstDF,secondDF).show()
Here is the code for Python 3.0 using pyspark:
from pyspark.sql.functions import lit
def __order_df_and_add_missing_cols(df, columns_order_list, df_missing_fields):
""" return ordered dataFrame by the columns order list with null in missing columns """
if not df_missing_fields: # no missing fields for the df
return df.select(columns_order_list)
else:
columns = []
for colName in columns_order_list:
if colName not in df_missing_fields:
columns.append(colName)
else:
columns.append(lit(None).alias(colName))
return df.select(columns)
def __add_missing_columns(df, missing_column_names):
""" Add missing columns as null in the end of the columns list """
list_missing_columns = []
for col in missing_column_names:
list_missing_columns.append(lit(None).alias(col))
return df.select(df.schema.names + list_missing_columns)
def __order_and_union_d_fs(left_df, right_df, left_list_miss_cols, right_list_miss_cols):
""" return union of data frames with ordered columns by left_df. """
left_df_all_cols = __add_missing_columns(left_df, left_list_miss_cols)
right_df_all_cols = __order_df_and_add_missing_cols(right_df, left_df_all_cols.schema.names,
right_list_miss_cols)
return left_df_all_cols.union(right_df_all_cols)
def union_d_fs(left_df, right_df):
""" Union between two dataFrames, if there is a gap of column fields,
it will append all missing columns as nulls """
# Check for None input
if left_df is None:
raise ValueError('left_df parameter should not be None')
if right_df is None:
raise ValueError('right_df parameter should not be None')
# For data frames with equal columns and order- regular union
if left_df.schema.names == right_df.schema.names:
return left_df.union(right_df)
else: # Different columns
# Save dataFrame columns name list as set
left_df_col_list = set(left_df.schema.names)
right_df_col_list = set(right_df.schema.names)
# Diff columns between left_df and right_df
right_list_miss_cols = list(left_df_col_list - right_df_col_list)
left_list_miss_cols = list(right_df_col_list - left_df_col_list)
return __order_and_union_d_fs(left_df, right_df, left_list_miss_cols, right_list_miss_cols)
A very simple way to do this - select the columns in the same order from both the dataframes and use unionAll
df1.select('code', 'date', 'A', 'B', 'C', lit(None).alias('D'), lit(None).alias('E'))\
.unionAll(df2.select('code', 'date', lit(None).alias('A'), 'B', 'C', 'D', 'E'))
Here's a pyspark solution.
It assumes that if a field in df1 is missing from df2, then you add that missing field to df2 with null values. However it also assumes that if the field exists in both dataframes, but the type or nullability of the field is different, then the two dataframes conflict and cannot be combined. In that case I raise a TypeError.
from pyspark.sql.functions import lit
def harmonize_schemas_and_combine(df_left, df_right):
left_types = {f.name: f.dataType for f in df_left.schema}
right_types = {f.name: f.dataType for f in df_right.schema}
left_fields = set((f.name, f.dataType, f.nullable) for f in df_left.schema)
right_fields = set((f.name, f.dataType, f.nullable) for f in df_right.schema)
# First go over left-unique fields
for l_name, l_type, l_nullable in left_fields.difference(right_fields):
if l_name in right_types:
r_type = right_types[l_name]
if l_type != r_type:
raise TypeError, "Union failed. Type conflict on field %s. left type %s, right type %s" % (l_name, l_type, r_type)
else:
raise TypeError, "Union failed. Nullability conflict on field %s. left nullable %s, right nullable %s" % (l_name, l_nullable, not(l_nullable))
df_right = df_right.withColumn(l_name, lit(None).cast(l_type))
# Now go over right-unique fields
for r_name, r_type, r_nullable in right_fields.difference(left_fields):
if r_name in left_types:
l_type = left_types[r_name]
if r_type != l_type:
raise TypeError, "Union failed. Type conflict on field %s. right type %s, left type %s" % (r_name, r_type, l_type)
else:
raise TypeError, "Union failed. Nullability conflict on field %s. right nullable %s, left nullable %s" % (r_name, r_nullable, not(r_nullable))
df_left = df_left.withColumn(r_name, lit(None).cast(r_type))
# Make sure columns are in the same order
df_left = df_left.select(df_right.columns)
return df_left.union(df_right)
I somehow find most of the python-answers here a bit too clunky in their writing if you're just going with the simple lit(None)-workaround (which is also the only way I know). As alternative this might be useful:
# df1 and df2 are assumed to be the given dataFrames from the question
# Get the lacking columns for each dataframe and set them to null in the respective dataFrame.
# First do so for df1...
for column in [column for column in df1.columns if column not in df2.columns]:
df1 = df1.withColumn(column, lit(None))
# ... and then for df2
for column in [column for column in df2.columns if column not in df1.columns]:
df2 = df2.withColumn(column, lit(None))
Afterwards just do the union() you wanted to do.
Caution: If your column-order differs between df1 and df2 use unionByName()!
result = df1.unionByName(df2)
Modified Alberto Bonsanto's version to preserve the original column order (OP implied the order should match the original tables). Also, the match part caused an Intellij warning.
Here's my version:
def unionDifferentTables(df1: DataFrame, df2: DataFrame): DataFrame = {
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = cols1 ++ cols2 // union
val order = df1.columns ++ df2.columns
val sorted = total.toList.sortWith((a,b)=> order.indexOf(a) < order.indexOf(b))
def expr(myCols: Set[String], allCols: List[String]) = {
allCols.map( {
case x if myCols.contains(x) => col(x)
case y => lit(null).as(y)
})
}
df1.select(expr(cols1, sorted): _*).unionAll(df2.select(expr(cols2, sorted): _*))
}
in pyspark:
df = df1.join(df2, ['each', 'shared', 'col'], how='full')
I had the same issue and using join instead of union solved my problem.
So, for example with python , instead of this line of code:
result = left.union(right), which will fail to execute for different number of columns,
you should use this one:
result = left.join(right, left.columns if (len(left.columns) < len(right.columns)) else right.columns, "outer")
Note that the second argument contains the common columns between the two DataFrames. If you don't use it, the result will have duplicate columns with one of them being null and the other not.
Hope it helps.
There is much concise way to handle this issue with a moderate sacrifice of performance.
def unionWithDifferentSchema(a: DataFrame, b: DataFrame): DataFrame = {
sparkSession.read.json(a.toJSON.union(b.toJSON).rdd)
}
This is the function which does the trick. Using toJSON to each dataframe makes a json Union. This preserves the ordering and the datatype.
Only catch is toJSON is relatively expensive (however not much you probably get 10-15% slowdown). However this keeps the code clean.
My version for Java:
private static Dataset<Row> unionDatasets(Dataset<Row> one, Dataset<Row> another) {
StructType firstSchema = one.schema();
List<String> anotherFields = Arrays.asList(another.schema().fieldNames());
another = balanceDataset(another, firstSchema, anotherFields);
StructType secondSchema = another.schema();
List<String> oneFields = Arrays.asList(one.schema().fieldNames());
one = balanceDataset(one, secondSchema, oneFields);
return another.unionByName(one);
}
private static Dataset<Row> balanceDataset(Dataset<Row> dataset, StructType schema, List<String> fields) {
for (StructField e : schema.fields()) {
if (!fields.contains(e.name())) {
dataset = dataset
.withColumn(e.name(),
lit(null));
dataset = dataset.withColumn(e.name(),
dataset.col(e.name()).cast(Optional.ofNullable(e.dataType()).orElse(StringType)));
}
}
return dataset;
}
Here's the version in Scala also answered here, Also a Pyspark version..
( Spark - Merge / Union DataFrame with Different Schema (column names and sequence) to a DataFrame with Master common schema ) -
It takes List of dataframe to be unioned .. Provided same named columns in all the dataframe should have same datatype..
def unionPro(DFList: List[DataFrame], spark: org.apache.spark.sql.SparkSession): DataFrame = {
/**
* This Function Accepts DataFrame with same or Different Schema/Column Order.With some or none common columns
* Creates a Unioned DataFrame
*/
import spark.implicits._
val MasterColList: Array[String] = DFList.map(_.columns).reduce((x, y) => (x.union(y))).distinct
def unionExpr(myCols: Seq[String], allCols: Seq[String]): Seq[org.apache.spark.sql.Column] = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
// Create EmptyDF , ignoring different Datatype in StructField and treating them same based on Name ignoring cases
val masterSchema = StructType(DFList.map(_.schema.fields).reduce((x, y) => (x.union(y))).groupBy(_.name.toUpperCase).map(_._2.head).toArray)
val masterEmptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], masterSchema).select(MasterColList.head, MasterColList.tail: _*)
DFList.map(df => df.select(unionExpr(df.columns, MasterColList): _*)).foldLeft(masterEmptyDF)((x, y) => x.union(y))
}
Here is the sample test for it -
val aDF = Seq(("A", 1), ("B", 2)).toDF("Name", "ID")
val bDF = Seq(("C", 1, "D1"), ("D", 2, "D2")).toDF("Name", "Sal", "Deptt")
unionPro(List(aDF, bDF), spark).show
Which gives output as -
+----+----+----+-----+
|Name| ID| Sal|Deptt|
+----+----+----+-----+
| A| 1|null| null|
| B| 2|null| null|
| C|null| 1| D1|
| D|null| 2| D2|
+----+----+----+-----+
This function takes in two dataframes (df1 and df2) with different schemas and unions them.
First we need to bring them to the same schema by adding all (missing) columns from df1 to df2 and vice versa. To add a new empty column to a df we need to specify the datatype.
import pyspark.sql.functions as F
def union_different_schemas(df1, df2):
# Get a list of all column names in both dfs
columns_df1 = df1.columns
columns_df2 = df2.columns
# Get a list of datatypes of the columns
data_types_df1 = [i.dataType for i in df1.schema.fields]
data_types_df2 = [i.dataType for i in df2.schema.fields]
# We go through all columns in df1 and if they are not in df2, we add
# them (and specify the correct datatype too)
for col, typ in zip(columns_df1, data_types_df1):
if col not in df2.columns:
df2 = df2\
.withColumn(col, F.lit(None).cast(typ))
# Now df2 has all missing columns from df1, let's do the same for df1
for col, typ in zip(columns_df2, data_types_df2):
if col not in df1.columns:
df1 = df1\
.withColumn(col, F.lit(None).cast(typ))
# Now df1 and df2 have the same columns, not necessarily in the same
# order, therefore we use unionByName
combined_df = df1\
.unionByName(df2)
return combined_df
PYSPARK
Scala version from Alberto works great. However, if you want to make a for-loop or some dynamic assignment of variables you can face some problems.
Solution comes with Pyspark - clean code:
from pyspark.sql.functions import *
#defining dataframes
df1 = spark.createDataFrame(
[
(1, 'foo','ok'),
(2, 'pro','ok')
],
['id', 'txt','check']
)
df2 = spark.createDataFrame(
[
(3, 'yep',13,'mo'),
(4, 'bro',11,'re')
],
['id', 'txt','value','more']
)
#retrieving columns
cols1 = df1.columns
cols2 = df2.columns
#getting columns from df1 and df2
total = list(set(cols2) | set(cols1))
#defining function for adding nulls (None in case of pyspark)
def addnulls(yourDF):
for x in total:
if not x in yourDF.columns:
yourDF = yourDF.withColumn(x,lit(None))
return yourDF
df1 = addnulls(df1)
df2 = addnulls(df2)
#additional sorting for correct unionAll (it concatenates DFs by column number)
df1.select(sorted(df1.columns)).unionAll(df2.select(sorted(df2.columns))).show()
+-----+---+----+---+-----+
|check| id|more|txt|value|
+-----+---+----+---+-----+
| ok| 1|null|foo| null|
| ok| 2|null|pro| null|
| null| 3| mo|yep| 13|
| null| 4| re|bro| 11|
+-----+---+----+---+-----+
from functools import reduce
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
def unionAll(*dfs, fill_by=None):
clmns = {clm.name.lower(): (clm.dataType, clm.name) for df in dfs for clm in df.schema.fields}
dfs = list(dfs)
for i, df in enumerate(dfs):
df_clmns = [clm.lower() for clm in df.columns]
for clm, (dataType, name) in clmns.items():
if clm not in df_clmns:
# Add the missing column
dfs[i] = dfs[i].withColumn(name, F.lit(fill_by).cast(dataType))
return reduce(DataFrame.unionByName, dfs)
unionAll(df1, df2).show()
Case insenstive columns
Will returns the actual column case
Support the existing datatypes
Default value can be customizable
Pass multiple dataframes at once (e.g unionAll(df1, df2, df3, ..., df10))
here's another one:
def unite(df1: DataFrame, df2: DataFrame): DataFrame = {
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = (cols1 ++ cols2).toSeq.sorted
val expr1 = total.map(c => {
if (cols1.contains(c)) c else "NULL as " + c
})
val expr2 = total.map(c => {
if (cols2.contains(c)) c else "NULL as " + c
})
df1.selectExpr(expr1:_*).union(
df2.selectExpr(expr2:_*)
)
}
Union and outer union for Pyspark DataFrame concatenation. This works for multiple data frames with different columns.
def union_all(*dfs):
return reduce(ps.sql.DataFrame.unionAll, dfs)
def outer_union_all(*dfs):
all_cols = set([])
for df in dfs:
all_cols |= set(df.columns)
all_cols = list(all_cols)
print(all_cols)
def expr(cols, all_cols):
def append_cols(col):
if col in cols:
return col
else:
return sqlfunc.lit(None).alias(col)
cols_ = map(append_cols, all_cols)
return list(cols_)
union_df = union_all(*[df.select(expr(df.columns, all_cols)) for df in dfs])
return union_df
One more generic method to union list of DataFrame.
def unionFrames(dfs: Seq[DataFrame]): DataFrame = {
dfs match {
case Nil => session.emptyDataFrame // or throw an exception?
case x :: Nil => x
case _ =>
//Preserving Column order from left to right DF's column order
val allColumns = dfs.foldLeft(collection.mutable.ArrayBuffer.empty[String])((a, b) => a ++ b.columns).distinct
val appendMissingColumns = (df: DataFrame) => {
val columns = df.columns.toSet
df.select(allColumns.map(c => if (columns.contains(c)) col(c) else lit(null).as(c)): _*)
}
dfs.tail.foldLeft(appendMissingColumns(dfs.head))((a, b) => a.union(appendMissingColumns(b)))
}
This is my pyspark version:
from functools import reduce
from pyspark.sql.functions import lit
def concat(dfs):
# when the dataframes to combine do not have the same order of columns
# https://datascience.stackexchange.com/a/27231/15325
return reduce(lambda df1, df2: df1.union(df2.select(df1.columns)), dfs)
def union_all(dfs):
columns = reduce(lambda x, y : set(x).union(set(y)), [ i.columns for i in dfs ] )
for i in range(len(dfs)):
d = dfs[i]
for c in columns:
if c not in d.columns:
d = d.withColumn(c, lit(None))
dfs[i] = d
return concat(dfs)
Alternate you could use full join.
list_of_files = ['test1.parquet', 'test2.parquet']
def merged_frames():
if list_of_files:
frames = [spark.read.parquet(df.path) for df in list_of_files]
if frames:
df = frames[0]
if frames[1]:
var = 1
for element in range(len(frames)-1):
result_df = df.join(frames[var], 'primary_key', how='full')
var += 1
display(result_df)
If you are loading from files, I guess you could just use the read function with a list of files.
# file_paths is list of files with different schema
df = spark.read.option("mergeSchema", "true").json(file_paths)
The resulting dataframe will have merged columns.