pyspark create an array with literal values and then explode - pyspark

Lets say I have a dataframe like below
df = spark.createDataFrame([(100, 'AB', 304), (200, 'BC', 305), (300, 'CD', 306)],['number', 'letter', 'id'])
df.show()
I want to create an array column with these ["source1","source2","source3"]
which I later want to explode
df_arr=df.withColumn("source",array(lit("source1"),lit("source2"),lit("source3")))
This did not work. i created a numpy array to use to explode which did work either. how can I achieve explode dataframe.

worked as per werner suggestion
from pyspark.sql import functions as f
from pyspark.sql.functions import lit
df2=df.withColumn("source", F.array(lit("source1"),lit("source2"),lit("source3")))
df2.withColumn("source", F.explode("source")).show()

Related

Union list of pyspark dataframes

Let's say I have a list of pyspark dataframes: [df1, df2, ...], what I want is to union them (so actually do df1.union(df2).union(df3).... What's the best practice to achieve that?
you could use the reduce and pass the union function along with the list of dataframes.
import pyspark
from functools import reduce
list_of_sdf = [df1, df2, ...]
final_sdf = reduce(pyspark.sql.dataframe.DataFrame.unionByName, list_of_sdf)
the final_sdf will have the appended data.

how to make a new column by pairing elements of the other column?

I have a big data dataframe and I want to make pairs from elements of the other column.
col
['summer','book','hot']
['g','o','p']
output:
the pair of the above rows:
new_col
['summer','book'],['summer','hot'],['hot','book']
['g','o'],['g','p'],['p','o']
Note that tuple will work instead of list. like ('summer','book').
I know in pandas I can do this:
df['col'].apply(lambda x: list(itertools.combinations(x, 2)))
but not sure in pyspark.
You can use a UDF to do the same as you would do in python. Then cast the output to an array of array of strings.
import itertools
from pyspark.sql import functions as F
combinations_udf = F.udf(
lambda x: list(itertools.combinations(x, 2)), "array<array<string>>"
)
df = spark.createDataFrame([(['hot','summer', 'book'],),
(['g', 'o', 'p'], ),
], ['col1'])
df1 = df.withColumn("new_col", combinations_udf(F.col("col1")))
display(df1)

Applying a function in each row of a big PySpark dataframe?

I have a big dataframe (~30M rows). I have a function f. The business of f is to run through each row, check some logics and feed the outputs into a dictionary. The function needs to be performed row by row.
I tried:
dic = dict() for row in df.rdd.collect(): f(row, dic)
But I always meet the error OOM. I set the memory of Docker to 8GB.
How can I effectively perform the business?
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import StringType, MapType
#sample data
df = sc.parallelize([
['a', 'b'],
['c', 'd'],
['e', 'f']
]).toDF(('col1', 'col2'))
#add logic to create dictionary element using rows of the dataframe
def add_to_dict(l):
d = {}
d[l[0]] = l[1]
return d
add_to_dict_udf = udf(add_to_dict, MapType(StringType(), StringType()))
#struct is used to pass rows of dataframe
df = df.withColumn("dictionary_item", add_to_dict_udf(struct([df[x] for x in df.columns])))
df.show()
#list of dictionary elements
dictionary_list = [i[0] for i in df.select('dictionary_item').collect()]
print dictionary_list
Output is:
[{u'a': u'b'}, {u'c': u'd'}, {u'e': u'f'}]
By using collect you pull all the data out of the Spark Executors into your Driver. You really should avoid this, as it makes using Spark pointless (you could just use plain python in that case).
What could you do:
reimplement your logic using functions already available: pyspark.sql.functions doc
if you cannot do the first, because there is functionality missing, you can define a User Defined Function

Add a column to an existing dataframe with random fixed values in Pyspark

I'm new to Pyspark and I'm trying to add a new column to my existing dataframe. The new column should contain only 4 fixed values (e.g. 1,2,3,4) and I'd like to randomly pick one of the values for each row.
How can I do that?
Pyspark dataframes are immutable, so you have to return a new one (e.g. you can't just assign to it the way you can with Pandas dataframes). To do what you want use a udf:
from pyspark.sql.functions import udf
import numpy as np
df = <original df>
udf_randint = udf(np.random.randint(1, 4))
df_new = df.withColumn("random_num": udf_randint)

get the distinct elements of an ArrayType column in a spark dataframe

I have a dataframe with 3 columns named id, feat1 and feat2. feat1 and feat2 are in the form of Array of String:
Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[]
2, ["feat1_2"],["feat2_1","feat2_2"]
3,["feat1_4"],["feat2_3"]
I want to get the list of distinct elements inside each feature column, so the output will be:
distinct_feat1,distinct_feat2
-----------------------------
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]
what is the best way to do this in Scala?
You can use the collect_set to find the distinct values of the corresponding column after applying the explode function on each column to unnest the array element in each cell. Suppose your data frame is called df:
import org.apache.spark.sql.functions._
val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
withColumn("feat2", explode(col("feat2"))).
agg(collect_set("feat1").alias("distinct_feat1"),
collect_set("feat2").alias("distinct_feat2"))
distinct_df.show
+--------------------+--------------------+
| distinct_feat1| distinct_feat2|
+--------------------+--------------------+
|[feat1_1, feat1_2...|[, feat2_1, feat2...|
+--------------------+--------------------+
distinct_df.take(1)
res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
WrappedArray(, feat2_1, feat2_2, feat2_3)])
one more solution for spark 2.4+
.withColumn("distinct", array_distinct(concat($"array_col1", $"array_col2")))
beware, if one of columns is null, result will be null
The method provided by Psidom works great, here is a function that does the same given a Dataframe and a list of fields:
def array_unique_values(df, fields):
from pyspark.sql.functions import col, collect_set, explode
from functools import reduce
data = reduce(lambda d, f: d.withColumn(f, explode(col(f))), fields, df)
return data.agg(*[collect_set(f).alias(f + '_distinct') for f in fields])
And then:
data = array_unique_values(df, my_fields)
data.take(1)