How to get the lists' length in one column in dataframe spark? - pyspark

I have a df whose 'products' column are lists like below:
+----------+---------+--------------------+
|member_srl|click_day| products|
+----------+---------+--------------------+
| 12| 20161223| [2407, 5400021771]|
| 12| 20161226| [7320, 2407]|
| 12| 20170104| [2407]|
| 12| 20170106| [2407]|
| 27| 20170104| [2405, 2407]|
| 28| 20161212| [2407]|
| 28| 20161213| [2407, 100093]|
| 28| 20161215| [1956119]|
| 28| 20161219| [2407, 100093]|
| 28| 20161229| [7905970]|
| 124| 20161011| [5400021771]|
| 6963| 20160101| [103825645]|
| 6963| 20160104|[3000014912, 6626...|
| 6963| 20160111|[99643224, 106032...|
How to add a new column product_cnt which are the length of products list? And how to filter df to get specified rows with condition of given products length ?
Thanks.

Pyspark has a built-in function to achieve exactly what you want called size. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.size .
To add it as column, you can simply call it during your select statement.
from pyspark.sql.functions import size
countdf = df.select('*',size('products').alias('product_cnt'))
Filtering works exactly as #titiro89 described. Furthermore, you can use the size function in the filter. This will allow you to bypass adding the extra column (if you wish to do so) in the following way.
filterdf = df.filter(size('products')==given_products_length)

First question:
How to add a new column product_cnt which are the length of products list?
>>> a = [(12,20161223, [2407,5400021771]),(12,20161226,[7320,2407])]
>>> df = spark.createDataFrame(a,
["member_srl","click_day","products"])
>>> df.show()
+----------+---------+------------------+
|member_srl|click_day| products|
+----------+---------+------------------+
| 12| 20161223|[2407, 5400021771]|
| 12| 20161226|[7320, 2407, 4344]|
+----------+---------+------------------+
You can find a similar example here
>>> from pyspark.sql.types import IntegerType
>>> from pyspark.sql.functions import udf
>>> slen = udf(lambda s: len(s), IntegerType())
>>> df2 = df.withColumn("product_cnt", slen(df.products))
>>> df2.show()
+----------+---------+------------------+-----------+
|member_srl|click_day| products|product_cnt|
+----------+---------+------------------+-----------+
| 12| 20161223|[2407, 5400021771]| 2|
| 12| 20161226|[7320, 2407, 4344]| 3|
+----------+---------+------------------+-----------+
Second question:
And how to filter df to get specified rows with condition of given products length ?
You can use filter function docs here
>>> givenLength = 2
>>> df3 = df2.filter(df2.product_cnt==givenLength)
>>> df3.show()
+----------+---------+------------------+-----------+
|member_srl|click_day| products|product_cnt|
+----------+---------+------------------+-----------+
| 12| 20161223|[2407, 5400021771]| 2|
+----------+---------+------------------+-----------+

Related

PySpark UDF: a fir transform example

I am really new to PySpark and am trying to translate some python code into pyspark.
I start with a panda, convert to a document - term matrix and then apply PCA.
The UDF:
class MultiLabelCounter():
def __init__(self, classes=None):
self.classes_ = classes
def fit(self,y):
self.classes_ =
sorted(set(itertools.chain.from_iterable(y)))
self.mapping = dict(zip(self.classes_,
range(len(self.classes_))))
return self
def transform(self,y):
yt = []
for labels in y:
data = [0]*len(self.classes_)
for label in labels:
data[self.mapping[label]] +=1
yt.append(data)
return yt
def fit_transform(self,y):
return self.fit(y).transform(y)
mlb = MultiLabelCounter()
df_grouped =
df_grouped.withColumnRenamed("collect_list(full)","full")
udf_mlb = udf(lambda x: mlb.fit_transform(x),IntegerType())
mlb_fitted = df_grouped.withColumn('full',udf_mlb(col("full")))
I am of course getting NULL results.
I am using spark 2.4.4 version.
EDIT
Adding sample input and output as per request
Input:
|id|val|
|--|---|
|1|[hello,world]|
|2|[goodbye, world]|
|3|[hello,hello]|
Output:
|id|hello|goodbye|world|
|--|-----|-------|-----|
|1|1|0|1|
|2|0|1|1|
|3|2|0|0|
Based upon input data shared, I tried replicating your output and it works. Please see below -
Input Data
df = spark.createDataFrame(data=[(1, ['hello', 'world']), (2, ['goodbye', 'world']), (3, ['hello', 'hello'])], schema=['id', 'vals'])
df.show()
+---+----------------+
| id| vals|
+---+----------------+
| 1| [hello, world]|
| 2|[goodbye, world]|
| 3| [hello, hello]|
+---+----------------+
Now, using explode to create separate rows out of vals list items. Thereafter, using pivot and count will calculate the frequency. Finally, replacing null values with 0 using fillna(0). See below -
from pyspark.sql.functions import *
df1 = df.select(['id', explode(col('vals'))]).groupBy("id").pivot("col").agg(count(col("col")))
df1.fillna(0).orderBy("id").show()
Output
+---+-------+-----+-----+
| id|goodbye|hello|world|
+---+-------+-----+-----+
| 1| 0| 1| 1|
| 2| 1| 0| 1|
| 3| 0| 2| 0|
+---+-------+-----+-----+

Fill null values in pyspark dataframe based on data type of column

Suppose, I am having a sample dataframe as below:
+-----+----+----+
| col1|col2|col3|
+-----+----+----+
| cat| 10| 1.5|
| dog| 20| 9.0|
| null| 30|null|
|mouse|null|15.3|
+-----+----+----+
I want to fill up the nulls based on the data type. For example for string types I want to fill with 'N/A' and for integer types I want to add 0. Similarly for float I want to add 0.0.
I tried using df.fillna() but then I realized there could be 'N' number of columns so I would like to have a dynamic solution.
df.dtypes gives you a tuple of (column_name, data_type). It can be used to get the list of string / int / float column names in df. Subset these columns and fillna() accordingly.
df = sc.parallelize([['cat', 10, 1.5], ['dog', 20, 9.0],\
[None, 30, None], ['mouse', None, 15.3]])\
.toDF(['col1', 'col2', 'col3'])
string_col = [item[0] for item in df.dtypes if item[1].startswith('string')]
big_int_col = [item[0] for item in df.dtypes if item[1].startswith('bigint')]
double_col = [item[0] for item in df.dtypes if item[1].startswith('double')]
df.fillna('N/A', subset = string_col)\
.fillna(0, subset = big_int_col)\
.fillna(0.0, subset = double_col)\
.show()
Output:
+-----+----+----+
| col1|col2|col3|
+-----+----+----+
| cat| 10| 1.5|
| dog| 20| 9.0|
| N/A| 30| 0.0|
|mouse| 0|15.3|
+-----+----+----+

Spark select column based on row values

I have a all string spark dataframe and I need to return columns in which all rows meet a certain criteria.
scala> val df = spark.read.format("csv").option("delimiter",",").option("header", "true").option("inferSchema", "true").load("file:///home/animals.csv")
df.show()
+--------+---------+--------+
|Column 1| Column 2|Column 3|
+--------+---------+--------+
|(ani)mal| donkey| wolf|
| mammal|(mam)-mal| animal|
| chi-mps| chimps| goat|
+--------+---------+--------+
Over here the criteria is return columns where all row values have length==6, irrespective of special characters. The response should be below dataframe since all rows in column 1 and column 2 have length==6
+--------+---------+
|Column 1| Column 2|
+--------+---------+
|(ani)mal| donkey|
| mammal|(mam)-mal|
| chi-mps| chimps|
+--------+---------+
You can use regexp_replace to delete the special characters if you know what there are and then get the length, filter to field what you want.
val cols = df.columns
val df2 = cols.foldLeft(df) {
(df, c) => df.withColumn(c + "_len", length(regexp_replace(col(c), "[()-]", "")))
}
df2.show()
+--------+---------+-------+-----------+-----------+-----------+
| Column1| Column2|Column3|Column1_len|Column2_len|Column3_len|
+--------+---------+-------+-----------+-----------+-----------+
|(ani)mal| donkey| wolf| 6| 6| 4|
| mammal|(mam)-mal| animal| 6| 6| 6|
| chi-mps| chimps| goat| 6| 6| 4|
+--------+---------+-------+-----------+-----------+-----------+

transform distinct row values to different columns with corresponding rows using Pyspark

I'm new to Pyspark and trying to transform data
Given dataframe
Col1
A=id1a A=id2a B=id1b C=id1c B=id2b
D=id1d A=id3a B=id3b C=id2c
A=id4a C=id3c
Required:
A B C
id1a id1b id1c
id2a id2b id2c
id3a id3b id3b
id4a null null
I have tried pivot, but that gives first value.
There might be a better way , however an approach is splitting the column on spaces to create array of the entries and then using higher order functions(spark 2.4+) to split on the '=' for each entry in the splitted array .Then explode and create 2 columns one with the id and one with the value. Then we can assign a row number to each partition and groupby then pivot:
import pyspark.sql.functions as F
df1 = (df.withColumn("Col1",F.split(F.col("Col1"),"\s+")).withColumn("Col1",
F.explode(F.expr("transform(Col1,x->split(x,'='))")))
.select(F.col("Col1")[0].alias("cols"),F.col("Col1")[1].alias("vals")))
from pyspark.sql import Window
w = Window.partitionBy("cols").orderBy("cols")
final = (df1.withColumn("Rnum",F.row_number().over(w)).groupBy("Rnum")
.pivot("cols").agg(F.first("vals")).orderBy("Rnum"))
final.show()
+----+----+----+----+----+
|Rnum| A| B| C| D|
+----+----+----+----+----+
| 1|id1a|id1b|id1c|id1d|
| 2|id2a|id2b|id2c|null|
| 3|id3a|id3b|id3c|null|
| 4|id4a|null|null|null|
+----+----+----+----+----+
this is how df1 looks like after the transformation:
df1.show()
+----+----+
|cols|vals|
+----+----+
| A|id1a|
| A|id2a|
| B|id1b|
| C|id1c|
| B|id2b|
| D|id1d|
| A|id3a|
| B|id3b|
| C|id2c|
| A|id4a|
| C|id3c|
+----+----+
May be I don't know the full picture, but the data format seems to be strange. If nothing can be done at the data source, then some collects, pivots and joins will be needed. Try this.
import pyspark.sql.functions as F
test = sqlContext.createDataFrame([('A=id1a A=id2a B=id1b C=id1c B=id2b',1),('D=id1d A=id3a B=id3b C=id2c',2),('A=id4a C=id3c',3)],schema=['col1','id'])
tst_spl = test.withColumn("item",(F.split('col1'," ")))
tst_xpl = tst_spl.select(F.explode("item"))
tst_map = tst_xpl.withColumn("key",F.split('col','=')[0]).withColumn("value",F.split('col','=')[1]).drop('col')
#%%
tst_pivot = tst_map.groupby(F.lit(1)).pivot('key').agg(F.collect_list(('value'))).drop('1')
#%%
tst_arr = [tst_pivot.select(F.posexplode(coln)).withColumnRenamed('col',coln) for coln in tst_pivot.columns]
tst_fin = reduce(lambda df1,df2:df1.join(df2,on='pos',how='full'),tst_arr).orderBy('pos')
tst_fin.show()
+---+----+----+----+----+
|pos| A| B| C| D|
+---+----+----+----+----+
| 0|id3a|id3b|id1c|id1d|
| 1|id4a|id1b|id2c|null|
| 2|id1a|id2b|id3c|null|
| 3|id2a|null|null|null|
+---+----+----+----+----

Pyspark forward and backward fill within column level

I try to fill missing data in a pyspark dataframe. The pyspark dataframe looks as such:
+---------+---------+-------------------+----+
| latitude|longitude| timestamplast|name|
+---------+---------+-------------------+----+
| | 4.905615|2019-08-01 00:00:00| 1|
|51.819645| |2019-08-01 00:00:00| 1|
| 51.81964| 4.961713|2019-08-01 00:00:00| 2|
| | |2019-08-01 00:00:00| 3|
| 51.82918| 4.911187| | 3|
| 51.82385| 4.901488|2019-08-01 00:00:03| 5|
+---------+---------+-------------------+----+
Within the column "name" I want to either forward fill or backward fill (whichever is necessary) to fill only "latitude" and "longitude" ("timestamplast" should not be filled). How do I do this?
Output will be:
+---------+---------+-------------------+----+
| latitude|longitude| timestamplast|name|
+---------+---------+-------------------+----+
|51.819645| 4.905615|2019-08-01 00:00:00| 1|
|51.819645| 4.905615|2019-08-01 00:00:00| 1|
| 51.81964| 4.961713|2019-08-01 00:00:00| 2|
| 51.82918| 4.911187|2019-08-01 00:00:00| 3|
| 51.82918| 4.911187| | 3|
| 51.82385| 4.901488|2019-08-01 00:00:03| 5|
+---------+---------+-------------------+----+
In Pandas this would be done as such:
df = df.groupby("name")['longitude','latitude'].apply(lambda x : x.ffill().bfill())
How would this be done in Pyspark?
I suggest you use the following two Window Specs:
from pyspark.sql import Window
w1 = Window.partitionBy('name').orderBy('timestamplast')
w2 = w1.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
Where:
w1 is the regular WinSpec we use to calculate the forward-fill which is the same as the following:
w1 = Window.partitionBy('name').orderBy('timestamplast').rowsBetween(Window.unboundedPreceding,0)
see the following note from the documentation for default window frames:
Note: When ordering is not defined, an unbounded window frame (rowFrame, unboundedPreceding, unboundedFollowing) is used by default. When ordering is defined, a growing window frame (rangeFrame, unboundedPreceding, currentRow) is used by default.
after ffill, we only need to fix the null values at the very front if exists, so we can use a fixed Window frame(Between Window.unboundedPreceding and Window.unboundedFollowing), this is more efficient than using a running Window frame since it requires only one aggregate, see SPARK-8638
Then the x.ffill().bfill() can be handled by using coalesce + last + first based on the above two WindowSpecs:
from pyspark.sql.functions import coalesce, last, first
df.withColumn('latitude_new', coalesce(last('latitude',True).over(w1), first('latitude',True).over(w2))) \
.select('name','timestamplast', 'latitude','latitude_new') \
.show()
+----+-------------------+---------+------------+
|name| timestamplast| latitude|latitude_new|
+----+-------------------+---------+------------+
| 1|2019-08-01 00:00:00| null| 51.819645|
| 1|2019-08-01 00:00:01| null| 51.819645|
| 1|2019-08-01 00:00:02|51.819645| 51.819645|
| 1|2019-08-01 00:00:03| 51.81964| 51.81964|
| 1|2019-08-01 00:00:04| null| 51.81964|
| 1|2019-08-01 00:00:05| null| 51.81964|
| 1|2019-08-01 00:00:06| null| 51.81964|
| 1|2019-08-01 00:00:07| 51.82385| 51.82385|
+----+-------------------+---------+------------+
Edit: to process (ffill+bfill) on multiple columns, use a list comprehension:
cols = ['latitude', 'longitude']
df_new = df.select([ c for c in df.columns if c not in cols ] + [ coalesce(last(c,True).over(w1), first(c,True).over(w2)).alias(c) for c in cols ])
I got a working solution for either forward or backward fill of one target name "longitude". I guess I could repeat the procedure for also "latitude" and then again for backward fill. Is there a more efficient way?
window = Window.partitionBy('name')\
.orderBy('timestamplast')\
.rowsBetween(-sys.maxsize, 0) # this is for forward fill
# .rowsBetween(0,sys.maxsize) # this is for backward fill
# define the forward-filled column
filled_column = last(df['longitude'], ignorenulls=True).over(window) # this is for forward fill
# filled_column = first(df['longitude'], ignorenulls=True).over(window) # this is for backward fill
df = df.withColumn('mmsi_filled', filled_column) # do the fill