To replace "None" with null in a spark dataframe in Jupyter Notebook - pyspark

I am finding difficulty in trying to replace every instance of "None" in the spark dataframe with nulls.
My assigned task requires me to replace "None" with a Spark Null.
And when I tried using:
data_sdf = data_sdf.na.fill("None", Seq("blank"))
it failed. Any suggestions on how should I handle this issue?
This is my sample spark dataframe I am required to work on-
+--------------------+---------+---------+---------+---------+---------+---------+---------+
| business_id| monday| tuesday|wednesday| thursday| friday| saturday| sunday|
+--------------------+---------+---------+---------+---------+---------+---------+---------+
|FYWN1wneV18bWNgQj...|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0| None| None|
|He-G7vWjzVUysIKrf...| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-16:0| 8:0-16:0| None|
|KQPW8lFf1y5BT2Mxi...| None| None| None| None| None| None| None|

I think None values are stored as a string value in your df. You can easily replace it with null value. If you want you can fill them with empty value as well
>>> data = sc.parallelize([
... ('FYWN1wneV18bWNgQj','7:30-17:0','7:30-17:0','7:30-17:0','7:30-17:0','7:30-17:0','None','None'),
... ('He-G7vWjzVUysIKrf','9:0-20:0','9:0-20:0','9:0-20:0','9:0-20:0','9:0-16:0','8:0-16:0','None'),
... ('KQPW8lFf1y5BT2Mxi','None','None','None','None','None','None','None')
... ])
>>>
>>> cols = ['business_id','monday','tuesday','wednesday',' thursday','friday','saturday','sunday']
>>>
>>> df = spark.createDataFrame(data, cols)
>>>
>>> df.show()
+-----------------+---------+---------+---------+---------+---------+--------+------+
| business_id| monday| tuesday|wednesday| thursday| friday|saturday|sunday|
+-----------------+---------+---------+---------+---------+---------+--------+------+
|FYWN1wneV18bWNgQj|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0| None| None|
|He-G7vWjzVUysIKrf| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-16:0|8:0-16:0| None|
|KQPW8lFf1y5BT2Mxi| None| None| None| None| None| None| None|
+-----------------+---------+---------+---------+---------+---------+--------+------+
>>> df.replace('None',None).show()
+-----------------+---------+---------+---------+---------+---------+--------+------+
| business_id| monday| tuesday|wednesday| thursday| friday|saturday|sunday|
+-----------------+---------+---------+---------+---------+---------+--------+------+
|FYWN1wneV18bWNgQj|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0| null| null|
|He-G7vWjzVUysIKrf| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-16:0|8:0-16:0| null|
|KQPW8lFf1y5BT2Mxi| null| null| null| null| null| null| null|
+-----------------+---------+---------+---------+---------+---------+--------+------+
>>> df.replace('None',None).na.fill('').show()
+-----------------+---------+---------+---------+---------+---------+--------+------+
| business_id| monday| tuesday|wednesday| thursday| friday|saturday|sunday|
+-----------------+---------+---------+---------+---------+---------+--------+------+
|FYWN1wneV18bWNgQj|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0| | |
|He-G7vWjzVUysIKrf| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-16:0|8:0-16:0| |
|KQPW8lFf1y5BT2Mxi| | | | | | | |
+-----------------+---------+---------+---------+---------+---------+--------+------+

I don't know if there any direct API like fillna.
But we can achieve
from pyspark.sql import Row
def replace_none_with_null(r):
return Row(**{k: None if v == "None" else v for k, v in r.asDict().iteritems()})
# data_sdf is ur dataframe
new_df = data_sdf.rdd.map(lambda x: replace_none_with_null(x)).toDF()
new_df.show()

Related

Pyspark remove duplicates base 2 columns

I have the next df in pyspark:
+---------+----------+--------+-----+----------+------+
|firstname|middlename|lastname| ncf| date|salary|
+---------+----------+--------+-----+----------+------+
| James| | V|36636|2021-09-03| 3000| remove
| Michael| Rose| |40288|2021-09-10| 4000|
| Robert| |Williams|42114|2021-08-03| 4000|
| Maria| Anne| Jones|39192|2021-05-13| 4000|
| Jen| Mary| Brown| |2020-09-03| -1|
| James| | Smith|36636|2021-09-03| 3000| remove
| James| | Smith|36636|2021-09-04| 3000|
+---------+----------+--------+-----+----------+------+
I need remove rows where ncf and date were equal. The df result will be:
+---------+----------+--------+-----+----------+------+
|firstname|middlename|lastname| ncf| date|salary|
+---------+----------+--------+-----+----------+------+
| Michael| Rose| |40288|2021-09-10| 4000|
| Robert| |Williams|42114|2021-08-03| 4000|
| Maria| Anne| Jones|39192|2021-05-13| 4000|
| Jen| Mary| Brown| |2020-09-03| -1|
| James| | Smith|36636|2021-09-04| 3000|
+---------+----------+--------+-----+----------+------+
dropDuplicates method helps with removing duplicates with in a subset of columns.
df.dropDuplicates(['ncf', 'date'])
You can use window functions to count if there are two or more rows with your conditions
from pyspark.sql import functions as F
from pyspark.sql import Window as W
df.withColumn('duplicated', F.count('*').over(W.partitionBy('ncf', 'date').orderBy(F.lit(1))) > 1)
# +---------+----------+--------+-----+----------+------+----------+
# |firstname|middlename|lastname| ncf| date|salary|duplicated|
# +---------+----------+--------+-----+----------+------+----------+
# | Jen| Mary| Brown| |2020-09-03| -1| false|
# | James| | V|36636|2021-09-03| 3000| true|
# | James| | Smith|36636|2021-09-03| 3000| true|
# | Michael| Rose| |40288|2021-09-10| 4000| false|
# | Robert| |Williams|42114|2021-08-03| 4000| false|
# | James| | Smith|36636|2021-09-04| 3000| false|
# | Maria| Anne| Jones|39192|2021-05-13| 4000| false|
# +---------+----------+--------+-----+----------+------+----------+
You now can use duplicated to filter rows as desired.

PySpark generating consecutive increasing index for each window

I would like to generate consecutively increasing index ids for each dataframe window, and the index start point can be customed, say 212 for the following example.
INPUT:
+---+-------------+
| id| component|
+---+-------------+
| a|1047972020224|
| b|1047972020224|
| c|1047972020224|
| d| 670014898176|
| e| 670014898176|
| f| 146028888064|
| g| 146028888064|
+---+-------------+
EXPECTED OUTPUT:
+---+-------------+-----------------------------+
| id| component| partition_index|
+---+-------------+-----------------------------+
| a|1047972020224| 212|
| b|1047972020224| 212|
| c|1047972020224| 212|
| d| 670014898176| 213|
| e| 670014898176| 213|
| f| 146028888064| 214|
| g| 146028888064| 214|
+---+-------------+-----------------------------+
Not sure if Window.partitionBy('component').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) can be helpful in this problem. Any ideas?
You don't have any obvious partitioning here, so you can use dense_rank with an unpartitioned window and add 211 to the result. e.g.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'index',
F.dense_rank().over(Window.orderBy(F.desc('component'))) + 211
)
df2.show()
+---+-------------+-----+
| id| component|index|
+---+-------------+-----+
| a|1047972020224| 212|
| b|1047972020224| 212|
| c|1047972020224| 212|
| d| 670014898176| 213|
| e| 670014898176| 213|
| f| 146028888064| 214|
| g| 146028888064| 214|
+---+-------------+-----+

Reading a tsv file in pyspark

I want to read a tsv file but it has no header I am creating my own schema nad then trying to read TSV file but after applyting schema it is showing all columns values as null.Below is my code and result.
from pyspark.sql.types import StructType,StructField,StringType,IntegerType
schema = StructType([StructField("id_code", IntegerType()),StructField("description", StringType())])
df=spark.read.csv("C:/Users/HP/Downloads/`connection_type`.tsv",schema=schema)
df.show();
+-------+-----------+
|id_code|description|
+-------+-----------+
| null| null|
| null| null|
| null| null|
| null| null|
| null| null|
+-------+-----------+
If i read it simply without applying any schema.
df=spark.read.csv("C:/Users/HP/Downloads/connection_type.tsv",sep="/t")
df.show()
+-----------------+
| _c0|
+-----------------+
| 0 Not Specified |
| 1 Modem |
| 2 LAN/Wifi |
| 3 Unknown |
| 4 Mobile Carrier|
+-----------------+
It is not coming in a proper way. Can anyone please help me with this. My sample file is .tsv file and it has below records.
0 Specified
1 Modemwifi
2 LAN/Wifi
3 Unknown
4 Mobile user
Add the sep option and if it is really tab-separated, this will work.
df = spark.read.option("inferSchema","true").option("sep","\t").csv("test.tsv").show()
+---+-----------+
|_c0| _c1|
+---+-----------+
| 0| Specified|
| 1| Modemwifi|
| 2| LAN/Wifi|
| 3| Unknown|
| 4|Mobile user|
+---+-----------+

pyspark - Convert sparse vector obtained after one hot encoding into columns

I am using apache Spark ML lib to handle categorical features using one hot encoding. After writing the below code I am getting a vector c_idx_vec as output of one hot encoding. I do understand how to interpret this output vector but I am unable to figure out how to convert this vector into columns so that I get a new transformed dataframe.Take this dataset for example:
>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()
+----+---+-----+
| x| c|c_idx|
+----+---+-----+
| 1.0| a| 0.0|
| 1.5| a| 0.0|
|10.0| b| 1.0|
| 3.2| c| 2.0|
+----+---+-----+
By default, the OneHotEncoder will drop the last category:
>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> fe = oe.transform(ff)
>>> fe.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(2,[0],[1.0])|
| 1.5| a| 0.0|(2,[0],[1.0])|
|10.0| b| 1.0|(2,[1],[1.0])|
| 3.2| c| 2.0| (2,[],[])|
+----+---+-----+-------------+
Of course, this behavior can be changed:
>>> oe.setDropLast(False)
>>> fl = oe.transform(ff)
>>> fl.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(3,[0],[1.0])|
| 1.5| a| 0.0|(3,[0],[1.0])|
|10.0| b| 1.0|(3,[1],[1.0])|
| 3.2| c| 2.0|(3,[2],[1.0])|
+----+---+-----+-------------+
So, I wanted to know how to convert my c_idx_vec vector into new dataframe as below:
Here is what you can do:
>>> from pyspark.ml.feature import OneHotEncoder, StringIndexer
>>>
>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()
+----+---+-----+
| x| c|c_idx|
+----+---+-----+
| 1.0| a| 0.0|
| 1.5| a| 0.0|
|10.0| b| 1.0|
| 3.2| c| 2.0|
+----+---+-----+
>>>
>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> oe.setDropLast(False)
OneHotEncoder_49e58b281387d8dc0c6b
>>> fl = oe.transform(ff)
>>> fl.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(3,[0],[1.0])|
| 1.5| a| 0.0|(3,[0],[1.0])|
|10.0| b| 1.0|(3,[1],[1.0])|
| 3.2| c| 2.0|(3,[2],[1.0])|
+----+---+-----+-------------+
// Get c and its repective index. One hot encoder will put those on same index in vector
>>> colIdx = fl.select("c","c_idx").distinct().rdd.collectAsMap()
>>> colIdx
{'c': 2.0, 'b': 1.0, 'a': 0.0}
>>>
>>> colIdx = sorted((value, "ls_" + key) for (key, value) in colIdx.items())
>>> colIdx
[(0.0, 'ls_a'), (1.0, 'ls_b'), (2.0, 'ls_c')]
>>>
>>> newCols = list(map(lambda x: x[1], colIdx))
>>> actualCol = fl.columns
>>> actualCol
['x', 'c', 'c_idx', 'c_idx_vec']
>>> allColNames = actualCol + newCols
>>> allColNames
['x', 'c', 'c_idx', 'c_idx_vec', 'ls_a', 'ls_b', 'ls_c']
>>>
>>> def extract(row):
... return tuple(map(lambda x: row[x], row.__fields__)) + tuple(row.c_idx_vec.toArray().tolist())
...
>>> result = fl.rdd.map(extract).toDF(allColNames)
>>> result.show(20, False)
+----+---+-----+-------------+----+----+----+
|x |c |c_idx|c_idx_vec |ls_a|ls_b|ls_c|
+----+---+-----+-------------+----+----+----+
|1.0 |a |0.0 |(3,[0],[1.0])|1.0 |0.0 |0.0 |
|1.5 |a |0.0 |(3,[0],[1.0])|1.0 |0.0 |0.0 |
|10.0|b |1.0 |(3,[1],[1.0])|0.0 |1.0 |0.0 |
|3.2 |c |2.0 |(3,[2],[1.0])|0.0 |0.0 |1.0 |
+----+---+-----+-------------+----+----+----+
// Typecast new columns to int
>>> for col in newCols:
... result = result.withColumn(col, result[col].cast("int"))
...
>>> result.show(20, False)
+----+---+-----+-------------+----+----+----+
|x |c |c_idx|c_idx_vec |ls_a|ls_b|ls_c|
+----+---+-----+-------------+----+----+----+
|1.0 |a |0.0 |(3,[0],[1.0])|1 |0 |0 |
|1.5 |a |0.0 |(3,[0],[1.0])|1 |0 |0 |
|10.0|b |1.0 |(3,[1],[1.0])|0 |1 |0 |
|3.2 |c |2.0 |(3,[2],[1.0])|0 |0 |1 |
+----+---+-----+-------------+----+----+----+
Hope this helps!!
Not sure it is the most efficient or simple way, but you can do it with a udf; starting from your fl dataframe:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit, udf
def ith_(v, i):
try:
return float(v[i])
except ValueError:
return None
ith = udf(ith_, DoubleType())
(fl.withColumn('is_a', ith("c_idx_vec", lit(0)))
.withColumn('is_b', ith("c_idx_vec", lit(1)))
.withColumn('is_c', ith("c_idx_vec", lit(2))).show())
The result is:
+----+---+-----+-------------+----+----+----+
| x| c|c_idx| c_idx_vec|is_a|is_b|is_c|
+----+---+-----+-------------+----+----+----+
| 1.0| a| 0.0|(3,[0],[1.0])| 1.0| 0.0| 0.0|
| 1.5| a| 0.0|(3,[0],[1.0])| 1.0| 0.0| 0.0|
|10.0| b| 1.0|(3,[1],[1.0])| 0.0| 1.0| 0.0|
| 3.2| c| 2.0|(3,[2],[1.0])| 0.0| 0.0| 1.0|
+----+---+-----+-------------+----+----+----+
i.e. exactly as requested.
HT (and +1) to this answer that provided the udf.
Given that the situation is specified to the case that StringIndexer was used to generate the index number, and then One-hot encoding is generated using OneHotEncoderEstimator. The entire code from end to end should be like:
Generate the data and index the string values, with the StringIndexerModel object is "saved"
>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>>
>>> # need to save the indexer model object for indexing label info to be used later
>>> ss_fit = ss.fit(fd)
>>> ss_fit.labels # to be used later
['a', 'b', 'c']
>>> ff = ss_fit.transform(fd)
>>> ff.show()
+----+---+-----+
| x| c|c_idx|
+----+---+-----+
| 1.0| a| 0.0|
| 1.5| a| 0.0|
|10.0| b| 1.0|
| 3.2| c| 2.0|
+----+---+-----+
Do one-hot encoding using OneHotEncoderEstimator class, since OneHotEncoder is deprecating
>>> oe = OneHotEncoderEstimator(inputCols=["c_idx"],outputCols=["c_idx_vec"])
>>> oe_fit = oe.fit(ff)
>>> fe = oe_fit.transform(ff)
>>> fe.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(2,[0],[1.0])|
| 1.5| a| 0.0|(2,[0],[1.0])|
|10.0| b| 1.0|(2,[1],[1.0])|
| 3.2| c| 2.0| (2,[],[])|
+----+---+-----+-------------+
Perform one-hot binary value reshaping. The one-hot values will always be 0.0 or 1.0.
>>> from pyspark.sql.types dimport FloatType, IntegerType
>>> from pyspark.sql.functions import lit, udf
>>> ith = udf(lambda v, i: float(v[i]), FloatType())
>>> fx = fe
>>> for sidx, oe_col in zip([ss_fit], oe.getOutputCols()):
...
... # iterate over string values and ignore the last one
... for ii, val in list(enumerate(sidx.labels))[:-1]:
... fx = fx.withColumn(
... sidx.getInputCol() + '_' + val,
... ith(oe_col, lit(ii)).astype(IntegerType())
... )
>>> fx.show()
+----+---+-----+-------------+---+---+
| x| c|c_idx| c_idx_vec|c_a|c_b|
+----+---+-----+-------------+---+---+
| 1.0| a| 0.0|(2,[0],[1.0])| 1| 0|
| 1.5| a| 0.0|(2,[0],[1.0])| 1| 0|
|10.0| b| 1.0|(2,[1],[1.0])| 0| 1|
| 3.2| c| 2.0| (2,[],[])| 0| 0|
+----+---+-----+-------------+---+---+
To be noticed that Spark, by default, removes the last category. So, following the behavior, the c_c column is not necessary here.
I can't find a way to access sparse vector with data frame and i converted it to rdd.
from pyspark.sql import Row
# column names
labels = ['a', 'b', 'c']
extract_f = lambda row: Row(**row.asDict(), **dict(zip(labels, row.c_idx_vec.toArray())))
fe.rdd.map(extract_f).collect()

Forward-fill missing data in PySpark not working

I have a simple dataset as shown under.
| id| name| country| languages|
|1 | Bob| USA| Spanish|
|2 | Angelina| France| null|
|3 | Carl| Brazil| null|
|4 | John| Australia| English|
|5 | Anne| Nepal| null|
I am trying to impute the null values in languages with the last non-null value using pyspark.sql.window to create a window over certain rows but nothing is happening. The column which is supposed to be have null values filled, temp_filled_spark, remains unchanged i.e a copy of original languages column.
from pyspark.sql import Window
from pyspark.sql.functions import last
window = Window.partitionBy('name').orderBy('country').rowsBetween(-sys.maxsize, 0)
filled_column = last(df['languages'], ignorenulls=True).over(window)
df = df.withColumn('temp_filled_spark', filled_column)
df.orderBy('name', 'country').show(100)
I expect the output column to be:
|temp_filled_spark|
| Spanish|
| Spanish|
| Spanish|
| English|
| English|
Could anybody help pointing out the mistake?
we can create window considering entire dataframe as one partition as,
from pyspark.sql import functions as F
>>> df1.show()
+---+--------+---------+---------+
| id| name| country|languages|
+---+--------+---------+---------+
| 1| Bob| USA| Spanish|
| 2|Angelina| France| null|
| 3| Carl| Brazil| null|
| 4| John|Australia| English|
| 5| Anne| Nepal| null|
+---+--------+---------+---------+
>>> w = Window.partitionBy(F.lit(1)).orderBy(F.lit(1)).rowsBetween(-sys.maxsize, 0)
>>> df1.select("*",F.last('languages',True).over(w).alias('newcol')).show()
+---+--------+---------+---------+-------+
| id| name| country|languages| newcol|
+---+--------+---------+---------+-------+
| 1| Bob| USA| Spanish|Spanish|
| 2|Angelina| France| null|Spanish|
| 3| Carl| Brazil| null|Spanish|
| 4| John|Australia| English|English|
| 5| Anne| Nepal| null|English|
+---+--------+---------+---------+-------+
Hope this helps.!