Pyspark Count includes Nulls

Pyspark Count includes Nulls - pyspark

Running a simple example -
dept = [("Finance",10),("Marketing",None),("Sales",30),("IT",40)]
deptColumns = ["dept_name","dept_id"]
rdd = sc.parallelize(dept)
df = rdd.toDF(deptColumns)
df.show(truncate=False)
print('count the dept_id, should be 3')
print('count: ' + str(df.select(F.col("dept_id")).count()))
We get the following output -
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|null |
|Sales |30 |
|IT |40 |
+---------+-------+
count the dept_id, should be 3
count: 4
I'm running on databricks and this is my stack -
Spark 3.0.1 scala 2.12, DBR 7.3 LTS
Thanks for any help!!

There is a subtle difference between the count function of the Dataframe API and the count function of Spark SQL. The first one simply counts the rows while the second one can ignore null values.
You are using Dataframe.count(). According to the documentation, this function
returns the number of rows in this DataFrame
So the result 4 is correct as there are 4 rows in the dataframe.
If null values should be ignored, you can use the Spark SQL function count which can ignore null values:
count(expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are all non-null.
For example
df.selectExpr("count(dept_id)").show()
returns 3.

Another alternative solution to #werner is using the
pyspark.sql.functions
from pyspark.sql import functions as F
print('count: ' + str(df.select(F.count(F.col("dept_id"))).collect()))

Your code is not complete. Maybe you can just add 'isNotNull()' before count()
My code like this:
from pyspark.sql.functions import col, count
print('count the dept_id, should be 3')
print('count: ' + str(df.filter(col("dept_id").isNotNull()).count()))

Related

filter out rows from pyspark dataframe that are 1 month ago

I am new to pyspark and I have below pyspark dataframe now
TIMESTAMP,TYPE,CLASS,OBJECT,INSTANCE
2022-11-22T10:47:45.8060+01:00,typeA,classA,objectA,instanceA
2022-10-22T08:39:49.1900+01:00,typeB,classB,objectB,instanceB
2022-10-18T08:37:59.3850+01:00,typeC,classC,objectC,instanceC
2021-10-11T08:37:59.3850+01:00,typeD,classD,objectD,instanceD
2022-12-01T06:40:44.3850+01:00,typeD,classD,objectD,instanceD
I want to filter out those rows that time is 1 month ago, so the expected dataframe should be below:
TIMESTAMP,TYPE,CLASS,OBJECT,INSTANCE
2022-11-22T10:47:45.8060+01:00,typeA,classA,objectA,instanceA
2022-12-01T06:40:44.3850+01:00,typeD,classD,objectD,instanceD
I tried below pyspark codes, but it doesn't work and can't return the expected result to me.
df.filter(date_format(col("TIMESTAMP"), "yyyyMM") - date_format(current_date(), "yyyyMM") > abs(1))
Can any expert help advice? Appreciated in advance!

You can use months_between to generate an additional column which can be used to filter out unwanted rows and achieve the desired results
Data Preparation
s = StringIO("""
TIMESTAMP,TYPE,CLASS,OBJECT,INSTANCE
2022-11-22T10:47:45.8060+01:00,typeA,classA,objectA,instanceA
2022-10-22T08:39:49.1900+01:00,typeB,classB,objectB,instanceB
2022-10-18T08:37:59.3850+01:00,typeC,classC,objectC,instanceC
2021-10-11T08:37:59.3850+01:00,typeD,classD,objectD,instanceD
2022-12-01T06:40:44.3850+01:00,typeD,classD,objectD,instanceD
""")
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)
sparkDF.show(truncate=False)
+------------------------------+-----+------+-------+---------+
|TIMESTAMP |TYPE |CLASS |OBJECT |INSTANCE |
+------------------------------+-----+------+-------+---------+
|2022-11-22T10:47:45.8060+01:00|typeA|classA|objectA|instanceA|
|2022-10-22T08:39:49.1900+01:00|typeB|classB|objectB|instanceB|
|2022-10-18T08:37:59.3850+01:00|typeC|classC|objectC|instanceC|
|2021-10-11T08:37:59.3850+01:00|typeD|classD|objectD|instanceD|
|2022-12-01T06:40:44.3850+01:00|typeD|classD|objectD|instanceD|
+------------------------------+-----+------+-------+---------+
Months Between
sparkDF = sparkDF.withColumn('TIMESTAMP',F.to_timestamp(F.col('TIMESTAMP'),"yyyy-MM-dd'T'hh:mm:ss.SSSS'+01:00'"))\
.withColumn('months_diff',F.floor(F.months_between(F.current_date(),F.col('TIMESTAMP'))))
sparkDF.filter(F.col('months_diff') < 1)\
.show(truncate=False)
+-----------------------+-----+------+-------+---------+-----------+
|TIMESTAMP |TYPE |CLASS |OBJECT |INSTANCE |months_diff|
+-----------------------+-----+------+-------+---------+-----------+
|2022-11-22 10:47:45.806|typeA|classA|objectA|instanceA|0 |
|2022-12-01 06:40:44.385|typeD|classD|objectD|instanceD|0 |
+-----------------------+-----+------+-------+---------+-----------+

pyspark join 2 columns if condition is met, and insert string into the result

I have a pyspark dataframe like this:
+-------+---------------+------------+
|s_field|s_check| t_filter|
+-------+---------------+------------+
| MANDT| true| !=E|
| WERKS| true|0010_0020_0021_00...|
+-------+---------------+------------+
And as a first step, I split t_filter based on _ with f.split(f.col("t_filter"), "_")
filters = filters.withColumn("t_filter_1", f.split(f.col("t_filter"), "_")).show(truncate=False)
+-------+---------------+------------+------------+------------+
|s_field|s_check| t_filter| t_filter_1|
+-------+---------------+------------+------------+------------+
| MANDT| true| 070_70| [!= E]|
| WERKS| true|0010_0020_0021_00...| [0010, 0020, 0021, 00...]
+-------+---------------+------------+------------+------------+
What I want to achieve is to create a new column, using s_field and t_filter as the input while doing a logic check for !=.
ultimate aim
+------------+------------+------------+
| t_filter_2|
+------------+------------+------------+
| MANDT != 'E'|
| WERKS in ('0010', '0020', ...)|
+------------+------------+------------+
I have tried using withColumn but I keep getting error on col must be Column.
I am also not sure what the proper approach should be in order to achieve this.
Note: there is a large amount of rows, like 10k. I understand that using a UDF would be quite expensive, so i'm interested to know if there are other ways that can be done.

You can achieve this using withColumn with conditional evaluation by using the when and otherwise function. Following your example the following logic would apply, if t_filter contains != concatenate s_field and t_filter, else first convert t_filter_1 array to a string with , as separator then concat with s_field along with literals for in and ().
from pyspark.sql import functions as f
filters.withColumn(
"t_filter_2",
f.when(f.col("t_filter").contains("!="), f.concat("s_field", "t_filter")).otherwise(
f.concat(f.col("s_field"), f.lit(" in ('"), f.concat_ws("','", "t_filter_1"), f.lit("')"))
),
)
Output
+-------+-------+--------------------+-------------------------+---------------------------------------+
|s_check|s_field|t_filter |t_filter_1 |t_filter_2 |
+-------+-------+--------------------+-------------------------+---------------------------------------+
|true |MANDT |!='E' |[!='E'] |MANDT!='E' |
|true |WERKS |0010_0020_0021_00...|[0010, 0020, 0021, 00...]|WERKS in ('0010','0020','0021','00...')|
+-------+-------+--------------------+-------------------------+---------------------------------------+
Complete Working Example
from pyspark.sql import functions as f
filters_data = [
{"s_field": "MANDT", "s_check": True, "t_filter": "!='E'"},
{"s_field": "WERKS", "s_check": True, "t_filter": "0010_0020_0021_00..."},
]
filters = spark.createDataFrame(filters_data)
filters = filters.withColumn("t_filter_1", f.split(f.col("t_filter"), "_"))
filters.withColumn(
"t_filter_2",
f.when(f.col("t_filter").contains("!="), f.concat("s_field", "t_filter")).otherwise(
f.concat(f.col("s_field"), f.lit(" in ('"), f.concat_ws("','", "t_filter_1"), f.lit("')"))
),
).show(200, False)

PySpark, create line graph from a dataframe without a "category" on databricks

I'm running the following code on databricks:
dataToShow = jDataJoined.\
withColumn('id', monotonically_increasing_id()).\
filter(
(jDataJoined.containerNumber == 'SUDU8108536')).\
select(col('id'), col('returnTemperature'), col('supplyTemperature'))
This will give me tabular data like
Now I would like to display a line graph with this returnTemperature and supplyTemperature as categories.
As far as I understood, the method display in databricks wants as second argument the category, so basically what I should have is something like
id - temperatureCategory - value
1 - returnTemperature - 25.0
1 - supplyTemperature - 27.0
2 - returnTemperature - 24.0
2 - supplyTemperature - 28.0
How can I transform the dataframe in this way?

I don't know if your format is what the display method is expecting, but you can do this transformation with the sql functions create_map and explode:
#creates a example df
from pyspark.sql import functions as F
l1 = [(1,25.0,27.0),(2,24.0,28.0)]
df = spark.createDataFrame(l1,['id','returnTemperature','supplyTemperature'])
#creates a map column which contains the values of the returnTemperature and supplyTemperature
df = df.withColumn('mapCol', F.create_map(
F.lit('returnTemperature'),df.returnTemperature
,F.lit('supplyTemperature'),df.supplyTemperature
)
)
#The explode function creates a new row for each element of the map
df = df.select('id',F.explode(df.mapCol).alias('temperatureCategory','value'))
df.show()
Output:
+---+-------------------+-----+
| id|temperatureCategory|value|
+---+-------------------+-----+
| 1 | returnTemperature| 25.0|
| 1 | supplyTemperature| 27.0|
| 2 | returnTemperature| 24.0|
| 2 | supplyTemperature| 28.0|
+---+-------------------+-----+

How to split column into multiple columns in Spark 2?

I am reading the data from HDFS into DataFrame using Spark 2.2.0 and Scala 2.11.8:
val df = spark.read.text(outputdir)
df.show()
I see this result:
+--------------------+
| value|
+--------------------+
|(4056,{community:...|
|(56,{community:56...|
|(2056,{community:...|
+--------------------+
If I run df.head(), I see more details about the structure of each row:
[(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})]
I want to get the following output:
+---------+----------+
| id | value|
+---------+----------+
|4056 |1 |
|56 |56 |
|2056 |20 |
+---------+----------+
How can I do it? I tried using .map(row => row.mkString(",")),
but I don't know how to extract the data as I showed.

The problem is that you are getting the data as a single column of strings. The data format is not really specified in the question (ideally it would be something like JSON), but given what we know, we can use a regular expression to extract the number on the left (id) and the community field:
val r = """\((\d+),\{.*community:(\d+).*\}\)"""
df.select(
F.regexp_extract($"value", r, 1).as("id"),
F.regexp_extract($"value", r, 2).as("community")
).show()

A bunch of regular expressions should give you required result.
df.select(
regexp_extract($"value", "^\\(([0-9]+),.*$", 1) as "id",
explode(split(regexp_extract($"value", "^\\(([0-9]+),\\{(.*)\\}\\)$", 2), ",")) as "value"
).withColumn("value", split($"value", ":")(1))

If your data is always of the following format
(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})
Then you can simply use split and regex_replace inbuilt functions to get your desired output dataframe as
import org.apache.spark.sql.functions._
df.select(regexp_replace((split(col("value"), ",")(0)), "\\(", "").as("id"), regexp_replace((split(col("value"), ",")(1)), "\\{community:", "").as("value") ).show()
I hope the answer is helpful

How to get few rows from Spark data frame on basis of some condition

I have Data Frame like below with three column
id|visit_class|in_date
+--+-----------+--------
|1|Non Hf |24-SEP-2017
|1|Non Hf |23-SEP-2017
|1|Hf |27-SEP-2017
|1|Non Hf |28-SEP-2017
|2|Non Hf |24-SEP-2017
|2|Hf |25-SEP-2017
I want to group this data frame on id then sort this grouped data on indate column and want only those rows which are coming after first occurrence of HF. The output will be like below. Means first 2 rows will drop for id =1 and first 1 row will drop for id = 2.
id|visit_class|in_date
+--+-----------+--------
|1|Hf |27-SEP-2017
|1|Non Hf |28-SEP-2017
|2|Hf |25-SEP-2017
How I will achieve this in Spark and Scala.

Steps:
1) Create the WindowSpec, order by date and partition by id:
2) Create a cumulative sum as indicates of whether Hf has appeared, and then filter based on the condition:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("id").orderBy(to_date($"in_date", "dd-MMM-yyyy"))
(df.withColumn("rn", sum(when($"visit_class" === "Hf", 1).otherwise(0)).over(w))
.filter($"rn" >= 1).drop("rn").show)
+---+-----------+-----------+
| id|visit_class| in_date|
+---+-----------+-----------+
| 1| Hf|27-SEP-2017|
| 1| Non Hf|28-SEP-2017|
| 2| Hf|25-SEP-2017|
+---+-----------+-----------+
Using spark 2.2.0, to_date with the format signature is a new function in 2.2.0
If you are using spark < 2.2.0, you can use unix_timestamp in place of to_date:
val w = Window.partitionBy("id").orderBy(unix_timestamp($"in_date", "dd-MMM-yyyy"))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark Count includes Nulls - pyspark

Another alternative solution to #werner is using the pyspark.sql.functions from pyspark.sql import functions as F print('count: ' + str(df.select(F.count(F.col("dept_id"))).collect()))

Your code is not complete. Maybe you can just add 'isNotNull()' before count() My code like this: from pyspark.sql.functions import col, count print('count the dept_id, should be 3') print('count: ' + str(df.filter(col("dept_id").isNotNull()).count()))

Related

filter out rows from pyspark dataframe that are 1 month ago

pyspark join 2 columns if condition is met, and insert string into the result

PySpark, create line graph from a dataframe without a "category" on databricks

How to split column into multiple columns in Spark 2?

How to get few rows from Spark data frame on basis of some condition

Categories

Resources