Removing Null records in pyspark - pyspark

I have a spark dataframe like below
Id value
1 \N
2 \N
3 a
4 b
5 \N
I want to remove the \N records, which are null, from the df. How to do this?

the simple filter should work.
data_sdf.filter(data_sdf.value != r'\N').show()
# +---+-----+
# | id|value|
# +---+-----+
# | 3| a|
# | 4| b|
# +---+-----+

Related

How to add a list of data into dataschame in pyspark?

there is a dataframe contains two columns, one is key and another is value. Like below:
+--------------------+--------------------+
| key| value |
+--------------------+--------------------+
|a |abcde |
I want to slice the value into mutiple values with position and generate a new dataframe following the key. Like below:
+--------------------+--------------------+
| key| value|
+--------------------+--------------------+
|a |[a, 0] |
|a |[b, 1] |
|a |[c, 2] |
|a |[d, 3] |
|a |[e, 4] |
I have tried to use join() and StructType() but I failed. Are there any possible method to do that? THANKS!
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("a","abcd",3000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
df.select( f.expr( "posexplode( filter (split(Department,''), x -> x != '' ) ) as (value, index)") , f.col("Name")).show()
+-----+-----+----+
|value|index|Name|
+-----+-----+----+
| 0| a| a|
| 1| b| a|
| 2| c| a|
| 3| d| a|
+-----+-----+----+
# to explain the above functions more clearly.
f.expr( "..." ) #--> write a SQL expression
"posexplode( [..] )" #for an array turn them into rows with their index
"filter( [..] , x -> x != '' )" #for an array filter out ''
"split( Department, '' )" #split the column on null (extract characters) which will add null in out array that we need to filter out.
Here's the update to fit with your exact request, just a little more manipulation to put it into your required format:
df.select( f.expr( "posexplode( filter (split(Department,''), x -> x != '' ) ) as (myvalue, index)") , f.col("Name"),f.expr( "array(myvalue,index) as value ")).drop("index","myvalue").show()
+----+------+
|Name| value|
+----+------+
| a|[0, a]|
| a|[1, b]|
| a|[2, c]|
| a|[3, d]|
+----+------+
The following snippet transform your data into your specified format:
import pyspark.sql.functions as F
df = spark.createDataFrame([("a", "abcde",)], ["key", "value"])
df_split = df.withColumn("split", F.array_remove(F.split("value", ""), ""))
df_split.show()
df_exploded = df_split.select("key", F.posexplode("split"))
df_exploded.show()
df_array = df_exploded.select("key", F.array("col", "pos").alias("value"))
df_array.show()
Output:
+---+-----+---------------+
|key|value| split|
+---+-----+---------------+
| a|abcde|[a, b, c, d, e]|
+---+-----+---------------+
+---+---+---+
|key|pos|col|
+---+---+---+
| a| 0| a|
| a| 1| b|
| a| 2| c|
| a| 3| d|
| a| 4| e|
+---+---+---+
+---+------+
|key| value|
+---+------+
| a|[a, 0]|
| a|[b, 1]|
| a|[c, 2]|
| a|[d, 3]|
| a|[e, 4]|
+---+------+
First, the string is split into an array, where the split pattern is the empty string. Therefore, the last element needs to be removed.
Then each element of a the split column array is transformed to a row with its position in the array as column pos
Lastly, the columns are combined to an array.

How to truncate the values of a column of a spark dataframe? [duplicate]

This question already has answers here:
remove last few characters in PySpark dataframe column
(5 answers)
Closed 3 years ago.
I would like to remove the last two values of a string for each string in a single column of a spark dataframe. I would like to do this in the spark dataframe not by moving it to pandas and then back.
An example dataframe would be below,
# +----+-------+
# | age| name|
# +----+-------+
# | 350|Michael|
# | 290| Andy|
# | 123| Justin|
# +----+-------+
where the dtype of the age column is a string.
# +----+-------+
# | age| name|
# +----+-------+
# | 3|Michael|
# | 2| Andy|
# | 1| Justin|
# +----+-------+
This is the expected output. The last two characters of the string have been removed.
Hi Scala/sparkSql way of doing this is very Simple.
val result = originalDF.withColumn("age", substring(col("age"),0,1))
reult.show
you can probably get your syntax for pyspark
substring, length, col, expr from functions can be used for this purpose.
from pyspark.sql.functions import substring, length, col, expr
df = your df here
substring index 1, -2 were used since its 3 digits and .... its age field logically a person
wont live more than 100 years :-) OP can change substring function suiting to his requirement.
df.withColumn("age",expr("substring(age, 1, length(age)-2)"))
df.show
Result :
+----+-------+
| age| name|
+----+-------+
| 3|Michael|
| 2| Andy|
| 1| Justin|
+----+-------+
Scala answer :
val originalDF = Seq(
(350, "Michael"),
(290, "Andy"),
(123, "Justin")
).toDF("age", "name")
println(" originalDF " )
originalDF.show
println("modified")
originalDF.selectExpr("substring(age,0,1) as age", "name " ).show
Result :
originalDF
+---+-------+
|age| name|
+---+-------+
|350|Michael|
|290| Andy|
|123| Justin|
+---+-------+
modified
+---+-------+
|age| name|
+---+-------+
| 3|Michael|
| 2| Andy|
| 1| Justin|
+---+-------+

pyspark: counting number of occurrences of each distinct values

I think the question is related to: Spark DataFrame: count distinct values of every column
So basically I have a spark dataframe, with column A has values of 1,1,2,2,1
So I want to count how many times each distinct value (in this case, 1 and 2) appears in the column A, and print something like
distinct_values | number_of_apperance
1 | 3
2 | 2
I just post this as I think the other answer with the alias could be confusing. What you need are the groupby and the count methods:
from pyspark.sql.types import *
l = [
1
,1
,2
,2
,1
]
df = spark.createDataFrame(l, IntegerType())
df.groupBy('value').count().show()
+-----+-----+
|value|count|
+-----+-----+
| 1| 3|
| 2| 2|
+-----+-----+
I am not sure if you are looking for below solution:
Here are my thoughts on this. Suppose you have a dataframe like this.
>>> listA = [(1,'AAA','USA'),(2,'XXX','CHN'),(3,'KKK','USA'),(4,'PPP','USA'),(5,'EEE','USA'),(5,'HHH','THA')]
>>> df = spark.createDataFrame(listA, ['id', 'name','country'])
>>> df.show();
+---+----+-------+
| id|name|country|
+---+----+-------+
| 1| AAA| USA|
| 2| XXX| CHN|
| 3| KKK| USA|
| 4| PPP| USA|
| 5| EEE| USA|
| 5| HHH| THA|
+---+----+-------+
I want to know the distinct country code appears in this particular dataframe and should be printed as alias name.
import pyspark.sql.functions as func
df.groupBy('country').count().select(func.col("country").alias("distinct_country"),func.col("count").alias("country_count")).show()
+----------------+-------------+
|distinct_country|country_count|
+----------------+-------------+
| THA| 1|
| USA| 4|
| CHN| 1|
+----------------+-------------+
were you looking something similar to this?

Parse Dataframe and store output in a single file [duplicate]

This question already has an answer here:
Spark split a column value into multiple rows
(1 answer)
Closed 4 years ago.
I have a data frame using Spark SQL in Scala with columns A and B with values:
A | B
1 a|b|c
2 b|d
3 d|e|f
I need to store the output to a single textfile in following format
1 a
1 b
1 c
2 b
2 d
3 d
3 e
3 f
How can I do that?
You can get the desired Dataframe with an expode and a split:
val resultDF = df.withColumn("B", explode(split($"B", "\\|")))
Result
+---+---+
| A| B|
+---+---+
| 1| a|
| 1| b|
| 1| c|
| 2| b|
| 2| d|
| 3| d|
| 3| e|
| 3| f|
+---+---+
Then you can save in a single file with a coalesce(1)
resultDF.coalesce(1).rdd.saveAsTextFile("desiredPath")
You can do something like,
val df = ???
val resDF =df.withColumn("B", explode(split(col("B"), "\\|")))
resDF.coalesce(1).write.option("delimiter", " ").csv("path/to/file")

Exploding pipe separated data in spark

I have a spark dataframe(input_dataframe), data in this dataframe looks like as below:
id value
1 a
2 x|y|z
3 t|u
I want to have output_dataframe, having pipe separated fields exploded and it should look like below:
id value
1 a
2 x
2 y
2 z
3 t
3 u
Please help me achieving the desired solution using PySpark. Any help will be appreciated
we can first split and then explode the value column using functions as below,
>>> l=[(1,'a'),(2,'x|y|z'),(3,'t|u')]
>>> df = spark.createDataFrame(l,['id','val'])
>>> df.show()
+---+-----+
| id| val|
+---+-----+
| 1| a|
| 2|x|y|z|
| 3| t|u|
+---+-----+
>>> from pyspark.sql import functions as F
>>> df.select('id',F.explode(F.split(df.val,'[|]')).alias('value')).show()
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 2| x|
| 2| y|
| 2| z|
| 3| t|
| 3| u|
+---+-----+