substring function return column type instead of a value. Is there a way to fetch a value out of column type in pyspark - pyspark

I am comparing a condition with pyspark join in my application by using substring function. This function is returning a column type instead of a value.
substring(trim(coalesce(df.col1)), 13, 3) returns
Column<b'substring(trim(coalesce(col1), 13, 3)'>
Tried with expr but still getting the same column type result
expr("substring(trim(coalesce(df.col1)),length(trim(coalesce(df.col1))) - 2, 3)")
I want to compare the values coming from substring to the value of another dataframe column. Both are of string type
pyspark:
substring(trim(coalesce(df.col1)), length(trim(coalesce(df.col1))) -2, 3) == df2["col2"]
lets say col1 = 'abcdefghijklmno'
The expected output of substring function should mno based on the above definition.

creating a sample dataframes to join
list1 = [('ABC','abcdefghijklmno'),('XYZ','abcdefghijklmno'),('DEF','abcdefghijklabc')]
df1=spark.createDataFrame(list1, ['col1', 'col2'])
list2 = [(1,'mno'),(2,'mno'),(3,'abc')]
df2=spark.createDataFrame(list2, ['col1', 'col2'])
import pyspark.sql.functions as f
creating a substring to read last n characters for three positions.
cond=f.substring(df1['col2'], -3, 3)==df2['col2']
newdf=df1.join(df2,cond)
>>> newdf.show()
+----+---------------+----+----+
|col1| col2|col1|col2|
+----+---------------+----+----+
| ABC|abcdefghijklmno| 1| mno|
| ABC|abcdefghijklmno| 2| mno|
| XYZ|abcdefghijklmno| 1| mno|
| XYZ|abcdefghijklmno| 2| mno|
| DEF|abcdefghijklabc| 3| abc|
+----+---------------+----+----+

Related

Pyspark Rename column based on column position

How do I rename the 3rd column of a dataframe in PySpark. I want to call the column index rather than the actual name.
Here is my attempt:
df
Col1 Col2 jfdklajfklfj
A B 2
df.withColumnRenamed([3], 'Row_Count')
Since python indexing starts at 0, you can index df.columns list by subtracting 1:
index_of_col = 3
df.withColumnRenamed(df.columns[index_of_col-1],'Row_Count').show()
+----+----+---------+
|Col1|Col2|Row_Count|
+----+----+---------+
| A| B| 2|
+----+----+---------+

How to select a column based on value of another in Pyspark?

I have a dataframe, where some column special_column contains values like one, two. My dataframe also has columns one_processed and two_processed.
I would like to add a new column my_new_column which values are taken from other columns from my dataframe, based on processed values from special_column. For example, if special_column == one I would like my_new_column to be set to one_processed.
I tried .withColumn("my_new_column", F.col(F.concat(F.col("special_column"), F.lit("_processed")))), but Spark complains that i cannot parametrize F.col with a column.
How could I get the string value of the concatenation, so that I can select the desired column?
from pyspark.sql.functions import when, col, lit, concat_ws
sdf.withColumn("my_new_column", when(col("special_column")=="one", col("one_processed"
).otherwise(concat_ws("_", col("special_column"), lit("processed"))
The easiest way in your case would be just a simple when/oterwise like:
>>> df = spark.createDataFrame([(1, 2, "one"), (1,2,"two")], ["one_processed", "two_processed", "special_column"])
>>> df.withColumn("my_new_column", F.when(F.col("special_column") == "one", F.col("one_processed")).otherwise(F.col("two_processed"))).show()
+-------------+-------------+--------------+-------------+
|one_processed|two_processed|special_column|my_new_column|
+-------------+-------------+--------------+-------------+
| 1| 2| one| 1|
| 1| 2| two| 2|
+-------------+-------------+--------------+-------------+
As far as I know there is no way to get a column value by name, as execution plan would depend on the data.

How to convert numerical values to a categorical variable using pyspark

pyspark dataframe which have a range of numerical variables.
for eg
my dataframe have a column value from 1 to 100.
1-10 - group1<== the column value for 1 to 10 should contain group1 as value
11-20 - group2
.
.
.
91-100 group10
how can i achieve this using pyspark dataframe
# Creating an arbitrary DataFrame
df = spark.createDataFrame([(1,54),(2,7),(3,72),(4,99)], ['ID','Var'])
df.show()
+---+---+
| ID|Var|
+---+---+
| 1| 54|
| 2| 7|
| 3| 72|
| 4| 99|
+---+---+
Once the DataFrame has been created, we use floor() function to find the integral part of a number. For eg; floor(15.5) will be 15. We need to find the integral part of the Var/10 and add 1 to it, because the indexing starts from 1, as opposed to 0. Finally, we have need to prepend group to the value. Concatenation can be achieved with concat() function, but keep in mind that since the prepended word group is not a column, so we need to put it inside lit() which creates a column of a literal value.
# Requisite packages needed
from pyspark.sql.functions import col, floor, lit, concat
df = df.withColumn('Var',concat(lit('group'),(1+floor(col('Var')/10))))
df.show()
+---+-------+
| ID| Var|
+---+-------+
| 1| group6|
| 2| group1|
| 3| group8|
| 4|group10|
+---+-------+

pyspark: filtering rows by length of inside values

I have a PySpark dataframe with a column contains Python list
id value
1 [1,2,3]
2 [1,2]
I want to remove all rows with len of the list in value column is less than 3.
So I tried:
df.filter(len(df.value) >= 3)
and indeed it does not work.
How can I filter the dataframe by the length of the inside data?
Refer to this link -
size() - It returns the length of the array or map stored in the column.
from pyspark.sql.functions import size
myValues = [(1,[1,2,3]),(2,[1,2])]
df = sqlContext.createDataFrame(myValues,['id','value'])
df.show()
+----+---------+
| id| value|
+--------------+
| 1| [1,2,3]|
| 2| [1,2]|
+----+---------+
df = df.filter(size(df.value) >= 3).show()
+----+---------+
| id| value|
+--------------+
| 1| [1,2,3]|
+----+---------+

Convert distinct values in a Dataframe in Pyspark to a list

I'm trying to get the distinct values of a column in a dataframe in Pyspark, to them save them in a list, at the moment the list contains "Row(no_children=0)"
but I need only the value as I will use it for another part of my code.
So, ideally only all_values=[0,1,2,3,4]
all_values=sorted(list(df1.select('no_children').distinct().collect()))
all_values
[Row(no_children=0),
Row(no_children=1),
Row(no_children=2),
Row(no_children=3),
Row(no_children=4)]
This takes around 15secs to run, is that normal?
Thank you very much!
You can use collect_set from functions module to get a column's distinct values.Here,
from pyspark.sql import functions as F
>>> df1.show()
+-----------+
|no_children|
+-----------+
| 0|
| 3|
| 2|
| 4|
| 1|
| 4|
+-----------+
>>> df1.select(F.collect_set('no_children').alias('no_children')).first()['no_children']
[0, 1, 2, 3, 4]
You could do something like this to get only the values
list = [r.no_children for r in all_values]
list
[0, 1, 2, 3, 4]
Try this:
all_values = df1.select('no_children').distinct().rdd.flatMap(list).collect()