How do I pass a column to substr function in pyspark - pyspark

I have 2 columns in a dataframe, ValueText and GLength. I need to add a new column VX based on other 2 columns (ValueText and GLength). Basically, new column VX is based on substring of ValueText. Below is what I tried
df_stage1.withColumn("VX", df_stage1.ValueText.substr(6,df_stage1.GLength))
However with above code, I get error: startPos and length must be the same type. Got class 'int' and class 'pyspark.sql.column.Column', respectively.
I have also tried
func.expr("substring(ValueText,5, 5 + GLength)")
When I execute above code, i get the error: Pyspark job aborted due to stage failure

expr will work in this case, as we are using Glength in substring function.
Example:
df=spark.createDataFrame([("abcdff",4),("dlaldajfa",3)],["valuetext","Glength"])
df.show()
#+---------+-------+
#|valuetext|Glength|
#+---------+-------+
#| abcdff| 4|
#|dlaldajfa| 3|
#+---------+-------+
from pyspark.sql.functions import *
df.withColumn("vx",expr("substring(valuetext,0,Glength)")).show()
#+---------+-------+----+
#|valuetext|Glength| vx|
#+---------+-------+----+
#| abcdff| 4|abcd|
#|dlaldajfa| 3| dla|
#+---------+-------+----+

Related

how to write substring to get the string from starting position to the end

I want to extract the code starting from the 25th position to the end.
I tried:
df_1.withColumn("code", f.col('index_key').substr(25, f.length(df_1.index_key))).show()
But I got the below error message,
TypeError: startPos and length must be the same type. Got <class 'int'>
respectively:
<class 'pyspark.sql.column.Column'>
Any suggestion will be very appreciated.
Using .substr:
Instead of integer value keep value in lit(<int>)(will be column type) so that we are passing both values of same type.
Example:
df.show()
#+---------+
#|index_key|
#+---------+
#| abcdef|
#+---------+
from pyspark.sql.functions import *
df.withColumn("code",col("index_key").substr(lit(1),length(col("index_key")))).\
show()
#+---------+-------+
#|index_key| code|
#+---------+-------+
#| abcdefg|abcdefg|
#+---------+-------+
Another option is using expr and substring function.
Example:
df.withColumn("code",expr('substring(index_key, 1,length(index_key))')).show()
#+---------+------+
#|index_key| code|
#+---------+------+
#| abcdef|abcdef|
#+---------+------+

Pyspark Rename column based on column position

How do I rename the 3rd column of a dataframe in PySpark. I want to call the column index rather than the actual name.
Here is my attempt:
df
Col1 Col2 jfdklajfklfj
A B 2
df.withColumnRenamed([3], 'Row_Count')
Since python indexing starts at 0, you can index df.columns list by subtracting 1:
index_of_col = 3
df.withColumnRenamed(df.columns[index_of_col-1],'Row_Count').show()
+----+----+---------+
|Col1|Col2|Row_Count|
+----+----+---------+
| A| B| 2|
+----+----+---------+

How to select a column based on value of another in Pyspark?

I have a dataframe, where some column special_column contains values like one, two. My dataframe also has columns one_processed and two_processed.
I would like to add a new column my_new_column which values are taken from other columns from my dataframe, based on processed values from special_column. For example, if special_column == one I would like my_new_column to be set to one_processed.
I tried .withColumn("my_new_column", F.col(F.concat(F.col("special_column"), F.lit("_processed")))), but Spark complains that i cannot parametrize F.col with a column.
How could I get the string value of the concatenation, so that I can select the desired column?
from pyspark.sql.functions import when, col, lit, concat_ws
sdf.withColumn("my_new_column", when(col("special_column")=="one", col("one_processed"
).otherwise(concat_ws("_", col("special_column"), lit("processed"))
The easiest way in your case would be just a simple when/oterwise like:
>>> df = spark.createDataFrame([(1, 2, "one"), (1,2,"two")], ["one_processed", "two_processed", "special_column"])
>>> df.withColumn("my_new_column", F.when(F.col("special_column") == "one", F.col("one_processed")).otherwise(F.col("two_processed"))).show()
+-------------+-------------+--------------+-------------+
|one_processed|two_processed|special_column|my_new_column|
+-------------+-------------+--------------+-------------+
| 1| 2| one| 1|
| 1| 2| two| 2|
+-------------+-------------+--------------+-------------+
As far as I know there is no way to get a column value by name, as execution plan would depend on the data.

substring function return column type instead of a value. Is there a way to fetch a value out of column type in pyspark

I am comparing a condition with pyspark join in my application by using substring function. This function is returning a column type instead of a value.
substring(trim(coalesce(df.col1)), 13, 3) returns
Column<b'substring(trim(coalesce(col1), 13, 3)'>
Tried with expr but still getting the same column type result
expr("substring(trim(coalesce(df.col1)),length(trim(coalesce(df.col1))) - 2, 3)")
I want to compare the values coming from substring to the value of another dataframe column. Both are of string type
pyspark:
substring(trim(coalesce(df.col1)), length(trim(coalesce(df.col1))) -2, 3) == df2["col2"]
lets say col1 = 'abcdefghijklmno'
The expected output of substring function should mno based on the above definition.
creating a sample dataframes to join
list1 = [('ABC','abcdefghijklmno'),('XYZ','abcdefghijklmno'),('DEF','abcdefghijklabc')]
df1=spark.createDataFrame(list1, ['col1', 'col2'])
list2 = [(1,'mno'),(2,'mno'),(3,'abc')]
df2=spark.createDataFrame(list2, ['col1', 'col2'])
import pyspark.sql.functions as f
creating a substring to read last n characters for three positions.
cond=f.substring(df1['col2'], -3, 3)==df2['col2']
newdf=df1.join(df2,cond)
>>> newdf.show()
+----+---------------+----+----+
|col1| col2|col1|col2|
+----+---------------+----+----+
| ABC|abcdefghijklmno| 1| mno|
| ABC|abcdefghijklmno| 2| mno|
| XYZ|abcdefghijklmno| 1| mno|
| XYZ|abcdefghijklmno| 2| mno|
| DEF|abcdefghijklabc| 3| abc|
+----+---------------+----+----+

Select row by value in set after collect_set with pyspark

Using
from pyspark.sql import functions as f
and methods f.agg and f.collect_set I have created a column colSet within a dataFrame as follows:
+-------+--------+
| index | colSet |
+-------+--------+
| 1|[11, 13]|
| 2| [3, 6]|
| 3| [3, 7]|
| 4| [2, 7]|
| 5| [2, 6]|
+-------+--------+
Now, how is it possible, using python/ and pyspark, to select only those rows where, for instance, 3 is an element of the array in the colSet entry (where in general there can be far more than only two entries!)?
I have tried using a udf function like this:
isInSet = f.udf( lambda vcol, val: val in vcol, BooleanType())
being called via
dataFrame.where(isInSet(f.col('colSet'), 3))
I also tried removing f.col from the caller and using it in the definition of isInSet instead, but neither worked, I am getting an exception:
AnalysisException: cannot resolve '3' given input columns: [index, colSet]
Any help is appreciated on how to select rows with a certain entry (or even better subset!!!) given a row with a collect_set result.
Your original UDF is fine, but to use it you need to pass the value 3 as a literal:
dataFrame.where(isInSet(f.col('colSet'), f.lit(3)))
But as jxc points out in a comment, using array_contains is probably a better choice:
dataFrame.where(f.array_contains(f.col('colSet'), 3))
I have not done any benchmarking, but in general using UDFs in PySpark is slower than using built-in functions because of the back-and-forth communication between the JVM and the Python interpreter.
I found a solution today (after failing Friday evening) without using an udf-method:
[3 in x[0] for x in list(dataFrame.select(['colSet']).collect())]
Hope this helps someone else in the future.