Pyspark Rename column based on column position - pyspark

How do I rename the 3rd column of a dataframe in PySpark. I want to call the column index rather than the actual name.
Here is my attempt:
df
Col1 Col2 jfdklajfklfj
A B 2
df.withColumnRenamed([3], 'Row_Count')

Since python indexing starts at 0, you can index df.columns list by subtracting 1:
index_of_col = 3
df.withColumnRenamed(df.columns[index_of_col-1],'Row_Count').show()
+----+----+---------+
|Col1|Col2|Row_Count|
+----+----+---------+
| A| B| 2|
+----+----+---------+

Related

How to select a column based on value of another in Pyspark?

I have a dataframe, where some column special_column contains values like one, two. My dataframe also has columns one_processed and two_processed.
I would like to add a new column my_new_column which values are taken from other columns from my dataframe, based on processed values from special_column. For example, if special_column == one I would like my_new_column to be set to one_processed.
I tried .withColumn("my_new_column", F.col(F.concat(F.col("special_column"), F.lit("_processed")))), but Spark complains that i cannot parametrize F.col with a column.
How could I get the string value of the concatenation, so that I can select the desired column?
from pyspark.sql.functions import when, col, lit, concat_ws
sdf.withColumn("my_new_column", when(col("special_column")=="one", col("one_processed"
).otherwise(concat_ws("_", col("special_column"), lit("processed"))
The easiest way in your case would be just a simple when/oterwise like:
>>> df = spark.createDataFrame([(1, 2, "one"), (1,2,"two")], ["one_processed", "two_processed", "special_column"])
>>> df.withColumn("my_new_column", F.when(F.col("special_column") == "one", F.col("one_processed")).otherwise(F.col("two_processed"))).show()
+-------------+-------------+--------------+-------------+
|one_processed|two_processed|special_column|my_new_column|
+-------------+-------------+--------------+-------------+
| 1| 2| one| 1|
| 1| 2| two| 2|
+-------------+-------------+--------------+-------------+
As far as I know there is no way to get a column value by name, as execution plan would depend on the data.

substring function return column type instead of a value. Is there a way to fetch a value out of column type in pyspark

I am comparing a condition with pyspark join in my application by using substring function. This function is returning a column type instead of a value.
substring(trim(coalesce(df.col1)), 13, 3) returns
Column<b'substring(trim(coalesce(col1), 13, 3)'>
Tried with expr but still getting the same column type result
expr("substring(trim(coalesce(df.col1)),length(trim(coalesce(df.col1))) - 2, 3)")
I want to compare the values coming from substring to the value of another dataframe column. Both are of string type
pyspark:
substring(trim(coalesce(df.col1)), length(trim(coalesce(df.col1))) -2, 3) == df2["col2"]
lets say col1 = 'abcdefghijklmno'
The expected output of substring function should mno based on the above definition.
creating a sample dataframes to join
list1 = [('ABC','abcdefghijklmno'),('XYZ','abcdefghijklmno'),('DEF','abcdefghijklabc')]
df1=spark.createDataFrame(list1, ['col1', 'col2'])
list2 = [(1,'mno'),(2,'mno'),(3,'abc')]
df2=spark.createDataFrame(list2, ['col1', 'col2'])
import pyspark.sql.functions as f
creating a substring to read last n characters for three positions.
cond=f.substring(df1['col2'], -3, 3)==df2['col2']
newdf=df1.join(df2,cond)
>>> newdf.show()
+----+---------------+----+----+
|col1| col2|col1|col2|
+----+---------------+----+----+
| ABC|abcdefghijklmno| 1| mno|
| ABC|abcdefghijklmno| 2| mno|
| XYZ|abcdefghijklmno| 1| mno|
| XYZ|abcdefghijklmno| 2| mno|
| DEF|abcdefghijklabc| 3| abc|
+----+---------------+----+----+

How to convert numerical values to a categorical variable using pyspark

pyspark dataframe which have a range of numerical variables.
for eg
my dataframe have a column value from 1 to 100.
1-10 - group1<== the column value for 1 to 10 should contain group1 as value
11-20 - group2
.
.
.
91-100 group10
how can i achieve this using pyspark dataframe
# Creating an arbitrary DataFrame
df = spark.createDataFrame([(1,54),(2,7),(3,72),(4,99)], ['ID','Var'])
df.show()
+---+---+
| ID|Var|
+---+---+
| 1| 54|
| 2| 7|
| 3| 72|
| 4| 99|
+---+---+
Once the DataFrame has been created, we use floor() function to find the integral part of a number. For eg; floor(15.5) will be 15. We need to find the integral part of the Var/10 and add 1 to it, because the indexing starts from 1, as opposed to 0. Finally, we have need to prepend group to the value. Concatenation can be achieved with concat() function, but keep in mind that since the prepended word group is not a column, so we need to put it inside lit() which creates a column of a literal value.
# Requisite packages needed
from pyspark.sql.functions import col, floor, lit, concat
df = df.withColumn('Var',concat(lit('group'),(1+floor(col('Var')/10))))
df.show()
+---+-------+
| ID| Var|
+---+-------+
| 1| group6|
| 2| group1|
| 3| group8|
| 4|group10|
+---+-------+

Spark: Applying UDF to Dataframe Generating new Columns based on Values in DF

I am having problems transposing values in a DataFrame in Scala. My initial DataFramelooks like this:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A| X| 6|null|
| B| Z|null| 5|
| C| Y| 4|null|
+----+----+----+----+
col1 and col2 are type String and col3 and col4 are Int.
And the result should look like this:
+----+----+----+----+------+------+------+
|col1|col2|col3|col4|AXcol3|BZcol4|CYcol4|
+----+----+----+----+------+------+------+
| A| X| 6|null| 6| null| null|
| B| Z|null| 5| null| 5| null|
| C| Y| 4| 4| null| null| 4|
+----+----+----+----+------+------+------+
That means that the three new columns should be named after col1, col2 and the column the value is extracted. The extracted value comes from the column col2, col3 or col5 depending which value is not null.
So how to achieve that? I first thought of a UDF like this:
def myFunc (col1:String, col2:String, col3:Long, col4:Long) : (newColumn:String, rowValue:Long) = {
if col3 == null{
val rowValue=col4;
val newColumn=col1+col2+"col4";
} else{
val rowValue=col3;
val newColumn=col1+col2+"col3";
}
return (newColumn, rowValue);
}
val udfMyFunc = udf(myFunc _ ) //needed to treat it as partially applied function
But how can I call it from the dataframe in the right way?
Of course, all code above is rubbish and there could be a much better way. Since I am just juggling the first code snippets let me know... Comparing the Int value to null is already not working.
Any help is appreciated! Thanks!
There is a simpler way:
val df3 = df2.withColumn("newCol", concat($"col1", $"col2")) //Step 1
.withColumn("value",when($"col3".isNotNull, $"col3").otherwise($"col4")) //Step 2
.groupBy($"col1",$"col2",$"col3",$"col4",$"newCol") //Step 3
.pivot("newCol") // Step 4
.agg(max($"value")) // Step 5
.orderBy($"newCol") // Step 6
.drop($"newCol") // Step 7
df3.show()
The steps work as follows:
Add a new column which contains the contents of col1 concatenated with col2
// add a new column, "value" which contains the non-null contents of either col3 or col4
GroupBy the columns you want
pivot on newCol, which contains the values which are now to be column headings
Aggregate by the max of value, which will be the value itself if the groupBy is single-valued per group; or alternatively .agg(first($"value")) if value happens to be a string rather than a numeric type - max function can only be applied to a numeric type
order by newCol so DF is in ascending order
drop this column as you no longer need it, or skip this step if you want a column of values without nulls
Credit due to #user8371915 who helped me answer my own pivot question in the first place.
Result is as follows:
+----+----+----+----+----+----+----+
|col1|col2|col3|col4| AX| BZ| CY|
+----+----+----+----+----+----+----+
| A| X| 6|null| 6|null|null|
| B| Z|null| 5|null| 5|null|
| C| Y| 4| 4|null|null| 4|
+----+----+----+----+----+----+----+
You might have to play around with the column header strings concatenation to get the right result.
Okay, I have a workaround to achieve what I want. I do the following:
(1) I generate a new column containing a tuple with [newColumnName,rowValue] following this advice Derive multiple columns from a single column in a Spark DataFrame
case class toTuple(newColumnName: String, rowValue: String)
def createTuple (input1:String, input2:String) : toTuple = {
//do something fancy here
var column:String= input1 + input2
var value:String= input1
return toTuple(column, value)
}
val UdfCreateTuple = udf(createTuple _)
(2) Apply function to DataFrame
dfNew= df.select($"*", UdfCreateTuple($"col1",$"col2").alias("tmpCol")
(3) Create array with distinct values of newColumnName
val dfDistinct = dfNew.select($"tmpCol.newColumnName").distinct
(4) Create an array with distinct values
var a = dfDistinct.select($"newCol").rdd.map(r => r(0).asInstanceOf[String])
var arrDistinct = a.map(a => a).collect()
(5) Create a key value mapping
var seqMapping:Seq[(String,String)]=Seq()
for (i <- arrDistinct){
seqMapping :+= (i,i)
}
(6) Apply mapping to original dataframe, cf. Mapping a value into a specific column based on annother column
val exprsDistinct = seqMapping.map { case (key, target) =>
when($"tmpCol.newColumnName" === key, $"tmpCol.rowValue").alias(target) }
val dfFinal = dfNew.select($"*" +: exprsDistinct: _*)
Well, that is a bit cumbersome but I can derive a set of new columns without knowing how many there are and at the same time transfer the value into that new column.

Unexpected column values after the IN condition in where() method of dataframe in spark

Task: I want the value of child_id column [Which is generated using withColumn() method and monoliticallyIncreasingId() method] corresponding to family_id and id column.
Let me explain steps to complete my task:
Step 1: 1. adding 2 columns to the dataframe. 1 with unique id and named as child_id, and another with value 0 and named parent_id.
Step 2: need all family_ids from dataframe.
Step 3: want the dataframe of child_id and id, where id == family_id.
[Problem is here.]
def processFoHierarchical(param_df: DataFrame) {
var dff = param_df.withColumn("child_id", monotonicallyIncreasingId() + 1)
println("Something is not gud...")
dff = dff.withColumn("parent_id", lit(0.toLong))
dff.select("id","family_id","child_id").show() // Original dataframe.
var family_ids = ""
param_df.select("family_id").distinct().coalesce(1).collect().map(x => family_ids = family_ids + "'" + x.getAs[String]("family_id") + "',")
println(family_ids)
var x: DataFrame = null
if (family_ids.length() > 0) {
family_ids = family_ids.substring(0, family_ids.length() - 1)
val y = dff.where(" id IN (" + family_ids + ")").select("id","family_id","child_id")
y.show() // here i am getting unexpected values.
}
This is the output of my code. I am trying to get the child_id values as per in dataframe. but i am not getting it.
Note: Using Spark with Scala.
+--------------------+--------------------+----------+
| id| family_id| child_id|
+--------------------+--------------------+----------+
|fe60c680-eb59-11e...|fe60c680-eb59-11e...| 4|
|8d9680a0-ec14-11e...|8d9680a0-ec14-11e...| 9|
|ff81457a-e9cf-11e...|ff81457a-e9cf-11e...| 5|
|4261cca0-f0e9-11e...|4261cca0-f0e9-11e...| 10|
|98c7dc00-f0e5-11e...|98c7dc00-f0e5-11e...| 8|
|dca16200-e462-11e...|dca16200-e462-11e...|8589934595|
|78be8950-ecca-11e...|ff81457a-e9cf-11e...| 1|
|4cc19690-e819-11e...|ff81457a-e9cf-11e...| 3|
|dca16200-e462-11e...|ff81457a-e9cf-11e...|8589934596|
|72dd0250-eff4-11e...|78be8950-ecca-11e...| 2|
|84ed0df0-e81a-11e...|78be8950-ecca-11e...| 6|
|78be8951-ecca-11e...|78be8950-ecca-11e...| 7|
|d1515310-e9ad-11e...|78be8951-ecca-11e...|8589934593|
|d1515310-e9ad-11e...|72dd0250-eff4-11e...|8589934594|
+--------------------+--------------------+----------+
'72dd0250-eff4-11e5-9ce9-5e5517507c66','dca16200-e462-11e5-90ec-c1cf090b354c','78be8951-ecca-11e5-a5f5-c1cf090b354c','4261cca0-f0e9-11e5-bbba-c1cf090b354c','98c7dc00-f0e5-11e5-bc76-c1cf090b354c','fe60c680-eb59-11e5-9582-c1cf090b354c','ff81457a-e9cf-11e5-9ce9-5e5517507c66','8d9680a0-ec14-11e5-a94f-c1cf090b354c','78be8950-ecca-11e5-a5f5-c1cf090b354c',
+--------------------+--------------------+-----------+
| id| family_id| child_id|
+--------------------+--------------------+-----------+
|fe60c680-eb59-11e...|fe60c680-eb59-11e...| 1|
|ff81457a-e9cf-11e...|ff81457a-e9cf-11e...| 2|
|98c7dc00-f0e5-11e...|98c7dc00-f0e5-11e...| 3|
|8d9680a0-ec14-11e...|8d9680a0-ec14-11e...| 4|
|4261cca0-f0e9-11e...|4261cca0-f0e9-11e...| 5|
|dca16200-e462-11e...|dca16200-e462-11e...| 6|
|78be8950-ecca-11e...|ff81457a-e9cf-11e...| 8589934593|
|dca16200-e462-11e...|ff81457a-e9cf-11e...| 8589934594|
|72dd0250-eff4-11e...|78be8950-ecca-11e...|17179869185|
|78be8951-ecca-11e...|78be8950-ecca-11e...|17179869186|
+--------------------+--------------------+-----------+
I know that it doesn't produce consecutive values, those values are dependents on partitions. Unexpected values means (see 2nd dataframe) those child_ids are meant to belong from the previous dataframe where family_id = id and to match multiple ids i am using IN. Unexpected values here means the child_id column have no values from the above dataframe instead it is creating new child_id column with monoliticallyIncresingIds().
See the last 2 values in 2nd dataframe doesn't belong to the above dataframe. So where does it coming from. I am not applying monoliticallyIncresingIds() again on dataframe. So, why it looks like that column (child_id) having the values like monoliticallyIncresingIds() is applied again.
However, The problem is not with spark DataFrame . When we are using monoliticallyIncresingId() with DataFrame, it will create new id for each time on DataFrame.show().
if we need to generate id once and needs to refer same id at other place in code then we may need to DataFrame.cache().
In your case, you need to cache DataFrame after Step1 so it will not create duplicate child_id every time on show().