Mean with differents columns ignoring Null values, Spark Scala - scala

I have a dataframe with different columns, what I am trying to do is the mean of this diff columns ignoring null values. For example:
| Baller | Power | Vision | KXD |
| John | 5 | null | 10 |
| Bilbo | 5 | 3 | 2 |
The output have to be:
| Baller | Power | Vision | KXD | MEAN |
| John | 5 | null | 10 | 7.5 |
| Bilbo | 5 | 3 | 2 | 3,33 |
What I am doing:
val a_cols = Array(col("Power"), col("Vision"), col("KXD"))
val avgFunc = a_cols.foldLeft(lit(0)){(x, y) => x+y}/a_cols.length
val avg_calc = df.withColumn("MEAN", avgFunc)
But I get the null values:
| Baller | Power | Vision | KXD | MEAN |
| John | 5 | null | 10 | null |
| Bilbo | 5 | 3 | 2 | 3,33 |

You can explode the columns and do a group by + mean, then join back to the original dataframe using the Baller column:
val result = df.join(
explode(array(col("Power"), col("Vision"), col("KXD")))
|Baller|Power|Vision|KXD| MEAN|
| John| 5| null| 10| 7.5|
| Bilbo| 5| 3| 2|3.3333333333333335|


pyspark dataframe check if string contains substring

i need help to implement below Python logic into Pyspark dataframe.
df1['isRT'] = df1['main_string'].str.lower().str.contains('|'.join(df2['sub_string'].str.lower()))
|id | main_string |
| 1 | i am a boy |
| 2 | i am from london |
| 3 | big data hadoop |
| 4 | always be happy |
| 5 | software and hardware |
|id | sub_string |
| 1 | happy |
| 2 | xxxx |
| 3 | i am a boy |
| 4 | yyyy |
| 5 | from london |
Final Output:
|id | main_string | isRT |
| 1 | i am a boy | True |
| 2 | i am from london | True |
| 3 | big data hadoop | False |
| 4 | always be happy | True |
| 5 | software and hardware | False |
First construct the substring list substr_list, and then use the rlike function to generate the isRT column.
df3 ='collect_list(lower(sub_string))').alias('substr'))
substr_list = '|'.join(df3.first()[0])
df = df1.withColumn('isRT', F.expr(f'lower(main_string) rlike "{substr_list}"'))
For your two dataframes,
df1 = spark.createDataFrame(['i am a boy', 'i am from london', 'big data hadoop', 'always be happy', 'software and hardware'], 'string').toDF('main_string')
df2 = spark.createDataFrame(['happy', 'xxxx', 'i am a boy', 'yyyy', 'from london'], 'string').toDF('sub_string')
|main_string |
|i am a boy |
|i am from london |
|big data hadoop |
|always be happy |
|software and hardware|
|sub_string |
|happy |
|xxxx |
|i am a boy |
|yyyy |
|from london|
you can get the following result with the simple join expression.
from pyspark.sql import functions as f
df1.join(df2, f.col('main_string').contains(f.col('sub_string')), 'left') \
.withColumn('isRT', f.expr('if(sub_string is null, False, True)')) \
.drop('sub_string') \
| main_string| isRT|
| i am a boy| true|
| i am from london| true|
| big data hadoop|false|
| always be happy| true|
|software and hard...|false|

Create a Dataframe based on ranges of other Dataframe

I have a Spark Dataframe containing ranges of numbers (column start and column end), and a column containing the type of this range.
I want to create a new Dataframe with two columns, the first one lists all ranges (incremented by one), and the second one lists the range's type.
To explain more, this is the input Dataframe :
| start | end | type |
| 10 | 20 | LOW |
| 21 | 30 | MEDIUM |
| 31 | 40 | HIGH |
And this is the desired result :
| nbr | type |
| 10 | LOW |
| 11 | LOW |
| 12 | LOW |
| 13 | LOW |
| 14 | LOW |
| 15 | LOW |
| 16 | LOW |
| 17 | LOW |
| 18 | LOW |
| 19 | LOW |
| 20 | LOW |
| 21 | MEDIUM |
| 22 | MEDIUM |
| .. | ... |
Any ideas ?
Try this.
val data = List((10, 20, "Low"), (21, 30, "MEDIUM"), (31, 40, "High"))
import spark.implicits._
val df = data.toDF("start", "end", "type")
df.withColumn("nbr", explode(sequence($"start", $"end"))).drop("start","end").show(false)
|type |nbr|
|Low |10 |
|Low |11 |
|Low |12 |
|Low |13 |
|Low |14 |
|Low |15 |
|Low |16 |
|Low |17 |
|Low |18 |
|Low |19 |
|Low |20 |
|MEDIUM|21 |
|MEDIUM|22 |
|MEDIUM|23 |
|MEDIUM|24 |
|MEDIUM|25 |
|MEDIUM|26 |
|MEDIUM|27 |
|MEDIUM|28 |
|MEDIUM|29 |
only showing top 20 rows
The solution provided by #Learn-Hadoop works if you're on Spark 2.4+ .
For older Spark version, consider creating a simple UDF to mimic the sequence function:
val sequence = udf{ (lower: Int, upper: Int) =>
Seq.iterate(lower, upper - lower + 1)(_ + 1)

Check if a value is between two columns, spark scala

I have two dataframes, one with my data and another one to compare. What I want to do is check if a value is in a range of two different columns, for example:
| Baller | Power |
| John | 1.5 |
| Bilbo | 3.7 |
| Frodo | 6 |
| First | Second | Value |
| 1 | 1.5 | Bad- |
| 1.5 | 3 | Bad |
| 3 | 4.2 | Good |
| 4.2 | 6 | Good+ |
The result would be:
| Baller | Power | Value |
| John | 1.5 | Bad- |
| Bilbo | 3.7 | Good |
| Frodo | 6 | Good+ |
You can do a join based on a between condition, but note that .between is not appropriate here because you want inequality in one of the comparisons:
val result = df_player.join(
df_player("Power") > df_check("First") && df_player("Power") <= df_check("Second"),
).select("Baller", "Power", "Value")
| John| 1.5| Bad-|
| Bilbo| 3.7| Good|
| Frodo| 6.0|Good+|

How do I transform a Spark dataframe so that my values become column names? [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I'm not sure of a good way to phrase the question, but an example will help. Here is the dataframe that I have with the columns: name, type, and count:
| Name | Type | Count |
| a | 0 | 5 |
| a | 1 | 4 |
| a | 5 | 5 |
| a | 4 | 5 |
| a | 2 | 1 |
| b | 0 | 2 |
| b | 1 | 4 |
| b | 3 | 5 |
| b | 4 | 5 |
| b | 2 | 1 |
| c | 0 | 5 |
| c | ... | ... |
I want to get a new dataframe structured like this where the Type column values have become new columns:
| Name | 0 | 1 | 2 | 3 | 4 | 5 | <- Number columns are types from input
| a | 5 | 4 | 1 | 0 | 5 | 5 |
| b | 2 | 4 | 1 | 5 | 5 | 0 |
| c | 5 | ... | | | | |
The columns here are [Name,0,1,2,3,4,5].
Do this by using the pivot function in Spark.
val df2 = df.groupBy("Name").pivot("Type").sum("Count")
Here, if the name and the type is the same for two rows, the count values are simply added together, but other aggregations are possible as well.
Resulting dataframe when using the example data in the question:
|Name| 0| 1| 2| 3| 4| 5|
| c| 5|null|null|null|null|null|
| b| 2| 4| 1| 5| 5|null|
| a| 5| 4| 1|null| 5| 5|

reorder column values pyspark

When I perform a Select operation on a DataFrame in PySpark it reduces to the following:
| val | Feat1 | Feat2 |
| 1 | f1a | f2a |
| 2 | f1a | f2b |
| 8 | f1b | f2f |
| 9 | f1a | f2d |
| 4 | f1b | f2c |
| 6 | f1b | f2a |
| 1 | f1c | f2c |
| 3 | f1c | f2g |
| 9 | f1c | f2e |
I require the val column to be ordered group wise based on another field Feat1 like the following:
| val | Feat1 | Feat2 |
| 1 | f1a | f2a |
| 2 | f1a | f2b |
| 3 | f1a | f2d |
| 1 | f1b | f2c |
| 2 | f1b | f2a |
| 3 | f1b | f2f |
| 1 | f1c | f2c |
| 2 | f1c | f2g |
| 3 | f1c | f2e |
NOTE that the val values don't depend on the order of Feat2 but are instead ordered based on their original val values.
Is there a command to reorder the column value in PySpark as required.
NOTE: Question exists for the same but is specific to SQL-lite.
data = [(1, 'f1a', 'f2a'),
(2, 'f1a', 'f2b'),
(8, 'f1b', 'f2f'),
(9, 'f1a', 'f2d'),
(4, 'f1b', 'f2c'),
(6, 'f1b', 'f2a'),
(1, 'f1c', 'f2c'),
(3, 'f1c', 'f2g'),
(9, 'f1c', 'f2e')]
table = sqlContext.createDataFrame(data, ['val', 'Feat1', 'Feat2'])
Edit: For this purpose, you can use window with rank function:
from pyspark.sql import Window
from pyspark.sql.functions import rank
w = Window.partitionBy('Feat1').orderBy('val')
table.withColumn('val', rank().over(w)).orderBy('Feat1').show()
| 1| f1a| f2a|
| 2| f1a| f2b|
| 3| f1a| f2d|
| 1| f1b| f2c|
| 2| f1b| f2a|
| 3| f1b| f2f|
| 1| f1c| f2c|
| 2| f1c| f2g|
| 3| f1c| f2e|