How to transpose a sub-matrix of spark rdd iteratively? - pyspark

For example,
From:
+-----+-----+
|Date |val_1|
+-----+-----+
| 1-1 | 1.1|
| 1-2 | 1.2|
| 1-3 | 1.3|
| 1-4 | 1.4|
| 1-5 | 1.5|
| 1-6 | 1.6|
| 1-7 | 1.7|
| 1-8 | 1.8|
| 1-9 | 1.9|
| ...| ...|
To:
+-----+-----+-----+-------+
| Date | D-3 | D-2 | D-1 |
+-----+-----+-----+-------+
| 1-4 | 1.1 | 1.2 | 1.3 |
| 1-5 | 1.2 | 1.3 | 1.4 |
| 1-6 | 1.3 | 1.4 | 1.5 |
| 1-7 | 1.4 | 1.5 | 1.6 |
| 1-8 | 1.5 | 1.6 | 1.7 |
| 1-9 | 1.6 | 1.7 | 1.8 |
| ... | ... | ... | ... |
Thanks a lot in advance.

Your question is not entirely clear, in particular with respect to the iterative solution you are after. However, for the example data provided:
df = sc.parallelize([('1-1', 1.1), ('1-2', 1.2), ('1-3', 1.3), ('1-4', 1.4), ('1-5', 1.5), ('1-6', 1.6),('1-7', 1.7),('1-8', 1.8),('1-9', 1.9)]).toDF(["Date", "val_1"])
You can use lag in combination with a Window to retrieve D-3, D-2 and D-1
from pyspark.sql.functions import lag, col
from pyspark.sql.window import Window
w = Window().partitionBy().orderBy(col("Date"))
dfl = df.select("Date", lag("val_1",count=3).over(w).alias("D-3"),
lag("val_1",count=2).over(w).alias("D-2"),
lag("val_1",count=1).over(w).alias("D-1")).na.drop()
dfl.show()
This results in the following output:
+----+---+---+---+
|Date|D-3|D-2|D-1|
+----+---+---+---+
| 1-4|1.1|1.2|1.3|
| 1-5|1.2|1.3|1.4|
| 1-6|1.3|1.4|1.5|
| 1-7|1.4|1.5|1.6|
| 1-8|1.5|1.6|1.7|
| 1-9|1.6|1.7|1.8|
+----+---+---+---+

Thanks for Jaco's inspiration.
Here is the Scala Version:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions
val df = sc.parallelize(Seq(("1-1", 1.1), ("1-2", 1.2), ("1-3", 1.3), ("1-4", 1.4), ("1-5", 1.5), ("1-6", 1.6),("1-7", 1.7),("1-8", 1.8),("1-9", 1.9))).toDF("Date", "val_1")
val w = Window.partitionBy().orderBy("Date")
val res = df.withColumn("D-3", lag("val_1", 3, 0).over(w)).withColumn("D-2", lag("val_1", 2, 0).over(w)).withColumn("D-1", lag("val_1", 1, 0).over(w)).na.drop()
Result:
+----+-----+---+---+---+
|Date|val_1|D-3|D-2|D-1|
+----+-----+---+---+---+
| 1-4| 1.4|1.1|1.2|1.3|
| 1-5| 1.5|1.2|1.3|1.4|
| 1-6| 1.6|1.3|1.4|1.5|
| 1-7| 1.7|1.4|1.5|1.6|
| 1-8| 1.8|1.5|1.6|1.7|
| 1-9| 1.9|1.6|1.7|1.8|
+----+-----+---+---+---+

Related

pyspark dataframe check if string contains substring

i need help to implement below Python logic into Pyspark dataframe.
Python:
df1['isRT'] = df1['main_string'].str.lower().str.contains('|'.join(df2['sub_string'].str.lower()))
df1.show()
+--------+---------------------------+
|id | main_string |
+--------+---------------------------+
| 1 | i am a boy |
| 2 | i am from london |
| 3 | big data hadoop |
| 4 | always be happy |
| 5 | software and hardware |
+--------+---------------------------+
df2.show()
+--------+---------------------------+
|id | sub_string |
+--------+---------------------------+
| 1 | happy |
| 2 | xxxx |
| 3 | i am a boy |
| 4 | yyyy |
| 5 | from london |
+--------+---------------------------+
Final Output:
df1.show()
+--------+---------------------------+--------+
|id | main_string | isRT |
+--------+---------------------------+--------+
| 1 | i am a boy | True |
| 2 | i am from london | True |
| 3 | big data hadoop | False |
| 4 | always be happy | True |
| 5 | software and hardware | False |
+--------+---------------------------+--------+
First construct the substring list substr_list, and then use the rlike function to generate the isRT column.
df3 = df2.select(F.expr('collect_list(lower(sub_string))').alias('substr'))
substr_list = '|'.join(df3.first()[0])
df = df1.withColumn('isRT', F.expr(f'lower(main_string) rlike "{substr_list}"'))
df.show(truncate=False)
For your two dataframes,
df1 = spark.createDataFrame(['i am a boy', 'i am from london', 'big data hadoop', 'always be happy', 'software and hardware'], 'string').toDF('main_string')
df1.show(truncate=False)
df2 = spark.createDataFrame(['happy', 'xxxx', 'i am a boy', 'yyyy', 'from london'], 'string').toDF('sub_string')
df2.show(truncate=False)
+---------------------+
|main_string |
+---------------------+
|i am a boy |
|i am from london |
|big data hadoop |
|always be happy |
|software and hardware|
+---------------------+
+-----------+
|sub_string |
+-----------+
|happy |
|xxxx |
|i am a boy |
|yyyy |
|from london|
+-----------+
you can get the following result with the simple join expression.
from pyspark.sql import functions as f
df1.join(df2, f.col('main_string').contains(f.col('sub_string')), 'left') \
.withColumn('isRT', f.expr('if(sub_string is null, False, True)')) \
.drop('sub_string') \
.show()
+--------------------+-----+
| main_string| isRT|
+--------------------+-----+
| i am a boy| true|
| i am from london| true|
| big data hadoop|false|
| always be happy| true|
|software and hard...|false|
+--------------------+-----+

Create a Dataframe based on ranges of other Dataframe

I have a Spark Dataframe containing ranges of numbers (column start and column end), and a column containing the type of this range.
I want to create a new Dataframe with two columns, the first one lists all ranges (incremented by one), and the second one lists the range's type.
To explain more, this is the input Dataframe :
+-------+------+---------+
| start | end | type |
+-------+------+---------+
| 10 | 20 | LOW |
| 21 | 30 | MEDIUM |
| 31 | 40 | HIGH |
+-------+------+---------+
And this is the desired result :
+-------+---------+
| nbr | type |
+-------+---------+
| 10 | LOW |
| 11 | LOW |
| 12 | LOW |
| 13 | LOW |
| 14 | LOW |
| 15 | LOW |
| 16 | LOW |
| 17 | LOW |
| 18 | LOW |
| 19 | LOW |
| 20 | LOW |
| 21 | MEDIUM |
| 22 | MEDIUM |
| .. | ... |
+-------+---------+
Any ideas ?
Try this.
val data = List((10, 20, "Low"), (21, 30, "MEDIUM"), (31, 40, "High"))
import spark.implicits._
val df = data.toDF("start", "end", "type")
df.withColumn("nbr", explode(sequence($"start", $"end"))).drop("start","end").show(false)
output:
+------+---+
|type |nbr|
+------+---+
|Low |10 |
|Low |11 |
|Low |12 |
|Low |13 |
|Low |14 |
|Low |15 |
|Low |16 |
|Low |17 |
|Low |18 |
|Low |19 |
|Low |20 |
|MEDIUM|21 |
|MEDIUM|22 |
|MEDIUM|23 |
|MEDIUM|24 |
|MEDIUM|25 |
|MEDIUM|26 |
|MEDIUM|27 |
|MEDIUM|28 |
|MEDIUM|29 |
+------+---+
only showing top 20 rows
The solution provided by #Learn-Hadoop works if you're on Spark 2.4+ .
For older Spark version, consider creating a simple UDF to mimic the sequence function:
val sequence = udf{ (lower: Int, upper: Int) =>
Seq.iterate(lower, upper - lower + 1)(_ + 1)
}
df.withColumn("nbr",explode(sequence($"start",$"end"))).drop("start","end").show(false)

Check if a value is between two columns, spark scala

I have two dataframes, one with my data and another one to compare. What I want to do is check if a value is in a range of two different columns, for example:
Df_player
+--------+-------+
| Baller | Power |
+--------+-------+
| John | 1.5 |
| Bilbo | 3.7 |
| Frodo | 6 |
+--------+-------+
Df_Check
+--------+--------+--------+
| First | Second | Value |
+--------+--------+--------+
| 1 | 1.5 | Bad- |
| 1.5 | 3 | Bad |
| 3 | 4.2 | Good |
| 4.2 | 6 | Good+ |
+--------+--------+--------+
The result would be:
Df_out
+--------+-------+--------+
| Baller | Power | Value |
+--------+-------+--------+
| John | 1.5 | Bad- |
| Bilbo | 3.7 | Good |
| Frodo | 6 | Good+ |
+--------+-------+--------+
You can do a join based on a between condition, but note that .between is not appropriate here because you want inequality in one of the comparisons:
val result = df_player.join(
df_check,
df_player("Power") > df_check("First") && df_player("Power") <= df_check("Second"),
"left"
).select("Baller", "Power", "Value")
result.show
+------+-----+-----+
|Baller|Power|Value|
+------+-----+-----+
| John| 1.5| Bad-|
| Bilbo| 3.7| Good|
| Frodo| 6.0|Good+|
+------+-----+-----+

Mean with differents columns ignoring Null values, Spark Scala

I have a dataframe with different columns, what I am trying to do is the mean of this diff columns ignoring null values. For example:
+--------+-------+---------+-------+
| Baller | Power | Vision | KXD |
+--------+-------+---------+-------+
| John | 5 | null | 10 |
| Bilbo | 5 | 3 | 2 |
+--------+-------+---------+-------+
The output have to be:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | 7.5 |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
What I am doing:
val a_cols = Array(col("Power"), col("Vision"), col("KXD"))
val avgFunc = a_cols.foldLeft(lit(0)){(x, y) => x+y}/a_cols.length
val avg_calc = df.withColumn("MEAN", avgFunc)
But I get the null values:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | null |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
You can explode the columns and do a group by + mean, then join back to the original dataframe using the Baller column:
val result = df.join(
df.select(
col("Baller"),
explode(array(col("Power"), col("Vision"), col("KXD")))
).groupBy("Baller").agg(mean("col").as("MEAN")),
Seq("Baller")
)
result.show
+------+-----+------+---+------------------+
|Baller|Power|Vision|KXD| MEAN|
+------+-----+------+---+------------------+
| John| 5| null| 10| 7.5|
| Bilbo| 5| 3| 2|3.3333333333333335|
+------+-----+------+---+------------------+

Spark dataframe aggregate on multiple columns

Actually I am working on pyspark code. My dataframe is
+-------+--------+--------+--------+--------+
|element|collect1|collect2|collect3|collect4|
+-------+--------+--------+--------+--------+
|A1 | 1.02 | 2.6 | 5.21 | 3.6 |
|A2 | 1.61 | 2.42 | 4.88 | 6.08 |
|B1 | 1.66 | 2.01 | 5.0 | 4.3 |
|C2 | 2.01 | 1.85 | 3.42 | 4.44 |
+-------+--------+--------+--------+--------+
I need to find the mean and stddev for each element by aggregating all the collectX columns. The final result should be as below.
+-------+--------+--------+
|element|mean |stddev |
+-------+--------+--------+
|A1 | 3.11 | 1.76 |
|A2 | 3.75 | 2.09 |
|B1 | 3.24 | 1.66 |
|C2 | 2.93 | 1.23 |
+-------+--------+--------+
The below code breakdown all the mean at individual columns
df.groupBy("element").mean().show(). Instead of doing for each column, is it possible to rollup for all the columns?
+-------+-------------+-------------+-------------+-------------+
|element|avg(collect1)|avg(collect2)|avg(collect3)|avg(collect4)|
+-------+-------------+-------------+-------------+-------------+
|A1 | 1.02 | 2.6 | 5.21 | 3.6 |
|A2 | 1.61 | 2.42 | 4.88 | 6.08 |
|B1 | 1.66 | 2.01 | 5.0 | 4.3 |
|C2 | 2.01 | 1.85 | 3.42 | 4.44 |
+-------+-------------+-------------+-------------+-------------+
I tried to make use of the describe function as it has the complete aggregation functions but still shown as individual column
df.groupBy("element").mean().describe().show()
thanks
Spark allows you to gather all sort of stats per column. You are trying to calculate stats per row. In this case you can hack something with udf. Here is an example :D
$ pyspark
>>> from pyspark.sql.types import DoubleType
>>> from pyspark.sql.functions import array, udf
>>>
>>> mean = udf(lambda v: sum(v) / len(v), DoubleType())
>>> df = sc.parallelize([['A1', 1.02, 2.6, 5.21, 3.6], ['A2', 1.61, 2.42, 4.88, 6.08]]).toDF(['element', 'collect1', 'collect2', 'collect3', 'collect4'])
>>> df.show()
+-------+--------+--------+--------+--------+
|element|collect1|collect2|collect3|collect4|
+-------+--------+--------+--------+--------+
| A1| 1.02| 2.6| 5.21| 3.6|
| A2| 1.61| 2.42| 4.88| 6.08|
+-------+--------+--------+--------+--------+
>>> df.select('element', mean(array(df.columns[1:])).alias('mean')).show()
+-------+------+
|element| mean|
+-------+------+
| A1|3.1075|
| A2|3.7475|
+-------+------+
Did you try just adding the columns together and possibly dividing by 4?
SELECT avg((collect1 + collect2 + collect3 + collect4) / 4),
stddev((collect1 + collect2 + collect3 + collect4) / 4)
That's not going to do exactly what you want but get the idea.
Not sure your language, but you can always build the query on the fly if you aren't happy with hardcoded:
val collectColumns = df.columns.filter(_.startsWith("collect"))
val stmnt = "SELECT avg((" + collectColumns.mkString(" + ") + ") / " + collectColumns.length + "))"
You get the idea.