When I perform a Select operation on a DataFrame in PySpark it reduces to the following:
+-----+--------+-------+
| val | Feat1 | Feat2 |
+-----+--------+-------+
| 1 | f1a | f2a |
| 2 | f1a | f2b |
| 8 | f1b | f2f |
| 9 | f1a | f2d |
| 4 | f1b | f2c |
| 6 | f1b | f2a |
| 1 | f1c | f2c |
| 3 | f1c | f2g |
| 9 | f1c | f2e |
+-----+--------+-------+
I require the val column to be ordered group wise based on another field Feat1 like the following:
+-----+--------+-------+
| val | Feat1 | Feat2 |
+-----+--------+-------+
| 1 | f1a | f2a |
| 2 | f1a | f2b |
| 3 | f1a | f2d |
| 1 | f1b | f2c |
| 2 | f1b | f2a |
| 3 | f1b | f2f |
| 1 | f1c | f2c |
| 2 | f1c | f2g |
| 3 | f1c | f2e |
+-----+--------+-------+
NOTE that the val values don't depend on the order of Feat2 but are instead ordered based on their original val values.
Is there a command to reorder the column value in PySpark as required.
NOTE: Question exists for the same but is specific to SQL-lite.
data = [(1, 'f1a', 'f2a'),
(2, 'f1a', 'f2b'),
(8, 'f1b', 'f2f'),
(9, 'f1a', 'f2d'),
(4, 'f1b', 'f2c'),
(6, 'f1b', 'f2a'),
(1, 'f1c', 'f2c'),
(3, 'f1c', 'f2g'),
(9, 'f1c', 'f2e')]
table = sqlContext.createDataFrame(data, ['val', 'Feat1', 'Feat2'])
Edit: For this purpose, you can use window with rank function:
from pyspark.sql import Window
from pyspark.sql.functions import rank
w = Window.partitionBy('Feat1').orderBy('val')
table.withColumn('val', rank().over(w)).orderBy('Feat1').show()
+---+-----+-----+
|val|Feat1|Feat2|
+---+-----+-----+
| 1| f1a| f2a|
| 2| f1a| f2b|
| 3| f1a| f2d|
| 1| f1b| f2c|
| 2| f1b| f2a|
| 3| f1b| f2f|
| 1| f1c| f2c|
| 2| f1c| f2g|
| 3| f1c| f2e|
+---+-----+-----+
Related
i need help to implement below Python logic into Pyspark dataframe.
Python:
df1['isRT'] = df1['main_string'].str.lower().str.contains('|'.join(df2['sub_string'].str.lower()))
df1.show()
+--------+---------------------------+
|id | main_string |
+--------+---------------------------+
| 1 | i am a boy |
| 2 | i am from london |
| 3 | big data hadoop |
| 4 | always be happy |
| 5 | software and hardware |
+--------+---------------------------+
df2.show()
+--------+---------------------------+
|id | sub_string |
+--------+---------------------------+
| 1 | happy |
| 2 | xxxx |
| 3 | i am a boy |
| 4 | yyyy |
| 5 | from london |
+--------+---------------------------+
Final Output:
df1.show()
+--------+---------------------------+--------+
|id | main_string | isRT |
+--------+---------------------------+--------+
| 1 | i am a boy | True |
| 2 | i am from london | True |
| 3 | big data hadoop | False |
| 4 | always be happy | True |
| 5 | software and hardware | False |
+--------+---------------------------+--------+
First construct the substring list substr_list, and then use the rlike function to generate the isRT column.
df3 = df2.select(F.expr('collect_list(lower(sub_string))').alias('substr'))
substr_list = '|'.join(df3.first()[0])
df = df1.withColumn('isRT', F.expr(f'lower(main_string) rlike "{substr_list}"'))
df.show(truncate=False)
For your two dataframes,
df1 = spark.createDataFrame(['i am a boy', 'i am from london', 'big data hadoop', 'always be happy', 'software and hardware'], 'string').toDF('main_string')
df1.show(truncate=False)
df2 = spark.createDataFrame(['happy', 'xxxx', 'i am a boy', 'yyyy', 'from london'], 'string').toDF('sub_string')
df2.show(truncate=False)
+---------------------+
|main_string |
+---------------------+
|i am a boy |
|i am from london |
|big data hadoop |
|always be happy |
|software and hardware|
+---------------------+
+-----------+
|sub_string |
+-----------+
|happy |
|xxxx |
|i am a boy |
|yyyy |
|from london|
+-----------+
you can get the following result with the simple join expression.
from pyspark.sql import functions as f
df1.join(df2, f.col('main_string').contains(f.col('sub_string')), 'left') \
.withColumn('isRT', f.expr('if(sub_string is null, False, True)')) \
.drop('sub_string') \
.show()
+--------------------+-----+
| main_string| isRT|
+--------------------+-----+
| i am a boy| true|
| i am from london| true|
| big data hadoop|false|
| always be happy| true|
|software and hard...|false|
+--------------------+-----+
I have two dataframes, one with my data and another one to compare. What I want to do is check if a value is in a range of two different columns, for example:
Df_player
+--------+-------+
| Baller | Power |
+--------+-------+
| John | 1.5 |
| Bilbo | 3.7 |
| Frodo | 6 |
+--------+-------+
Df_Check
+--------+--------+--------+
| First | Second | Value |
+--------+--------+--------+
| 1 | 1.5 | Bad- |
| 1.5 | 3 | Bad |
| 3 | 4.2 | Good |
| 4.2 | 6 | Good+ |
+--------+--------+--------+
The result would be:
Df_out
+--------+-------+--------+
| Baller | Power | Value |
+--------+-------+--------+
| John | 1.5 | Bad- |
| Bilbo | 3.7 | Good |
| Frodo | 6 | Good+ |
+--------+-------+--------+
You can do a join based on a between condition, but note that .between is not appropriate here because you want inequality in one of the comparisons:
val result = df_player.join(
df_check,
df_player("Power") > df_check("First") && df_player("Power") <= df_check("Second"),
"left"
).select("Baller", "Power", "Value")
result.show
+------+-----+-----+
|Baller|Power|Value|
+------+-----+-----+
| John| 1.5| Bad-|
| Bilbo| 3.7| Good|
| Frodo| 6.0|Good+|
+------+-----+-----+
I have a dataframe with different columns, what I am trying to do is the mean of this diff columns ignoring null values. For example:
+--------+-------+---------+-------+
| Baller | Power | Vision | KXD |
+--------+-------+---------+-------+
| John | 5 | null | 10 |
| Bilbo | 5 | 3 | 2 |
+--------+-------+---------+-------+
The output have to be:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | 7.5 |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
What I am doing:
val a_cols = Array(col("Power"), col("Vision"), col("KXD"))
val avgFunc = a_cols.foldLeft(lit(0)){(x, y) => x+y}/a_cols.length
val avg_calc = df.withColumn("MEAN", avgFunc)
But I get the null values:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | null |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
You can explode the columns and do a group by + mean, then join back to the original dataframe using the Baller column:
val result = df.join(
df.select(
col("Baller"),
explode(array(col("Power"), col("Vision"), col("KXD")))
).groupBy("Baller").agg(mean("col").as("MEAN")),
Seq("Baller")
)
result.show
+------+-----+------+---+------------------+
|Baller|Power|Vision|KXD| MEAN|
+------+-----+------+---+------------------+
| John| 5| null| 10| 7.5|
| Bilbo| 5| 3| 2|3.3333333333333335|
+------+-----+------+---+------------------+
This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I'm not sure of a good way to phrase the question, but an example will help. Here is the dataframe that I have with the columns: name, type, and count:
+------+------+-------+
| Name | Type | Count |
+------+------+-------+
| a | 0 | 5 |
| a | 1 | 4 |
| a | 5 | 5 |
| a | 4 | 5 |
| a | 2 | 1 |
| b | 0 | 2 |
| b | 1 | 4 |
| b | 3 | 5 |
| b | 4 | 5 |
| b | 2 | 1 |
| c | 0 | 5 |
| c | ... | ... |
+------+------+-------+
I want to get a new dataframe structured like this where the Type column values have become new columns:
+------+---+-----+---+---+---+---+
| Name | 0 | 1 | 2 | 3 | 4 | 5 | <- Number columns are types from input
+------+---+-----+---+---+---+---+
| a | 5 | 4 | 1 | 0 | 5 | 5 |
| b | 2 | 4 | 1 | 5 | 5 | 0 |
| c | 5 | ... | | | | |
+------+---+-----+---+---+---+---+
The columns here are [Name,0,1,2,3,4,5].
Do this by using the pivot function in Spark.
val df2 = df.groupBy("Name").pivot("Type").sum("Count")
Here, if the name and the type is the same for two rows, the count values are simply added together, but other aggregations are possible as well.
Resulting dataframe when using the example data in the question:
+----+---+----+----+----+----+----+
|Name| 0| 1| 2| 3| 4| 5|
+----+---+----+----+----+----+----+
| c| 5|null|null|null|null|null|
| b| 2| 4| 1| 5| 5|null|
| a| 5| 4| 1|null| 5| 5|
+----+---+----+----+----+----+----+
I have a table data like following :
+-----------+--------+-------------+
| City Name | URL | Read Count |
+-----------+--------+-------------+
| Gurgaon | URL1 | 3 |
| Gurgaon | URL3 | 6 |
| Gurgaon | URL6 | 5 |
| Gurgaon | URL4 | 1 |
| Gurgaon | URL5 | 5 |
| Delhi | URL3 | 4 |
| Delhi | URL7 | 2 |
| Delhi | URL5 | 1 |
| Delhi | URL6 | 6 |
| Punjab | URL6 | 5 |
| Punjab | URL4 | 1 |
| Mumbai | URL5 | 5 |
+-----------+--------+-------------+
I would like to see somthing like -> Top 3 Read article(if exists) each city
+-----------+--------+--------+
| City Name | URL | Count |
+-----------+--------+--------+
| Gurgaon | URL3 | 6 |
| Gurgaon | URL6 | 5 |
| Gurgaon | URL5 | 5 |
| Delhi | URL6 | 6 |
| Delhi | URL3 | 4 |
| Delhi | URL1 | 3 |
| Punjab | URL6 | 5 |
| Punjab | URL4 | 1 |
| Mumbai | URL5 | 5 |
+-----------+--------+--------+
I am working on Spark 2.0.2, Scala 2.11.8
You can use window function to get the output.
import org.apache.spark.sql.expressions.Window
val df = sc.parallelize(Seq(
("Gurgaon","URL1",3), ("Gurgaon","URL3",6), ("Gurgaon","URL6",5), ("Gurgaon","URL4",1),("Gurgaon","URL5",5)
("DELHI","URL3",4), ("DELHI","URL7",2), ("DELHI","URL5",1), ("DELHI","URL6",6),("Mumbai","URL5",5)
("Punjab","URL6",6), ("Punjab","URL4",1))).toDF("City", "URL", "Count")
df.show()
+-------+----+-----+
| City| URL|Count|
+-------+----+-----+
|Gurgaon|URL1| 3|
|Gurgaon|URL3| 6|
|Gurgaon|URL6| 5|
|Gurgaon|URL4| 1|
|Gurgaon|URL5| 5|
| DELHI|URL3| 4|
| DELHI|URL7| 2|
| DELHI|URL5| 1|
| DELHI|URL6| 6|
| Mumbai|URL5| 5|
| Punjab|URL6| 6|
| Punjab|URL4| 1|
+-------+----+-----+
val w = Window.partitionBy($"City").orderBy($"Count".desc)
val dfTop = df.withColumn("row", rowNumber.over(w)).where($"row" <= 3).drop("row")
dfTop.show
+-------+----+-----+
| City| URL|Count|
+-------+----+-----+
|Gurgaon|URL3| 6|
|Gurgaon|URL6| 5|
|Gurgaon|URL5| 5|
| Mumbai|URL5| 5|
| DELHI|URL6| 6|
| DELHI|URL3| 4|
| DELHI|URL7| 2|
| Punjab|URL6| 6|
| Punjab|URL4| 1|
+-------+----+-----+
Output tested on Spark 1.6.2
Window functions are probably the way to go, and there is a built-in function for this purpose:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{rank, desc}
val window = Window.partitionBy($"City").orderBy(desc("Count"))
val dfTop = df.withColumn("rank", rank.over(window)).where($"rank" <= 3)