Convert PySpark Dense Column Vectors into Rows [duplicate] - pyspark

This question already has answers here:
Pyspark: Split multiple array columns into rows
(3 answers)
Closed 4 years ago.
I have a data-frame with 3 columns and every entry is a dense vector of same length.
How can I melt the Vector entries?
Current data-frame:
column1 | column2 |
[1.0,2.0,3.0]|[10.0,4.0,3.0]
[5.0,4.0,3.0]|[11.0,26.0,3.0]
[9.0,8.0,7.0]|[13.0,7.0,3.0]
Expected:
column1|column2
1.0 . 10.0
2.0 . 4.0
3.0 . 3.0
5.0 . 11.0
4.0 . 26.0
3.0 . 3.0
9.0 . 13.0
...

Step 1: Let's create the initial DataFrame:
myValues = [([1.0,2.0,3.0],[10.0,4.0,3.0]),([5.0,4.0,3.0],[11.0,26.0,3.0]),([9.0,8.0,7.0],[13.0,7.0,3.0])]
df = sqlContext.createDataFrame(myValues,['column1','column2'])
df.show()
+---------------+-----------------+
| column1| column2|
+---------------+-----------------+
|[1.0, 2.0, 3.0]| [10.0, 4.0, 3.0]|
|[5.0, 4.0, 3.0]|[11.0, 26.0, 3.0]|
|[9.0, 8.0, 7.0]| [13.0, 7.0, 3.0]|
+---------------+-----------------+
Step 2: Now, explode both the columns, but after we zip the arrays. Here we know before hand that the length of list/array is 3.
from pyspark.sql.functions import array, struct
tmp = explode(array(*[
struct(col("column1").getItem(i).alias("column1"), col("column2").getItem(i).alias("column2"))
for i in range(3)
]))
df=(df.withColumn("tmp", tmp).select(col("tmp").getItem("column1").alias('column1'), col("tmp").getItem("column2").alias('column2')))
df.show()
+-------+-------+
|column1|column2|
+-------+-------+
| 1.0| 10.0|
| 2.0| 4.0|
| 3.0| 3.0|
| 5.0| 11.0|
| 4.0| 26.0|
| 3.0| 3.0|
| 9.0| 13.0|
| 8.0| 7.0|
| 7.0| 3.0|
+-------+-------+

Related

Unsure how to apply row-wise normalization on pyspark dataframe

Disclaimer: I'm a beginner when it comes to Pyspark.
For each cell in a row, I'd like to apply the following function
new_col_i = col_i / max(col_1,col_2,col_3,...,col_n)
At the very end, I'd like the range of values to go from 0.0 to 1.0.
Here are the details of my dataframe:
Dimensions: (6.5M, 2905)
Dtypes: Double
Initial DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 7.5| 0.1| 2.0|
| 2| 0.3| 3.5| 10.5|
+-----+-------+-------+-------+
Updated DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 1.0| 0.013| 0.26|
| 2| 0.028| 0.33| 1.0|
+-----+-------+-------+-------+
Any help would be appreciated.
You can find the maximum value from an array of columns and loop your dataframe to replace the normalized column value.
cols = df.columns[1:]
import builtins as p
df2 = df.withColumn('max', array_max(array(*[col(c) for c in cols]))) \
for c in cols:
df2 = df2.withColumn(c, col(c) / col('max'))
df2.show()
+---+-------------------+--------------------+-------------------+----+
| id| col_1| col_2| col_n| max|
+---+-------------------+--------------------+-------------------+----+
| 1| 1.0|0.013333333333333334|0.26666666666666666| 7.5|
| 2|0.02857142857142857| 0.3333333333333333| 1.0|10.5|
+---+-------------------+--------------------+-------------------+----+

PySpark - Get aggregated values for dynamic columns in a dataframe

I have a dataframe with the below rows:
+------+--------+-------+-------+
| label| machine| value1| value2|
+------+--------+-------+-------+
|label1|machine1| 13| 7.5|
|label1|machine1| 9 | 7.5|
|label1|machine1| 8.5| 7.5|
|label1|machine1| 10.5| 7.5|
|label1|machine1| 12| 8|
|label1|machine2| 8 | 13.5|
|label1|machine2| 18| 10|
|label1|machine2| 10| 14|
|label1|machine2| 9 | 10.5|
|label1|machine2| 8.5| 10|
|label2|machine3| 8 | 7.5|
|label2|machine3| 18| 7.5|
|label2|machine3| 10| 7.5|
|label2|machine3| 9 | 7.5|
|label2|machine3| 8.5| 8|
|label2|machine4| 13.5| 13|
|label2|machine4| 10| 9|
|label2|machine4| 14| 8.5|
|label2|machine4| 10.5| 10.5|
|label2|machine4| 10| 12|
+------+--------+-------+-------+
Here, I can have multiple value columns other than value1, value2 in the data frame. For every column, I want to aggregate the values with collect_list and create a new column in the data frame, so that I can perform some functions later.
For this, I tried like this:
my_df = my_df.groupBy(['label', 'machine']). \
agg(collect_list("value1").alias("col_value1"), collect_list("value2").alias("col_value2"))
It is giving me the below 4 rows as I'm grouping by label and machine columns.
+------+--------+--------------------+--------------------+
| label| machine| collected_value1| collected_value2|
+------+--------+--------------------+--------------------+
|label1|machine1|[13.0, 9.0, 8.5, ...|[7.5, 7.5, 7.5, 7...|
|label2|machine2|[8.0, 18.0, 10.0,...|[13.5, 10.0, 14, ...|
|label1|machine3|[8.0, 18.0, 10.0,...|[7.5, 7.5, 7.5, 7...|
|label2|machine4|[13.5, 10.0, 14, ...|[13.0, 9.0, 8.5, ...|
+------+--------+--------------------+--------------------+
Now, my problem here is how to pass columns dynamically to this group by. The columns might differ for every run, so I want to use something like this:
df_cols = ['value1', 'value2']
my_df = my_df.groupBy(['label', 'machine']). \
agg(collect_list(col_name).alias(str(col_name+"_collected")) for col_name in df_cols)
It gives me AssertionError: all exprs should be Column error.
How can I achieve this? Can someone please help me on this?
Thanks in advance.
The below code has worked. Thank you.
exprs = [collect_list(x).alias(str(x+"_collected")) for x in df_cols]
my_df = my_df.groupBy(['label', 'machine']).agg(*exprs)

how could I merge the column that was duplicated in pyspark? [duplicate]

This question already has answers here:
combine text from multiple rows in pyspark
(3 answers)
Closed 2 years ago.
I have a dataframe as below:
+--------------------+--------------------+
| _id| statement|
+--------------------+--------------------+
| 1| ssssssss|
| 2| ssssssss|
| 3| aaaaaaaa|
| 4| aaaaaaaa|
+--------------------+--------------------+
After using df.dropDuplicates(['statement']), I got this:
+--------------------+--------------------+
| _id| statement|
+--------------------+--------------------+
| 1| ssssssss|
| 3| aaaaaaaa|
+--------------------+--------------------+
But actually, I want to keep the _id value as below:
+--------------------+--------------------+
| _id| statement|
+--------------------+--------------------+
| 1, 2| ssssssss|
| 3, 4| aaaaaaaa|
+--------------------+--------------------+
How could I do?
Finally find my answer in combine text from multiple rows in pyspark
sdf.groupBy('lstatement').agg(F.collect_list('_id').alias("_id")).show()

Extracting array index in Spark Dataframe

I have a Dataframe with a Column of Array Type
For example :
val df = List(("a", Array(1d,2d,3d)), ("b", Array(4d,5d,6d))).toDF("ID", "DATA")
df: org.apache.spark.sql.DataFrame = [ID: string, DATA: array<double>]
scala> df.show
+---+---------------+
| ID| DATA|
+---+---------------+
| a|[1.0, 2.0, 3.0]|
| b|[4.0, 5.0, 6.0]|
+---+---------------+
I wish to explode the array and have index like
+---+------------------+
| ID| DATA_INDEX| DATA|
+---+------------------+
| a|1 | 1.0 |
| a|2 | 2.0 |
| a|3 | 3.0 |
| b|1 | 4.0 |
| b|2 | 5.0 |
| b|3 | 6.0 |
+---+------------+-----+
I wish be able to do that with scala, and Sparlyr or SparkR
I'm using spark 1.6
There is a posexplode function available in spark functions
import org.apache.spark.sql.functions._
df.select("ID", posexplode($"DATA))
PS: This is only available after 2.1.0 versions
With Spark 1.6, you can register you dataframe as a temporary table and then run Hive QL over it to get the desired result.
df.registerTempTable("tab")
sqlContext.sql("""
select
ID, exploded.DATA_INDEX + 1 as DATA_INDEX, exploded.DATA
from
tab
lateral view posexplode(tab.DATA) exploded as DATA_INDEX, DATA
""").show
+---+----------+----+
| ID|DATA_INDEX|DATA|
+---+----------+----+
| a| 1| 1.0|
| a| 2| 2.0|
| a| 3| 3.0|
| b| 1| 4.0|
| b| 2| 5.0|
| b| 3| 6.0|
+---+----------+----+

Equivalent of R's reshape2::melt() in Scala? [duplicate]

This question already has answers here:
Unpivot in Spark SQL / PySpark
(2 answers)
How to melt Spark DataFrame?
(6 answers)
Transpose column to row with Spark
(9 answers)
Closed 5 years ago.
I have a data frame and I would like to use Scala to explode rows into multiple rows using the values in multiple columns. Ideally I am looking to replicate the behavior of the R function melt().
All the columns contain Strings.
Example: I want to transform this data frame..
df.show
+--------+-----------+-------------+-----+----+
|col1 | col2 | col3 | res1|res2|
+--------+-----------+-------------+-----+----+
| a| baseline| equivalence| TRUE| 0.1|
| a| experiment1| equivalence|FALSE|0.01|
| b| baseline| equivalence| TRUE| 0.2|
| b| experiment1| equivalence|FALSE|0.02|
+--------+-----------+-------------+-----+----+
...Into this data frame:
+--------+-----------+-------------+-----+-------+
|col1 | col2 | col3 | key |value|
+--------+-----------+-------------+-----+-------+
| a| baseline| equivalence| res1 | TRUE |
| a|experiment1| equivalence| res1 | FALSE|
| b| baseline| equivalence| res1 | TRUE |
| b|experiment1| equivalence| res1 | FALSE|
| a| baseline| equivalence| res2 | 0.1 |
| a|experiment1| equivalence| res2 | 0.01 |
| b| baseline| equivalence| res2 | 0.2 |
| b|experiment1| equivalence| res2 | 0.02 |
+--------+-----------+-------------+-----+-------+
Is there a built-in function in Scala which applies to datasets or
data frames to do this?
If not, would it be relatively simple to
implement this? How would it be done at a high level?
Note: I have found the class UnpivotOp from SMV which would do exactly what I want: (https://github.com/TresAmigosSD/SMV/blob/master/src/main/scala/org/tresamigos/smv/UnpivotOp.scala).
Unfortunately, the class is private, so I cannot do something like this:
import org.tresamigos.smv.UnpivotOp
val melter = new UnpivotOp(df, Seq("res1","res2"))
val melted_df = melter.unpivot()
Does anyone know if there a way to access the class org.tresamigos.smv.UnpivotOp via some some other class of static method of SMV?
Thanks!
Thanks to Andrew's Ray answer to unpivot in spark-sql/pyspark
This did the trick:
df.select($"col1",
$"col2",
$"col3",
expr("stack(2, 'res1', res1, 'res2', res2) as (key, value)"))
Or if the expression for the select should be passed as strings (handy for df %>% sparklyr::invoke("")):
df.selectExpr("col1",
"col2",
"col3",
"stack(2, 'res1', res1, 'res2', res2) as (key, value)")