I have a Dataframe with a Column of Array Type
For example :
val df = List(("a", Array(1d,2d,3d)), ("b", Array(4d,5d,6d))).toDF("ID", "DATA")
df: org.apache.spark.sql.DataFrame = [ID: string, DATA: array<double>]
scala> df.show
+---+---------------+
| ID| DATA|
+---+---------------+
| a|[1.0, 2.0, 3.0]|
| b|[4.0, 5.0, 6.0]|
+---+---------------+
I wish to explode the array and have index like
+---+------------------+
| ID| DATA_INDEX| DATA|
+---+------------------+
| a|1 | 1.0 |
| a|2 | 2.0 |
| a|3 | 3.0 |
| b|1 | 4.0 |
| b|2 | 5.0 |
| b|3 | 6.0 |
+---+------------+-----+
I wish be able to do that with scala, and Sparlyr or SparkR
I'm using spark 1.6
There is a posexplode function available in spark functions
import org.apache.spark.sql.functions._
df.select("ID", posexplode($"DATA))
PS: This is only available after 2.1.0 versions
With Spark 1.6, you can register you dataframe as a temporary table and then run Hive QL over it to get the desired result.
df.registerTempTable("tab")
sqlContext.sql("""
select
ID, exploded.DATA_INDEX + 1 as DATA_INDEX, exploded.DATA
from
tab
lateral view posexplode(tab.DATA) exploded as DATA_INDEX, DATA
""").show
+---+----------+----+
| ID|DATA_INDEX|DATA|
+---+----------+----+
| a| 1| 1.0|
| a| 2| 2.0|
| a| 3| 3.0|
| b| 1| 4.0|
| b| 2| 5.0|
| b| 3| 6.0|
+---+----------+----+
Related
Disclaimer: I'm a beginner when it comes to Pyspark.
For each cell in a row, I'd like to apply the following function
new_col_i = col_i / max(col_1,col_2,col_3,...,col_n)
At the very end, I'd like the range of values to go from 0.0 to 1.0.
Here are the details of my dataframe:
Dimensions: (6.5M, 2905)
Dtypes: Double
Initial DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 7.5| 0.1| 2.0|
| 2| 0.3| 3.5| 10.5|
+-----+-------+-------+-------+
Updated DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 1.0| 0.013| 0.26|
| 2| 0.028| 0.33| 1.0|
+-----+-------+-------+-------+
Any help would be appreciated.
You can find the maximum value from an array of columns and loop your dataframe to replace the normalized column value.
cols = df.columns[1:]
import builtins as p
df2 = df.withColumn('max', array_max(array(*[col(c) for c in cols]))) \
for c in cols:
df2 = df2.withColumn(c, col(c) / col('max'))
df2.show()
+---+-------------------+--------------------+-------------------+----+
| id| col_1| col_2| col_n| max|
+---+-------------------+--------------------+-------------------+----+
| 1| 1.0|0.013333333333333334|0.26666666666666666| 7.5|
| 2|0.02857142857142857| 0.3333333333333333| 1.0|10.5|
+---+-------------------+--------------------+-------------------+----+
I have 2 dataframes like this.
scala> df1.show
+---+---------+
| ID| Count|
+---+---------+
| 1|20.565656|
| 2|30.676776|
+---+---------+
scala> df2.show
+---+-----------+
| ID| Count|
+---+-----------+
| 1|10.00998787|
| 2| 40.7767|
+---+-----------+
How can i take take the max of the column-count after join?
Expected output.
+---+---------+
| id| Count|
+---+---------+
| 1|20.565656|
| 2|40.7767 |
+---+---------+
You can do this:
df1.union(df2).groupBy("ID").max("Count").show()
+---+----------+
| ID|max(Count)|
+---+----------+
| 1| 20.565656|
| 2| 40.7767|
+---+----------+
After joining both dataframes, create an UDF with 2 count columns as input and in the UDF return the greatest value between those columns.
Always its a good practice to use UDF when we need to derive a single column based on multiple columns.
scala> df.show()
+---+---------+
| ID| Count|
+---+---------+
| 1|20.565656|
| 2|30.676776|
+---+---------+
scala> df1.show()
+---+-----------+
| ID| Count|
+---+-----------+
| 1|10.00998787|
| 2| 40.7767|
+---+-----------+
scala> df.alias("x").join(df1.alias("y"), List("ID"))
.select(col("ID"), col("x.count").alias("Xcount"),col("y.count").alias("Ycount"))
.withColumn("Count", when(col("Xcount") >= col("Ycount"), col("Xcount")).otherwise(col("Ycount")))
.drop("Xcount", "YCount")
.show()
+---+---------+
| ID| Count|
+---+---------+
| 1|20.565656|
| 2| 40.7767|
+---+---------+
I would like to replicate rows according to their value for a given column. For example, I got this DataFrame:
+-----+
|count|
+-----+
| 3|
| 1|
| 4|
+-----+
I would like to get:
+-----+
|count|
+-----+
| 3|
| 3|
| 3|
| 1|
| 4|
| 4|
| 4|
| 4|
+-----+
I tried to use withColumn method, according to this answer.
val replicateDf = originalDf
.withColumn("replicating", explode(array((1 until $"count").map(lit): _*)))
.select("count")
But $"count" is a ColumnName and cannot be used to represent its values in the above expression.
(I also tried with explode(Array.fill($"count"){1}) but same problem here.)
What do I need to change? Is there a cleaner way?
array_repeat is available from 2.4 onwards. If you need the solution in lower versions, you can use udf() or rdd. For Rdd, check this out
import scala.collection.mutable._
val df = Seq(3,1,4).toDF("count")
val rdd1 = df.rdd.flatMap( x=> { val y = x.getAs[Int]("count"); for ( p <- 0 until y ) yield Row(y) } )
spark.createDataFrame(rdd1,df.schema).show(false)
Results:
+-----+
|count|
+-----+
|3 |
|3 |
|3 |
|1 |
|4 |
|4 |
|4 |
|4 |
+-----+
With df() alone
scala> df.flatMap( r=> { (0 until r.getInt(0)).map( i => r.getInt(0)) } ).show
+-----+
|value|
+-----+
| 3|
| 3|
| 3|
| 1|
| 4|
| 4|
| 4|
| 4|
+-----+
For udf(), below would work
val df = Seq(3,1,4).toDF("count")
def array_repeat(x:Int):Array[Int]={
val y = for ( p <- 0 until x )yield x
y.toArray
}
val udf_array_repeat = udf (array_repeat(_:Int):Array[Int] )
df.withColumn("count2", explode(udf_array_repeat('count))).select("count2").show(false)
EDIT :
Check #user10465355's answer below for more information about array_repeat.
You can use array_repeat function:
import org.apache.spark.sql.functions.{array_repeat, explode}
val df = Seq(1, 2, 3).toDF
df.select(explode(array_repeat($"value", $"value"))).show()
+---+
|col|
+---+
| 1|
| 2|
| 2|
| 3|
| 3|
| 3|
+---+
I am trying to sort a value val using another column ts for each id.
# imports
from pyspark.sql import functions as F
from pyspark.sql import SparkSession as ss
import pandas as pd
# create dummy data
pdf = pd.DataFrame( [['2',2,'cat'],['1',1,'dog'],['1',2,'cat'],['2',3,'cat'],['2',4,'dog']] ,columns=['id','ts','val'])
sdf = ss.createDataFrame( pdf )
sdf.show()
+---+---+---+
| id| ts|val|
+---+---+---+
| 2| 2|cat|
| 1| 1|dog|
| 1| 2|cat|
| 2| 3|cat|
| 2| 4|dog|
+---+---+---+
You can aggregate by id and sort by ts:
sorted_sdf = ( sdf.groupBy('id')
.agg( F.sort_array( F.collect_list( F.struct( F.col('ts'), F.col('val') ) ), asc = True)
.alias('sorted_col') )
)
sorted_sdf.show()
+---+--------------------+
| id| sorted_col|
+---+--------------------+
| 1| [[1,dog], [2,cat]]|
| 2|[[2,cat], [3,cat]...|
+---+--------------------+
Then, we can explode this list:
explode_sdf = sorted_sdf.select( 'id' , F.explode( F.col('sorted_col') ).alias('sorted_explode') )
explode_sdf.show()
+---+--------------+
| id|sorted_explode|
+---+--------------+
| 1| [1,dog]|
| 1| [2,cat]|
| 2| [2,cat]|
| 2| [3,cat]|
| 2| [4,dog]|
+---+--------------+
Break the tuples of sorted_explode into two:
detupled_sdf = explode_sdf.select( 'id', 'sorted_explode.*' )
detupled_sdf.show()
+---+---+---+
| id| ts|val|
+---+---+---+
| 1| 1|dog|
| 1| 2|cat|
| 2| 2|cat|
| 2| 3|cat|
| 2| 4|dog|
+---+---+---+
Now our original dataframe is sorted by ts for each id!
val df = sc.parallelize(Seq((1,"Emailab"), (2,"Phoneab"), (3, "Faxab"),(4,"Mail"),(5,"Other"),(6,"MSL12"),(7,"MSL"),(8,"HCP"),(9,"HCP12"))).toDF("c1","c2")
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
| 6| MSL12|
| 7| MSL|
| 8| HCP|
| 9| HCP12|
+---+-------+
I want to filter out records which have first 3 characters of column 'c2' either 'MSL' or 'HCP'.
So the output should be like below.
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
+---+-------+
Can any one please help on this?
I knew that df.filter($"c2".rlike("MSL")) -- This is for selecting the records but how to exclude the records. ?
Version: Spark 1.6.2
Scala : 2.10
This works too. Concise and very similar to SQL.
df.filter("c2 not like 'MSL%' and c2 not like 'HCP%'").show
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
+---+-------+
df.filter(not(
substring(col("c2"), 0, 3).isin("MSL", "HCP"))
)
I used below to filter rows from dataframe and this worked form me.Spark 2.2
val spark = new org.apache.spark.sql.SQLContext(sc)
val data = spark.read.format("csv").
option("header", "true").
option("delimiter", "|").
option("inferSchema", "true").
load("D:\\test.csv")
import spark.implicits._
val filter=data.filter($"dept" === "IT" )
OR
val filter=data.filter($"dept" =!= "IT" )
val df1 = df.filter(not(df("c2").rlike("MSL"))&¬(df("c2").rlike("HCP")))
This worked.