How to pivot a table into a timeseries table in Scala - scala

I have the following table:
index 0 1 2 id
1 9.69 1.18 0.59 62
2 7.38 2.18 0.87 62
3 10.02 1.16 0.29 62
That I'm trying to pivot into a time series like table.
Expected Output:
data id
[9.69, 7.38, 10.02] 62
[1.18, 2.18, 1.16] 62
[0.59, 0.87, 0.29] 62
I tried the following code
val table = df.groupBy(df.col("id")).pivot("index").sum("0").cache()
val tablets = table.map(x => new transform(1.until(x.length).map(x.getDouble(_)).toList, x.getString(0)))
case class transform(data:List[Double], start:String)
But it's given only this output
[9.69, 7.38, 10.02] 62
How can I iterate through all columns and get the desired output table as above?
class pivot (df: DataFrame) {
val col1Names = df.drop("id").columns.tail
val kv = explode(array(df.select(col1Names.map(col): _*).columns.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
val tempdf = df.withColumn("kv", kv)
.select("index", "kv.k", "kv.v", "id")
.groupBy("id", "k")
.pivot("index")
.agg(first("v"))
.drop("k")
val col2Names = tempdf.columns.tail
val finaldf = tempdf.withColumn("data", array(col2Names.map(col): _*)).drop(col2Names: _*)
}

In your solution you have used groupBy and sum which will generate aggregated one row for each group. Thats why you were getting one result in your result.
The solution to your problem is a bit complex. I have used combination of withColumn, explode, array, struct, pivot, groupBy, agg, drop, col, select and alias. Following is the solution
val df = Seq((1, 9.69, 1.18, 0.59, 62),
(2, 7.38, 2.18, 0.87, 62),
(3, 10.02, 1.16, 0.29, 62)).toDF("index", "0", "1", "2", "id")
As defined in your question you must already have dataframe as below by reading input as above
+-----+-----+----+----+---+
|index|0 |1 |2 |id |
+-----+-----+----+----+---+
|1 |9.69 |1.18|0.59|62 |
|2 |7.38 |2.18|0.87|62 |
|3 |10.02|1.16|0.29|62 |
+-----+-----+----+----+---+
If yes then following solution should work.
val col1Names = df.drop("id").columns.tail
val kv = explode(array(df.select(col1Names.map(col): _*).columns.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
val tempdf = df.withColumn("kv", kv)
.select("index", "kv.k", "kv.v", "id")
.groupBy("id", "k")
.pivot("index")
.agg(first("v"))
.orderBy("k")
.drop("k")
val col2Names = tempdf.columns.tail
val finaldf = tempdf.withColumn("data", array(col2Names.map(col): _*)).drop(col2Names: _*).sort($"data".desc)
You should be getting following output
+---+-------------------+
|id |data |
+---+-------------------+
|62 |[9.69, 7.38, 10.02]|
|62 |[1.18, 2.18, 1.16] |
|62 |[0.59, 0.87, 0.29] |
+---+-------------------+

Related

How to create a dataframe from Array[Strings]?

I used rdd.collect() to create an Array and now I want to use this Array[Strings] to create a DataFrame. My test file is in the following format(separated by a pipe |).
TimeStamp
IdC
Name
FileName
Start-0f-fields
column01
column02
column03
column04
column05
column06
column07
column08
column010
column11
End-of-fields
Start-of-data
G0002B|0|13|IS|LS|Xys|Xyz|12|23|48|
G0002A|0|13|IS|LS|Xys|Xyz|12|23|45|
G0002x|0|13|IS|LS|Xys|Xyz|12|23|48|
G0002C|0|13|IS|LS|Xys|Xyz|12|23|48|
End-of-data
document
the column name are in between Start-of-field and End-of-Field.
I want to store "| " pipe separated in different columns of Dataframe.
like below example:
column01 column02 column03 column04 column05 column06 column07 column08 column010 column11
G0002C 0 13 IS LS Xys Xyz 12 23 48
G0002x 0 13 LS MS Xys Xyz 14 300 400
my code :
val rdd = sc.textFile("the above text file")
val columns = rdd.collect.slice(5,16).mkString(",") // it will hold columnnames
val data = rdd.collect.slice(5,16)
val rdd1 = sc.parallelize(rdd.collect())
val df = rdd1.toDf(columns)
but this is not giving me the above desired dataframe
Could you try this?
import spark.implicits._ // Add to use `toDS()` and `toDF()`
val rdd = sc.textFile("the above text file")
val columns = rdd.collect.slice(5,16) // `.mkString(",")` is not needed
val dataDS = rdd.collect.slice(5,16)
.map(_.trim()) // to remove whitespaces
.map(s => s.substring(0, s.length - 1)) // to remove last pipe '|'
.toSeq
.toDS
val df = spark.read
.option("header", false)
.option("delimiter", "|")
.csv(dataDS)
.toDF(columns: _*)
df.show(false)
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
|column01|column02|column03|column04|column05|column06|column07|column08|column010|column11|
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
|G0002B |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
|G0002A |0 |13 |IS |LS |Xys |Xyz |12 |23 |45 |
|G0002x |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
|G0002C |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
Calling spark.read...csv() method without schema, can take a long time with huge data, because of schema inferences(e,g. Additional reading).
On that case, you can specify schema like below.
/*
column01 STRING,
column02 STRING,
column03 STRING,
...
*/
val schema = columns
.map(c => s"$c STRING")
.mkString(",\n")
val df = spark.read
.option("header", false)
.option("delimiter", "|")
.schema(schema) // schema inferences not occurred
.csv(dataDS)
// .toDF(columns: _*) => unnecessary when schema is specified
If the number of columns and the name of the column are fixed then you can do that as below :
val columns = rdd.collect.slice(5,15).mkString(",") // it will hold columnnames
val data = rdd.collect.slice(17,21)
val d = data.mkString("\n").split('\n').toSeq.toDF()
import org.apache.spark.sql.functions._
val dd = d.withColumn("columnX",split($"value","\\|")).withColumn("column1",$"columnx".getItem(0)).withColumn("column2",$"columnx".getItem(1)).withColumn("column3",$"columnx".getItem(2)).withColumn("column4",$"columnx".getItem(3)).withColumn("column5",$"columnx".getItem(4)).withColumn("column6",$"columnx".getItem(5)).withColumn("column8",$"columnx".getItem(7)).withColumn("column10",$"columnx".getItem(8)).withColumn("column11",$"columnx".getItem(9)).drop("columnX","value")
display(dd)
you can see the output as below:

How to null a struct in Scala spark when all values in the struct are null?

I have a spark scala data frame with a column that is a struct and I want null instead of objects when all values in struct are null.
val someDF = Seq(
(8, null,null),
(64, "mouse", "s"),
(-27, "horse", "e")
).toDF("a", "b", "c")
def make_week_struct (week:String) : Column = {
val summary = struct($"b", $"c").alias(s"wks_${week}_jrny")
return summary
}
val week1_summary = make_week_struct("1")
var dd = someDF.select($"a",week1_summary)
display(dd)
Sample Data
a b c
8 null null
64 mouse s
-27 horse e
Current Output
a wks_1_jrny
8 object:{a:null, b:null}
64 object:{a:"mouse", b:"s"}
-27 object:{a:"horse", b:"e"}
Expected Output
a wks_1_jrny
8 null
64 object:{a:"mouse", b:"s"}
-27 object:{a:"horse", b:"e"}
You can also use to_json function & filter empty json {}.
scala>
dd
.withColumn("wks_1_jrny",
when(
to_json($"wks_1_jrny") =!= "{}", // Filter Empty Json values.
$"wks_1_jrny"
)
)
.show(false)
+---+----------+
|a |wks_1_jrny|
+---+----------+
|8 |null |
|64 |[mouse,s] |
|-27|[horse,e] |
+---+----------+
This also should work:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = List(
(None, None),
(None, Some("abc")),
(Some(1), Some("xyz"))
).toDF("id", "name")
val structCols = Seq("id", "name")
val dataStruct = struct(structCols.map(col): _*)
val emptyStruct = struct(df.schema.fields.filter(f => structCols.contains(f.name)).map(f => lit(null).cast(f.dataType).as(f.name)):_*)
df
.select(when(dataStruct.equalTo(emptyStruct), lit(null: StructType)).otherwise(dataStruct).as("col"))
.show(false)

Spark SQL Split or Extract words from String of Words

I have a spark Dataframe like Below.I'm trying to split the column into 2 more columns:
date time content
28may 11am [ssid][customerid,shopid]
val personDF2 = personDF.withColumn("temp",split(col("content"),"\\[")).select(
col("*") +: (0 until 3).map(i => col("temp").getItem(i).as(s/col$i)): _*)
date time content col1 col2 col3
28may 11 [ssid][customerid,shopid] ssid customerid shopid
Assuming a String to represent an Array of Words. Got your request. You can optimize the number of dataframes as well to reduce load on system. If there are more than 9 cols etc. you may need to use c00, c01, etc. for c10 etc. Or just use integer as name for columns. leave that up to you.
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
// Set up data
val df = spark.sparkContext.parallelize(Seq(
("A", "[foo][customerid,shopid][Donald,Trump,Esq][single]"),
("B", "[foo]")
)).toDF("k", "v")
val df2 = df.withColumn("words_temp", regexp_replace($"v", lit("]"), lit("" )))
val df3 = df2.withColumn("words_temp2", regexp_replace($"words_temp", lit(","), lit("[" ))).drop("words_temp")
val df4 = df3.withColumn("words_temp3", expr("substring(words_temp2, 2, length(words_temp2))")).withColumn("cnt", expr("length(words_temp2)")).drop("words_temp2")
val df5 = df4.withColumn("words",split(col("words_temp3"),"\\[")).drop("words_temp3")
val df6 = df5.withColumn("num_words", size($"words"))
val df7 = df6.withColumn("v2", explode($"words"))
// Convert to Array of sorts via group by
val df8 = df7.groupBy("k")
.agg(collect_list("v2"))
// Convert to rdd Tuple and then find position so as to gen col names! That is the clue so as to be able to use pivot
val rdd = df8.rdd
val rdd2 = rdd.map(row => (row.getAs[String](0), row.getAs[WrappedArray[String]](1).toArray))
val rdd3 = rdd2.map { case (k, list) => (k, list.zipWithIndex) }
val df9 = rdd3.toDF("k", "v")
val df10 = df9.withColumn("vn", explode($"v"))
val df11 = df10.select($"k", $"vn".getField("_1"), concat(lit("c"),$"vn".getField("_2"))).toDF("k", "v", "c")
// Final manipulation
val result = df11.groupBy("k")
.pivot("c")
.agg(expr("coalesce(first(v),null)")) // May never occur in your case, just done for completeness and variable length cols.
result.show(100,false)
returns in this case:
+---+---+----------+------+------+-----+----+------+
|k |c0 |c1 |c2 |c3 |c4 |c5 |c6 |
+---+---+----------+------+------+-----+----+------+
|B |foo|null |null |null |null |null|null |
|A |foo|customerid|shopid|Donald|Trump|Esq |single|
+---+---+----------+------+------+-----+----+------+
Update:
Based on original title stating array of words. See other answer.
If new, then a few things here. Can also be done with dataset and map I assume. Here is a solution using DFs and rdd's. I might well investigate a complete DS in future, but this works for sure and at scale.
// Can amalgamate more steps
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
// Set up data
val df = spark.sparkContext.parallelize(Seq(
("A", Array(Array("foo", "bar"), Array("Donald", "Trump","Esq"), Array("single"))),
("B", Array(Array("foo2", "bar2"), Array("single2"))),
("C", Array(Array("foo3", "bar3", "x", "y", "z")))
)).toDF("k", "v")
// flatten via 2x explode, can be done more elegeantly with def or UDF, but keeping it simple here
val df2 = df.withColumn("v2", explode($"v"))
val df3 = df2.withColumn("v3", explode($"v2"))
// Convert to Array of sorts via group by
val df4 = df3.groupBy("k")
.agg(collect_list("v3"))
// Convert to rdd Tuple and then find position so as to gen col names! That is the clue so as to be able to use pivot
val rdd = df4.rdd
val rdd2 = rdd.map(row => (row.getAs[String](0), row.getAs[WrappedArray[String]](1).toArray))
val rdd3 = rdd2.map { case (k, list) => (k, list.zipWithIndex) }
val df5 = rdd3.toDF("k", "v")
val df6 = df5.withColumn("vn", explode($"v"))
val df7 = df6.select($"k", $"vn".getField("_1"), concat(lit("c"),$"vn".getField("_2"))).toDF("k", "v", "c")
// Final manipulation
val result = df7.groupBy("k")
.pivot("c")
.agg(expr("coalesce(first(v),null)")) // May never occur in your case, just done for completeness and variable length cols.
result.show(100,false)
returns in correct col order:
+---+----+----+-------+-----+----+------+
|k |c0 |c1 |c2 |c3 |c4 |c5 |
+---+----+----+-------+-----+----+------+
|B |foo2|bar2|single2|null |null|null |
|C |foo3|bar3|x |y |z |null |
|A |foo |bar |Donald |Trump|Esq |single|
+---+----+----+-------+-----+----+------+

conditional operator with groupby in spark rdd level - scala

I am using Spark 1.60 and Scala 2.10.5
I have a dataframe like this,
+------------------+
|id | needed |
+------------------+
|1 | 2 |
|1 | 0 |
|1 | 3 |
|2 | 0 |
|2 | 0 |
|3 | 1 |
|3 | 2 |
+------------------+
From this df I created an rdd like this,
val dfRDD = df.rdd
from my rdd, I want to group by id and count of needed is > 0.
((1, 2), (2,0), (3,2))
So, I tried like this,
val groupedDF = dfRDD.map(x =>(x(0), x(1) > 0)).count.redueByKey(_+_)
In this case, I am getting an error:
error: value > is not a member of any
I need that in rdd level. Any help to get my desired output would be great.
The problem is that in your map you're calling the apply method of Row, and as you can see in its scaladoc, that method returns Any - and as you can see for the error and from the scaladoc there is not such method < in Any
You can fix it using the getAs[T] method.
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
val spark =
SparkSession
.builder
.master("local[*]")
.getOrCreate()
import spark.implicits._
val df =
List(
(1, 2),
(1, 0),
(1, 3),
(2, 0),
(2, 0),
(3, 1),
(3, 2)
).toDF("id", "needed")
val rdd: RDD[(Int, Int)] = df.rdd.map(row => (row.getAs[Int](fieldName = "id"), row.getAs[Int](fieldName = "needed")))
From there you can continue with the aggregation, you have a few mistakes in your logic.
First, you don't need the count call.
And second, if you need to count the amount of times "needed" was greater than one you can't do _ + _, because that is the sum of needed values.
val grouped: RDD[(Int, Int)] = rdd.reduceByKey { (acc, v) => if (v > 0) acc + 1 else acc }
val result: Array[(Int, Int)] = grouped.collect()
// Array((1,3), (2,0), (3,2))
PS: You should tell your professor to upgrade to Spark 2 and Scala 2.11 ;)
Edit
Using case classes in the above example.
final case class Data(id: Int, needed: Int)
val rdd: RDD[Data] = df.as[Data].rdd
val grouped: RDD[(Int, Int)] = rdd.map(d => d.id -> d.needed).reduceByKey { (acc, v) => if (v > 0) acc + 1 else acc }
val result: Array[(Int, Int)] = grouped.collect()
// Array((1,3), (2,0), (3,2))
There's no need to do the calculation at the rdd level. Aggregation with the data frame should work:
df.groupBy("id").agg(sum(($"needed" > 0).cast("int")).as("positiveCount")).show
+---+-------------+
| id|positiveCount|
+---+-------------+
| 1| 2|
| 3| 2|
| 2| 0|
+---+-------------+
If you have to work with RDD, use row.getInt or as #Luis' answer row.getAs[Int] to get the value with explicit type, and then do the comparison and reduceByKey:
df.rdd.map(r => (r.getInt(0), if (r.getInt(1) > 0) 1 else 0)).reduceByKey(_ + _).collect
// res18: Array[(Int, Int)] = Array((1,2), (2,0), (3,2))

How to replace a dataframe column with another dataframe column

I have two dataframes:
dataframe1
DATE1|
+----------+
|2017-01-08|
|2017-10-10|
|2017-05-01|
dataframe2
|NAME | SID| DATE1| DATE2|ROLL| SCHOOL|
+------+----+----------+----------+----+--------+
| Sayam|22.0| 8/1/2017| 7 1 2017|3223| BHABHA|
|ADARSH| 2.0|10-10-2017|10.03.2017| 222|SUNSHINE|
| SADIM| 1.0| 1.5.2017| 1/2/2017| 111| DAV|
Expected output
| NAME| SID| DATE1| DATE2|ROLL| SCHOOL|
+------+----+----------+----------+----+--------+
| Sayam|22.0|2017-01-08| 7 1 2017|3223| BHABHA|
|ADARSH| 2.0|2017-10-10|10.03.2017| 222|SUNSHINE|
| SADIM| 1.0|2017-05-01| 1/2/2017| 111| DAV|
I want to replace the DATE1 column in the dataframe2 with the DATE1 column of the dataframe1. I need a generic solution.
Any help will be appreciated.
I have tried withColumn method as following
dataframe2.withColumn(newColumnTransformInfo._1, dataframe1.col("DATE1").cast(DateType))
But, I'm getting an error:
org.apache.spark.sql.AnalysisException: resolved attribute(s)
You cannot add the column from another dataframe
What you could do is join the two dataframes and keep the column you want, Both the dataframe must have a common join column. If you do not have a common column and data is in order you can assign a increasing id for both dataframe and then join.
Here is the simple example of your case
//Dummy data
val df1 = Seq(
("2017-01-08"),
("2017-10-10"),
("2017-05-01")
).toDF("DATE1")
val df2 = Seq(
("Sayam", 22.0, "2017-01-08", "7 1 2017", 3223, "BHABHA"),
("ADARSH", 2.0, "2017-10-10", "10.03.2017", 222, "SUNSHINE"),
("SADIM", 1.0, "2017-05-01", "1/2/2017", 111, "DAV")
).toDF("NAME", "SID", "DATE1", "DATE2", "ROLL", "SCHOOL")
//create new Dataframe1 with new column id
val rows1 = df1.rdd.zipWithIndex().map{
case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
val dataframe1 = spark.createDataFrame(rows1, StructType(StructField("id", LongType, false) +: df1.schema.fields))
//create new Dataframe2 with new column id
val rows2= df2.rdd.zipWithIndex().map{
case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
val dataframe2 = spark.createDataFrame(rows2, StructType(StructField("id", LongType, false) +: df2.schema.fields))
dataframe2.drop("DATE1")
.join(dataframe1, "id")
.drop("id").show()
Output:
+------+----+----------+----+--------+----------+
| NAME| SID| DATE2|ROLL| SCHOOL| DATE1|
+------+----+----------+----+--------+----------+
| Sayam|22.0| 7 1 2017|3223| BHABHA|2017-01-08|
|ADARSH| 2.0|10.03.2017| 222|SUNSHINE|2017-10-10|
| SADIM| 1.0| 1/2/2017| 111| DAV|2017-05-01|
+------+----+----------+----+--------+----------+
Hope this helps!