forming list of columns after groupByKey or groupBy - scala

I have this input DataFrame
input_df:
|C1|C2|C3 |
|-------------|
|A| 1 | 12/06/2012 |
|A| 2 | 13/06/2012 |
|B| 3 | 12/06/2012 |
|B| 4 | 17/06/2012 |
|C| 5 | 14/06/2012 |
|----------|
and after transformations, i want to get this kind of DataFrame grouping by C1 and creating C4 column wich is form by a list of couple from C2 and C3
output_df:
|C1 | C4 |
|---------------------------------------------|
|A| (1, 12/06/2012), (2, 12/06/2012) |
|B| (3, 12/06/2012), (4, 12/06/2012) |
|C| (5, 12/06/2012) |
|---------------------------------------------|
I appoach the result when I try this:
val output_df = input_df.map(x => (x(0), (x(1), x(2))) ).groupByKey()
I obtain this result
(A,CompactBuffer((1, 12/06/2012), (2, 13/06/2012)))
(B,CompactBuffer((3, 12/06/2012), (4, 17/06/2012)))
(C,CompactBuffer((5, 14/06/2012)))
But I don't know how to convert this into DataFrame and if this is the good way to do it.
Any advise is welcome even with another approach

//please, try this
val conf = new SparkConf().setAppName("groupBy").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val rdd = sc.parallelize(
Seq(("A",1,"12/06/2012"),("A",2,"13/06/2012"),("B",3,"12/06/2012"),("B",4,"17/06/2012"),("C",5,"14/06/2012")) )
val v1 = rdd.map(x => (x._1, x ))
val v2 = v1.groupByKey()
val v3 = v2.mapValues(v => v.toArray)
val df2 = v3.toDF("aKey","theValues")
df2.printSchema()
val first = df2.first
println (first)
println (first.getString(0))
val values = first.getSeq[Row](1)
val firstArray = values(0)
println (firstArray.getString(0)) //B
println (firstArray.getInt(1)) //3
println (firstArray.getString(2)) //12/06/2012

Related

Spark SQL Split or Extract words from String of Words

I have a spark Dataframe like Below.I'm trying to split the column into 2 more columns:
date time content
28may 11am [ssid][customerid,shopid]
val personDF2 = personDF.withColumn("temp",split(col("content"),"\\[")).select(
col("*") +: (0 until 3).map(i => col("temp").getItem(i).as(s/col$i)): _*)
date time content col1 col2 col3
28may 11 [ssid][customerid,shopid] ssid customerid shopid
Assuming a String to represent an Array of Words. Got your request. You can optimize the number of dataframes as well to reduce load on system. If there are more than 9 cols etc. you may need to use c00, c01, etc. for c10 etc. Or just use integer as name for columns. leave that up to you.
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
// Set up data
val df = spark.sparkContext.parallelize(Seq(
("A", "[foo][customerid,shopid][Donald,Trump,Esq][single]"),
("B", "[foo]")
)).toDF("k", "v")
val df2 = df.withColumn("words_temp", regexp_replace($"v", lit("]"), lit("" )))
val df3 = df2.withColumn("words_temp2", regexp_replace($"words_temp", lit(","), lit("[" ))).drop("words_temp")
val df4 = df3.withColumn("words_temp3", expr("substring(words_temp2, 2, length(words_temp2))")).withColumn("cnt", expr("length(words_temp2)")).drop("words_temp2")
val df5 = df4.withColumn("words",split(col("words_temp3"),"\\[")).drop("words_temp3")
val df6 = df5.withColumn("num_words", size($"words"))
val df7 = df6.withColumn("v2", explode($"words"))
// Convert to Array of sorts via group by
val df8 = df7.groupBy("k")
.agg(collect_list("v2"))
// Convert to rdd Tuple and then find position so as to gen col names! That is the clue so as to be able to use pivot
val rdd = df8.rdd
val rdd2 = rdd.map(row => (row.getAs[String](0), row.getAs[WrappedArray[String]](1).toArray))
val rdd3 = rdd2.map { case (k, list) => (k, list.zipWithIndex) }
val df9 = rdd3.toDF("k", "v")
val df10 = df9.withColumn("vn", explode($"v"))
val df11 = df10.select($"k", $"vn".getField("_1"), concat(lit("c"),$"vn".getField("_2"))).toDF("k", "v", "c")
// Final manipulation
val result = df11.groupBy("k")
.pivot("c")
.agg(expr("coalesce(first(v),null)")) // May never occur in your case, just done for completeness and variable length cols.
result.show(100,false)
returns in this case:
+---+---+----------+------+------+-----+----+------+
|k |c0 |c1 |c2 |c3 |c4 |c5 |c6 |
+---+---+----------+------+------+-----+----+------+
|B |foo|null |null |null |null |null|null |
|A |foo|customerid|shopid|Donald|Trump|Esq |single|
+---+---+----------+------+------+-----+----+------+
Update:
Based on original title stating array of words. See other answer.
If new, then a few things here. Can also be done with dataset and map I assume. Here is a solution using DFs and rdd's. I might well investigate a complete DS in future, but this works for sure and at scale.
// Can amalgamate more steps
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
// Set up data
val df = spark.sparkContext.parallelize(Seq(
("A", Array(Array("foo", "bar"), Array("Donald", "Trump","Esq"), Array("single"))),
("B", Array(Array("foo2", "bar2"), Array("single2"))),
("C", Array(Array("foo3", "bar3", "x", "y", "z")))
)).toDF("k", "v")
// flatten via 2x explode, can be done more elegeantly with def or UDF, but keeping it simple here
val df2 = df.withColumn("v2", explode($"v"))
val df3 = df2.withColumn("v3", explode($"v2"))
// Convert to Array of sorts via group by
val df4 = df3.groupBy("k")
.agg(collect_list("v3"))
// Convert to rdd Tuple and then find position so as to gen col names! That is the clue so as to be able to use pivot
val rdd = df4.rdd
val rdd2 = rdd.map(row => (row.getAs[String](0), row.getAs[WrappedArray[String]](1).toArray))
val rdd3 = rdd2.map { case (k, list) => (k, list.zipWithIndex) }
val df5 = rdd3.toDF("k", "v")
val df6 = df5.withColumn("vn", explode($"v"))
val df7 = df6.select($"k", $"vn".getField("_1"), concat(lit("c"),$"vn".getField("_2"))).toDF("k", "v", "c")
// Final manipulation
val result = df7.groupBy("k")
.pivot("c")
.agg(expr("coalesce(first(v),null)")) // May never occur in your case, just done for completeness and variable length cols.
result.show(100,false)
returns in correct col order:
+---+----+----+-------+-----+----+------+
|k |c0 |c1 |c2 |c3 |c4 |c5 |
+---+----+----+-------+-----+----+------+
|B |foo2|bar2|single2|null |null|null |
|C |foo3|bar3|x |y |z |null |
|A |foo |bar |Donald |Trump|Esq |single|
+---+----+----+-------+-----+----+------+

conditional operator with groupby in spark rdd level - scala

I am using Spark 1.60 and Scala 2.10.5
I have a dataframe like this,
+------------------+
|id | needed |
+------------------+
|1 | 2 |
|1 | 0 |
|1 | 3 |
|2 | 0 |
|2 | 0 |
|3 | 1 |
|3 | 2 |
+------------------+
From this df I created an rdd like this,
val dfRDD = df.rdd
from my rdd, I want to group by id and count of needed is > 0.
((1, 2), (2,0), (3,2))
So, I tried like this,
val groupedDF = dfRDD.map(x =>(x(0), x(1) > 0)).count.redueByKey(_+_)
In this case, I am getting an error:
error: value > is not a member of any
I need that in rdd level. Any help to get my desired output would be great.
The problem is that in your map you're calling the apply method of Row, and as you can see in its scaladoc, that method returns Any - and as you can see for the error and from the scaladoc there is not such method < in Any
You can fix it using the getAs[T] method.
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
val spark =
SparkSession
.builder
.master("local[*]")
.getOrCreate()
import spark.implicits._
val df =
List(
(1, 2),
(1, 0),
(1, 3),
(2, 0),
(2, 0),
(3, 1),
(3, 2)
).toDF("id", "needed")
val rdd: RDD[(Int, Int)] = df.rdd.map(row => (row.getAs[Int](fieldName = "id"), row.getAs[Int](fieldName = "needed")))
From there you can continue with the aggregation, you have a few mistakes in your logic.
First, you don't need the count call.
And second, if you need to count the amount of times "needed" was greater than one you can't do _ + _, because that is the sum of needed values.
val grouped: RDD[(Int, Int)] = rdd.reduceByKey { (acc, v) => if (v > 0) acc + 1 else acc }
val result: Array[(Int, Int)] = grouped.collect()
// Array((1,3), (2,0), (3,2))
PS: You should tell your professor to upgrade to Spark 2 and Scala 2.11 ;)
Edit
Using case classes in the above example.
final case class Data(id: Int, needed: Int)
val rdd: RDD[Data] = df.as[Data].rdd
val grouped: RDD[(Int, Int)] = rdd.map(d => d.id -> d.needed).reduceByKey { (acc, v) => if (v > 0) acc + 1 else acc }
val result: Array[(Int, Int)] = grouped.collect()
// Array((1,3), (2,0), (3,2))
There's no need to do the calculation at the rdd level. Aggregation with the data frame should work:
df.groupBy("id").agg(sum(($"needed" > 0).cast("int")).as("positiveCount")).show
+---+-------------+
| id|positiveCount|
+---+-------------+
| 1| 2|
| 3| 2|
| 2| 0|
+---+-------------+
If you have to work with RDD, use row.getInt or as #Luis' answer row.getAs[Int] to get the value with explicit type, and then do the comparison and reduceByKey:
df.rdd.map(r => (r.getInt(0), if (r.getInt(1) > 0) 1 else 0)).reduceByKey(_ + _).collect
// res18: Array[(Int, Int)] = Array((1,2), (2,0), (3,2))

Removing the Option type from a joined RDD

There are two rdds.
val pairRDD1 = sc.parallelize(List( ("cat",2), ("girl", 5), ("book", 4),("Tom", 12)))
val pairRDD2 = sc.parallelize(List( ("cat",2), ("cup", 5), ("mouse", 4),("girl", 12)))
And then I will do this join operation.
val kk = pairRDD1.fullOuterJoin(pairRDD2).collect
it shows like that:
kk: Array[(String, (Option[Int], Option[Int]))] = Array((book,(Some(4),None)), (Tom,(Some(12),None)), (girl,(Some(5),Some(12))), (mouse,(None,Some(4))), (cup,(None,Some(5))), (cat,(Some(2),Some(2))))
if i would like to fill the NONE by 0 and transform Option[int] to Int.what should I code?Thanks!
You can use mapValues on kk as follows (note this is before the collect):
pairRDD1.fullOuterJoin(pairRDD2).mapValues(pair => (pair._1.getOrElse(0), pair._2.getOrElse(0)))
You might have to do this before collect in an RDD, otherwise you could do:
kk.map { case (k, pair) => (k, (pair._1.getOrElse(0), pair._2.getOrElse(0))) }
Based on commnets in first answer, if you are fine using DataFrames, you can do with dataframes with any number of columns.
val ss = SparkSession.builder().master("local[*]").getOrCreate()
val sc = ss.sparkContext
import ss.implicits._
val pairRDD1 = sc.parallelize(List(("cat", 2,9999), ("girl", 5,8888), ("book", 4,9999), ("Tom", 12,6666)))
val pairRDD2 = sc.parallelize(List(("cat", 2,9999), ("cup", 5,7777), ("mouse", 4,3333), ("girl", 12,1111)))
val df1 = pairRDD1.toDF
val df2 = pairRDD2.toDF
val joined = df1.join(df2, df1.col("_1") === df2.col("_1"),"fullouter")
joined.show()
Here _1,_2 e.t.c are default column names provided by Spark. But, if you wish to have proper names you can change it as you wish.
Result:
+----+----+----+-----+----+----+
| _1| _2| _3| _1| _2| _3|
+----+----+----+-----+----+----+
|girl| 5|8888| girl| 12|1111|
| Tom| 12|6666| null|null|null|
| cat| 2|9999| cat| 2|9999|
|null|null|null| cup| 5|7777|
|null|null|null|mouse| 4|3333|
|book| 4|9999| null|null|null|
+----+----+----+-----+----+----+

Convert arbitrary number of columns to Vector

How to convert a group of arbitrary columns to a Mllib Vector?
Basically, I have first column of my DataFrame with a fixed name and then a number of arbitrary named columns each having Double values inside.
like so:
name | a | b | c |
val1 | 0.0 | 1.0 | 1.0 |
val2 | 2.0 | 1.0 | 5.0 |
Could be any number of columns. I need to get a DataSet of the following:
final case class ValuesRow(name: String, values: Vector)
This can be done in a simple way using VectorAssembler. The columns that are to be merged into a Vector are used as input, in this case all columns except the first.
val df = spark.createDataFrame(Seq(("val1", 0, 1, 1), ("val2", 2, 1, 5)))
.toDF("name", "a", "b", "c")
val columnNames = df.columns.drop(1) // drop the name column
val assembler = new VectorAssembler()
.setInputCols(columnNames)
.setOutputCol("values")
val df2 = assembler.transform(df).select("name", "values").as[ValuesRow]
The result will be a dataset containing the name and values columns:
+----+-------------+
|name| values|
+----+-------------+
|val1|[0.0,1.0,1.0]|
|val2|[2.0,1.0,5.0]|
+----+-------------+
Here's one way to do it:
import org.apache.spark.sql.functions._
import org.apache.spark.mllib.linalg.DenseVector
val ds = Seq(
("val1", 0.0, 1.0, 1.0),
("val2", 2.0, 1.0, 5.0)
).toDF("name", "a", "b", "c").
as[(String, Double, Double, Double)]
val colList = ds.columns
val keyCol = colList(0)
val valCols = colList.drop(1)
def arrToVec = udf(
(s: Seq[Double]) => new DenseVector(s.toArray)
)
ds.select(
col(keyCol), arrToVec( array(valCols.map(x => col(x)): _*) ).as("values")
).show
// +----+-------------+
// |name| values|
// +----+-------------+
// |val1|[0.0,1.0,1.0]|
// |val2|[2.0,1.0,5.0]|
// +----+-------------+

How to pivot a table into a timeseries table in Scala

I have the following table:
index 0 1 2 id
1 9.69 1.18 0.59 62
2 7.38 2.18 0.87 62
3 10.02 1.16 0.29 62
That I'm trying to pivot into a time series like table.
Expected Output:
data id
[9.69, 7.38, 10.02] 62
[1.18, 2.18, 1.16] 62
[0.59, 0.87, 0.29] 62
I tried the following code
val table = df.groupBy(df.col("id")).pivot("index").sum("0").cache()
val tablets = table.map(x => new transform(1.until(x.length).map(x.getDouble(_)).toList, x.getString(0)))
case class transform(data:List[Double], start:String)
But it's given only this output
[9.69, 7.38, 10.02] 62
How can I iterate through all columns and get the desired output table as above?
class pivot (df: DataFrame) {
val col1Names = df.drop("id").columns.tail
val kv = explode(array(df.select(col1Names.map(col): _*).columns.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
val tempdf = df.withColumn("kv", kv)
.select("index", "kv.k", "kv.v", "id")
.groupBy("id", "k")
.pivot("index")
.agg(first("v"))
.drop("k")
val col2Names = tempdf.columns.tail
val finaldf = tempdf.withColumn("data", array(col2Names.map(col): _*)).drop(col2Names: _*)
}
In your solution you have used groupBy and sum which will generate aggregated one row for each group. Thats why you were getting one result in your result.
The solution to your problem is a bit complex. I have used combination of withColumn, explode, array, struct, pivot, groupBy, agg, drop, col, select and alias. Following is the solution
val df = Seq((1, 9.69, 1.18, 0.59, 62),
(2, 7.38, 2.18, 0.87, 62),
(3, 10.02, 1.16, 0.29, 62)).toDF("index", "0", "1", "2", "id")
As defined in your question you must already have dataframe as below by reading input as above
+-----+-----+----+----+---+
|index|0 |1 |2 |id |
+-----+-----+----+----+---+
|1 |9.69 |1.18|0.59|62 |
|2 |7.38 |2.18|0.87|62 |
|3 |10.02|1.16|0.29|62 |
+-----+-----+----+----+---+
If yes then following solution should work.
val col1Names = df.drop("id").columns.tail
val kv = explode(array(df.select(col1Names.map(col): _*).columns.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
val tempdf = df.withColumn("kv", kv)
.select("index", "kv.k", "kv.v", "id")
.groupBy("id", "k")
.pivot("index")
.agg(first("v"))
.orderBy("k")
.drop("k")
val col2Names = tempdf.columns.tail
val finaldf = tempdf.withColumn("data", array(col2Names.map(col): _*)).drop(col2Names: _*).sort($"data".desc)
You should be getting following output
+---+-------------------+
|id |data |
+---+-------------------+
|62 |[9.69, 7.38, 10.02]|
|62 |[1.18, 2.18, 1.16] |
|62 |[0.59, 0.87, 0.29] |
+---+-------------------+