Related
I have a spark Dataframe like Below.I'm trying to split the column into 2 more columns:
date time content
28may 11am [ssid][customerid,shopid]
val personDF2 = personDF.withColumn("temp",split(col("content"),"\\[")).select(
col("*") +: (0 until 3).map(i => col("temp").getItem(i).as(s/col$i)): _*)
date time content col1 col2 col3
28may 11 [ssid][customerid,shopid] ssid customerid shopid
Assuming a String to represent an Array of Words. Got your request. You can optimize the number of dataframes as well to reduce load on system. If there are more than 9 cols etc. you may need to use c00, c01, etc. for c10 etc. Or just use integer as name for columns. leave that up to you.
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
// Set up data
val df = spark.sparkContext.parallelize(Seq(
("A", "[foo][customerid,shopid][Donald,Trump,Esq][single]"),
("B", "[foo]")
)).toDF("k", "v")
val df2 = df.withColumn("words_temp", regexp_replace($"v", lit("]"), lit("" )))
val df3 = df2.withColumn("words_temp2", regexp_replace($"words_temp", lit(","), lit("[" ))).drop("words_temp")
val df4 = df3.withColumn("words_temp3", expr("substring(words_temp2, 2, length(words_temp2))")).withColumn("cnt", expr("length(words_temp2)")).drop("words_temp2")
val df5 = df4.withColumn("words",split(col("words_temp3"),"\\[")).drop("words_temp3")
val df6 = df5.withColumn("num_words", size($"words"))
val df7 = df6.withColumn("v2", explode($"words"))
// Convert to Array of sorts via group by
val df8 = df7.groupBy("k")
.agg(collect_list("v2"))
// Convert to rdd Tuple and then find position so as to gen col names! That is the clue so as to be able to use pivot
val rdd = df8.rdd
val rdd2 = rdd.map(row => (row.getAs[String](0), row.getAs[WrappedArray[String]](1).toArray))
val rdd3 = rdd2.map { case (k, list) => (k, list.zipWithIndex) }
val df9 = rdd3.toDF("k", "v")
val df10 = df9.withColumn("vn", explode($"v"))
val df11 = df10.select($"k", $"vn".getField("_1"), concat(lit("c"),$"vn".getField("_2"))).toDF("k", "v", "c")
// Final manipulation
val result = df11.groupBy("k")
.pivot("c")
.agg(expr("coalesce(first(v),null)")) // May never occur in your case, just done for completeness and variable length cols.
result.show(100,false)
returns in this case:
+---+---+----------+------+------+-----+----+------+
|k |c0 |c1 |c2 |c3 |c4 |c5 |c6 |
+---+---+----------+------+------+-----+----+------+
|B |foo|null |null |null |null |null|null |
|A |foo|customerid|shopid|Donald|Trump|Esq |single|
+---+---+----------+------+------+-----+----+------+
Update:
Based on original title stating array of words. See other answer.
If new, then a few things here. Can also be done with dataset and map I assume. Here is a solution using DFs and rdd's. I might well investigate a complete DS in future, but this works for sure and at scale.
// Can amalgamate more steps
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
// Set up data
val df = spark.sparkContext.parallelize(Seq(
("A", Array(Array("foo", "bar"), Array("Donald", "Trump","Esq"), Array("single"))),
("B", Array(Array("foo2", "bar2"), Array("single2"))),
("C", Array(Array("foo3", "bar3", "x", "y", "z")))
)).toDF("k", "v")
// flatten via 2x explode, can be done more elegeantly with def or UDF, but keeping it simple here
val df2 = df.withColumn("v2", explode($"v"))
val df3 = df2.withColumn("v3", explode($"v2"))
// Convert to Array of sorts via group by
val df4 = df3.groupBy("k")
.agg(collect_list("v3"))
// Convert to rdd Tuple and then find position so as to gen col names! That is the clue so as to be able to use pivot
val rdd = df4.rdd
val rdd2 = rdd.map(row => (row.getAs[String](0), row.getAs[WrappedArray[String]](1).toArray))
val rdd3 = rdd2.map { case (k, list) => (k, list.zipWithIndex) }
val df5 = rdd3.toDF("k", "v")
val df6 = df5.withColumn("vn", explode($"v"))
val df7 = df6.select($"k", $"vn".getField("_1"), concat(lit("c"),$"vn".getField("_2"))).toDF("k", "v", "c")
// Final manipulation
val result = df7.groupBy("k")
.pivot("c")
.agg(expr("coalesce(first(v),null)")) // May never occur in your case, just done for completeness and variable length cols.
result.show(100,false)
returns in correct col order:
+---+----+----+-------+-----+----+------+
|k |c0 |c1 |c2 |c3 |c4 |c5 |
+---+----+----+-------+-----+----+------+
|B |foo2|bar2|single2|null |null|null |
|C |foo3|bar3|x |y |z |null |
|A |foo |bar |Donald |Trump|Esq |single|
+---+----+----+-------+-----+----+------+
I have a file with 20+ columns of which I would like to extract a few. Until now, I have the following code. I'm sure there is a smart way to do it, but not able to get it working successfully. Any ideas?
mvnmdata is of type RDD[String]
val strpcols = mvnmdata.map(x => x.split('|')).map(x => (x(0),x(1),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12),x(13),x(14),x(15),x(16),x(17),x(18),x(19),x(20),x(21),x(22),x(23) ))```
The next solution provides an easy and scalable way to manage your column names and indices. It is based on a map which determines the column name/index relation. The map will also help us to handle both the index of the extracted column and its name.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructType, StructField}
val rdd = spark.sparkContext.parallelize(Seq(
"1|500|400|300",
"1|34|67|89",
"2|10|20|56",
"3|2|5|56",
"3|1|8|22"))
val dictColums = Map("c0" -> 0, "c2" -> 2)
// create schema from map keys
val schema = StructType(dictColums.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(dictColums.values.toSeq.map{cols(_)})
}
val df = spark.createDataFrame(mappedRDD, schema).show
//output
+---+---+
| c0| c2|
+---+---+
| 1|400|
| 1| 67|
| 2| 20|
| 3| 5|
| 3| 8|
+---+---+
First we declare dictColums in this example we will extract the cols "c0" -> 0 and "c2" -> 2
Next we create the schema from the keys of the map
The one map (which you already have) will split the line by |, the second one will create a Row containing the values that correspond to each item of dictColums.values
UPDATE:
You could also create a function from the above functionality in order to be able to reuse it multiple times:
import org.apache.spark.sql.DataFrame
def stringRddToDataFrame(colsMapping: Map[String, Int], rdd: RDD[String]) : DataFrame = {
val schema = StructType(colsMapping.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(colsMapping.values.toSeq.map{cols(_)})
}
spark.createDataFrame(mappedRDD, schema)
}
And then use it for your case:
val cols = Map("c0" -> 0, "c1" -> 1, "c5" -> 5, ... "c23" -> 23)
val df = stringRddToDataFrame(cols, rdd)
As below,if you don't want to write repeated x(i),you can process it in a loop. Example 1:
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
for (i <- Array(0,1,5,6...)){
xbuffer.append(x(i))
}
xbuffer
})
If you only want to define the index list with start&end and the numbers to be excluded, see Example 2 of below:
scala> (1 to 10).toSet
res8: scala.collection.immutable.Set[Int] = Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)
scala> ((1 to 10).toSet -- Set(2,9)).toArray.sortBy(row=>row)
res9: Array[Int] = Array(1, 3, 4, 5, 6, 7, 8, 10)
The final code you want:
//define the function to process indexes
def getSpecIndexes(start:Int, end:Int, removedValueSet:Set[Int]):Array[Int] = {
((start to end).toSet -- removedValueSet).toArray.sortBy(row=>row)
}
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
//call the function
for (i <- getSpecIndexes(0,100,Set(3,4,5,6))){
xbuffer.append(x(i))
}
xbuffer
})
I am new to Spark and Scala, so I have no idea how this kind of problem is called (which makes searching for it pretty hard).
I have data of the following structure:
[(date1, (name1, 1)), (date1, (name1, 1)), (date1, (name2, 1)), (date2, (name3, 1))]
In some way, this has to be reduced/aggregated to:
[(date1, [(name1, 2), (name2, 1)]), (date2, [(name3, 1)])]
I know how to do reduceByKey on a list of key-value pairs, but this particular problem is a mystery to me.
Thanks in advance!
My data, but here goes, step-wise:
val rdd1 = sc.makeRDD(Array( ("d1",("A",1)), ("d1",("A",1)), ("d1",("B",1)), ("d2",("E",1)) ),2)
val rdd2 = rdd1.map(x => ((x._1, x._2._1), x._2._2))
val rdd3 = rdd2.groupByKey
val rdd4 = rdd3.map{
case (str, nums) => (str, nums.sum)
}
val rdd5 = rdd4.map(x => (x._1._1, (x._1._2, x._2))).groupByKey
rdd5.collect
returns:
res28: Array[(String, Iterable[(String, Int)])] = Array((d2,CompactBuffer((E,1))), (d1,CompactBuffer((A,2), (B,1))))
Better approach avoiding groupByKey is as follows:
val rdd1 = sc.makeRDD(Array( ("d1",("A",1)), ("d1",("A",1)), ("d1",("B",1)), ("d2",("E",1)) ),2)
val rdd2 = rdd1.map(x => ((x._1, x._2._1), (x._2._2))) // Need to add quotes around V part for reduceByKey
val rdd3 = rdd2.reduceByKey(_+_)
val rdd4 = rdd3.map(x => (x._1._1, (x._1._2, x._2))).groupByKey // Necessary Shuffle
rdd4.collect
As I stated in the columns it can be done with DataFrames for structured data, so run this below:
// This above should be enough.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val rddA = sc.makeRDD(Array( ("d1","A",1), ("d1","A",1), ("d1","B",1), ("d2","E",1) ),2)
val dfA = rddA.toDF("c1", "c2", "c3")
val dfB = dfA
.groupBy("c1", "c2")
.agg(sum("c3").alias("sum"))
dfB.show
returns:
+---+---+---+
| c1| c2|sum|
+---+---+---+
| d1| A| 2|
| d2| E| 1|
| d1| B| 1|
+---+---+---+
But you can do this to approximate the above of the CompactBuffer above.
import org.apache.spark.sql.functions.{col, udf}
case class XY(x: String, y: Long)
val xyTuple = udf((x: String, y: Long) => XY(x, y))
val dfC = dfB
.withColumn("xy", xyTuple(col("c2"), col("sum")))
.drop("c2")
.drop("sum")
dfC.printSchema
dfC.show
// Then ... this gives you the CompactBuffer answer but from a DF-perspective
val dfD = dfC.groupBy(col("c1")).agg(collect_list(col("xy")))
dfD.show
returns - some renaming req'd and possible sorting:
---+----------------+
| c1|collect_list(xy)|
+---+----------------+
| d2| [[E, 1]]|
| d1|[[A, 2], [B, 1]]|
+---+----------------+
I am using Scala and Spark to create a dataframe. Here's my code so far:
val df = transformedFlattenDF
.groupBy($"market", $"city", $"carrier").agg(count("*").alias("count"), min($"bandwidth").alias("bandwidth"), first($"network").alias("network"), concat_ws(",", collect_list($"carrierCode")).alias("carrierCode")).withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>")).withColumn("Carrier Count", collect_set("carrierCode"))
The column carrierCode becomes an array column. The data is present as follows:
CarrierCode
1: [12,2,12]
2: [5,2,8]
3: [1,1,3]
I'd like to create a column that counts the number of distinct values in each array. I tried doing collect_set, however, it gives me an error saying grouping expressions sequence is empty Is it possible to find the number of distinct values in each row's array? So that way in our same example, there could be a column like so:
Carrier Count
1: 2
2: 3
3: 2
collect_set is for aggregation hence should be applied within your groupBy-agg step:
val df = transformedFlattenDF.groupBy($"market", $"city", $"carrier").agg(
count("*").alias("count"), min($"bandwidth").alias("bandwidth"),
first($"network").alias("network"),
concat_ws(",", collect_list($"carrierCode")).alias("carrierCode"),
size(collect_set($"carrierCode")).as("carrier_count") // <-- ADDED `collect_set`
).
withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>"))
If you don't want to change the existing groupBy-agg code, you can create a UDF like in the following example:
import org.apache.spark.sql.functions._
val codeDF = Seq(
Array("12", "2", "12"),
Array("5", "2", "8"),
Array("1", "1", "3")
).toDF("carrier_code")
def distinctElemCount = udf( (a: Seq[String]) => a.toSet.size )
codeDF.withColumn("carrier_count", distinctElemCount($"carrier_code")).
show
// +------------+-------------+
// |carrier_code|carrier_count|
// +------------+-------------+
// | [12, 2, 12]| 2|
// | [5, 2, 8]| 3|
// | [1, 1, 3]| 2|
// +------------+-------------+
Without UDF and using RDD conversion and back to DF for posterity:
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", 2, 100, 2), ("F", 7, 100, 1), ("B", 10, 100, 100)
)).toDF("c1", "c2", "c3", "c4")
val x = df.select("c1", "c2", "c3", "c4").rdd.map(x => (x.get(0), List(x.get(1), x.get(2), x.get(3))) )
val y = x.map {case (k, vL) => (k, vL.toSet.size) }
// Manipulate back to your DF, via conversion, join, what not.
Returns:
res15: Array[(Any, Int)] = Array((A,2), (F,3), (B,2))
Solution above better, as stated more so for posterity.
You can take help for udf and you can do like this.
//Input
df.show
+-----------+
|CarrierCode|
+-----------+
|1:[12,2,12]|
| 2:[5,2,8]|
| 3:[1,1,3]|
+-----------+
//udf
val countUDF=udf{(str:String)=>val strArr=str.split(":"); strArr(0)+":"+strArr(1).split(",").distinct.length.toString}
df.withColumn("Carrier Count",countUDF(col("CarrierCode"))).show
//Sample Output:
+-----------+-------------+
|CarrierCode|Carrier Count|
+-----------+-------------+
|1:[12,2,12]| 1:3|
| 2:[5,2,8]| 2:3|
| 3:[1,1,3]| 3:3|
+-----------+-------------+
There are two rdds.
val pairRDD1 = sc.parallelize(List( ("cat",2), ("girl", 5), ("book", 4),("Tom", 12)))
val pairRDD2 = sc.parallelize(List( ("cat",2), ("cup", 5), ("mouse", 4),("girl", 12)))
And then I will do this join operation.
val kk = pairRDD1.fullOuterJoin(pairRDD2).collect
it shows like that:
kk: Array[(String, (Option[Int], Option[Int]))] = Array((book,(Some(4),None)), (Tom,(Some(12),None)), (girl,(Some(5),Some(12))), (mouse,(None,Some(4))), (cup,(None,Some(5))), (cat,(Some(2),Some(2))))
if i would like to fill the NONE by 0 and transform Option[int] to Int.what should I code?Thanks!
You can use mapValues on kk as follows (note this is before the collect):
pairRDD1.fullOuterJoin(pairRDD2).mapValues(pair => (pair._1.getOrElse(0), pair._2.getOrElse(0)))
You might have to do this before collect in an RDD, otherwise you could do:
kk.map { case (k, pair) => (k, (pair._1.getOrElse(0), pair._2.getOrElse(0))) }
Based on commnets in first answer, if you are fine using DataFrames, you can do with dataframes with any number of columns.
val ss = SparkSession.builder().master("local[*]").getOrCreate()
val sc = ss.sparkContext
import ss.implicits._
val pairRDD1 = sc.parallelize(List(("cat", 2,9999), ("girl", 5,8888), ("book", 4,9999), ("Tom", 12,6666)))
val pairRDD2 = sc.parallelize(List(("cat", 2,9999), ("cup", 5,7777), ("mouse", 4,3333), ("girl", 12,1111)))
val df1 = pairRDD1.toDF
val df2 = pairRDD2.toDF
val joined = df1.join(df2, df1.col("_1") === df2.col("_1"),"fullouter")
joined.show()
Here _1,_2 e.t.c are default column names provided by Spark. But, if you wish to have proper names you can change it as you wish.
Result:
+----+----+----+-----+----+----+
| _1| _2| _3| _1| _2| _3|
+----+----+----+-----+----+----+
|girl| 5|8888| girl| 12|1111|
| Tom| 12|6666| null|null|null|
| cat| 2|9999| cat| 2|9999|
|null|null|null| cup| 5|7777|
|null|null|null|mouse| 4|3333|
|book| 4|9999| null|null|null|
+----+----+----+-----+----+----+