reduceBykey Spark maintain order - scala

My input dataset looks like
id1, 10, v1
id2, 9, v2
id2, 34, v3
id1, 6, v4
id1, 12, v5
id2, 2, v6
and I want output
id1; 6,v4 | 10,v1 | 12,v5
id2; 2,v6 | 9,v2 | 34,v3
This is such that
id1: array[num(i),value(i)] where num(i) should be sorted
What I have tried:
Get id and 2nd column as key, sortByKey, but since it's a string,
sorting doesn't happen like a int, but as string
Get 2nd column as key, sortByKey, then get id and key and 2nd
column in value, reduceByKey. But in this case, while doing
reduceByKey; order is not preserved. Even groupByKey is not preventing
the order. Actually this is expected.
Any help will be appreciated.

Since you didn't provide any information about input type I assume it is RDD[(String, Int, String)]:
val rdd = sc.parallelize(
("id1", 10, "v1") :: ("id2", 9, "v2") ::
("id2", 34, "v3") :: ("id1", 6, "v4") ::
("id1", 12, "v5") :: ("id2", 2, "v6") :: Nil)
rdd
.map{case (id, x, y) => (id, (x, y))}
.groupByKey
.mapValues(iter => iter.toList.sortBy(_._1))
.sortByKey() // Optional if you want id1 before id2
Edit:
To get an output you've described in the comments you can replace function passed to mapValues with something like this:
def process(iter: Iterable[(Int, String)]): String = {
iter.toList
.sortBy(_._1)
.map{case (x, y) => s"$x,$y"}
.mkString("|")
}

Related

Spark scala - Dataframes comparison

How to compare 2 Dataframes based on PK.
Basically want to create a scala spark code to compare 2 big Dataframes (10M records each, 100 columns each) and show output as:
ID Diff
1 [ {Col1: [1,2]}, {col3: [5,10]} ...]
2 [ {Col3: [4,2]}, {col7: [2,6]} ...]
ID is PK
Diff column - show first Column name where is the difference and then which value is different one from another in that column.
Each different column can be converted to string, and then all columns are concated:
// ---- data ---
val leftDF = Seq(
(1, 1, 5, 0),
(2, 0, 4, 2)
).toDF("ID", "Col1", "col3", "col7")
val rightDF = Seq(
(1, 2, 10, 0),
(2, 0, 2, 6)
).toDF("ID", "Col1", "col3", "col7")
def getDifferenceForColumn(name: String): Column =
when(
col("l." + name) =!= col("r." + name),
concat(lit("{" + name + ": ["), col("l." + name), lit(","), col("r." + name), lit("]}")))
.otherwise(lit(""))
val diffColumn = leftDF
.columns
.filter(_ != "ID")
.map(name => getDifferenceForColumn(name))
.reduce((l, r) => concat(l,
when(length(r) =!= 0 && length(l) =!= 0, lit(",")).otherwise(lit(""))
, r))
val diffColumnWithBraces = concat(lit("["), diffColumn, lit("]"))
leftDF
.alias("l")
.join(rightDF.alias("r"), Seq("id"))
.select(col("ID"), diffColumnWithBraces.alias("DIFF"))
Output:
+---+------------------------------+
|ID |DIFF |
+---+------------------------------+
|1 |[{Col1: [1,2]},{col3: [5,10]}]|
|2 |[{col3: [4,2]},{col7: [2,6]}] |
+---+------------------------------+
If columns cannot have value "}{", in solution above two variables can be changed, maybe, performance can be better:
val diffColumns = leftDF
.columns
.filter(_ != "ID")
.map(name => getDifferenceForColumn(name))
val diffColumnWithBraces = concat(lit("["), regexp_replace(concat(diffColumns: _*),"\\}\\{","},{"), lit("]"))
Also UDF can be used, incoming data and output is the same as in my first answer:
val colNames = leftDF
.columns
.filter(_ != "ID")
val generateSeqDiff = (colNames: Seq[String], leftValues: Seq[Any], rightValues: Seq[Any]) => {
val nameValues = colNames
.zip(leftValues)
.zip(rightValues)
.filterNot({ case ((_, l), r) => l == r })
.map({ case ((name, l), r) => s"{$name: [$l,$r]}" })
.mkString(",")
s"[$nameValues]"
}
val generateSeqDiffUDF = udf(generateSeqDiff)
leftDF
.select($"ID", array(colNames.head, colNames.tail: _*).alias("leftValues"))
.alias("l")
.join(
rightDF
.select($"ID", array(colNames.head, colNames.tail: _*).alias("rightValues"))
.alias("r"), Seq("id"))
.select($"ID", generateSeqDiffUDF(lit(colNames), $"leftValues", $"rightValues").alias("DIFF"))

how to filter few rows in a table using Scala

Using Scala:
I have a emp table as below
id, name, dept, address
1, a, 10, hyd
2, b, 10, blr
3, a, 5, chn
4, d, 2, hyd
5, a, 3, blr
6, b, 2, hyd
Code:
val inputFile = sc.textFile("hdfs:/user/edu/emp.txt");
val inputRdd = inputFile.map(iLine => (iLine.split(",")(0),
iLine.split(",")(1),
iLine.split(",")(3)
));
// filtering only few columns Now i want to pull hyd addressed employees complete data
Problem: I don't want to print all emp details, I want to print only few emp details who all are from hyd.
I have loaded this emp dataset into Rdd
I was split this Rdd with ','
now I want to print only hyd addressed emp.
I think the below solution will help to solve your problem.
val fileName = "/path/stact_test.txt"
val strRdd = sc.textFile(fileName).map { line =>
val data = line.split(",")
(data(0), data(1), data(3))
}.filter(rec=>rec._3.toLowerCase.trim.equals("hyd"))
after splitting the data, filter the location using the 3rd item from the tuple RDD.
Output:
(1, a, hyd)
(4, d, hyd)
(6, b, hyd)
You may try to use dataframe
val viewsDF=spark.read.text("hdfs:/user/edu/emp.txt")
val splitedViewsDF = viewsDF.withColumn("id", split($"value",",").getItem(0))
.withColumn("name", split($"value", ",").getItem(1))
.withColumn("address", split($"value", ",").getItem(3))
.drop($"value")
.filter(df("address").equals("hyd") )

Sample a different number of random rows for every group in a dataframe in spark scala

The goal is to sample (without replacement) a different number of rows in a dataframe for every group. The number of rows to sample for a specific group is in another dataframe.
Example: idDF is the dataframe to sample from. The groups are denoted by the ID column. The dataframe, planDF specifies the number of rows to sample for each group where "datesToUse" denotes the number of rows, and "ID" denotes the group. "totalDates" is the total number of rows for that group and may or may not be useful.
The final result should have 3 rows sampled from the first group (ID 1), 2 rows sampled from the second group (ID 2) and 1 row sampled from the third group (ID 3).
val idDF = Seq(
(1, "2017-10-03"),
(1, "2017-10-22"),
(1, "2017-11-01"),
(1, "2017-10-02"),
(1, "2017-10-09"),
(1, "2017-12-24"),
(1, "2017-10-20"),
(2, "2017-11-17"),
(2, "2017-11-12"),
(2, "2017-12-02"),
(2, "2017-10-03"),
(3, "2017-12-18"),
(3, "2017-11-21"),
(3, "2017-12-13"),
(3, "2017-10-08"),
(3, "2017-10-16"),
(3, "2017-12-04")
).toDF("ID", "date")
val planDF = Seq(
(1, 3, 7),
(2, 2, 4),
(3, 1, 6)
).toDF("ID", "datesToUse", "totalDates")
this is an example of what a resultant dataframe should look like:
+---+----------+
| ID| date|
+---+----------+
| 1|2017-10-22|
| 1|2017-11-01|
| 1|2017-10-20|
| 2|2017-11-12|
| 2|2017-10-03|
| 3|2017-10-16|
+---+----------+
So far, I tried to use the sample method for DataFrame: https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/sql/DataFrame.html
Here is an example that would work for an entire data frame.
def sampleDF(DF: DataFrame, datesToUse: Int, totalDates: Int): DataFrame = {
val fraction = datesToUse/totalDates.toFloat.toDouble
DF.sample(false, fraction)
}
I cant figure out how to use something like this for each group. I tried joining the planDF table to the idDF table and using a window partition.
Another idea I had was to somehow make a new column with randomly labeled True / false and then filter on that column.
Another option staying entirely in Dataframes would be to compute probabilities using your planDF, join with idDF, append a column of random numbers and then filter. Helpfully, sql.functions has a rand function.
import org.apache.spark.sql.functions._
import spark.implicits._
val probabilities = planDF.withColumn("prob", $"datesToUse" / $"totalDates")
val dfWithProbs = idDF.join(probabilities, Seq("ID"))
.withColumn("rand", rand())
.where($"rand" < $"prob")
(You'll want to double check that that isn't integer division.)
With the assumption that your planDF is small enough to be collected, you can use Scala's foldLeft to traverse the id list and accumulate the sample Dataframes per id:
import org.apache.spark.sql.{Row, DataFrame}
def sampleByIdDF(DF: DataFrame, id: Int, datesToUse: Int, totalDates: Int): DataFrame = {
val fraction = datesToUse.toDouble / totalDates
DF.where($"id" === id ).sample(false, fraction)
}
val emptyDF = Seq.empty[(Int, String)].toDF("ID", "date")
val planList = planDF.rdd.collect.map{ case Row(x: Int, y: Int, z: Int) => (x, y, z) }
// planList: Array[(Int, Int, Int)] = Array((1,3,7), (2,2,4), (3,1,6))
planList.foldLeft( emptyDF ){
case (accDF: DataFrame, (id: Int, num: Int, total: Int)) =>
accDF union sampleByIdDF(idDF, id, num, total)
}
// res1: org.apache.spark.sql.DataFrame = [ID: int, date: string]
// res1.show
// +---+----------+
// | ID| date|
// +---+----------+
// | 1|2017-10-03|
// | 1|2017-11-01|
// | 1|2017-10-02|
// | 1|2017-12-24|
// | 1|2017-10-20|
// | 2|2017-11-17|
// | 2|2017-11-12|
// | 2|2017-12-02|
// | 3|2017-11-21|
// | 3|2017-12-13|
// +---+----------+
Note that method sample() does not necessarily generate the exact number of samples specified in the method arguments. Here's a relevant SO Q&A.
If your planDF is large, you might have to consider using RDD's aggregate, which has the following signature (skipping the implicit argument):
def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U): U
It works somewhat like foldLeft, except that it has one accumulation operator within a partition and an additional one to comine results from different partitions.

Joining 2 RDDs when one having a Option type as key

I have 2 RDDs I would like to join which looks like this
val a:RDD[(Option[Int],V)]
val q:RDD[(Int,V)]
Is there any way I can do a left outer join on them?
I have tried this but it does not work because the type of the key is different i.e Int, Option[Int]
q.leftOuterJoin(a)
The natural solution is to convert the Int to Option[Int] so they have the same type.
Following you example:
val a:RDD[(Option[Int],V)]
val q:RDD[(Int,V)]
q.map{ case (k,v) => (Some(k),v))}.leftOuterJoin(a)
If you want to recover the Int type at the output, you can do this:
q.map{ case (k,v) => (Some(k),v))}.leftOuterJoin(a).map{ case (k,v) => (k.get, v) }
Note that you can do ".get" without any problem since it is not possible to get None's there.
One way to do is to convert it into dataframe and join
Here is a simple example
import spark.implicits._
val a = spark.sparkContext.parallelize(Seq(
(Some(3), 33),
(Some(1), 11),
(Some(2), 22)
)).toDF("id", "value1")
val q = spark.sparkContext.parallelize(Seq(
(Some(3), 33)
)).toDF("id", "value2")
q.join(a, a("id") === q("id") , "leftouter").show

read individual elements of a tuple from a map((tuple),(tuple)) in scala

The generated output of reducebykey is an ShuffledRDD with key-value both as array of multiple field. I need to extract all the fields and write to a hive table.
Below is the code which I was trying:
sqlContext.sql(s"select SUBS_CIRCLE_ID,SUBS_MSISDN,EVENT_START_DT,RMNG_NW_OP_KEY, ACCESS_TYPE FROM FACT.FCT_MEDIATED_USAGE_DATA")
val USAGE_DATA_Reduce = USAGE_DATA.map{ USAGE_DATA => ((USAGE_DATA.getShort(0), USAGE_DATA.getString(1),USAGE_DATA.getString(2)),
(USAGE_DATA.getInt(3), USAGE_DATA.getInt(4)))}.reduceByKey((x, y) => (math.min(x._1, y._1), math.max(x._2,y._2)))
The final output what I am expecting is all the five fields as:
SUBS_CIRCLE_ID,SUBS_MSISDN,EVENT_START_DT, MINVAL, MAXVAL
So that it can be directly inserted to hive table
If you mean:
Given a RDD[(TupleN, TupleM)], how do I map each record's elements of both key and value tuples into a single concatenated string?
Here's a simplified version, you should be able extrapolate this to solve your problem:
val keyValueRdd = sc.parallelize(Seq(
(1, "key1") -> (10, "value1", "A"),
(2, "key2") -> (20, "value2", "B"),
(3, "key3") -> (30, "value3", "C")
))
val asStrings: RDD[String] = keyValueRdd.map {
case ((k1, k2), (v1, v2, v3)) => List(k1, k2, v1, v2, v3).mkString(",")
}
asStrings.foreach(println)
// prints:
// 3,key3,30,value3,C
// 2,key2,20,value2,B
// 1,key1,10,value1,A