Spark how to transform RDD[Seq[(String, String)]] to RDD[(String, String)] - scala

I have a Spark RDD[Seq[(String,String)]] which contains several group of two words. Now I have to save them to a file in HDFS like this (no matter in which Seq they are):
dog cat
cat mouse
mouse milk
Could someone help me with this? Thanks a lot <3
EDIT:
Thanks for your help. Here is the solution
Code
val seqTermTermRDD: RDD[Seq[(String, String)]] = ...
val termTermRDD: RDD[(String, String)] = seqTermTermRDD.flatMap(identity)
val combinedTermsRDD: RDD[String] = termTermRDD.map{ case(term1, term2) => term1 + " " + term2 }
combinedTermsRDD.saveAsTextFile(outputFile)

RDDs have a neat function called "flatMap" that will do exactly what you want. Think of it as a Map followed by a Flatten (except implemented a little more intelligently)--if the function produces multiple entities, each will be added to the group separately. (You can also use this for many other objects in Scala.)
val seqRDD = sc.parallelize(Seq(Seq(("dog","cat"),("cat","mouse"),("mouse","milk"))),1)
val tupleRDD = seqRDD.flatMap(identity)
tupleRDD.collect //Array((dog,cat), (cat,mouse), (mouse,milk))
Note that I also use the scala feature identity, because flatMap is looking for a function that turns an object of the RDD's type to a TraversableOnce, which a Seq counts as.

You can also use mkString( sep ) function ( where sep is for separator) on Scala collections. Here are some examples: (note that in your code you would replace the last .collect().mkString("\n") with saveAsTextFile(filepath)) to save to Hadoop.
scala> val rdd = sc.parallelize(Seq( Seq(("a", "b"), ("c", "d")), Seq( ("1", "2"), ("3", "4") ) ))
rdd: org.apache.spark.rdd.RDD[Seq[(String, String)]] = ParallelCollectionRDD[6102] at parallelize at <console>:71
scala> rdd.map( _.mkString("\n")) .collect().mkString("\n")
res307: String =
(a,b)
(c,d)
(1,2)
(3,4)
scala> rdd.map( _.mkString("|")) .collect().mkString("\n")
res308: String =
(a,b)|(c,d)
(1,2)|(3,4)
scala> rdd.map( _.mkString("\n")).map(_.replace("(", "").replace(")", "").replace(",", " ")) .collect().mkString("\n")
res309: String =
a b
c d
1 2
3 4

Related

scala: Create a Sequence of tuples with a constant key

Given a constant value and a potentially long Sequence:
a:String = "A"
bs = List(1, 2, 3)
How can you most efficiently construct a Sequence of tuples with the first element equalling a?
Seq(
( "A", 1 ),
( "A", 2 ),
( "A", 3 )
)
Just use a map:
val list = List(1,2,3)
list.map(("A",_))
Output:
res0: List[(String, Int)] = List((A,1), (A,2), (A,3))
Since the most efficient would be to pass (to further receiver) just the seq, and the receiver tuple the elements there, I'd do it with views.
val first = "A"
val bs = (1 to 1000000).view
further( bs.map((first, _)) )
You can do it using map just like in the answer provided by #Pedro or you can use for and yield as below:
val list = List(1,2,3)
val tuple = for {
i <- list
} yield ("A",i)
println(tuple)
Output:
List((A,1), (A,2), (A,3))
You are also asking about the efficient way in your question. Different developers have different opinions between the efficiency of for and map. So, I guess going through the links below gives you more knowledge about the efficiency part.
for vs map in functional programming
Scala style: for vs foreach, filter, map and others
Getting the desugared part of a Scala for/comprehension expression?

Flatmap on dataframe

What is the best way to preform a flatMap on a DataFrame in spark?
From searching around and doing some testing, I have come up with two different approaches. Both of these have some drawbacks so I'm thinking that there should be some better/easier way to do it.
The first way I have found is to first convert the DataFrame into an RDD and then back again:
val map = Map("a" -> List("c","d","e"), "b" -> List("f","g","h"))
val df = List(("a", 1.0), ("b", 2.0)).toDF("x", "y")
val rdd = df.rdd.flatMap{ row =>
val x = row.getAs[String]("x")
val x = row.getAs[Double]("y")
for(v <- map(x)) yield Row(v,y)
}
val df2 = spark.createDataFrame(rdd, df.schema)
The second approach is to create a DataSet before using the flatMap (using the same variables as above) and then convert back:
val ds = df.as[(String, Double)].flatMap{
case (x, y) => for(v <- map(x)) yield (v,y)
}.toDF("x", "y")
Both these approaches work quite well when the number of columns are small, however I have a lot more than 2 columns. Is there any better way to solve this problem? Preferably in a way where no conversion is necessary.
You can create a second dataframe from your map RDD:
val mapDF = Map("a" -> List("c","d","e"), "b" -> List("f","g","h")).toList.toDF("key", "value")
Then do the join and apply the explode function:
val joinedDF = df.join(mapDF, df("x") === mapDF("key"), "inner")
.select("value", "y")
.withColumn("value", explode($"value"))
And you get the solution.
joinedDF.show()

Spark merge/combine arrays in groupBy/aggregate

The following Spark code correctly demonstrates what I want to do and generates the correct output with a tiny demo data set.
When I run this same general type of code on a large volume of production data, I am having runtime problems. The Spark job runs on my cluster for ~12 hours and fails out.
Just glancing at the code below, it seems inefficient to explode every row, just to merge it back down. In the given test data set, the fourth row with three values in array_value_1 and three values in array_value_2, that will explode to 3*3 or nine exploded rows.
So, in a larger data set, a row with five such array columns, and ten values in each column, would explode out to 10^5 exploded rows?
Looking at the provided Spark functions, there are no out of the box functions that would do what I want. I could supply a user-defined-function. Are there any speed drawbacks to that?
val sparkSession = SparkSession.builder.
master("local")
.appName("merge list test")
.getOrCreate()
val schema = StructType(
StructField("category", IntegerType) ::
StructField("array_value_1", ArrayType(StringType)) ::
StructField("array_value_2", ArrayType(StringType)) ::
Nil)
val rows = List(
Row(1, List("a", "b"), List("u", "v")),
Row(1, List("b", "c"), List("v", "w")),
Row(2, List("c", "d"), List("w")),
Row(2, List("c", "d", "e"), List("x", "y", "z"))
)
val df = sparkSession.createDataFrame(rows.asJava, schema)
val dfExploded = df.
withColumn("scalar_1", explode(col("array_value_1"))).
withColumn("scalar_2", explode(col("array_value_2")))
// This will output 19. 2*2 + 2*2 + 2*1 + 3*3 = 19
logger.info(s"dfExploded.count()=${dfExploded.count()}")
val dfOutput = dfExploded.groupBy("category").agg(
collect_set("scalar_1").alias("combined_values_2"),
collect_set("scalar_2").alias("combined_values_2"))
dfOutput.show()
It could be inefficient to explode but fundamentally the operation you try to implement is simply expensive. Effectively it is just another groupByKey and there is not much you can do here to make it better. Since you use Spark > 2.0 you could collect_list directly and flatten:
import org.apache.spark.sql.functions.{collect_list, udf}
val flatten_distinct = udf(
(xs: Seq[Seq[String]]) => xs.flatten.distinct)
df
.groupBy("category")
.agg(
flatten_distinct(collect_list("array_value_1")),
flatten_distinct(collect_list("array_value_2"))
)
In Spark >= 2.4 you can replace udf with composition of built-in functions:
import org.apache.spark.sql.functions.{array_distinct, flatten}
val flatten_distinct = (array_distinct _) compose (flatten _)
It is also possible to use custom Aggregator but I doubt any of these will make a huge difference.
If sets are relatively large and you expect significant number of duplicates you could try to use aggregateByKey with mutable sets:
import scala.collection.mutable.{Set => MSet}
val rdd = df
.select($"category", struct($"array_value_1", $"array_value_2"))
.as[(Int, (Seq[String], Seq[String]))]
.rdd
val agg = rdd
.aggregateByKey((MSet[String](), MSet[String]()))(
{case ((accX, accY), (xs, ys)) => (accX ++= xs, accY ++ ys)},
{case ((accX1, accY1), (accX2, accY2)) => (accX1 ++= accX2, accY1 ++ accY2)}
)
.mapValues { case (xs, ys) => (xs.toArray, ys.toArray) }
.toDF

Spark RDD map internal object to Row

My initial data from a CSV file is:
1 ,21658392713 ,21626890421
1 ,21623461747 ,21626890421
1 ,21623461747 ,21626890421
The data I have after a few transformations and grouping based on business logic is yields
scala> val sGrouped = grouped
sGrouped: org.apache.spark.rdd.RDD[(String, Iterable[(String,
(Array[String], String))])] = ShuffledRDD[85] at groupBy at <console>:51
scala> sGrouped.foreach(f=>println(f))
(21626890421,CompactBuffer((21626890421,
([Ljava.lang.String;#62ac8444,21626890421)),
(21626890421,([Ljava.lang.String;#59d80fe,21626890421)),
(21626890421,([Ljava.lang.String;#270042e8,21626890421)),
from this I want to get a map that yields something like the following format
[String, Row[String]]
so the data may look like:
[ 21626890421 , Row[(1 ,21658392713 ,21626890421)
, (1 ,21623461747 ,21626890421)
, (1 ,21623461747,21626890421)]]
I really appreciate any guidance on moving forward on this.
I found the answer, but I am not sure if this is an efficient way, any better approaches are appreciated, as this feels more like a hack.
scala> import org.apache.spark.sql.Row
scala> val grouped = cToP.groupBy(_._1)
grouped: org.apache.spark.rdd.RDD[(String, Iterable[(String,
(Array[String], String))])]
scala> val sGrouped = grouped.map(f => f._2.toList)
sGrouped: org.apache.spark.rdd.RDD[List[(String, (Array[String],
String))]]
scala> val tGrouped = sGrouped.map(f =>f.map(_._2).map(c =>
Row(c._1(0), c._1(12), c._1(18))))
tGrouped: org.apache.spark.rdd.RDD[List[org.apache.spark.sql.Row]] =
MapPartitionsRDD[42] a
scala> tGrouped.foreach(f => println(f))
yields
List([1,21658392713,21626890421], [1,21623461747,21626890421],
[1,21623461747,21626890421])
scala> tGrouped.count()
res6: Long = 1
The answer I am getting is correct, and even the count is correct. However, I do not understand why the count is 1.

How to unpack a map/list in scala to tuples for a variadic function?

I'm trying to create a PairRDD in spark. For that I need a tuple2 RDD, like RDD[(String, String)]. However, I have an RDD[Map[String, String]].
I can't work out how to get rid of the iterable so I'm just left with RDD[(String, String)] rather than e.g. RDD[List[(String, String)]].
A simple demo of what I'm trying to make work is this broken code:
val lines = sparkContext.textFile("data.txt")
val pairs = lines.map(s => Map(s -> 1))
val counts = pairs.reduceByKey((a, b) => a + b)
The last line doesn't work because pairs is an RDD[Map[String, Int]] when it needs to be an RDD[(String, Int)].
So how can I get rid of the iterable in pairs above to convert the Map to just a tuple2?
You can actually just run:
val counts = pairs.flatMap(identity).reduceByKey(_ + _)
Note that the usage of the identity function that replicates the functionality of flatten on an RDD and the reduceByKey() function has a nifty underscore notation for conciseness.