Nested flatMap in spark - scala

in the below given code snipet i have declared and rdd by parallelizing a List(1,2,3,4) what i wanted do was to append List(1,2,3,4) to each element of the above rdd. i did so by using nested flatMap function scince it can return multiple values for each element of a RDD .The code is as follows
val rand6=sc.parallelize(List(1,2,3,4))
val bv=sc.broadcast(List(5,6,7,8))
rand6.flatMap(s=>{
val c=List(1,2,3,4)
val a=List(s,c)
val b=a.flatMap(r=>r)
b
})
But I am getting the following Error
command-1095314872161512:74: error: type mismatch;
found : Any
required: scala.collection.GenTraversableOnce[?]
val b=a.flatMap(r=>r)
^
is is the problem with the syntax or we are not supposed to use flatMaps in this fashion
it would be very helpful if someone can help me to understand this

Try to add type wherever possible in your scala code
Depending on your question description came up with below solution
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
object RandomDF {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val sc = spark.sparkContext
val rand6 : RDD[Int] =sc.parallelize(List(1,2,3,4))
val bv: Broadcast[List[Int]] =sc.broadcast(List(5,6,7,8))
val output = rand6.map( (s : Int)=>{
val c : List[Int] =List(1,2,3,4)
val a = s :: c
// val b = a.flatMap(r=>r)
// b
a
}).collect().toList
println(output)
}
}

Related

Circe encoder for a SortedMultiDict

I'm trying to write a Circe encoder for an object which has a field of scala.collection.immutable.SortedMultiDict. Circe can't find an encoder instance for that, so I need to write one.
import io.circe.{Decoder, Encoder, HCursor}
import io.circe.generic.semiauto._
import io.circe.parser.decode
import scala.collection.immutable.SortedMultiDict
import io.circe.syntax._
implicit val mapEncoder: Encoder[List[(Long, String)]] = deriveEncoder[List[(Long, String)]]
implicit val mapDecoder: Decoder[List[(Long, String)]] = deriveDecoder[List[(Long, String)]]
implicit val oneEncoder: Encoder[SortedMultiDict[Long, String]] = (a: SortedMultiDict[Long, String]) =>
mapEncoder(a.toList)
implicit val oneDecoder: Decoder[SortedMultiDict[Long, String]] = (c: HCursor) =>
mapDecoder.map(SortedMultiDict.from[Long, String])(c)
Sadly, this isn't correct...
val test = SortedMultiDict.from[Long, String](Seq(1666268475626L -> "a5d9f51d-35c7-4fef-b4a3-3d28944eeb2b", 1666268475626L -> "df359396-043c-4b65-bc3 -bf309d433ff5"))
val encodedData = test.asJson.noSpaces
val roundTrip = decode[SortedMultiDict[Long, String]](encodedData)
results in
scala> roundTrip
val res2: Either[io.circe.Error,scala.collection.immutable.SortedMultiDict[Long,String]] = Left(DecodingFailure(Attempt to decode value on failed cursor, List(DownField(head), DownField(::))))
In fact, the derived list encoder doesn't appear to work...
scala> val myList = List((1666268475626L, "a5d9f51d-35c7-4fef-b4a3-3d28944eeb2b"), (1666268475626L, "df359396-043c-4b65-bc3 -bf309d433ff5"))
val myList: List[(Long, String)] = List((1666268475626,a5d9f51d-35c7-4fef-b4a3-3d28944eeb2b), (1666268475626,df359396-043c-4b65-bc3 -bf309d433ff5))
scala> decode[List[(Long, String)]](myList.asJson.noSpaces)
val res0: Either[io.circe.Error,List[(Long, String)]] = Left(DecodingFailure(Attempt to decode value on failed cursor, List(DownField(head), DownField(::))))
Are my expectations of how to do the round trip of encoding/decoding wrong? It's what I'd understood from Circe's codec docs.
EDIT: Well, it works if I change the map codecs to be:
implicit val mapEncoder: Encoder[List[(Long, String)]] = Encoder.encodeList[(Long, String)]
implicit val mapDecoder: Decoder[List[(Long, String)]] = Decoder.decodeList[(Long, String)]
I still don't really understand why the earlier ones don't work, though, explications welcome...

Implicits in a Spark Scala program not working

I am not able to perform an implicit conversion from an RDD to a Dataframe in a Scala program although I am importing spark.implicits._.
Any help would be appreciated.
Main Program with the implicits:
object spark1 {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("e1").config("o1", "sv").getOrCreate()
import spark.implicits._
val conf = new SparkConf().setMaster("local").setAppName("My App")
val sc = spark.sparkContext
val data = sc.textFile("/TestDataB.txt")
val allSplit = data.map(line => line.split(","))
case class CC1(LAT: Double, LONG: Double)
val allData = allSplit.map( p => CC1( p(0).trim.toDouble, p(1).trim.toDouble))
val allDF = allData.toDF()
// ... other code
}
}
Error is as follows:
Error:(40, 25) value toDF is not a member of org.apache.spark.rdd.RDD[CC1]
val allDF = allData.toDF()
When you define the case class CC1 inside the main method, you hit https://issues.scala-lang.org/browse/SI-6649; toDF() then fails to locate the appropriate implicit TypeTag for that class at compile time.
You can see this in this simple example:
case class Out()
object TestImplicits {
def main(args: Array[String]) {
case class In()
val typeTagOut = implicitly[TypeTag[Out]] // compiles
val typeTagIn = implicitly[TypeTag[In]] // does not compile: Error:(23, 31) No TypeTag available for In
}
}
Spark's relevant implicit conversion has this type parameter: [T <: Product : TypeTag] (see newProductEncoder here), which means an implicit TypeTag[CC1] is required.
To fix this - simply move the definition of CC1 out of the method, or out of object entirely:
case class CC1(LAT: Double, LONG: Double)
object spark1 {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("e1").config("o1", "sv").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.textFile("/TestDataB.txt")
val allSplit = data.map(line => line.split(","))
val allData = allSplit.map( p => CC1( p(0).trim.toDouble, p(1).trim.toDouble))
val allDF = allData.toDF()
// ... other code
}
}
I thought the toDF is in sqlContext.implicits._ so you need to import that not spark.implicits._. At least that is the case in spark 1.6

Class import error in Scala/Spark

I am new to Spark and I'm using it with Scala. I wrote a simple object that is loaded fine in spark-shell using :load test.scala.
import org.apache.spark.ml.feature.StringIndexer
object Collaborative{
def trainModel() ={
val data = sc.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}
}
Now I want to put it in a class to pass parameters. I use the same code with class instead.
import org.apache.spark.ml.feature.StringIndexer
class Collaborative{
def trainModel() ={
val data = sc.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}
}
This returns import errors.
<console>:19: error: value toDF is not a member of org.apache.spark.rdd.RDD[(String, String, Double)]
val df = data.map(_.split(",") match { case Array(user,food,fav) => (user,food,fav.toDouble) }).toDF("userID","foodID","favorite")
<console>:24: error: not found: type StringIndexer
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
What am I missing here?
Try this one, this one seems to work fine.
def trainModel() ={
val spark = SparkSession.builder().appName("test").master("local").getOrCreate()
import spark.implicits._
val data = spark.read.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}

Streaming CSV with akka-http in scala

I am very new to akka-http, and I would like to stream a csv with an arbitrary number of lines.
For instance, I would like to return :
a,1
b,2
c,3
with the following code
implicit val actorSystem = ActorSystem("system")
implicit val actorMaterializer = ActorMaterializer()
val map = new mutable.HashMap[String, Int]()
map.put("a", 1)
map.put("b", 2)
map.put("c", 3)
val `text/csv` = ContentType(MediaTypes.`text/csv`, `UTF-8`)
val route =
path("test") {
complete {
HttpEntity(`text/csv`, ??? using map)
}
}
Http().bindAndHandle(route,"localhost",8080)
Thanks for your help
EDIT: Thanks to Ramon J Romero y Vigil
package test
import akka.actor.ActorSystem
import akka.http.scaladsl.Http
import akka.http.scaladsl.model.HttpCharsets.`UTF-8`
import akka.http.scaladsl.model._
import akka.http.scaladsl.server.Directives._
import akka.stream._
import akka.util.ByteString
import scala.collection.mutable
object Test{
def main(args: Array[String]) {
implicit val actorSystem = ActorSystem("system")
implicit val actorMaterializer = ActorMaterializer()
val map = new mutable.HashMap[String, Int]()
map.put("a", 1)
map.put("b", 2)
map.put("c", 3)
val mapStream = Stream.fromIterator(() => map.toIterator)
.map((k: String, v: Int) => s"$k,$v")
.map(ByteString.apply)
val `text/csv` = ContentType(MediaTypes.`text/csv`, `UTF-8`)
val route =
path("test") {
complete {
HttpEntity(`text/csv`, mapStream)
}
}
Http().bindAndHandle(route, "localhost", 8080)
}
}
With this code I have two compile error:
Error:(29, 28) value fromIterator is not a member of object scala.collection.immutable.Stream
val mapStream = Stream.fromIterator(() => map.toIterator)
Error:(38, 11) overloaded method value apply with alternatives:
(contentType: akka.http.scaladsl.model.ContentType,file: java.io.File,chunkSize: Int)akka.http.scaladsl.model.UniversalEntity <and>
(contentType: akka.http.scaladsl.model.ContentType,data: akka.stream.scaladsl.Source[akka.util.ByteString,Any])akka.http.scaladsl.model.HttpEntity.Chunked <and>
(contentType: akka.http.scaladsl.model.ContentType,data: akka.util.ByteString)akka.http.scaladsl.model.HttpEntity.Strict <and>
(contentType: akka.http.scaladsl.model.ContentType,bytes: Array[Byte])akka.http.scaladsl.model.HttpEntity.Strict <and>
(contentType: akka.http.scaladsl.model.ContentType.NonBinary,string: String)akka.http.scaladsl.model.HttpEntity.Strict
cannot be applied to (akka.http.scaladsl.model.ContentType.WithCharset, List[akka.util.ByteString])
HttpEntity(`text/csv`, mapStream)
I used a List of tuples to get arround the first issue (hower i do not know how to stream a map in Scala)
No idea for the second
Thanks for your help.
(I am using scala 2.11.8)
Use the apply function in HttpEntity that takes in a Source[ByteString,Any]. The apply creates a Chunked entity. You can read your file using code based on the documentation for streaming file IO using an akka stream Source:
import akka.stream.scaladsl._
val file = Paths.get("yourFile.csv")
val entity = HttpEntity(`txt/csv`, FileIO.fromPath(file))
The stream will break up your file into chunk sizes, default is currently set to 8192.
To stream the map that you've created you can use a similar trick:
val mapStream = Source.fromIterator(() => map.toIterator)
.map( (k : String, v : Int) => s"$k,$v" )
.map(ByteString.apply)
val mapEntity = HttpEntity(`test/csv`, mapStream)

Transform data using scala in spark

I am trying to transform the input text file into a Key/Value RDD, but the code below doesn't work.(The text file is a tab separated file.) I am really new to Scala and Spark so I would really appreciate your help.
import org.apache.spark.{SparkConf, SparkContext}
import scala.io.Source
object shortTwitter {
def main(args: Array[String]): Unit = {
for (line <- Source.fromFile(args(1).txt).getLines()) {
val newLine = line.map(line =>
val p = line.split("\t")
(p(0).toString, p(1).toInt)
)
}
val sparkConf = new SparkConf().setAppName("ShortTwitterAnalysis").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text = sc.textFile(args(0))
val counts = text.flatMap(line => line.split("\t"))
}
}
I'm assuming you want the resulting RDD to have the type RDD[(String, Int)], so -
You should use map (which transforms each record into a single new record) and not flatMap (which transform each record into multiple records)
You should map the result of the split into a tuple
Altogether:
val counts = text
.map(line => line.split("\t"))
.map(arr => (arr(0), arr(1).toInt))
EDIT per clarification in comment: if you're also interested in fixing the non-Spark part (which reads the file sequentially), you have some errors in the for-comprehension syntax, here's the entire thing:
def main(args: Array[String]): Unit = {
// read the file without Spark (not necessary when using Spark):
val countsWithoutSpark: Iterator[(String, Int)] = for {
line <- Source.fromFile(args(1)).getLines()
} yield {
val p = line.split("\t")
(p(0), p(1).toInt)
}
// equivalent code using Spark:
val sparkConf = new SparkConf().setAppName("ShortTwitterAnalysis").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val counts: RDD[(String, Int)] = sc.textFile(args(0))
.map(line => line.split("\t"))
.map(arr => (arr(0), arr(1).toInt))
}