Transform data using scala in spark - scala

I am trying to transform the input text file into a Key/Value RDD, but the code below doesn't work.(The text file is a tab separated file.) I am really new to Scala and Spark so I would really appreciate your help.
import org.apache.spark.{SparkConf, SparkContext}
import scala.io.Source
object shortTwitter {
def main(args: Array[String]): Unit = {
for (line <- Source.fromFile(args(1).txt).getLines()) {
val newLine = line.map(line =>
val p = line.split("\t")
(p(0).toString, p(1).toInt)
)
}
val sparkConf = new SparkConf().setAppName("ShortTwitterAnalysis").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text = sc.textFile(args(0))
val counts = text.flatMap(line => line.split("\t"))
}
}

I'm assuming you want the resulting RDD to have the type RDD[(String, Int)], so -
You should use map (which transforms each record into a single new record) and not flatMap (which transform each record into multiple records)
You should map the result of the split into a tuple
Altogether:
val counts = text
.map(line => line.split("\t"))
.map(arr => (arr(0), arr(1).toInt))
EDIT per clarification in comment: if you're also interested in fixing the non-Spark part (which reads the file sequentially), you have some errors in the for-comprehension syntax, here's the entire thing:
def main(args: Array[String]): Unit = {
// read the file without Spark (not necessary when using Spark):
val countsWithoutSpark: Iterator[(String, Int)] = for {
line <- Source.fromFile(args(1)).getLines()
} yield {
val p = line.split("\t")
(p(0), p(1).toInt)
}
// equivalent code using Spark:
val sparkConf = new SparkConf().setAppName("ShortTwitterAnalysis").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val counts: RDD[(String, Int)] = sc.textFile(args(0))
.map(line => line.split("\t"))
.map(arr => (arr(0), arr(1).toInt))
}

Related

Class import error in Scala/Spark

I am new to Spark and I'm using it with Scala. I wrote a simple object that is loaded fine in spark-shell using :load test.scala.
import org.apache.spark.ml.feature.StringIndexer
object Collaborative{
def trainModel() ={
val data = sc.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}
}
Now I want to put it in a class to pass parameters. I use the same code with class instead.
import org.apache.spark.ml.feature.StringIndexer
class Collaborative{
def trainModel() ={
val data = sc.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}
}
This returns import errors.
<console>:19: error: value toDF is not a member of org.apache.spark.rdd.RDD[(String, String, Double)]
val df = data.map(_.split(",") match { case Array(user,food,fav) => (user,food,fav.toDouble) }).toDF("userID","foodID","favorite")
<console>:24: error: not found: type StringIndexer
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
What am I missing here?
Try this one, this one seems to work fine.
def trainModel() ={
val spark = SparkSession.builder().appName("test").master("local").getOrCreate()
import spark.implicits._
val data = spark.read.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}

Lift-Json Extracting from JSON object

I have this code below:
object Test {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Spark").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(3))
val kafkaBrokers = Map("metadata.broker.list" -> "HostName:9092")
val offsetMap = Map(TopicAndPartition("topic_test", 0), 8)
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaBrokers, offsetMap)
var offsetArray = Array[OffsetRange]()
lines.transform {rdd =>
offsetArray = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}.map {
_.message()
}.foreachRDD {rdd =>
/* NEW CODE */
}
ssc.start()
ssc.awaitTermination()
}
}
I have added the new code uder the comment /* NEW CODE */. My question is the lines val will contain a sequence of RDD's which basically form the kafka sever every 3 seconds. Then I am grabbing the message using the map function.
But I am a little confused on what the foreachRDD function does. Does that iterate over all of the RDD's which are in the lines DStream (which is what I am trying to do)? The thing is the parse function from the lift-json library only accepts a string so I need to iterate over all of the rdd's and pass that String value to the parse function which is what I attempted to do. But nothing is being printed out for some reason.
If you want to read data from a specific offset, you're using the wrong overload.
The one you need is this:
createDirectStream[K,
V,
KD <: Decoder[K],
VD <: Decoder[V], R]
(ssc: StreamingContext,
kafkaParams: Map[String, String],
fromOffsets: Map[TopicAndPartition, Long],
messageHandler: (MessageAndMetadata[K, V]) ⇒ R): InputDStream[R]
You need a Map[TopicAndPartition, Long]:
val offsetMap = Map(TopicAndPartition("topic_test", 0), 8L)
And you need to pass a function which receives a MessageAndMetadata[K, V] and returns your desired type, for example:
val extractKeyValue: MessageAndMetadata[String, String] => (String, String) =
msgAndMeta => (msgAndMeta.key(), msgAndMeta.message())
And use it:
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]
(ssc, kafkaBrokers, offsetMap, extractKeyValue)

How to pass hiveContext as argument to functions spark scala

I have created a hiveContext in main() function in Scala and I need to pass through parameters this hiveContext to other functions, this is the structure:
object Project {
def main(name: String): Int = {
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
...
}
def read (streamId: Int, hc:hiveContext): Array[Byte] = {
...
}
def close (): Unit = {
...
}
}
but it doesn't work. Function read() is called inside main().
any idea?
I'm declaring hiveContext as implicit, this is working for me
implicit val sqlContext: HiveContext = new HiveContext(sc)
MyJob.run(conf)
Defined in MyJob:
override def run(config: Config)(implicit sqlContext: SQLContext): Unit = ...
But if you don't want it implicit, this should be the same
val sqlContext: HiveContext = new HiveContext(sc)
MyJob.run(conf)(sqlContext)
override def run(config: Config)(sqlContext: SQLContext): Unit = ...
Also, your function read should receive HiveContext as the type for the parameter hc, and not hiveContext
def read (streamId: Int, hc:HiveContext): Array[Byte] =
I tried several options, this is what worked eventually for me..
object SomeName extends App {
val conf = new SparkConf()...
val sc = new SparkContext(conf)
implicit val sqlC = SQLContext.getOrCreate(sc)
getDF1(sqlC)
def getDF1(sqlCo: SQLContext): Unit = {
val query1 = SomeQuery here
val df1 = sqlCo.read.format("jdbc").options(Map("url" -> dbUrl,"dbtable" -> query1)).load.cache()
//iterate through df1 and retrieve the 2nd DataFrame based on some values in the Row of the first DataFrame
df1.foreach(x => {
getDF2(x.getString(0), x.getDecimal(1).toString, x.getDecimal(3).doubleValue) (sqlCo)
})
}
def getDF2(a: String, b: String, c: Double)(implicit sqlCont: SQLContext) : Unit = {
val query2 = Somequery
val sqlcc = SQLContext.getOrCreate(sc)
//val sqlcc = sqlCont //Did not work for me. Also, omitting (implicit sqlCont: SQLContext) altogether did not work
val df2 = sqlcc.read.format("jdbc").options(Map("url" -> dbURL, "dbtable" -> query2)).load().cache()
.
.
.
}
}
Note: In the above code, if I omitted (implicit sqlCont: SQLContext) parameter from getDF2 method signature, it would not work. I tried several other options of passing the sqlContext from one method to the other, it always gave me NullPointerException or Task not serializable Excpetion.
Good thins is it eventually worked this way, and I could retrieve parameters from a row of the DataFrame1 and use those values in loading the DataFrame 2.

error mismatching in scala

I am trying to return RDD[(String,String,String)] and I am not able to do that using flatMap. I tried (tweetId, tweetBody, gender) and (tweetId, tweetBody, gender) but it give me an error of type mismatch can you guid me to know how I can return RDD[(String, String, String)] from flatMap
override def transform(sqlContext: SQLContext, rdd: RDD[Array[Byte]], config: UserTransformConfig, logger: PhaseLogger): DataFrame = {
val idColumnName = config.getConfigString("column_name").getOrElse("id")
val bodyColumnName = config.getConfigString("column_name").getOrElse("body")
val genderColumnName = config.getConfigString("column_name").getOrElse("gender")
// convert each input element to a JsonValue
val jsonRDD = rdd.map(r => byteUtils.bytesToUTF8String(r))
val hashtagsRDD: RDD[(String,String, String)] = jsonRDD.mapPartitions(r => {
// register jackson mapper (this needs to be instantiated per partition
// since it is not serializable)
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
r.flatMap(tweet => tweet match {
case _ :: tweet =>
val rootNode = mapper.readTree(tweet)
val tweetId = rootNode.path("id").asText.split(":")(2)
val tweetBody = rootNode.path("body").asText
val tweetVector = new HashingTF().transform(tweetBody.split(" "))
val result =genderModel.predict(tweetVector)
val gender = if(result == 1.0){"Male"}else{"Female"}
(tweetId, tweetBody, gender)
// Array(1).map(x => (tweetId, tweetBody, gender))
})
})
val rowRDD: RDD[Row] = hashtagsRDD.map(x => Row(x._1,x._2,x._3))
val schema = StructType(Array(StructField(idColumnName,StringType, true),StructField(bodyColumnName, StringType, true),StructField(genderColumnName,StringType, true)))
sqlContext.createDataFrame(rowRDD, schema)
}
}
Try to use map instead of flatMap.
flatMap is being used when result type of parameter function is collection or RDD
I.e. flatMap is being used when every element of current collection is mapped to zero or more elements.
While map is being used when every element of current collection is mapped to exactly one element.
map with A => B exchanges symbol A with symbol B in functorial types, i.e. transforms RDD[A] to RDD[B]
flatMap could be read as map then flatten in monadic types. E.g. you have and RDD[A] and parameter function is of type A => RDD[B] result of simple map will be RDD[RDD[B]] and that pair of occurences could be simplified to just RDD[B] via flatten
Here the example of succesfully compiled code.
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StringType, StructField, StructType}
class UserTransformConfig {
def getConfigString(name: String): Option[String] = ???
}
class PhaseLogger
object byteUtils {
def bytesToUTF8String(r: Array[Byte]): String = ???
}
class HashingTF {
def transform(strs: Array[String]): Array[Double] = ???
}
object genderModel {
def predict(v: Array[Double]): Double = ???
}
def transform(sqlContext: SQLContext, rdd: RDD[Array[Byte]], config: UserTransformConfig, logger: PhaseLogger): DataFrame = {
val idColumnName = config.getConfigString("column_name").getOrElse("id")
val bodyColumnName = config.getConfigString("column_name").getOrElse("body")
val genderColumnName = config.getConfigString("column_name").getOrElse("gender")
// convert each input element to a JsonValue
val jsonRDD = rdd.map(r => byteUtils.bytesToUTF8String(r))
val hashtagsRDD: RDD[(String, String, String)] = jsonRDD.mapPartitions(r => {
// register jackson mapper (this needs to be instantiated per partition
// since it is not serializable)
val mapper = new ObjectMapper
mapper.registerModule(DefaultScalaModule)
r.map { tweet =>
val rootNode = mapper.readTree(tweet)
val tweetId = rootNode.path("id").asText.split(":")(2)
val tweetBody = rootNode.path("body").asText
val tweetVector = new HashingTF().transform(tweetBody.split(" "))
val result = genderModel.predict(tweetVector)
val gender = if (result == 1.0) {"Male"} else {"Female"}
(tweetId, tweetBody, gender)
}
})
val rowRDD: RDD[Row] = hashtagsRDD.map(x => Row(x._1, x._2, x._3))
val schema = StructType(Array(StructField(idColumnName, StringType, true), StructField(bodyColumnName, StringType, true), StructField(genderColumnName, StringType, true)))
sqlContext.createDataFrame(rowRDD, schema)
}
please note how much I should bring from my imagination because you did not supply the minimum example. In general questions like this are not worth to answer

How to match Dataframe column names to Scala case class attributes?

The column names in this example from spark-sql come from the case class Person.
case class Person(name: String, age: Int)
val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.
// The RDD is implicitly converted to a SchemaRDD by createSchemaRDD, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")
https://spark.apache.org/docs/1.1.0/sql-programming-guide.html
However in many cases the parameter names may be changed. This would cause columns to not be found if the file has not been updated to reflect the change.
How can I specify an appropriate mapping?
I am thinking something like:
val schema = StructType(Seq(
StructField("name", StringType, nullable = false),
StructField("age", IntegerType, nullable = false)
))
val ps: Seq[Person] = ???
val personRDD = sc.parallelize(ps)
// Apply the schema to the RDD.
val personDF: DataFrame = sqlContext.createDataFrame(personRDD, schema)
Basically, all the mapping you need to do can be achieved with DataFrame.select(...). (Here, I assume, that no type conversions need to be done.)
Given the forward- and backward-mapping as maps, the essential part is
val mapping = from.map{ (x:(String, String)) => personsDF(x._1).as(x._2) }.toArray
// personsDF your original dataframe
val mappedDF = personsDF.select( mapping: _* )
where mapping is an array of Columns with alias.
Example code
object Example {
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
case class Person(name: String, age: Int)
object Mapping {
val from = Map("name" -> "a", "age" -> "b")
val to = Map("a" -> "name", "b" -> "age")
}
def main(args: Array[String]) : Unit = {
// init
val conf = new SparkConf()
.setAppName( "Example." )
.setMaster( "local[*]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
// create persons
val persons = Seq(Person("bob", 35), Person("alice", 27))
val personsRDD = sc.parallelize(persons, 4)
val personsDF = personsRDD.toDF
writeParquet( personsDF, "persons.parquet", sc, sqlContext)
val otherPersonDF = readParquet( "persons.parquet", sc, sqlContext )
}
def writeParquet(personsDF: DataFrame, path:String, sc: SparkContext, sqlContext: SQLContext) : Unit = {
import Mapping.from
val mapping = from.map{ (x:(String, String)) => personsDF(x._1).as(x._2) }.toArray
val mappedDF = personsDF.select( mapping: _* )
mappedDF.write.parquet("/output/path.parquet") // parquet with columns "a" and "b"
}
def readParquet(path: String, sc: SparkContext, sqlContext: SQLContext) : Unit = {
import Mapping.to
val df = sqlContext.read.parquet(path) // this df has columns a and b
val mapping = to.map{ (x:(String, String)) => df(x._1).as(x._2) }.toArray
df.select( mapping: _* )
}
}
Remark
If you need to convert a dataframe back to an RDD[Person], then
val rdd : RDD[Row] = personsDF.rdd
val personsRDD : Rdd[Person] = rdd.map { r: Row =>
Person( r.getAs("person"), r.getAs("age") )
}
Alternatives
Have also a look at How to convert spark SchemaRDD into RDD of my case class?