How to improve Kudu reads with Spark? - scala

I have a process that given a new input retrieves related information form our Kudu database and then does some computation.
The problem lies in the data retrieval, we have 1.201.524.092 rows and for any computation, it takes forever to start processing the needed ones because the reader needs to give it all to spark.
To read form kudu we do:
def read(tableName: String): Try[DataFrame] = {
val kuduOptions: Map[String, String] = Map(
"kudu.table" -> tableName,
"kudu.master" -> kuduContext.kuduMaster)
SQLContext.read.options(kuduOptions).format("kudu").load
}
And then:
val newInputs = ??? // Dataframe with the new inputs
val currentInputs = read("inputsTable") // This takes too much time!!!!
val relatedCurrent = currentInputs.join(newInputs.select("commonId", Seq("commonId"), "inner")
doThings(newInputs, relatedCurrent)
For example, we only want to introduce a single new input. Well, it has to scan the full table to find the currentInputs which makes a Shuffle Write of 81.6 GB / 1201524092 rows.
How can I improve this?
Thanks,

You can collect the new input and after that you can use it in a where clause.
Using this way you can easily hit an OOM, but it can make your query very fast because it's going to benefit of predicate pushdown
val collectedIds = newInputs.select("commonId").collect
val filtredCurrentInputs = currentInputs.where($"commonId".isin(collectedIds))

Related

Spark: How to get String value while generating output file

I have two files
--------Student.csv---------
StudentId,City
101,NDLS
102,Mumbai
-------StudentDetails.csv---
StudentId,StudentName,Course
101,ABC,C001
102,XYZ,C002
Requirement
StudentId in first should be replaced with StudentName and Course in the second file.
Once replaced I need to generate a new CSV with complete details like
ABC,C001,NDLS
XYZ,C002,Mumbai
Code used
val studentRDD = sc.textFile(file path);
val studentdetailsRDD = sc.textFile(file path);
val studentB = sc.broadcast(studentdetailsRDD.collect)
//Generating CSV
studentRDD.map{student =>
val name = getName(student.StudentId)
val course = getCourse(student.StudentId)
Array(name, course, student.City)
}.mapPartitions{data =>
val stringWriter = new StringWriter();
val csvWriter =new CSVWriter(stringWriter);
csvWriter.writeAll(data.toList)
Iterator(stringWriter.toString())
}.saveAsTextFile(outputPath)
//Functions defined to get details
def getName(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.StudentName}
}
def getCourse(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.Course}
}
Problem
File gets generated but the values are object representations instead of String value.
How can I get the string values instead of objects ?
As suggested in another answer, Spark's DataFrame API is especially suitable for this, as it easily supports joining two DataFrames, and writing CSV files.
However, if you insist on staying with RDD API, looks like the main issue with your code is the lookup functions: getName and getCourse basically do nothing, because their return type is Unit; Using an if without an else means that for some inputs there's no return value, which makes the entire function return Unit.
To fix this, it's easier to get rid of them and simplify the lookup by broadcasting a Map:
// better to broadcast a Map instead of an Array, would make lookups more efficient
val studentB = sc.broadcast(studentdetailsRDD.keyBy(_.StudentId).collectAsMap())
// convert to RDD[String] with the wanted formatting
val resultStrings = studentRDD.map { student =>
val details = studentB.value(student.StudentId)
Array(details.StudentName, details.Course, student.City)
}
.map(_.mkString(",")) // naive CSV writing with no escaping etc., you can also use CSVWriter like you did
// save as text file
resultStrings.saveAsTextFile(outputPath)
Spark has great support for join and write to file. Join only takes 1 line of code and write also only takes 1.
Hand write those code can be error proven, hard to read and most likely super slow.
val df1 = Seq((101,"NDLS"),
(102,"Mumbai")
).toDF("id", "city")
val df2 = Seq((101,"ABC","C001"),
(102,"XYZ","C002")
).toDF("id", "name", "course")
val dfResult = df1.join(df2, "id").select("id", "city", "name")
dfResult.repartition(1).write.csv("hello.csv")
There will be a directory created. There is only 1 file in the directory which is the finally result.

Not able to persist the DStream for use in next batch

JavaRDD<String> history_ = sc.emptyRDD();
java.util.Queue<JavaRDD<String> > queue = new LinkedList<JavaRDD<String>>();
queue.add(history_);
JavaDStream<String> history_dstream = ssc.queueStream(queue);
JavaPairDStream<String,ArrayList<String>> history = history_dstream.mapToPair(r -> {
return new Tuple2< String,ArrayList<String> >(null,null);
});
JavaPairInputDStream<String, GenericData.Record> stream_1 =
KafkaUtils.createDirectStream(ssc, String.class, GenericData.Record.class, StringDecoder.class,
GenericDataRecordDecoder.class, props, topicsSet_1);
JavaPairInputDStream<String, GenericData.Record> stream_2 =
KafkaUtils.createDirectStream(ssc, String.class, GenericData.Record.class, StringDecoder.class,
GenericDataRecordDecoder.class, props, topicsSet_2);
then doing some transformation and creating twp DStream Data_1 and Data_2 of type
JavaPairDStream<String, <ArrayList<String>>
and do the join as below , then filtering out those records for whom there was no joining key and saving them in history for using it in next batch by doing its union with Data_1
Data_1 = Data_1.union(history);
JavaPairDStream<String, Tuple2<ArrayList<String>, Optional<ArrayList<String>>>> joined =
Data_1.leftOuterJoin(Data_2).cache();
JavaPairDStream<String, Tuple2<ArrayList<String>, Optional<ArrayList<String>>>> notNULL_join = joined.filter(r -> r._2._2().isPresent());
JavaPairDStream<String, Tuple2<ArrayList<String>, Optional<ArrayList<String>>>> dstream_filtered = joined.filter(r -> !r._2._2().isPresent());
history = dstream_filtered.mapToPair(r -> {
return new Tuple2<>(r._1,r._2._1);
}).persist;
I get history after the previous step(checked by saving it to hdfs) , but still this history is empty in batch while doing union.
It's conceptually not possible to "remember" a DStream. DStreams are time-bound and on each clock-tick (called "batch interval") the DStream represents the observed data in the stream during that period of time.
Hence, we cannot have an "old" DStream saved to join with a "new" DStream. All DStreams live in the "now".
The underlying data structure of DStreams is the RDD: Each batch interval, our DStream will have 1 RDD of the data for that interval.
RDDs represent a distributed collection of data. RDDs are immutable and permanent, for as long as we have a reference to them.
We can combine RDDs and DStreams to create the "history roll over" that's required here.
It looks pretty similar to the approach on the question, but only using the history RDD.
Here's a high-level view of the suggested changes:
var history: RDD[(String, List[String]) = sc.emptyRDD()
val dstream1 = ...
val dstream2 = ...
val historyDStream = dstream1.transform(rdd => rdd.union(history))
val joined = historyDStream.join(dstream2)
... do stuff with joined as above, obtain dstreamFiltered ...
dstreamFiltered.foreachRDD{rdd =>
val formatted = rdd.map{case (k,(v1,v2)) => (k,v1)} // get rid of the join info
history.unpersist(false) // unpersist the 'old' history RDD
history = formatted // assign the new history
history.persist(StorageLevel.MEMORY_AND_DISK) // cache the computation
history.count() //action to materialize this transformation
}
This is only a starting point. There're additional considerations with regards to checkpointing. Otherwise the lineage of the history RDD will grow unbounded until some StackOverflow happens. This blog is quite complete on this particular technique: http://www.spark.tc/stateful-spark-streaming-using-transform/
I also recommend you using Scala instead of Java. The Java syntax is too verbose to use with Spark Streaming.

How to efficiently extract a value from HiveContext Query

I am running a query through my HiveContext
Query:
val hiveQuery = s"SELECT post_domain, post_country, post_geo_city, post_geo_region
FROM $database.$table
WHERE year=$year and month=$month and day=$day and hour=$hour and event_event_id='$uniqueIdentifier'"
val hiveQueryObj:DataFrame = hiveContext.sql(hiveQuery)
Originally, I was extracting each value from the column with:
hiveQueryObj.select(column).collectAsList().get(0).get(0).toString
However, I was told to avoid this because it makes too many connections to Hive. I am pretty new to this area so I'm not sure how to extract the column values efficiently. How can I perform the same logic in a more efficient way?
I plan to implement this in my code
val arr = Array("post_domain", "post_country", "post_geo_city", "post_geo_region")
arr.foreach(column => {
// expected Map
val ex = expected.get(column).get
val actual = hiveQueryObj.select(column).collectAsList().get(0).get(0).toString
assert(actual.equals(ex))
}

How to extract records from Dstream and write into Cassandra (Spark Streaming)

I am fetching data from Kafka and processing in Spark Streaming and writing Data into Cassandra
I am trying to Filter the DStream records but it doesn't filter the records and write the complete records in Cassandra,
Any suggestion with sample/example Code to filter multiple columns of records and any help will be highly appreciated i have done a research on this but not able to get any solution.
class SparkKafkaConsumer1(val recordStream : org.apache.spark.streaming.dstream.DStream[String], val streaming : StreamingContext) {
val internationalAddress = recordStream.map(line => line.split("\\|")(10).toUpperCase)
def timeToStr(epochMillis: Long): String =
DateTimeFormat.forPattern("YYYYMMddHHmmss").print(epochMillis)
if(internationalAddress =="INDIA")
{
print("-----------------------------------------------")
recordStream.print()
val riskScore = "1"
val timestamp: Long = System.currentTimeMillis
val formatedTimeStamp = timeToStr(timestamp)
var wc1 = recordStream.map(_.split("\\|")).map(r=>Row(r(0),r(1),r(2),r(3),r(4).toInt,r(5).toInt,r(6).toInt,r(7),r(8),r(9),r(10),r(11),r(12),r(13),r(14),r(15),r(16),riskScore.toInt,0,0,0,formatedTimeStamp))
implicit val rowWriter = SqlRowWriter.Factory
wc1.saveToCassandra("fraud", "fraudrating", SomeColumns("purchasetimestamp","sessionid","productdetails","emailid","productprice","itemcount","totalprice","itemtype","luxaryitem","shippingaddress","country","bank","typeofcard","creditordebitcardnumber","contactdetails","multipleitem","ipaddress","consumer1score","consumer2score","consumer3score","consumer4score","recordedtimestamp"))
}
(Note: I am have records with internationalAddress = INDIA in Kafka and I am very much new to Scala)
I'm not really sure what you're trying to do, but if you are simply trying to filter on records pertaining to India, you could do this:
implicit val rowWriter = SqlRowWriter.Factory
recordStream
.filter(_.split("\\|")(10).toUpperCase) == "INDIA")
.map(_.split("\\|"))
.map(r => Row(...))
.saveToCassandra(...)
As a side note, I think case classes would be really helpful for you.

Large task size for simplest program

I am trying to run the simplest program with Spark
import org.apache.spark.{SparkContext, SparkConf}
object LargeTaskTest {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("DataTest").setMaster("local[*]")
val sc = new SparkContext(conf)
val dat = (1 to 10000000).toList
val data = sc.parallelize(dat).cache()
for(i <- 1 to 100){
println(data.reduce(_ + _))
}
}
}
I get the following error message, after each iteration :
WARN TaskSetManager: Stage 0 contains a task of very large size (9767
KB). The maximum recommended task size is 100 KB.
Increasing the data size increases said task size. This suggests to me that the driver is shipping the "dat" object to all executors, but I can't for the life of me see why, as the only operation on my RDD is reduce, which basically has no closure. Any ideas ?
Because you create the very large list locally first, the Spark parallelize method is trying to ship this list to the Spark workers as a single unit, as part of a task. Hence the warning message you receive. As an alternative, you could parallelize a much smaller list, then use flatMap to explode it into the larger list. this also has the benefit of creating the larger set of numbers in parallel. For example:
import org.apache.spark.{SparkContext, SparkConf}
object LargeTaskTest extends App {
val conf = new SparkConf().setAppName("DataTest").setMaster("local[*]")
val sc = new SparkContext(conf)
val dat = (0 to 99).toList
val data = sc.parallelize(dat).cache().flatMap(i => (1 to 1000000).map(j => j * 100 + i))
println(data.count()) //100000000
println(data.reduce(_ + _))
sc.stop()
}
EDIT:
Ultimately the local collection being parallelized has to be pushed to the executors. The parallelize method creates an instance of ParallelCollectionRDD:
def parallelize[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T] = withScope {
assertNotStopped()
new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
}
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L730
ParallelCollectionRDD creates a number of partitions equal to numSlices:
override def getPartitions: Array[Partition] = {
val slices = ParallelCollectionRDD.slice(data, numSlices).toArray
slices.indices.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray
}
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L96
numSlices defaults to sc.defaultParallelism which on my machine is 4. So even when split, each partition contains a very large list which needs to be pushed to an executor.
SparkContext.parallelize contains the note #note Parallelize acts lazily and ParallelCollectionRDD contains the comment;
// TODO: Right now, each split sends along its full data, even if
later down the RDD chain it gets // cached. It might be worthwhile
to write the data to a file in the DFS and read it in the split //
instead.
So it appears that the problem happens when you call reduce because this is the point that the partitions are sent to the executors, but the root cause is that you are calling parallelize on a very big list. Generating the large list within the executors is a better approach, IMHO.
Reduce function sends all the data to one single node. When you run sc.parallelize the data is distributed by default to 100 partitions. To make use of the already distributed data use something like this:
data.map(el=> el%100 -> el).reduceByKey(_+_)
or you can do the reduce at partition level.
data.mapPartitions(p => Iterator(p.reduce(_ + _))).reduce(_ + _)
or just use sum :)