Spark: convert large input to rdd - scala

I read a lot of lines from an iterator and I need to convert them to an RDD.
I have see some answers like do an sc.parallelize(YourIterable.toList) but the toList will raise a memory exception
I have also read post saying that it is again Spark model but I think there should be another solution.
I have two ideas, please tell me which is the best or if you have any other ideas to solve this.
Solution 1: I store those lines 100 000 by 100 000 to an ArrayBuffer then when the iterator is empty I convert the array to an RDD with parallelize
val result = ArrayBuffer[String]()
var counter = 0
var resultRDD:RDD[Array[Option[Any]]] = sc.sparkContext.emptyRDD[Array[Option[Any]]]
while (resultSet.next()) {
//Do some stuff on resultSet
result.append(stuff)
counter = counter + 1
if(counter % 100000 == 0){
val tmp = sc.sparkContext.parallelize(result)
tmp.count // Need to run an action because result will be cleared
resultRDD = resultRDD union sc.sparkContext.parallelize(result)
result.clear
}
}
// Same for last lines
//use resultRDD
With this method the use of count to force an action on the lazy union before the arrayBuffer.clear is a bit annoying.
Solution 2: Same bunch reads but write in some files in HDFS and next do a sc.textFiles

Related

spark program to check if a given keyword exists in a huge text file or not

To find out a given keyword exists in a huge text file or not, I came up wit below two approaches.
Approach1:
def keywordExists(line):
if (line.find(“my_keyword”) > -1):
return 1
return 0
lines = sparkContext.textFile(“test_file.txt”);
isExist = lines.map(keywordExists);
sum = isExist.reduce(sum);
print(“Found” if sum>0 else “Not Found”)
Approach2:
var keyword="my_keyword"
val rdd=sparkContext.textFile("test_file.txt")
val count= rdd.filter(line=>line.contains(keyword)).count
print(“Found” if count>0 else “Not Found”)
Main difference is first one using map and then reducing whereas second one is filtering and doing a count.
Could anyone suggest which is efficient.
I would suggest:
val wordFound = !rdd.filter(line=>line.contains(keyword)).isEmpty()
Benefit: The search can be stopped once 1 occurence of keyword was found
see also Spark: Efficient way to test if an RDD is empty

Spark flushing Dataframe on show / count

I am trying to print the count of a dataframe, and then first few rows of it, before finally sending it out for further processing.
Strangely, after a call to count() the dataframe becomes empty.
val modifiedDF = funcA(sparkDF)
val deltaDF = modifiedDF.except(sparkDF)
println(deltaDF.count()) // prints 10
println(deltaDF.count()) //prints 0, similar behavior with show
funcB(deltaDF) //gets null dataframe
I was able to verify the same using deltaDF.collect.foreach(println) and subsequent calls to count.
However, if I do not call count or show, and just send it as is, funcB gets the whole DF with 10 rows.
Is it expected?
Definition of funcA() and its dependencies:
def funcA(inputDataframe: DataFrame): DataFrame = {
val col_name = "colA"
val modified_df = inputDataframe.withColumn(col_name, customUDF(col(col_name)))
val modifiedDFRaw = modified_df.limit(10)
modifiedDFRaw.withColumn("colA", modifiedDFRaw.col("colA").cast("decimal(38,10)"))
}
val customUDF = udf[Option[java.math.BigDecimal], java.math.BigDecimal](myUDF)
def myUDF(sval: java.math.BigDecimal): Option[java.math.BigDecimal] = {
val strg_name = Option(sval).getOrElse(return None)
if (change_cnt < 20) {
change_cnt = change_cnt + 1
Some(strg_name.multiply(new java.math.BigDecimal("1000")))
} else {
Some(strg_name)
}
}
First of all function used as UserDefinedFunction has to be at least idempotent, but optimally pure. Otherwise the results are simply non-deterministic. While some escape hatch is provided in the latest versions (it is possible to hint Spark that function shouldn't be re-executed) these won't help you here.
Moreover having mutable stable (it is not exactly clear what is the source of change_cnt, but it is both written and read in the udf) as simply no go - Spark doesn't provide global mutable state.
Overall your code:
Modifies some local copy of some object.
Makes decision based on such object.
Unfortunately both components are simply not salvageable. You'll have to go back to planning phase and rethink your design.
Your Dataframe is a distributed dataset and trying to do a count() returns unpredictable results since the count() can be different in each node. Read the documentation about RDDs below. It is applicable to DataFrames as well.
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#understanding-closures-
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#printing-elements-of-an-rdd

Scala - Alternative to nested for loops while writing out

To introduce my situation, I'm trying to generate multiple log files, one for every hour, that will have a random number of timestamps (lines). While writing the code, I decided to write-out and name the filenames based on the hour (log0 ... log23). This is so I can test a Spark Streaming job that relies on logs being separated by the hour. However, couldn't figure out any other way to do this other than having nested for loops.
In the Scala spirit of avoiding nested for loops and to making code easier to read, I'm looking to see if there there is a way to rewrite the following sample code with identical functionality:
import scala.reflect.io.File
val hours = 24
val max_iterations = 100
val string_builder = scala.collection.mutable.StringBuilder.newBuilder
val rand = scala.util.Random
for (hour <- 0 until hours) {
for (iter <- 1 to rand.nextInt(max_iterations)) {
string_builder.append(s"{datetime=$hour:$minute:$second}\n")
}
File(s"log$hour.txt").createFile(false).writeAll(string_builder.toString)
string_builder.clear
}
Edit: Just for clarification, this differs from a standard multiple file write out, as the hours need to match the file name.
A simple solution would be to use for-comprehension:
for {
hour <- 0 until hours
iter <- 1 to rand.nextInt(max_iterations)
} yield {
File(s"log$hour.txt").appendAll(s"{datetime=$hour:${iter%60}:00}\n")
}
It has a downside of recreating the File handler over and over, so performance might be an issue, but if this code is just used to create some test data once, this shouldn't be a concern.
An alternative would be to call foreach on the sequence of hours (and then of iters) directly:
(0 until hours).foreach(hour => {
val f = File(s"log$hour.txt")
val lines = (1 to rand.nextInt(max_iterations)).map(iter => s"{datetime=$hour:${iter%60}:00}\n")
f.writeAll(lines: _*)
})

Split rdd and Select elements

I am trying to capture a stream, transform the data, and then save it locally.
So far, streaming, and writing works fine. However, the transformation only works halfway.
The stream I receive consists out of 9 columns separated by "|". So I want to split it, and let's say select column 1,3, and 5. What I have tried looks like this, but nothing really let to a result
val indices = List(1,3,5)
linesFilter.window(Seconds(EVENT_PERIOD_SECONDS*WRITE_EVERY_N_SECONDS), Seconds(EVENT_PERIOD_SECONDS*WRITE_EVERY_N_SECONDS)).foreachRDD { (rdd, time) =>
if (rdd.count() > 0) {
rdd
.map(_.split("\\|").slice(1,2))
//.map(arr => (arr(0), arr(2))))
//filter(x=> indices.contains(_(x)))) //selec(indices)
//.zipWithIndex
.coalesce(1,true)
//the replacement is is used so that I get a csv file at the end
//.map(_.replace(DELIMITER_STREAM, DELIMITER_OUTPUT))
//.map{_.mkString(DELIMITER_OUTPUT) }
.saveAsTextFile(CHECKPOINT_DIR + "/output/o_" + sdf.format(System.currentTimeMillis()))
}
Has anyone a tip how do I split a rdd and then only grab specific elements out of it?
Edit Input:
val lines = streamingContext.socketTextStream(HOST, PORT)
val linesFilter = lines
.map(_.toLowerCase)
.filter(_.split(DELIMITER_STREAM).length == 9)
Input stream is that:
536365|71053|white metal lantern|6|01-12-10 8:26|3,39|17850|united kingdom|2017-11-17 14:52:22
Thank you very much everyone.
As you recommended, I modified my code like that:
private val DELIMITER_STREAM = "\\|"
val linesFilter = lines
.map(_.toLowerCase)
.filter(_.split(DELIMITER_STREAM).length == 9)
.map(x =>{
val y = x.split(DELIMITER_STREAM)
(y(0),y(1),y(3),y(4),y(5),y(6),y(7))})
and then in the RDD
if (rdd.count() > 0) {
rdd
.map(_.productIterator.mkString(DELIMITER_OUTPUT))
.coalesce(1,true)
.saveAsTextFile(CHECKPOINT_DIR + "/output/o_" + sdf.format(System.currentTimeMillis()))
}

How to use existing trained model using LinearRegressionModel to work with SparkStreaming and predict data with it [duplicate]

I have the following code:
val blueCount = sc.accumulator[Long](0)
val output = input.map { data =>
for (value <- data.getValues()) {
if (record.getEnum() == DataEnum.BLUE) {
blueCount += 1
println("Enum = BLUE : " + value.toString()
}
}
data
}.persist(StorageLevel.MEMORY_ONLY_SER)
output.saveAsTextFile("myOutput")
Then the blueCount is not zero, but I got no println() output! Am I missing anything here? Thanks!
This is a conceptual question...
Imagine You have a big cluster, composed of many workers let's say n workers and those workers store a partition of an RDD or DataFrame, imagine You start a map task across that data, and inside that map you have a print statement, first of all:
Where will that data be printed out?
What node has priority and what partition?
If all nodes are running in parallel, who will be printed first?
How will be this print queue created?
Those are too many questions, thus the designers/maintainers of apache-spark decided logically to drop any support to print statements inside any map-reduce operation (this include accumulators and even broadcast variables).
This also makes sense because Spark is a language designed for very large datasets. While printing can be useful for testing and debugging, you wouldn't want to print every line of a DataFrame or RDD because they are built to have millions or billions of rows! So why deal with these complicated questions when you wouldn't even want to print in the first place?
In order to prove this you can run this scala code for example:
// Let's create a simple RDD
val rdd = sc.parallelize(1 to 10000)
def printStuff(x:Int):Int = {
println(x)
x + 1
}
// It doesn't print anything! because of a logic design limitation!
rdd.map(printStuff)
// But you can print the RDD by doing the following:
rdd.take(10).foreach(println)
I was able to work it around by making a utility function:
object PrintUtiltity {
def print(data:String) = {
println(data)
}
}