How to print a String or String[Array] in Scala(spark)? - scala

I'm trying to unit test the values returned in a String, but when I'm trying to print the console gives
MapPartitionsRDD[32]
My code is as follows:
UPDATED:
val src = exact_bestmatch_src.filter(line => line.split(",")(0).toInt.equals(i))
val dest = exact_bestmatch_Dest.filter(line => line.split(",")(0).toInt.equals(i)).toArray()
for (print1 <- src) {
var n1:String = src.toString()
var sourceArr: Array[String] = n1.split(",")
for (print2 <- dest) {
var n2: String = dest.toString()
for (i <- 0 until sourceArr.length) {
if (n1.split(",")(i).equals(n2.split(",")(i))) {
}
}
I also tried println(n1.mkstring())
I'm trying to compare both src and dest RDD's to find out the differences between both the rows

If you want to see each record in the RDD printed as a separate line, you can use:
src.foreach(println)
This will run the println function on each record, within the executor that holds it (which might be several different executors). If this runs in some test using Spark's "local" mode, there's only one "executor" and it's the same process as the driver, so that's not a problem.
Alternatively, if you do have more than one executor (non-local mode) and you want to make sure the RDD's elements are printed to the driver console, you can first collect the RDD's elements into a local collection and then print them:
src.collect().foreach(println)
NOTE that this assumes the RDD is small enough to be collected into a single machine's memory.
Calling toString on an RDD does not access the RDD's data (as it might be too large to fit as a String in the driver machine's memory), as you observed it just prints the type of the RDD and its ID.

You don't have a list or array. You'd need to collect() an RDD in order to get one, or you need to iterate it via foreach.
Calling println on any object already calls the toString method for it, by the way. And RDD doesn't have a mkString method

Calling toString on src just means you are getting a string representation which can be anything. For RDD this is not the content of the RDD (as this would require getting all the content of the RDD to the driver and printing it at once).
As other have mentioned in order to print the content of the RDD you need to first get all the data to the driver.
Let's consider the simple solution already proposed:
src.collect().foreach(println)
The first part - collect tells spark to get all the content of the RDD and bring it to the driver as a sequence of records. The foreach tells scala to go over each record in the sequence and pass it as argument to the println function which would print it. You could of course use mkstring instead of foreach to get a single string.

Related

Spark - Convert Tuples to Tab Seperated String

I want to create a function that takes an RDD of tuples and converts each tuple to a tab separated string. I want the function to be able to handle Tuples of any size.
If I already have this RDD created, I can get the desired output using:
rdd.map(line => (0 to (line.productArity-1)).map(line.productElement(_)).toList.mkString("\t"))
How can I convert this piece of code to work as a function that takes an RDD of tuples, or is there a good library that already does this?
Something like this should work:
def toTab[T <: Product](rdd:RDD[T]) = rdd.map(_.productIterator.mkString("\t"))

modifying RDD of object in spark (scala)

I have:
val rdd1: RDD[myClass]
it has been initialized, i checked while debugging all the members have got thier default values
If i do
rdd1.foreach(x=>x.modifier())
where modifier is a member function of myClass which modifies some of the member variables
After executing this if i check the values inside the RDD they have not been modified.
Can someone explain what's going on here?
And is it possible to make sure the values are modified inside the RDD?
EDIT:
class myClass(var id:String,var sessions: Buffer[Long],var avgsession: Long) {
def calcAvg(){
// calculate avg by summing over sessions and dividing by legnth
// Store this average in avgsession
}
}
The avgsession attribute is not updating if i do
myrdd.foreach(x=>x.calcAvg())
RDD are immutable, calling a mutating method on the objects it contains will not have any effect.
The way to obtain the result you want is to produce new copies of MyClass instead of modifying the instance:
case class MyClass(id:String, avgsession: Long) {
def modifier(a: Int):MyClass =
this.copy(avgsession = this.avgsession + a)
}
Now you still cannot update rdd1, but you can obtain rdd2 that will contain the updated instances:
rdd2 = rdd1.map (_.modifier(18) )
The answer to this question is slightly more nuanced than the original accepted answer here. The original answer is correct only with respect to data that is not cached in memory. RDD data that is cached in memory can be mutated in memory as well and the mutations will remain even though the RDD is supposed to be immutable. Consider the following example:
val rdd = sc.parallelize(Seq(new mutable.HashSet[Int]()))
rdd.foreach(_+=1)
rdd.collect.foreach(println)
If you run that example you will get Set() as the result just like the original answer states.
However if you were to run the exact same thing with a cache call:
val rdd = sc.parallelize(Seq(new mutable.HashSet[Int]()))
rdd.cache
rdd.foreach(_+=1)
rdd.collect.foreach(println)
Now the result will print as Set(1). So it depends on whether the data is being cached in memory. If spark is recomputing from source or reading from a serialized copy on disk, then it will always reset back to the original object and appear to be immutable but if it's not loading from a serialized form then the mutations will in fact stick.
Objects are immutable. By using map, you can iterate over the rdd and return a new one.
val rdd2 = rdd1.map(x=>x.modifier())
I have observed that code like yours will work after calling RDD.persist when running in spark/yarn. It is probably unsupported/accidental behavior and you should avoid it - but it is a workaround that may help in a pinch. I'm running version 1.5.0.

Spark Closures with Array [duplicate]

This question already has an answer here:
LinkedHashMap variable is not accessable out side the foreach loop
(1 answer)
Closed 7 years ago.
I have an array that works when it is inside the closure (it has some values) but outside the loop, the array size is 0. I want to know what causes the behavior to be like that?
I need the hArr to be accessible outside for batch HBase put.
val hArr = new ArrayBuffer[Put]()
rdd.foreach(row => {
val hConf = HBaseConfiguration.create()
val hTable = new HTable(hConf, tablename)
val hRow = new Put(Bytes.toBytes(row._1.toString))
hRow.add(...)
hArr += hRow
println("hArr: " + hArr.toArray.mkString(","))
})
println("hArr.size: " + hArr.size)
The problem is that any items in a rdd closure are copied and use local versions. foreach should only be used for saving to disk or something along those lines.
If you want this in an array, then you can map and then collect
rdd.map(row=> {
val hConf = HBaseConfiguration.create()
val hTable = new HTable(hConf, tablename)
val hRow = new Put(Bytes.toBytes(row._1.toString))
hRow.add(...)
hRow
}).collect()
I found quite some new Spark users are confused about how the mapper and reducer functions get run and how they are related to things defined in driver's program. In general, all the mapper/reducer functions you defined and registered by map or foreach or reduceByKey or many other variants will not get executed in your driver's program. In your driver's program, you just register them for Spark to run them remotely and distributedly. When those functions reference some objects you instantiated in your driver's program, you literally created a "Closure", which will compile OK most of the time. But usually that's not what you intended and you will usually run into problem in runtime, by either seeing NotSerializable or ClassNotFound exceptions.
You can either do all outputing work remotely by foreach() variants or try to collecting all data back to your driver's program for output by calling collect(). But be careful with collect() as it'll collect all data from distributed nodes to your driver's program. You only do it when you are absolutely sure your final aggregated data is small.

how to make saveAsTextFile NOT split output into multiple file?

When using Scala in Spark, whenever I dump the results out using saveAsTextFile, it seems to split the output into multiple parts. I'm just passing a parameter(path) to it.
val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap)
year.saveAsTextFile("year")
Does the number of outputs correspond to the number of reducers it uses?
Does this mean the output is compressed?
I know I can combine the output together using bash, but is there an option to store the output in a single text file, without splitting?? I looked at the API docs, but it doesn't say much about this.
The reason it saves it as multiple files is because the computation is distributed. If the output is small enough such that you think you can fit it on one machine, then you can end your program with
val arr = year.collect()
And then save the resulting array as a file, Another way would be to use a custom partitioner, partitionBy, and make it so everything goes to one partition though that isn't advisable because you won't get any parallelization.
If you require the file to be saved with saveAsTextFile you can use coalesce(1,true).saveAsTextFile(). This basically means do the computation then coalesce to 1 partition. You can also use repartition(1) which is just a wrapper for coalesce with the shuffle argument set to true. Looking through the source of RDD.scala is how I figured most of this stuff out, you should take a look.
For those working with a larger dataset:
rdd.collect() should not be used in this case as it will collect all data as an Array in the driver, which is the easiest way to get out of memory.
rdd.coalesce(1).saveAsTextFile() should also not be used as the parallelism of upstream stages will be lost to be performed on a single node, where data will be stored from.
rdd.coalesce(1, shuffle = true).saveAsTextFile() is the best simple option as it will keep the processing of upstream tasks parallel and then only perform the shuffle to one node (rdd.repartition(1).saveAsTextFile() is an exact synonym).
rdd.saveAsSingleTextFile() as provided bellow additionally allows one to store the rdd in a single file with a specific name while keeping the parallelism properties of rdd.coalesce(1, shuffle = true).saveAsTextFile().
Something that can be inconvenient with rdd.coalesce(1, shuffle = true).saveAsTextFile("path/to/file.txt") is that it actually produces a file whose path is path/to/file.txt/part-00000 and not path/to/file.txt.
The following solution rdd.saveAsSingleTextFile("path/to/file.txt") will actually produce a file whose path is path/to/file.txt:
package com.whatever.package
import org.apache.spark.rdd.RDD
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import org.apache.hadoop.io.compress.CompressionCodec
object SparkHelper {
// This is an implicit class so that saveAsSingleTextFile can be attached to
// SparkContext and be called like this: sc.saveAsSingleTextFile
implicit class RDDExtensions(val rdd: RDD[String]) extends AnyVal {
def saveAsSingleTextFile(path: String): Unit =
saveAsSingleTextFileInternal(path, None)
def saveAsSingleTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit =
saveAsSingleTextFileInternal(path, Some(codec))
private def saveAsSingleTextFileInternal(
path: String, codec: Option[Class[_ <: CompressionCodec]]
): Unit = {
// The interface with hdfs:
val hdfs = FileSystem.get(rdd.sparkContext.hadoopConfiguration)
// Classic saveAsTextFile in a temporary folder:
hdfs.delete(new Path(s"$path.tmp"), true) // to make sure it's not there already
codec match {
case Some(codec) => rdd.saveAsTextFile(s"$path.tmp", codec)
case None => rdd.saveAsTextFile(s"$path.tmp")
}
// Merge the folder of resulting part-xxxxx into one file:
hdfs.delete(new Path(path), true) // to make sure it's not there already
FileUtil.copyMerge(
hdfs, new Path(s"$path.tmp"),
hdfs, new Path(path),
true, rdd.sparkContext.hadoopConfiguration, null
)
// Working with Hadoop 3?: https://stackoverflow.com/a/50545815/9297144
hdfs.delete(new Path(s"$path.tmp"), true)
}
}
}
which can be used this way:
import com.whatever.package.SparkHelper.RDDExtensions
rdd.saveAsSingleTextFile("path/to/file.txt")
// Or if the produced file is to be compressed:
import org.apache.hadoop.io.compress.GzipCodec
rdd.saveAsSingleTextFile("path/to/file.txt.gz", classOf[GzipCodec])
This snippet:
First stores the rdd with rdd.saveAsTextFile("path/to/file.txt") in a temporary folder path/to/file.txt.tmp as if we didn't want to store data in one file (which keeps the processing of upstream tasks parallel)
And then only, using the hadoop file system api, we proceed with the merge (FileUtil.copyMerge()) of the different output files to create our final output single file path/to/file.txt.
You could call coalesce(1) and then saveAsTextFile() - but it might be a bad idea if you have a lot of data. Separate files per split are generated just like in Hadoop in order to let separate mappers and reducers write to different files. Having a single output file is only a good idea if you have very little data, in which case you could do collect() as well, as #aaronman said.
As others have mentioned, you can collect or coalesce your data set to force Spark to produce a single file. But this also limits the number of Spark tasks that can work on your dataset in parallel. I prefer to let it create a hundred files in the output HDFS directory, then use hadoop fs -getmerge /hdfs/dir /local/file.txt to extract the results into a single file in the local filesystem. This makes the most sense when your output is a relatively small report, of course.
In Spark 1.6.1 the format is as shown below. It creates a single output file.It is best practice to use it if the output is small enough to handle.Basically what it does is that it returns a new RDD that is reduced into numPartitions partitions.If you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1)
pair_result.coalesce(1).saveAsTextFile("/app/data/")
You can call repartition() and follow this way:
val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap)
var repartitioned = year.repartition(1)
repartitioned.saveAsTextFile("C:/Users/TheBhaskarDas/Desktop/wc_spark00")
You will be able to do it in the next version of Spark, in the current version 1.0.0 it's not possible unless you do it manually somehow, for example, like you mentioned, with a bash script call.
I also want to mention that the documentation clearly states that users should be careful when calling coalesce with a real small number of partitions . this can cause upstream partitions to inherit this number of partitions.
I would not recommend using coalesce(1) unless really required.
Here's my answer to output a single file. I just added coalesce(1)
val year = sc.textFile("apat63_99.txt")
.map(_.split(",")(1))
.flatMap(_.split(","))
.map((_,1))
.reduceByKey((_+_)).map(_.swap)
year.saveAsTextFile("year")
Code:
year.coalesce(1).saveAsTextFile("year")

Best way to represent a readline loop in Scala?

Coming from a C/C++ background, I'm not very familiar with the functional style of programming so all my code tends to be very imperative, as in most cases I just can't see a better way of doing it.
I'm just wondering if there is a way of making this block of Scala code more "functional"?
var line:String = "";
var lines:String = "";
do {
line = reader.readLine();
lines += line;
} while (line != null)
How about this?
val lines = Iterator.continually(reader.readLine()).takeWhile(_ != null).mkString
Well, in Scala you can actually say:
val lines = scala.io.Source.fromFile("file.txt").mkString
But this is just a library sugar. See Read entire file in Scala? for other possiblities. What you are actually asking is how to apply functional paradigm to this problem. Here is a hint:
Source.fromFile("file.txt").getLines().foreach {println}
Do you get the idea behind this? foreach line in the file execute println function. BTW don't worry, getLines() returns an iterator, not the whole file. Now something more serious:
lines filter {_.startsWith("ab")} map {_.toUpperCase} foreach {println}
See the idea? Take lines (it can be an array, list, set, iterator, whatever that can be filtered and which contains an items having startsWith method) and filter taking only the items starting with "ab". Now take every item and map it by applying toUpperCase method. Finally foreach resulting item print it.
The last thought: you are not limited to a single type. For instance say you have a file containing integer number, one per line. If you want to read that file, parse the number and sum them up, simply say:
lines.map(_.toInt).sum
To add how the same can be achieved using the formerly new nio files which I vote to use because it has several advantages:
val path: Path = Paths.get("foo.txt")
val lines = Source.fromInputStream(Files.newInputStream(path)).getLines()
// Now we can iterate the file or do anything we want,
// e.g. using functional operations such as map. Or simply concatenate.
val result = lines.mkString
Don't forget to close the stream afterwards.
I find that Stream is a pretty nice approach: it create a re-traversible (if needed) sequence:
def loadLines(in: java.io.BufferedReader): Stream[String] = {
val line = in.readLine
if (line == null) Stream.Empty
else Stream.cons(line, loadLines(in))
}
Each Stream element has a value (a String, line, in this case), and calls a function (loadLines(in), in this example) which will yield the next element, lazily, on demand. This makes for a good memory usage profile, especially with large data sets -- lines aren't read until they're needed, and aren't retained unless something is actually still holding onto them. Yet you can also go back to a previous Stream element and traverse forward again, yielding the exact same result.