How do I print the contents of an ApacheSpark RDD in my terminal? - scala

This is my first time using Scala and ApacheSpark for a project. I'm trying to print the contents of an matrix when I run my code in the terminal, but nothing I try is working so far.
Instead I only get this printed:
I just using println() but when I use collect(), that doesn't give a good result either.

The default toString prints the name of a class followed by an address in memory.
You're going to want to find a way to iterate through your matrix and print each element.

Building on #zero323 's comment ( aside would you like to put an answer out there?): given an RDD[SomeType] you can call
Then you can print out the results using normal toString() methods that depend on the type of the rdd contents. So if SomeType were a List[Double] then the
would give you a single-line comma separated output of the results.
As #zero323 another consideration is: "do you really want to print out the contents of your rdd?" More likely you might only want a summary - such as
println(s"Number of entries in RDD is ${rdd.count()}")

Iterate over the rdd like this,

scala>val rdd1 = sc.parallelize(List(1,2,3,4)).map(_*2)
To print the data within RDD
scala> rdd1.collect().foreach(println)


Removing Data Type From Tuple When Printing In Scala

I currently have two maps: -
mapBuffer = Map[String, ListBuffer[(Int, String, Float)]
personalMapBuffer = Map[mapBuffer, String]
The idea of what I'm trying to do is create a list of something, and then allow a user to create a personalised list which includes a comment, so they'd have their own list of maps.
I am simply trying to print information as everything is good from the above.
To print the Key from mapBuffer, I use: -
mapBuffer.foreach(line => println(line._1))
This returns: -
Sample String 1
Sample String 2
To print the same thing from personalMapBuffer, I am using: -
personalMapBuffer.foreach(line => println(
However, this returns: -
List(Sample String 1)
List(Sample String 2)
I obviously would like it to just return "Sample String" and remove the List() aspect. I'm assuming this has something to do with the .map function, although this was the only way I could find to access a tuple within a tuple. Is there a simple way to remove the data type? I was hoping for something simple like: -
But obviously no such pre-function exists. I'm very new to Scala so this might be something extremely simple (which I hope it is haha) or it could be a bit more complex. Any help would be great.
What you see if default List.toString behaviour. You build your own string with mkString operation :
val separator = ","
personalMapBuffer.foreach(line => println(
which will produce desired result of Sample String 1 or Sample String 1, Sample String 2 if there will be 2 strings.
Hope this helps!
I have found a way to get the result I was looking for, however I'm not sure if it's the best way.
The .map() method just returns a collection. You can see more info on that here:-
By using any sort of specific element finder at the end, I'm able to return only the element and not the data type. For example: -
As I was writing this Ivan Kurchenko replied above suggesting I use .mkString. This also works and looks a little bit better than .head in my mind."")
Again, I'm not 100% if this is the most efficient way but if it is necessary for something, this way has worked for me for now.

Using distinct on a slice stringbuilder

Now when I perform this it seems to apply .distinct to the whole string rather than the selection I use with slice. (mouse and highlight are just index positions and buffer is a StringBuilder). I'm just wondering what is the reason for this.
Your approach is correct. Please see below code for more clarification.
slice() function gives you the sub-string so in your approach It will first find the sub-string and then distinct.
Please follow below step by step approach for more understanding.
val buffer=new StringBuilder
val sl=buffer.slice(2,10)
The variable sl contains
sl= baabbbcc
Now you can apply distinct on sl variable
val result=sl.distinct
Finally your output
result= bac
This is the how your single line of code is working.

Copy all elements in RDD to Array

So, I'm reading data from a JSON file and creating a DataFrame. Usually, I would use"//line//to//some-file.json")
Problem is that my JSON file isn't consistent. So, for each line in the file, there are 2 JSONs. Each line looks like this
{ I don't need....}, { I need....}
I only need my DataFrame to be formed from the data I need, i.e. the second JSON of each line. So I read each line as a string and substring the part that I need, like so
val lines = sc.textFile(link, 2)
val part = x => x.substring(x.lastIndexOf('{')).trim)
I want to get all the elements in 'part' as an Array[String] then turn the Array[String] into one string and make the DataFrame. Like so
val strings = part .collect() //doesn't work
val strings = part.take(1000) //works
val jsonStr = "[".concat(strings.mkString(", ")).concat("]")
The problem is, if I call part.collect(), it doesn't work but if I call part.take(N) it works. However, I'd like to get all my data and not just the first N.
Also, if I try part.take(part.count().toInt) it still doesn't work.
Any Ideas??
I realized my problem after a good sleep. It was a silly mistake on my part. The very last line of my input file had a different format from the rest of the file.
So part.take(N) would work for all N less than part.count(). That's why part.collect() wasn't working.
Thanks for the help though!

I need help parsing a file in scala for running a spark job

I'm running a Spark Job in Scala and I'm struck with parsing the input file.
The Input file(TAB separated) is something like,
date=20160701 name=mike age=26
date=20160402 name=john age=33
I want to parse it and extract only values and not the keys, such as,
20160701 mike 26
20160402 john 33
How can this be achieved in SCALA?
I'm using,
You can use CSVParser() and you know the location for key, it will be easy and clean
Test data
val data = "date=20160701\tname=mike\tage=26\ndate=20160402 name=john\tage=33\n"
One statement to do what you asked
val rdd = sc.parallelize(data.split('\n'))
.map(_.split('\t') // split into key=value
.map(_.split('=')(1))) // split those at "=" and select only the value
Display what we got
// 20160701,mike,26
// 20160402,john,33
But don't do this for real code. It's very fragile in the face of data format errors, etc. Use CSVParser or something instead as Narendra Parmar suggests.
val rdd = sc.textFile() => x.split("\t")).map(x => x.split("=")(1)).map(x => x.mkstring("\t")).saveAsTextFile("")

Parsing options that take more than one value with scopt in scala

I am using scopt to parse command line arguments in scala. I want it to be able to parse options with more than one value. For instance, the range option, if specified, should take exactly two values.
--range 25 45
Coming, from python background, I am basically looking for a way to do the following with scopt instead of python's argparse:
parser.add_argument("--range", default=None, nargs=2, type=float,
metavar=('start', 'end'),
help=(" Foo bar start and stop "))
I dont think minOccurs and maxOccurs solves my problem exactly, nor the key:value example in its help.
Looking at the source code, this is not possible. The Read type class used has a member tuplesToRead, but it doesn't seem to be working when you force it to 2 instead of 1. You will have to make a feature request, I guess, or work around this by using --min 25 --max 45, or --range '25 45' with a custom Read instance that splits this string into two parts. As #roterl noted, this is not a standard way of parsing.
It should be ok if only your values are delimited with something else than a space...
--range 25-45
... although you need to split them manually. Parse it with something like:
opt[String]('r', "range").action { (x, c) =>
val rx = "([0-9]+)\\-([0-9]+)".r
val rx(from, to) = x
c.copy(from = from.toInt, to = to.toInt)
// ...
println(s" Got range ${parsedArgs.from}..${}")