How to print the contents of RDD? - scala

I'm attempting to print the contents of a collection to the Spark console.
I have a type:
linesWithSessionId: org.apache.spark.rdd.RDD[String] = FilteredRDD[3]
And I use the command:
scala> linesWithSessionId.map(line => println(line))
But this is printed :
res1: org.apache.spark.rdd.RDD[Unit] = MappedRDD[4] at map at :19
How can I write the RDD to console or save it to disk so I can view its contents?

If you want to view the content of a RDD, one way is to use collect():
myRDD.collect().foreach(println)
That's not a good idea, though, when the RDD has billions of lines. Use take() to take just a few to print out:
myRDD.take(n).foreach(println)

The map function is a transformation, which means that Spark will not actually evaluate your RDD until you run an action on it.
To print it, you can use foreach (which is an action):
linesWithSessionId.foreach(println)
To write it to disk you can use one of the saveAs... functions (still actions) from the RDD API

You can convert your RDD to a DataFrame then show() it.
// For implicit conversion from RDD to DataFrame
import spark.implicits._
fruits = sc.parallelize([("apple", 1), ("banana", 2), ("orange", 17)])
// convert to DF then show it
fruits.toDF().show()
This will show the top 20 lines of your data, so the size of your data should not be an issue.
+------+---+
| _1| _2|
+------+---+
| apple| 1|
|banana| 2|
|orange| 17|
+------+---+

If you're running this on a cluster then println won't print back to your context. You need to bring the RDD data to your session. To do this you can force it to local array and then print it out:
linesWithSessionId.toArray().foreach(line => println(line))

There are probably many architectural differences between myRDD.foreach(println) and myRDD.collect().foreach(println) (not only 'collect', but also other actions). One the differences I saw is when doing myRDD.foreach(println), the output will be in a random order. For ex: if my rdd is coming from a text file where each line has a number, the output will have a different order. But when I did myRDD.collect().foreach(println), order remains just like the text file.

In python
linesWithSessionIdCollect = linesWithSessionId.collect()
linesWithSessionIdCollect
This will printout all the contents of the RDD

c.take(10)
and Spark newer version will show table nicely.

Instead of typing each time, you can;
[1] Create a generic print method inside Spark Shell.
def p(rdd: org.apache.spark.rdd.RDD[_]) = rdd.foreach(println)
[2] Or even better, using implicits, you can add the function to RDD class to print its contents.
implicit class Printer(rdd: org.apache.spark.rdd.RDD[_]) {
def print = rdd.foreach(println)
}
Example usage:
val rdd = sc.parallelize(List(1,2,3,4)).map(_*2)
p(rdd) // 1
rdd.print // 2
Output:
2
6
4
8
Important
This only makes sense if you are working in local mode and with a small amount of data set. Otherwise, you either will not be able to see the results on the client or run out of memory because of the big dataset result.

You can also save as a file: rdd.saveAsTextFile("alicia.txt")

In java syntax:
rdd.collect().forEach(line -> System.out.println(line));

Related

To split data into good and bad rows and write to output file using Spark program

I am trying to filter the good and bad rows by counting the number of delimiters in a TSV.gz file and write to separate files in HDFS
I ran the below commands in spark-shell
Spark Version: 1.6.3
val file = sc.textFile("/abc/abc.tsv.gz")
val data = file.map(line => line.split("\t"))
var good = data.filter(a => a.size == 995)
val bad = data.filter(a => a.size < 995)
When I checked the first record the value could be seen in the spark shell
good.first()
But when I try to write to an output file I am seeing the below records,
good.saveAsTextFile(good.tsv)
Output in HDFS (top 2 rows):
[Ljava.lang.String;#1287b635
[Ljava.lang.String;#2ef89922
Could ypu please let me know on how to get the required output file in HDFS
Thanks.!
Your final RDD is type of org.apache.spark.rdd.RDD[Array[String]]. Which leads to writing objects instead of string values in the write operation.
You should convert the array of strings to tab separated string values again before saving. Just try;
good.map(item => item.mkString("\t")).saveAsTextFile("goodFile.tsv")

How to read with Spark constantly updating HDFS directory and split output to multiple HDFS files based on String (row)?

Elaborated scenario -> HDFS directory which is "fed" with new log data of multiple types of bank accounts activity.
Each row represents a random activity type, and each row (String) contains the text "ActivityType=<TheTypeHere>".
In Spark-Scala, what's the best approach to read the input file/s in the HDFS directory and output multiple HDFS files, where each ActivityType is written to its own new file?
Adapted first answer to the statement:
The location of the "key" string is random within the parent String,
the only thing that is guaranteed is that it contains that sub-string,
in this case "ActivityType" followed by some val.
The question is really about this. Here goes:
// SO Question
val rdd = sc.textFile("/FileStore/tables/activitySO.txt")
val rdd2 = rdd.map(x => (x.slice (x.indexOfSlice("ActivityType=<")+14, x.indexOfSlice(">", (x.indexOfSlice("ActivityType=<")+14))), x))
val df = rdd2.toDF("K", "V")
df.write.partitionBy("K").text("SO_QUESTION2")
Input is:
ActivityType=<ACT_001>,34,56,67,89,90
3,4,4,ActivityType=<ACT_002>,A,1,2
ABC,ActivityType=<ACT_0033>
DEF,ActivityType=<ACT_0033>
Output is 3 files whereby the key is e.g. not ActivityType=, but rather ACT_001, etc. The key data is not stripped, it is still there in the String. You can modify that if you want as well as output location and format.
You can use MultipleOutputFormat for this.Convert rdd into key value pairs such that ActivityType is the key.Spark will create different files for different keys.You can decide based on the key where to place the files and what their names will be.
You can do something like this using RDDs whereby I assume you have variable length files and then converting to DFs:
val rdd = sc.textFile("/FileStore/tables/activity.txt")
val rdd2 = rdd.map(_.split(","))
.keyBy(_(0))
val rdd3 = rdd2.map(x => (x._1, x._2.mkString(",")))
val df = rdd3.toDF("K", "V")
//df.show(false)
df.write.partitionBy("K").text("SO_QUESTION")
Input is:
ActivityType=<ACT_001>,34,56,67,89,90
ActivityType=<ACT_002>,A,1,2
ActivityType=<ACT_003>,ABC
I get then as output 3 files, in this case 1 for each record. A bit hard to show as did it in Databricks.
You can adjust your output format and location, etc. partitionBy is the key here.

Map Triples to IDs/numbers using ZipWithIndex/ZipWithUniqueID

I asked the question before but it was unclear so I added more explanation to be more clear and to get help.
replace strings with ZipWithIndex/ZipWithUniqueID
I am trying to map string to number using ZipWithIndex OR ZipWithUniqueID
lets say I have this format
("u1",("name", "John Sam"))
("u2",("age", "twinty Four"))
("u3",("name", "sam Blake"))
I want this result
(0,(3,4))
(1,(5,6))
(2,(3,8))
I tried to use zipWithIndex directly to the triples but I got each letter mapped to a number I want to map the whole string without dividing it.
and tried to extract the first element in the key, value pair
so I did
val first = file.map(line=> line._1).distinct()
then apply ZipWithIndex
val z1= first.ZipWithIndex()
I got result like this
("u1",0)
("u2",1)
("u3",2)
now I need to take the ids/numbers and change it in my original file. and I need to keep all the distinct ids/numbers in hashTable to be able to look for them later on.
is there any way to do that? Any suggestions?
I hope you got my question
You mean something like this?
val file = List(("u1",("name", "John Sam")),
("u2",("age", "twinty Four")),
("u3",("name", "sam Blake")))
val first = file.map(line=> line._1) ++
file.flatMap(line=> List(line._2._1, line._2._2)).distinct
val z1: Map[String,Int] = Map[String,Int](first.zipWithIndex:_*)
file.map{ l =>
(z1(l._1),
(z1(l._2._1), z1(l._2._2)))
}

How can I pretty print a data frame in Zeppelin/Spark/Scala?

I am using Spark 2 and Scala 2.11 in a Zeppelin 0.7 notebook. I have a dataframe that I can print like this:
dfLemma.select("text", "lemma").show(20,false)
and the output looks like:
+---------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text |lemma |
+---------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|RT #Dope_Promo: When you and your crew beat your high scores on FUGLY FROG 😍🔥 https://time.com/Sxp3Onz1w8 |[rt, #dope_promo, :, when, you, and, you, crew, beat, you, high, score, on, FUGLY, FROG, https://time.com/sxp3onz1w8] |
|RT #axolROSE: Did yall just call Kermit the frog a lizard? https://time.com/wDAEAEr1Ay |[rt, #axolrose, :, do, yall, just, call, Kermit, the, frog, a, lizard, ?, https://time.com/wdaeaer1ay] |
I am trying to make the output nicer in Zeppelin, by:
val printcols= dfLemma.select("text", "lemma")
println("%table " + printcols)
which gives this output:
printcols: org.apache.spark.sql.DataFrame = [text: string, lemma: array<string>]
and a new blank Zeppelin paragraph headed
[text: string, lemma: array]
Is there a way of getting the dataframe to show as a nicely formatted table?
TIA!
In Zeppelin you can use z.show(df) to show a pretty table. Here's an example:
val df = Seq(
(1,1,1), (2,2,2), (3,3,3)
).toDF("first_column", "second_column", "third_column")
z.show(df)
I know this is an old thread, but just in case it helps...
The below was the only way that I could take show a portion of the df. Trying to add a second parameter to .show() as suggested in the comments is throwing an error.
z.show(df.limit(10))

How to Remove first few lines/header from multiple files using scala in spark

I was able to remove the first few lines of a single file using the code below:
scala> val file = sc.textFile("file:///root/path/file.csv")
Removing first 5 lines:
scala> val Data = file.mapPartitionsWithIndex{ (idx, iter) => if (idx == 0) iter.drop(5) else iter }
The problem is: Suppose that I have multiple files with the same columns, and I want to load all of them into rdd, removing the first few lines of each file.
Is this actually possible?
I'd appreciate any help. Thanks in advance!
Lets assume there are 2 files.
ravis-MacBook-Pro:files raviramadoss$ cat file.csv
first_file_first_record
first_file_second_record
first_file_third_record
first_file_fourth_record
first_file_fifth_record
first_file_sixth_record
ravis-MacBook-Pro:files raviramadoss$ cat file_2.csv
second_file_first_record
second_file_second_record
second_file_third_record
second_file_fourth_record
second_file_fifth_record
second_file_sixth_record
second_file_seventh_record
second_file_eight_record
Scala Code
sc.wholeTextFiles("/Users/raviramadoss/files").flatMap( _._2.lines.drop(5) ).collect()
Output:
res41: Array[String] = Array(first_file_sixth_record, second_file_sixth_record, second_file_seventh_record, second_file_eight_record)
In Spark/Hadoop if you give the input path as the directory containing all the files then the code which you have written will work on all the individual files separately.
So to achieve your objective, just give the input path as the directory containing all the files. So the first few lines will be removed from all the files.