Can't print RDD content with take() action - scala

When I try to print RDD content with first() action I am able to print it with foreach loop. But with take() action it doesn't print out content.
using first()
myRDD.first().foreach(println)
1
2013-07-25 00:00:00.0
11599
CLOSED
using take():
myRDD.take(5).foreach(println)
[Ljava.lang.String;#23a5818e
[Ljava.lang.String;#4715ae33
[Ljava.lang.String;#9fc9f91
[Ljava.lang.String;#1fac1d5c
[Ljava.lang.String;#108a46d6
I expect same output as first() indeed it should be. But ı get different output.

I assume your RDD is of type org.apache.spark.rdd.RDD[Array[String]]. In that case the return type of the first method is Array[String] and the foreach(println) prints the elements of the first string array in the RDD.
But the return type of take(5) is Array[Array[String]] and foreach(println) prints the 5 elements.
To get the same output for first and take(5) either use
println(myRDD.first())
myRDD.take(5).foreach(println)
or
myRDD.first().foreach(println)
myRDD.take(5).foreach(_.foreach(println))

Related

PySpark list() in withColumn() only works once, then AssertionError: col should be Column

I have a DataFrame with 6 string columns named like 'Spclty1'...'Spclty6' and another 6 named like 'StartDt1'...'StartDt6'. I want to zip them and collapse into a columns that looks like this:
[[Spclty1, StartDt1]...[Spclty6, StartDt6]]
I first tried collapsing just the 'Spclty' columns into a list like this:
DF = DF.withColumn('Spclty', list(DF.select('Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6')))
This worked the first time I executed it, giving me a new column called 'Spclty' containing rows such as ['014', '124', '547', '000', '000', '000'], as expected.
Then, I added a line to my script to do the same thing on a different set of 6 string columns, named 'StartDt1'...'StartDt6':
DF = DF.withColumn('StartDt', list(DF.select('StartDt1', 'StartDt2', 'StartDt3', 'StartDt4', 'StartDt5', 'StartDt6'))))
This caused AssertionError: col should be Column.
After I ran out of things to try, I tried the original operation again (as a sanity check):
DF.withColumn('Spclty', list(DF.select('Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6'))).collect()
and got the assertion error as above.
So, it would be good to understand why it only worked the first time (only), but the main question is: what is the correct way to zip columns into a collection of dict-like elements in Spark?
.withColumn() expects a column object as second parameter and you are supplying a list.
Thanks. After reading a number of SO posts I figured out the syntax for passing a set of columns to the col parameter, using struct to create an output column that holds a list of values:
DF_tmp = DF_tmp.withColumn('specialties', array([
struct(
*(col("Spclty{}".format(i)).alias("spclty_code"),
col("StartDt{}".format(i)).alias("start_date"))
)
for i in range(1, 7)
]
))
So, the col() and *col() constructs are what I was looking for, while the array([struct(...)]) approach lets me combine the 'Spclty' and 'StartDt' entries into a list of dict-like elements.

Invoking print() with list.foreach in Scala is printing Nil

I am a new to Scala and learning through the language constructs. While using print() with list.foreach() also prints the Nil or "()" in the console. Is this something expected or am I missing some trick here?
Code Snippet:
val oneTwo = "one"::"two"::Nil
println(oneTwo.foreach(s=> print(s+" ")))
o/p: one two ()
You have an extra println.
oneTwo.foreach(s=> print(s+" "))
Prints the contents of the list - "one two".
The println you have outside prints out the return value of the foreach statement, which is Unit (not Nil - that's a completely different beast), represented in scala as ().
To just output the list elements,
oneTwo.foreach(s=> print(s+" "))
would suffice. Now you put another print around that, so you say "and then print whatever oneTwo.foreach(s=> print(s+" ")) evaluates to".
The return type of foreach is Unit, so it'll return the only value of that type, the empty tuple ().
So what you see is the list elements printed by the print in the foreach, and then the outer print prints the result of the foreach. Does that make sense?

Count filtered records in scala

As I am new to scala ,This problem might look very basic to all..
I have a file called data.txt which contains like below:
xxx.lss.yyy23.com-->mailuogwprd23.lss.com,Hub,12689,14.98904563,1549
xxx.lss.yyy33.com-->mailusrhubprd33.lss.com,Outbound,72996,1.673717588,1949
xxx.lss.yyy33.com-->mailuogwprd33.lss.com,Hub,12133,14.9381027,664
xxx.lss.yyy53.com-->mailusrhubprd53.lss.com,Outbound,72996,1.673717588,3071
I want to split the line and find the records depending upon the numbers in xxx.lss.yyy23.com
val data = io.Source.fromFile("data.txt").getLines().map { x => (x.split("-->"))}.map { r => r(0) }.mkString("\n")
which gives me
xxx.lss.yyy23.com
xxx.lss.yyy33.com
xxx.lss.yyy33.com
xxx.lss.yyy53.com
This is what I am trying to count the exact value...
data.count { x => x.contains("33")}
How do I get the count of records who does not contain 33...
The following will give you the number of lines that contain "33":
data.split("\n").count(a => a.contains("33"))
The reason what you have above isn't working is that you need to split data into an array of strings again. Your previous statement actually concatenates the result into a single string using newline as a separator using mkstring, so you can't really run collection operations like count on it.
The following will work for getting the lines that do not contain "33":
data.split("\n").count(a => !a.contains("33"))
You simply need to negate the contains operation in this case.

How to print the contents of RDD?

I'm attempting to print the contents of a collection to the Spark console.
I have a type:
linesWithSessionId: org.apache.spark.rdd.RDD[String] = FilteredRDD[3]
And I use the command:
scala> linesWithSessionId.map(line => println(line))
But this is printed :
res1: org.apache.spark.rdd.RDD[Unit] = MappedRDD[4] at map at :19
How can I write the RDD to console or save it to disk so I can view its contents?
If you want to view the content of a RDD, one way is to use collect():
myRDD.collect().foreach(println)
That's not a good idea, though, when the RDD has billions of lines. Use take() to take just a few to print out:
myRDD.take(n).foreach(println)
The map function is a transformation, which means that Spark will not actually evaluate your RDD until you run an action on it.
To print it, you can use foreach (which is an action):
linesWithSessionId.foreach(println)
To write it to disk you can use one of the saveAs... functions (still actions) from the RDD API
You can convert your RDD to a DataFrame then show() it.
// For implicit conversion from RDD to DataFrame
import spark.implicits._
fruits = sc.parallelize([("apple", 1), ("banana", 2), ("orange", 17)])
// convert to DF then show it
fruits.toDF().show()
This will show the top 20 lines of your data, so the size of your data should not be an issue.
+------+---+
| _1| _2|
+------+---+
| apple| 1|
|banana| 2|
|orange| 17|
+------+---+
If you're running this on a cluster then println won't print back to your context. You need to bring the RDD data to your session. To do this you can force it to local array and then print it out:
linesWithSessionId.toArray().foreach(line => println(line))
There are probably many architectural differences between myRDD.foreach(println) and myRDD.collect().foreach(println) (not only 'collect', but also other actions). One the differences I saw is when doing myRDD.foreach(println), the output will be in a random order. For ex: if my rdd is coming from a text file where each line has a number, the output will have a different order. But when I did myRDD.collect().foreach(println), order remains just like the text file.
In python
linesWithSessionIdCollect = linesWithSessionId.collect()
linesWithSessionIdCollect
This will printout all the contents of the RDD
c.take(10)
and Spark newer version will show table nicely.
Instead of typing each time, you can;
[1] Create a generic print method inside Spark Shell.
def p(rdd: org.apache.spark.rdd.RDD[_]) = rdd.foreach(println)
[2] Or even better, using implicits, you can add the function to RDD class to print its contents.
implicit class Printer(rdd: org.apache.spark.rdd.RDD[_]) {
def print = rdd.foreach(println)
}
Example usage:
val rdd = sc.parallelize(List(1,2,3,4)).map(_*2)
p(rdd) // 1
rdd.print // 2
Output:
2
6
4
8
Important
This only makes sense if you are working in local mode and with a small amount of data set. Otherwise, you either will not be able to see the results on the client or run out of memory because of the big dataset result.
You can also save as a file: rdd.saveAsTextFile("alicia.txt")
In java syntax:
rdd.collect().forEach(line -> System.out.println(line));

The expression prints itself in unexpected order

When I print a log information like this:
val idList = getIdList
log info s"\n\n-------idList: ${idList foreach println}"
It shows me:
1
2
3
4
5
-------idList: ()
That makes sense because foreach returns Unit. But why does it print the list of id first? idList is already evaluated in the previous line (if that's the cause)!
And how to make it print it in expected order - after idList:?
This is because you're not evaluating the log string to read what you want, you evaluate it to:
\n\n -------idList: ()
However, the members of the list appear in the output stream as a side effect, due to the println call in the string interpolation.
EDIT: since clarification was requested by the OP, what happens is that the output comes from two sources:
${idList foreach println} evaluates to (), since println itself doesn't return anything.
However, you can see the elements printed out, because when the string interpolation is evaluated, println is being called. And println prints all the elements into the output stream.
In other words:
//line with log.info() reached, starts evaluating string before method call
1 //println from foreach
2 //println from foreach
3 //println from foreach
4 //println from foreach
5 //println from foreach
//string argument log.info() evaluated from interpolation
-------idList: () //log info prints the resultant string
To solve your problem, modify the expression in the interpolated string to actually return the correct string, e.g.:
log info s"\n\n-------idList: ${idList.mkString("\n")}"
Interpolation works in a following way:
evaluate all arguments
substitute their results into resulting string
println is a Unit function that prints to the standard output, you should use mkstring instead that returns a string
log info s"\n\n-------idList: ${idList.mkString("(", ", ", ")")}"
As pointed out by #TheTerribleSwiftTomato , you need to give an expression that returns a value and has no other side-effect. So simply do it like this:
val idList = getIdList
log info s"\n\n-------idList: ${idList mkString " "}"
For example, this works for me:
val idList = List(1, 2, 3, 4, 5)
println(s"\n\n-------idList: ${idList mkString " "}")
Output:
-------idList: 1 2 3 4 5