How can I pretty print a data frame in Zeppelin/Spark/Scala? - scala

I am using Spark 2 and Scala 2.11 in a Zeppelin 0.7 notebook. I have a dataframe that I can print like this:
dfLemma.select("text", "lemma").show(20,false)
and the output looks like:
+---------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text |lemma |
+---------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|RT #Dope_Promo: When you and your crew beat your high scores on FUGLY FROG 😍🔥 https://time.com/Sxp3Onz1w8 |[rt, #dope_promo, :, when, you, and, you, crew, beat, you, high, score, on, FUGLY, FROG, https://time.com/sxp3onz1w8] |
|RT #axolROSE: Did yall just call Kermit the frog a lizard? https://time.com/wDAEAEr1Ay |[rt, #axolrose, :, do, yall, just, call, Kermit, the, frog, a, lizard, ?, https://time.com/wdaeaer1ay] |
I am trying to make the output nicer in Zeppelin, by:
val printcols= dfLemma.select("text", "lemma")
println("%table " + printcols)
which gives this output:
printcols: org.apache.spark.sql.DataFrame = [text: string, lemma: array<string>]
and a new blank Zeppelin paragraph headed
[text: string, lemma: array]
Is there a way of getting the dataframe to show as a nicely formatted table?
TIA!

In Zeppelin you can use z.show(df) to show a pretty table. Here's an example:
val df = Seq(
(1,1,1), (2,2,2), (3,3,3)
).toDF("first_column", "second_column", "third_column")
z.show(df)

I know this is an old thread, but just in case it helps...
The below was the only way that I could take show a portion of the df. Trying to add a second parameter to .show() as suggested in the comments is throwing an error.
z.show(df.limit(10))

Related

PySpark list() in withColumn() only works once, then AssertionError: col should be Column

I have a DataFrame with 6 string columns named like 'Spclty1'...'Spclty6' and another 6 named like 'StartDt1'...'StartDt6'. I want to zip them and collapse into a columns that looks like this:
[[Spclty1, StartDt1]...[Spclty6, StartDt6]]
I first tried collapsing just the 'Spclty' columns into a list like this:
DF = DF.withColumn('Spclty', list(DF.select('Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6')))
This worked the first time I executed it, giving me a new column called 'Spclty' containing rows such as ['014', '124', '547', '000', '000', '000'], as expected.
Then, I added a line to my script to do the same thing on a different set of 6 string columns, named 'StartDt1'...'StartDt6':
DF = DF.withColumn('StartDt', list(DF.select('StartDt1', 'StartDt2', 'StartDt3', 'StartDt4', 'StartDt5', 'StartDt6'))))
This caused AssertionError: col should be Column.
After I ran out of things to try, I tried the original operation again (as a sanity check):
DF.withColumn('Spclty', list(DF.select('Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6'))).collect()
and got the assertion error as above.
So, it would be good to understand why it only worked the first time (only), but the main question is: what is the correct way to zip columns into a collection of dict-like elements in Spark?
.withColumn() expects a column object as second parameter and you are supplying a list.
Thanks. After reading a number of SO posts I figured out the syntax for passing a set of columns to the col parameter, using struct to create an output column that holds a list of values:
DF_tmp = DF_tmp.withColumn('specialties', array([
struct(
*(col("Spclty{}".format(i)).alias("spclty_code"),
col("StartDt{}".format(i)).alias("start_date"))
)
for i in range(1, 7)
]
))
So, the col() and *col() constructs are what I was looking for, while the array([struct(...)]) approach lets me combine the 'Spclty' and 'StartDt' entries into a list of dict-like elements.

How to replace white space with comma in Spark ( with Scala)?

I have a log file like this. I want to create a DataFrame in Scala.
2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2
I want to replace all the spaces with commas so that I can use spark.sql but I am unable to do so.
Here is everything I tried:
Tried importing it as text file first to see if there is a replaceAll method.
Tried splitting on the basis of space.
Any suggestions. I went through the documentation and there is no mention of replace function like in Pandas..
You can simply tell spark that your delimiter is a white space like this:
val df = spark.read.option("delimiter", " ").csv("path/to/file")
Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema.
Roughly:
val rdd = sc.textFile({logline path}).map(line=>line.split("\\s+"))
Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.
A simpler way to get up and going though would be
spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)
If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.
import org.apache.commons.csv.CSVParser._
val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""
val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)
val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)
println(http)
println(curl)
Results:
GET https://www.example.com:443/ HTTP/1.1
curl/7.38.0

Map Triples to IDs/numbers using ZipWithIndex/ZipWithUniqueID

I asked the question before but it was unclear so I added more explanation to be more clear and to get help.
replace strings with ZipWithIndex/ZipWithUniqueID
I am trying to map string to number using ZipWithIndex OR ZipWithUniqueID
lets say I have this format
("u1",("name", "John Sam"))
("u2",("age", "twinty Four"))
("u3",("name", "sam Blake"))
I want this result
(0,(3,4))
(1,(5,6))
(2,(3,8))
I tried to use zipWithIndex directly to the triples but I got each letter mapped to a number I want to map the whole string without dividing it.
and tried to extract the first element in the key, value pair
so I did
val first = file.map(line=> line._1).distinct()
then apply ZipWithIndex
val z1= first.ZipWithIndex()
I got result like this
("u1",0)
("u2",1)
("u3",2)
now I need to take the ids/numbers and change it in my original file. and I need to keep all the distinct ids/numbers in hashTable to be able to look for them later on.
is there any way to do that? Any suggestions?
I hope you got my question
You mean something like this?
val file = List(("u1",("name", "John Sam")),
("u2",("age", "twinty Four")),
("u3",("name", "sam Blake")))
val first = file.map(line=> line._1) ++
file.flatMap(line=> List(line._2._1, line._2._2)).distinct
val z1: Map[String,Int] = Map[String,Int](first.zipWithIndex:_*)
file.map{ l =>
(z1(l._1),
(z1(l._2._1), z1(l._2._2)))
}

Using apache spark to iterate over string

For example we have the string "abcdabcd"
And we want to count all the pairs (e.g: "ab" or "da") that are available in the string.
So how do we do that in apache spark?
I asked this cause it looks like that the RDD does not support sliding function:
rdd.sliding(2).toList
//Count number of pairs in list
//Returns syntax error on first line (sliding)
Apparently it supports sliding via mllib as shown by zero323 here
import org.apache.spark.mllib.rdd.RDDFunctions._
val str = "abcdabcd"
val rdd = sc.parallelize(str)
rdd.sliding(2).map(_.mkString).toLocalIterator.forEach(println)
will show
ab
bc
cd
da
ab
bc
cd

How to print the contents of RDD?

I'm attempting to print the contents of a collection to the Spark console.
I have a type:
linesWithSessionId: org.apache.spark.rdd.RDD[String] = FilteredRDD[3]
And I use the command:
scala> linesWithSessionId.map(line => println(line))
But this is printed :
res1: org.apache.spark.rdd.RDD[Unit] = MappedRDD[4] at map at :19
How can I write the RDD to console or save it to disk so I can view its contents?
If you want to view the content of a RDD, one way is to use collect():
myRDD.collect().foreach(println)
That's not a good idea, though, when the RDD has billions of lines. Use take() to take just a few to print out:
myRDD.take(n).foreach(println)
The map function is a transformation, which means that Spark will not actually evaluate your RDD until you run an action on it.
To print it, you can use foreach (which is an action):
linesWithSessionId.foreach(println)
To write it to disk you can use one of the saveAs... functions (still actions) from the RDD API
You can convert your RDD to a DataFrame then show() it.
// For implicit conversion from RDD to DataFrame
import spark.implicits._
fruits = sc.parallelize([("apple", 1), ("banana", 2), ("orange", 17)])
// convert to DF then show it
fruits.toDF().show()
This will show the top 20 lines of your data, so the size of your data should not be an issue.
+------+---+
| _1| _2|
+------+---+
| apple| 1|
|banana| 2|
|orange| 17|
+------+---+
If you're running this on a cluster then println won't print back to your context. You need to bring the RDD data to your session. To do this you can force it to local array and then print it out:
linesWithSessionId.toArray().foreach(line => println(line))
There are probably many architectural differences between myRDD.foreach(println) and myRDD.collect().foreach(println) (not only 'collect', but also other actions). One the differences I saw is when doing myRDD.foreach(println), the output will be in a random order. For ex: if my rdd is coming from a text file where each line has a number, the output will have a different order. But when I did myRDD.collect().foreach(println), order remains just like the text file.
In python
linesWithSessionIdCollect = linesWithSessionId.collect()
linesWithSessionIdCollect
This will printout all the contents of the RDD
c.take(10)
and Spark newer version will show table nicely.
Instead of typing each time, you can;
[1] Create a generic print method inside Spark Shell.
def p(rdd: org.apache.spark.rdd.RDD[_]) = rdd.foreach(println)
[2] Or even better, using implicits, you can add the function to RDD class to print its contents.
implicit class Printer(rdd: org.apache.spark.rdd.RDD[_]) {
def print = rdd.foreach(println)
}
Example usage:
val rdd = sc.parallelize(List(1,2,3,4)).map(_*2)
p(rdd) // 1
rdd.print // 2
Output:
2
6
4
8
Important
This only makes sense if you are working in local mode and with a small amount of data set. Otherwise, you either will not be able to see the results on the client or run out of memory because of the big dataset result.
You can also save as a file: rdd.saveAsTextFile("alicia.txt")
In java syntax:
rdd.collect().forEach(line -> System.out.println(line));