This question already has answers here:
Why is it possible to declare variable with same name in the REPL?
(2 answers)
Closed 3 years ago.
I am working with Spark-shell for Scala and found a strange behaviour in Spark-shell REPL which is not there if i use any IDE.
I can declare the same immutable variable again and again in REPL, but the same is not allowed in IDE.
Here is the code in REPL:
scala> val rdd = sc.textFile("README.md")
rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[5] at textFile at <console>:24
scala> val rdd = sc.textFile("README.md")
rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[7] at textFile at <console>:24
scala> val rdd = sc.textFile("README.md")
rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[9] at textFile at <console>:24
scala> val rdd = sc.textFile("README.md")
rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[11] at textFile at <console>:24
And here is the same thing I am trying in Eclipse IDE, and it shows compile time error:
Is there anything i am missing any configuration for Spark-shell REPL?
Or is it the expected behaviour?
In your REPL, your code is actually translated as follows:
object SomeName {
val rdd = sc.textFile("README.md")
}
object Some_Other_Name {
val rdd = sc.textFile("README.md")
}
Since both your rdd vals are defined in separate Singletons, there is no name collision between them. and since this happens behind the scenes in the REPL, you feel as if you are making reassignments to the same val.
In an IDE, all our code is written inside Classes or Scala singletons(Objects), so in the same scope (inside the resp. Class/Object) you are returned an error.
Related
I noticed something interesting when working with the spark-shell and I'm curious as to why this is happening. I load a text file into Spark using the basic syntax, and then I just simply repeat this command. The output of the REPL is below:
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[1] at textFile at <console>:24
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[3] at textFile at <console>:24
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[5] at textFile at <console>:24
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[7] at textFile at <console>:24
I know that the MapPartitionsRDD[X] portion features X as the RDD identifier. However, based upon this SO post on RDD identifiers, I'd expect that the identifier integer increments by one each time a new RDD is created. So why exactly is it incrementing by 2?
My guess is that loading a text file creates an intermediate RDD? Because clearly creating an RDD from parallelize() only increments the RDD counter by 1 (before it was 7):
scala> val arrayrdd = sc.parallelize(Array(3,4,5))
arrayrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24
Note: I don't believe the number has anything to do w/ partitions. If I call, I get that my RDD is partitioned into 9 partitions:
scala> myreviews.partitions.size
res2: Int = 9
Because a single method call can create more than one intermediate RDD. It will be obvious if you check the debug string
sc.textFile("README.md").toDebugString
String =
(2) README.md MapPartitionsRDD[1] at textFile at <console>:25 []
| README.md HadoopRDD[0] at textFile at <console>:25 []
As you see the lineage consist of two RDDs.
The first one is a HadoopRDD which corresponds to data import
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions)
The second one is MapPartitionsRDD and corresponds to the subsequent map which drops keys (offsets) and converts Text to String.
.map(pair => pair._2.toString).setName(path)
I am new to spark/scala and need to load a file from hdfs to spark. I have a file in hdfs (/newhdfs/abc.txt), and I could see my file contents by using hdfs dfs -cat /newhdfs/abc.txt
I did in below order to load the file into spark context
spark-shell #It entered into scala console window
scala> import org.apache.spark._; //Line 1
scala> val conf=new SparkConf().setMaster("local[*]");
scala> val sc = new SparkContext(conf);
scala> val input=sc.textFile("hdfs:///newhdfs/abc.txt"); //Line 4
Once I hit enter on line 4, I am getting below message.
input: org.apache.spark.rdd.RDD[String] = hdfs:///newhdfs/abc.txt MapPartitionsRDD[19] at textFile at <console>:27``
Is this a fatal error? What do I need to do to solve this?
(Using Spark-2.0.0 and Hadoop 2.7.0)
This is not an error, it just says the name of the file for your RDD.
In the Basic docs, there is this example:
scala> val textFile = sc.textFile("README.md")
textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:25
which demonstrates the very same behavior.
How would you expect an error to happen without an action triggering actual work to happen?
If you want to check that everything is OK, do a count of your input RDD, which is an action and will trigger the actual read of the file, and then the count of the elements of your RDD.
I have the following Scala value:
val values: List[Iterable[Any]] = Traces().evaluate(features).toList
and I want to convert it to a DataFrame.
When I try the following:
sqlContext.createDataFrame(values)
I got this error:
error: overloaded method value createDataFrame with alternatives:
[A <: Product](data: Seq[A])(implicit evidence$2: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
[A <: Product](rdd: org.apache.spark.rdd.RDD[A])(implicit evidence$1: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
cannot be applied to (List[Iterable[Any]])
sqlContext.createDataFrame(values)
Why?
Thats what spark implicits object is for. It allows you to convert your common scala collection types into DataFrame / DataSet / RDD.
Here is an example with Spark 2.0 but it exists in older versions too
import org.apache.spark.sql.SparkSession
val values = List(1,2,3,4,5)
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val df = values.toDF()
Edit: Just realised you were after 2d list. Here is something I tried on spark-shell. I converted a 2d List to List of Tuples and used implicit conversion to DataFrame:
val values = List(List("1", "One") ,List("2", "Two") ,List("3", "Three"),List("4","4")).map(x =>(x(0), x(1)))
import spark.implicits._
val df = values.toDF
Edit2: The original question by MTT was How to create spark dataframe from a scala list for a 2d list for which this is a correct answer. The original question is https://stackoverflow.com/revisions/38063195/1
The question was later changed to match an accepted answer. Adding this edit so that if someone else looking for something similar to the original question can find it.
As zero323 mentioned, we need to first convert List[Iterable[Any]] to List[Row] and then put rows in RDD and prepare schema for the spark data frame.
To convert List[Iterable[Any]] to List[Row], we can say
val rows = values.map{x => Row(x:_*)}
and then having schema like schema, we can make RDD
val rdd = sparkContext.makeRDD[RDD](rows)
and finally create a spark data frame
val df = sqlContext.createDataFrame(rdd, schema)
Simplest approach:
val newList = yourList.map(Tuple1(_))
val df = spark.createDataFrame(newList).toDF("stuff")
In Spark 2 we can use DataSet by just converting list to DS by toDS API
val ds = list.flatMap(_.split(",")).toDS() // Records split by comma
or
val ds = list.toDS()
This more convenient than rdd or df
The most concise way I've found:
val df = spark.createDataFrame(List("A", "B", "C").map(Tuple1(_)))
I am studying Spark on VirtualBox. I use ./bin/spark-shell to open Spark and use Scala. Now I got confused about key-value format using Scala.
I have a txt file in home/feng/spark/data, which looks like:
panda 0
pink 3
pirate 3
panda 1
pink 4
I use sc.textFile to get this txt file. If I do
val rdd = sc.textFile("/home/feng/spark/data/rdd4.7")
Then I can use rdd.collect() to show rdd on the screen:
scala> rdd.collect()
res26: Array[String] = Array(panda 0, pink 3, pirate 3, panda 1, pink 4)
However, if I do
val rdd = sc.textFile("/home/feng/spark/data/rdd4.7.txt")
which no ".txt" here. Then when I use rdd.collect(), I got a mistake:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/feng/spark/A.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
......
But I saw other examples. All of them have ".txt" at the end. Is there something wrong with my code or my system?
Another thing is when I tried to do:
scala> val rddd = rdd.map(x => (x.split(" ")(0),x))
rddd: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[2] at map at <console>:29
scala> rddd.collect()
res0: Array[(String, String)] = Array((panda,panda 0), (pink,pink 3), (pirate,pirate 3), (panda,panda 1), (pink,pink 4))
I intended to select the first column of the data and use it as the key. But rddd.collect() looks like not that way as the words occur twice, which is not right. I cannot keep doing the rest operations like mapbykey, reducebykey or others. Where did I do wrong?
Just for example I create a String with your dataset, after this I split the record by line, and use SparkContext's parallelize method to create an RDD. Notice that after I create the RDD I use its map method to split the String stored in each record and convert it to a Row.
import org.apache.spark.sql.Row
val text = "panda 0\npink 3\npirate 3\npanda 1\npink 4"
val rdd = sc.parallelize(text.split("\n")).map(x => Row(x.split(" "):_*))
rdd.take(3)
The output from the take method is:
res4: Array[org.apache.spark.sql.Row] = Array([panda,0], [pink,3], [pirate,3])
About your first question, there is no need for files to have any extension. Because, in this case files are seen as plain text.
I have been playing with Spark and I found my join operation doesn't work. Below are part of my code and result in scala console:
scala> val conf = new SparkConf().setMaster("local[*]").setAppName("Part4")
scala> val sc = new SparkContext(conf)
scala> val k1 = sc.parallelize(List((1,3),(1,5),(2,4)))
k1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[24] at parallelize at <console>:29
scala> val k2 = sc.parallelize(List((1,'A'),(2,'B')))
k2: org.apache.spark.rdd.RDD[(Int, Char)] = ParallelCollectionRDD[25] at parallelize at <console>:29
scala> val k3 = k1.join(k2)
k3: org.apache.spark.rdd.RDD[(Int, (Int, Char))] = MapPartitionsRDD[28] at join at <console>:33
scala> k3.foreach(println)
scala> k3.collect
res33: Array[(Int, (Int, Char))] = Array()
So I just created a toy example with two rdd Lists k1 and k2 with (k,v) pairs and try to join them. However, the result k3 is always empty. We can see k1 and k2 are correctly specified but k3 is empty nevertheless.
What is wrong?
-------Update my question:
I think I know where the problem is but I'm still confused:
At first I wrote
val conf = new SparkConf().setMaster("local[*]").setAppName("Part4")
val sc = new SparkContext(conf)
When I didn't have those two lines of code, my join worked but when I added those it wouldn't work.
Why was that?
spark-shell starts up it own Spark Context. Alas Spark does not like multiple contexts running in the same application. When I execute the second line (val sc = new SparkContext(conf)) in spark-shell I get
SNIP LOTS OF ERROR LINES
org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
org.apache.spark.SparkContext.<init>(SparkContext.scala:82)
SNIP LOTS OF ERROR LINES
Spark has lots of static context and other stuff that means it does not work will when you have two contexts. I'd chalk this down to that however alas I can not prove it.