saveasTextFile("path") in scala - scala

I used Scala in Spark, and I tried to save my file in HDFS file, but I got an error.
I have tried rdd.saveAsTextFile("path"); sc.saveAsTextFile("pathe");
saveAsTextFile("path")
scala> inputJPG.map(x=>x.split("")).map(array=>array(0)).sc.saveAsTextFile("/loudacre/iplist")
<console>:28: error: value sc is not a member of Array[String]
inputJPG.map(x=>x.split("")).map(array=>array(0)).sc.saveAsTextFile("/loudacre/iplist")

I'm a new student and still learning Scala, so not familiar with RDD and Scala function.
Turn to my problem.I found that is because I have
val xx=data.collect()before which is not RDD, so I can't do data.saveAsTextFile in Spark.
So I moved the .collect() function => val xx=data
Then data.savaAsTextFile("path") works.

Related

How to resolve error: value reduceByKey is not a member of org.apache.spark.rdd.RDD[(Int, Int)]?

I am learning apache spark and trying to execute a small program on scala terminal.
I have started the dfs, yarn and history server using the following commands:
start-dfs.sh
start-yarn.sh
mr-jobhistory-deamon.sh start historyserver
and then in the scala terminal, i have written the following commands:
var file = sc.textFile("/Users/****/Documents/backups/h/*****/input/ncdc/micro-tab/sample.txt");
val records = lines.map(_.split("\t"));
val filters = records.filter(rec => (rec(1) != "9999" && rec(2).matches("[01459]")));
val tuples = filters.map(rec => (rec(0).toInt, rec(1).toInt));
val maxTemps = tuples.reduceByKey((a,b) => Math.max(a,b));
all commands are executed successfully, except the last one, which throws the following error:
error: value reduceByKey is not a member of org.apache.spark.rdd.RDD[(Int, Int)]
i found some solutions like:
This comes from using a pair rdd function generically. The reduceByKey method is actually a method of the PairRDDFunctions class, which has an implicit conversion from RDD.So it requires several implicit typeclasses. Normally when working with simple concrete types, those are already in scope. But you should be able to amend your method to also require those same implicit.
But i am not sure how to achieve this.
Any help, how to resolve this issue?
It seems you are missing an import. Try writing this in the console:
import org.apache.spark.SparkContext._
And then running the above commands. This import brings an implicit conversion which lets you use the reduceByKey method.

Accessing Spark.SQL

I am new to Spark. Following the below example in a book, I found that the command below was giving the error. What would be the best way to run a Spark-SQL command, whilst coding in general in Spark?
scala> // Use SQL to create another DataFrame containing the account
summary records
scala> val acSummary = spark.sql("SELECT accNo, sum(tranAmount) as TransTotal FROM trans GROUP BY accNo")
<console>:37: error: not found: value spark
I tried importing import org.apache.spark.SparkContext or using the sc object, but no luck.
Assuming you're in the spark-shell, then first get a sql context thus:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Then you can do:
val acSummary = sqlContext.sql("SELECT accNo, sum(tranAmount) as TransTotal FROM trans GROUP BY accNo")
So the value spark that is available in spark-shell is actually an instance of SparkSession (https://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.sql.SparkSession)
val spark = SparkSession.builder().getOrCreate()
will give you one.
What version are you using? It appears you're in the shell and this should work, but only in Spark 2+ - otherwise you have to use sqlContext.sql

Spark 2.0 Scala - RDD.toDF()

I am working with Spark 2.0 Scala. I am able to convert an RDD to a DataFrame using the toDF() method.
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
But for the life of me I cannot find where this is in the API docs. It is not under RDD. But it is under DataSet (link 1). However I have an RDD not a DataSet.
Also I can't see it under implicits (link 2).
So please help me understand why toDF() can be called for my RDD. Where is this method being inherited from?
It's coming from here:
Spark 2 API
Explanation: if you import sqlContext.implicits._, you have a implicit method to convert RDD to DataSetHolder (rddToDataSetHolder), then you call toDF on the DataSetHolder
Yes, you should import sqlContext implicits like that:
val sqlContext = //create sqlContext
import sqlContext.implicits._
val df = RDD.toDF()
Before you call to "toDF" in your RDDs
Yes I finally found piece of mind, this issue. It was troubling me like hell, this post is a life saver. I was trying to generically load data from log files to a case class object making it mutable List, this idea was to finally convert the list into DF. However as it was mutable and Spark 2.1.1 has changed the toDF implementation, what ever why the list want not getting converted. I finally thought of even covering save the data to file and the load it back using .read. However 5 min back this post had saved my day.
I did the exact same way as described.
after loading the data to mutable list I immediately used
import spark.sqlContext.implicits._
val df = <mutable list object>.toDF
df.show()
I have done just this with Spark 2.
it worked.
val orders = sc.textFile("/user/gd/orders")
val ordersDF = orders.toDF()

Why does the Scala compiler give "value registerKryoClasses is not a member of org.apache.spark.SparkConf" for Spark 1.4?

I tried to register a class for Kryo as follows
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Seq(classOf[MyClass]))
val sc = new SparkContext(conf)
However, I get the following error
value registerKryoClasses is not a member of org.apache.spark.SparkConf
I also tried, conf.registerKryoClasses(classOf[MyClass]), but still it complains about the same error.
What mistake am I doing? I am using Spark 1.4.
The method SparkConf.registerKryoClasses is defined in Spark 1.4 (since 1.2). However, it expects an Array[Class[_]] as an argument. This might be the problem. Try calling conf.registerKryoClasses(Array(classOf[MyClass])) instead.

Scala code works on Spark Terminal but not in eclipse

The code below works fine when I run it on Spark Terminal but in eclipse it throws an error. What might be the reason. Please let me know if you need more information.
val IABLabels= IAB.zip(labels)
val temp1 = IABLabels.groupBy(x=>x._2).mapValues( _.map( _._1 ))
Error in Eclipse:
value mapValues is not a member of org.apache.spark.rdd.RDD[(Int, Iterable[(String, Int)])]
The code runs perfectly fine on Spark shell.
You should use this import to access extra functions on RDD's of (key,value) pairs through an implicit conversion:
import org.apache.spark.SparkContext._
You can check the API docs for further details.