Running wordcount failed in scala - scala

I am trying to run wordcount program in scala. Here's how my code looks like.
package myspark;
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.implicits._
object WordCount {
def main(args: Array[String]) {
val sc = new SparkContext( "local", "Word Count", "/home/hadoop/spark-2.2.0-bin-hadoop2.7/bin", Nil, Map(), Map())
val input = sc.textFile("/myspark/input.txt")
Val count = input.flatMap(line ⇒ line.split(" "))
.map(word ⇒ (word, 1))
.reduceByKey(_ + _)
count.saveAsTextFile("outfile")
System.out.println("OK");
}
}
Then I tried to execute it in spark.
spark-shell -i /myspark/WordCount.scala
And I get this error.
... 149 more
<console>:14: error: not found: value spark
import spark.implicits._
^
<console>:14: error: not found: value spark
import spark.sql
^
That file does not exist
Can someone please explain the error in this code? I am very new to Spark and Scala both. I have verified that the input.txt file is in the mentioned location.

You can take a look here to get started : Learning Spark-WordCount
Other than that there are many a errors that I can see
import org.apache.spark..implicits._: the two dots wont work
Other than that have you added spark-dependency in your project ? Maybe even as provided ? You must do that atleast to run the spark code.

First of all check whether you have added the right dependencies . An i can see you did few mistake in your code .
create Sparksession not Sparkcontext SparkSessionAPI
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
Then use this spark variable
import spark.implicits._
I am not sure why you have mentioned import org.apache.spark..implicits._ 2 dot between the spark..implicits

Related

toDF is not working in spark scala ide , but works perfectly in spark-shell [duplicate]

This question already has answers here:
Spark 2.0 Scala - RDD.toDF()
(4 answers)
Closed 2 years ago.
I am new to Spark and I am trying to run the below commands both from spark-shell and spark scala eclipse ide
When I ran it from shell , it perfectly works .
But in ide , it gives the compilation error.
Please help
package sparkWCExample.spWCExample
import org.apache.log4j.Level
import org.apache.spark.sql.{ Dataset, SparkSession, DataFrame, Row }
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql._
object TwitterDatawithDataset {
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("Spark Scala WordCount Example")
.setMaster("local[1]")
val spark = SparkSession.builder()
.config(conf)
.appName("CsvExample")
.master("local")
.getOrCreate()
val csvData = spark.sparkContext
.textFile("C:\\Sankha\\Study\\data\\bank_data.csv", 3)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Bank(age: Int, job: String)
val bankDF = dfData.map(x => Bank(x(0).toInt, x(1)))
val df = bankDF.toDF()
}
}
Exception is as below on compile time itself
Description Resource Path Location Type
value toDF is not a member of org.apache.spark.rdd.RDD[Bank] TwitterDatawithDataset.scala /spWCExample/src/main/java/sparkWCExample/spWCExample line 35 Scala Problem
To toDF(), you must enable implicit conversions:
import spark.implicits._
In spark-shell, it is enabled by default and that's why the code works there. :imports command can be used to see what imports are already present in your shell:
scala> :imports
1) import org.apache.spark.SparkContext._ (70 terms, 1 are implicit)
2) import spark.implicits._ (1 types, 67 terms, 37 are implicit)
3) import spark.sql (1 terms)
4) import org.apache.spark.sql.functions._ (385 terms)
This works fine for me in Eclipse Scala IDE:
case class Bank(age: Int, job: String)
val u = Array((1, "manager"), (2, "clerk"))
import spark.implicits._
spark.sparkContext.makeRDD(u).map(r => Bank(r._1, r._2)).toDF().show()

value toDF is not a member of Seq[(Int,String)]

I am trying to execute the following code but getting this error:
value toDF is not a member of Seq[(Int,String)].
I have the case class outside main and I have imported implicits too. But still I am getting this error. Can someone help me to resolve this ? I am using Spark 2.11-2.1.0 and Scala 2.11.8
import org.apache.spark.sql._
import org.apache.spark.ml.clustering._
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark._
final case class Email(id: Int, text: String)
object SampleKMeans {
def main(args: Array[String]) = {
val spark = SparkSession.builder.appName("SampleKMeans")
.master("yarn")
.getOrCreate()
import spark.implicits._
val emails = Seq(
"This is an email from...",
"SPAM SPAM spam",
"Hello, We'd like to offer you")
.zipWithIndex.map(_.swap).toDF("id", "text").as[Email]
}
}
You already have a SparkSession you can just import the spark.implicits._ will work in your case
val spark = SparkSession.builder.appName("SampleKMeans")
.master("local[*]")
.getOrCreate()
import spark.implicits._
Now toDF method works as expected.
If the error still exists, You need to check the version of spark and scala libraries that you are using.
Hope this helps!

Spark-Scala writing the output in a textfile

I am executing the wordcount program in spark and trying to store the result in a text file.
I have a scala script to count the word as SparkWordCount.scala. I am trying to execute the script from Spark console as below.
scala> :load /opt/spark-2.0.2-bin-hadoop2.7/bin/SparkWordCount.scala
Loading /opt/spark-2.0.2-bin-hadoop2.7/bin/SparkWordCount.scala...
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark._
defined object SparkWordCount
scala>
after the program is exectued i am getting the message as "defined object SparkWordCount" but I am not able to see the output result in the text file.
My Word count program is below.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark._
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext( "local", "Word Count", "/opt/spark-2.0.2-bin-hadoop2.7",/opt/spark-2.0.2-bin-hadoop2.7/jars,map())
val input = sc.textFile("demo.txt")
val count = input.flatMap(line ⇒ line.split(" ")).map(word ⇒ (word, 1)).reduceByKey(_ + _)
count.saveAsTextFile("outfile")
}
}
Please can anyone suggest. Thanks.
Once object is defined you can call the method to execute your code. Spark-shell won't execute the main method automatically. In your case you can use SparkWordCount.main(Array()) to execute your word-count program.

spark scala datastax csv load file and print schema

Spark version 2.0.2.6
Scala version 2.11.11
Using DataStax 5.0
import org.apache.log4j.{Level, Logger}
import java.util.Calendar
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.sql._
object csvtocassandra {
def main(args: Array[String]): Unit = {
val key_space = scala.io.StdIn.readLine("Please enter cassandra Key Space Name: ")
val table_name = scala.io.StdIn.readLine("Please enter cassandra Table Name: ")
// Cassandra Part
val conf = new SparkConf().setAppName("Sample1").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
println(Calendar.getInstance.getTime)
// Scala Read CSV Part
val spark1 = org.apache.spark.sql.SparkSession.builder().master("local").config("spark.cassandra.connection.host", "127.0.0.1")
.appName("Spark SQL basic example").getOrCreate()
val csv_input = scala.io.StdIn.readLine("Please enter csv file location: ")
val df_csv = spark1.read.format("csv").option("header", "true").option("inferschema", "true").load(csv_input)
df_csv.printSchema()
}
}
Why am I not able to run this program as a Job trying to submit it to spark. When I run this program using IntelliJ it works.
But When I create a JAR and run it I am getting following Error.
Command:
> dse spark-submit --class "csvtospark" /Users/del/target/scala-2.11/csvtospark_2.11-1.0.jar
I am getting following Error:
ERROR 2017-11-02 11:46:10,245 org.apache.spark.deploy.DseSparkSubmitBootstrapper: Failed to start or submit Spark application
org.apache.spark.sql.AnalysisException: Path does not exist: dsefs://127.0.0.1/Users/Desktop/csv/example.csv;
Why is it appending dsefs://127.0.0.1 part even though I am giving just the path /Users/Desktop/csv/example.csv when asked.
I tried giving --mater option as well. How ever I am getting the same error. I am running DataStax Spark in Local Machine. No Cluster.
Please correct me where I am doing things wrong.
Got it. Never mind. Sorry about that.
input should be file:///file_name

Error found when importing spark.implicits

I am using spark 1.4.0
When I tried to import spark.implicits using this command:
import spark.implicits._, this error appear:
<console>:19: error: not found: value spark
import spark.implicits._
^
Can anyone help me to resolve this problem ?
It's because SparkSession is avialable from Spark 2.0 and spark value is an object of type SparkSession in Spark REPL.
In Spark 1.4 use
import sqlContext.implicits._
Value sqlContext is automatically created in Spark REPL for Spark 1.x
To make it complete, first you have to create a sqlContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
val conf = new SparkConf().setMaster("local").setAppName("my app")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._