How to fix spark.read.format("parquet") error - eclipse

I'm running Scala code on Azure databricks well. Now I want to move this code from Azure notebook to eclipse.
I install databricks connection following Microsoft document successfully. Pass databricks data connection test.
I also installed SBT and import to my project in eclipse
I create scala object in eclipse and also I import all jar files as external file in pyspark
package Student
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SparkSession
import java.util.Properties
//import com.databricks.dbutils_v1.DBUtilsHolder.dbutils
object Test {
def isTypeSame(df: DataFrame, name: String, coltype: String) = (df.schema(name).dataType.toString == coltype)
def main(args: Array[String]){
var Result = true
val Borrowers = List(("col1", "StringType"),("col2", "StringType"),("col3", "DecimalType(38,18)"))
val dfPcllcus22 = spark.read.format("parquet").load("/mnt/slraw/ServiceCenter=*******.parquet")
if (Result == false) println("Test Fail, Please check") else println("Test Pass")
}
}
When I run this code in eclipse, it shows cannot find main class. But if I comment "val dfPcllcus22 = spark.read.format("parquet").load("/mnt/slraw/ServiceCenter=*******.parquet")", pass the test.
So it seems spark.read.format cannot be recognized.
I'm new to Scala and DataBricks.
I was researching result for several days but still cannot solve it.
If anyone can help, really appreciate.
Environment is a bit complicated to me, if more information required, please let me know

SparkSession is needed to run your code in eclipse, since your provided code does not have this line for SparkSession creation leads to an error,
val spark = SparkSession.builder.appName("SparkDBFSParquet").master("local[*]".getOrCreate()
Please add this line and run the code and it should work.

Related

Loop and process multiple HDFS files in spark/scala

I have multiple files in my HDFS folder and I want to loop and run my scala transformation logic on it.
I am using below script which is working fine in my development environment using local files but it is failing when I run on my HDFS environment. Any idea where am I doing wrong please?
val files = new File("hdfs://172.X.X.X:8020/landing/").listFiles.map(_.getName).toList
files.foreach { file =>
print(file)
val event = spark.read.option("multiline", "true").json("hdfs://172.X.X.X:8020/landing/" + file)
event.show(false)
}
Can someone correct it or suggest alternative solution please.
You should use Hadoop IO library to handle hadoop files.
code:
import java.net.URI
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.SparkSession
val spark=SparkSession.builder().master("local[*]").getOrCreate()
val fs=FileSystem.get(new URI("hdfs://172.X.X.X:8020/"),spark.sparkContext.hadoopConfiguration)
fs.globStatus(new Path("/landing/*")).toList.foreach{
f=>
val event = spark.read.option("multiline", "true").json("hdfs://172.X.X.X:8020/landing/" + f.getPath.getName)
event.show(false)
}

Execute code scala from spark in Zeppelin

I would like to run a scala code on Zeppelin from Spark cluster.
For example:
This is code into hdfs Spark "HelloWorldScala.scala":
object HelloWorldScala{
def main (arg: Array[String]): Unit = {
val conf = new SparkConf().setAppName("myApp_Enrico")
val spark = SparkSession.builder.config(conf).getOrCreate()
val aList = List(1,2,3,4,5,6,7,8,9,10)
val aRdd = spark.sparkContext.parallelize(aList)
println("********* HELLO WORLD AND HELLO SPARK!! ******")
println("Print even numbers")
aRdd.filter(x=>x%2==0).map(x=>x*2).collect().foreach(println)
}
}
I would like to import in Zeppelin the HelloWorldScala file and run main, but I see the error:
Error code Zeppelin
Unfortunately you can't import single file in Zeppelin. You can pack your scala files into .jar library and put it to spark.jars (setted as property in spark) directory, after you will can import your library using line: import your.libray.packages.YourClass and using non-private functions from it. If you don't know about jar packages, and spark.jar directories just read a bit more about that.
UPDATE:
%dep
z.load("your_package_group:artifact:version")
%spark
import com.yourpackage.HelloWorldScala

Error in running Scala in terminal: "object apache is not a member of package org"

I'm using sublime to write my first Scala program, and I'm using terminal to run it.
First I use scalac assignment2.scala command to compile it, but it show error message:"error: object apache is not a member of package org"
How can I do to fix it?
This is my code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object assignment2 {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("assignment2")
val sc = new SparkContext(conf)
val input = sc.parallelize(List(1, 2, 3, 4))
val result = input.map(x => x * x)
println(result.collect().mkString(","))
}
}
Where are you trying to submit the job. To run any spark application you need to submit it from bin/spark-submit in your spark installation directory or you need to have spark-home set in your environment, which you can refer while submitting.
Actually you can't run spark-scala file directly because for compilation your scala class, you need spark library. So for executing scala file you required spark-shell. For executing your spark scala file inside spark-shell, please find the below steps:
Open your spark-shell using next command-
'spark-shell --master yarn-client'
load your file with exact location-
':load File_Name_With_Absoulte_path'
Run you main method using class name- 'ClassName.main(null)'

Scala Word count jar is not running in spark

I'm very new to both Scala and Spark. I added the Scala IDE to Eclipse Luna. I created a maven project in the eclipse. I was to run the program within the eclipse with run as configuration option and able to get the output successfully. But when i create the jar for the following program and tried to run the spark shell am getting the following error.
error: ';' expected but 'class' found.
Command use to run the jar
spark-submit --class com.kirthi.spark.proj.sparkexamples.WordsCount --master local /home/cloudera/workspace/sparkwc1.jar hdfs://localhost:8020/kirthi3/dataset.txt hdfs://localhost:8020/kirthi3/sparkwco
The word count program which i tried
package com.kirthi.spark.proj.sparkexamples
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object WordsCount {
def main(args: Array[String]){
val conf = new SparkConf()
.setAppName("Word Count")
.setMaster("local")
val sc = new SparkContext(conf)
val textFile = sc.textFile(args(0))
val words = textFile.flatMap(line => line.split(","))
val counts = words.map(word => (word,1))
val wordcount = counts.reduceByKey(_+_)
val wordcount_sorted = wordcount.sortByKey()
wordcount_sorted.foreach(println)
wordcount_sorted.saveAsTextFile(args(1))
}
}
Kindly help me out in this as I am struck with initial program for spark.
I am using cloudera quickstart CDH 5.5
As shown in the comments, you ran the above command in the Scala REPL, and which you should run it from a regular linux shell.

Loaded JARs in spark-shell, but can't reference the variables in the code

I'm studying Advanced Analytics with Spark.
Here's what happens: I follow the tutorial on spark-shell, and I put pretty long lines of codes into it. When I close the lid of my laptop, this puts my laptop to a sleep mode, and when I turn it back on, the codes are gone.
As a solution, as suggested in the book, I am trying to put my code in a .scala file, and compile and load it with JAR whenever I restart spark-shell. The book even provides a simple example to do that. https://github.com/sryza/aas/tree/master/simplesparkproject
So I git cloneed the project, ran mvn package, and ran spark-shell with spark-shell --jars target/simplesparkproject-0.0.1.jar --master local just as in the direction.
If you see the git repo for this example, the code contains an object MyApp with two functions in it.
object MyApp {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("My App"))
println("num lines: " + countLines(sc, args(0)))
}
def countLines(sc: SparkContext, path: String): Long = {
sc.textFile(path).count()
}
}
From what I understood, this class and the functions should be able to be referenced in spark-shell because it was specified for the --jars option.
However, when I type MyApp on the spark-shell,
scala> MyApp
<console>:23: error: not found: value MyApp
MyApp
^
What am I doing wrong, and how can I make this work?
Just import the object and call required methods:
import com.cloudera.datascience.MyApp
MyApp.main()