Read local .csv into spark scala 2.12.12 using IntelliJ - scala

I am trying to import a comma delimited .csv file into a Scala project with version 2.12.12 using IntelliJ. I have been unsuccessful at importing this file. The file has 1 million rows and 5 columns. In addition to importing and reading the file, I also have to find the total by month in the data.
I have tried:
val df = spark.read.option("header", true).csv("C:\Users\trialrun\Desktop\DataExtract.csv")
Spark throws an error: not found: value df
I have tried this as well and get the same error:
df = spark.read.csv("file:///C:\\Users\trialrun\Desktop\DataExtract.csv").show()
My build.sbt is successful and I have created an object in IntelliJ to try and read my csv file from my desktop, but, I need help with the correct dependencies to import and the correct logic to get IntelliJ to read the .csv file with headers from my desktop.

I was able to figure out my issue. I created another spark scala project and used version 2.11.11 with dependencies 2.1.0 for spark-core and spark-sql. After refreshing the .sbt, all of the right dependencies were added and all of my errors were gone. I was able to load the csv file.

Related

java.lang.NoSuchMethodError: com.mongodb.internal.connection.Cluster.selectServer

I am new to Apache Spark and I am using Scala and Mongodb to learn it.
https://docs.mongodb.com/spark-connector/current/scala-api/
I am trying to read the RDD from my MongoDB database, my notebook script as below:
import com.mongodb.spark.config._
import com.mongodb.spark._
val readConfig = ReadConfig(Map("uri" -> "mongodb+srv://root:root#mongodbcluster.td5gp.mongodb.net/test_database.test_collection?retryWrites=true&w=majority"))
val testRDD = MongoSpark.load(sc, readConfig)
print(testRDD.collect)
At the print(testRDD.collect) line, I got this error:
java.lang.NoSuchMethodError:
com.mongodb.internal.connection.Cluster.selectServer(Lcom/mongodb/selector/ServerSelector;)Lcom/mongodb/internal/connection/Server;
And more than 10 lines "at..."
Used libraries:
org.mongodb.spark:mongo-spark-connector_2.12:3.0.1
org.mongodb.scala:mongo-scala-driver_2.12:4.2.3
Is this the problem from Mongodb internal libraries or how could I fix it?
Many thanks
I suspect that there is a conflict between mongo-spark-connector and mongo-scala-driver. The former is using Mongo driver 4.0.5, but the later is based on the version 4.2.3. I would recommend to try only with mongo-spark-connector
I was also facing the same problem, solved it using mongo-spark-connector-2.12:3.0.1 jar and with that also added jar of Scalaj HTTP ยป 2.4.2. It's working fine now.

NoClassDefFoundError: Could Not Initialise class org.apache.spark.package

I am attempting to make some changes to the apache spark's MLLib. I cloned latest spark repo from Github, opened up MLLib as a project in IntelliJ with JDK 1.8.0 and scala-sdk-2.12.6, and created a scratch file to make sure I could run things.
Here's all the code presently being tested:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.master("local").appName("IncrementalCB").getOrCreate()
It returns the error:
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.package$
at org.apache.spark.SparkContext.$anonfun$new$1(scratch_1.scala:179)
at org.apache.spark.internal.Logging.logInfo(scratch_1.scala:53)
at org.apache.spark.internal.Logging.logInfo$(scratch_1.scala:52)
at org.apache.spark.SparkContext.logInfo(scratch_1.scala:73)
at org.apache.spark.SparkContext.<init>(scratch_1.scala:179)
at org.apache.spark.SparkContext$.getOrCreate(scratch_1.scala:2508)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(scratch_1.scala:942)
at scala.Option.getOrElse(scratch_1.scala:134)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(scratch_1.scala:933)
at #worksheet#.spark$lzycompute(scratch_1.scala:2)
at #worksheet#.spark(scratch_1.scala:2)
at #worksheet#.get$$instance$$spark(scratch_1.scala:2)
at #worksheet#.#worksheet#(scratch_1.scala:10)
While I'm not quite sure what the situation, I suspect it may be something JAR or version related. Anyone care to fill in the blanks? Thanks!
First: YOu don't need to clone Spark repository from GitHub to work with spark.
Second: Instead of using scratch file - it's better to set-up a project with either maven or sbt.
They will save you a time buy downloading all dependencies.

spark, kafka integration issue: object kafka is not a member of org.apache.spark.streaming

I am receiving error while building my spark application (scala) in IntelliJ IDE.
It is a simple application with uses Kafka Stream for further processing. I have added all the jars and the IDE does not show any unresolved import or code statements.
However, when I try to build the artifact, I get two errors stating that
Error:(13, 35) object kafka is not a member of package
org.apache.spark.streaming
import org.apache.spark.streaming.kafka.KafkaUtils
Error:(35, 60) not found: value KafkaUtils
val messages: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(streamingContext,zkQuorum,"myGroup",topics)
I have seen similar questions but most of the ppl complain about this issue while submitting to spark. However, I one step behind that and merely building the jar file which would be submitted ultimately to spark. On top I am using IntelliJ IDE and a bit new to spark and scala; lost here.
Below is the snapshot of the IntelliJ Error
IntelliJ Error
Thanks
Omer
The reason is that you need to add spark-streaming-kafka-K.version-Sc.version.jar to your pom.xml and as well as your spark lib directory.

eclipse(set with scala envirnment) : object apache is not a member of package org

As shown in image, its giving error when i am importing the Spark packages. Please help. When i hover there, it shows "object apache is not a member of package org".
I searched on this error, it shows spark jars has not been imported. So, i imported "spark-assembly-1.4.1-hadoop2.2.0.jar" too. But still same error.Below is what i actually want to run:
import org.apache.spark.{SparkConf, SparkContext}
object ABC {
def main(args: Array[String]){
//Scala Main Method
println("Spark Configuration")
val conf = new SparkConf()
conf.setAppName("My First Spark Scala Application")
conf.setMaster("spark://ip-10-237-224-94:7077")
println("Creating Spark Context")
}
}
Adding spark-core jar in your classpath should resolve your issue. Also if you are using some build tools like Maven or Gradle (if not then you should because spark-core has lot many dependencies and you would keep getting such problem for different jars), try to use Eclipse task provided by these tools to properly set classpath in your project.
I was also receiving the same error, in my case it was compatibility issue. As Spark 2.2.1 is not compatible with Scala 2.12(it is compatible with 2.11.8) and my IDE was supporting Scala 2.12.3.
I resolved my error by
1) Importing the jar files from the basic folder of Spark. During the installation of Spark in our C drive we have a folder named Spark which contains Jars folder in it. In this folder one can get all the basic jar files.
Goto to Eclipse right click on the project -> properties-> Java Build Path. Under 'library' category we will get an option of ADD EXTERNAL JARs.. Select this option and import all the jar files of 'jars folder'. click on Apply.
2) Again goto properties -> Scala Compiler ->Scala Installation -> Latest 2.11 bundle (dynamic)*
*before selecting this option one should check the compatibility of SPARK and SCALA.
The problem is Scala is NOT backward compatible. Hence each Spark module is complied against specific Scala library. But when we run from eclipse, we have one SCALA VERSION which was used to compile and create the spark Dependency Jar which we add to the build path, and SECOND SCALA VERSION is there as the eclipse run time environment. Both may conflict.
This is a hard reality, although, we wish Scala to be ,backward compatible. Or at least a complied jar file created could be backward compatible.
Hence, the recommendation is , use Maven or similar where dependency version can be managed.
If you are doing this in the context of Scala within a Jupyter Notebook, you'll get this error. You have to install the Apache Toree kernel:
https://github.com/apache/incubator-toree
and create your notebooks with that kernel.
You also have to start the Jupyter Notebook with:
pyspark

error: not found: value sc

I am new to Scala and am trying to code read a file using the following code
scala> val textFile = sc.textFile("README.md")
scala> textFile.count()
But I keep getting the following error
error: not found: value sc
I have tried everything, but nothing seems to work. I am using Scala version 2.10.4 and Spark 1.1.0 (I have even tried Spark 1.2.0 but it doesn't work either). I have sbt installed and compiled yet not able to run sbt/sbt assembly. Is the error because of this?
You should run this code using ./spark-shell. It's scala repl with provided sparkContext. You can find it in your apache spark distribution in folder spark-1.4.1/bin.