How to run a Spark test from IntelliJ (or other IDE)

How to run a Spark test from IntelliJ (or other IDE) - scala

I am trying to create a Test for some Spark code. The following code fails when getting a SparkSession object. NOTE: The test runs fine when running from the cli: gradle my_module:build
#Test
def myTest(): Unit = {
val spark = SparkSession.builder().master("local[2]").getOrCreate()
...
}
Error:
java.lang.IllegalArgumentException: Can't get Kerberos realm
...
Caused by: java.lang.reflect.InvocationTargetException
...
Caused by: KrbException: Cannot locate default realm
My set-up: IntelliJ + Gradle + Mac OS
Questions:
How do I run a Spark Test from within IntelliJ?
Why is Spark looking for Kerberos at all when running 'local'

By your code you need to run Spark from JUnit, not specifically from IntelliJ, you can try something like https://github.com/sleberknight/sparkjava-testing

Related

java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf.useDeprecatedKafkaOffsetFetching()Z

I want to show spark dataframe and I used:
df.writeStream.outputMode("append").start().awaitTermination()
But when I got the error when run this line:
21/07/16 01:20:53 ERROR MicroBatchExecution: Query [id = f243e6e6-c02e-4e70-b5c3-6a821fd33232, runId = 312544cf-fea8-45b4-94a1-c052306538cf] terminated with error
java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf.useDeprecatedKafkaOffsetFetching()Z

Check the version of spark and version of dependencies you have added. Make sure both are having the same versions. This will resolve the issue.

databricks-connect, py4j.protocol.Py4JJavaError: An error occurred while calling o342.cache

Connection to databricks works fine, working with DataFrames goes smoothly (operations like join, filter, etc).
The problem appears when I call cache on a dataframe.
py4j.protocol.Py4JJavaError: An error occurred while calling o342.cache.
: java.io.InvalidClassException: failed to read class descriptor
...
Caused by: java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$client53442a94a3$$anonfun$mapPartitions$1$$anonfun$apply$23
at java.lang.ClassLoader.findClass(ClassLoader.java:523)
at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.java:35)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40)
at org.apache.spark.util.ChildFirstURLClassLoader.loadClass(ChildFirstURLClassLoader.java:48)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:257)
at org.apache.spark.sql.util.ProtoSerializer.org$apache$spark$sql$util$ProtoSerializer$$readResolveClassDescriptor(ProtoSerializer.scala:4316)
at org.apache.spark.sql.util.ProtoSerializer$$anon$4.readClassDescriptor(ProtoSerializer.scala:4304)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1857)
... 71 more
I work with java8 as required, clearing pycache doesn't help.
The same code submitted as a job to databricks works fine.
It looks like a local problem on a bridge python-jvm level but java version (8) and python (3.7) is as required. Switching to java13 produces quite the same message.
Versions databricks-connect==6.2.0, openjdk version "1.8.0_242", Python 3.7.6
EDIT:
Behavior depends on how DF is created, if the source of DF is external then it works fine, if DF is created locally then such error appears.
# works fine
df = spark.read.csv("dbfs:/some.csv")
df.cache()
# ERROR in 'cache' line
df = spark.createDataFrame([("a",), ("b",)])
df.cache()

This is a known issue and I think a recent patch fixed it. This was seen for Azure, I am not sure whether you are using which Azure or AWS but it's solved. Please check the issue - https://github.com/MicrosoftDocs/azure-docs/issues/52431

error while reading hbase 1.2.5 table in spark 2.4

spark version - 2.4.0
hbase version - 1.2.5
scala version.2.11.8
Everything is setup in local VM.
packages imported
spark-shell --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11,org.apache.hadoop:hadoop-common:2.7.3,org.apache.hbase:hbase-common:1.2.5,org.apache.hbase:hbase-client:1.2.5,org.apache.hbase:hbase-protocol:1.2.5,org.apache.hbase:hbase-hadoop2-compat:1.2.5,org.apache.hbase:hbase-server:1.2.5 --repositories http://repo.hortonworks.com/content/groups/public/
In hbase shell:
creating table:
create 'cardata','software','hardware','other'
inserting data to table:
put 'cardata','v001_H','hardware:alloy_wheels','yes'
put 'cardata','v001_H','hardware:anti_Lock_break','yes'
put 'cardata','v001_H','software:electronic_breakforce_distribution','yes'
put 'cardata','v001_H','software:terrain_mode','yes'
put 'cardata','v001_H','software:traction_control','yes'
put 'cardata','v001_H','software:stability_control','yes'
put 'cardata','v001_H','software:cruize_control','yes'
put 'cardata','v001_H','other:make','hyundai'
put 'cardata','v001_H','other:model','i10'
put 'cardata','v001_H','other:variant','sportz'
in repl
import org.apache.spark.sql.execution.datasources.hbase._
import spark.implicits._
def carCatalog = s"""{
"table":{"namespace":"default", "name":"cardata"},
"rowkey":"key",
"columns":{
"alloy_wheels":{"cf":"hardware", "col":"alloy_wheels", "type":"string"},
"anti_Lock_break":{"cf":"hardware", "col":"anti_Lock_break", "type":"string"},
"electronic_breakforce_distribution":{"cf":"software", "col":"electronic_breakforce_distribution", "type":"string"},
"terrain_mode":{"cf":"software", "col":"terrain_mode", "type":"string"},
"traction_control":{"cf":"software", "col":"traction_control", "type":"string"}
}
}""".stripMargin
val hbaseDF=spark.read.options(Map(HBaseTableCatalog.tableCatalog->carCatalog)).format("org.apache.spark.sql.execution.datasources.hbase").load()
Error
java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z)Lorg/json4s/JsonAST$JValue;
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:257)
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.<init>(HBaseRelation.scala:80)
at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:51)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
... 53 elided
Everything is setup in local VM.

NoSuchMethodError usually means the library version found at runtime is not the expected one (used at compile time).
It might be introduced by your platform adding default libraries (on your behalf with spark-submit command) and conflicting with the libraries you're using while developing.
A common solution is called shading and can be defined when creating your assembly jar (in build.sbt for instance).
You can refer to this link : https://github.com/sbt/sbt-assembly#shading

Spark executor is throwing error "java.lang.ClassNotFoundException: oracle.jdbc.OracleDriver"

I am trying to import a table from my oracle database using spark and here I am using Scala to import the table.
My jdbc driver is ojdbc7.jar and it's added in both the parameter spark.driver.extraClassPath and spark.executor.extraClassPath in configuration file
spark.driver.extraClassPath :/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/s
hare/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/home/hadoop/ojdbc7.jar
spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.executor.extraClassPath :/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/s
hare/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/home/hadoop/ojdbc7.jar
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
I can successfully import the table. I can print the schema of the table. But while performing any operations like Count,show() it throws below error
`
Caused by: java.lang.ClassNotFoundException: oracle.jdbc.OracleDriver
at java.lang.ClassLoader.findClass(ClassLoader.java:530) at
org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.scala:26)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at
org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:34)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at
org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30)
at
org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:77)
... 21 more
`

This error was because Spark was not able to locate the ojdbc7.jar from every core node. So placing this jar in a shared location like /usr/lib/spark/jars will resolve this issue.

You can also do few other things including adding jar file full location as a dependency under spark section in the interpreter as an artifact
If you just want %jdbc to work, update the jdbc section under interpreter, add the jar file full location as a artifact under the dependencies and also update the default.driver, default.url, default.user, default.password accordingly

Using spark to access HDFS failed

I am using Cloudera 4.2.0 and Spark.
I just want to try out some examples given by Spark.
// HdfsTest.scala
package spark.examples
import spark._
object HdfsTest {
def main(args: Array[String]) {
val sc = new SparkContext(args(0), "HdfsTest",
System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
val file = sc.textFile("hdfs://n1.example.com/user/cloudera/data/navi_test.csv")
val mapped = file.map(s => s.length).cache()
for (iter <- 1 to 10) {
val start = System.currentTimeMillis()
for (x <- mapped) { x + 2 }
// println("Processing: " + x)
val end = System.currentTimeMillis()
println("Iteration " + iter + " took " + (end-start) + " ms")
}
System.exit(0)
}
}
It's ok for compiling, but there is always some runtime problems:
Exception in thread "main" java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.HftpFileSystem could not be instantiated: java.lang.IllegalAccessError: tried to access method org.apache.hadoop.fs.DelegationTokenRenewer.<init>(Ljava/lang/Class;)V from class org.apache.hadoop.hdfs.HftpFileSystem
at java.util.ServiceLoader.fail(ServiceLoader.java:224)
at java.util.ServiceLoader.access$100(ServiceLoader.java:181)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:377)
at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2229)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2240)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2257)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:86)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2296)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2278)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:316)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:162)
at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
at spark.SparkContext.hadoopFile(SparkContext.scala:263)
at spark.SparkContext.textFile(SparkContext.scala:235)
at spark.examples.HdfsTest$.main(HdfsTest.scala:9)
at spark.examples.HdfsTest.main(HdfsTest.scala)
Caused by: java.lang.IllegalAccessError: tried to access method org.apache.hadoop.fs.DelegationTokenRenewer.<init>(Ljava/lang/Class;)V from class org.apache.hadoop.hdfs.HftpFileSystem
at org.apache.hadoop.hdfs.HftpFileSystem.<clinit>(HftpFileSystem.java:84)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
at java.lang.Class.newInstance0(Class.java:374)
at java.lang.Class.newInstance(Class.java:327)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
... 16 more
I have searched on Google, no idea about this kind of exception for Spark and HDFS.
val file = sc.textFile("hdfs://n1.example.com/user/cloudera/data/navi_test.csv") is where the problem occurs.
13/04/04 12:20:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
And I got this Warning. Maybe I should add some hadoop paths in CLASS_PATH.
Feel free to give any clue. =)
Thank you all.
REN Hao

(This question was also asked / answered on the spark-users mailing list).
You need to compile Spark against the particular version of Hadoop/HDFS running on your cluster. From the Spark documentation:
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the HDFS protocol has changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs. You can change the version by setting the HADOOP_VERSION variable at the top of project/SparkBuild.scala, then rebuilding Spark (sbt/sbt clean compile).
The spark-users mailing list archives contain several questions about compiling against specific Hadoop versions, so I would search there if you run into any problems when building Spark.

You can set Coudera's Hadoop version with an environment variable when building Spark, look up your exact artifact version on Cloudera's maven repo, should be this:
SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 sbt/sbt assembly publish-local
Make sure you run whatever you run with the same Java engine you use to build Spark. Also, there are pre-built Spark packages for different Cloudera Hadoop distributions, like http://spark-project.org/download/spark-0.8.0-incubating-bin-cdh4.tgz

This might be a problem related with the installed Java in your system. Hadoop requires (Sun) Java 1.6+.
Make sure you have:
JAVA_HOME="/usr/lib/jvm/java-6-sun

Categories

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to run a Spark test from IntelliJ (or other IDE) - scala

By your code you need to run Spark from JUnit, not specifically from IntelliJ, you can try something like https://github.com/sleberknight/sparkjava-testing

Related

java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf.useDeprecatedKafkaOffsetFetching()Z

databricks-connect, py4j.protocol.Py4JJavaError: An error occurred while calling o342.cache

error while reading hbase 1.2.5 table in spark 2.4

Spark executor is throwing error "java.lang.ClassNotFoundException: oracle.jdbc.OracleDriver"

Using spark to access HDFS failed

Categories

Resources