Unable to hit breakpoints while debugging Spark with IntelliJ CE - scala

I am running a Spark JAR file using command spark-submit testmysparkfile.jar, after having setup export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005;
The code is written in Scala.
Below is the Spark Session i am creating-
val spark = SparkSession
.builder()
.appName("testmysparkfile")
.config("spark.serializer",classOf[KryoSerializer].getName).master("local[*]")
.getOrCreate()
When I run the JAR, the application is started and it listens to port 5005. Now, when I go back to my IntelliJ, and try to run 'debug', it runs the debugger fine, and the sample output is recieved on the terminal window that was listening to port 5005, however, the breakpoint I have set, is not hit.
Debugger settings-
Debugger mode:Attach to remote JVM
Host:Localhost
Port:5005
command line args for JVM:-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005
Use module classpath:
Output from IntelliJ terminal-
Connected to the target VM, address: 'localhost:5005', transport: 'socket'
Disconnected from the target VM, address: 'localhost:5005', transport: 'socket'
I have followed examples that show how to debug Spark using a debugger like this one-http://www.bigendiandata.com/2016-08-26-How-to-debug-remote-spark-jobs-with-IntelliJ/
However, this does not seem to work as I am unable to hit any breakpoints.

Related

How to ignore spark_home when unit testing with pyspark

I trying to write unit-tests with pyspark. Tests pass with the following configuration when SPARK_HOME is NOT set. There are multiple installations of spark on our machines and if SPARK_HOME is set to one of them tests fails on that machine.
#pytest.fixture(scope="session")
def spark_session(request):
session = SparkSession\
.builder\
.master("local[2]")\
.appName("pytest-pyspark-local-testing")\
.getOrCreate()
request.addfinalizer(lambda: session.stop())
quiet_py4j()
return session
I have tried os.environ["SPARK_HOME"] = "" which gets FileNotFoundError: [Errno 2] No such file or directory: './bin/spark-submit': './bin/spark-submit error.
I have also tried os.unsetenv('SPARK_HOME') which gets Exception: Java gateway process exited before sending its port number. When I don't try to unset the env var, I get this same error as well.
How can I make sure that my tests will work on any machine simply ignoring any environment variables.

Scala Spark : (org.apache.spark.repl.ExecutorClassLoader) Failed to check existence of class org on REPL class server at path

Running basic df.show() post spark notebook installation
I am getting the following error when running scala - spark code on spark-notebook. Any idea when this occurs and how to avoid?
[org.apache.spark.repl.ExecutorClassLoader] Failed to check existence of class org.apache.spark.sql.catalyst.expressions.Object on REPL class server at spark://192.168.10.194:50935/classes
[org.apache.spark.util.Utils] Aborting task
[org.apache.spark.repl.ExecutorClassLoader] Failed to check existence of class org on REPL class server at spark://192.168.10.194:50935/classes
[org.apache.spark.util.Utils] Aborting task
[org.apache.spark.repl.ExecutorClassLoader] Failed to check existence of class
I installed the spark on local, and when I was using following code it was giving me the same error.
spark.read.format("json").load("Downloads/test.json")
I think the issue was, it was trying to find some master node and taking some random or default IP. I specified the mode and then provided the IP as 127.0.0.1 and it resolved my issue.
Solution
Run the spark using local master
usr/local/bin/spark-shell --master "local[4]" --conf spark.driver.host=127.0.0.1'

Failed to open native connection to Cassandra from jar produced by intellij

I've developed a spark-based application that gets data from kafka and saves it in cassandra DB, using intellij.
connection code in scala:
val cluster = Cluster.builder().addContactPoint("192.168.0.253").withPort(9042).build();
val session = cluster.connect()
the code work fine when I run it from intellij, but I get this error when I try to run it from jar using command line:
Exception in thread "main" java.io.IOException:
Failed to open native connection to Cassandra at {192.168.0.253}:9042 at .... ....
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException:
All host(s) tried for query failed
(tried: /192.168.0.253:9042 (com.datastax.driver.core.exceptions.TransportException:
[/192.168.0.253] Error writing)) at
com.datastax.driver.core.ControlConnection.reconnectInternal(
ControlConnection.java:233) at
com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:79)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1424)
at com.datastax.driver.core.Cluster.getMetadata(Cluster.java:403)
at com.datastax.spark.connector.cql.CassandraConnector$.
com$datastax$spark$connector$cql$CassandraConnector$$createSession(
CassandraConnector.scala:155) ... 13 more
I've produced a jar file from intellij, and I built the jar with dependencies, using [ copy to the output directory and link via manifist ] option.
cassandra.yaml file:
#Whether to start the native transport server.
start_native_transport: true
#port for the CQL native transport to listen for clients on
native_transport_port: 9042
Why is this error raised & and how can I fix it?
Thanks in advance

Error running hadoop application in Eclipse on Windows

I'm trying to set up an Eclipse environment for developing and debugging hadoop. I'm following Tom White's Definitive Hadoop 3rd ed. What I would like to do is get the MaxTemperature app working locally on my Windows within Eclipse before moving it to my Hortonworks sandbox VM. The comment on page 158 about using the local job runner seems to be what I want. I don't want to set up a full hadoop implementation on Windows. I'm hoping with the right config params I can convince it to run as a java application inside Eclipse.
Windows: 7
Eclipse: Luna
Hadoop: 2.4.0
JDK: 7
When I set the Run configuration for MaxTemperatureDriver (Source code on page 157) to
inputfile outputdir foo (deliberate bogus 3rd parameter)
I get the usage message so I know I'm running my program with those params.
If I remove the bogus third param I get
Exception in thread "main" java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at mark.MaxTemperatureDriver.run(MaxTemperatureDriver.java:52)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at mark.MaxTemperatureDriver.main(MaxTemperatureDriver.java:56)
I've tried inserting -conf but it seems to be ignored. There is no error message if I specify a nonexistent path.
I've tried inserting -fs file:/// -jt local, but it makes no difference
I've tried inserting -D mapreduce.framework.name=local
I've tried specifying the input and output with the file: format
Note. I'm not asking about how to configure eclipse to connect to a remote Hadoop installation. I want the application to run within eclipse.
Is this possible? Any ideas?
Additional info:
I turned on debugging. I saw:
582 [main] DEBUG org.apache.hadoop.mapreduce.Cluster - Trying ClientProtocolProvider : org.apache.hadoop.mapred.YarnClientProtocolProvider
583 [main] DEBUG org.apache.hadoop.mapreduce.Cluster - Cannot pick org.apache.hadoop.mapred.YarnClientProtocolProvider as the ClientProtocolProvider - returned null protocol
I'm wondering not why YarnClientProtocolProvider failed, but why it didn't try LocalClientProtocolProvider.
New info:
It seems that this is an issue with Hadoop 2.4.0. I recreated my environment with Hadoop 1.2.1, followed the instructions in
http://gerrymcnicol.com/index.php/2014/01/02/hadoop-and-cassandra-part-4-writing-your-first-mapreduce-job/
added the Windows hack from
http://bigdatanerd.wordpress.com/2013/11/14/mapreduce-running-mapreduce-in-windows-file-system-debug-mapreduce-in-eclipse
and it all started working.
Following blog will be useful.
Running mapreduce in Windows filesystem

error in eclipse --- java.lang.RuntimeException: Could not start Selenium session: null

Used jdk 7
selenium-server-standalone-2.25.0.jar
eclipse java indigosr1win32.
When I run my server, by cmd:
java - jar selenium-server-standalone-2.25.0.jar 4444
the server's stopping by itself after showing some message, and getting next prompt:
server not working properly.
When I run my program I got an error in eclipse:
java.lang.RuntimeException: Could not start Selenium session: null
Don't specify port number 4444 ...Selenium server by default will run on port 4444.
Write on CMD java - jar selenium-server-standalone-2.25.0.jar
If you want to give port no ...Specify as
java - jar selenium-server-standalone-2.25.0.jar -port 4445