Where do you configure spark and hive settings on local install of jupyter? - scala

I'm trying to configure Spark on my local IDE and local install of Conda jupyter environment to use our corp spark/hive connect which looks something specs similar to this:
host: mycompany.com
port: 10003
I tried to configure spark-default.conf
spark.master spark://mycompany.com:10003
And when I try and call the spark context : sc
I get the following error with Jupyter:
Exception: Java gateway process exited before sending the driver its port number
Does anyone know of any good documentation that I can use to configure my local instance of jupyter and or Netbeans to use Spark with Scala or Python?

Related

Connection to remote Hadoop Cluster (CDP) through Linux server

I'm new to PySpark and I want to connect remote Hadoop Cluster (CDP) through Linux server by using spark-submit command.
Any help would be appreciated.
I need spark-submit command to connect remote CDP.
You can use Apache Livy to submit remote jobs to a CDP cluster. Here is detailed info on how to install and use Livy to submit jobs :
After downloading and unzipping Livy you should add following lines in livy.conf file. Then start livy service.
livy.spark.master = yarn
livy.spark.deploy-mode = cluster
You can find examples of how to create a spark submit script on following links:
https://community.cloudera.com/t5/Community-Articles/Submit-a-Spark-Job-to-CDP-Data-Hub-using-the-Livy-REST-API/ta-p/322481
https://livy.apache.org/examples/

how to run first example of Apache Flink

I am trying to run the first example from the oreilly book "Stream Processing with Apache Flink" and from the flink project. Each gives different errors
Example from the book gies NoClassDefFound error
Example from flink project gives java.net.ConnectException: Connection refused (Connection refused) but does create a flink job, see screenshot.
Detail below
Book example
java.lang.BootstrapMethodError: java.lang.NoClassDefFoundError:scala/runtime/java8/JFunction1$mcVI$sp
at io.github.streamingwithflink.chapter1.AverageSensorReadings$$anon$3.createSerializer(AverageSensorReadings.scala:50)
The instructions from the book are:
download flink-1.7.1-bin-scala_2.12.tgz
extract
start cluster ./bin/start-cluster.sh
open flink's web UI http://localhost:8081
this all works fine
Download the jar file that includes examples in this book
run example
./bin/flink run \
-c io.github.streamingwithflink.chapter1.AverageSensorReadings \
examples-scala.jar
It seems that the class is not found from error message at the top of this post.
I put the jar in the same directory I am running the command
java -version
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (Zulu 8.44.0.9-CA-macosx) (build 1.8.0_242-b20)
OpenJDK 64-Bit Server VM (Zulu 8.44.0.9-CA-macosx) (build 25.242-b20, mixed mode)
I also tried compiling the jar myself with the same error.
https://github.com/streaming-with-flink/examples-scala.git
and
mvn clean build
error is the same.
Flink project tutorial
running the SocketWindowWordCount
./bin/flink run examples/streaming/SocketWindowWordCount.jar --port 9000
I get a job but it fails
gives java.net.ConnectException: Connection refused (Connection refused)
It is not clear to me what connection is refused. I tried different ports with no change.
How can I run flink code successfully?
I tried to reproduce the failing AverageSensorReadings example, but it was working on my setup. I'll try look deeper into it tomorrow.
Regarding the SocketWindowWordCount example, the error message indicates that the Flink job failed to open a connection to the socket on port 9000. You need to open the socket before you start the job. You can do this for example with netcat:
nc -l 9000
After the job is running, you can send messages by typing and and these message will be ingested into the Flink job. You can see the stats in the WebUI evolving according to the number of words that your messages consisted of.
Note that netcat closes the socket when you stop the Flink job.
I am able to run the "Stream Processing with Apache Flink" code from IntelliJ.
See this post
I am able to run the "Stream Processing with Apache Flink" AverageSensorReadings code on my flink cluster by using sbt. I have never used sbt before but thought I would try it. My project is here
Note that I moved AverageSensorReading.scala to chapter5 a since that is where the code is explained and changed the package to com.mitzit.
use sbt assembly to create jar
run on flink cluster
./bin/flink run \
-c com.mitzit.chapter5.AverageSensorReadings \
/path/to/project/sbt-flink172/target/scala-2.11/sbt-flink172-assembly-0.1.jar
works fine. I have no idea why this works and the mvn compiled jar does not.

Pyspark / pyspark kernels not working in jupyter notebook

Here are installed kernels:
$jupyter-kernelspec list
Available kernels:
apache_toree_scala /usr/local/share/jupyter/kernels/apache_toree_scala
apache_toree_sql /usr/local/share/jupyter/kernels/apache_toree_sql
pyspark3kernel /usr/local/share/jupyter/kernels/pyspark3kernel
pysparkkernel /usr/local/share/jupyter/kernels/pysparkkernel
python3 /usr/local/share/jupyter/kernels/python3
sparkkernel /usr/local/share/jupyter/kernels/sparkkernel
sparkrkernel /usr/local/share/jupyter/kernels/sparkrkernel
A new notebook was created but fails with
The code failed because of a fatal error:
Error sending http request and maximum retry encountered..
There is no [error] message in the jupyter console
If you use magicspark to connect your Jupiter notebook, you should also start Livy which is API service used by magicspark to talk to your Spark cluster.
Download Livy from Apache Livy and unzip it
Check SPARK_HOME environment is set, if not, set to your Spark installation directory
Run Livy server by <livy_home>/bin/livy-server in the shell/command line
Now go back to your notebook, you should be able to run spark code in cell.

Cassandra connection to Spark

I am connecting spark with Cassandra and I am storing csv file in Cassandra, when I enter this command I got error.
dfprev.write.format("org.apache.spark.sql.cassandra") .options(Map("keyspace"->"sensorkeyspace","table"->"sensortable")).save()
Then I got this error.
java.io.IOException: Failed to open native connection to Cassandra at {127.0.0.1}:9042
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:168)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$8.apply(CassandraConnector.scala:154)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$8.apply(CassandraConnector.scala:154)
at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:32)
at com.datastax.spark.connector.cql.RefCountedCache.syncAcquire(RefCountedCache.scala:69)
Are you Cassandra listening on localhost? You may need to configure the list of IP addresses of your Cassandra cluster by specifying spark.cassandra.connection.host setting in Spark configuration. See documentation for details.
There might any one of the following
Cassandra server might not be running at 127.0.0.1:9042
Please check the cassandra is listening at port 9042 using netstat -an command.
There might be dependency issues when fat jar.
Please make sure that you have added right version of cassandra connector in library dependency like
"com.datastax.spark" %% "spark-cassandra-connector" % "2.0.0-M3"
I am running this command ./spark-shell --packages
com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3 --conf
spark.cassandra.connection.host=127.0.0.
The package to be specified as,
spark-shell --packages "com.datastax.spark":"spark-cassandra-connector_2.11":"2.0.0-M3"
Check these things, May solve your problem,
1. Find cqlsh.py file in your system by entering the below command in shell
whereis cqlsh
Edit the cqlsh.py and change the DEFAULT PORT to your IP
Initiate spark context with the following SparkConfig()
val conf = new SparkConf().set("spark.cassandra.connection.host", "<YOUR IP>")
val sc = new SparkContext(conf)

Debugging MapReduce Hadoop in local mode in eclipse. Failed to connect remote VM

I am new to hadoop and I am trying to debug MapReduce Hadoop in local mode in Eclipse in Virtualbox Ubuntu following these articles: Debug Custom Java hadoop code in local environment and Hadoop MapReduce Debugging in Local Setup
In hadoop-env.sh I put the text
export HADOOP_OPTS="$HADOOP_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=8008"
I tried to run Eclipse from command line
eclipse -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8008
I also changed from hdfs to file:/// in core-site.xml in hadoop configurations
<name>fs.default.name</name>
<value>file:///localhost:8020</value>
I checked the port 8080. Seems like it works okay:
netstat -atn | grep 8080`
says tcp6 8080 LISTEN
http://localhost:8080 opens in browser and says Required param job, map and reduce
still everything is useless as when I try to set debug configuration with the port 8080 in Eclipse it breaks “failed to connect remote vm”.
Can anyone suggest a possible solution?
That isn't the way to run eclipse as a debugger.
Run eclipse without any command line options and setup a debug configuration for a remote java application that connects to 8008.
[EDIT]
I also think your hadoop debug options are wrong. I use:
-agentlib:jdwp=transport=dt_socket,address=8008,server=y,suspend=n