in pyspark > why SparkSession with "master('local')" need internet connection? - pyspark

I am new with pyspark .
I want to know why SparkSession with master('local') need internet connection?
I thougut that it should run the code only in my local computer.
spark = SparkSession.builder
.master('local[8]')
.appName('myAppName')
.getOrCreate()
and what happend if I dont close the session before I turned off the computer?

Related

Prevent pyspark from using in-memory session/docker

We are looking into using Spark as big data processing framework in Azure Synapse Analytics with notebooks. I want to set up a local development environment/sandbox on my own computer similar to that, interacting with Azure Data Lake Storage Gen 2.
For installing Spark I'm using WSL with a Ubuntu distro (Spark seems to be easier to manage in linux)
For notebooks I'm using jupyter notebook with Anaconda
Both components work fine by themself but I can't manage to connect the notebook to my local sparkcluster in WSL. I tried the following:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master("local[1]") \
.appName("Python Spark SQL basic example") \
.getOrCreate()
When examining the spark object it outputs
SparkSession - in-memory
SparkContext
Spark UI
Version v3.3.0
Master local[1]
AppName Python Spark SQL basic example
The spark-ui link points to http://host.docker.internal:4040/jobs/, Also when examining the UI for spark in WSL I can't see any connection. I think there is something I'm missing or not understanding with how pyspark works. Any help would be much appreciated to clarify.
Your are connecting to local instance which is in this case native Windows running jupyter:
.master("local[1]")
Instead, you should connect to your WSL cluster:
.master("spark://localhost:7077") # assuming default port

jupyter notebook connecting to Apache Spark 3.0

I'm trying to connect my Scala kernel in a notebook environment to an existing Apache 3.0 Spark cluster.
I've tried the following methods in integrating Scala into a notebook environment;
Jupyter Scala (Almond)
Spylon Kernel
Apache Zeppelin
Polynote
In each of these Scala environments I've tried to connect to an existing cluster using the following script:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("spark:<ipaddress>:7077)
.getOrCreate()
However when I go to the WebUI at localhost:8080 I don't see anything running on the cluster.
I am able to connect to the cluster using pyspark, but need help with connecting Scala to the cluster.

Correct way to create SparkSession on a Docker container

Having this docker-compose https://github.com/big-data-europe/docker-spark what should be the correct master section to create from outside the Docker?
// Scala
val sparkSession = SparkSession.builder
.appName("beg_You_for_help")
.master("spark://<host_?>:<port_?>")
.getOrCreate()
Many thanks in advance to anyone for help.

Hive table created with Spark not visible from HUE / Hive GUI

I am creating a hive table from scala using the next code:
val spark = SparkSession
.builder()
.appName("self service")
.enableHiveSupport()
.master("local")
.getOrCreate()
spark.sql("CREATE TABLE default.TEST_TABLE (C1 INT)")
The table must be successfully created, because if I run this code twice I receive an error saying the table already exists.
However, when I try to access this table from the GUI (HUE), I cannot see any table in Hive, so it seems it's being saved in a different path that the used by Hive in HUE to get this information.
Do you know what should I do to see the tables I create from my code from the HUE/Hive web GUI?
Any help will be very appreciated.
Thank you very much.
I seems to me you have not added hive-site.xml to the proper path.
Hive-site has the properties that spark need to connect successfully with Hive and you should add this to the directory
SPARK_HOME/conf/
You can also add this file by using spark.driver.extraClassPath and give the directory where this file exists. For example in pyspark submit
/usr/bin/spark2-submit \
--conf spark.driver.extraClassPath=/../ Directory with Hive-site.xml / \
--master yarn --deploy-mode client --driver-memory nG --executor-memory nG \
--executor-cores n myScript.py

How to access remote HDFS cluster from my PC

I'm trying to access a remote cloudera HDFS cluster from my local PC (win7). As cricket_007 suggested in my last question I did the following things:
(1) I created the next Spark session
val spark = SparkSession
.builder()
.appName("API")
.config("spark.sql.warehouse.dir", "/user/hive/warehouse")
.master("local")
.enableHiveSupport()
.getOrCreate()
(2) I copied the next files from the cluster :
core-site.xml
hdfs-site.xml
hive-site.xml
mapred-site.xml
yarn-site.xml
and configured the variable HADOOP_CONF_DIR to the directory that contains them
(3) I downloaded Spark and configured the variables SPARK_HOME and SPARK_CONF_DIR
(4) I downloaded winutils and set it in the path variable. I changed the permissions of /tmp/hive to 777.
When the master set to local I see only the default database which means it doesn't identify the XML files. When it is set to yarn the screen is stuck and it looks like my pc is thinking but it is taking to much time and doesn't end. When I use local and I also use the line .config("hive.metastore.uris","thrift://MyMaster:9083") everything works well.
Why might this be happening? Why locally I see only the default database? Why when the master set to yarn I can't connect and it is stuck? And why when I add the config line it solved my problem only locally?