Correct way to create SparkSession on a Docker container

Correct way to create SparkSession on a Docker container - scala

Having this docker-compose https://github.com/big-data-europe/docker-spark what should be the correct master section to create from outside the Docker?
// Scala
val sparkSession = SparkSession.builder
.appName("beg_You_for_help")
.master("spark://<host_?>:<port_?>")
.getOrCreate()
Many thanks in advance to anyone for help.

Related

Prevent pyspark from using in-memory session/docker

We are looking into using Spark as big data processing framework in Azure Synapse Analytics with notebooks. I want to set up a local development environment/sandbox on my own computer similar to that, interacting with Azure Data Lake Storage Gen 2.
For installing Spark I'm using WSL with a Ubuntu distro (Spark seems to be easier to manage in linux)
For notebooks I'm using jupyter notebook with Anaconda
Both components work fine by themself but I can't manage to connect the notebook to my local sparkcluster in WSL. I tried the following:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master("local[1]") \
.appName("Python Spark SQL basic example") \
.getOrCreate()
When examining the spark object it outputs
SparkSession - in-memory
SparkContext
Spark UI
Version v3.3.0
Master local[1]
AppName Python Spark SQL basic example
The spark-ui link points to http://host.docker.internal:4040/jobs/, Also when examining the UI for spark in WSL I can't see any connection. I think there is something I'm missing or not understanding with how pyspark works. Any help would be much appreciated to clarify.

Your are connecting to local instance which is in this case native Windows running jupyter:
.master("local[1]")
Instead, you should connect to your WSL cluster:
.master("spark://localhost:7077") # assuming default port

in pyspark > why SparkSession with "master('local')" need internet connection?

I am new with pyspark .
I want to know why SparkSession with master('local') need internet connection?
I thougut that it should run the code only in my local computer.
spark = SparkSession.builder
.master('local[8]')
.appName('myAppName')
.getOrCreate()
and what happend if I dont close the session before I turned off the computer?

In AWS EMR, From Jupyter, pyspark's hive enabled spark session is only showing default database and not all hive databases

Have installed jupyter in AWS EMR.
The following piece of code works fine in non AWS Env, but in AWS EMR jupyter is only showing default database in Hive.
From Hive shell, show databases I see 6 databases, but from jupyter it only shows default.
It shows 6 in non AWS cluster.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.enableHiveSupport() \
.getOrCreate()
display(spark.sql('show databases').show())
+---------+
|namespace|
+---------+
| default|
+---------+
None
spark
SparkSession - hive
SparkContext
Spark UI
Version
v3.0.1
Master
local
AppName
Python Spark SQL Hive integration example

settings = [
("hive.metastore.uris", "thrift://xxxxx") #from /etc/hive/conf/hive-site.xml
]
spark_conf = SparkConf().setAppName("Python Spark SQL Hive integration example").setAll(settings)
spark = SparkSession.builder. \
config(conf = spark_conf). \
enableHiveSupport(). \
getOrCreate()

Hive table created with Spark not visible from HUE / Hive GUI

I am creating a hive table from scala using the next code:
val spark = SparkSession
.builder()
.appName("self service")
.enableHiveSupport()
.master("local")
.getOrCreate()
spark.sql("CREATE TABLE default.TEST_TABLE (C1 INT)")
The table must be successfully created, because if I run this code twice I receive an error saying the table already exists.
However, when I try to access this table from the GUI (HUE), I cannot see any table in Hive, so it seems it's being saved in a different path that the used by Hive in HUE to get this information.
Do you know what should I do to see the tables I create from my code from the HUE/Hive web GUI?
Any help will be very appreciated.
Thank you very much.

I seems to me you have not added hive-site.xml to the proper path.
Hive-site has the properties that spark need to connect successfully with Hive and you should add this to the directory
SPARK_HOME/conf/
You can also add this file by using spark.driver.extraClassPath and give the directory where this file exists. For example in pyspark submit
/usr/bin/spark2-submit \
--conf spark.driver.extraClassPath=/../ Directory with Hive-site.xml / \
--master yarn --deploy-mode client --driver-memory nG --executor-memory nG \
--executor-cores n myScript.py

How to change a mongo-spark connection configuration from a databricks python notebook

I succeeded at connecting to mongodb from spark, using the mongo-spark connector from a databricks notebook in python.
Right now I am configuring the mongodb uri in an environment variable, but it is not flexible, since I want to change the connection parameter right in my notebook.
I read in the connector documentation that it is possible to override any values set in the SparkConf.
How can I override the values from python?

You don't need to set anything in the SparkConf beforehand*.
You can pass any configuration options to the DataFrame Reader or Writer eg:
df = sqlContext.read \
.option("uri", "mongodb://example.com/db.coll) \
.format("com.mongodb.spark.sql.DefaultSource") \
.load()
* This was added in 0.2