How to use GraphFrames on EMR serverless

How to use GraphFrames on EMR serverless - pyspark

Summary of steps executed:
Uploaded the python script to S3.
Created a virtualenv that installs graphframes and uploaded it to S3.
Added a VPC to my EMR application.
Added graphframes package to spark conf.
The error message was:
22/09/11 18:44:49 INFO Utils: /home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.16.jar has been previously copied to /tmp/spark-d0c75876-b210-4ce2-b2f8-cd65e59d00db/userFiles-1aef56b8-1c39-4158-b9a2-ee810a844314/org.slf4j_slf4j-api-1.7.16.jar
22/09/11 18:44:49 INFO Executor: Fetching file:/tmp/spark-ae6aefff-23d3-447e-a743-5ed41621fad0/pyspark_ge.tar.gz#environment with timestamp 1662921887645
22/09/11 18:44:49 INFO Utils: Copying /tmp/spark-ae6aefff-23d3-447e-a743-5ed41621fad0/pyspark_ge.tar.gz to /tmp/spark-7f9cf09a-dc6e-41e0-b187-61f1b5a80670/pyspark_ge.tar.gz
22/09/11 18:44:49 INFO Executor: Unpacking an archive file:/tmp/spark-ae6aefff-23d3-447e-a743-5ed41621fad0/pyspark_ge.tar.gz#environment from /tmp/spark-7f9cf09a-dc6e-41e0-b187-61f1b5a80670/pyspark_ge.tar.gz to /tmp/spark-d0c75876-b210-4ce2-b2f8-cd65e59d00db/userFiles-1aef56b8-1c39-4158-b9a2-ee810a844314/environment
22/09/11 18:44:49 INFO Executor: Fetching spark://[2600:1f18:61b4:c700:3016:ab0c:287a:ccaf]:34507/jars/org.slf4j_slf4j-api-1.7.16.jar with timestamp 1662921887645
22/09/11 18:44:50 ERROR Utils: Aborting task
java.io.IOException: Failed to connect to /2600:1f18:61b4:c700:3016:ab0c:287a:ccaf:34507
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:288)
Could anybody point me to a solution/documentation that helps with this problem?
ANNEX
I also tried to add just graphframes package but got a "numpy not found error".
More detailed steps
python script:
import sys
from operator import add
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from graphframes import *
print('job started')
conf = (SparkConf()
.setMaster("local")
.setAppName("GraphDataFrame")
.set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
if __name__ == "__main__":
# Create a Vertex DataFrame with unique ID column "id"
v = sqlContext.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = sqlContext.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
], ["src", "dst", "relationship"])
# Create a GraphFrame
from graphframes import *
g = GraphFrame(v, e)
# Query: Get in-degree of each vertex.
print(g.inDegrees.show())
# Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()
# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
print(results.vertices.select("id", "pagerank").show())
sc.stop()
Virtualenv Dockerfile compiled and copied to S3.
FROM --platform=linux/amd64 amazonlinux:2 AS base
RUM yum install -y python
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install venv-pack==0.2.0
RUN python3 -m pip install graphframes
...
Just added a default generated VPC with NAT.
Conf graphframes added:
--conf spark.jars.packages=graphframes:graphframes:0.8.2-spark3.2-s_2.12
Conf for virtualenv
--conf spark.archives=s3://bucket_name/venv.tar.gz#environment
--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python
--conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python

Found the solution:
We can't use a context like the one I used:
conf = (SparkConf()
.setMaster("local")
.setAppName("GraphDataFrame")
.set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
Use this instead:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("HelloGraphFrame").getOrCreate()

Related

Access home directory within spark task node

Directory: /home/hadoop/
module.py
def incr(value):
return int(value + 1)
main.py
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
import sys
sys.path.append('/home/hadoop/')
import module
if __name__ == '__main__':
df = spark.createDataFrame([['a', 1], ['b', 2]], schema=['id', 'value'])
df.show()
print(module.incr(5)) #this works
# this throws module not found error
incr_udf = F.udf(lambda val: module.incr(val), T.IntegerType())
df = df.withColumn('new_value', incr_udf('value'))
df.show()
Spark task nodes do not have access to /home/hadoop/
How do I import module.py from within spark task nodes?

if you are submitting the spark to yarn. the task will be progress launched by user 'yarn' in the worknode and will not have permission to access.
you can add --py-files module.py to your spark-submit command, then you want directly call the function module.py by adding from module import * since they are all in the container now

Submitting a pyspark job to Amazon EMR cluster from terminal

I have SSH-ed into the Amazon EMR server and I want to submit a Spark job ( a simple word count file and a sample.txt are both on the Amazon EMR server ) written in Python from the terminal. How do I do this and what's the syntax?
The word_count.py is as follows:
from pyspark import SparkConf, SparkContext
from operator import add
import sys
## Constants
APP_NAME = " HelloWorld of Big Data"
##OTHER FUNCTIONS/CLASSES
def main(sc,filename):
textRDD = sc.textFile(filename)
words = textRDD.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1))
wordcount = words.reduceByKey(add).collect()
for wc in wordcount:
print (wc[0],wc[1])
if __name__ == "__main__":
# Configure Spark
conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster("local[*]")
sc = SparkContext(conf=conf)
sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId","XXXX")
sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey","YYYY")
filename = "s3a://bucket_name/sample.txt"
# filename = sys.argv[1]
# Execute Main functionality
main(sc, filename)

You can run this command:
spark-submit s3://your_bucket/your_program.py
if you need to run the script using python3, you can run this command before spark-submit:
export PYSPARK_PYTHON=python3.6
Remember to save your program in a bucket before spark-submit.

Classnotfound error when connecting to snowflake from pyspark local machine

I am trying to connect to snowflake from Pyspark on my local machine.
My code looks as below.
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext
sc = SparkContext("local", "sf_test")
spark = SQLContext(sc)
spark_conf = SparkConf().setMaster('local').setAppName('sf_test')
sfOptions = {
"sfURL" : "someaccount.some.address",
"sfAccount" : "someaccount",
"sfUser" : "someuser",
"sfPassword" : "somepassword",
"sfDatabase" : "somedb",
"sfSchema" : "someschema",
"sfWarehouse" : "somedw",
"sfRole" : "somerole",
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
I get an error when I run this particular chunk of code.
df = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query","""select * from
"PRED_ORDER_DEV"."SALES"."V_PosAnalysis" pos
ORDER BY pos."SAPAccountNumber", pos."SAPMaterialNumber" """).load()
Py4JJavaError: An error occurred while calling o115.load. :
java.lang.ClassNotFoundException: Failed to find data source:
net.snowflake.spark.snowflake. Please find packages at
http://spark.apache.org/third-party-projects.html at
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
I have loaded the connector and jdbc jar files and added them to CLASSPATH
pyspark --packages net.snowflake:snowflake-jdbc:3.11.1,net.snowflake:spark-snowflake_2.11:2.5.7-spark_2.4
CLASSPATH = C:\Program Files\Java\jre1.8.0_241\bin;C:\snowflake_jar
I want to be able to connect to snowflake and read data with Pyspark. Any help would be much appreciated!

To run a pyspark application you can use spark-submit and pass the JARs under the --packages option. I'm assuming you'd like to run client mode so you pass this to the --deploy-mode option and at last you add the name of your pyspark program.
Something like below:
spark-submit --packages net.snowflake:snowflake-jdbc:3.11.1,net.snowflake:spark-snowflake_2.11:2.5.7-spark_2.4 --deploy-mode client spark-snowflake.py

Below working script.
You should to create directory jar in root of you project and add two jars:
snowflake-jdbc-3.13.4.jar (jdbc driver)
spark-snowflake_2.12-2.9.0-spark_3.1.jar (spark connector).
Next you need to understood what is your scala compiler version. I`m using PyCharm, so double click shift and the search for 'scala'. You will see something like scala-compiler-2.12.10.jar. The first digits of the scala-compiler version (in our case 2.12) should be the same as the first digits of spark connector (spark-snowflake_2.12-2.9.0-spark_3.1.jar)
Driver - https://repo1.maven.org/maven2/net/snowflake/snowflake-jdbc/
Connector - https://docs.snowflake.com/en/user-guide/spark-connector-install.html#downloading-and-installing-the-connector
CHECK SCALA COMPILER VERSION BEFORE DOWNLOADING CONNECTOR
from pyspark.sql import SparkSession
sfOptions = {
"sfURL": "sfURL",
"sfUser": "sfUser",
"sfPassword": "sfPassword",
"sfDatabase": "sfDatabase",
"sfSchema": "sfSchema",
"sfWarehouse": "sfWarehouse",
"sfRole": "sfRole",
}
spark = SparkSession.builder \
.master("local") \
.appName("snowflake-test") \
.config('spark.jars', 'jar/snowflake-jdbc-3.13.4.jar,jar/spark-snowflake_2.12-2.9.0-spark_3.1.jar') \
.getOrCreate()
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select * from some_table") \
.load()
df.show()

How to write a pyspark-dataframe to redshift?

I am trying to write a pyspark DataFrame to Redshift but it results into error:-
java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V
Spark Version: 2.4.1
Spark-submit command: spark-submit --master local[*] --jars ~/Downloads/spark-avro_2.12-2.4.0.jar,~/Downloads/aws-java-sdk-1.7.4.jar,~/Downloads/RedshiftJDBC42-no-awssdk-1.2.20.1043.jar,~/Downloads/hadoop-aws-2.7.3.jar,~/Downloads/hadoop-common-2.7.3.jar --packages com.databricks:spark-redshift_2.11:2.0.1,com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3,org.apache.hadoop:hadoop-common:2.7.3,org.apache.spark:spark-avro_2.12:2.4.0 script.py
from pyspark.sql import DataFrameReader
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import sys
import os
pe_dl_dbname = os.environ.get("REDSHIFT_DL_DBNAME")
pe_dl_host = os.environ.get("REDSHIFT_DL_HOST")
pe_dl_port = os.environ.get("REDSHIFT_DL_PORT")
pe_dl_user = os.environ.get("REDSHIFT_DL_USER")
pe_dl_password = os.environ.get("REDSHIFT_DL_PASSWORD")
s3_bucket_path = "s3-bucket-name/sub-folder/sub-sub-folder"
tempdir = "s3a://{}".format(s3_bucket_path)
driver = "com.databricks.spark.redshift"
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
spark = SparkSession(sc)
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
sc._jsc.hadoopConfiguration().set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
datalake_jdbc_url = 'jdbc:redshift://{}:{}/{}?user={}&password={}'.format(pe_dl_host, pe_dl_port, pe_dl_dbname, pe_dl_user, pe_dl_password)
"""
The table is created in Redshift as follows:
create table adhoc_analytics.testing (name varchar(255), age integer);
"""
l = [('Alice', 1)]
df = spark.createDataFrame(l, ['name', 'age'])
df.show()
df.write \
.format("com.databricks.spark.redshift") \
.option("url", datalake_jdbc_url) \
.option("dbtable", "adhoc_analytics.testing") \
.option("tempdir", tempdir) \
.option("tempformat", "CSV") \
.save()

Databricks Spark-Redshift doesn't work with Spark version 2.4.1,
Here is the version that I maintain to make it work with Spark 2.4.1
https://github.com/goibibo/spark-redshift
How to use it:
pyspark --packages "com.github.goibibo:spark-redshift:v4.1.0" --repositories "https://jitpack.io"

Failed to find data source: com.mongodb.spark.sql.DefaultSource

I'm trying to connect spark (pyspark) to mongodb as follows:
conf = SparkConf()
conf.set('spark.mongodb.input.uri', default_mongo_uri)
conf.set('spark.mongodb.output.uri', default_mongo_uri)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
spark = SparkSession \
.builder \
.appName("my-app") \
.config("spark.mongodb.input.uri", default_mongo_uri) \
.config("spark.mongodb.output.uri", default_mongo_uri) \
.getOrCreate()
But when I do the following:
users = spark.read.format("com.mongodb.spark.sql.DefaultSource") \
.option("uri", '{uri}.{col}'.format(uri=mongo_uri, col='users')).load()
I get this error:
java.lang.ClassNotFoundException: Failed to find data source:
com.mongodb.spark.sql.DefaultSource
I did the same thing from pyspark shell and I was able to retrieve data. This is the command I ran:
pyspark --conf "spark.mongodb.input.uri=mongodb_uri" --conf "spark.mongodb.output.uri=mongodburi" --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.2
But here we have the option to specify the package we need to use. But what about standalone apps and scripts. how can I configure mongo-spark-connector there.
Any ideas?

Here how I did it in Jupyter notebook:
1. Download jars from central or any other repository and put them in directory called "jars":
mongo-spark-connector_2.11-2.4.0
mongo-java-driver-3.9.0
2. Create session and write/read any data
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
working_directory = 'jars/*'
my_spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri=mongodb://127.0.0.1/test.myCollection") \
.config("spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection") \
.config('spark.driver.extraClassPath', working_directory) \
.getOrCreate()
people = my_spark.createDataFrame([("JULIA", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77),
("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", 22)], ["name", "age"])
people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()
df = my_spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.select('*').where(col("name") == "JULIA").show()
As a result you will see this:

If you are using SparkContext & SparkSession, you have mentioned the connector jar packages in SparkConf, check the following Code:
from pyspark import SparkContext,SparkConf
conf = SparkConf().set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.3.2")
sc = SparkContext(conf=conf)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
.config("spark.mongodb.output.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()
If you are using only SparkSession then use following code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
.config("spark.mongodb.output.uri", "mongodb://xxx.xxx.xxx.xxx:27017/sample1.zips") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.3.2') \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()

If you're using the newest version of mongo-spark-connector, i.e. v10.0.1 at the time of writing this, you need to use SparkConf object, as stated by the mongo documentation (https://www.mongodb.com/docs/spark-connector/current/configuration/).
Besides, you don't need to manually download anything, it will do it for you.
Bellow is the solution I came up with, for :
mongo-spark-connector: 10.0.1
mongo server : 5.0.8
spark : 3.2.0
def init_spark():
password = os.environ["MONGODB_PASSWORD"]
user = os.environ["MONGODB_USER"]
host = os.environ["MONGODB_HOST"]
db_auth = os.environ["MONGODB_DB_AUTH"]
mongo_conn = f"mongodb://{user}:{password}#{host}:27017/{db_auth}"
conf = SparkConf()
# Download mongo-spark-connector and its dependencies.
# This will download all the necessary jars and put them in your $HOME/.ivy2/jars, no need to manually download them :
conf.set("spark.jars.packages",
"org.mongodb.spark:mongo-spark-connector:10.0.1")
# Set up read connection :
conf.set("spark.mongodb.read.connection.uri", mongo_conn)
conf.set("spark.mongodb.read.database", "<my-read-database>")
conf.set("spark.mongodb.read.collection", "<my-read-collection>")
# Set up write connection
conf.set("spark.mongodb.write.connection.uri", mongo_conn)
conf.set("spark.mongodb.write.database", "<my-write-database>")
conf.set("spark.mongodb.write.collection", "<my-write-collection>")
# If you need to update instead of inserting :
conf.set("spark.mongodb.write.operationType", "update")
SparkContext(conf=conf)
return SparkSession \
.builder \
.appName('<my-app-name>') \
.getOrCreate()
spark = init_spark()
df = spark.read.format("mongodb").load()
df_grouped = df.groupBy("<some-column>").agg(mean("<some-other-column>"))
df_grouped.write.format("mongodb").mode("append").save()

I was also facing same error "java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource" while trying to connect to MongoDB from Spark (2.3).
I had to download and copy mongo-spark-connector_2.11 JAR file(s) into jars directory of spark installation.
That resolved my issue and I was successfully able to call my spark code via spark-submit.
Hope it helps.

Here is how this error got resolved by downloading the jar files below. (Used the solution of this question.)
1.Downloaded the jar files below.
mongo-spark-connector_2.11-2.4.1 from here
mongo-java-driver-3.9.0 from here
copy and paste both these jar files into 'jars' location in spark directory.
Pyspark Code in jupiter notebook:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("mongo").\
config("spark.mongodb.input.uri","mongodb://127.0.0.1:27017/$database.$table_name").\
config("spark.mongodb.output.uri","mongodb://127.0.0.1:27017/$database.$table_name").\
getOrCreate()
df=spark.read.format('com.mongodb.spark.sql.DefaultSource')\
.option( "uri", "mongodb://127.0.0.1:27017/$database.$table_name") \
.load()
df.printSchema()
#create Temp view of df to view the data
table = df.createOrReplaceTempView("df")
#to read table present in mongodb
query1 = spark.sql("SELECT * FROM df ")
query1.show(10)

You are not using sc to create the SparkSession. Maybe this code can help you:
conf.set('spark.mongodb.input.uri', mongodb_input_uri)
conf.set('spark.mongodb.input.collection', 'collection_name')
conf.set('spark.mongodb.output.uri', mongodb_output_uri)
sc = SparkContext(conf=conf)
spark = SparkSession(sc) # Using the context (conf) to create the session