How to enable pySpark in Glue ETL? - pyspark

I have a very simple Glue ETL Job with the following code:
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate()
conf = sc.getConf()
print(conf.toDebugString())
The Job is created with a Redshift connection enabled. When executing the Job I get:
No module named pyspark.context
The public documentations all seem to mention, point, and imply the availability of pyspark, but why is my environment complaining that it doesn't have pyspark? What steps am I missing?
Best Regards,
Lim

Python Shell jobs only support Python and libraries like pandas, Scikit-learn, etc. They don't have support for PySpark, so you should create one with job type = Spark and ETL language = Python in order to make it work.

I use:
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

Related

NoSuchMethodError in google dataproc cluster for excel files

While consuming Excel file in dataproc cluster, getting errorjava.lang.NoSuchMethodError.
Note: schema is getting printed but not the actual data.
Error:
py4j.protocol.Py4JJavaError: An error occurred while calling
o74.showString. : java.lang.NoSuchMethodError:
scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
at
com.crealytics.spark.excel.ExcelRelation.buildScan(ExcelRelation.scala:74)
Code:
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from google.cloud import storage
from google.cloud import bigquery
import pyspark
client = storage.Client()
bucket_name = "test_bucket"
path=f"gs://{bucket_name}/test_file.xlsx"
def make_spark_session(app_name, jars=[]):
configuration = (SparkConf()
.set("spark.jars", ','.join(jars)))
spark = SparkSession.builder.appName(app_name) \
.config(conf=configuration).getOrCreate()
return spark
app_name = 'test_app'
jars = ['gs://bucket/spark-excel_2.11_uber-0.12.0.jar']
spark = make_spark_session(app_name,jars)
df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader","true") \
.load(path)
df.show(1)
This appears to be Scala version mismatch between your job jars and the cluster. Both Dataproc 1.5 and 2.0 come with Scala 2.12. The gs://bucket/spark-excel_2.11_uber-0.12.0.jar in your code seems to be Scala 2.11 based, you might want to use spark-excel_2.12_... instead. In addition to that, make sure your Spark application is also built with Scala 2.12.

AWS Glue job run failed - no log4j-web module available

I wrote the script below to run a Glue job:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import *
from awsglue.dynamicframe import DynamicFrame
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
source_data = glueContext.create_dynamic_frame.from_catalog(database = "source_db", table_name = "source_table")
source_data.toDF().createOrReplaceTempView("data")
query = "SELECT id, date_created FROM data"
data_df = spark.sql(query)
data_dynamicframe = DynamicFrame.fromDF(data_df.repartition(1), glueContext, "data_dynamicframe")
target_data = glueContext.write_dynamic_frame.from_catalog(frame = data_dynamicframe, database = "target", table_name = "target_table", transformation_ctx = "target_data")
job.commit()
And I got this message in the Log
Thread-4 INFO Log4j appears to be running in a Servlet environment, but there's no log4j-web module available. If you want better web container support, please add the log4j-web JAR to your web archive or server lib directory.
Has anyone incurred in the same situation? Is there something wrong with the script?
Thanks!
Turns out there was a typo!
The script works fine, I still get the following message
Executor task launch worker for task 0 INFO Log4j appears to be running
in a Servlet environment, but there's no log4j-web module available. \ If
you want better web container support, please add the log4j-web JAR to
your web archive or server lib directory.
2021-11-19 16:16:27,020 Executor task launch worker for task 0 INFO Log4j
appears to be running in a Servlet environment, but there's no log4j-web
module available. If you want better web container support, please add
the log4j-web JAR to your web archive or server lib directory.
and I guess it will be worth investigating in the future.

Aws Glue Etl - no module named dynamicframe

I have a problem trying to execute aws example fro Aws Glue Etl - locally
after read all those steps:
https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-local-notebook.html
and create my endpoints into aws glue. When i try to execute this code:
%pyspark
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
# sc = SparkContext()
#glueContext = GlueContext(sc)
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
persons = glueContext.create_dynamic_frame.from_catalog(
database="sampledb",
table_name="avro_avro_files"
)
print(persons.count())
persons.printSchema()
I have this error:
File "/usr/share/aws/glue/etl/python/PyGlue.zip/awsglue/__init__.py", line 13, in <module>
from dynamicframe import DynamicFrame
ImportError: No module named 'dynamicframe'
And i don't know how solve this problem
i'm have zeppeling0.7.3 config locally.
the idea with the code showed before is , get this result:
2019-04-01 11:37:22 INFO avro-test-bo: Test log message
Count: 5
root
|-- name: string
|-- favorite_number: int
|-- favorite_color: string
Hello finally i get the answer here
the problem is when i create my endpoint , i create it just on a private network.
After create a new endpoint with public network. this error was solved.
Thanks for the help for everybody
Regards
do you mean to say the code was working earlier, and have stopped working? sorry couldnt interpret it correctly.
With reference to local development using Zeppelin, can you please confirm if the configuration is correct, and have enabled ssh tunneling, etc? You may need to do some config. changes in the Zeppelin->Spark interpreters, etc.
Please make sure you are connected to AWS Glue DEP using SSH tunneling. Here are some references that may help you. Looks like your zeppelin is unable to get a GlueContext (I dont see a glueconext object being created?)
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
Please refer to this linke, setting up zeppelin on windows, for any help on configuring local zeppelin environment.

How to fix 'Exception: Java gateway process exited before sending its port number' in Eclipse IDE

I am trying to connect MySQL using pyspark in pydev environment of Eclipse IDE.
Getting below error:
Exception: Java gateway process exited before sending its port number
I have checked Java is properly installed and also set PYSPARK_SUBMIT_ARGS to value --master local[*] --jars path\mysql-connector-java-5.1.44-bin.jar pyspark-shell in windows-> preferences->Pydev->Python Interpreter->Environment.
Java Path is also set. Tried setting it via code also but no luck.
#import os
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql.context import SQLContext
#os.environ['JAVA_HOME']= 'C:/Program Files/Java/jdk1.8.0_141/'
#os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars D:/Softwares/mysql-connector-java-5.1.44.tar/mysql-connector-java-5.1.44/mysql-connector-java-5.1.44-bin.jar pyspark-shell'
conf = SparkConf().setMaster('local').setAppName('MySQLdataread')
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "XXXXX").option("user", "root").option("password", "XXXX").load()
dataframe_mysql.show()
My problem was slightly different, I am running spark in spyder with windows.
when I am using
from pyspark.sql import SQLContext, SparkSession
I had the issue and followed google search links and not able to solve the problem.
Then I changed the import to:
from pyspark.sql import SparkSession
from pyspark import SQLContext
and the error message disappeared.
I am running on Windows, anaconda3, python3.7, spyder Hope it is helpful to someone.
Edit:
Later, i found the real problem is from the following. When any of the configuration was not working properly, the same exception shows up. Previously, I used 28gb and 4gb instead of 28g and 4g and that cause all the problems I had.
from pyspark.sql import SparkSession
from pyspark import SQLContext
spark = SparkSession.builder \
.master('local') \
.appName('muthootSample1') \
.config('spark.executor.memory', '28g') \
.config('spark.driver.memory','4g')\
.config("spark.cores.max", "6") \
.getOrCreate()

How to load a csv/txt file into AWS Glue job

I have below 2 clarifications on AWS Glue, could you please clarify. Because I need to use glue as part of my project.
I would like to load a csv/txt file into a Glue job to process it. (Like we do in Spark with dataframes). Is this possible in Glue? Or do we have to use only Crawlers to crawl the data into Glue tables and make use of them like below for further processing?
empdf = glueContext.create_dynamic_frame.from_catalog(
database="emp",
table_name="emp_json")
Below I used Spark code to load a file into Glue, but I'm getting lengthy error logs. Can we directly run Spark or PySpark code as it is without any changes in Glue?
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
dfnew = spark.read.option("header","true").option("delimiter", ",").csv("C:\inputs\TEST.txt")
dfnew.show(2)
It's possible to load data directly from s3 using Glue:
sourceDyf = glueContext.create_dynamic_frame_from_options(
connection_type="s3",
format="csv",
connection_options={
"paths": ["s3://bucket/folder"]
},
format_options={
"withHeader": True,
"separator": ","
})
You can also do that just with spark (as you already tried):
sourceDf = spark.read
.option("header","true")
.option("delimiter", ",")
.csv("C:\inputs\TEST.txt")
However, in this case Glue doesn't guarantee that they provide appropriate Spark readers. So if your error is related to missing data source for CSV then you should add spark-csv lib to the Glue job by providing s3 path to its locations via --extra-jars parameter.
Below 2 cases i tested working fine:
To load a file from S3 into Glue.
dfnew = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://MyBucket/path/"] }, format="csv" )
dfnew.show(2)
To load data from Glue db and tables which are generated already through Glue Crawlers.
DynFr = glueContext.create_dynamic_frame.from_catalog(database="test_db", table_name="test_table")
DynFr is a DynamicFrame, so if we want to work with Spark code in Glue, then we need to convert it into a normal data frame like below.
df1 = DynFr.toDF()