AWS Glue job run failed - no log4j-web module available - pyspark

I wrote the script below to run a Glue job:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import *
from awsglue.dynamicframe import DynamicFrame
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
source_data = glueContext.create_dynamic_frame.from_catalog(database = "source_db", table_name = "source_table")
source_data.toDF().createOrReplaceTempView("data")
query = "SELECT id, date_created FROM data"
data_df = spark.sql(query)
data_dynamicframe = DynamicFrame.fromDF(data_df.repartition(1), glueContext, "data_dynamicframe")
target_data = glueContext.write_dynamic_frame.from_catalog(frame = data_dynamicframe, database = "target", table_name = "target_table", transformation_ctx = "target_data")
job.commit()
And I got this message in the Log
Thread-4 INFO Log4j appears to be running in a Servlet environment, but there's no log4j-web module available. If you want better web container support, please add the log4j-web JAR to your web archive or server lib directory.
Has anyone incurred in the same situation? Is there something wrong with the script?
Thanks!

Turns out there was a typo!
The script works fine, I still get the following message
Executor task launch worker for task 0 INFO Log4j appears to be running
in a Servlet environment, but there's no log4j-web module available. \ If
you want better web container support, please add the log4j-web JAR to
your web archive or server lib directory.
2021-11-19 16:16:27,020 Executor task launch worker for task 0 INFO Log4j
appears to be running in a Servlet environment, but there's no log4j-web
module available. If you want better web container support, please add
the log4j-web JAR to your web archive or server lib directory.
and I guess it will be worth investigating in the future.

Related

NoSuchMethodError in google dataproc cluster for excel files

While consuming Excel file in dataproc cluster, getting errorjava.lang.NoSuchMethodError.
Note: schema is getting printed but not the actual data.
Error:
py4j.protocol.Py4JJavaError: An error occurred while calling
o74.showString. : java.lang.NoSuchMethodError:
scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
at
com.crealytics.spark.excel.ExcelRelation.buildScan(ExcelRelation.scala:74)
Code:
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from google.cloud import storage
from google.cloud import bigquery
import pyspark
client = storage.Client()
bucket_name = "test_bucket"
path=f"gs://{bucket_name}/test_file.xlsx"
def make_spark_session(app_name, jars=[]):
configuration = (SparkConf()
.set("spark.jars", ','.join(jars)))
spark = SparkSession.builder.appName(app_name) \
.config(conf=configuration).getOrCreate()
return spark
app_name = 'test_app'
jars = ['gs://bucket/spark-excel_2.11_uber-0.12.0.jar']
spark = make_spark_session(app_name,jars)
df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader","true") \
.load(path)
df.show(1)
This appears to be Scala version mismatch between your job jars and the cluster. Both Dataproc 1.5 and 2.0 come with Scala 2.12. The gs://bucket/spark-excel_2.11_uber-0.12.0.jar in your code seems to be Scala 2.11 based, you might want to use spark-excel_2.12_... instead. In addition to that, make sure your Spark application is also built with Scala 2.12.

Close spark context started via Celery task in a django app

I am using Pyspark along with Celery in a Django app. So the flow of my code is as follows:
1. Put a POST request to upload a file (large file).
2. Django handles the request and loads the file to hdfs. This large file in hdfs is read by pyspark to load it into the cassandra.
3. This upload is handled by Celery (from reading file to cassandra upload). Celery starts the process in the background and starts a spark context to start the upload.
4. The data gets loaded to cassandra, but the spark context which was created via the celery does not stop even after using spark.stop() when the load is complete.
project -> celery.py
import os
from celery import Celery
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'project.settings')
app = Celery('project')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
tasks.py
import celery
from project.celery import app
from cassandra.cluster import Cluster
from pyspark.sql import SparkSession
class uploadfile():
def __init__(self):
self.cluster = Cluster(getattr(settings, "CASSANDRA_IP", ""))
self.session = self.cluster.connect()
def start_spark(self):
self.spark = SparkSession.builder.master(getattr(settings,'SPARK_MASTER', settings.SPARK_MASTER))\
.appName('Load CSV to Cassandra')\
.config('spark.jars', self.jar_files_path)\
.config('spark.cassandra.connection.host', getattr(settings,'SPARK_CASSANDRA_CONNECTION_HOST','0.0.0.0'))\
.getOrCreate()
def spark_stop(self):
self.spark.stop()
def file_upload(self):
self.start_spark()
df = self.spark.read.csv(file_from_hdfs)
# do some operation on the dataframe
# self.session.create_cassandra_table_if_does_not_exist
df.write.format('org.apache.spark.sql.cassandra').\
.option('table',table_name)\
.option('keyspace',keyspace)\
.mode('append').save()
self.spark_stop() <<<-------------------- This does not close the spark context
#task(name="api.tasks.uploadfile")
def csv_upload():
# handle request.FILE and upload the file to hdfs
spark_obj = uploadfile()
spark_obj.file_upload()
calling_task_script.py
from task import csv_upload
from rest_framework.views import APIView
class post_it(APIView):
def post(request):
csv_upload.delay()
return Response('success')

How to enable pySpark in Glue ETL?

I have a very simple Glue ETL Job with the following code:
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate()
conf = sc.getConf()
print(conf.toDebugString())
The Job is created with a Redshift connection enabled. When executing the Job I get:
No module named pyspark.context
The public documentations all seem to mention, point, and imply the availability of pyspark, but why is my environment complaining that it doesn't have pyspark? What steps am I missing?
Best Regards,
Lim
Python Shell jobs only support Python and libraries like pandas, Scikit-learn, etc. They don't have support for PySpark, so you should create one with job type = Spark and ETL language = Python in order to make it work.
I use:
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

Aws Glue Etl - no module named dynamicframe

I have a problem trying to execute aws example fro Aws Glue Etl - locally
after read all those steps:
https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-local-notebook.html
and create my endpoints into aws glue. When i try to execute this code:
%pyspark
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
# sc = SparkContext()
#glueContext = GlueContext(sc)
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
persons = glueContext.create_dynamic_frame.from_catalog(
database="sampledb",
table_name="avro_avro_files"
)
print(persons.count())
persons.printSchema()
I have this error:
File "/usr/share/aws/glue/etl/python/PyGlue.zip/awsglue/__init__.py", line 13, in <module>
from dynamicframe import DynamicFrame
ImportError: No module named 'dynamicframe'
And i don't know how solve this problem
i'm have zeppeling0.7.3 config locally.
the idea with the code showed before is , get this result:
2019-04-01 11:37:22 INFO avro-test-bo: Test log message
Count: 5
root
|-- name: string
|-- favorite_number: int
|-- favorite_color: string
Hello finally i get the answer here
the problem is when i create my endpoint , i create it just on a private network.
After create a new endpoint with public network. this error was solved.
Thanks for the help for everybody
Regards
do you mean to say the code was working earlier, and have stopped working? sorry couldnt interpret it correctly.
With reference to local development using Zeppelin, can you please confirm if the configuration is correct, and have enabled ssh tunneling, etc? You may need to do some config. changes in the Zeppelin->Spark interpreters, etc.
Please make sure you are connected to AWS Glue DEP using SSH tunneling. Here are some references that may help you. Looks like your zeppelin is unable to get a GlueContext (I dont see a glueconext object being created?)
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
Please refer to this linke, setting up zeppelin on windows, for any help on configuring local zeppelin environment.

How to load a csv/txt file into AWS Glue job

I have below 2 clarifications on AWS Glue, could you please clarify. Because I need to use glue as part of my project.
I would like to load a csv/txt file into a Glue job to process it. (Like we do in Spark with dataframes). Is this possible in Glue? Or do we have to use only Crawlers to crawl the data into Glue tables and make use of them like below for further processing?
empdf = glueContext.create_dynamic_frame.from_catalog(
database="emp",
table_name="emp_json")
Below I used Spark code to load a file into Glue, but I'm getting lengthy error logs. Can we directly run Spark or PySpark code as it is without any changes in Glue?
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
dfnew = spark.read.option("header","true").option("delimiter", ",").csv("C:\inputs\TEST.txt")
dfnew.show(2)
It's possible to load data directly from s3 using Glue:
sourceDyf = glueContext.create_dynamic_frame_from_options(
connection_type="s3",
format="csv",
connection_options={
"paths": ["s3://bucket/folder"]
},
format_options={
"withHeader": True,
"separator": ","
})
You can also do that just with spark (as you already tried):
sourceDf = spark.read
.option("header","true")
.option("delimiter", ",")
.csv("C:\inputs\TEST.txt")
However, in this case Glue doesn't guarantee that they provide appropriate Spark readers. So if your error is related to missing data source for CSV then you should add spark-csv lib to the Glue job by providing s3 path to its locations via --extra-jars parameter.
Below 2 cases i tested working fine:
To load a file from S3 into Glue.
dfnew = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://MyBucket/path/"] }, format="csv" )
dfnew.show(2)
To load data from Glue db and tables which are generated already through Glue Crawlers.
DynFr = glueContext.create_dynamic_frame.from_catalog(database="test_db", table_name="test_table")
DynFr is a DynamicFrame, so if we want to work with Spark code in Glue, then we need to convert it into a normal data frame like below.
df1 = DynFr.toDF()