I have a problem trying to execute aws example fro Aws Glue Etl - locally
after read all those steps:
https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-local-notebook.html
and create my endpoints into aws glue. When i try to execute this code:
%pyspark
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
# sc = SparkContext()
#glueContext = GlueContext(sc)
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
persons = glueContext.create_dynamic_frame.from_catalog(
database="sampledb",
table_name="avro_avro_files"
)
print(persons.count())
persons.printSchema()
I have this error:
File "/usr/share/aws/glue/etl/python/PyGlue.zip/awsglue/__init__.py", line 13, in <module>
from dynamicframe import DynamicFrame
ImportError: No module named 'dynamicframe'
And i don't know how solve this problem
i'm have zeppeling0.7.3 config locally.
the idea with the code showed before is , get this result:
2019-04-01 11:37:22 INFO avro-test-bo: Test log message
Count: 5
root
|-- name: string
|-- favorite_number: int
|-- favorite_color: string
Hello finally i get the answer here
the problem is when i create my endpoint , i create it just on a private network.
After create a new endpoint with public network. this error was solved.
Thanks for the help for everybody
Regards
do you mean to say the code was working earlier, and have stopped working? sorry couldnt interpret it correctly.
With reference to local development using Zeppelin, can you please confirm if the configuration is correct, and have enabled ssh tunneling, etc? You may need to do some config. changes in the Zeppelin->Spark interpreters, etc.
Please make sure you are connected to AWS Glue DEP using SSH tunneling. Here are some references that may help you. Looks like your zeppelin is unable to get a GlueContext (I dont see a glueconext object being created?)
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
Please refer to this linke, setting up zeppelin on windows, for any help on configuring local zeppelin environment.
Related
I am not able to import the below lines..
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.mqtt import MQTTUtils
It is throwing error as below.
ModuleNotFoundError Traceback (most recent call last)
Input In [3], in <module>
1 from pyspark import SparkContext
2 from pyspark.streaming import StreamingContext
----> 3 from pyspark.streaming.mqtt import MQTTUtils
ModuleNotFoundError: No module named 'pyspark.streaming.mqtt
I have download the rabbitMQ jar files and saved in jar directory too but still I am not able to import.
Also I would like to make clear that I am completely new for spark. I have task to pull the data from AMQP topics based on spark structure streaming. How can I do it please suggest if you have any idea. It would be helpful if someone take the initiative to explain through sample code. There is very less information on internet for pyspark. please, share your thoughts.
I wrote the script below to run a Glue job:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import *
from awsglue.dynamicframe import DynamicFrame
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
source_data = glueContext.create_dynamic_frame.from_catalog(database = "source_db", table_name = "source_table")
source_data.toDF().createOrReplaceTempView("data")
query = "SELECT id, date_created FROM data"
data_df = spark.sql(query)
data_dynamicframe = DynamicFrame.fromDF(data_df.repartition(1), glueContext, "data_dynamicframe")
target_data = glueContext.write_dynamic_frame.from_catalog(frame = data_dynamicframe, database = "target", table_name = "target_table", transformation_ctx = "target_data")
job.commit()
And I got this message in the Log
Thread-4 INFO Log4j appears to be running in a Servlet environment, but there's no log4j-web module available. If you want better web container support, please add the log4j-web JAR to your web archive or server lib directory.
Has anyone incurred in the same situation? Is there something wrong with the script?
Thanks!
Turns out there was a typo!
The script works fine, I still get the following message
Executor task launch worker for task 0 INFO Log4j appears to be running
in a Servlet environment, but there's no log4j-web module available. \ If
you want better web container support, please add the log4j-web JAR to
your web archive or server lib directory.
2021-11-19 16:16:27,020 Executor task launch worker for task 0 INFO Log4j
appears to be running in a Servlet environment, but there's no log4j-web
module available. If you want better web container support, please add
the log4j-web JAR to your web archive or server lib directory.
and I guess it will be worth investigating in the future.
I have Zip file of 1.3GB and inside it a txt file with comma separated format which is of 6GB. This zip folder is on Azure Data Lake Storage and using service principle, its mounted on DBFS Databricks file system.
When using normal python code to extract the 6GB file, I get the 1.98GB as extracted file.
Please suggest a way to read the txt file directly and store it as spark Dataframe.
I have tried using python code but directly reading from python gives error - Error tokenizing data. C error: Expected 2 fields in line 371, saw 3
this was also fixed using the UTF-16-LE coding but after that got error - ConnectException: Connection refused (Connection refused) on Databricks while trying to display the df.head().
import pandas as pd
import zipfile
zfolder = zipfile.ZipFile('dbfszipath')
zdf = pd.read_csv(zfolder.open('6GBtextfile.txt'),error_bad_lines=False,encoding='UTF-16-LE')
zdf.head()
Extract code -
import pandas as pd
import zipfile
zfolder = zipfile.ZipFile('/dbfszippath')
zfolder.extract(dbfsexrtactpath)
The dataframe should contain all the data when directly read through the zip folder and also it should display some data and should not hang the Databricks Cluster. Need options in Scala or Pyspark.
The connection refused comes from the memory setting that Databricks and spark have. You will have to increase the size allowance to avoid this error.
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
conf=SparkConf()
conf.set("spark.executor.memory", "4g")
conf.set("spark.driver.memory", "4g")
In this case, the allotted memory is 4GB so change it as needed.
Another solution would be the following:
import zipfile
import io
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
zips = sc.binaryFiles("somerandom.zip")
files_data = zips.map(zip_extract)
Let me know if this works or what the error is in this case.
[Source]
I have a very simple Glue ETL Job with the following code:
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate()
conf = sc.getConf()
print(conf.toDebugString())
The Job is created with a Redshift connection enabled. When executing the Job I get:
No module named pyspark.context
The public documentations all seem to mention, point, and imply the availability of pyspark, but why is my environment complaining that it doesn't have pyspark? What steps am I missing?
Best Regards,
Lim
Python Shell jobs only support Python and libraries like pandas, Scikit-learn, etc. They don't have support for PySpark, so you should create one with job type = Spark and ETL language = Python in order to make it work.
I use:
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
I have below 2 clarifications on AWS Glue, could you please clarify. Because I need to use glue as part of my project.
I would like to load a csv/txt file into a Glue job to process it. (Like we do in Spark with dataframes). Is this possible in Glue? Or do we have to use only Crawlers to crawl the data into Glue tables and make use of them like below for further processing?
empdf = glueContext.create_dynamic_frame.from_catalog(
database="emp",
table_name="emp_json")
Below I used Spark code to load a file into Glue, but I'm getting lengthy error logs. Can we directly run Spark or PySpark code as it is without any changes in Glue?
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
dfnew = spark.read.option("header","true").option("delimiter", ",").csv("C:\inputs\TEST.txt")
dfnew.show(2)
It's possible to load data directly from s3 using Glue:
sourceDyf = glueContext.create_dynamic_frame_from_options(
connection_type="s3",
format="csv",
connection_options={
"paths": ["s3://bucket/folder"]
},
format_options={
"withHeader": True,
"separator": ","
})
You can also do that just with spark (as you already tried):
sourceDf = spark.read
.option("header","true")
.option("delimiter", ",")
.csv("C:\inputs\TEST.txt")
However, in this case Glue doesn't guarantee that they provide appropriate Spark readers. So if your error is related to missing data source for CSV then you should add spark-csv lib to the Glue job by providing s3 path to its locations via --extra-jars parameter.
Below 2 cases i tested working fine:
To load a file from S3 into Glue.
dfnew = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://MyBucket/path/"] }, format="csv" )
dfnew.show(2)
To load data from Glue db and tables which are generated already through Glue Crawlers.
DynFr = glueContext.create_dynamic_frame.from_catalog(database="test_db", table_name="test_table")
DynFr is a DynamicFrame, so if we want to work with Spark code in Glue, then we need to convert it into a normal data frame like below.
df1 = DynFr.toDF()