formatting AWS glue output to JSON OBJECT - pyspark

This is the result I get from my pyspark job in AWS GLUE
{a:1,b:7}
{a:1,b:9}
{a:1,b:3}
but I need to write this data on s3 and send it to an API in JSON array
format
[
{a:1,b:2},
{a:1,b:7},
{a:1,b:9},
{a:1,b:3}
]
I tried converting my output to DataFrame and then applied
toJSON()
results = mapped_dyF.toDF()
jsonResults = results.toJSON().collect()
but now unable to write back the result on s3 with 'write_dynamic_frame.from_options'
as it requires a DF but my'jsonResults' is no longer a DataFrame now.

In order to put it in JSON array format I usually do the following:
df --> DataFrame containing the original data.
if df.count() > 0:
# Build the json file
data = list()
for row in df.collect():
data.append({"a": row['a'],
"b" : row['b']
})
I haven't use the Glue write_dynamic_frame.from_options in this case but I use boto3 to save the file:
import boto3
import json
s3 = boto3.resource('s3')
# Dump the json file to s3 bucket
filename = '/{0}_batch_{1}.json'.format(str(uuid.uuid4()))
obj = s3.Object(bucket_name, filename)
obj.put(Body=json.dumps(data))

Related

Synapse - Notebook not working from Pipeline

I have a notebook in Azure Synapse that reads parquet files into a data frame using the synapsesql function and then pushes the data frame contents into a table in the SQL Pool.
Executing the notebook manually is successful and the table is created and populated in the Synapse SQL pool.
When I try to call the same notebook from an Azure Synapse pipeline it returns successful however does not create the table. I am using the Synapse Notebook activity in the pipeline.
What could be the issue here?
I am getting deprecated warnings around the synapsesql function but don't know what is actually deprecated.
The code is below.
%%spark
val pEnvironment = "t"
val pFolderName = "TestFolder"
val pSourceDatabaseName = "TestDatabase"
val pSourceSchemaName = "TestSchema"
val pRootFolderName = "RootFolder"
val pServerName = pEnvironment + "synas01"
val pDatabaseName = pEnvironment + "syndsqlp01"
val pTableName = pSourceDatabaseName + "" + pSourceSchemaName + "" + pFolderName
// Import functions and Synapse connector
import org.apache.spark.sql.DataFrame
import com.microsoft.spark.sqlanalytics.utils.Constants
import org.apache.spark.sql.functions.
import org.apache.spark.sql.SqlAnalyticsConnector.
// Get list of "FileLocation" from control.FileLoadStatus
val fls:DataFrame = spark.read.
synapsesql(s"${pDatabaseName}.control.FileLoadStatus").
select("FileLocation","ProcessedDate")
// Read all parquet files in folder into data frame
// Add file name as column
val df:DataFrame = spark.read.
parquet(s"/source/${pRootFolderName}/${pFolderName}/").
withColumn("FileLocation", input_file_name())
// Join parquet file data frame to FileLoadStatus data frame
// Exclude rows in parquet file data frame where ProcessedDate is not null
val df2 = df.
join(fls,Seq("FileLocation"), "left").
where(fls("ProcessedDate").isNull)
// Write data frame to sql table
df2.write.
option(Constants.SERVER,s"${pServerName}.sql.azuresynapse.net").
synapsesql(s"${pDatabaseName}.xtr.${pTableName}",Constants.INTERNAL)
This case happens often and to get the output after pipeline execution. Follow the steps mentioned.
Pick up the Apache Spark application name from the output of pipeline
Navigate to Apache Spark Application under Monitor tab and search for the same application name .
These 4 tabs would be available there: Diagnostics,Logs,Input data,Output data
Go to Logs ad check 'stdout' for getting the required output.
https://www.youtube.com/watch?v=ydEXCVVGAiY
Check the above video link for detailed live procedure.

How to return data from azure databricks notebook in Azure Data Factory

I have a requirement where I need to transform data in azure databricks and then return the transformed data. Below is notebook sample code where I am trying to return some json.
from pyspark.sql.functions import *
from pyspark.sql.types import *
import json
import pandas as pd
# Define a dictionary containing ICC rankings
rankings = {'test': ['India', 'South Africa', 'England',
'New Zealand', 'Australia'],
'odi': ['England', 'India', 'New Zealand',
'South Africa', 'Pakistan'],
't20': ['Pakistan', 'India', 'Australia',
'England', 'New Zealand']}
# Convert the dictionary into DataFrame
rankings_pd = pd.DataFrame(rankings)
# Before renaming the columns
rankings_pd.rename(columns = {'test':'TEST'}, inplace = True)
rankings_pd.rename(columns = {'odi':'ODI'}, inplace = True)
rankings_pd.rename(columns = {'t20':'twenty-20'}, inplace = True)
# After renaming the columns
#print(rankings_pd.to_json())
dbutils.notebook.exit(rankings_pd.to_json())
In order to achieve the same, I created a job under a cluster for this notebook and then I had to create a custom connector too following this article https://medium.com/#poojaanilshinde/create-azure-logic-apps-custom-connector-for-azure-databricks-e51f4524ab27. Using the connectors with API endpoint '/2.1/jobs/run-now' and then '/2.1/jobs/runs/get-output' in Azure Logic App, I am able to get the return value but after the job is executed successfully, sometimes I just get the status as running with no output. I need to get the output when job is executed successfully with transformation.
Please suggest a way better way for this if I am missing anything.
looks like dbutils.notebooks.exit() only accpet "string", you can return the value as json string and convert to json object in DataFactory or Logic App. https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-utils#--notebook-utility-dbutilsnotebook

Save custom transformers in pyspark

When I implement this part of this python code in Azure Databricks:
class clustomTransformations(Transformer):
<code>
custom_transformer = customTransformations()
....
pipeline = Pipeline(stages=[custom_transformer, assembler, scaler, rf])
pipeline_model = pipeline.fit(sample_data)
pipeline_model.save(<your path>)
When I attempt to save the pipeline, I get this:
AttributeError: 'customTransformations' object has no attribute '_to_java'
Any work arounds?
It seems like there is no easy workaround but to try and implement the _to_java method, as is suggested here for StopWordsRemover:
Serialize a custom transformer using python to be used within a Pyspark ML pipeline
def _to_java(self):
"""
Convert this instance to a dill dump, then to a list of strings with the unicode integer values of each character.
Use this list as a set of dumby stopwords and store in a StopWordsRemover instance
:return: Java object equivalent to this instance.
"""
dmp = dill.dumps(self)
pylist = [str(ord(d)) for d in dmp] # convert byes to string integer list
pylist.append(PysparkObjId._getPyObjId()) # add our id so PysparkPipelineWrapper can id us.
sc = SparkContext._active_spark_context
java_class = sc._gateway.jvm.java.lang.String
java_array = sc._gateway.new_array(java_class, len(pylist))
for i in xrange(len(pylist)):
java_array[i] = pylist[i]
_java_obj = JavaParams._new_java_obj(PysparkObjId._getCarrierClass(javaName=True), self.uid)
_java_obj.setStopWords(java_array)
return _java_obj

AWS Glue PySpark replace NULLs

I am running an AWS Glue job to load a pipe delimited file on S3 into an RDS Postgres instance, using the auto-generated PySpark script from Glue.
Initially, it complained about NULL values in some columns:
pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"
After some googling and reading on SO, I tried to replace the NULLs in my file by converting my AWS Glue Dynamic Dataframe to a Spark Dataframe, executing the function fillna() and reconverting back to a Dynamic Dataframe.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx =
"datasource0")
custom_df = datasource0.toDF()
custom_df2 = custom_df.fillna(-1)
custom_df3 = custom_df2.fromDF()
applymapping1 = ApplyMapping.apply(frame = custom_df3, mappings = [("id",
"string", "id", "int"),........more code
References:
https://github.com/awslabs/aws-glue-samples/blob/master/FAQ_and_How_to.md#3-there-are-some-transforms-that-i-cannot-figure-out
How to replace all Null values of a dataframe in Pyspark
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.fillna
Now, when I run my job, it throws the following error:
Log Contents:
Traceback (most recent call last):
File "script_2017-12-20-22-02-13.py", line 23, in <module>
custom_df3 = custom_df2.fromDF()
AttributeError: 'DataFrame' object has no attribute 'fromDF'
End of LogType:stdout
I am new to Python and Spark and have tried a lot, but can't make sense of this. Appreciate some expert help on this.
I tried changing my reconvert command to this:
custom_df3 = glueContext.create_dynamic_frame.fromDF(frame = custom_df2)
But still got the error:
AttributeError: 'DynamicFrameReader' object has no attribute 'fromDF'
UPDATE:
I suspect this is not about NULL values. The message "Can't get JDBC type for null" seems not to refer to a NULL value, but some data/type that JDBC is unable to decipher.
I created a file with only 1 record, no NULL values, changed all Boolean types to INT (and replaced values with 0 and 1), but still get the same error:
pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"
UPDATE:
Make sure DynamicFrame is imported (from awsglue.context import DynamicFrame), since fromDF / toDF are part of DynamicFrame.
Refer to https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html
You are calling .fromDF on the wrong class. It should look like this:
from awsglue.dynamicframe import DynamicFrame
DyamicFrame.fromDF(custom_df2, glueContext, 'label')
For this error, pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"
you should use the drop Null columns.
I was getting similar errors while loading to Redshift DB Tables. After using the below command, the issue got resolved
Loading= DropNullFields.apply(frame = resolvechoice3, transformation_ctx = "Loading")
In Pandas, and for Pandas DataFrame, pd.fillna() is used to fill null values with other specified values. However, DropNullFields drops all null fields in a DynamicFrame whose type is NullType. These are fields with missing or null values in every record in the DynamicFrame data set.
In your specific situation, you need to make sure you are using the write class for the appropriate dataset.
Here is the edited version of your code:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx =
"datasource0")
custom_df = datasource0.toDF()
custom_df2 = custom_df.fillna(-1)
custom_df3 = DyamicFrame.fromDF(custom_df2, glueContext, 'your_label')
applymapping1 = ApplyMapping.apply(frame = custom_df3, mappings = [("id",
"string", "id", "int"),........more code
This is what you are doing: 1. Read the file in DynamicFrame, 2. Convert it to DataFrame, 3. Drop null values, 4. Convert back to DynamicFrame, and 5. ApplyMapping. You were getting the following error because your step 4 was wrong and you were were feeding a DataFrame to ApplyMapping which does not work. ApplyMapping is designed for DynamicFrames.
I would suggest read your data in DynamicFrame and stick to the same data type. It would look like this (one way to do it):
from awsglue.dynamicframe import DynamicFrame
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx =
"datasource0")
custom_df = DropNullFields.apply(frame=datasource0)
applymapping1 = ApplyMapping.apply(frame = custom_df, mappings = [("id",
"string", "id", "int"),........more code

How to bundle many files in S3 using Spark

I have 20 million files in S3 spanning roughly 8000 days.
The files are organized by timestamps in UTC, like this: s3://mybucket/path/txt/YYYY/MM/DD/filename.txt.gz. Each file is UTF-8 text containing between 0 (empty) and 100KB of text (95th percentile, although there are a few files that are up to several MBs).
Using Spark and Scala (I'm new to both and want to learn), I would like to save "daily bundles" (8000 of them), each containing whatever number of files were found for that day. Ideally I would like to store the original filenames as well as their content. The output should reside in S3 as well and be compressed, in some format that is suitable for input in further Spark steps and experiments.
One idea was to store bundles as a bunch of JSON objects (one per line and '\n'-separated), e.g.
{id:"doc0001", meta:{x:"blah", y:"foo", ...}, content:"some long string here"}
{id:"doc0002", meta:{x:"foo", y:"bar", ...}, content: "another long string"}
Alternatively, I could try the Hadoop SequenceFile, but again I'm not sure how to set that up elegantly.
Using the Spark shell for example, I saw that it was very easy to read the files, for example:
val textFile = sc.textFile("s3n://mybucket/path/txt/1996/04/09/*.txt.gz")
// or even
val textFile = sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz")
// which will take for ever
But how do I "intercept" the reader to provide the file name?
Or perhaps I should get an RDD of all the files, split by day, and in a reduce step write out K=filename, V=fileContent?
You can use this
First You can get a Buffer/List of S3 Paths :
import scala.collection.JavaConverters._
import java.util.ArrayList
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ObjectListing
import com.amazonaws.services.s3.model.S3ObjectSummary
import com.amazonaws.services.s3.model.ListObjectsRequest
def listFiles(s3_bucket:String, base_prefix : String) = {
var files = new ArrayList[String]
//S3 Client and List Object Request
var s3Client = new AmazonS3Client();
var objectListing: ObjectListing = null;
var listObjectsRequest = new ListObjectsRequest();
//Your S3 Bucket
listObjectsRequest.setBucketName(s3_bucket)
//Your Folder path or Prefix
listObjectsRequest.setPrefix(base_prefix)
//Adding s3:// to the paths and adding to a list
do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (objectSummary <- objectListing.getObjectSummaries().asScala) {
files.add("s3://" + s3_bucket + "/" + objectSummary.getKey());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());
//Removing Base Directory Name
files.remove(0)
//Creating a Scala List for same
files.asScala
}
Now Pass this List object to the following piece of code, note : sc is an object of SQLContext
var df: DataFrame = null;
for (file <- files) {
val fileDf= sc.textFile(file)
if (df!= null) {
df= df.unionAll(fileDf)
} else {
df= fileDf
}
}
Now you got a final Unified RDD i.e. df
Optional, And You can also repartition it in a single BigRDD
val files = sc.textFile(filename, 1).repartition(1)
Repartitioning always works :D
have you tried something along the lines of sc.wholeTextFiles?
It creates an RDD where the key is the filename and the value is the byte array of the whole file. You can then map this so the key is the file date, and then groupByKey?
http://spark.apache.org/docs/latest/programming-guide.html
At your scale, elegant solution would be a stretch.
I would recommend against using sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz") as it takes forever. What you can do is use AWS DistCp or something similar to move files into HDFS. Once its in HDFS, spark is quite fast in ingesting the information in whatever way suits you.
Note that most of these processes require some sort of file list so you'll need to generate that somehow. for 20 mil files, this creation of file list will be a bottle neck. I'd recommend creating a file that get appended with the file path, every-time a file gets uploaded to s3.
Same for output, put into hdfs and then move to s3 (although direct copy might be equally efficient).