import boto3
import pandas as pd
import io
def lambda_handler(event, context):
if event:
s3_client = boto3.client('s3')
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
file_obj = s3_client.get_object(Bucket=bucket,Key=key)
file_content = file_obj['Body'].read()
b = io.BytesIO(file_content)
df = pd.read_excel(b)
print(df)
I am trying to upload excel sheet data from s3 to amazon rds (Postgres). The above code is what I have to extract data from s3. How can I upload the data from here to postgres, Please Help.
Related
I have a notebook in Azure Synapse that reads parquet files into a data frame using the synapsesql function and then pushes the data frame contents into a table in the SQL Pool.
Executing the notebook manually is successful and the table is created and populated in the Synapse SQL pool.
When I try to call the same notebook from an Azure Synapse pipeline it returns successful however does not create the table. I am using the Synapse Notebook activity in the pipeline.
What could be the issue here?
I am getting deprecated warnings around the synapsesql function but don't know what is actually deprecated.
The code is below.
%%spark
val pEnvironment = "t"
val pFolderName = "TestFolder"
val pSourceDatabaseName = "TestDatabase"
val pSourceSchemaName = "TestSchema"
val pRootFolderName = "RootFolder"
val pServerName = pEnvironment + "synas01"
val pDatabaseName = pEnvironment + "syndsqlp01"
val pTableName = pSourceDatabaseName + "" + pSourceSchemaName + "" + pFolderName
// Import functions and Synapse connector
import org.apache.spark.sql.DataFrame
import com.microsoft.spark.sqlanalytics.utils.Constants
import org.apache.spark.sql.functions.
import org.apache.spark.sql.SqlAnalyticsConnector.
// Get list of "FileLocation" from control.FileLoadStatus
val fls:DataFrame = spark.read.
synapsesql(s"${pDatabaseName}.control.FileLoadStatus").
select("FileLocation","ProcessedDate")
// Read all parquet files in folder into data frame
// Add file name as column
val df:DataFrame = spark.read.
parquet(s"/source/${pRootFolderName}/${pFolderName}/").
withColumn("FileLocation", input_file_name())
// Join parquet file data frame to FileLoadStatus data frame
// Exclude rows in parquet file data frame where ProcessedDate is not null
val df2 = df.
join(fls,Seq("FileLocation"), "left").
where(fls("ProcessedDate").isNull)
// Write data frame to sql table
df2.write.
option(Constants.SERVER,s"${pServerName}.sql.azuresynapse.net").
synapsesql(s"${pDatabaseName}.xtr.${pTableName}",Constants.INTERNAL)
This case happens often and to get the output after pipeline execution. Follow the steps mentioned.
Pick up the Apache Spark application name from the output of pipeline
Navigate to Apache Spark Application under Monitor tab and search for the same application name .
These 4 tabs would be available there: Diagnostics,Logs,Input data,Output data
Go to Logs ad check 'stdout' for getting the required output.
https://www.youtube.com/watch?v=ydEXCVVGAiY
Check the above video link for detailed live procedure.
I have a requirement where I need to transform data in azure databricks and then return the transformed data. Below is notebook sample code where I am trying to return some json.
from pyspark.sql.functions import *
from pyspark.sql.types import *
import json
import pandas as pd
# Define a dictionary containing ICC rankings
rankings = {'test': ['India', 'South Africa', 'England',
'New Zealand', 'Australia'],
'odi': ['England', 'India', 'New Zealand',
'South Africa', 'Pakistan'],
't20': ['Pakistan', 'India', 'Australia',
'England', 'New Zealand']}
# Convert the dictionary into DataFrame
rankings_pd = pd.DataFrame(rankings)
# Before renaming the columns
rankings_pd.rename(columns = {'test':'TEST'}, inplace = True)
rankings_pd.rename(columns = {'odi':'ODI'}, inplace = True)
rankings_pd.rename(columns = {'t20':'twenty-20'}, inplace = True)
# After renaming the columns
#print(rankings_pd.to_json())
dbutils.notebook.exit(rankings_pd.to_json())
In order to achieve the same, I created a job under a cluster for this notebook and then I had to create a custom connector too following this article https://medium.com/#poojaanilshinde/create-azure-logic-apps-custom-connector-for-azure-databricks-e51f4524ab27. Using the connectors with API endpoint '/2.1/jobs/run-now' and then '/2.1/jobs/runs/get-output' in Azure Logic App, I am able to get the return value but after the job is executed successfully, sometimes I just get the status as running with no output. I need to get the output when job is executed successfully with transformation.
Please suggest a way better way for this if I am missing anything.
looks like dbutils.notebooks.exit() only accpet "string", you can return the value as json string and convert to json object in DataFactory or Logic App. https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-utils#--notebook-utility-dbutilsnotebook
I am writing an entire Pandas DataFrame as bytestream into a flat file and into a Mongo Database, e.g here
import logging
from io import BytesIO
import pandas as pd
import numpy as np
from pymongo import MongoClient
from uuid import uuid4
logger = logging.getLogger(__name__)
logging.basicConfig(format='%(asctime)s %(levelname)s - %(message)s', level=logging.INFO)
MONGODB_SETTINGS = {"Local": {'host': 'host.docker.internal',
'port': 27017}}
if __name__ == '__main__':
logger.info("Start")
frame = pd.DataFrame(columns=list("ABCDE"), data=np.random.randn(300_000, 5))
logger.info("Constructed Frame (object)")
name = str(uuid4())
bytestream = frame.to_parquet()
post = {"frame": bytestream, "name": name}
logger.info("Dictionary constructed")
with open('data/tmp/output', 'wb') as file:
file.write(bytestream)
logger.info("Bytestrem written to disk")
for key, settings in MONGODB_SETTINGS.items():
logger.info(50 * "-")
logger.info(key)
with MongoClient(**settings) as client:
db = client.capture
collection = db.frame
logger.info("Start writinng into Database")
collection.insert_one(post)
logger.info("Object written to Database")
# read the frame back from database
x = collection.find_one({"name": name})
with BytesIO(x["frame"]) as buffer:
frame_out = pd.read_parquet(buffer)
logger.info("Object read from Database")
pd.testing.assert_frame_equal(frame, frame_out)
logger.info(50 * "-")
It takes like 0.1s to write the file where as it takes 3s to write into the Mongo Database. The database is hosted on my local computer and runs within the standard Mongo image. Am I missing something. Is that loss of speed normal?
With code as ineffecient as that, yes.
Has anyone been able to write to Kafka using this library using PySpark?
I've been able to successfully read using the code from the README documentation:
import logging, traceback
import requests
from pyspark.sql import Column
from pyspark.sql.column import *
jvm_gateway = spark_context._gateway.jvm
abris_avro = jvm_gateway.za.co.absa.abris.avro
naming_strategy = getattr(getattr(abris_avro.read.confluent.SchemaManager, "SchemaStorageNamingStrategies$"), "MODULE$").TOPIC_NAME()
schema_registry_config_dict = {"schema.registry.url": schema_registry_url,
"schema.registry.topic": topic,
"value.schema.id": "latest",
"value.schema.naming.strategy": naming_strategy}
conf_map = getattr(getattr(jvm_gateway.scala.collection.immutable.Map, "EmptyMap$"), "MODULE$")
for k, v in schema_registry_config_dict.items():
conf_map = getattr(conf_map, "$plus")(jvm_gateway.scala.Tuple2(k, v))
deserialized_df = data_frame.select(Column(abris_avro.functions.from_confluent_avro(data_frame._jdf.col("value"), conf_map))
.alias("data")).select("data.*")
However, I am struggling to extend the behaviour by writing to topics via the to_confluent_avro function.
This is the result I get from my pyspark job in AWS GLUE
{a:1,b:7}
{a:1,b:9}
{a:1,b:3}
but I need to write this data on s3 and send it to an API in JSON array
format
[
{a:1,b:2},
{a:1,b:7},
{a:1,b:9},
{a:1,b:3}
]
I tried converting my output to DataFrame and then applied
toJSON()
results = mapped_dyF.toDF()
jsonResults = results.toJSON().collect()
but now unable to write back the result on s3 with 'write_dynamic_frame.from_options'
as it requires a DF but my'jsonResults' is no longer a DataFrame now.
In order to put it in JSON array format I usually do the following:
df --> DataFrame containing the original data.
if df.count() > 0:
# Build the json file
data = list()
for row in df.collect():
data.append({"a": row['a'],
"b" : row['b']
})
I haven't use the Glue write_dynamic_frame.from_options in this case but I use boto3 to save the file:
import boto3
import json
s3 = boto3.resource('s3')
# Dump the json file to s3 bucket
filename = '/{0}_batch_{1}.json'.format(str(uuid.uuid4()))
obj = s3.Object(bucket_name, filename)
obj.put(Body=json.dumps(data))