How do you write to Kafka using the ABRiS library in PySpark? - pyspark

Has anyone been able to write to Kafka using this library using PySpark?
I've been able to successfully read using the code from the README documentation:
import logging, traceback
import requests
from pyspark.sql import Column
from pyspark.sql.column import *
jvm_gateway = spark_context._gateway.jvm
abris_avro = jvm_gateway.za.co.absa.abris.avro
naming_strategy = getattr(getattr(abris_avro.read.confluent.SchemaManager, "SchemaStorageNamingStrategies$"), "MODULE$").TOPIC_NAME()
schema_registry_config_dict = {"schema.registry.url": schema_registry_url,
"schema.registry.topic": topic,
"value.schema.id": "latest",
"value.schema.naming.strategy": naming_strategy}
conf_map = getattr(getattr(jvm_gateway.scala.collection.immutable.Map, "EmptyMap$"), "MODULE$")
for k, v in schema_registry_config_dict.items():
conf_map = getattr(conf_map, "$plus")(jvm_gateway.scala.Tuple2(k, v))
deserialized_df = data_frame.select(Column(abris_avro.functions.from_confluent_avro(data_frame._jdf.col("value"), conf_map))
.alias("data")).select("data.*")
However, I am struggling to extend the behaviour by writing to topics via the to_confluent_avro function.

Related

Synapse - Notebook not working from Pipeline

I have a notebook in Azure Synapse that reads parquet files into a data frame using the synapsesql function and then pushes the data frame contents into a table in the SQL Pool.
Executing the notebook manually is successful and the table is created and populated in the Synapse SQL pool.
When I try to call the same notebook from an Azure Synapse pipeline it returns successful however does not create the table. I am using the Synapse Notebook activity in the pipeline.
What could be the issue here?
I am getting deprecated warnings around the synapsesql function but don't know what is actually deprecated.
The code is below.
%%spark
val pEnvironment = "t"
val pFolderName = "TestFolder"
val pSourceDatabaseName = "TestDatabase"
val pSourceSchemaName = "TestSchema"
val pRootFolderName = "RootFolder"
val pServerName = pEnvironment + "synas01"
val pDatabaseName = pEnvironment + "syndsqlp01"
val pTableName = pSourceDatabaseName + "" + pSourceSchemaName + "" + pFolderName
// Import functions and Synapse connector
import org.apache.spark.sql.DataFrame
import com.microsoft.spark.sqlanalytics.utils.Constants
import org.apache.spark.sql.functions.
import org.apache.spark.sql.SqlAnalyticsConnector.
// Get list of "FileLocation" from control.FileLoadStatus
val fls:DataFrame = spark.read.
synapsesql(s"${pDatabaseName}.control.FileLoadStatus").
select("FileLocation","ProcessedDate")
// Read all parquet files in folder into data frame
// Add file name as column
val df:DataFrame = spark.read.
parquet(s"/source/${pRootFolderName}/${pFolderName}/").
withColumn("FileLocation", input_file_name())
// Join parquet file data frame to FileLoadStatus data frame
// Exclude rows in parquet file data frame where ProcessedDate is not null
val df2 = df.
join(fls,Seq("FileLocation"), "left").
where(fls("ProcessedDate").isNull)
// Write data frame to sql table
df2.write.
option(Constants.SERVER,s"${pServerName}.sql.azuresynapse.net").
synapsesql(s"${pDatabaseName}.xtr.${pTableName}",Constants.INTERNAL)
This case happens often and to get the output after pipeline execution. Follow the steps mentioned.
Pick up the Apache Spark application name from the output of pipeline
Navigate to Apache Spark Application under Monitor tab and search for the same application name .
These 4 tabs would be available there: Diagnostics,Logs,Input data,Output data
Go to Logs ad check 'stdout' for getting the required output.
https://www.youtube.com/watch?v=ydEXCVVGAiY
Check the above video link for detailed live procedure.

Saving data to Postgres from AWS Lambda using Python

import boto3
import pandas as pd
import io
def lambda_handler(event, context):
if event:
s3_client = boto3.client('s3')
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
file_obj = s3_client.get_object(Bucket=bucket,Key=key)
file_content = file_obj['Body'].read()
b = io.BytesIO(file_content)
df = pd.read_excel(b)
print(df)
I am trying to upload excel sheet data from s3 to amazon rds (Postgres). The above code is what I have to extract data from s3. How can I upload the data from here to postgres, Please Help.

Receiving error when executitng code in Databricks / Spark DataFrame' object does not support item assignment

When running the following command I get the error
I am running the code on Databricks Platform, but the code is written using Pandas
TypeError: 'DataFrame' object does not support item assignment
Can someone let me know if the error is related to spark / databricks platform not supporting the code?
import numpy as np
import pandas as pd
def matchSchema(df):
df['active'] = df['active'].astype('boolean')
df['price'] = df['counts']/100
df.drop('counts', axis=1, inplace=True)
return df,df.head(3)
(dataset, sample) = matchSchema(df)
print(dataset)
print(sample)
The error is:
TypeError: 'DataFrame' object does not support item assignment
bool is used instead of boolean as a dtype...
df['active'] = df['active'].astype('bool')

CSV Feeders for gatling 3

I am using Gatling 3. I have a csv feeder with just one column titled accountIds. I need to pass this in the body of my POST request as JSON. I have tried a lot of different syntax but nothing seems to be working. I can also not print what is actually being sent in the body. It works if I remove the "$accountIds" and use an actual value instead. Below is my code:
val searchFeeder = csv("C://data/accountids.csv").random
val scn1 = scenario("Scenario 1")
.feed(searchFeeder)
.exec(http("Search")
.post("/v3/accounts/")
.body(StringBody("""{"accountIds": "${accountIds}"}""")).asJson)
setUp(scn1.inject(atOnceUsers(10)).protocols(httpConf))
Have you enabled trace level in logback.xml to see the details of post request?
Also, can you confirm if location you have mentioned "C://data/accountids.csv" is the right one. Generally, data folder resides in project location and within project you can access the data file as:
val searchFeeder = csv("data/stack.csv").random
I just created sample script and enabled logging.I am able to see that accountId is getting passed:
package basicpackage
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import io.gatling.core.scenario.Simulation
class StackFeeder extends Simulation {
val httpConf=http.baseUrl("http://example.com")
val searchFeeder = csv("data/stack.csv").random
val scn1 = scenario("Scenario 1")
.feed(searchFeeder)
.exec(http("Search")
.post("/v3/accounts/")
.body(StringBody("""{"accountIds": "${accountIds}"}""")).asJson)
setUp(scn1.inject(atOnceUsers(1)).protocols(httpConf))
csv file location

Gcloud ai-platform, can't create model with own prediction-class

I try following AI Platform tutorial to upload a model and a prediction routine but one part fail and I don't understand why.
My prediction class is the same as in their tutorial:
%%writefile predictor.py
import os
import pickle
import numpy as np
from sklearn.datasets import load_iris
from sklearn.externals import joblib
class MyPredictor(object):
def __init__(self, model, preprocessor):
self._model = model
self._preprocessor = preprocessor
self._class_names = load_iris().target_names
def predict(self, instances, **kwargs):
inputs = np.asarray(instances)
preprocessed_inputs = self._preprocessor.preprocess(inputs)
if kwargs.get('probabilities'):
probabilities = self._model.predict_proba(preprocessed_inputs)
return probabilities.tolist()
else:
outputs = self._model.predict(preprocessed_inputs)
return [self._class_names[class_num] for class_num in outputs]
#classmethod
def from_path(cls, model_dir):
model_path = os.path.join(model_dir, 'model.joblib')
model = joblib.load(model_path)
preprocessor_path = os.path.join(model_dir, 'preprocessor.pkl')
with open(preprocessor_path, 'rb') as f:
preprocessor = pickle.load(f)
return cls(model, preprocessor)
the code I use to create my model in cloud is:
! gcloud beta ai-platform versions create {VERSION_NAME} \
--model {MODEL_NAME} \
--runtime-version 1.13 \
--python-version 3.5 \
--origin gs://{BUCKET_NAME}/custom_prediction_routine_tutorial/model/ \
--package-uris gs://{BUCKET_NAME}/custom_prediction_routine_tutorial/my_custom_code-0.1.tar.gz \
--prediction-class predictor.MyPredictor
But I end up with such an odd error:
ERROR: (gcloud.beta.ai-platform.versions.create) Bad model detected with error: "Failed to load model: Unexpected error when loading the model: 'ascii' codec can't decode byte 0xf9 in position 2: ordinal not in range(128) (Error code: 0)"
The thing is that when I run the same command without the:
--prediction-class predictor.MyPredictor
it work fine.
Does someone know the reason of this ? I think model.joblib might have an encoding problem but when I load it myself there is nothing wrong
I've find the solution,
In the tutorial they use pickle to save the preprocessor object created, and Joblib to save the model.
You need to use Joblib to save both and then send it to google storage.