How to convert a SnowflakeCursor in to a pySpark dataframe

How to convert a SnowflakeCursor in to a pySpark dataframe - pyspark

The result of sending an SQL to Snowflake is a SnowflakeCursor. How would I easily convert it into a pySpark dataframe?
Thanks!

When using Databricks (https://docs.databricks.com/data/data-sources/snowflake.html), we can use the spark.read to load the result of an SQL statement into a dataframe. Note that specifying the sfRole may be the key to gain access to your database objects.
options = {
"sfUrl": "https://yourinstance.snowflakecomputing.com/",
"sfUser": user,
"sfPassword": pw,
"sfDatabase": "db",
"sfSchema": "schema",
"sfRole": "Accountadmin",
"sfWarehouse": "wh"
}
df = spark.read \
.format("snowflake") \
.options(**options) \
.option("query", strCheckingSQL) \
.load()

Related

Update/Replace value in Mongo Database using Mongo Spark Connector (Pyspark) v10x

I am using the spark version in the image below. Details:
mongo-spark-connector:10.0.5
Spark version 3.1.3
And I config the spark-mongo-connector by following:
spark = SparkSession.builder \
.appName("hello") \
.master("yarn") \
.config("spark.executor.memory", "4g") \
.config('spark.driver.memory', '2g') \
.config('spark.driver.cores', '4') \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector:10.0.5') \
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.27.1.jar') \
.enableHiveSupport() \
.getOrCreate()
I want to ask the question, how to update and replace value in Mongo Database.
I read the following question in Updating mongoData with MongoSpark. But it is successful for mongo-spark v2.x. About mongo-spark v10 above is fail.
Example:
I have these following attributes:
from bson.objectid import ObjectId
data = {
'_id' : ObjectId("637367d5262dc89a8e318d09"),
'database' : database_name,
"table" : table,
"latestSyncAt": lastestSyncAt,
"lastest_id" : str(lastest_id)
}
df = spark.createDataFrame(data)
How do I update or replace _id attribute value in Mongo Database by using Mongo-spark-connector?
Thank you very much for your support.

java.lang.UnsupportedOperationException: Data source mongodb does not support microbatch processing

I'm trying to perform read/write streaming data from CosmosDB API for MongoDB into databricks pyspark and gettting error java.lang.UnsupportedOperationException: Data source mongodb does not support microbatch processing.
Please help anyone how can we achieve data streaming in pyspark.
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.streaming import *
from pyspark.sql.types import StringType,BooleanType,DateType,StructType,LongType,IntegerType
spark = SparkSession.\
builder.\
appName("streamingExampleRead").\
config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector:10.0.0').\
getOrCreate()
sourceConnectionString = <primary connection string of cosmosDB API for MongoDB isntance>
sourceDb = <your database name>
sourceCollection = <yourcollection name>
dataStreamRead=(
spark.readStream.format("mongodb")
.option('spark.mongodb.connection.uri', sourceConnectionString)
.option('spark.mongodb.database', sourceDb) \
.option('spark.mongodb.collection', sourceCollection) \
.option('spark.mongodb.change.stream.publish.full.document.only','true') \
.option("forceDeleteTempCheckpointLocation", "true") \
.load()
)
display(dataStreamRead)
query2=(dataStreamRead.writeStream \
.outputMode("append") \
.option("forceDeleteTempCheckpointLocation", "true") \
.format("console") \
.trigger(processingTime='1 seconds')
.start().awaitTermination());
Getting following error:
java.lang.UnsupportedOperationException: Data source mongodb does not support microbatch processing.
at org.apache.spark.sql.errors.QueryExecutionErrors$.microBatchUnsupportedByDataSourceError(QueryExecutionErrors.scala:1579)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$1.applyOrElse(MicroBatchExecution.scala:123)
Data source mongodb does not support microbatch processing.
=== Streaming Query ===
Identifier: [id = 78cfcef1-19de-40f4-86fc-847109263ee9, runId = d2212e1f-5247-4cd2-9c8c-3cc937e2c7c5]
Current Committed Offsets: {}
Current Available Offsets: {}
Current State: INITIALIZING
Thread State: RUNNABLE```

Try using trigger(continuous="1 second") instead of trigger(processingTime='1 seconds').

Sink Kafka Stream to MongoDB using PySpark Structured Streaming

My Spark:
spark = SparkSession\
.builder\
.appName("Demo")\
.master("local[3]")\
.config("spark.streaming.stopGracefullyonShutdown", "true")\
.config('spark.jars.packages','org.mongodb.spark:mongo-spark-connector_2.12:3.0.1')\
.getOrCreate()
Mongo URI:
input_uri_weld = 'mongodb://127.0.0.1:27017/db.coll1'
output_uri_weld = 'mongodb://127.0.0.1:27017/db.coll1'
Function for writing stream batches to Mongo:
def save_to_mongodb_collection(current_df, epoc_id, mongodb_collection_name):
current_df.write\
.format("com.mongodb.spark.sql.DefaultSource") \
.mode("append") \
.option("spark.mongodb.output.uri", output_uri_weld) \
.save()
Kafka Stream:
kafka_df = spark.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", kafka_broker)\
.option("subscribe", kafka_topic)\
.option("startingOffsets", "earliest")\
.load()
Write to Mongo:
mongo_writer = df_parsed.write\
.format('com.mongodb.spark.sql.DefaultSource')\
.mode('append')\
.option("spark.mongodb.output.uri", output_uri_weld)\
.save()
& my spark.conf file:
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-avro_2.12:3.0.1,com.datastax.spark:spark-cassandra-connector_2.12:3.0.0
Error:
java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource. Please find packages at http://spark.apache.org/third-party-projects.html

I found a solution.
Since I couldn't find the right Mongo driver for Structured Streaming, I worked on another solution.
Now, I use the direct connection to mongoDb, and use "foreach(...)" instead of foreachbatch(...). My code looks like this in testSpark.py file:
....
import pymongo
from pymongo import MongoClient
local_url = "mongodb://localhost:27017"
def write_machine_df_mongo(target_df):
cluster = MongoClient(local_url)
db = cluster["test_db"]
collection = db.test1
post = {
"machine_id": target_df.machine_id,
"proc_type": target_df.proc_type,
"sensor1_id": target_df.sensor1_id,
"sensor2_id": target_df.sensor2_id,
"time": target_df.time,
"sensor1_val": target_df.sensor1_val,
"sensor2_val": target_df.sensor2_val,
}
collection.insert_one(post)
machine_df.writeStream\
.outputMode("append")\
.foreach(write_machine_df_mongo)\
.start()

Read / Write data from HBase using Pyspark

I am trying to read data from HBase using Pyspark and it is getting many weird errors. Below is the sample snippet of my code.
Please suggest any solution.
empdata = ''.join("""
{
'table': {
'namespace': 'default',
'name': 'emp'
},
'rowkey': 'key',
'columns': {
'emp_id': {'cf': 'rowkey', 'col': 'key', 'type': 'string'},
'emp_name': {'cf': 'personal data', 'col': 'name', 'type': 'string'}
}
}
""".split())
df = sqlContext \
.read \
.options(catalog=empdata) \
.format('org.apache.spark.sql.execution.datasources.hbase') \
.load()
df.show()
I have used the below version
HBase 2.1.6,
Pyspark 2.3.2, Hadoop 3.1
I have ran the code as follows
pyspark --master local --packages com.hortonworks:shc-core:1.1.1-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/hbase/conf/hbase-site.xml
The error is
An error occurred while calling o71.load. : java.lang.NoclassDefFoundError: org/apache/apark/Logging

Connection issue: Databricks - Snowflake

I am trying to connect to Snowflake from Databricks Notebook through externalbrowser authenticator but without any success.
CMD1
sfOptions = {
"sfURL" : "xxxxx.west-europe.azure.snowflakecomputing.com",
"sfAccount" : "xxxxx",
"sfUser" : "ivan.lorencin#xxxxx",
"authenticator" : "externalbrowser",
"sfPassword" : "xxxxx",
"sfDatabase" : "DWH_PROD",
"sfSchema" : "APLSDB",
"sfWarehouse" : "SNOWFLAKExxxxx",
"tracing" : "ALL",
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
CMD2
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select 1 as my_num union all select 2 as my_num") \
.load()
And CMD2 is not completed but I am receiving ".. Running command ..." that last forever.
Can anybody help what is going wrong here? How can I establish a connection?

It looks like you're setting Authenticator to externalbrowser, but according to the docs it should be sfAuthenticator - is this intentional? If you are trying to do an OAuth type of auth, why do you also have password?
If you're account/user requires OAuth to login, I'd remove that password entry from sfOptions, edit that one entry to sfAuthenticator and try again.
If that does not work, you should ensure that your Spark cluster can reach out to all the required Snowflake hosts (see SnowCD for assistance).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to convert a SnowflakeCursor in to a pySpark dataframe - pyspark

The result of sending an SQL to Snowflake is a SnowflakeCursor. How would I easily convert it into a pySpark dataframe? Thanks!

Related

Update/Replace value in Mongo Database using Mongo Spark Connector (Pyspark) v10x

java.lang.UnsupportedOperationException: Data source mongodb does not support microbatch processing

Sink Kafka Stream to MongoDB using PySpark Structured Streaming

Read / Write data from HBase using Pyspark

Connection issue: Databricks - Snowflake

Categories

Resources