Read / Write data from HBase using Pyspark - scala

I am trying to read data from HBase using Pyspark and it is getting many weird errors. Below is the sample snippet of my code.
Please suggest any solution.
empdata = ''.join("""
{
'table': {
'namespace': 'default',
'name': 'emp'
},
'rowkey': 'key',
'columns': {
'emp_id': {'cf': 'rowkey', 'col': 'key', 'type': 'string'},
'emp_name': {'cf': 'personal data', 'col': 'name', 'type': 'string'}
}
}
""".split())
df = sqlContext \
.read \
.options(catalog=empdata) \
.format('org.apache.spark.sql.execution.datasources.hbase') \
.load()
df.show()
I have used the below version
HBase 2.1.6,
Pyspark 2.3.2, Hadoop 3.1
I have ran the code as follows
pyspark --master local --packages com.hortonworks:shc-core:1.1.1-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/hbase/conf/hbase-site.xml
The error is
An error occurred while calling o71.load. : java.lang.NoclassDefFoundError: org/apache/apark/Logging

Related

Update/Replace value in Mongo Database using Mongo Spark Connector (Pyspark) v10x

I am using the spark version in the image below. Details:
mongo-spark-connector:10.0.5
Spark version 3.1.3
And I config the spark-mongo-connector by following:
spark = SparkSession.builder \
.appName("hello") \
.master("yarn") \
.config("spark.executor.memory", "4g") \
.config('spark.driver.memory', '2g') \
.config('spark.driver.cores', '4') \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector:10.0.5') \
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.27.1.jar') \
.enableHiveSupport() \
.getOrCreate()
I want to ask the question, how to update and replace value in Mongo Database.
I read the following question in Updating mongoData with MongoSpark. But it is successful for mongo-spark v2.x. About mongo-spark v10 above is fail.
Example:
I have these following attributes:
from bson.objectid import ObjectId
data = {
'_id' : ObjectId("637367d5262dc89a8e318d09"),
'database' : database_name,
"table" : table,
"latestSyncAt": lastestSyncAt,
"lastest_id" : str(lastest_id)
}
df = spark.createDataFrame(data)
How do I update or replace _id attribute value in Mongo Database by using Mongo-spark-connector?
Thank you very much for your support.

java.lang.UnsupportedOperationException: Data source mongodb does not support microbatch processing

I'm trying to perform read/write streaming data from CosmosDB API for MongoDB into databricks pyspark and gettting error java.lang.UnsupportedOperationException: Data source mongodb does not support microbatch processing.
Please help anyone how can we achieve data streaming in pyspark.
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.streaming import *
from pyspark.sql.types import StringType,BooleanType,DateType,StructType,LongType,IntegerType
spark = SparkSession.\
builder.\
appName("streamingExampleRead").\
config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector:10.0.0').\
getOrCreate()
sourceConnectionString = <primary connection string of cosmosDB API for MongoDB isntance>
sourceDb = <your database name>
sourceCollection = <yourcollection name>
dataStreamRead=(
spark.readStream.format("mongodb")
.option('spark.mongodb.connection.uri', sourceConnectionString)
.option('spark.mongodb.database', sourceDb) \
.option('spark.mongodb.collection', sourceCollection) \
.option('spark.mongodb.change.stream.publish.full.document.only','true') \
.option("forceDeleteTempCheckpointLocation", "true") \
.load()
)
display(dataStreamRead)
query2=(dataStreamRead.writeStream \
.outputMode("append") \
.option("forceDeleteTempCheckpointLocation", "true") \
.format("console") \
.trigger(processingTime='1 seconds')
.start().awaitTermination());
Getting following error:
java.lang.UnsupportedOperationException: Data source mongodb does not support microbatch processing.
at org.apache.spark.sql.errors.QueryExecutionErrors$.microBatchUnsupportedByDataSourceError(QueryExecutionErrors.scala:1579)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$1.applyOrElse(MicroBatchExecution.scala:123)
Data source mongodb does not support microbatch processing.
=== Streaming Query ===
Identifier: [id = 78cfcef1-19de-40f4-86fc-847109263ee9, runId = d2212e1f-5247-4cd2-9c8c-3cc937e2c7c5]
Current Committed Offsets: {}
Current Available Offsets: {}
Current State: INITIALIZING
Thread State: RUNNABLE```
Try using trigger(continuous="1 second") instead of trigger(processingTime='1 seconds').

Sink Kafka Stream to MongoDB using PySpark Structured Streaming

My Spark:
spark = SparkSession\
.builder\
.appName("Demo")\
.master("local[3]")\
.config("spark.streaming.stopGracefullyonShutdown", "true")\
.config('spark.jars.packages','org.mongodb.spark:mongo-spark-connector_2.12:3.0.1')\
.getOrCreate()
Mongo URI:
input_uri_weld = 'mongodb://127.0.0.1:27017/db.coll1'
output_uri_weld = 'mongodb://127.0.0.1:27017/db.coll1'
Function for writing stream batches to Mongo:
def save_to_mongodb_collection(current_df, epoc_id, mongodb_collection_name):
current_df.write\
.format("com.mongodb.spark.sql.DefaultSource") \
.mode("append") \
.option("spark.mongodb.output.uri", output_uri_weld) \
.save()
Kafka Stream:
kafka_df = spark.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", kafka_broker)\
.option("subscribe", kafka_topic)\
.option("startingOffsets", "earliest")\
.load()
Write to Mongo:
mongo_writer = df_parsed.write\
.format('com.mongodb.spark.sql.DefaultSource')\
.mode('append')\
.option("spark.mongodb.output.uri", output_uri_weld)\
.save()
& my spark.conf file:
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-avro_2.12:3.0.1,com.datastax.spark:spark-cassandra-connector_2.12:3.0.0
Error:
java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource. Please find packages at http://spark.apache.org/third-party-projects.html
I found a solution.
Since I couldn't find the right Mongo driver for Structured Streaming, I worked on another solution.
Now, I use the direct connection to mongoDb, and use "foreach(...)" instead of foreachbatch(...). My code looks like this in testSpark.py file:
....
import pymongo
from pymongo import MongoClient
local_url = "mongodb://localhost:27017"
def write_machine_df_mongo(target_df):
cluster = MongoClient(local_url)
db = cluster["test_db"]
collection = db.test1
post = {
"machine_id": target_df.machine_id,
"proc_type": target_df.proc_type,
"sensor1_id": target_df.sensor1_id,
"sensor2_id": target_df.sensor2_id,
"time": target_df.time,
"sensor1_val": target_df.sensor1_val,
"sensor2_val": target_df.sensor2_val,
}
collection.insert_one(post)
machine_df.writeStream\
.outputMode("append")\
.foreach(write_machine_df_mongo)\
.start()

How to convert a SnowflakeCursor in to a pySpark dataframe

The result of sending an SQL to Snowflake is a SnowflakeCursor. How would I easily convert it into a pySpark dataframe?
Thanks!
When using Databricks (https://docs.databricks.com/data/data-sources/snowflake.html), we can use the spark.read to load the result of an SQL statement into a dataframe. Note that specifying the sfRole may be the key to gain access to your database objects.
options = {
"sfUrl": "https://yourinstance.snowflakecomputing.com/",
"sfUser": user,
"sfPassword": pw,
"sfDatabase": "db",
"sfSchema": "schema",
"sfRole": "Accountadmin",
"sfWarehouse": "wh"
}
df = spark.read \
.format("snowflake") \
.options(**options) \
.option("query", strCheckingSQL) \
.load()

Does scala has "Options" to parse command line arguments in spark-submit just like in Java? [duplicate]

This question already has answers here:
Best way to parse command-line parameters? [closed]
(26 answers)
Closed 3 years ago.
In order to parse command line arguments while using spark-submit:
SPARK_MAJOR_VERSION=2 spark-submit --class com.partition.source.Pickup --master=yarn --conf spark.ui.port=0000 --driver-class-path /home/hdpusr/jars/postgresql-42.1.4.jar --conf spark.jars=/home/hdpusr/jars/postgresql-42.1.4.jar,/home/hdpusr/jars/postgresql-42.1.4.jar --executor-cores 4 --executor-memory 4G --keytab /home/hdpusr/hdpusr.keytab --principal hdpusr#DEVUSR.COM --files /usr/hdp/current/spark2-client/conf/hive-site.xml,testconnection.properties --name Spark_APP --conf spark.executor.extraClassPath=/home/hdpusr/jars/greenplum.jar sparkload_2.11-0.1.jar ORACLE
I am passing a database name: ORACLE which I am parsing it in the code as
def main(args: Array[String]): Unit = {
val dbtype = args(0).toString
.....
}
Is there a way I can give it a name like "--dbname" and then check for that option in the spark-submit to get the option's value ?
Ex:
SPARK_MAJOR_VERSION=2 spark-submit --class com.partition.source.Pickup --master=yarn --conf spark.ui.port=0000 --driver-class-path /home/hdpusr/jars/postgresql-42.1.4.jar --conf spark.jars=/home/hdpusr/jars/postgresql-42.1.4.jar,/home/hdpusr/jars/postgresql-42.1.4.jar --executor-cores 4 --executor-memory 4G --keytab /home/hdpusr/hdpusr.keytab --principal hdpusr#DEVUSR.COM --files /usr/hdp/current/spark2-client/conf/hive-site.xml,testconnection.properties --name Spark_APP --conf spark.executor.extraClassPath=/home/hdpusr/jars/greenplum.jar sparkload_2.11-0.1.jar --dbname ORACLE
In Java there are two packages which can be used to do the same:
import org.apache.commons.cli.Option;
import org.apache.commons.cli.Options;
public static void main(String[] args) {
Options options = new Options();
Option input = new Option("s", "ssn", true, "source system names");
input.setRequired(false);
options.addOption(input);
CommandLineParser parser = new DefaultParser();
HelpFormatter formatter = new HelpFormatter();
CommandLine cmd = null;
try {
cmd = parser.parse(options, args);
if(cmd.hasOption("s")) { // Checks if there is an argument '--s' in the CLI. Runs the Recon only for the received SSNs.
}
} catch(ParseException e) {
formatter.printHelp("utility-name", options);
e.printStackTrace();
System.exit(1);
} catch(Exception e) {
e.printStackTrace();
}
}
Could anyone let me know if it is possible to name the command line arguments and parse them accordingly ?
If you use --dbname=ORACLE for example.
val pattern = """--dbname=(.*)""".r
val params = args.map {
case pattern(pair, _) => pair
case arg => throw new ConfigException.Generic(s"""unable to parse command-line argument "$arg"""")
}
\s Matches whitespace, you can use it to create --dbname ORACLE, but it's easier if you just use a string.
Here you can see all the possibilities.
If we are not specific about the key name, we can prefix the key name with spark. in this case spark.dbname, and pass an conf argument like spark-submit --conf spark.dbname=<> .... or add it to the spark-defaults.conf
In the user code, we can access the key as sparkContext.getConf.get("spark.dbname")