Spark SQL vs Hive SQL - pyspark

select case when unix_timestamp(start_date,'YYYY-MM-DD') is TRUE then 'present' else 'fail' end as test from hive_schema.table_name;
sample data :
start_date
2023-01-25
The query for above sample data return correct result when executed in Hive. Table is of external type and its a CDP platform.
However, the same query when executed from Python using spark session gives me incorrect result.
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
spark = SparkSession.builder.appName('CD6').enableHiveSupport().getOrCreate()
res=("""select case when unix_timestamp(start_date,'YYYY-MM-DD') is TRUE then 'present' else 'fail' end as test from hive_schema.table_name;
""")
spark.stop()
Spark is not running the query as is by just taking the Hive schema connection, but modifying something that's not getting correct result.
Any help to resolve the above issue will be highly appreciated. Thanks !

Related

Spark - Hive UDF is working with Spark-SQL but not with DataFrame

If I use hive UDF in spark SQL it works. as mentioned below.
val df=List(("$100", "$90", "$10")).toDF("selling_price", "market_price", "profit")
df.registerTempTable("test")
spark.sql("select default.encrypt(selling_price,'sales','','employee','id') from test").show
However following is not working.
//following is not working. not sure if you need to register a function for this
val encDF = df.withColumn("encrypted", default.encrypt($"selling_price","sales","","employee","id"))
encDF.show
Error
error: not found: value default
Hive UDF is only available if you access it through Spark SQL. It is not available in the Scala environment, because it was not defined there. But you can still access the Hive UDF using expr:
df.withColumn("encrypted", expr("default.encrypt(selling_price,'sales','','employee','id')"))

PySpark Couchbase connection

I'm trying to use PySpark to connect to our Couchbase server and query it. Essentially I'm trying to do is query Couchbase similar to the following Scala code but using Python (PySpark).
import com.couchbase.spark._
val query = “SELECT name FROM travel-sample WHERE type = ‘airline’ ORDER BY name ASC LIMIT 10”
sc
.couchbaseQuery(Query.simple(query))
.collect()
.foreach(println)
Does anyone have an example of doing this with Python code that they could post?

Parse Kafka JSON stream using pyspark and save to mongodb

I have a stream of alerts coming from Kafka to Spark. These are alerts in JSON format from different IoT Sensors.
Kafka Streams:
{ "id":"2093021", alert:"Malfunction
detected","sensor_id":"14-23092-AS" }
{ "id":"2093021", alert:"Malfunction
detected","sensor_id":"14-23092-AS" , "alarm_code": "Severe" }
My code: spark-client.py
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SparkSession
from pyspark.sql.context import SQLContext
import json
if __name__ == "__main__":
spark = SparkSession.builder.appName("myApp").config("spark.mongodb.input.uri", "mongodb://spark:1234#172.31.9.44/at_cloudcentral.spark_test").config("spark.mongodb.output.uri", "mongodb://spark:1234#172.31.9.44/at_cloudcentral.spark_test").getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 10)
zkQuorum, topic = sys.argv[1:]
kafka_streams =KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-sql-mongodb-test-consumer", {topic: 1})
dstream = kafka_streams.map(lambda x: json.loads(x[1]))
dstream.pprint()
ssc.start()
ssc.awaitTermination()
When I run this
ubuntu#ip-172-31-89-176:~/spark-connectors$ spark-submit spark-client.py localhost:2181 DetectionEntry
I get this output
-------------------------------------------
Time: 2019-12-04 14:26:40
-------------------------------------------
{u'sensor_id': u'16-23092-AS', u'id': u'2093021', u'alert': u'Malfunction detected'}
I need to be able to save this alert to a remote MongoDB. I have two specific challenges:
How do I correctly parse the output so that I can create a dataframe that can be written to mongodb ? I have tried adding this to the end of code
d = [dstream]
df = spark.createDataFrame(d).collect()
and it gives me this error
dataType py4j.java_gateway.JavaMember object at 0x7f5912726750 should
be an instance of class 'pyspark.sql.types.DataType'
My alerts can have different json structure and I'll need to dump them into a mongodb collection. As such a fixed schema wont work for me. Most of the similar questions and code that I have referred to in stackoverflow are specific to fixed schema and I'm unable to figure out how to push this to mongodb in a way that each record in the mongodb collection will have its own schema(json structure). Any pointers in the right direction is requested.
We can parsing the Kafka JSON message easily through Pyspark based Structured Streaming API with invoking the simple UDF. You can check complete code in below stack overflow link for reference.
Pyspark Structured streaming processing

Making RDD operations on sqlContext

I am working on a tutorial in apache spark, and I making use of the Cassandra database, Spark2.0, and Python
I am trying to do an RDD Operation on a sql Query, using this tutorial,
https://spark.apache.org/docs/2.0.0-preview/sql-programming-guide.html
it says #The results of SQL queries are RDDs and support all the normal RDD operations.
I currently have this line of codes that says
sqlContext = SQLContext(sc)
results = sqlContext.sql("SELECT word FROM tweets where word like '%#%'").show(20, False)
df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="wordcount", keyspace= "demo")\
.load()
df.select("word")
df.createOrReplaceTempView("tweets")
usernames = results.map(lambda p: "User: " + p.word)
for name in usernames.collect():
print(name)
AttributeError: 'NoneType' object has no attribute 'map'
If the variable results is a result of sql Query, why am I getting this error? Can anyone please explain this to me.
Everything works fine, the tables print, only time I get an error is when I
try to do a RDD Operation.
please bear in mind sc is an existing spark context
It's because show() only prints content.
Use:
results = sqlContext.sql("SELECT word FROM tweets where word like '%#%'")
result.show(20, False)

N1QL Query to connect databricks spark 1.6 to couchbase server 4.5

I am trying to setup a connection from Databricks to couchbase server 4.5 and then run a N1QL query.
The scala code below will return 1 record but fails when introducing the N1QL. Any help is appreciated.
import com.couchbase.client.java.CouchbaseCluster;
import scala.collection.JavaConversions._;
import com.couchbase.client.java.query.Select.select;
import com.couchbase.client.java.query.dsl.Expression;
import com.couchbase.client.java.query.Query
// Connect to a cluster on localhost
val cluster = CouchbaseCluster.create("http://**************")
// Open the default bucket
val bucket = cluster.openBucket("travel-sample", "password");
// Read it back out
//val streamsense = bucket.get("airline_1004546") - Works and returns one record
// Create a DataFrame with schema inference
val ev = sql.read.couchbase(schemaFilter = EqualTo("type", "airline"))
//Show the inferred schema
ev.printSchema()
//query using the data frame
ev
.select("id", "type")
.show(10)
//issue sql query for the same data (N1ql)
val query = "SELECT type, meta().id FROM `travel-sample` LIMIT 10"
sc
.couchbaseQuery(N1qlQuery.simple(query))
.collect()
.foreach(println)
In Databricks (and any interactive Spark cloud environment usually) you do not define the cluster nodes, buckets or sc variable, instead you need to set the configuration settings for Spark to use when setting up the Databricks cluster. Use the advanced settings option as shown below.
I've only used this approach with spark2.0 so your mileage may vary.
You can remove your cluster and bucket variable initialisation as well.
You have a syntax error in the N1QL query. You have:
val query = "SELECT type, id FROM `travel-sample` WHERE LIMIT 10"
You need to either remove the WHERE, or add a condition.
You also need to change id to META().id.