Trigger.AvailableNow for Delta source streaming queries in PySpark (Databricks) - pyspark

All the examples in the Databricks documentation are in Scala. Can't find how to use this trigger type from PySpark. Is there an equivalent API or workaround ?

Python implementation missed the Spark 3.2 release, so it will be included into Spark 3.3 only (for OSS version). On Databricks it was released as part of DBR 10.3 (or 10.2?), and could be used as following:
.trigger(availableNow=True)

Here is the official documentation:
DataStreamWriter.trigger(*, processingTime: Optional[str] = None,
once: Optional[bool] = None,
continuous: Optional[str] = None,
availableNow: Optional[bool] = None) -> pyspark.sql.streaming.DataStreamWriter
availableNow: bool, optional
if set to True, set a trigger that processes all available data in multiple >batches then terminates the query. Only one trigger can be set.
# trigger the query for reading all available data with multiple batches
writer = sdf.writeStream.trigger(availableNow=True)

Related

Is there are difference between PySpark and SparkSQL? If so, what's the difference?

Long story short, I'm tasked with converting files from SparkSQL to PySpark as my first task at my new job.
However, I'm unable to see many differences outside of syntax. Is SparkSQL an earlier version of PySpark or a component of it or something different altogether?
And yes, it's my first time using these tools. But, I have experience with both Python & SQL, so it's not seeming to be that difficult of a task. Just want a better understanding.
Example of the syntax difference I'm referring to:
spark.read.table("db.table1").alias("a")
.filter(F.col("a.field1") == 11)
.join(
other = spark.read.table("db.table2").alias("b"),
on = 'field2',
how = 'left'
Versus
df = spark.sql(
"""
SELECT b.field1,
CASE WHEN ...
THEN ...
ELSE ...
end field2
FROM db.table1 a
LEFT JOIN db.table2 b
on a.field1= b.field1
WHERE a.field1= {}
""".format(field1)
)
From the documentation: PySpark is an interface within which you have the components of spark viz. Spark core, SparkSQL, Spark Streaming and Spark MLlib.
Coming to the task you have been assigned, it looks like you've been tasked with translating SQL-heavy code into a more PySpark-friendly format.

check if table exists in hive using spark 1.6 scala code

i am trying to check if a table exists in hive using spark 1.6 and scala coding.
i tried to explore over internet but couldn't find anything more useful than this
spark - scala - How can I check if a table exists in hive
here it is mentioned that if we use
sqlContext.tableNames.contains("mytable")
then it returns boolean. but when i try this it checks in default database and gives me false.
how can i set a database to be looked into while this check?
You could set the database first like this:
scala> sqlContext.sql("use dbName")
and then search for the table:
scala> sqlContext.tableNames.contains("tabName")
res3: Boolean = true

Reading ES from spark with elasticsearch-spark connector: all the fields are returned

I've done some experiments in the spark-shell with the elasticsearch-spark connector. Invoking spark:
] $SPARK_HOME/bin/spark-shell --master local[2] --jars ~/spark/jars/elasticsearch-spark-20_2.11-5.1.2.jar
In the scala shell:
scala> import org.elasticsearch.spark._
scala> val es_rdd = sc.esRDD("myindex/mytype",query="myquery")
It works well, the result contains the good records as specified in myquery. The only thing is that I get all the fields, even if I specify a subset of these fields in the query. Example:
myquery = """{"query":..., "fields":["a","b"], "size":10}"""
returns all the fields, not only a and b (BTW, I noticed that size parameter is not taken in account neither : result contains more than 10 records). Maybe it's important to add that fields are nested, a and b are actually doc.a and doc.b.
Is it a bug in the connector or do I have the wrong syntax?
The spark elasticsearch connector uses fields thus you cannot apply projection.
If you wish to use fine-grained control over the mapping, you should be using DataFrame instead which are basically RDDs plus schema.
pushdown predicate should also be enabled to translate (push-down) Spark SQL into Elasticsearch Query DSL.
Now a semi-full example :
myQuery = """{"query":..., """
val df = spark.read.format("org.elasticsearch.spark.sql")
.option("query", myQuery)
.option("pushdown", "true")
.load("myindex/mytype")
.limit(10) // instead of size
.select("a","b") // instead of fields
what about calling:
scala> val es_rdd = sc.esRDD("myindex/mytype",query="myquery", Map[String, String] ("es.read.field.include"->"a,b"))
You want restrict fields returned from elasticsearch _search HTTP API? (I guess to improve download speed).
First of all, use a HTTP proxy to see what the elastic4hadoop plugin is doing (I use on MacOS Apache Zeppelin with Charles proxy). This will help you to understand how pushdown works.
There are several solutions to achieve this:
1. dataframe and pushdown
You specify fields, and the plugin will "forward" to ES (here the _source parameter):
POST ../events/_search?search_type=scan&scroll=5m&size=50&_source=client&preference=_shards%3A3%3B_local
(-) Not fully working for nested fields.
(+) Simple, straightaway, easy to read
2. RDD & query fields
With JavaEsSpark.esRDD, you can specify fields inside the JSON query, like you did. This only work with RDD (with DataFrame, the fields is not sent).
(-) no dataframe -> no Spark way
(+) more flexible, more control

N1QL Query to connect databricks spark 1.6 to couchbase server 4.5

I am trying to setup a connection from Databricks to couchbase server 4.5 and then run a N1QL query.
The scala code below will return 1 record but fails when introducing the N1QL. Any help is appreciated.
import com.couchbase.client.java.CouchbaseCluster;
import scala.collection.JavaConversions._;
import com.couchbase.client.java.query.Select.select;
import com.couchbase.client.java.query.dsl.Expression;
import com.couchbase.client.java.query.Query
// Connect to a cluster on localhost
val cluster = CouchbaseCluster.create("http://**************")
// Open the default bucket
val bucket = cluster.openBucket("travel-sample", "password");
// Read it back out
//val streamsense = bucket.get("airline_1004546") - Works and returns one record
// Create a DataFrame with schema inference
val ev = sql.read.couchbase(schemaFilter = EqualTo("type", "airline"))
//Show the inferred schema
ev.printSchema()
//query using the data frame
ev
.select("id", "type")
.show(10)
//issue sql query for the same data (N1ql)
val query = "SELECT type, meta().id FROM `travel-sample` LIMIT 10"
sc
.couchbaseQuery(N1qlQuery.simple(query))
.collect()
.foreach(println)
In Databricks (and any interactive Spark cloud environment usually) you do not define the cluster nodes, buckets or sc variable, instead you need to set the configuration settings for Spark to use when setting up the Databricks cluster. Use the advanced settings option as shown below.
I've only used this approach with spark2.0 so your mileage may vary.
You can remove your cluster and bucket variable initialisation as well.
You have a syntax error in the N1QL query. You have:
val query = "SELECT type, id FROM `travel-sample` WHERE LIMIT 10"
You need to either remove the WHERE, or add a condition.
You also need to change id to META().id.

Remove Temporary Tables from Apache SQL Spark

I have registertemptable in Apache Spark using Zeppelin below:
val hvacText = sc.textFile("...")
case class Hvac(date: String, time: String, targettemp: Integer, actualtemp: Integer, buildingID: String)
val hvac = hvacText.map(s => s.split(",")).filter(s => s(0) != "Date").map(
s => Hvac(s(0),
s(1),
s(2).toInt,
s(3).toInt,
s(6))).toDF()
hvac.registerTempTable("hvac")
After I have done with my queries with this temp table, how do I remove it ?
I checked all docs and it seems I am getting nowhere.
Any guidance ?
Spark 2.x
For temporary views you can use Catalog.dropTempView:
spark.catalog.dropTempView("df")
For global views you can use Catalog.dropGlobalTempView:
spark.catalog.dropGlobalTempView("df")
Both methods are safe to call if view doesn't exist and, since Spark 2.1, return boolean indicating if the operation succeed.
Spark 1.x
You can use SQLContext.dropTempTable:
scala.util.Try(sqlContext.dropTempTable("df"))
It can be still used in Spark 2.0, but delegates processing to Catalog.dropTempView and is safe to use if table doesn't exist.
If you want to remove your temp table on zeppelin, try like this.
sqlc.dropTempTable("hvac")
or
%sql DROP VIEW hvac
And you can get the informations you need from spark API Docs(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package)
in new ver (2.0 and latest) of spark.
one should use: createOrReplaceTempView in place of registerTempTable (depricated)
and corresponding method to deallocate is: dropTempView
spark.catalog.dropTempView("temp_view_name") //drops the table
You can use sql drop table/view statement to remove it like below
spark.sql("drop view hvac");