I'm trying to use PySpark to connect to our Couchbase server and query it. Essentially I'm trying to do is query Couchbase similar to the following Scala code but using Python (PySpark).
import com.couchbase.spark._
val query = “SELECT name FROM travel-sample WHERE type = ‘airline’ ORDER BY name ASC LIMIT 10”
sc
.couchbaseQuery(Query.simple(query))
.collect()
.foreach(println)
Does anyone have an example of doing this with Python code that they could post?
Related
Scenario: Cassandra is hosted on a server a.b.c.d and spark runs on server say w.x.y.z.
Assume i want to transform the data from a table(say table)casssandra and rewrite the same to other table(say tableNew) in cassandra using Spark,The code that i write looks something like this
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "a.b.c.d")
.set("spark.cassandra.auth.username", "<UserName>")
.set("spark.cassandra.auth.password", "<Password>")
val spark = SparkSession.builder().master("yarn")
.config(conf)
.getOrCreate()
val dfFromCassandra = spark.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "<table>", "keyspace" -> "<Keyspace>")).load()
val filteredDF = dfFromCassandra.filter(filterCriteria).write.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "<tableNew>", "keyspace" -> "<Keyspace>")).save
Here filterCriteria represents the transformation/filtering that I do. Iam not sure how Spark cassandra connector works in this case internally.
This is the Confusion that I have:
1: Does spark load the data from Cassandra source table to the memory and then filter the same and reload the same to the Target table Or
2: Does Spark cassandra connector convert the filter criteria to Where clause and loads only the relevant data to form RDD and writes the same back to target table in Cassandra Or
3:Does the entire operation happens as a cql operation where the query is converted to sqllike query and is executed in cassandra itself?(I am almost sure that this is not what happens)
It is either 1. or 2. depending on your filterCriteria. Naturally Spark itself can't do any CQL filtering but custom datasources can implement it using predicate pushdown. In case if Cassandra driver, it is implemented here and the answer depends if that covers the used filterCriteria.
I am working on a tutorial in apache spark, and I making use of the Cassandra database, Spark2.0, and Python
I am trying to do an RDD Operation on a sql Query, using this tutorial,
https://spark.apache.org/docs/2.0.0-preview/sql-programming-guide.html
it says #The results of SQL queries are RDDs and support all the normal RDD operations.
I currently have this line of codes that says
sqlContext = SQLContext(sc)
results = sqlContext.sql("SELECT word FROM tweets where word like '%#%'").show(20, False)
df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="wordcount", keyspace= "demo")\
.load()
df.select("word")
df.createOrReplaceTempView("tweets")
usernames = results.map(lambda p: "User: " + p.word)
for name in usernames.collect():
print(name)
AttributeError: 'NoneType' object has no attribute 'map'
If the variable results is a result of sql Query, why am I getting this error? Can anyone please explain this to me.
Everything works fine, the tables print, only time I get an error is when I
try to do a RDD Operation.
please bear in mind sc is an existing spark context
It's because show() only prints content.
Use:
results = sqlContext.sql("SELECT word FROM tweets where word like '%#%'")
result.show(20, False)
I've done some experiments in the spark-shell with the elasticsearch-spark connector. Invoking spark:
] $SPARK_HOME/bin/spark-shell --master local[2] --jars ~/spark/jars/elasticsearch-spark-20_2.11-5.1.2.jar
In the scala shell:
scala> import org.elasticsearch.spark._
scala> val es_rdd = sc.esRDD("myindex/mytype",query="myquery")
It works well, the result contains the good records as specified in myquery. The only thing is that I get all the fields, even if I specify a subset of these fields in the query. Example:
myquery = """{"query":..., "fields":["a","b"], "size":10}"""
returns all the fields, not only a and b (BTW, I noticed that size parameter is not taken in account neither : result contains more than 10 records). Maybe it's important to add that fields are nested, a and b are actually doc.a and doc.b.
Is it a bug in the connector or do I have the wrong syntax?
The spark elasticsearch connector uses fields thus you cannot apply projection.
If you wish to use fine-grained control over the mapping, you should be using DataFrame instead which are basically RDDs plus schema.
pushdown predicate should also be enabled to translate (push-down) Spark SQL into Elasticsearch Query DSL.
Now a semi-full example :
myQuery = """{"query":..., """
val df = spark.read.format("org.elasticsearch.spark.sql")
.option("query", myQuery)
.option("pushdown", "true")
.load("myindex/mytype")
.limit(10) // instead of size
.select("a","b") // instead of fields
what about calling:
scala> val es_rdd = sc.esRDD("myindex/mytype",query="myquery", Map[String, String] ("es.read.field.include"->"a,b"))
You want restrict fields returned from elasticsearch _search HTTP API? (I guess to improve download speed).
First of all, use a HTTP proxy to see what the elastic4hadoop plugin is doing (I use on MacOS Apache Zeppelin with Charles proxy). This will help you to understand how pushdown works.
There are several solutions to achieve this:
1. dataframe and pushdown
You specify fields, and the plugin will "forward" to ES (here the _source parameter):
POST ../events/_search?search_type=scan&scroll=5m&size=50&_source=client&preference=_shards%3A3%3B_local
(-) Not fully working for nested fields.
(+) Simple, straightaway, easy to read
2. RDD & query fields
With JavaEsSpark.esRDD, you can specify fields inside the JSON query, like you did. This only work with RDD (with DataFrame, the fields is not sent).
(-) no dataframe -> no Spark way
(+) more flexible, more control
I am trying to setup a connection from Databricks to couchbase server 4.5 and then run a N1QL query.
The scala code below will return 1 record but fails when introducing the N1QL. Any help is appreciated.
import com.couchbase.client.java.CouchbaseCluster;
import scala.collection.JavaConversions._;
import com.couchbase.client.java.query.Select.select;
import com.couchbase.client.java.query.dsl.Expression;
import com.couchbase.client.java.query.Query
// Connect to a cluster on localhost
val cluster = CouchbaseCluster.create("http://**************")
// Open the default bucket
val bucket = cluster.openBucket("travel-sample", "password");
// Read it back out
//val streamsense = bucket.get("airline_1004546") - Works and returns one record
// Create a DataFrame with schema inference
val ev = sql.read.couchbase(schemaFilter = EqualTo("type", "airline"))
//Show the inferred schema
ev.printSchema()
//query using the data frame
ev
.select("id", "type")
.show(10)
//issue sql query for the same data (N1ql)
val query = "SELECT type, meta().id FROM `travel-sample` LIMIT 10"
sc
.couchbaseQuery(N1qlQuery.simple(query))
.collect()
.foreach(println)
In Databricks (and any interactive Spark cloud environment usually) you do not define the cluster nodes, buckets or sc variable, instead you need to set the configuration settings for Spark to use when setting up the Databricks cluster. Use the advanced settings option as shown below.
I've only used this approach with spark2.0 so your mileage may vary.
You can remove your cluster and bucket variable initialisation as well.
You have a syntax error in the N1QL query. You have:
val query = "SELECT type, id FROM `travel-sample` WHERE LIMIT 10"
You need to either remove the WHERE, or add a condition.
You also need to change id to META().id.
we can count all rows, using hbase shell with this command : count 'table_name', INTERVAL=> 1 or just simple count 'table_name.
But How to do this using Scala Programming ?
Although I have done with java client for Hbase, I researched and found out the below..
Java way code snippet :
You can use KeyOnlyFilter() to get only Keys of the rows. and then loop like below..
for (Result rs = scanner.next(); rs != null; rs = scanner.next()) {
number++;
}
like above you can use the below scala hbase example..
Please look at the Java API. Adaptation to scala should be relatively easy. The example below shows part of the sample Java code adapted to scala:
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{HBaseAdmin,HTable,Put,Get}
import org.apache.hadoop.hbase.util.Bytes
val conf = new HBaseConfiguration()
val admin = new HBaseAdmin(conf)
// list the tables
val listtables=admin.listTables()
listtables.foreach(println)
// let's insert some data in 'mytable' and get the row
val table = new HTable(conf, "mytable")
val theput= new Put(Bytes.toBytes("rowkey1"))
theput.add(Bytes.toBytes("ids"),Bytes.toBytes("id1"),Bytes.toBytes("one"))
table.put(theput)
val theget= new Get(Bytes.toBytes("rowkey1"))
val result=table.get(theget)
val value=result.value()
println(Bytes.toString(value))
However as an additional information(and best way than java or scala) please see below
RowCounter is a mapreduce job to count all the rows of a table. This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency. It will run the mapreduce all in a single process but it will run faster if you have a MapReduce cluster in place for it to exploit.
$ hbase org.apache.hadoop.hbase.mapreduce.RowCounter <tablename>
Usage: RowCounter [options]
<tablename> [
--starttime=[start]
--endtime=[end]
[--range=[startKey],[endKey]]
[<column1> <column2>...]
]
With java client, you can scan all table with RowKeyOnlyFilter is effective. By this way you only transfer rowkeys to your client code, not data, so it will be faster. This is what count 'tablename' does in shell too.