How to count all rows on Hbase table using Scala - scala

we can count all rows, using hbase shell with this command : count 'table_name', INTERVAL=> 1 or just simple count 'table_name.
But How to do this using Scala Programming ?

Although I have done with java client for Hbase, I researched and found out the below..
Java way code snippet :
You can use KeyOnlyFilter() to get only Keys of the rows. and then loop like below..
for (Result rs = scanner.next(); rs != null; rs = scanner.next()) {
number++;
}
like above you can use the below scala hbase example..
Please look at the Java API. Adaptation to scala should be relatively easy. The example below shows part of the sample Java code adapted to scala:
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{HBaseAdmin,HTable,Put,Get}
import org.apache.hadoop.hbase.util.Bytes
val conf = new HBaseConfiguration()
val admin = new HBaseAdmin(conf)
// list the tables
val listtables=admin.listTables()
listtables.foreach(println)
// let's insert some data in 'mytable' and get the row
val table = new HTable(conf, "mytable")
val theput= new Put(Bytes.toBytes("rowkey1"))
theput.add(Bytes.toBytes("ids"),Bytes.toBytes("id1"),Bytes.toBytes("one"))
table.put(theput)
val theget= new Get(Bytes.toBytes("rowkey1"))
val result=table.get(theget)
val value=result.value()
println(Bytes.toString(value))
However as an additional information(and best way than java or scala) please see below
RowCounter is a mapreduce job to count all the rows of a table. This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency. It will run the mapreduce all in a single process but it will run faster if you have a MapReduce cluster in place for it to exploit.
$ hbase org.apache.hadoop.hbase.mapreduce.RowCounter <tablename>
Usage: RowCounter [options]
<tablename> [
--starttime=[start]
--endtime=[end]
[--range=[startKey],[endKey]]
[<column1> <column2>...]
]

With java client, you can scan all table with RowKeyOnlyFilter is effective. By this way you only transfer rowkeys to your client code, not data, so it will be faster. This is what count 'tablename' does in shell too.

Related

PySpark Couchbase connection

I'm trying to use PySpark to connect to our Couchbase server and query it. Essentially I'm trying to do is query Couchbase similar to the following Scala code but using Python (PySpark).
import com.couchbase.spark._
val query = “SELECT name FROM travel-sample WHERE type = ‘airline’ ORDER BY name ASC LIMIT 10”
sc
.couchbaseQuery(Query.simple(query))
.collect()
.foreach(println)
Does anyone have an example of doing this with Python code that they could post?

Recursively adding rows to a dataframe

I am new to spark. I have some json data that comes as an HttpResponse. I'll need to store this data in hive tables. Every HttpGet request returns a json which will be a single row in the table. Due to this, I am having to write single rows as files in the hive table directory.
But I feel having too many small files will reduce the speed and efficiency. So is there a way I can recursively add new rows to the Dataframe and write it to the hive table directory all at once. I feel this will also reduce the runtime of my spark code.
Example:
for(i <- 1 to 10){
newDF = hiveContext.read.json("path")
df = df.union(newDF)
}
df.write()
I understand that the dataframes are immutable. Is there a way to achieve this?
Any help would be appreciated. Thank you.
You are mostly on the right track, what you want to do is to obtain multiple single records as a Seq[DataFrame], and then reduce the Seq[DataFrame] to a single DataFrame by unioning them.
Going from the code you provided:
val BatchSize = 100
val HiveTableName = "table"
(0 until BatchSize).
map(_ => hiveContext.read.json("path")).
reduce(_ union _).
write.insertInto(HiveTableName)
Alternatively, if you want to perform the HTTP requests as you go, we can do that too. Let's assume you have a function that does the HTTP request and converts it into a DataFrame:
def obtainRecord(...): DataFrame = ???
You can do something along the lines of:
val HiveTableName = "table"
val OtherHiveTableName = "other_table"
val jsonArray = ???
val batched: DataFrame =
jsonArray.
map { parameter =>
obtainRecord(parameter)
}.
reduce(_ union _)
batched.write.insertInto(HiveTableName)
batched.select($"...").write.insertInto(OtherHiveTableName)
You are clearly misusing Spark. Apache Spark is analytical system, not a database API. There is no benefit of using Spark to modify Hive database like this. It will only bring a severe performance penalty without benefiting from any of the Spark features, including distributed processing.
Instead you should use Hive client directly to perform transactional operations.
If you can batch-download all of the data (for example with a script using curl or some other program) and store it in a file first (or many files, spark can load an entire directory at once) you can then load that file(or files) all at once into spark to do your processing. I would also check to see it the webapi as any endpoints to fetch all the data you need instead of just one record at a time.

Reading ES from spark with elasticsearch-spark connector: all the fields are returned

I've done some experiments in the spark-shell with the elasticsearch-spark connector. Invoking spark:
] $SPARK_HOME/bin/spark-shell --master local[2] --jars ~/spark/jars/elasticsearch-spark-20_2.11-5.1.2.jar
In the scala shell:
scala> import org.elasticsearch.spark._
scala> val es_rdd = sc.esRDD("myindex/mytype",query="myquery")
It works well, the result contains the good records as specified in myquery. The only thing is that I get all the fields, even if I specify a subset of these fields in the query. Example:
myquery = """{"query":..., "fields":["a","b"], "size":10}"""
returns all the fields, not only a and b (BTW, I noticed that size parameter is not taken in account neither : result contains more than 10 records). Maybe it's important to add that fields are nested, a and b are actually doc.a and doc.b.
Is it a bug in the connector or do I have the wrong syntax?
The spark elasticsearch connector uses fields thus you cannot apply projection.
If you wish to use fine-grained control over the mapping, you should be using DataFrame instead which are basically RDDs plus schema.
pushdown predicate should also be enabled to translate (push-down) Spark SQL into Elasticsearch Query DSL.
Now a semi-full example :
myQuery = """{"query":..., """
val df = spark.read.format("org.elasticsearch.spark.sql")
.option("query", myQuery)
.option("pushdown", "true")
.load("myindex/mytype")
.limit(10) // instead of size
.select("a","b") // instead of fields
what about calling:
scala> val es_rdd = sc.esRDD("myindex/mytype",query="myquery", Map[String, String] ("es.read.field.include"->"a,b"))
You want restrict fields returned from elasticsearch _search HTTP API? (I guess to improve download speed).
First of all, use a HTTP proxy to see what the elastic4hadoop plugin is doing (I use on MacOS Apache Zeppelin with Charles proxy). This will help you to understand how pushdown works.
There are several solutions to achieve this:
1. dataframe and pushdown
You specify fields, and the plugin will "forward" to ES (here the _source parameter):
POST ../events/_search?search_type=scan&scroll=5m&size=50&_source=client&preference=_shards%3A3%3B_local
(-) Not fully working for nested fields.
(+) Simple, straightaway, easy to read
2. RDD & query fields
With JavaEsSpark.esRDD, you can specify fields inside the JSON query, like you did. This only work with RDD (with DataFrame, the fields is not sent).
(-) no dataframe -> no Spark way
(+) more flexible, more control

N1QL Query to connect databricks spark 1.6 to couchbase server 4.5

I am trying to setup a connection from Databricks to couchbase server 4.5 and then run a N1QL query.
The scala code below will return 1 record but fails when introducing the N1QL. Any help is appreciated.
import com.couchbase.client.java.CouchbaseCluster;
import scala.collection.JavaConversions._;
import com.couchbase.client.java.query.Select.select;
import com.couchbase.client.java.query.dsl.Expression;
import com.couchbase.client.java.query.Query
// Connect to a cluster on localhost
val cluster = CouchbaseCluster.create("http://**************")
// Open the default bucket
val bucket = cluster.openBucket("travel-sample", "password");
// Read it back out
//val streamsense = bucket.get("airline_1004546") - Works and returns one record
// Create a DataFrame with schema inference
val ev = sql.read.couchbase(schemaFilter = EqualTo("type", "airline"))
//Show the inferred schema
ev.printSchema()
//query using the data frame
ev
.select("id", "type")
.show(10)
//issue sql query for the same data (N1ql)
val query = "SELECT type, meta().id FROM `travel-sample` LIMIT 10"
sc
.couchbaseQuery(N1qlQuery.simple(query))
.collect()
.foreach(println)
In Databricks (and any interactive Spark cloud environment usually) you do not define the cluster nodes, buckets or sc variable, instead you need to set the configuration settings for Spark to use when setting up the Databricks cluster. Use the advanced settings option as shown below.
I've only used this approach with spark2.0 so your mileage may vary.
You can remove your cluster and bucket variable initialisation as well.
You have a syntax error in the N1QL query. You have:
val query = "SELECT type, id FROM `travel-sample` WHERE LIMIT 10"
You need to either remove the WHERE, or add a condition.
You also need to change id to META().id.

Spark DataFrame filtering: retain element belonging to a list

I am using Spark 1.5.1 with Scala on Zeppelin notebook.
I have a DataFrame with a column called userID with Long type.
In total I have about 4 million rows and 200,000 unique userID.
I have also a list of 50,000 userID to exclude.
I can easily build the list of userID to retain.
What is the best way to delete all the rows that belong to the users to exclude?
Another way to ask the same question is: what is the best way to keep the rows that belong to the users to retain?
I saw this post and applied its solution (see the code below), but the execution is slow, knowing that I am running SPARK 1.5.1 on my local machine, an I have decent RAM memory of 16GB and the initial DataFrame fits in the memory.
Here is the code that I am applying:
import org.apache.spark.sql.functions.lit
val finalDataFrame = initialDataFrame.where($"userID".in(listOfUsersToKeep.map(lit(_)):_*))
In the code above:
the initialDataFrame has 3885068 rows, each row has 5 columns, one of these columns called userID and it contains Long values.
The listOfUsersToKeep is an Array[Long] and it contains 150,000 Long userID.
I wonder if there is a more efficient solution than the one I am using.
Thanks
You can either use join:
val usersToKeep = sc.parallelize(
listOfUsersToKeep.map(Tuple1(_))).toDF("userID_")
val finalDataFrame = usersToKeep
.join(initialDataFrame, $"userID" === $"userID_")
.drop("userID_")
or a broadcast variable and an UDF:
import org.apache.spark.sql.functions.udf
val usersToKeepBD = sc.broadcast(listOfUsersToKeep.toSet)
val checkUser = udf((id: Long) => usersToKeepBD.value.contains(id))
val finalDataFrame = initialDataFrame.where(checkUser($"userID"))
It should be also possible to broadcast a DataFrame:
import org.apache.spark.sql.functions.broadcast
initialDataFrame.join(broadcast(usersToKeep), $"userID" === $"userID_")