How to use spark streaming to get data from HBASE table using scala - scala

I am trying to identify a solution to read data from HBASE table using spark streaming and write the data to another HBASE table.
I found numerous samples in internet which asks to create a DSTREAM to get the data from HDFS files and all.But I was unable to find any examples to get data from HBASE tables
For e.g, if I have a HBASE table 'SAMPLE' with columns as 'name' and 'activeStatus'. How can I retrieve the data from the table SAMPLE based on activeStatus column using spark streaming (New data?
Any examples to retrieve the data from HBASE table using spark streaming is welcome.
Regards,
Adarsh K S

You can connect to hbase from spark multiple ways
Hortonwork Spark hbase connector:
https://github.com/hortonworks-spark/shc
Unicredit hbase rdd : https://github.com/unicredit/hbase-rdd
Hortonworks SHC read hbase directly to dataframe using user defined
catalog whereas hbase-rdd read it as rdd and can be converted to DF
using toDF method. hbase-rdd has bulk write option (direct write HFiles) preferred for massive data write.

What you need is a library that enables spark to interact with hbase. Horton Works' shc is such an extension:
https://github.com/hortonworks-spark/shc

Related

Merge dataframe into Google bigquery using spark and scala

I'm trying to achieve something similar using spark and scala
Updating BigQuery data using Java
https://cloud.google.com/bigquery/docs/updating-data
I want to update existing data and also insert new data into Bigquery table. Any ideas if we can using some sort of DML within spark to do an upsert operation against BigQuery ??
I found that BigQuery supports merge but I'm not sure if we can do something similar using spark and scala
Google BQ - how to upsert existing data in tables?
The spark API does not support upsert yet. The best workaround at this moment is to write the dataframe to a temporary table, run a MERGE job and then delete the temporary table.

how to save output from cassandra table using spark

I want to save the output/rows read from cassandra table to a file in either csv or json format. Using, Spark 1.6.3:
scala>val results.sqlContext.sql("select * from myks.mytable")
scala>val.write.option("header","true").save("/tmp/xx.csv") -- writes to cfs:// filesystem
I am not able to find an option to write to the OS as csv or json format file.
Appreciate any help!
Use the spark Cassandra connector to read data from a Cassandra table into spark
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/0_quick_start.md

Bulk Load in HBase Using Spark

I am using HBase Version 1.2.0-cdh5.8.2 and Spark version 1.6.0.
I am using toHBaseTable() method of it.nerdammer.spark.hbase package to save RDD of Spark in HBASE.
val experiencesDataset = sc.parallelize(Seq((001, null.asInstanceOf[String]), (001,"2016-10-25")))
experiencesDataset .toHBaseTable(experienceTableName).save()
But I want to save data in HBase using Spark by Bulk Load.
I am not able to understand how to use Bulk Load option. Please assist me.

Data Analysis Scala on Spark

I am new to Scala, and i have to use Scala and Spark's SQL, Mllib and GraphX in order to perform some analysis on huge data set. The analyses i want to do are:
Customer life cycle Value (CLV)
Centrality measures (degree, Eigenvector, edge-betweenness,
closeness) The data is in a CSV file (60GB (3 years transnational data))
located in Hadoop cluster.
My question is about the optimal approach to access the data and perform the above calculations?
Should i load the data from the CSV file into dataframe and work on
the dataframe? or
Should i load the data from the CSV file and convert it into RDD and
then work on the RDD? or
Are there any other approach to access the data and perform the analyses?
Thank you so much in advance for your help..
Dataframe gives you sql like syntax to work with the data where as RDD gives Scala collection like methods for data manipulation.
One extra benefit with Dataframes is underlying spark system will optimise your queries just like sql query optimisation. This is not available in case of RDD's.
As you are new to Scala its highly recommended to use Dataframes API initially and then Pick up RDD API later based on requirement.
You can use Databricks CSV reader api, which is easy to use and returns DataFrame. It automatically infer data types. If you pass the file with header it can automatically use that as Schema, otherwise you can construct schema using StructType.
https://github.com/databricks/spark-csv
Update:
If you are using Spark 2.0 Version , by default it support CSV datasource, please see the below link.
https://spark.apache.org/releases/spark-release-2-0-0.html#new-features
See this link for how to use.
https://github.com/databricks/spark-csv/issues/367

Sending data from my spark code to redshift

I have a Spark code programmed in Scala. My code reads an xml and extracts all the info in it. The goal is to store the info from the XML into Redshift tables.
Is it possible to send the data directly from my Scala Spark code to Redshift without using S3?
Cheers!
If you're using Spark SQL you can read your XML data into DataFrame using spark-xml and then writing it into Redshift tables using spark-redshift .
You can also take a look on this question .
You can do row level insert using pre-prepared SQL statements into your Python/ Java code, but it will be extremely inefficient if you are going to insert more than few records.