I am using HBase Version 1.2.0-cdh5.8.2 and Spark version 1.6.0.
I am using toHBaseTable() method of it.nerdammer.spark.hbase package to save RDD of Spark in HBASE.
val experiencesDataset = sc.parallelize(Seq((001, null.asInstanceOf[String]), (001,"2016-10-25")))
experiencesDataset .toHBaseTable(experienceTableName).save()
But I want to save data in HBase using Spark by Bulk Load.
I am not able to understand how to use Bulk Load option. Please assist me.
Related
I am trying to do a bulk insert or bulk load into Hbase in EMR using Glue Scala (Spark 3.1). I got this using
table.put(List<Put>);
without a satisfatory performance. I tried to insert though spark dataframe following some examples but the libraries features are compatible just with Spark 1.6. I tried, too, reproduce some examples of insert HFiles into HDFS enviroment and processing through HOutputFormat and HOutputFormat2 but these classes were removed from newer versions. How can I be able to perform a highly-performatic insert in HBase, with current libraries or, even, an bulkload? The examples that I found were old and the Hbase Book Reference wasn't clearly about this point.
Thank you.
I'm trying to achieve something similar using spark and scala
Updating BigQuery data using Java
https://cloud.google.com/bigquery/docs/updating-data
I want to update existing data and also insert new data into Bigquery table. Any ideas if we can using some sort of DML within spark to do an upsert operation against BigQuery ??
I found that BigQuery supports merge but I'm not sure if we can do something similar using spark and scala
Google BQ - how to upsert existing data in tables?
The spark API does not support upsert yet. The best workaround at this moment is to write the dataframe to a temporary table, run a MERGE job and then delete the temporary table.
I am trying to identify a solution to read data from HBASE table using spark streaming and write the data to another HBASE table.
I found numerous samples in internet which asks to create a DSTREAM to get the data from HDFS files and all.But I was unable to find any examples to get data from HBASE tables
For e.g, if I have a HBASE table 'SAMPLE' with columns as 'name' and 'activeStatus'. How can I retrieve the data from the table SAMPLE based on activeStatus column using spark streaming (New data?
Any examples to retrieve the data from HBASE table using spark streaming is welcome.
Regards,
Adarsh K S
You can connect to hbase from spark multiple ways
Hortonwork Spark hbase connector:
https://github.com/hortonworks-spark/shc
Unicredit hbase rdd : https://github.com/unicredit/hbase-rdd
Hortonworks SHC read hbase directly to dataframe using user defined
catalog whereas hbase-rdd read it as rdd and can be converted to DF
using toDF method. hbase-rdd has bulk write option (direct write HFiles) preferred for massive data write.
What you need is a library that enables spark to interact with hbase. Horton Works' shc is such an extension:
https://github.com/hortonworks-spark/shc
How I can write Spark dataframe to DynamoDB using emr-dynamodb-connector and Python?
I can't find how I can create new JobConf with pyspark.
I want to avoid writing the entire stream to a file and then load it to dataframe. what's the right way?
You can check Spark Streaming and sqlnetworkWordCount which explains that your problem can be solved by creating singleton instance of SparkSession by using SparkContext of SparkStreaming.
You should have better ideas by going through above links where dataframes are created from streaming rdd.