Merge dataframe into Google bigquery using spark and scala - scala

I'm trying to achieve something similar using spark and scala
Updating BigQuery data using Java
https://cloud.google.com/bigquery/docs/updating-data
I want to update existing data and also insert new data into Bigquery table. Any ideas if we can using some sort of DML within spark to do an upsert operation against BigQuery ??
I found that BigQuery supports merge but I'm not sure if we can do something similar using spark and scala
Google BQ - how to upsert existing data in tables?

The spark API does not support upsert yet. The best workaround at this moment is to write the dataframe to a temporary table, run a MERGE job and then delete the temporary table.

Related

How to do a bulk insert/ bulkload into Hbase through Glue

I am trying to do a bulk insert or bulk load into Hbase in EMR using Glue Scala (Spark 3.1). I got this using
table.put(List<Put>);
without a satisfatory performance. I tried to insert though spark dataframe following some examples but the libraries features are compatible just with Spark 1.6. I tried, too, reproduce some examples of insert HFiles into HDFS enviroment and processing through HOutputFormat and HOutputFormat2 but these classes were removed from newer versions. How can I be able to perform a highly-performatic insert in HBase, with current libraries or, even, an bulkload? The examples that I found were old and the Hbase Book Reference wasn't clearly about this point.
Thank you.

Upsert to Phoenix table in Apache Spark

Looking to find if anybody got through a way to perform upserts (append / update / partial inserts/update) on Phoenix using Apache Spark. I could see as per Phoenix documentation save SaveMode.Overwrite is only supported - which is overwrite with full load. I tried changing the mode it throws error.
Currently, we have *.hql jobs running to perform this operation, now we want to rewrite them in Spark Scala. Thanks for sharing your valuable inputs.
While Phoenix connector indeed supports only SaveMode.Overwrite, the implementation doesn't conform to the Spark standard, which states that:
Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame
If you check the source, you'll see that saveToPhoenix just calls saveAsNewAPIHadoopFile with PhoenixOutputFormat, which
internally builds the UPSERT query for you
In other words SaveMode.Overwrite with Phoenix Connector is in fact UPSERT.

GCP Dataproc spark consuming BigQuery

I'm very new on GCP Google Cloud Platform, so I hope my question will not look so silly.
Footstage:
The main goals is gather few extend tables from BigQuery and apply few transformations. Because of the size of the tables I'm planning use Dataproc deploying a Pyspark script, ideally I would be able to use sqlContext to apply few sql queries to the DFs (tables pulled from BQ). Finally, I could easily dump this info into a file within a data storage bucket.
Questions :
Can I use import google.datalab.bigquery as bq within my Pyspark script?
Is this proposed schema the most efficient or instead I might validate any other? keep in mind that I need to create many temporal queries and this is why I though on Spark.
I expect to use pandas and bq to read the results queries as pandas df following this example. Later, I might use sc.parallelize from Spark to transform the pandas df into a spark df. Is this approach the right one?
my script
Update:
After have a back and forth with #Tanvee that kindly attend this question we conclude that GCP requires an intermediate allocation step when you need to read data from DataStorage into your Dataproc. Briefly, your spark or hadoop script might need a temporal bucket where store the data from the table and then bring it into Spark.
References:
Big Query Connector \
Deployment
thanks so much
You will need to use BigQuery connector for spark. There are some examples in the GCP documentation here and here. It will create RDD which you can convert to dataframe and then you will be able to perform all typical transformations. Hope that helps.
You can directly use following options to connect bigquery table from spark.
You can also use spark-bigquery connectors https://github.com/samelamin/spark-bigquery to directly run your queries on dataproc using spark.
https://github.com/GoogleCloudPlatform/spark-bigquery-connector This is new connector which is in beta. This is spark datasource api to bigquery which is easy to use.
Please refer following link:
Dataproc + BigQuery examples - any available?

How to use spark streaming to get data from HBASE table using scala

I am trying to identify a solution to read data from HBASE table using spark streaming and write the data to another HBASE table.
I found numerous samples in internet which asks to create a DSTREAM to get the data from HDFS files and all.But I was unable to find any examples to get data from HBASE tables
For e.g, if I have a HBASE table 'SAMPLE' with columns as 'name' and 'activeStatus'. How can I retrieve the data from the table SAMPLE based on activeStatus column using spark streaming (New data?
Any examples to retrieve the data from HBASE table using spark streaming is welcome.
Regards,
Adarsh K S
You can connect to hbase from spark multiple ways
Hortonwork Spark hbase connector:
https://github.com/hortonworks-spark/shc
Unicredit hbase rdd : https://github.com/unicredit/hbase-rdd
Hortonworks SHC read hbase directly to dataframe using user defined
catalog whereas hbase-rdd read it as rdd and can be converted to DF
using toDF method. hbase-rdd has bulk write option (direct write HFiles) preferred for massive data write.
What you need is a library that enables spark to interact with hbase. Horton Works' shc is such an extension:
https://github.com/hortonworks-spark/shc

Sending data from my spark code to redshift

I have a Spark code programmed in Scala. My code reads an xml and extracts all the info in it. The goal is to store the info from the XML into Redshift tables.
Is it possible to send the data directly from my Scala Spark code to Redshift without using S3?
Cheers!
If you're using Spark SQL you can read your XML data into DataFrame using spark-xml and then writing it into Redshift tables using spark-redshift .
You can also take a look on this question .
You can do row level insert using pre-prepared SQL statements into your Python/ Java code, but it will be extremely inefficient if you are going to insert more than few records.