We are moving from HDinsights 3.6 to 4.0. The problem is in 4.0 I am unable to read hive tables using spark. Can anyone help me with the hive spark integration?
There is a Hive Warehouse Connector that you need to use. Additional information is available in these articles:
Integrate Apache Spark and Apache Hive with the Hive Warehouse Connector
Migrate Azure HDInsight 3.6 Hive workloads to HDInsight 4.0
Related
I am trying to configure Kafka connect distributed, but i didn't find any jars for configure Kafka connect in Azure HD Insight.
Can you please help me with my above query.
https://techcommunity.microsoft.com/t5/analytics-on-azure/kafka-connect-with-hdinsight-managed-kafka/ba-p/1547013
Download the relevant Kafka Plugins from the Confluent Hub to your plugin.path setting
Kafka Connect plugin for streaming data to kafka:
Azure Blob Storage Sink Connector
https://www.confluent.io/hub/confluentinc/kafka-connect-azure-blob-storage
I'm struggling to get cosmos db up and running after upgrading spring boot to 2.3.3.
Is there a compatibility chart please ?
Cosmos has support for Spring 2.3.x. There is a helpful getting started guide on the Spring data readme on GitHub that may point to your issue.
https://github.com/Azure/azure-sdk-for-java/tree/master/sdk/cosmos/azure-spring-data-cosmos
We have a spark streaming application running on EMR cluster, we need to store the streaming data to Google Cloud Storage in parquet format.
Please anyone help me.
To connect to Google cloud storage (GCS) using spark on EMR, you need to include the google cloud storage connector in your application jar. You can also add the jar in hadoop classpath on EMR cluster. Quickest and easiest way is to bundle the GCS connector in your application jar.
You can get the google cloud storage connector here:
https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage
It has connectors for hadoop 1.x, 2.x and 3.x.
Once you get the jar, add the following properties in your spark application
SparkConf sparkConf = new SparkConf();
sparkConf.set("spark.hadoop.google.cloud.auth.service.account.enable", "true");
sparkConf.set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", "<path to your google cloud key>");
SparkSession spark = SparkSession.builder()
.appName("My spark application")
.config(sparkConf)
.getOrCreate();
spark.sparkContext().hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl","com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS");
spark.sparkContext().hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem");
You can put the GCP key on your EMR cluster using a simple bootstrap action script that copies the key from an s3 location to a local path.
Bundle the cloud storage connector jar with your application and now you can read/write using the "gs" filesystem.
I got this working with Hadoop 3.x and spark 3.x on EMR 6.3.0.
I am not sure of how you process streaming data in EMR. Anyways, you can always have a custom python script using google library to connect to GCS and push your data to GCS. You can also opt to run your script as a pyspark code to quicken your process
https://cloud.google.com/appengine/docs/standard/python/googlecloudstorageclient/read-write-to-cloud-storage
This Google Cloud official guide of how to migrate from Amazon S3 to Cloud Storage may be helpful:
https://cloud.google.com/storage/docs/migrating
My last answer was removed, please share at least why it was deleted. Thanks.
Is there a way to see all the queries that are there in my Oracle COnnector stages of my datastage project? I am using DS 11.3.
No not natively. You could export your project and parse the export for all of the SQL staments (this could be done by a DataStage job of cause) or you might be able to query it if you have IGC Information Governance Catalog) in place.
I`m currently working on recommender system using pyspark and ipython-notebook. I want to get recommendations from data stored in BigQuery. There are two options:Spark BQ connector and Python BQ library.
What are the pros and cons of these two tools?
The Python BQ library is a standard way to interact with BQ from Python, and so it will include the full API capabilities of BigQuery. The Spark BQ connector you mention is the Hadoop Connector - a Java Hadoop library that will allow you to read/write from BigQuery using abstracted Hadoop classes. This will more closely resemble how you interact with native Hadoop inputs and outputs.
You can find example usage of the Hadoop Connector here.