Dataproc: What is the primary use case of local Hive metastore? - google-cloud-dataproc

As a default, Dataproc uses the local MySQL (image versions 1.5+) database on the master node as the Hive table metadata store.
I do not fully understand the primary use case of this local metadata store.
What are the benefits of using it and the drawbacks of not using it?

There are 3 deployment modes for Hive Metastore on Dataproc:
In-cluster MySQL and Hive Metastore. This is the default. The lifecycle of Hive metadata (table schemas) is the same as the cluster. A typical use case is that you have input data in GCS and want the output data to be in GCS as well. In your Hive script, you first create external tables for the input and output data, then query the data from the input table with some transformations and insert the result into the output table. After the query is done, the table metadata is no longer needed.
External MySQL, in-cluster Hive Metastore. In this deployment, you store Hive metadata in an external MySQL instance, typically Cloud SQL instance. The in-cluster Hive Metastore uses the external MySQL instance as the underlying database. See this doc for more details.
External MySQL and Hive Metastore. This is the recommended mode. In this deployment, there are no MySQL and Hive Metastore in the cluster, Hive Server2 depends on an external Hive Metastore, typically Dataproc Metastore Service. See this doc for details.
Choose mode 1, when you don't need Hive metadata lifecycle to outlive the cluster lifecycle. Choose mode 2 or 3, when you need that.

Related

ETL using apache pyspark and airflow

We are developing ETL tool using apache pyspark and apache airflow.
Apache airflow will be used for workflow management.
Can apache pyspark handle huge volume of data?
Can i get extract,transform count from apache airflow?
Yes, Apache (Py)Spark is built for dealing with big data
There is no magic out-of-the-box solution for getting metrics from PySpark into Airflow
Some solutions for #2 are:
Writing metrics from PySpark to another system (e.g. database, blob storage, ...) and reading those in a 2nd task in Airflow
Returning the values from the PySpark jobs and pushing them into Airflow XCom
My 2c: don't process large data in Airflow itself as it's built for orchestration and not data processing. If the intermediate data becomes big, use a dedicated storage system for that (database, blob storage, etc...). XComs are stored in the Airflow metastore itself (although custom XCom backends to store data in other systems are supported since Airflow 2.0 https://www.astronomer.io/guides/custom-xcom-backends), so make sure the data isn't too big if you're storing it in the Airflow metastore.

kafka-connect jdbc distributed mode

We are working on building the Kafka-connect application using JDBC source connector in increment+timestamp mode. We tried the Standalone mode and It is working as expected. Now, we would like to switch to distributed mode.
When we have a single Hive table as a source, How the tasks will be distributed among the workers?
The problem we faced was when we run the application in multiple instances, It is querying the table for every instance and fetching the same rows again. Does parallelism will work in this case? If so,
How does the tasks will co-ordinate with each other on the current status of table ?
The parameter tasks.max doesn't have any difference for the kafka-connect-jdbc source/sink connector. There is no occurrence of this property in the source code of the jdbc connector project.
Consult JDBC source config options for the available properties for this connector.

Confluent Kafka Connect : Run multiple sink connectors in synchronous way

We are using Kafka connect S3 sink connector that connect to Kafka and load data to S3 buckets.Now I want to load data from S3 buckets to AWS Redshift using Copy command, for that I'm creating my own custom connector.Use case is I want to load data that created over S3 to Redshift in synchronous way, and then next time S3 connector should replace the existing file and again our custom connector load data to S3.
How can I do this using Confluent Kafka Connect,or my other better approach to do same task?
Thanks in advance !
If you want data to Redshift, you should probably just use the JDBC Sink Connector and download the Redshift JDBC Driver into the kafka-connect-jdbc directory.
Otherwise, rather than writing a Connector, you could use Lambda to trigger some type of S3 event notification to do some type of Redshift upload
Alternatively, if you are simply looking to query S3 data, you could use Athena instead without dealing with any databases
But basically, Sink Connectors don't communicate between one another. They are independent tasks that are designed to initially consume from a topic and write to a destination, not necessarily trigger external, downstream systems.
You want to achieve synchronous behaviour from Kafka to redshift then S3 sink connector is not right option.
If you are using S3 sink connector then first put the data into s3 and then externally run copy command to push to S3. ( Copy command is extra overhead )
No customize code or validation can happen before pushing to redshift.
Redshift sink connector has come up with native jdbc library which is equivalent fast to S3 copy command.

How do I read a Table In Postgresql Using Flink

I want to do some analytics using Flink on the Data in Postgresql. How and where should I give the port address,username and password. I was trying with the table source as mentioned in the link:https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/table/common.html#register-tables-in-the-catalog.
final static ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
final static TableSource csvSource = new CsvTableSource("localhost", port);
I am unable to start with actually. I went through all the documents but detailed report about this not found.
The tables and catalog referred to the link you've shared are part of Flink's SQL support, wherein you can use SQL to express computations (queries) to be performed on data ingested into Flink. This is not about connecting Flink to a database, but rather it's about having Flink behave somewhat like a database.
To the best of my knowledge, there is no Postgres source connector for Flink. There is a JDBC table sink, but it only supports append mode (via INSERTs).
The CSVTableSource is for reading data from CSV files, which can then be processed by Flink.
If you want to operate on your data in batches, one approach you could take would be to export the data from Postgres to CSV, and then use a CSVTableSource to load it into Flink. On the other hand, if you wish to establish a streaming connection, you could connect Postgres to Kafka and then use one of Flink's Kafka connectors.
Reading a Postgres instance directly isn't supported as far as I know. However, you can get realtime streaming of Postgres changes by using a Kafka server and a Debezium instance that replicates from Postgres to Kafka.
Debezium connects using the native Postgres replication mechanism on the DB side and emits all record inserts, updates or deletes as a message on the Kafka side. You can then use the Kafka topic(s) as your input in Flink.

Best way to stream/logically replicate RDS Postgres data to kinesis

Our primary datastore is an RDS Postgres database. It would be nice if we could stream all changes to that happen in Postgres to some sink - whether that's kinesis, elasticsearch or any other data store.
We use Postgres 9.5 which has support for 'logical replication'. However, all the extensions that tap into this stream are blocked on RDS. There's a tutorial for streaming the MySQL RDS flavor to kinesis - the postgres equivalent would be ideal. Is this possible currently?
Have a look at https://github.com/disneystreaming/pg2k4j . It takes all changes made to your database and streams them to Kinesis. See the README for an example of how to set this up with RDS. We've been using it in production and have found it very useful for solving this exact problem. Disclaimer: I wrote https://github.com/disneystreaming/pg2k4j
Integrate a central Amazon Relational Database Service (Amazon RDS) for PostgreSQL database with other systems by streaming its modifications into Amazon Kinesis Data Streams. An earlier post, Streaming Changes in a Database with Amazon Kinesis, described how to integrate a central RDS for MySQL database with other systems by streaming modifications through Kinesis. In this post, I take it a step further and explain how to use an AWS Lambda function to capture the changes in Amazon RDS for PostgreSQL and stream those changes to Kinesis Data Streams.
https://aws.amazon.com/blogs/database/stream-changes-from-amazon-rds-for-postgresql-using-amazon-kinesis-data-streams-and-aws-lambda/