ETL using apache pyspark and airflow

ETL using apache pyspark and airflow - pyspark

We are developing ETL tool using apache pyspark and apache airflow.
Apache airflow will be used for workflow management.
Can apache pyspark handle huge volume of data?
Can i get extract,transform count from apache airflow?

Yes, Apache (Py)Spark is built for dealing with big data
There is no magic out-of-the-box solution for getting metrics from PySpark into Airflow
Some solutions for #2 are:
Writing metrics from PySpark to another system (e.g. database, blob storage, ...) and reading those in a 2nd task in Airflow
Returning the values from the PySpark jobs and pushing them into Airflow XCom
My 2c: don't process large data in Airflow itself as it's built for orchestration and not data processing. If the intermediate data becomes big, use a dedicated storage system for that (database, blob storage, etc...). XComs are stored in the Airflow metastore itself (although custom XCom backends to store data in other systems are supported since Airflow 2.0 https://www.astronomer.io/guides/custom-xcom-backends), so make sure the data isn't too big if you're storing it in the Airflow metastore.

Related

Confluent Kafka Connect : Run multiple sink connectors in synchronous way

We are using Kafka connect S3 sink connector that connect to Kafka and load data to S3 buckets.Now I want to load data from S3 buckets to AWS Redshift using Copy command, for that I'm creating my own custom connector.Use case is I want to load data that created over S3 to Redshift in synchronous way, and then next time S3 connector should replace the existing file and again our custom connector load data to S3.
How can I do this using Confluent Kafka Connect,or my other better approach to do same task?
Thanks in advance !

If you want data to Redshift, you should probably just use the JDBC Sink Connector and download the Redshift JDBC Driver into the kafka-connect-jdbc directory.
Otherwise, rather than writing a Connector, you could use Lambda to trigger some type of S3 event notification to do some type of Redshift upload
Alternatively, if you are simply looking to query S3 data, you could use Athena instead without dealing with any databases
But basically, Sink Connectors don't communicate between one another. They are independent tasks that are designed to initially consume from a topic and write to a destination, not necessarily trigger external, downstream systems.

You want to achieve synchronous behaviour from Kafka to redshift then S3 sink connector is not right option.
If you are using S3 sink connector then first put the data into s3 and then externally run copy command to push to S3. ( Copy command is extra overhead )
No customize code or validation can happen before pushing to redshift.
Redshift sink connector has come up with native jdbc library which is equivalent fast to S3 copy command.

Is it possible to load a database directly from HDFS into spark as a DataFrame?

I have my MongoDB and Spark running on Zeppelin, both sharing the same HDFS. The MongoDB produces a .wt database stored in the same HDFS.
I want to load the database collection produced by that MongoDB from the HDFS into a Spark DataFrame.
Is it possible to load the database directly from HDFS into spark as a DataFrame? Or do I need to use a MongoDB Spark connector?

I would not recommend to read or modify the internal WiredTiger Storage Engine's *.wt files. Firstly, these internal files are subject to change without notifications (not a public facing API), also any unintended modifications to these files may cause the database to be in an invalid/corrupt state.
You can utilise MongoDB Spark Connector to load data from MongoDB to Spark. This connector is designed, developed and optimised for the purpose of read/write data between MongoDB and Apache Spark. For example, by accessing data via the database the client may utilise the database indexes to optimise read operations.
See also:
GitHub demo: Docker for MongoDB, Apache Spark and Apache Zeppelin
GitHub demo: Docker for MongoDB and Apache Spark

How do I read a Table In Postgresql Using Flink

I want to do some analytics using Flink on the Data in Postgresql. How and where should I give the port address,username and password. I was trying with the table source as mentioned in the link:https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/table/common.html#register-tables-in-the-catalog.
final static ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
final static TableSource csvSource = new CsvTableSource("localhost", port);
I am unable to start with actually. I went through all the documents but detailed report about this not found.

The tables and catalog referred to the link you've shared are part of Flink's SQL support, wherein you can use SQL to express computations (queries) to be performed on data ingested into Flink. This is not about connecting Flink to a database, but rather it's about having Flink behave somewhat like a database.
To the best of my knowledge, there is no Postgres source connector for Flink. There is a JDBC table sink, but it only supports append mode (via INSERTs).
The CSVTableSource is for reading data from CSV files, which can then be processed by Flink.
If you want to operate on your data in batches, one approach you could take would be to export the data from Postgres to CSV, and then use a CSVTableSource to load it into Flink. On the other hand, if you wish to establish a streaming connection, you could connect Postgres to Kafka and then use one of Flink's Kafka connectors.

Reading a Postgres instance directly isn't supported as far as I know. However, you can get realtime streaming of Postgres changes by using a Kafka server and a Debezium instance that replicates from Postgres to Kafka.
Debezium connects using the native Postgres replication mechanism on the DB side and emits all record inserts, updates or deletes as a message on the Kafka side. You can then use the Kafka topic(s) as your input in Flink.

MongoDB with the Spark connector

If I have a replica set with mongodb, than a primary server is receiving all the wirte/read operations and writing them to the server.
The secondary server are reading the operations from the oplog and replicating them.
Now I would like to analyze the data in mongodb replica set with spark-mongodb-connector. I can install a spark cluster on all three nodes and run analytics on it in memory.
I understand that spark cluster has a master node where I have to submit the spark job for analytics, or spark streaming. Both are installed on an application server in tomcat.
now I need to choose a master node to submit the job from my tomcat app server to the spark cluster.
Should the Primary Server be the Spark Master node? and than the driver of an application can connect to submit jobs on it?.
What would be the Spark master in a sharded cluster?

It doesn't really matter which node is the Spark Master in your cluster.
The Spark master will be responsible for assigning the tasks to the Spark executors, it will not receive all read/write requests.
Each executor will then be responsible for fetching the data it needs to process.
Be careful about data partitioning in Spark, it might happen that mongoDB only provides a single partition to start with, so you might want to do a repartition first.

Akka Persistence: migrating from jdbc (postgres) to cassandra

I have a running project with using of akka-persistence-jdbc plugin and postgresql as a backend.
Now I want to migrate to akka-persistence-cassandra.
But how can I convert the existing events (more than 4GB size in postgres) to cassandra?
Should I write a manual migration program? Reading from postgres and writing to right format in cassandra?

This is a classic migration problem. There are multiple solutions for this.
Spark SQL and Spark Cassandra Connector: Spark JDBC (called as Spark Dataframe, Spark SQL) API allows you to read from any JDBC source. You can read it in chunks by segmenting it otherwise you will go out of memory. Segmentation also makes the migration parallel. Then write the data into Cassandra by Cassandra Spark Connector. This is by far the simplest and efficient way I used in my tasks.
Java Agents: Java Agent can be written based on plain JDBC or other libraries and then write to Cassandra with Datastax driver. Spark program runs on multi machine - multi threaded way and recovers if something goes wrong automatically. But if you write an agent like this manually, then your agent only runs on single machine and multi threading also need to be coded.
Kafka Connectors: Kafka is a messaging broker. It can be used indirectly to migrate. Kafka has connector which can read and write to different databases. You can use JDBC connector to read from PostGres and Cassandra connector to write to Cassandra. It's not that easy to setup but it has the advantage of "no coding involved".
ETL Systems: Some ETL Systems have support for Cassandra but I haven't personally tried anything.
I saw some advantages in using Spark Cassandra and Spark SQL for migration, some of them are:
Code was concise. It was hardly 40 lines
Multi machine (Again multi threaded on each machine)
Job progress and statistics in Spark Master UI
Fault tolerance- if a spark node is down or thread/worker failed there then job is automatically started on other node - good for very long running jobs
If you don't know Spark then writing agent is okay for 4GB data.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse