Can Spark jobs be scheduled through Airflow - pyspark

I am new to spark and need to clarify some doubts i have.
Can I schedule Spark Jobs through Airflow
My Airflow (Spark) jobs process raw csv files present in S3 bucket and then transforms into parquet format , stores it into S3 bucket and then finally stores it into Presto Hive after completely processed. End user connects to Presto and queries the data to create visualisation.
Can this processed data be stored in Hive only or Presto only so that user can connect to Presto or Hive and accordingly to perform query on the database.

Well you can always spark_submit_operator
to schedule and submit your spark jobs or you can use bash operator
where you can use the spark-submit bash command to schedule and submit spark jobs.
to your second question, After spark created parquet files you can use spark (same spark instance) to write it to hive or presto.

Related

Spark Structured Stream - Kinesis as Data Source

I am trying to consume kinesis data stream records using psypark structured stream.
I am trying to run this code in aws glue batch job. My goal is to use checkpoint and save checkpoints and data to s3. I am able to consume the data but it is giving only few records for every trigger whereas kinesis data stream has lot of records. I am using TRIM_HORIZON which is alias to earliest and trigger spark.writestream once so that it executes once and shuts down the cluster. When i run the job again, it picks latest offset from checkpoint and runs.
kinesis = spark.readStream.format('kinesis') \
.option('streamName', kinesis_stream_name) \
.option('endpointUrl', 'blaablaa')\
.option('region', region) \
.option('startingPosition', 'TRIM_HORIZON')\
.option('maxOffsetsPerTrigger',100000)\
.load()
// do some transformation here
TargetKinesisData = stream_data.writeStream.format("parquet").outputMode('append').option(
"path", s3_target).option("checkpointLocation", checkpoint_location).trigger(once=True).start().awaitTermination()

ETL using apache pyspark and airflow

We are developing ETL tool using apache pyspark and apache airflow.
Apache airflow will be used for workflow management.
Can apache pyspark handle huge volume of data?
Can i get extract,transform count from apache airflow?
Yes, Apache (Py)Spark is built for dealing with big data
There is no magic out-of-the-box solution for getting metrics from PySpark into Airflow
Some solutions for #2 are:
Writing metrics from PySpark to another system (e.g. database, blob storage, ...) and reading those in a 2nd task in Airflow
Returning the values from the PySpark jobs and pushing them into Airflow XCom
My 2c: don't process large data in Airflow itself as it's built for orchestration and not data processing. If the intermediate data becomes big, use a dedicated storage system for that (database, blob storage, etc...). XComs are stored in the Airflow metastore itself (although custom XCom backends to store data in other systems are supported since Airflow 2.0 https://www.astronomer.io/guides/custom-xcom-backends), so make sure the data isn't too big if you're storing it in the Airflow metastore.

Confluent Kafka Connect : Run multiple sink connectors in synchronous way

We are using Kafka connect S3 sink connector that connect to Kafka and load data to S3 buckets.Now I want to load data from S3 buckets to AWS Redshift using Copy command, for that I'm creating my own custom connector.Use case is I want to load data that created over S3 to Redshift in synchronous way, and then next time S3 connector should replace the existing file and again our custom connector load data to S3.
How can I do this using Confluent Kafka Connect,or my other better approach to do same task?
Thanks in advance !
If you want data to Redshift, you should probably just use the JDBC Sink Connector and download the Redshift JDBC Driver into the kafka-connect-jdbc directory.
Otherwise, rather than writing a Connector, you could use Lambda to trigger some type of S3 event notification to do some type of Redshift upload
Alternatively, if you are simply looking to query S3 data, you could use Athena instead without dealing with any databases
But basically, Sink Connectors don't communicate between one another. They are independent tasks that are designed to initially consume from a topic and write to a destination, not necessarily trigger external, downstream systems.
You want to achieve synchronous behaviour from Kafka to redshift then S3 sink connector is not right option.
If you are using S3 sink connector then first put the data into s3 and then externally run copy command to push to S3. ( Copy command is extra overhead )
No customize code or validation can happen before pushing to redshift.
Redshift sink connector has come up with native jdbc library which is equivalent fast to S3 copy command.

Is it possible to load a database directly from HDFS into spark as a DataFrame?

I have my MongoDB and Spark running on Zeppelin, both sharing the same HDFS. The MongoDB produces a .wt database stored in the same HDFS.
I want to load the database collection produced by that MongoDB from the HDFS into a Spark DataFrame.
Is it possible to load the database directly from HDFS into spark as a DataFrame? Or do I need to use a MongoDB Spark connector?
I would not recommend to read or modify the internal WiredTiger Storage Engine's *.wt files. Firstly, these internal files are subject to change without notifications (not a public facing API), also any unintended modifications to these files may cause the database to be in an invalid/corrupt state.
You can utilise MongoDB Spark Connector to load data from MongoDB to Spark. This connector is designed, developed and optimised for the purpose of read/write data between MongoDB and Apache Spark. For example, by accessing data via the database the client may utilise the database indexes to optimise read operations.
See also:
GitHub demo: Docker for MongoDB, Apache Spark and Apache Zeppelin
GitHub demo: Docker for MongoDB and Apache Spark

Parallelism in Spark Job server

We are working on Qubole with Spark version 2.0.2.
We have a multi-step process in which all the intermediate steps write their output to HDFS and later this output is used in the reporting layer.
As per our use case, we want to avoid writing to HDFS and keep all the intermediate output as temporary tables in spark and directly write the final reporting layer output.
For this implementation, we wanted to use Job server provided by Qubole but when we try to trigger multiple queries on the Job server, Job server is running my jobs sequentially.
I have observed the same behavior in Databricks cluster as well.
The cluster we are using is a 30 node, r4.2xlarge.
Does anyone has experience in running multiple jobs using job server ?
Community's help will be greatly appreciated !