I have investigated a lot but couldn't get any help or resource on how to test my pyspark Structured Streaming pipeline job (ingesting from Kafka topics to S3) and how to build Continuous Integration (CI)/Continuous Deployment (CD).
Is it possible to test (unit test, integration test) pyspark structured stream?
How to build Continuous Integration (CI)/Continuous Deployment (CD)?
Refer to https://bartoszgajda.com/2020/04/13/testing-spark-structured-streaming-using-memorystream/ - the code is in Scala but you should be able to convert to PySpark
Start with Jenkins (https://www.jenkins.io/)
Related
I've started developing my first DataFlow job using Scio, the Scala SDK. The dataflow job will run in streaming mode.
Can anyone advise the best way to deploy this? I have read in the Scio docs they use sbt-pack and then deploy this within a Docker container. I have also read about using DataFlow templates (but not in great detail).
What's best?
Like for Java and Python version, you can run directly your code on Dataflow by using the dataflow runner and by launching it from your computer (or a VM/function).
If you want to package it for a reutilisation, you can create a template.
You can't run custom container on Dataflow.
I am new to Talend ETL tool. I want to use Talend to generate Spark batch jobs.
Can Talend generate Scala code instead of Java or is there a way to plug in Scala based batch job in to Talend?
No, it generates Java only.
That should not matter if using Talend as a graphic ETL-tool at a higher abstraction level though.
Does apache beam for python support flink runner right now ? or even the portable runner ? And is beam for java supported by flink runner commercially ?
Yes, both python and java is supported for the Apache Flink runner.
It is important to understand that the Flink Runner comes in two flavors:
A legacy Runner which supports only Java (and other JVM-based languages)
A portable Runner which supports Java/Python/Go
Ref: https://beam.apache.org/documentation/runners/flink/
I think you would have to define what commercially supported means:
Do you want to run the Flink Runner as a service, similarly to what Google Cloud Dataflow provides? If so, the closest to this is Amazon's Kinesis Data Analytics, but it is really just a managed Flink cluster.
Many companies use the Flink Runner and contribute back to the Beam project, e.g. Lyft, Yelp, Alibaba, Ververica. This could be seen as a form of commercial support. There are also various consultancies, e.g. BigDataInstitute, Ververica, which could help you manage your Flink Runner applications.
I have a code written entirely in scala which uses spark streaming to get json data from kafka topic and then dump it to cassandra and another kafka topic after some processing. Now I need to write an unit test for this code. I need help on how to write such a test and how we can mock data when I am using spark cassandra connector.
You can use spark-cassandra-connector-embedded that is developed together with connector itself. Just add the Maven or SBT dependency to your project, like, this for SBT:
"com.datastax.spark" %% "spark-cassandra-connector-embedded" % {latest.version}
I`m currently working on recommender system using pyspark and ipython-notebook. I want to get recommendations from data stored in BigQuery. There are two options:Spark BQ connector and Python BQ library.
What are the pros and cons of these two tools?
The Python BQ library is a standard way to interact with BQ from Python, and so it will include the full API capabilities of BigQuery. The Spark BQ connector you mention is the Hadoop Connector - a Java Hadoop library that will allow you to read/write from BigQuery using abstracted Hadoop classes. This will more closely resemble how you interact with native Hadoop inputs and outputs.
You can find example usage of the Hadoop Connector here.