My plan:
Move all data processing to Spark (PySpark preferably) with final output (consumer facing) data going to Redshift only. Spark seems to connect to all the various sources well (Dynamo DB, S3, Redshift). Output to Redshift/S3 etc depending on customer need. This avoids having multiple Redshift clusters, broken/overusing internal unsupported ETL tools, copy of the same data across clusters, views and tables etc (which is the current setup).
Use Luigi to build a web UI to daily monitor pipelines and visualise the dependency tree, and schedule ETL's. Email notifications should be an option for failures also. An alternative is AWS data pipeline, but, Luigi seems to have a better UI for what is happening where many dependencies are involved (some trees are 5 levels deep, but perhaps this can also be avoided with better Spark code).
Questions:
Does Luigi integrate with Spark (I have only used PySpark before, not Luigi, so this is a learning curve for me). The plan was to schedule 'applications' and Spark actually has ETL too I believe, so unsure how Luigi integrates here?
How to account for the fact that some pipelines may be 'real time' - would I need to spin up the Spark / EMR job hourly for example then?
I'm open to thoughts / suggestions / better ways of doing this too!
To answer your questions directly,
1) Yes, Luigi does play nicely with PySpark, just like any other library. We certainly have it running without issue -- the only caveat is that you have to be a little careful with imports and have them within the functions of the Luigi class as, in the background, it is spinning up new Python instances.
2) There are ways of getting Luigi to slurp in streams of data, but it is tricky to do. Realistically, you'd fall back to running an hourly cron cycle to just call the pipeline and process and new data. This sort of reflects Spotify's use case for Luigi where they run daily jobs for calculate top artist, etc.
As #RonD suggests, if I was building a new pipeline now, I'd skip Luigi and go straight to AirFlow. If nothing else, look at the release history. Luigi hasn't really been significantly worked on for a long time (because it works for the main dev). Whereas AirFlow is actively being incubated by Apache.
Instead of Luigi use Apache Airflow for workflow orchestration (code is written in Python). It has a lot of operators and hooks built in which you can call in DAGs (Workflows). For example create task to call operator to start up EMR cluster, another to run PySpark script located in s3 on cluster, another to watch the run for status. You can use tasks to set up dependencies etc too.
Related
The fundamental problem is attempting to use spark to generate data but then work with the data internally. I.e., I have a program that does a thing, and it generates "rows" of data - can I leverage Spark to parallelize that work across the worker nodes, and have them each contribute back to the underlying store?
The reason I want to use Spark is that seems to be a very popular framework, and I know this request is a little outside of the defined range of functions Spark should offer. However, the alternatives of MapReduce or Storm are dreadfully old and there isn't much support anymore.
I have a feeling there has to be a way to do this, has anyone tried to utilize Spark in this way?
To be honest, I don't think adopting Spark just because it's popular is the right decision. Also, it's not obvious from the question why this problem would require a framework for distributed data processing (that comes along with a significant coordination overhead).
The key consideration should be how you are going to process the generated data in the next step. If it's all about dumping it immediately into a data store I would really discourage using Spark, especially if you don't have the necessary infrastructure (Spark cluster) at hand.
Instead, write a simple program that generates the data. Then run it on a modern resource scheduler such as Kubernetes and scale it out and run as many instances of it as necessary.
If you absolutely want to use Spark for this (and unnecessarily burn resources), it's not difficult. Create a distributed "seed" dataset / stream and simply flatMap that. Using flatMap you can generate as many new rows for each seed input row as you like (obviously limited by the available memory).
During my cloud dataprep adventures I have come across yet another very annoying bug.
The problem occurs when creating complex flow structures which need to be connected through reference datasets. If a certain limit is crossed in performing a number of unions or a joins with these sets, dataflow is unable to start a job.
I have had a lot of contact with support and they are working on the issue:
"Our Systems Engineer Team was able to determine the root cause resulting into the failed job. They mentioned that the job is too large. That means that the recipe (combined from all datasets) is too big, and Dataflow rejects it. Our engineering team is still investigating approaches to address this.
A workaround is to split the job into two smaller jobs. The first run the flow for the data enrichment, and then use the output as input in the other flow. While it is not ideal, this would be a working solution for the time being."
I ran into the same problem and have a fairly educated guess as to the answer. Keep in mind that DataPrep simply takes all your GUI based inputs and translates it into Apache Beam code. When you pass in a reference data set, it probably writes some AB code that turns the reference data set into a side-input (https://beam.apache.org/documentation/programming-guide/). DataFlow will perform a Parellel Do (ParDo) function where it takes each element from a PCollection, stuffs it into a worker node, and then applies the side-input data for transformation.
So I am pretty sure if the reference sets get too big (which can happen with Joins), the underlying code will take an element from dataset A, pass it to a function with side-input B...but if side-input B is very big, it won't be able to fit into the worker memory. Take a look at the Stackdriver logs for your job to investigate if this is the case. If you see 'GC (Allocation Failure)' in your logs this is a sign of not enough memory.
You can try doing this: suppose you have two CSV files to read in and process, file A is 4 GB and file B is also 4 GB. If you kick off a job to perform some type of Join, it will very quickly outgrow the worker memory and puke. If you CAN, see if you can pre-process in a way where one of the files is in the MB range and just grow the other file.
If your data structures don't lend themselves to that option, you could do what the Sys Engs suggested, split one file up into many small chunks and then feed it to the recipe iteratively against the other larger file.
Another option to test is specifying the compute type for the workers. You can iteratively grow the compute type larger and larger to see if it finally pushes through.
The other option is to code it all up yourself in Apache Beam, test locally, then port to Google Cloud DataFlow.
Hopefully these guys fix the problem soon, they don't make it easy to ask them questions, that's for sure.
We have a system where our primary data store (and "Universal Source of Truth") is Postgres, but we replicate that data both in real-time as well as nightly in aggregate. We currently replicate to Elasticsearch, Redis, Redshift (nightly only), and are adding Neo4j as well.
Our ETL pipeline has gotten expansive enough that we're starting to look at tools like Airflow and Luigi, but from what I can tell from my initial research, these tools are meant almost entirely for batch loads in aggregate.
Is there any tool that can handle an ETL process that can handle both large batch ETL processes as well as on-the-fly, high-volume, individual-record replication? Do Airflow or Luigi handle this and I just missed it?
Thanks!
As far as Luigi goes, you would likely end up with a micro batch approach, running the jobs on a short interval. For example, you could trigger a cron job every minute to check for new records in Postgres tables and process that batch. You can create a task for each item, so that your processing flow itself is around a single item. At high volumes, say more than a few hundred updates per second, this is a real challenge.
Apache Spark has a scalable batch mode and micro-batch mode of processing, and some basic pipelining operators that can be adapted to ETL. However, the complexity level of the solution in terms of supporting infrastructure goes up a quite a bit.
I'm no crazy expert on different ETL engines, but I've done lots with Pentaho Kettle and am pretty happy with it performance wise. Especially if you tune your transformations to take advantage of the parallel processing.
I've mostly used it for handling integrations (real time) and nightly jobs that perform ETL to drive our reporting DB but I'm pretty sure you could set it up to perform many real time tasks.
I did set up web services that called all sorts of things on our back end once in real time but it very much was not under any kind of load and it sounds like you're doing some more hefty things than we are. Then again it's got functionality to cluster the ETL servers and scale things that I've never really played with.
I feels like kettle could do these things if you spent time to set it up right. Overall I love the tool. It's a joy to work in the GUI TBH. If you're not familiar or doubt the power of doing ETL from a GUI you should check it out. You might be surprised.
I have built a scala application in Spark v.1.6.0 that actually combines various functionalities. I have code for scanning a dataframe for certain entries, I have code that performs certain computation on a dataframe, I have code for creating an output, etc.
At the moment the components are 'statically' combined, i.e., in my code I call the code from a component X doing a computation, I take the resulting data and call a method of component Y that takes the data as input.
I would like to get this more flexible, having a user simply specify a pipeline (possibly one with parallel executions). I would assume that the workflows are rather small and simple, as in the following picture:
However, I do not know how to best approach this problem.
I could build the whole pipeline logic myself, which will probably result in quite some work and possibly some errors too...
I have seen that Apache Spark comes with a Pipeline class in the ML package, however, it does not support parallel execution if I understand correctly (in the example the two ParquetReader could read and process the data at the same time)
there is apparently the Luigi project that might do exactly this (however, it says on the page that Luigi is for long-running workflows, whereas I just need short-running workflows; Luigi might be overkill?)?
What would you suggest for building work/dataflows in Spark?
I would suggest to use Spark's MLlib pipeline functionality, what you describe sounds like it would fit the case well. One nice thing about it is that it allows Spark to optimize the flow for you, in a way that is probably smarter than you can.
You mention it can't read the two Parquet files in parallel, but it can read each separate file in a distributed way. So rather than having N/2 nodes process each file separately, you would have N nodes process them in series, which I'd expect to give you a similar runtime, especially if the mapping to y-c is 1-to-1. Basically, you don't have to worry about Spark underutilizing your resources (if your data is partitioned properly).
But actually things may even be better, because Spark is smarter at optimising the flow than you are. An important thing to keep in mind is that Spark may not do things exactly in the way and in the separate steps as you define them: when you tell it to compute y-c it doesn't actually do that right away. It is lazy (in a good way!) and waits until you've built up the whole flow and ask it for answers, at which point it analyses the flow, applies optimisations (e.g. one possibility is that it can figure out it doesn't have to read and process a large chunk of one or both of the Parquet files, especially with partition discovery), and only then executes the final plan.
I was working a proof of concept of having Spark Mllib training and prediction serving exposed for multiple tenants with some form of a REST interface. I did get a POC up and running but it seems a bit wasteful as it has to create numerous spark contexts and JVMs to execute in so I was wondering if there is a way around that or a cleaner solution having in mind spark's context per jvm restrictions.
There are 2 parts to it:
Trigger training of a specified jar per tenant with specific restrictions for each tenant like executor size etc. (this is pretty much out of the box with spark job server, sadly it doesnt yet seem to support Oauth), but there is a way to do it. For this part I don't think it's possible to share context between tenants because they should be able to train in parallel and as far as i know an MLlib context will do 2 training requests sequentially.
This is trickier and I can't seem to find a good way to do that, but once the model has been trained we need to load it in some kind of a REST service and expose it. This also means allocating a spark context per tenant, hence a full JVM per tenant serving predictions, which is quite wasteful.
Any feedback on how this can possibly be improved or re-architected so it's a bit less resource hungry, maybe there are certain Spark features I'm not aware of that would facilitate that. Thanks!