Is Defining window mandatory for unbounded pcollection - apache-beam

New to beam/flink and would appreciate assistance in this issue.
I have a pipeline that reads from kafka avro message does some object transformation and writes again to kafka. ו did not define any window since currently we would like to handle each event separately with no aggregation.
I wonder if this correct. From what i understand in the docs seem like we cannot use the default behaviour and define some kind of window and relevant triggers.
Is my understanding correct?
Thanks
S

If you do not specify a windowing strategy for a pipeline, it will automatically run on a global window. Since no aggregations or such are to be made, this is ok.
https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/tour-of-beam/windowing.ipynb says:
All pipelines use the GlobalWindow by default. This is a single window that covers the entire PCollection.
In many cases, especially for batch pipelines, this is what we want since we want to analyze all the data that we have.
ℹ️ GlobalWindow is not very useful in a streaming pipeline unless you only need element-wise transforms. Aggregations, like GroupByKey and Combine, need to process the entire window, but a streaming pipeline has no end, so they would never finish.

Related

What is the correct way to organize the ptransforms in a beam pipeline?

I'm developing one pipeline that reads data from Kafka.
The source kafka topic is quite big in terms of traffic, there are 10k messages inserted per second and each of the message is around 200kB
I need to filter the data in order to apply the transformations that I need but something I'm sure is if there is an order in which I need to apply the filter and window functions.
read->window->filter->transform->write
would be more efficient than
read->filter->window->transform->write
or it would be the same performance both options?
I know that samza is just a model that only tells the what and not the how and the runner optimizes the pipeline but I just want to be sure I got it correct
Thanks
If there is substantial filtering, windowing after the filter will technically reduce the amount of work performed, though that saved work is cheap enough that I doubt it'd make a measurable difference. (Presumably the runner could notice that the filter does not observe the assigned window and lift it in that case, but as mentioned it's unclear if there are really savings to be gained here...)

Is Apache Beam the right tool for feature pre processing?

So this is a bit of a weird question as it isn't related to how to use the tool but more about why to use it.
I'm deploying a model and thinking of using Apache-beam to run the feature processing tasks using its python API. Documentation is pretty big and complex but I went through most of it, even built a small working pipeline, and it is still not clear this would be the right tool for me.
An example of what I need is the following:
Input data structure:
ID | Timestamp | category
output needed:
category | category count for last 30 minutes (feature example)
This process needs to run every 5 minutes and update the counts.
===> What I fail to understand is if apache can run this pipeline every 5 minutes, read whichever new input data was generated and update the counts of the previous time it ran. And if so, can someone point me in the right direction?
Thank you!
When you run a Beam pipeline manually, it's expected to be started only once. Then it could be either a Bounded (Batch) or Unbounded (Streaming) pipeline. In the first case, it will be stopped after the all your bounded amount of data has been processed, in the second case it will run continuously and expect new data arrival (until it will be stopped manually).
Usually, the type of pipeline depends on data source that you have (Beam IO connectors). For example, if you read from files, then, by default, it's assumed to be a bounded source (limited number of files), but it could be unbounded source as well if you expect to have more new files to arrive and want to read them in the same pipeline.
Also, you can run your batch pipeline periodically with automated tools, like Apache Airflow (or just unix crontab). So, it all depends on your needs and type or data source. I could probably give more specific advice if you could share more details of your data pipeline - type of your data source and environment, an example of input and output results, how often your input data can be updated and so on.

google dataprep (clouddataprep by trifacta) tip: jobs will not be able to run if they are to large

During my cloud dataprep adventures I have come across yet another very annoying bug.
The problem occurs when creating complex flow structures which need to be connected through reference datasets. If a certain limit is crossed in performing a number of unions or a joins with these sets, dataflow is unable to start a job.
I have had a lot of contact with support and they are working on the issue:
"Our Systems Engineer Team was able to determine the root cause resulting into the failed job. They mentioned that the job is too large. That means that the recipe (combined from all datasets) is too big, and Dataflow rejects it. Our engineering team is still investigating approaches to address this.
A workaround is to split the job into two smaller jobs. The first run the flow for the data enrichment, and then use the output as input in the other flow. While it is not ideal, this would be a working solution for the time being."
I ran into the same problem and have a fairly educated guess as to the answer. Keep in mind that DataPrep simply takes all your GUI based inputs and translates it into Apache Beam code. When you pass in a reference data set, it probably writes some AB code that turns the reference data set into a side-input (https://beam.apache.org/documentation/programming-guide/). DataFlow will perform a Parellel Do (ParDo) function where it takes each element from a PCollection, stuffs it into a worker node, and then applies the side-input data for transformation.
So I am pretty sure if the reference sets get too big (which can happen with Joins), the underlying code will take an element from dataset A, pass it to a function with side-input B...but if side-input B is very big, it won't be able to fit into the worker memory. Take a look at the Stackdriver logs for your job to investigate if this is the case. If you see 'GC (Allocation Failure)' in your logs this is a sign of not enough memory.
You can try doing this: suppose you have two CSV files to read in and process, file A is 4 GB and file B is also 4 GB. If you kick off a job to perform some type of Join, it will very quickly outgrow the worker memory and puke. If you CAN, see if you can pre-process in a way where one of the files is in the MB range and just grow the other file.
If your data structures don't lend themselves to that option, you could do what the Sys Engs suggested, split one file up into many small chunks and then feed it to the recipe iteratively against the other larger file.
Another option to test is specifying the compute type for the workers. You can iteratively grow the compute type larger and larger to see if it finally pushes through.
The other option is to code it all up yourself in Apache Beam, test locally, then port to Google Cloud DataFlow.
Hopefully these guys fix the problem soon, they don't make it easy to ask them questions, that's for sure.

Axon Framework- is it possible to have an aggregate handle commands from multiple sagas?

I'd like to use one aggregate to handle commands from multiple sagas. Unfortunately, if a saga sends a command while the aggregate is busy handling another command, the command is lost with an AggregateNotFoundException written to the log.
I can use one aggregate per saga, but I'd like to know if it is possible with one aggregate for all sagas.
in Axon, a Command Handler isn't interested in the source of the command. Therefore, it doesn't matter whether multiple Sagas send commands, or if there is just one source.
I think the problem here is more related to a race condition. If a Command results in an AggregateNotFoundException, it means that the Command creating the Aggregate hasn't been processed, yet.
Most likely, there is an issue in the model/design, that causes these race conditions to appear. However, to be able to judge that, I would need much more information about your design and what you're trying to achieve with it.

How to do parallel pipeline?

I have built a scala application in Spark v.1.6.0 that actually combines various functionalities. I have code for scanning a dataframe for certain entries, I have code that performs certain computation on a dataframe, I have code for creating an output, etc.
At the moment the components are 'statically' combined, i.e., in my code I call the code from a component X doing a computation, I take the resulting data and call a method of component Y that takes the data as input.
I would like to get this more flexible, having a user simply specify a pipeline (possibly one with parallel executions). I would assume that the workflows are rather small and simple, as in the following picture:
However, I do not know how to best approach this problem.
I could build the whole pipeline logic myself, which will probably result in quite some work and possibly some errors too...
I have seen that Apache Spark comes with a Pipeline class in the ML package, however, it does not support parallel execution if I understand correctly (in the example the two ParquetReader could read and process the data at the same time)
there is apparently the Luigi project that might do exactly this (however, it says on the page that Luigi is for long-running workflows, whereas I just need short-running workflows; Luigi might be overkill?)?
What would you suggest for building work/dataflows in Spark?
I would suggest to use Spark's MLlib pipeline functionality, what you describe sounds like it would fit the case well. One nice thing about it is that it allows Spark to optimize the flow for you, in a way that is probably smarter than you can.
You mention it can't read the two Parquet files in parallel, but it can read each separate file in a distributed way. So rather than having N/2 nodes process each file separately, you would have N nodes process them in series, which I'd expect to give you a similar runtime, especially if the mapping to y-c is 1-to-1. Basically, you don't have to worry about Spark underutilizing your resources (if your data is partitioned properly).
But actually things may even be better, because Spark is smarter at optimising the flow than you are. An important thing to keep in mind is that Spark may not do things exactly in the way and in the separate steps as you define them: when you tell it to compute y-c it doesn't actually do that right away. It is lazy (in a good way!) and waits until you've built up the whole flow and ask it for answers, at which point it analyses the flow, applies optimisations (e.g. one possibility is that it can figure out it doesn't have to read and process a large chunk of one or both of the Parquet files, especially with partition discovery), and only then executes the final plan.