Is there a way to detect how many columns that have come in via RCP? I have a Sequential File stage using RCP. The next stage is a Transformer stage. In the Transformer stage I want to know/detect the total number of columns coming from the Sequential File stage. Is this possible? Thanks in advance for any help.
What you are seeking to do is not possible, unless you load the count from the schema file into a job parameter to which the Transformer stage has access.
Related
New to beam/flink and would appreciate assistance in this issue.
I have a pipeline that reads from kafka avro message does some object transformation and writes again to kafka. ו did not define any window since currently we would like to handle each event separately with no aggregation.
I wonder if this correct. From what i understand in the docs seem like we cannot use the default behaviour and define some kind of window and relevant triggers.
Is my understanding correct?
Thanks
S
If you do not specify a windowing strategy for a pipeline, it will automatically run on a global window. Since no aggregations or such are to be made, this is ok.
https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/tour-of-beam/windowing.ipynb says:
All pipelines use the GlobalWindow by default. This is a single window that covers the entire PCollection.
In many cases, especially for batch pipelines, this is what we want since we want to analyze all the data that we have.
ℹ️ GlobalWindow is not very useful in a streaming pipeline unless you only need element-wise transforms. Aggregations, like GroupByKey and Combine, need to process the entire window, but a streaming pipeline has no end, so they would never finish.
I have some pyspark code with a very large number of joins and aggregation. I've enabled spark ui and I've been digging in to the event timeling, job stages, and dag visualization. I can find the task id and executor id for the expensive parts. Does anyone have a tip how I can tie the expensive parts from the spark ui output (task id, executor id) back to parts of my pyspark code? Like I can tell from the output that the expensive parts are caused by a large amount of shuffle operations from all my joins, but it would be really handy to identify which join was the main culprit.
Your best approach is to start applying actions to your dataframes in various parts of the code. Pick a place, write it to a file, read it back, and continue.
This will allow you to identify your bottlenecks. As you can observe small portions of the executions in the UI as well.
So this is a bit of a weird question as it isn't related to how to use the tool but more about why to use it.
I'm deploying a model and thinking of using Apache-beam to run the feature processing tasks using its python API. Documentation is pretty big and complex but I went through most of it, even built a small working pipeline, and it is still not clear this would be the right tool for me.
An example of what I need is the following:
Input data structure:
ID | Timestamp | category
output needed:
category | category count for last 30 minutes (feature example)
This process needs to run every 5 minutes and update the counts.
===> What I fail to understand is if apache can run this pipeline every 5 minutes, read whichever new input data was generated and update the counts of the previous time it ran. And if so, can someone point me in the right direction?
Thank you!
When you run a Beam pipeline manually, it's expected to be started only once. Then it could be either a Bounded (Batch) or Unbounded (Streaming) pipeline. In the first case, it will be stopped after the all your bounded amount of data has been processed, in the second case it will run continuously and expect new data arrival (until it will be stopped manually).
Usually, the type of pipeline depends on data source that you have (Beam IO connectors). For example, if you read from files, then, by default, it's assumed to be a bounded source (limited number of files), but it could be unbounded source as well if you expect to have more new files to arrive and want to read them in the same pipeline.
Also, you can run your batch pipeline periodically with automated tools, like Apache Airflow (or just unix crontab). So, it all depends on your needs and type or data source. I could probably give more specific advice if you could share more details of your data pipeline - type of your data source and environment, an example of input and output results, how often your input data can be updated and so on.
I have a CSV file with a number of fields. What is an idiomatic way to read the file, sort the file using a subset of fields, and then write another CSV as output.
Should I even attempt to do this in spring-batch? I understand that *nix-based OSes have the sort utility to do this, but I'd like to contain all my work within spring batch if possible.
The Batch Processing Strategies section of the documentation seems to suggest that might be standard utility steps to accomplish this:
In addition to the main building blocks, each application may use one or more of standard utility steps, such as:
Sort: A program that reads an input file and produces an output file where records have been re-sequenced according to a sort key field in the records. Sorts are usually performed by standard system utilities.
But I am not able to locate this. Any pointers most welcome!
Thanks very much!
Unless you really should do it inside Spring Batch I would suggest you do it with OS based commands.
But your point is correct, adding intermediary Steps to your Jobs to Sort/Filter or even clean DATA is a mainstream pattern used in Batch Processing or ETL Jobs.
Hope this helps.
I found out that there is a SystemCommandTasklet that is meant to run OS commands. This can be used to do things like sorting, finding unique items, etc.
I've got log files from various devices showing users and want to create kind of a stateful count of users visiting specfic websites for every minute. I can tranform the data to a format: ts,websitename,userID,(-)1 (1 for joiners/-1 for leavers).
I'd like to end up with a time series with count per website per ts:
ts1,siteA,34
ts2,siteA,30 <- 4 users left
ts3,siteA,32 <- 2 users joined
The way to do this in Spark streaming is well descibed. The most straight-forward way IMHO would be to have a timewindow in Spark Streaming of the desired aggregation time and use updateStateByKey to keep a count per website (not even taking into account log ts to keep it simple).
Now the question is how to achieve this in a batch process, more specifically it's not to hard to use aggregateByKey() and end up with something like:
ts0,siteA,30
ts1,siteA,4
ts2,siteA,-4
ts3,siteA,2
But then how to iterate over that? It would not sound very logical but the only thing I can think of would be to sort the data using sortByKey(), partition it to be sure that all data for a specific site is on one node, and then iterate over every element of the RDD creating a new RDD with (ts,count)..
But e.g. using foreach doesn't iterate over the elements sequentially as far as I understand. Actually this might not even suit Spark well as it's not really "batch-type" work going down to the level of individual records.
Any help or pointers to specific functions greatly appreciated!