AWS Lambda throttled queue processing - queue

I plan to fetch a list of records from a web service which limits the number of requests that can be made within a certain time frame.
My idea was to setup a simple pipeline like this:
List of URLs -> Lambda Function to fetch JSON -> S3
The part I'm not sure about is how to feed the list of URLs in rate/time limited blocks, e.g take 5 URLs and spawn 5 lambda functions every second.
Ideally I'd like to start this by uploading/sending/queueing the list once and then just let it do it's thing on its own until it has processed the queue completely.

Splitting the problem in two parts.
Trigger: Lambda supports a wide variety. Look for Using AWS Lambda to process AWS events in Lambda FAQs.
I personally would go with Dynamo DB. But S3 will come in a close second.
There might be other options using other streams like Kinesis, but these seem simpler by far.
Throttling: You can set limits on number of lambda instances.
So e.g. if you go with DDB:
You'll dump all your URLs in to a table one row per URL.
This will create events, one per row.
Each event triggers one Lambda call.
Number of parallel Lambda executions/instances are limited by config.

Related

Best practices for Informatica Webservice workflow

I have created a Informatica webservice workflow which takes 1 parameter as input. A Webservice provider source definition is used for this and mapping is a one-way type.
Workflow works fine when parameter is being passed. But when the same workflow is triggered from Informatica Power center directly (in which case no parameters are passed), mapping that contains webservice provider source definition takes 3 minutes to complete (Gives Timeout based commit point in the log).
Is it a good practice to run the webservice workflow from power center directly? And is there a way to improve its performance when triggered from power center directly?
Note: I am trying to use 1 workflow for both - 1) Pass the parameter from web 2) Schedule the workflow in Informatica
Answers to your questions below.
Is it a good practice to run the webservice workflow from power center directly?
Of course it depends on requirement - whether you need to extract data automatically from WS or not. If you pass parameter using some session then i dont see much issue here and your session is completing within time.
So, you can create a new session/command task/shell script to create a param file and then use it in original session so it is passed on to WS.
In a complex scenario, you may have to pass multiple values, in such case, i would recommend to use a parent workflow to call original workflow multiple times and change param every time before call.
Is there a way to improve its performance when triggered from power center directly?
It is really depends on few factors.
The web service - Make sure you are using correct input and output columns. Most of the time WS are sensitive to outside call and you need to choose optimized column to extract data for better performance. You can work with WS admin to know correct column.
If informatica flow is complex then depending on bottle neck transformation/s (source, target, expression, lookup, aggregator, sorter), we can check and take actions.
For lookup, you can add new filter to exclude unwanted data, remove unwanted columns etc.
For aggregator, you can use sorter before to improve perf.
... like this

Azure Data Factory ForEach is seemingly not running data flow in parallel

In Azure Data Factory I am using a Lookup activity to get a list of files to download, then pass it to a ForEach where a dataflow is processing each file
I do not have 'Sequential' mode turned on, I would assume that the data flows should be running in parallel. However, their runtimes are not the same but actually have almost constant time between them (like, first data flow ran 4 mins, second 6, third 8 and so on). It seems as if the second data flow is waiting for the first one to finish and then uses its cluster to process the file.
Is that intended behavior? I have TTL on the cluster set but that did not help too much. If it is, then what is a workaround? I am currently working on creating a list of files first and using that instead of a ForEach but I am not sure if I am going to see an increase in efficiency
I have not been able to solve the issue with the Parallel data flows not executing in parallel, however, I have managed to change the solution that would increase performance.
What was before: A lookup activity that would get a list of files to process, passed on to a ForEach loop with a data flow activity.
What I am testing now: A Data flow activity that would get a list of files, and save them in a text file in ADLS, Then another data flow activity that was previously in a ForEach loop, but changed its source to use "List of Files" and point to that list
The result was an increase in efficiency (Using the same cluster, 40 files would take around 80 mins using ForEach and only 2-3 mins using List of Files), however, debugging is not easy now that everything is in 1 data flow
You can overwrite a list of files file, or use dynamic expressions and name the file as the pipelineId or something else

Streaming data Complex Event Processing over files and a rather long period

my challenge:
we receive files every day with about 200.000 records. We keep the files for approx 1 year, to support re-processing, etc..
For the sake of the discussion assume it is some sort of long lasting fulfilment process, with a provisioning-ID that correlates records.
we need to identify flexible patterns in these files, and trigger events
typical questions are:
if record A is followed by record B which is followed by record C, and all records occured within 60 days, then trigger an event
if record D or record E was found, but record F did NOT follow within 30 days, then trigger an event
if both records D and record E were found (irrespective of the order), followed by ... within 24 hours, then trigger an event
some pattern require lookups in a DB/NoSql or joins for additional information either to select the record, or to put into the event.
"Selecting a record" can be simple "field-A equals", but can also be "field-A in []" or "filed-A match " or "func identify(field-A, field-B)"
"days" might also be "hours" or "in previous month". Hence more flexible then just "days". Usually we have some date/timestamp in the record. The maximum is currently "within 6 months" (cancel within setup phase)
The created events (preferably JSON) needs to contain data from all records which were part of the selection process.
We need an approach that allows to flexibly change (add, modify, delete) the pattern, optionally re-processing the input files.
Any thoughts on how to tackle the problem elegantly? May be some python or java framework, or does any of the public cloud solutions (AWS, GCP, Azure) address the problem space especially well?
thanks a lot for your help
After some discussions and readings, we'll try first Apache Flink with the FlinkCEP library. From the docs and blog entries it seems to be able to do the job. It also seems AWS's choice, running on their EMR cluster. We didn't find any managed service on GCP nor Azure providing the functionalities. Of course we can always deploy and manage it ourselves. Unfortunately we didn't find a Python framework

Real-time data streaming using Wikipedia's RecentChanges API

I'm lately trying to create a demo on real time streaming using NiFi -> Kafka -> Druid -> Superset. For the purposes of this demo I chose to use Wikipedia's RecentChanges API in order to get asynchronous data of the most recent changes.
I use this URL in order to get a response of changes. I'm calling the API constanlty in order to not miss any changes. This way I get a lot of duplicates that I do not want. Is there anyway to parameterize this API to fix it for example getting all the changes from the previous second and doing that everysecond or something else to tackle this issue. I'm trying to make a configuration for this uing NiFi, if someone has to add something on that part then visit this discussion on Cloudera.
Yes. See https://en.wikipedia.org/w/api.php?action=help&modules=query%2Brecentchanges Use rcstart and rcend to define your start and end times. You can use "now" for rcend.
I want to expand smartse's answer and come up with a solution. You want to put your API request in certain time windows, by shifting the start and end parameters. Windowing might work like this:
Initialize start, end timestamp parameters
Put those parameters as attributes on the flow
Downstream processors can call the API using those parameters
After doing that, you have to set start = previous_end + 1 second and end = now
When you determine the new window for the next run, you need the parameters from the previous run. This is why you have to remember those values. You can achieve this using NiFi's distributed map cache.
I've assembled a flow for you:
Zoom into Get next date range:
The end parameter is always now, so you just have to store the start parameter. FetchDistributedMapCache will fetch that for you and put it into stored.state attribute:
Set time range processor will initialize the parameters:
Notice that end is always now and start is either an initial date (for the first run) or the last end parameter plus 1 second. At this point the flow is directed into the Time range output, where you can call your API downstream. Additionally you have to update the stored.value. This happens in the ReplaceText processor:
Finally you update the state:
The lifecycle of the parameters are bound to the cache identifier. When you change the identifier, you start from scratch.

Access to stage timing through API

In the Stage View plugin, we can see on each of the stages a timestamp that displays number of seconds for the stage, and number of seconds spent waiting within that stage. This is kind of interesting data but I haven't figure out where we may be able to access it outside of the display on the single pipeline. If we want to use these times in our own metrics program, say to measure trends over multiple pipelines and/or projects, is it accessible externally somehow?
You can use the existent JSON API: http://[JENKINS_HOST]/job/test/[BUILD_NUMBER]/wfapi/describe
There is timing information in the response.