Real-time data streaming using Wikipedia's RecentChanges API

Real-time data streaming using Wikipedia's RecentChanges API - rest

I'm lately trying to create a demo on real time streaming using NiFi -> Kafka -> Druid -> Superset. For the purposes of this demo I chose to use Wikipedia's RecentChanges API in order to get asynchronous data of the most recent changes.
I use this URL in order to get a response of changes. I'm calling the API constanlty in order to not miss any changes. This way I get a lot of duplicates that I do not want. Is there anyway to parameterize this API to fix it for example getting all the changes from the previous second and doing that everysecond or something else to tackle this issue. I'm trying to make a configuration for this uing NiFi, if someone has to add something on that part then visit this discussion on Cloudera.

Yes. See https://en.wikipedia.org/w/api.php?action=help&modules=query%2Brecentchanges Use rcstart and rcend to define your start and end times. You can use "now" for rcend.

I want to expand smartse's answer and come up with a solution. You want to put your API request in certain time windows, by shifting the start and end parameters. Windowing might work like this:
Initialize start, end timestamp parameters
Put those parameters as attributes on the flow
Downstream processors can call the API using those parameters
After doing that, you have to set start = previous_end + 1 second and end = now
When you determine the new window for the next run, you need the parameters from the previous run. This is why you have to remember those values. You can achieve this using NiFi's distributed map cache.
I've assembled a flow for you:
Zoom into Get next date range:
The end parameter is always now, so you just have to store the start parameter. FetchDistributedMapCache will fetch that for you and put it into stored.state attribute:
Set time range processor will initialize the parameters:
Notice that end is always now and start is either an initial date (for the first run) or the last end parameter plus 1 second. At this point the flow is directed into the Time range output, where you can call your API downstream. Additionally you have to update the stored.value. This happens in the ReplaceText processor:
Finally you update the state:
The lifecycle of the parameters are bound to the cache identifier. When you change the identifier, you start from scratch.

Related

Best practices for Informatica Webservice workflow

I have created a Informatica webservice workflow which takes 1 parameter as input. A Webservice provider source definition is used for this and mapping is a one-way type.
Workflow works fine when parameter is being passed. But when the same workflow is triggered from Informatica Power center directly (in which case no parameters are passed), mapping that contains webservice provider source definition takes 3 minutes to complete (Gives Timeout based commit point in the log).
Is it a good practice to run the webservice workflow from power center directly? And is there a way to improve its performance when triggered from power center directly?
Note: I am trying to use 1 workflow for both - 1) Pass the parameter from web 2) Schedule the workflow in Informatica

Answers to your questions below.
Is it a good practice to run the webservice workflow from power center directly?
Of course it depends on requirement - whether you need to extract data automatically from WS or not. If you pass parameter using some session then i dont see much issue here and your session is completing within time.
So, you can create a new session/command task/shell script to create a param file and then use it in original session so it is passed on to WS.
In a complex scenario, you may have to pass multiple values, in such case, i would recommend to use a parent workflow to call original workflow multiple times and change param every time before call.
Is there a way to improve its performance when triggered from power center directly?
It is really depends on few factors.
The web service - Make sure you are using correct input and output columns. Most of the time WS are sensitive to outside call and you need to choose optimized column to extract data for better performance. You can work with WS admin to know correct column.
If informatica flow is complex then depending on bottle neck transformation/s (source, target, expression, lookup, aggregator, sorter), we can check and take actions.
For lookup, you can add new filter to exclude unwanted data, remove unwanted columns etc.
For aggregator, you can use sorter before to improve perf.
... like this

Early firing in Flink - how to emit early window results to a different DataStream with a trigger

I'm working with code that uses a tumbling window of one day, and would like to send early results to a different DataStream on an hourly basis.
I understand that triggers are a way to go here, but don't really see how it would work.
The current code is as follows:
myStream
.keyBy(..)
.window(TumblingEventTimeWindows.of(Time.days(1)))
.aggregate(new MyAggregateFunction(), new MyProcessWindowFunction())
In my understanding, I should register a trigger, and then on its onEventTime method get a hold of a TriggerContext and I can send data to the labeled output from there. But how do I get the current state of MyAggregateFunction from there? Or would I need to my own computation here inside of onEventTime()?
Also, the documentation states that "By specifying a trigger using trigger() you are overwriting the default trigger of a WindowAssigner.". Would my one day window then still fire correctly, or do I need to trigger it somehow differently?
Another way of doing this is creating two different operators - one that windows by 1 hour, and another that windows by 1 day. Would triggers be a preferred approach to that?

Rather than using a custom Trigger, it would be simpler to have two layers of windowing, where the hourly results are further aggregated into daily results. Something like this:
hourlyResults = myStream
.keyBy(...)
.window(TumblingEventTimeWindows.of(Time.hours(1)))
.aggregate(new MyAggregateFunction(), new MyProcessWindowFunction())
dailyResults = hourlyResults
.keyBy(...)
.window(TumblingEventTimeWindows.of(Time.days(1)))
.aggregate(new MyAggregateFunction(), new MyProcessWindowFunction())
hourlyResults.addSink(...)
dailyResults.addSink(...)
Note that the result of a window is not a KeyedStream, so you will need to use keyBy again, unless you can arrange to leverage reinterpretAsKeyedStream (docs).

Normally when I get to more complex behavior like this, I use a KeyedProcessFunction. You can aggregate (and save in state) hourly and daily results, set timers as needed, and use a side output for the hourly results versus the regular output for the daily results.

There are quite a few questions here. I will try to ask all of them. First of all, if You specify Your own trigger using trigger() this means You are going to effectively override the default trigger and thus the window may not work the way it would by default. So, if You for example if You create the 1 day event time tumbling window, but override a trigger so that it fires for every 20th element, it will never fire based on event time.
Now, after Your custom trigger fires, the output from MyAggregateFunction will be passed to MyProcessWindowFunction, so It will work the same as for the default trigger, you don't need to access the MyAggregateFunction from inside the trigger.
Finally, while it may be technically possible to implement trigger to trigger partial results every hour, my personal opinion is that You should probably go with the two separate windows. While this solution may create a slightly larger overhead and may result in a larger state, it should be much clearer, easier to implement, and finally much more error resistant.

JMeter to record results on hourly basis

I have a JMeter project with multiple GET and POST requests and assertions for these. I use Aggregate results and View results tree listeners, but none of these can store results on hourly basis. I tried JMeterPlugins-Standard and JMeterPlugins-Extras packages and jp#gc - Graphs Generator listener, but all of them use aggregated data instead of hourly data. So I would like to get number of successful and failed requests/assertions per hour, maybe a bar chart would be most suitable for this purpose.

I'm going to suggest a non-conventional design-level solution: name your samplers dynamically with hour (or date and hour), so that each hour the name will change, and thus they will appear in different category, i.e.:
The code for such name is:
${__time(dd:hh,)} the rest of sampler name
Such sampler will appear in the following way in Aggregate Report (here I simulated it with minutes/seconds, but same will happen with days/hours, just on larger scale):
Pros and cons of such approach:
Simple, you can aggregate anything by hour, minute, or any other time slice while test is running, and not by analysis after execution.
Not listener-dependant, can be used with pretty much any listener or visualizer
If you want to also have overall stats, it will require to sum up every sub-category. So it alters data, but in the way that it can still can be added back to original relatively easy.
Calculating __time before every sampler will not be unnoticed completely from performance perspective, but I don't think it will add visible overhead to a script.
You could get the same data by properly aggregating JTL or CSV (whichever you use) after execution, so it doesn't provide you with anything that is not possible to achieve using standard methods
Script needs altering to make this happen. if you have 100s of samplers, it's going to take a while. And if you want to change back...

You might want to use Filter Results Tool which has --start-offset and --end-offset parameters, you can "cut" your results file into "interesting" pieces and plot them according to your requirements.
You can install Filter Results Tool using JMeter Plugins Manager
Also be aware that according to JMeter Best Practices you should
Use as few Listeners as possible; if using the -l flag as above they can all be deleted or disabled.
Don't use "View Results Tree" or "View Results in Table" listeners during the load test, use them only during scripting phase to debug your scripts.
You can get whatever information you need from the .jtl results file, you can specify test results location via -l command-line argument

To get summarized results per hour add to your test plan Generate Summary Results:
Generates a summary of the test run so far to the log file and/or standard output
Update interval in jmeter.properties to your needs ,1 hour, 3600 seconds:
summariser.interval=3600
You will get summary per hour of your requests.

You can try with Jmeter backend Listener. It has integration with graphite and Influxdb. After storing the results in these time series database you can display the result in Grafana dashboard. Grafana has its own filtering of showing the results in hourly, monthly, daily basis and so on.

Akka Persistence: Where do the execution of the Command Goes when it is not simply a state update

Just for clarification: Where do the execution of a command goes, when the execution is not simply a state update (like in most examples found online)
For instance, in my case,
The Command is FetchLastHistoryChangeSet which consist in fetching the last history changeset from an external service based on where we left off last time. In other words the time of the newest change of the previous history ChangeSet Fetched.
The Event would be HistoryChangeSetFetched(changeSet, time). In correlation to what has been said above, the time should be that of the newest change of the newly history ChangeSet Fetched (as per the command event currently being handled)
Now in all example that i see, it is always: (i) validating the command, then, (ii) persisting the event, and finally (iii) handling the event.
It is in handling the event that i have seen custom code added in addition to the updatestate logic. Where, the custom code is usually added after the update state function. But this custom is most of the time about sending message back to the sender, or broadcasting it to the event bus.
As per my example, it is clear that i need to do quite few operation to actually call Persist(HistoryChangeSetFetched(changeSet, time)). Indeed i need the new changeset, and the time of the newest change of it.
The only way i see it possible is to do the fetch in the validating the command
That is:
case FetchLastHistoryChangeSet => val changetuple = if ValidateCommand(FetchLastHistoryChangeSet) persit(HistoryChangeSetFetched(changetuple._1, changetuple._2)) { historyChangeSetFetched =>
updateState(historyChangeSetFetched)
}
Where the ValidateCommand(FetchLastHistoryChangeSet)
would have as logic, to read last changeSet time (newest change of the changeSet), fetch a new changeset based on it, if it exist, get the time of its newest change, and return the tuple.
My question is, is that how it is supposed to work. Validating command
can be something as complex as that ? i.e. actually executing the
command ?

As it says in the documentation: "validation can mean anything, from simple inspection of a command message's fields up to a conversation with several external services"
So I think what you're trying to do is exactly right. Any interaction with an external service must be done at the command validation stage.

how to create a elastic watch which can identify the changes of data in a given index of elasticsearch

In the offical site of Elastic Watcher, they said
Watcher is a plugin for Elasticsearch that provides alerting and notification based on changes in your data
The relevant data or changes in data can be identified with a periodic Elasticsearch query
What I want is a function like Trigger of MySQL, that is when a record is updated, a action is triggered.
But I didn't find a example or document to address this use case, can anybody tell me how to do this?

You define an input of type search and using body, indices (mainly) you define which indices to look at (indices) and what is the actual query (body). If you need other settings, there are many more things to configure. After this, you define a condition and an action to complete the flow.
Make an attempt and create a watch. If you have difficulties, provide details of what you tried in a different SO post (you realize your current post is not appropriate for SO since you ask for complete code without you trying anything).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Real-time data streaming using Wikipedia's RecentChanges API - rest

Yes. See https://en.wikipedia.org/w/api.php?action=help&modules=query%2Brecentchanges Use rcstart and rcend to define your start and end times. You can use "now" for rcend.

Related

Best practices for Informatica Webservice workflow

Early firing in Flink - how to emit early window results to a different DataStream with a trigger

JMeter to record results on hourly basis

Akka Persistence: Where do the execution of the Command Goes when it is not simply a state update

how to create a elastic watch which can identify the changes of data in a given index of elasticsearch

Categories

Resources