How to use Counters in apache crunch - apache-crunch

In Apache Crunch , there is method named increment("any enum").
I used increment(TOTAL_IDS);, but where I can see the result of counters, counters are not coming in logs after completion of job.
What am I missing there?

you should be able to see your counters in the tracking URL of the mapreduce job (if you are running mapreduce) or extract the counters from the pipeline. It would be useful if you could provide your code how you are incrementing the counters? is it in your DoFn, in your cleanp method?
Regards

Related

Real-time data streaming using Wikipedia's RecentChanges API

I'm lately trying to create a demo on real time streaming using NiFi -> Kafka -> Druid -> Superset. For the purposes of this demo I chose to use Wikipedia's RecentChanges API in order to get asynchronous data of the most recent changes.
I use this URL in order to get a response of changes. I'm calling the API constanlty in order to not miss any changes. This way I get a lot of duplicates that I do not want. Is there anyway to parameterize this API to fix it for example getting all the changes from the previous second and doing that everysecond or something else to tackle this issue. I'm trying to make a configuration for this uing NiFi, if someone has to add something on that part then visit this discussion on Cloudera.
Yes. See https://en.wikipedia.org/w/api.php?action=help&modules=query%2Brecentchanges Use rcstart and rcend to define your start and end times. You can use "now" for rcend.
I want to expand smartse's answer and come up with a solution. You want to put your API request in certain time windows, by shifting the start and end parameters. Windowing might work like this:
Initialize start, end timestamp parameters
Put those parameters as attributes on the flow
Downstream processors can call the API using those parameters
After doing that, you have to set start = previous_end + 1 second and end = now
When you determine the new window for the next run, you need the parameters from the previous run. This is why you have to remember those values. You can achieve this using NiFi's distributed map cache.
I've assembled a flow for you:
Zoom into Get next date range:
The end parameter is always now, so you just have to store the start parameter. FetchDistributedMapCache will fetch that for you and put it into stored.state attribute:
Set time range processor will initialize the parameters:
Notice that end is always now and start is either an initial date (for the first run) or the last end parameter plus 1 second. At this point the flow is directed into the Time range output, where you can call your API downstream. Additionally you have to update the stored.value. This happens in the ReplaceText processor:
Finally you update the state:
The lifecycle of the parameters are bound to the cache identifier. When you change the identifier, you start from scratch.

Can I test kafka-streams suppress logic?

My application use kafka streams suppress logic.
I want to test kafka streams topology using suppress.
Runnning uinit test, My topology not emit result.
Kafka streams logic
...
.suppress(Suppressed.untilTimeLimit(Duration.ofSeconds(5), Suppressed.BufferConfig.maxBytes(1_000_000_000L).emitEarlyWhenFull()))
...
My test case code.
After create input data, running test case cant't read suppress logic output record.
just return null
testDriver.pipeInput(recordFactory.create("input", key, dummy, 0L));
System.out.println(testDriver.readOutput("streams-result", Serdes.String().deserializer(), serde.deserializer()));
Can i test my suppress logic?
The simple answer is yes.
Some good references are Confluent Example Tests this example in particular tests the suppression feature. And many other examples always a good place to check first. Here is another example of mine written in Kotlin.
An explanation of the feature and testing it can be found in post 3 on this blog post
Some key points:
The window will only emit the final result as expected from the documents.
To flush the final results you will need to send an extra dummy event as seen in the examples such as confluents here.
You will need to manipulate the event time to test it as suppression works off the event time this can be provided by the test input topic API or use a custom TimestampExtractor.
For testing I recommend setting the following to remove cache and reduce commit interval.
props[StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG] = 0
props[StreamsConfig.COMMIT_INTERVAL_MS_CONFIG] = 5
Hope this helps.

Jmeter - Can I change a variable halfway through runtime?

I am fairly new to JMeter so please bear with me.
I need to understand whist running a JMeter script I can change the variable with the details of "DB1" to then look at "DB2".
Reason for this is I want to throw load at a MongoDB and then switch to another DB at a certain time. (hotdb/colddb)
The easiest way is just defining 2 MongoDB Source Config elements pointing to separate database instances and give them 2 different MongoDB Source names.
Then in your script you will be able to manipulate MongoDB Source parameter value in the MongoDB Script test element or in JSR223 Samplers so your queries will hit either hotdb or colddb
See How to Load Test MongoDB with JMeter article for detailed information
How about reading the value from a file in a beanshell/javascript sampler each iteration and storing in a variable, then editing/saving the file when you want to switch? It's ugly but it would work.

Talend job batch processing

I am exploring Talend at work, I was asked if Talend supports batch processing as in running the job in multiple threads. After going through the user guide I understood threading is possible with sub jobs. I would like to know if it is possible to run the a job with a single action in parallel
Talend has excellent multi threading support. There are two basic methods for this. One method gives you more control and is implemented using components. The other method is implemented as job setting.
For the first method see my screenshot. I use tParallelize to load three files into three tables at the same time. Then when all three files are successfully loaded I use the same tParallelize to set the values of a control table. tParallelize can also be connected to tRunJob as easily as a subjob.
The other method is described very well here in Talend Help: Talend Help- Run Jobs in Parallel
Generally I recommend the first method because of the control it gives you, but if your job follows the simple pattern described in the help link, that method works as well.

Matlab - load results from parallel job that has not finished yet

Is there a way to retrieve the variables of a batch job that has not finished yet?
If not, how do I perform some kind of checkpointing, so I could retrieve intermediate results from a parallel job?
Well, there a few things you can do. First, something like this to see if your job is done:
while ~strcmp(jobHandle.State, 'Finished')
jobHandle.Task
jobHandle.Task(1)
jobHandle.Task(1).State
jobHandle.Task(1).OutputArguments
end
Inside that loop, you'll have access to the job object, and all the task objects for that job. I tried to demo some of the data you have access to in the impractical example above. You can use that data-access to set up any checkpoint scheme you want. See the documentation, here, for more info. Good Luck!