How to send message by message to Kafka - apache-kafka

I'm new to reactive programming and I try to implement a very basic scenario.
I want to send a message to kafka each time a file is dropped to a specific folder.
I think that I don't understand well the basics things... so please could you help me?
So I have a few questions :
What is the difference between smallrye-reactive-messaging and smallrye-reactive-streams-operators ?
I have this simple code :
#Outgoing( "my-topic" )
public PublisherBuilder<Message<MessageWrapper>> generate() {
if(Objects.isNull(currentMessage)){
//currentMessage is an instance variable which is null when I start the application
return ReactiveStreams.of(new MessageWrapper()).map(Message::of);
}
else {
//currentMessage has been correctly set with the file information
LOGGER.info(currentMessage);
return ReactiveStreams.of(currentMessage).map(Message::of);
}
}
When the code goes in the if statement, everything is ok and I got a JSON serialization of my object will null values. However I don't understand why when my code goes to the else statement, nothing goes to the topic? It seems that the .of instructions of the if statement has broke the streams or something like that...
How to keep a continuous streams that 'react' to the new dropped files ? (or other events like HTTP GET request or something like that) ...
If I don't return an instance of PublisherBuilder but an Integer for example, then my kafka topic will be populated by a very huge stream of Integer value. This is why examples are using some intervals when sending messages...
Should I use some CompletationStage or CompletableFuture ? RxJAva2? It's a bit confusing which lib to use (vertx, smallrye, rxjava2, microprofile, ...)
What are the differences between :
ReactiveStreams.fromCompletionStage
ReactiveStreams.fromProcessor
ReactiveStreams.fromPublisher
ReactiveStreams.fromSubscriber
Which one to use on which scenario ?
Thank you very much !

Let's start with the difference between smallrye-reactive-messaging & smallrye-reactive-streams-operators: smallrye-reactive-streams-operators is the same as smallrye-reactive-messaging but in addition it has a support to MicroProfile-context-propagation. Since most reactive-messaging providers use Vert.x behind the scene, it will process your message in an event-loop style, which means it will run in separate thread. Sometimes you need to propagate some ctx from your base thread into the new thread (ex: populating CDI and Tx context to execute some JPA Entity manager logic). Here where ctx propagation help.
For method signatures. You can take a look at the official documentation of SmallRye-reactive-streams sections 3,4 & 5. Each one has a different use case. It is up to you which flavor do you want to use.
When to use what ? If you are not running within reactive context, you can use the below to send messages.
#Inject
#Channel("my-channel")
Emitter emitter;
For Message consumption you can use method signature like this :
#Incoming("channel-2")
public CompletionStage doSomething(Message anEvent)
Or
#Incoming("channel-2")
public void doSomething(String anEvent)
Hope that helps.

Related

Is it possible to create a batch flink job in streaming flink job?

I have a job streaming using Apache Flink (flink version: 1.8.1) using scala. there are flow job requirements as follows:
Kafka -> Write to Hbase -> Send to kafka again with a different topic
During the writing process to Hbase, there was a need to retrieve data from another table. To ensure that the data is not empty (NULL), the job must check repeatedly (within a certain time) if the data is empty.
is this possible with Flink? If yes, can you help provide examples for conditions similar to my needs?
Edit :
I mean, with the problem that I described in the content, I thought about having to create some kind of job batch in the job streaming, but I couldn't find the right example for my case. So, is it possible to create a batch flink job in streaming flink job? If yes, can you help provide examples for conditions similar to my needs?
With more recent versions of Flink you can do lookup queries (with a configurable cache) against HBase from the SQL/Table APIs. Your use case sounds like it might be easily implemented in this fashion. See the docs for more info.
Just to clarify my comment I will post a sketch of what I was trying to suggest based on The Broadcast State Pattern. The link provides an example in Java, so I will follow it. In case you want in Scala it should not be too much different. You will likely have to implement the below code as it is explained on the link that I mentioned:
DataStream<String> output = colorPartitionedStream
.connect(ruleBroadcastStream)
.process(
// type arguments in our KeyedBroadcastProcessFunction represent:
// 1. the key of the keyed stream
// 2. the type of elements in the non-broadcast side
// 3. the type of elements in the broadcast side
// 4. the type of the result, here a string
new KeyedBroadcastProcessFunction<Color, Item, Rule, String>() {
// my matching logic
}
);
I was suggesting that you can collect the stream ruleBroadcastStream in fixed intervals from the database or whatever is your store. Instead of getting:
// broadcast the rules and create the broadcast state
BroadcastStream<Rule> ruleBroadcastStream = ruleStream
.broadcast(ruleStateDescriptor);
like the web page says. You will need to add a source where you can schedule it to run every X minutes.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
BroadcastStream<Rule> ruleBroadcastStream = env
.addSource(new YourStreamSource())
.broadcast(ruleStateDescriptor);
public class YourStreamSource extends RichSourceFunction<YourType> {
private volatile boolean running = true;
#Override
public void run(SourceContext<YourType> ctx) throws Exception {
while (running) {
// TODO: yourData = FETCH DATA;
ctx.collect(yourData);
Thread.sleep("sleep for X minutes");
}
}
#Override
public void cancel() {
this.running = false;
}
}

How to Register Interests using 'ALL_KEYS' in Spring Data GemFire with ClientRegionFactoryBean

I am going to register interests in ALL_KEYS for my Pivotal GemFire client via Spring Data GemFire, but I find that ClientRegionFactoryBean has one method.
org.springframework.data.gemfire.client.ClientRegionFactoryBean.setInterests(Interest<MyRegionPojo>[] interests)
In this case, I only can set the exact keys, but I want to register interests for all keys. My key is not a simple class like String, or Long, but a complex object MyRegionPojo.
Please help if any method to implement so like GemFire API region.registerInterest("ALL_KEYS");
You problem statement is a bit vague but I assume/suspect you are configuring your Spring (Data GemFire) (SDG) application using Spring JavaConfig?
However, I will quickly add that this is not unlike how you would register interests in all keys using SDG's XML namespace, as shown here.
The JavaConfig approach is similar, but clearly based on "strongly-typed arguments", namely 1 or more sub-type instances of the o.s.d.g.client.Interest class to the o.s.d.g.client.ClientRegionFactoryBean.setInterests(:Interest<K>[]) method.
By way of example, you might do the following...
#Bean("Example")
public ClientRegionFactoryBean<?, ?> exampleRegion(GemFireCache gemfireCache) {
ClientRegionFactoryBean<MyRegionKey, MyRegionValue> exampleRegion =
new ClientRegionFactoryBean<>();
RegexInterest regexInterest = new RegexInterest();
regexInterest.setKey(".*");
exampleRegion.setCache(gemfireCache);
exampleRegion.setShortcut(ClientRegionShortcut.PROXY);
exampleRegion.setInterests(new Interest[] { regexInterest });
exampleRegion.setKeyConstraint(MyRegionKey.class);
exampleRegion.setValueConstraint(MyRegionValue.class);
return exampleRegion;
}
NOTE: updated the example above to reflect the proper way to register (Regex) interests based on SDG 1.9 or earlier. Keep in mind that the `o.s.d.g.client.RegexInterest.getRegex() delegates to getKey() therefore you can set the Regular Expression using setKey(:String) as I have shown above.
Notice the o.s.d.g.client.RegexInterest sub-type registration, which is effectively the same as register interests in "ALL_KEYS", as described here as well.
Hope this helps!
-John

How to use delta trigger in flink?

I want to use the deltatrigger in apache flink (flink 1.3) but I have some trouble with this code :
.trigger(DeltaTrigger.of(100, new DeltaFunction[uniqStruct] {
override def getDelta(oldFp: uniqStruct, newFp: uniqStruct): Double = newFp.time - oldFp.time
}, TypeInformation[uniqStruct]))
And I have this error:
error: object org.apache.flink.api.common.typeinfo.TypeInformation is not a value [ERROR] }, TypeInformation[uniqStruct]))
I don't understand why DeltaTrigger need TypeSerializer[T]
and I don't know what to do to remove this error.
Thanks a lot everyone.
I would read into this a bit https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/types_serialization.html sounds like you can create a serializer using typeInfo.createSerializer(config) on your type info. Note what you're passing in currently is a type itself and NOT the type info which is why you're getting the error you are.
You would need to do something more like
val uniqStructTypeInfo: TypeInformation[uniqStruct] = createTypeInformation[uniqStruct]
val uniqStrictTypeSerializer = typeInfo.createSerializer(config)
To quote the page above regarding the config param you need to pass to create serializer
The config parameter is of type ExecutionConfig and holds the
information about the program’s registered custom serializers. Where
ever possibly, try to pass the programs proper ExecutionConfig. You
can usually obtain it from DataStream or DataSet via calling
getExecutionConfig(). Inside functions (like MapFunction), you can get
it by making the function a Rich Function and calling
getRuntimeContext().getExecutionConfig().
DeltaTrigger needs a TypeSerializer because it uses Flink's managed state mechanism to store each element for later comparison with the next one (it just keeps one element, the last one, which is updated as new elements arrive).
You will find an example (in Java) here.
But if all you need is a window that triggers every 100msec, then it'll be easier to just use a TimeWindow, such as
input
.keyBy(<key selector>)
.timeWindow(Time.milliseconds(100)))
.apply(<window function>)
Updated:
To have hour-long windows that trigger every 100msec, you could use sliding windows. However, you would have 10 * 60 * 60 windows, and every event would be placed into each of these 36000 windows. So that's not a great idea.
If you use a GlobalWindow with a DeltaTrigger, then the window will be triggered only when events are more than 100msec apart, which isn't what you've said you want.
I suggest you look at ProcessFunction. It should be straightforward to get what you want that way.

How to make SuggestBox to reliably RPC-call to Server DB in GWTP (Gwt platform) Framework?

I spent many hours to search this info "How to make SuggestBox to reliably RPC-call to Server DB in GWTP (Gwt platform) Framework" but couldn't find any answer about it.
In fact, there were some answers but they were for people who do not use GWTP. For example, i found a website (http://jagadesh4java.blogspot.com.au/2009/03/hi-every-one-this-is-my-first-blog.html) that guide to code SuggestBox & RPC & it suggests these classes:
-Client Side:
+ interface SuggestService extends RemoteService
+ interface SuggestServiceAsync
+ class Suggestions implements IsSerializable, Suggestion
+ class SuggestionOracle extends SuggestOracle
-Server Side:
+ class SuggestServiceImpl extends RemoteServiceServlet implements
SuggestService
I tried to follow up that website but i got error:
[WARN] failed SelectChannelConnector#127.0.0.1:8888
java.net.BindException: Address already in use: bind..........
The above guide clearly was not for people who use GWTP.
My task is I have a dictionary that contains 200k of English words & I want to have a suggest box that when user types any char or word it will look up into DB & suggest accordingly. Ex, when user types "c" it will suggest "cat, car, cut, etc", when typing "car" it will suggest "car service", "carbon", etc.
So I come up with my own solution, even it works but I am feeling I am not doing right thing. My solution is quite simple that I just bring the data from DB down & add them into MultiWordSuggestOracle. Whenever it found a list of word in DB, it won't clear the old data but just keep adding the new list into MultiWordSuggestOracle. However, my program will not constantly call to DB everytime user types a char, but it will call to DB only if the wordInTheMultiWordSuggestOracleList.indexOf(suggestBox.getText(),0)>0. However, there is no way to loop each string in the MultiWordSuggestOracle, so I used List<String> accumulatedSuggestedWordsList=new ArrayList<String>() to store data. Pls see ex:
private final MultiWordSuggestOracle mySuggestions = new MultiWordSuggestOracle();
private List<String> accumulatedSuggestedWordsList=new ArrayList<String>();
private void updateSuggestions(List<String> suggestedWordsList) {
// call some service to load the suggestions
for(int i=0;i<suggestedWordsList.size(); i++){
mySuggestions.add(suggestedWordsList.get(i));
accumulatedSuggestedWordsList.add(suggestedWordsList.get(i));
}
}
#Override
protected void onBind() {
super.onBind();
final SuggestBox suggestBox = new SuggestBox(mySuggestions);
getView().getShowingTriplePanel().add(suggestBox);
suggestBox.addKeyDownHandler(new KeyDownHandler(){
#Override
public void onKeyDown(KeyDownEvent event) {
// TODO Auto-generated method stub
String word=suggestBox.getText();
int index=-1;
for(int i=0; i<accumulatedSuggestedWordsList.size();i++){
String w=accumulatedSuggestedWordsList.get(i);
index=w.indexOf(word,0);
if(index>0)
break;
}
if(index==0 || index==-1){
GetWordFromDictionary action=new tWordFromDictionary(suggestBox.getText());
action.setActionType("getSuggestedWords");
dispatchAsync.execute(action, getWordFromDictionaryCallback);
}
}
});
}
private AsyncCallback<GetWordFromDictionaryResult> getWordFromDictionaryCallback=new AsyncCallback<GetWordFromDictionaryResult>(){
#Override
public void onFailure(Throwable caught) {
// TODO Auto-generated method stub
}
#Override
public void onSuccess(GetWordFromDictionaryResult result) {
// TODO Auto-generated method stub
List<String> suggestedWordsFromDictionaryList=result.getSuggestedWordsFromDictionaryList();
updateSuggestions(suggestedWordsFromDictionaryList);
}
};
The Result: it works but the suggest only show up if i type the "Backspace" button. For ex, when i type the word "car" then no suggested list popup, it only popup "car service, car sale, etc" when i hit the backspace button.
So, can u evaluate my solution? I am feeling i am not doing right. If i am not doing right thing, can u provide a SuggestBox PRC for GWTPframework?
Very Important Note:
How to build a Reliable SuggestBox PRC that prevents the a denial of service attack on our own servers?
What if there are too many calls generated by many people rapidly typing in a suggest box?
Actually I just found an error:
SQL Exception: Data source rejected establishment of connection, message from server: "Too many connections" --> so there must be something wrong with my solution
I knew why i got "Too many connections" error. For example, when i type "ambassador" into the suggest box, & i saw my server call the Db 9 times continuously.
-1st call, it will search any word like 'a%'
-2nd call, it will search any word like 'am%'
-3nd call, it will search any word like 'amb%'
The first problem is that it create too many calls at 1 time, second it is not effective cos the first time call like 'a%' may already contains word that will be called at 2nd time like 'am%', so it duplicate the data. Question is how to code to avoid this ineffectiveness.
Someone suggests to use RPCSuggestOracle.java (https://code.google.com/p/google-web-toolkit-incubator/source/browse/trunk/src/com/google/gwt/widgetideas/client/RPCSuggestOracle.java?spec=svn1310&r=1310)
If you can provide an example of using RPCSuggestOracle.java, that will be great.
I hope your answer will help a lot of other people.
There were an old inspiring blog post, from the Lombardi Development, that I remember addresses almost all questions you are looking for. It took me a while to find that out but, fortunately, it has simply been moved! And the sources are available. Have a look.
Although being old, things in that post still applies. In particular:
use a single connection to avoid explosion of requests, and left free the other ones for other tasks (i.e. avoid to use all the 2-to-8 max parallel browser http connections);
reuse data from a previous requests (i.e., if your request is a substring of the previous one, you may already have the suggestions, hence just filter them client-side).
Other things that come to my mind are:
use a Timer to simulate a little delay in case of fast writers, so you call the server only after a bit (probably an over optimization, but still an idea);
allow to fetch suggestions only on a minimum input length (say, min 3 characters). If you have a lot of possible suggestions, the data returned might be expensive even to parse, specially if - for the search - you decide to adopt a contains instead of startswith strategy;
in case you still have tons of suggestions, you could try to implement a lazy load SuggestionDisplay that simply show you the first, say, 50 suggestions and then, on scroll, all the others in an incremental way using the same input string.
Can't say anything from the GWTP part, I've never used it. But AFAICS seems just like GWT-RPC + dispatch mechanism (command pattern) like the old gwt-dispatch. Should't be hard to use instead of vanilla GWT-RPC.
Also have a look at the other 2 previous articles linked in the one above. Might contain some other useful tips.
Use key Up handler instead of key down handler may be it will solve your problem.
this is because keyDown event is fired before rendering the character.

Celery - error handling and data storage

I'm trying to better understand common strategies regarding results and errors in Celery.
I see that results have statuses/states and stores results if requested -- when would I use this data? Should error handling and data storage be contained within the task?
Here is a sample scenario, in case it helps better understand my objective:
I have a geocoding task that goeocodes user addresses. If the task fails or succeeds, I'd like to update a field in the database letting the user know. (Error handling) On success I'd like the geocoded data to be inserted into the database (Data storage)
What approach should take?
Let me preface this by saying that I'm still getting a feel for Celery myself. That being said, I have some general inclinations about how I'd go about tackling this, and since no one else has responded, I'll give it a shot.
Based on what you've written, a relatively simple (though I suspect non-optimized) solution is to follow the broad contours of the blog comment spam task example from the documentation.
app.models.py
class Address(models.Model):
GEOCODE_STATUS_CHOICES = (
('pr', 'pre-check'),
('su', 'success'),
('fl', 'failed'),
)
address = models.TextField()
...
geocode = models.TextField()
geocode_status = models.CharField(max_length=2,
choices=GEOCODE_STATUS_CHOICES,
default='pr')
class AppUser(models.Model):
name = models.CharField(max_length=100)
...
address = models.ForeignKey(Address)
app.tasks.py
from celery import task
from app.models import Address, AppUser
from some_module import geocode_function #assuming this returns a string
#task()
def get_geocode(appuser_pk):
user = AppUser.objects.get(pk=appuser_pk)
address = user.address
try:
result = geocode_function(address.address)
address.geocode = result
address.geocode_status = 'su' #set address object as successful
address.save()
return address.geocode #this is optional -- your task doesn't have to return anything
on the other hand, you could also choose to decouple the geo-
code function from the database update for the object instance.
Also, if you're thinking about chaining tasks together, you
might think about if it's advantageous to pass a parameter as
an input or partial input into the child task.
except Exception as e:
address.geocode_status = 'fl' #address object fails
address.save()
#do something_else()
raise #re-raise the error, in case you want to trigger retries, etc
app.views.py
from app.tasks import *
from app.models import *
from django.shortcuts import get_object_or_404
def geocode_for_address(request, app_user_pk):
app_user = get_object_or_404(AppUser, pk=app_user_pk)
...etc.etc. --- **somewhere calling your tasks with appropriate args/kwargs
I believe this meets the minimal requirements you've outlined above. I've intentionally left the view undeveloped since I don't have a sense of how exactly you want to trigger it. It sounds like you also may want some sort of user notification when their address can't be geocoded ("I'd like to update a field in a database letting a user know"). Without knowing more about the specifics of this requirement, I would it sounds like something that might be best accomplished in your html templates (if instance.attribute value is X, display q in template) or by using a django.signals (set up a signal for when a user.address.geocode_status switches to failure -- say, by emailing the user to let them know, etc.).
In the comments to the code above, I mentioned the possibility of decoupling and chaining the component parts of the get_geocode task above. You could also think about decoupling the exception handling from the get_geocode task, by writing a custom error handler task, and using the link_error parameter (for instance., add.apply_async((2, 2), link_error=error_handler.s(), where error_handler has been defined as a task in app.tasks.py ). Also, whether you choose to handle errors via the main task (get_geocode) or via a linked error handler, I would think that you would want to get much more specific about how to handle different sorts of errors (e.g., do something with connection errors different than with address data being incorrectly formatted).
I suspect there are better approaches, and I'm just beginning to understand how inventive you can get by chaining tasks, using groups and chords, etc. Hope this helps at least get you thinking about some of the possibilities. I'll leave it to others to recommend best practices.