Our product requires real time Lucene index merge in an embeded device. The device may get shutdown anytime. My team is exploring posibility of resume merge after system reboot. In our POC, all consumers from SegmentMerger are overrided by customized codec and the whole merge process is divided into many fine-grain steps. When each step is done, its state is saved in disk to avoid redo after resume. From test, this can work. However, I am not able to determine how robust this solution can be or it is built on flawed foundation.
Thanks in advance for your response
Related
I've built a Kafka Streams application. It's my first one, so I'm moving out of a proof-of-concept mindset into a "how can I productionalize this?" mindset.
The tl;dr version: I'm looking for kafka streams deployment recommendations and tips, specifically related to updating your application code.
I've been able to find lots of documentation about how Kafka and the Streams API work, but I couldn't find anything on actually deploying a Streams app.
The initial deployment seems to be fairly easy - there is good documentation for configuring your Kafka cluster, then you must create topics for your application, and then you're pretty much fine to start it up and publish data for it to process.
But what if you want to upgrade your application later? Specifically, if the update contains a change to the topology. My application does a decent amount of data enrichment and aggregation into windows, so it's likely that the processing will need to be tweaked in the future.
My understanding is that changing the order of processing or inserting additional steps into the topology will cause the internal ids for each processing step to shift, meaning at best new state stores will be created with the previous state being lost, and at worst, processing steps reading from an incorrect state store topic when starting up. This implies that you either have to reset the application, or give the new version a new application id. But there are some problems with that:
If you reset the application or give a new id, processing will start from the beginning of source and intermediate topics. I really don't want to publish the output to the output topics twice.
Currently "in-flight" data would be lost when you stop your application for an upgrade (since that application would never start again to resume processing).
The only way I can think to mitigate this is to:
Stop data from being published to source topics. Let the application process all messages, then shut it off.
Truncate all source and intermediate topics.
Start new version of application with a new app id.
Start publishers.
This is "okay" for now since my application is the only one reading from the source topics, and intermediate topics are not currently used beyond feeding to the next processor in the same application. But, I can see this getting pretty messy.
Is there a better way to handle application updates? Or are my steps generally along the lines of what most developers do?
I think you have a full picture of the problem here and your solution seems to be what most people do in this case.
During the latest Kafka-Summit this question has been asked after the talk of Gwen Shapira and Matthias J. Sax about Kubernetes deployment. The responses were the same: If your upgrade contains topology modifications, that implies rolling upgrades can't be done.
It looks like there is no KIP about this for now.
I was trying to publisg assets from one environment to another. It was almost very slow and not progressing further. Could any body suggest what can be the issue ?
Try segmenting your assets and publishing in smaller goups
Sometimes it boils down to finding the culprit asset which causes the entire batch to stall. This is why segmenting slow publishes can help narrow the issue. Also check if there are any assets checked out in your target destination.
There are a few things to check
You can set VERBOSE=TRUE on the publish destination config, to make the UI write a more detailed log. It's important to know exactly what is slow, whether its the movement of assets to target or the cache flush/potential rebuild on target.
Check the futuretense.txt on source and target for any telltale error or curious messages, if nothing is appearing there then maybe the logging is suppressed. You should have INFO level for most loggers by default, and if still nothing is appearing then set com.fatwire.logging.cs=DEBUG and retry.
Generally speaking if this is a production system, and its not a huge number of assets being published, then cache flush is where most time is spent. And, if it is configured to do so, cache regeneration. The verbose publish log will tell you how much is being flushed.
If the cause of slowness can't be determined from inspecting the logs, then consider taking periodic thread dumps (source and target) during the publish, to see what is happening under the hood. Perhaps the system is slow waiting on a resource like shared disk (a common problem).
Phil
To understand better, you will need to find at what step the publishing process is stuck. As you know, the publishing is process is composed of 5 steps, first two (data gather and serialization) happens at source, the third (data transfer) happens between source & destination, and the last two (de-serialization & cache clear) happens at delivery.
One weird situation that I have come across is the de-serialization step in which it was trying to update Locale tree in each real time publish. The then Fatwire support suggested us to add &PUBLISHLOCALETREE=false. This significantly improved the publish performance. Once again, this only applies if you're using locales/translations in your Site.
The other day a friend suggested to play a web browser game called OGame. If you don't know it I'll tell you what it is:an rts game where you have to build things like mining factories, barracks and so on. The interesting thing that every building has a build time and you can log off while it's building because it will keep going.
Something like this I would believe is managed via dbms. I have my records where I have the end time of a costruction. How do I check when to update a building? Do I need an external application that checks every seconds what record needs to be updated? Is it possible with mysql5 to have an internal scheduler that launches a procedure on this table? And if so, is it a best practice?
I have built a similar game and I stored the construction end times (and other events to be fired) in an events table. I wrote a PHP daemon which regularly checks the events table for expired records and acts on them accordingly.
I couldn't find a way to do it in the database itself (and if I later wanted to migrate to another DB it would need rewriting). A cron'd script may overlap. A daemon can keep track of everything all the time, and output debug information if events are queuing faster than they're being processed. I also added a cron to check periodically that my daemon is still running, otherwise start it.
Creating a daemon in PHP (if you're using PHP)
Hope that helps.
I am working on a regular iPhone app which pulls data from a server (XML, JSON, etc...), and I'm wondering what is the best way to implement synching data. Criteria are speed (less network data exchange), robustness (data recovery in case update fails), offline access and flexibility (adaptable when the structure of the database changes slightly, like a new column). I know it varies from app to app, but can you guys share some of your strategy/experience?
For me, I'm thinking of something like this:
1) Store Last Modified Date in iPhone
2) Upon launching, send a message like getNewData.php?lastModifiedDate=...
3) Server will process and send back only modified data from last time.
4) This data is formatted as so:
<+><data id="..."></data></+> // add this to SQLite/CoreData
<-><data id="..."></data></-> // remove this
<%><data id="..."><attribute>newValue</attribute></data></%> // new modified value
I don't want to make <+>, <->, <%>... for each attribute as well, because it would be too complicated, so probably when receive a <%> field, I would just remove the data with the specified id and then add it again (assuming id here is not some automatically auto-incremented field).
5) Once everything is downloaded and updated, I will update the Last Modified Date field.
The main problem with this strategy is: If the network goes down when I am updating something => the Last Modified Date is not yet updated => next time I relaunch the app, I will have to go through the same thing again. Not to mention potential inconsistent data. If I use a temporary table for update and make the whole thing atomic, it would work, but then again, if the update is too long (lots of data change), the user has to wait a long time until new data is available. Should I use Last-Modified-Date for each of the data field and update data gradually?
I would start by making the update routine atomic, since you'll have enough on your hands figuring out how to get the client-server communication working properly.
After that is a good time to consider tweaking it to be incremental, but only after you do some testing to figure out if it's really necessary. If you're tuning your update protocol to be as low bandwidth as possible, you might discover that even a "big" update is downloaded fast enough.
Another way to look at it is to ask yourself, how often is there going to be network trouble when an average user is doing a sync? You probably don't want to tune for unlikely scenarios.
If you are trying to optimize (minimize) the data transfer you may want to consider a different format than XML, since XML is fairly verbose. Or at least you may want to trade in XML readability for space by making each element name and attribute as small as possible, and eliminate all unnecessary whitespace.
Your basic scheme is good. The thing you need to do is to somehow make your updates idempotent so that you can restart a partially-completed transfer without risk. This is a better way to go than to try to implement some sort of true atomic commit (though you could do that too, using, eg, the SQLite database).
In our experience fairly large updates (10s of KB) can be downloaded quite rapidly, if the server is fast enough. No great need to break updates up into tiny bits. But certainly it won't hurt to try to minimize the amount of data transferred by keeping more granular info on "last update".
(And definitely you should use JSON rather than XML as your transmitted data representation.)
Wonder if you have considered using a Sync Framework to manage the synchronization. If that interests you can take a look at the open source project, OpenMobster's Sync service. You can do the following sync operations
two-way
one-way client
one-way device
bootup
Besides that, all modifications are automatically tracked and synced with the Cloud. You can have your app offline when network connection is down. It will track any changes and automatically in the background synchronize it with the cloud when the connection returns. It also provides synchronization like iCloud across multiple devices
Also, modifications in the Cloud are synched using Push notifications, so the data is always current even if it is stored locally.
In your case,
Criteria are speed (less network data exchange), robustness (data recovery in case update fails), offline access
Speed: Only the changes are sent across the network in both directions
Robustness: It stores data in a transactional store like sqlite and any failed updates are communicated in the SyncML payload. Only the successful operations are processed while the failed operations are re-tried during the next sync
Here is a link to the open source project: http://openmobster.googlecode.com
Here is a link to iPhone App Sync: http://code.google.com/p/openmobster/wiki/iPhoneSyncApp
I'm working on an iPhone application that should work in offline and online modes.
In it's online mode it's supposed to feed all the information the user enters to a webservice backed by GWT/GAE.
In it's offline mode it's supposed to store the information locally, and when connection is available sync it up to the web service.
Currently my plan is as follows:
Provide a connection between an app and a webservice using Protobuffers for efficient over-the-wire communication
Work with local DB using Core Data
Poll the network status, and when available sync the database and keep some sort of local-db-to-remote-db key synchronization.
The question is - am I in the right direction? Are the standard patterns for implementing this? Maybe someone can point me to an open-source application that works in a similar fashion?
I am really new to iPhone coding, and would be very glad to hear any suggestions.
Thanks
I think you've blurring the questions together.
If you've got a question about making a GWT web interface, that's one question.
Questions about how to sync an iPhone to a web service are a different question. For that, you don't want to use GWT's RPCs for syncing, as you'd have to fake out the 'browser-side' of the serialization system in your iPhone code, which GWT normally provides for you.
about system design direction:
First if there is no REAL need do not create 2 different apps one GWT and other iPhone
create one but well written GWT app. It will work off line no problem and will manage your data using HTML feature -- offline application cache
If it a must to create 2 separate apps
than at least save yourself effort and do not write server twice as if you go with standard GWT aproach you will almost sertanly fail to talk to server from stand alone app (it is zipped JSON over HTTP with some tricky headers...) or will write things twise so look in to the RestLet library it well supported by the GAE.
About the way to keep sync with offline / online switching:
There are several aproaches to consider and all of them are not perfect. So when you conseder yours think of what youser expects... Do not be Microsoft Word do not try to outsmart the user.
If there at least one scenario in the use cases that demand user intervention to merge changes (And there will be - take it to the bank) - than you will have implement UI for this - than there is a good reason to use it often - user will get used to it. it better than it will see it in a while since he started to use the app because a need fro it is rare because you implemented a super duper merging logic that asks user only in very special cases... Don't do it.
balance the effort. Because the mess that a bug in such code will introduce to user is much more painful than the benefit all together.
so the HOW:
The one way is the Do-UnDo way.
While off line - keep the log of actions user did on data in timed order user did them
as soon as you connected - send to server and execute them. Same from server to client.
Will work fine in most cases as long as you are not writing a Photoshop kind of software with huge amounts of data per operation. Also referred as Action Pattern by the GangOfFour.
Another way is a source control way. - Versions and may be even locks. very application dependent. DBMS internally some times use it for transactions implementations.
And there is always an option to be Read Only when Ofline :-)
Wonder if you have considered using a Sync Framework to manage the synchronization. If that interests you can take a look at the open source project, OpenMobster's Sync service. You can do the following sync operations
two-way
one-way client
one-way device
bootup
Besides that, all modifications are automatically tracked and synced with the Cloud. You can have your app offline when network connection is down. It will track any changes and automatically in the background synchronize it with the cloud when the connection returns. It also provides synchronization like iCloud across multiple devices
Also, modifications in the Cloud are synched using Push notifications, so the data is always current even if it is stored locally.
Here is a link to the open source project: http://openmobster.googlecode.com
Here is a link to iPhone App Sync: http://code.google.com/p/openmobster/wiki/iPhoneSyncApp