We are using Kafka Streams in our application. The topology uses groupByKey, followed by windowing followed by aggregation.
Sometimes, after restart, the application fails to read from the intermediate .repartition topic, e.g, lag is growing bigger and bigger. Deleting the .repartition topic solves the problem till the next restart, but it is not a good solution. The application runs on docker with local storage mounted as state directory.
Seems like without docker, everything is OK. Please, advise!
Thanks, Mark.
Someone experiencing a similar issue was able to resolve it by setting metadata.max.age.ms to a lower value than the current default (300000) -- try setting it quite low (eg few hundred ms) to see if that helps, then work out a reasonable value to run with.
Related
We’re using a standard 3-node Atlas replicaset in a dedicated cluster (M10, Mongo 6.0.3, AWS) and have configured an alert if the ‘Restarts in last hour is’ rule exceeds 0 for any node.
https://www.mongodb.com/docs/atlas/reference/alert-conditions/#mongodb-alert-Restarts-in-Last-Hour-is
We’re seeing this alert fire every now and then and we’re wondering what this means for a node in a dedicated cluster and whether this is something to be concerned about, since I don’t think we have any control over it. Should we should disable this rule or increase the restart threshold?
Thanks in advance for any advice.
(Note I've asked this over at the Mongo community support site also, but haven't received any traction yet so asking here too)
I got an excellent response on my question at the Mongo community support site:
A node restarting is not necessarily a cause for concern. However, you should investigate the cause of the restart itself to better determine if this is an issue or not. You should take a look at your Project Activity Feed to see if you can determine why the nodes are restarting. I understand you have noted this is an M10 cluster so you should have access to the MongoDB logs, you also can check those to try determine the cause of the node restart. If you do not have access to the logs, you can consider working with Atlas in-app chat support to diagnose the issue.
It’s always good to keep the alerts active, as they can indicate a potential problem as soon as they occur. You can consider increasing the restart threshold to reduce alert noise after concluding whether the restarts are expected or not.
In my case, having checked the activity feed I was able to match up all the alerts we were seeing to Mongo version auto-updates on the nodes. We still wanted to keep that so we've increased our alert threshold to fire on >1 restart per hour rather than >0 restart, assuming that auto-updates won't be applied multiple times in the same hour.
I was doing some late night work on a hard disk, blowing away partitions on a drive, I accidentally selected my main drive which had a windows 10 install on it. Blew away the partitions with gparted using live disk.
I managed to recover the most of the partitions using testdisk but not the MSR (the hidden one or secure one) one, so the PC doesn't boot....
At this stage i'd like to copy my recovered partition, but using anything doesn't work, clonezilla is confused because somehow there is an MBR and GPT installed on it.
I do not know what i was actually using(mbr or gtp) i see a whole load of overlapped partitions on it looks like the disk was a mess. However the main partition is in tact and the data is all there, checked with testdisk.
At this stage i've deleted all the partitions except for one i wish to keep but clonezilla/gparted still can't clone my partition.
I have no idea how to move forward from this.
Ideally what i would have done is cloned the last partition, done a fresh install of Windows 10 recreating the correct partition table, partitions ect, and then cloned the original partition, the one with data on it, back onto of the freshly install windows 10 one and this process worked for me during my triple boots.
However at this stage cannot clone the partition and too scared to try any option of delete either the GPT or the MBR, in case of data loss.
Advice from anybody would be greatly appreciated, this is only my gaming pc, so we're not taking mountains of valuable data, but there was some old files i forgot to copy to the NAS. Not to mentioned i realised, i never set the backing up onto the NAS of this computer, i know i stuffed this part up, you've save me re-downloading the 600gb of games i had on there and my set up which i loved!
Thanks in advance and your time in reading this mess.
Anyway so i figured it out and this is how i did it.
Basically it was windows 10 that got install with gpt.
and testdisk recovers any way it can which was by installing a mbr table and finding the data.
So i launched gparted to find the info about my current disk, and the partition was visible under mbr, so that meant that gpt was safe to delete.
So did that, after which i could clonezilla the partition to a spare disk, saving all the data!
Fresh install of windows 10 to re-create the partitions, Recovery, MSR and data.
Then clonezilla the from the spare disk ontop of the new windows 10 install, all working perfect, as i'm typing this from that PC.
Thanks all for reading!
I've built a Kafka Streams application. It's my first one, so I'm moving out of a proof-of-concept mindset into a "how can I productionalize this?" mindset.
The tl;dr version: I'm looking for kafka streams deployment recommendations and tips, specifically related to updating your application code.
I've been able to find lots of documentation about how Kafka and the Streams API work, but I couldn't find anything on actually deploying a Streams app.
The initial deployment seems to be fairly easy - there is good documentation for configuring your Kafka cluster, then you must create topics for your application, and then you're pretty much fine to start it up and publish data for it to process.
But what if you want to upgrade your application later? Specifically, if the update contains a change to the topology. My application does a decent amount of data enrichment and aggregation into windows, so it's likely that the processing will need to be tweaked in the future.
My understanding is that changing the order of processing or inserting additional steps into the topology will cause the internal ids for each processing step to shift, meaning at best new state stores will be created with the previous state being lost, and at worst, processing steps reading from an incorrect state store topic when starting up. This implies that you either have to reset the application, or give the new version a new application id. But there are some problems with that:
If you reset the application or give a new id, processing will start from the beginning of source and intermediate topics. I really don't want to publish the output to the output topics twice.
Currently "in-flight" data would be lost when you stop your application for an upgrade (since that application would never start again to resume processing).
The only way I can think to mitigate this is to:
Stop data from being published to source topics. Let the application process all messages, then shut it off.
Truncate all source and intermediate topics.
Start new version of application with a new app id.
Start publishers.
This is "okay" for now since my application is the only one reading from the source topics, and intermediate topics are not currently used beyond feeding to the next processor in the same application. But, I can see this getting pretty messy.
Is there a better way to handle application updates? Or are my steps generally along the lines of what most developers do?
I think you have a full picture of the problem here and your solution seems to be what most people do in this case.
During the latest Kafka-Summit this question has been asked after the talk of Gwen Shapira and Matthias J. Sax about Kubernetes deployment. The responses were the same: If your upgrade contains topology modifications, that implies rolling upgrades can't be done.
It looks like there is no KIP about this for now.
As mentioned on this page: Memcached for PHP and failover,
I am trying to test the failover of Memcached.
Basically, I want to ensure that if one of the server is marked dead, subsequent sets and gets should get re-distributed to the servers that are left over.
Someone mentioned on this page that OPT_AUTO_EJECT_HOSTS is one option to achieve this.
However, It seems that Memcached::OPT_AUTO_EJECT_HOSTS is depreciated as decribed on this page:http://hoborglabs.com/en/blog/2013/memcached-php
I tried using OPT_REMOVE_FAILED_SERVERS option also. But this makes no difference.
I also tried OPT_SERVER_FAILURE_LIMIT, setting it to 1.
Benchmark/request generator in my case, is BRUTIS.
https://code.google.com/p/brutis/
I'm using libmemcached-1.0.16, memcached-1.4.15, and the php version of memcached is 2.1.0.
What should I do to make the failover and automatic rebalancing to work.
I am trying different combinations of these options also. But, it does not work.
There is a related question:
Brutis and memcached FailOver
But no answer yet :(
If anyone has any idea about this, please share your views.
Thanks in advance,
Amit
I was trying to publisg assets from one environment to another. It was almost very slow and not progressing further. Could any body suggest what can be the issue ?
Try segmenting your assets and publishing in smaller goups
Sometimes it boils down to finding the culprit asset which causes the entire batch to stall. This is why segmenting slow publishes can help narrow the issue. Also check if there are any assets checked out in your target destination.
There are a few things to check
You can set VERBOSE=TRUE on the publish destination config, to make the UI write a more detailed log. It's important to know exactly what is slow, whether its the movement of assets to target or the cache flush/potential rebuild on target.
Check the futuretense.txt on source and target for any telltale error or curious messages, if nothing is appearing there then maybe the logging is suppressed. You should have INFO level for most loggers by default, and if still nothing is appearing then set com.fatwire.logging.cs=DEBUG and retry.
Generally speaking if this is a production system, and its not a huge number of assets being published, then cache flush is where most time is spent. And, if it is configured to do so, cache regeneration. The verbose publish log will tell you how much is being flushed.
If the cause of slowness can't be determined from inspecting the logs, then consider taking periodic thread dumps (source and target) during the publish, to see what is happening under the hood. Perhaps the system is slow waiting on a resource like shared disk (a common problem).
Phil
To understand better, you will need to find at what step the publishing process is stuck. As you know, the publishing is process is composed of 5 steps, first two (data gather and serialization) happens at source, the third (data transfer) happens between source & destination, and the last two (de-serialization & cache clear) happens at delivery.
One weird situation that I have come across is the de-serialization step in which it was trying to update Locale tree in each real time publish. The then Fatwire support suggested us to add &PUBLISHLOCALETREE=false. This significantly improved the publish performance. Once again, this only applies if you're using locales/translations in your Site.