Apache Druid Datasources creation

Apache Druid Datasources creation - druid

I have a Druid platform deployed without a router (and therefore no UI) on Kubernetes. I noticed that some datasources have disappeared (most probably erased manually). Is there a way to re-created them manually without re-deploying the full platform (for example, by restarting the ingestion server, through an API call, other)?
Thanks - Christian

It really depends on how the data was deleted and whether the segment files are still being pointed to by the Metadata Database.
Your best starting point is the full Druid API documentation.
For example, they may be in Deep Storage and still in Metadata DB but just marked as "unused", in which case you can use an API:
https://druid.apache.org/docs/latest/operations/api-reference.html#post-1
Or they could be "used" in the MDDB but just not being loaded, in which case check the Load Rules:
https://druid.apache.org/docs/latest/operations/api-reference.html#retention-rules

Related

How To Design a Distributed Logging System in Kubernetes?

I'm designing a distributed application, comprised of several Spring microservices that will be deployed with Kubernetes. It is a batch processing app, and a typical request could take several minutes of processing, with the processing getting distributed across the services, using Kafka as a message broker.
A requirement of the project is that each request will generate a log file, which will need to be stored on the application file store for retrieval. The current design is, all the processing services write log messages (with the associated unique request ID) to Kafka, and there is a dedicated logging microservice that reads these messages down, does some formatting and should persist them to the log file associated with the given request ID.
I'm very unfamiliar with how files should be stored in web applications. Should I be storing these log files to the local file system? If so, wouldn't that mean this "logging service" couldn't be scaled? For example, if I scaled the log service to 2 instances, then each instance would only have access to half of the log files in theory. And if a user makes a request to retrieve a log file, there is no guarantee that the requested log file will be at whatever log service instance the Kubernetes load balancer routed them too.
What is the currently accepted "best practice" for having a file system in a distributed application? Or should I just accept that the logging service can never be scaled up?
A possible solution I can think of would just store the text log files in our MySQL database as TEXT rows, making the logging service effectively stateless. If someone could point out any potential issues with this that would be much appreciated?

deployed with Kubernetes
each request will generate a log file, which will need to be stored on the application file store
Don't do this. Use a Fluentd / Filebeat / promtail / Splunk forwarder side car that gathers stdout from the container processes.
Or have your services write to a kafka logs topic rather than create files.
With either option, use a collector like Elasticsearch, Grafana Loki, or Splunk
https://kubernetes.io/docs/concepts/cluster-administration/logging/#sidecar-container-with-a-logging-agent
wouldn't that mean this "logging service" couldn't be scaled?
No, each of these services are designed to be scaled
possible solution I can think of would just store the text log files in our MySQL database as TEXT rows,
Sure, but Elasticsearch or Solr are purpose-built for gathering and searching plaintext, not MySQL.
Don't treat logs as something application specific. In other words, your solution shouldn't be unique to Spring

How to read in file paths into a queue that is in a Kubernetes cluster?

I want to read file paths from a persistent volume and store these file paths into a persistent queue of sorts. This would probably be done with an application contained within a pod. This persistent volume will be updated constantly with new files. This means that I will need to constantly update the queue with new file paths. What if this application that is adding items to the queue crashes? Kubernetes would be able to reboot the application, but I do not want to add in file paths that are already in the queue. The app would need to know what exists in the queue before adding in files, at least I would think. I was leaning on RabbitMQ, but apparently you cannot search a queue for specific items with this tool. What can I do to account for this issue? I am running this cluster on Google Kubernetes Engine, so this would be on the Google Cloud Platform.

What if this application that is adding items to the queue crashes?
Kubernetes would be able to reboot the application, but I do not want
to add in file paths that are already in the queue. The app would need
to know what exists in the queue before adding in files
if you are looking for searching option also i would suggest using the Redis instead of Queue Running rabbitMQ on K8s i have pretty good experience when it's come to scaling and elasticity however there is HA helm chart of RabbitMQ you can use it.
i would Recomand checking out Redis and using it as backend to store the data, if you looking forward to create queue still you can use Bull : https://github.com/OptimalBits/bull
it uses the Redis as background to store the data and you can create the queue using this library.
As in Redis you will be taking continuous dump at every second or so...! there is less chances to miss data however in RabbitMQ you can keep persistent messaging plus it provide option for acknowledgment and all.
it's about the actual requirement that you want to implement. If your application wants to order in the list you can not use the Redis in that case RabbitMQ would be best.

Have you ever heard about KubeMQ? There is a KubeMQ community where you can refer to with the guides and help.
As an alternative solution you can find useful guide on official Kubernetes documentation on creating working queue with Redis

Backup of ignite stateful set in Kubernetes

I’m trying to come up with a strategy to backup data in my apache ignite cache hosted as a stateful set in google cloud Kubernetes.
My ignite deployment uses ignite native persistence and runs a 3 node ignite cluster backed up by persistence volumes in Kubernetes.
I’m using a binaryConfiguration to store binary objects in cache.
I’m looking for a reliable way to back up my ignite data and be able to restore it.
So far I’ve tried backing up just the persistence files and then restoring them back.
It hasn’t worked reliably yet.
The issue I’m facing is that after restore, the cache data which isn’t binary objects is restored properly, e.g. strings or numbers. I’m able to access numeric or string data just fine. But binary objects are not accessible. It seems the binary objects are restored, but I’m unable to fetch them.
The weird part is that after the restore, once I add a new binary object to the cache all the restored data seems to be accessed normally.
Can anyone please suggest a reliable way to back up and restore ignite native persistence data?

You should either backup ${ignite.work.dir}/marshaller directory, or call ignite.binary().type(KeyOrValue.class) for every type you have in cache to prime binary marshaller.

Apache Ignite providers ACID transactions which are pretty reliable. The cache also uses its own mechanism for primary backups and copies and assuming you have its WAL enabled some stuff is kept in memory.
The most likely thing happening is that you do your restore and the moment you make an initial write memory starts populating allowing you to see what's on disk (cache). This is not really a supported restore mechanism (there isn't one in the docs) but it could work that way where after the restore you run a minor sample irrelevant write. I advise testing this thoroughly though.

What is upconfig and downconfig in zookeeper?

I am a noob in Solr and zookeeper and trying to learn by myself. I understood that zookeeper is a file structure that manages solr cluster and prevents race condition using locks. I didn’t understand what is upconfig and downconfig and when we do that. It would be of great help if someone can give me a clear picture on it. Thanks in advance!

A better and more general description of Zookeeper is an application that provides centralised configuration for distributed systems. So in Solr Cloud, you can have multiple Solr instances across multiple servers acting together as a single cloud. However, if you want to update a collection's configuration, you don't want to have to go to each server and update them all individually. You want only one version of the config which is then used by any collection that needs it. Hence the conf commands.
upconfig uploads a configuration to ZooKeeper, which then ensures that all collections using that configuration (throughout the Cloud, on all the servers) have that specific config. So you only need to upload it once, on one server.
downconfig lets you fetch a configuration from Zookeeper.

Apache Ignite Failover functionality

I have set apache ignite on a Cluster of nodes and sent some job to some server node to run. When connection to that server node was lost I need to somehow store the result of that node locally (either via binary file or via some other way). Then when the connection with that node is established again push back the stored results to some Database server.
I'm working under .Net platform.
I can use
EventType.EVT_CLIENT_NODE_DISCONNECTED
EventType.EVT_CLIENT_NODE_RECONNECTED
these events and inside of their functions to implement the 'storing locally' and 'pushing to the DB server' functionality but I wanted to find some ready solution.
Is there any ready tool with the functionality I mentioned to just take and use it?

You can take a look at Checkpointing. I'm not sure this is exactly the same as you described (mainly because it will save the intermidiate state on server side), but I think it can be quite helpful.