Access local storage for druid - druid

How do I access or view the local storage for druid? I would like to view the segments or copy the segments to a file. I am running druid operator on kubernetes. I have tried exec commands for historicals pods and middle managers pod however I am unable to enter in to any of the druid pod

Have you tried looking where the deep storage says
Deep storage is where segments are stored. It is a storage mechanism
that Apache Druid does not provide. This deep storage infrastructure
defines the level of durability of your data, as long as Druid
processes can see this storage infrastructure and get at the segments
stored on it, you will not lose data no matter how many Druid nodes
you lose. If segments disappear from this storage layer, then you will
lose whatever data those segments represented.
Source: Deep Storage on Druid documentation
For example, you have to know what directory is pointed in: druid.storage.storageDirectory
Remember that the data is saved in segments as we can read here: Segments on Apache Druid documentation
Useful Documentation:
Ingestion troubleshooting FAQ
HDFS as Deep-Storage: Druid is not storing the historical data on hdfs
Druid Setup with HDFS
Change Local Storage to S3 as deepstorage

Related

Clean up an apache druid cluster

Is there any way to clean up all the druid data (tasks, storage, etc.) for testing purposes?
Found the tutorial which demonstrates the segment deletion:
https://druid.apache.org/docs/latest/ingestion/data-management.html#delete
And reset-cluster tool:
https://druid.apache.org/docs/latest/operations/reset-cluster.html
My goal is to have a fresh druid cluster, every time I run testing.
If you are asking which of the two options to use, the reset-cluster tool will address your use case as it has options to remove metadata, task logs and segment data in deep storage. The --all option will remove all of them.
The segment deletion process, on the other hand, is used to remove unwanted segments from the cluster and deep storage, but does not address metadata in general or task logs.

Druid segments not available

Hi Guys, There are ingestion tasks going on in my druid server setup on Kubernetes. Lot of segments in multiple datasources are not available, even though ingestion was successful. As a result I am not able to show the ingested data in my app. Why are segments unavailable and how to rectify it? Also what are the steps to restart all druid components setup on multi node Kubernetes cluster?
It is difficult to say why segments are unavailable without looking at some logs. The coordinator log and the historical logs will be useful to determine why historical processes are unable to make the segments available (download them from deep storage).
A quick thought, could you be out of space for the historicals segment-cache ?

Dataproc: Hot data on HDFS, cold data on Cloud Storage?

I am studying for the Professional Data Engineer and I wonder what is the "Google recommended best practice" for hot data on Dataproc (given that costs are no concern)?
If cost is a concern then I found a recommendation to have all data in Cloud Storage because it is cheaper.
Can a mechanism be set up, such that all data is on Cloud Storage and recent data is cached on HDFS automatically? Something like AWS does with FSx/Lustre and S3.
What to store in HDFS and what to store in GCS is a case-dependant question. Dataproc supports running hadoop or spark jobs on GCS with GCS connector, which makes Cloud Storage HDFS compatible without performance losses.
Cloud Storage connector is installed by default on all Dataproc cluster nodes and it's available on both Spark and PySpark environments.
After researching a bit: the performance of HDFS and Cloud Storage (or any other blog store) is not completely equivalent. For instance a "mv" operation in a blob store is emulated as copy + delete.
What the ASF can do is warn that our own BlobStore filesystems (currently s3:, s3n: and swift:) are not complete replacements for hdfs:, as operations such as rename() are only emulated through copying then deleting all operations, and so a directory rename is not atomic -a requirement of POSIX filesystems which some applications (MapReduce) currently depend on.
Source: https://cwiki.apache.org/confluence/display/HADOOP2/HCFS

Should I use EBS or EFS for database?

For database directories for MongoDB, Cassandra or Elasticsearch clusters with high availability, should I use EBS or EFS? MongoDB, Cassnadra and Elasticsearch clusters take care of replicating data across nodes if they are configured to have replication factor > 1, so EFS replication feature may not be needed I giuess.
EBS - for databases
EFS - for file sharing across applications, VMs etc
Here is a good article that differentiates between the storage types
https://dzone.com/articles/confused-by-aws-storage-options-s3-ebs-amp-efs-explained
EFS is for multiple servers having access to the same set of files. Cassandra has replication built in, so it has no use for that feature. You would not want multiple Cassandra nodes accessing the same files anyway as each node manages its own sstables.
Not to mention Cassandra is disk intensive and gets angry if there is latency. Cassandra connections time out really easily. So, using an NFS mount (EFS) instead of a “local” disk is just a bad idea.
Read this if you haven’t already: https://aws.amazon.com/blogs/big-data/best-practices-for-running-apache-cassandra-on-amazon-ec2/
(Can’t speak for other databases like MongoDB.)

Backup Zookeeper data on active server

I am trying to configure production-ready Zookeeper data backup.
As I learned from different sources, Zookeeper snapshot file is not enough to guarantee a return to a previous state. In fact, the snapshot file may not even represent the state of the tree at any point in time (see corresponding stackoverflow ticket answer).
So to make the consistent zk data storage backup (to store it on a cloud or elsewhere), I need to copy snapshots with transaction logs.
The question is: how can I copy transaction log files while zookeeper is active and makes hundreds of transactions a second? Won't the files be corrupted?
What other practices can be used in this case?