How can I reduce the data replication factor on HDFS using IAE?
The idea is to use all the available HDFS disk space for testing purposes.
I have seen quite a few questions asking how to do this on other vendor’s Hadoop clusters but not on IBM Analytics Engine.
Using Ambari you would do something like the following:
Connect to the Ambari web URL.
Click on the HDFS tab on the left.
Click on the config tab.
Click on "Advanced" settings
Under "General," change the value of "Block Replication"
Now, restart the HDFS services.
Related
How do I access or view the local storage for druid? I would like to view the segments or copy the segments to a file. I am running druid operator on kubernetes. I have tried exec commands for historicals pods and middle managers pod however I am unable to enter in to any of the druid pod
Have you tried looking where the deep storage says
Deep storage is where segments are stored. It is a storage mechanism
that Apache Druid does not provide. This deep storage infrastructure
defines the level of durability of your data, as long as Druid
processes can see this storage infrastructure and get at the segments
stored on it, you will not lose data no matter how many Druid nodes
you lose. If segments disappear from this storage layer, then you will
lose whatever data those segments represented.
Source: Deep Storage on Druid documentation
For example, you have to know what directory is pointed in: druid.storage.storageDirectory
Remember that the data is saved in segments as we can read here: Segments on Apache Druid documentation
Useful Documentation:
Ingestion troubleshooting FAQ
HDFS as Deep-Storage: Druid is not storing the historical data on hdfs
Druid Setup with HDFS
Change Local Storage to S3 as deepstorage
Hi Guys, There are ingestion tasks going on in my druid server setup on Kubernetes. Lot of segments in multiple datasources are not available, even though ingestion was successful. As a result I am not able to show the ingested data in my app. Why are segments unavailable and how to rectify it? Also what are the steps to restart all druid components setup on multi node Kubernetes cluster?
It is difficult to say why segments are unavailable without looking at some logs. The coordinator log and the historical logs will be useful to determine why historical processes are unable to make the segments available (download them from deep storage).
A quick thought, could you be out of space for the historicals segment-cache ?
I am studying for the Professional Data Engineer and I wonder what is the "Google recommended best practice" for hot data on Dataproc (given that costs are no concern)?
If cost is a concern then I found a recommendation to have all data in Cloud Storage because it is cheaper.
Can a mechanism be set up, such that all data is on Cloud Storage and recent data is cached on HDFS automatically? Something like AWS does with FSx/Lustre and S3.
What to store in HDFS and what to store in GCS is a case-dependant question. Dataproc supports running hadoop or spark jobs on GCS with GCS connector, which makes Cloud Storage HDFS compatible without performance losses.
Cloud Storage connector is installed by default on all Dataproc cluster nodes and it's available on both Spark and PySpark environments.
After researching a bit: the performance of HDFS and Cloud Storage (or any other blog store) is not completely equivalent. For instance a "mv" operation in a blob store is emulated as copy + delete.
What the ASF can do is warn that our own BlobStore filesystems (currently s3:, s3n: and swift:) are not complete replacements for hdfs:, as operations such as rename() are only emulated through copying then deleting all operations, and so a directory rename is not atomic -a requirement of POSIX filesystems which some applications (MapReduce) currently depend on.
Source: https://cwiki.apache.org/confluence/display/HADOOP2/HCFS
I'm working on trying to setup some monitoring on a Google Cloud SQL node and am not seeing how to do it. I was able to install the monitoring agent on my Google Compute Engine instances to monitor CPU, Network, etc. I have not been able to figure out how to do so on the Cloud SQL instance. I have access to these types of monitoring:
Storage Usage (GB)
Number of Read/Write operations
Egress Bytes
Active Connections
MySQL Queries
MySQL Questions
InnoDB Pages Read/Written (pages/sec)
InnoDB Data fsyncs (operations/sec)
InnoDB Log fsyncs (operations/sec)
I'm sure these are great options, but at this point all I want to pay attention to is if my node is performing on a CPU/RAM standpoint as they seem to first and foremost measures for performance.
If I'm missing something, or misunderstnading what I'm trying to do, any advice is appreciated.
Thanks!
Google has a Stackdriver which is for logging and monitoring Google and AWS cloud infrastructure. It can monitor every single thing present on GCP. You can create visualization to monitor your Cloud SQL instance in one dashboard. You just have to ---->
1. login to stackdriver and Go to any existing dashboard, If you dont have create one.---->
2. Add chart and select Cloud SQL in resource Name.---->
3. Select CPU Utilization from metric and save. You can also monitor memory, Disk I/o, Delta count of Queries or servers Up-time and many more.
if you want to monitor any other GCP Compute engine, App-Engine, Kubernetese Engine, storage bucket, Bigtable or pub/sub you just have to select appropriate resource name from list. Hope you got your answer.
You can view all of them directly from the "Overview" tab of the Cloud SQL console:
I have added this as a feature request as issue 110.
https://code.google.com/p/googlecloudsql/issues/detail?id=110
I have read that you can replicate a Cloud SQL database to MySQL. Instead, I want to replicate from a MySQL database (that the business uses to keep inventory) to Cloud SQL so it can have up-to-date inventory levels for use on a web site.
Is it possible to replicate MySQL to Cloud SQL. If so, how do I configure that?
This is something that is not yet possible in CloudSQL.
I'm using DBSync to do it, and working fine.
http://dbconvert.com/mysql.php
The Sync version do the service that you want.
It work well with App Engine and Cloud SQL. You must authorize external conections first.
This is a rather old question, but it might be worth noting that this seems now possible by Configuring External Masters.
The high level steps are:
Create a dump of the data from the master and upload the file to a storage bucket
Create a master instance in CloudSQL
Setup a replica of that instance, using the external master IP, username and password. Also provide the dump file location
Setup additional replicas if needed
VoilĂ !