Presto on Preemptible GCE instances - google-cloud-storage

I am running an instance group of 20 Preemptible GCE instance to read ORC files on Google storage, The data partitioned by hour, each hour about 2GB.
What type of instances should i use ?
How many of the Ram should be used by the JVM ?
I am using autoscale configuration of 80% CPU and 10 minute cooldown, Is there more subtitle config for Presto ?
Is there a solution for servers shutdowns, due to lack of resources ?
Partial responses will be appreciated as well.

As 0.199 version of PrestoDB there's no google cloud storage connector for Presto, which makes impossible to query GCS data.
Regarding hardware requirements, I'll cite Terada doc here.
Memory
You should allocate a minimum of 16GB of RAM per node for Presto. But
recommend 64GB for most production workloads.
Network Bandwidth
It is recommended to have 10 Gigabit Ethernet between all the nodes in
the cluster.
Other Recommendations
Presto can be installed on any normally configured Hadoop cluster.
YARN should be configured to account for resources dedicated to
Presto. For example, if a node has 64GB of RAM, perhaps you would
normally allocate 60GB to YARN. If you install Presto on that node and
give Presto 32GB of RAM, then you should subtract 32GB from the 60GB
and let YARN only allocate 28GB per node. An optimized configuration
might choose to have separate Presto and Hadoop nodes. The optimized
configuration allows you to give more memory to Presto, and thus
perform larger join queries, for example.

Related

Cockroach DB is slower in 64GB RAM cluster as compared to 32GB RAM cluster

I have installed CockroachDB in below clusters
a. 3 nodes cluster in which every node has 32GB RAM
b. 3 nodes cluster in which every node has 64GB RAM
I am testing the performance by running the same queries(Select, join, insert, delete, aggregate functions, nested queries, concurrent queries) in both the clusters.
After testing for 3 times, I have found that 64GB cluster is slower than 32 GB cluster.
I was expecting 64GB RAM cluster would be faster than 32GB RAM cluster.
I am not able to find the suitable answers for the same.
Any answers or insights would be greatly appreciated.
Thanks in Advance!
Thanks for the post! Without knowing the exact machine specs, configuration settings, and workload it would be hard to figure out what could be happening here :)
We have a Slack channel that might be a bit easier for back and forth on performance optimization, or a support ticket could be opened with a best effort SLA. If you have any more information on the run configuration that would be great too!
https://www.cockroachlabs.com/join-community/
https://support.cockroachlabs.com/hc/en-us

Apache Druid: My VM hangs when I try to load quickstart data

I'm new to Apache Druid. I used Azure VM (Standard B2s (2 vcpus, 4 GiB memory)) to install apache druid and then tried to load the quick-start tutorial json data (wikiticker-2015-09-12-sampled.json.gz) using console.
I followed all the instructions as mentioned in the DRUID tutorial on their official site. I tried multiple times but each time the VM hangs and make it unresponsive. Am I missing anything/need to do any configuration changes for task to execute before loading the data?
Thanks.
Druid comes with several startup configuration profiles for a range of machine sizes.
*Single server reference configurations
Nano-Quickstart: 1 CPU, 4GB RAM
Micro-Quickstart: 4 CPU, 16GB RAM
Small: 8 CPU, 64GB RAM (~i3.2xlarge)
Medium: 16 CPU, 128GB RAM (~i3.4xlarge)
Large: 32 CPU, 256GB RAM (~i3.8xlarge)
X-Large: 64 CPU, 512GB RAM (~i3.16xlarge)
*
To start the Druid services I was using the micro configuration profile:
./bin/start-micro-quickstart
However, my machines as mentioned above is more of a Nano configuration and hence should be using below command to start the Druid services:
./bin/start-nano-quickstart
I was now able to successfully load and query the data file.
Please check your machine configuration before running the service start command.
Regards,
Udayan

Cassandra and MongoDB minimum system requirements for Windows 10 Pro

RAM- 4GB,
PROCESSOR-i3 5010ucpu #2.10 GHz
64 bit OS
can Cassandra and MongoDB be installed in such a laptop? Will it run successfully?
The hardware configuration proposed does not meet the minimum requirements. For Cassandra, the documentation requests a minimum of 8GB of RAM and at least 2 cores.
MongoDB's documentation also states that it will need at least 2 real cores or one multi-core physical CPU. With 4GB in RAM, the WiredTiger will allocate 1.5GB for the cache. Please also note that MongoDB will require changes in BIOS to allow memory interleaving to enable Non-Uniform Access Memory, a.k.a. NUMA, such changes will impact the performance of the laptop for other processes.
Will it run successfully?
This will depend on the workload expected to be executed; there are documented examples where Cassandra was installed on a Raspberry Pi array, which since the design it was expected to have slow performance and have a limited amount of data that can be held in the cluster.
If you are looking to have a small sandbox to start using these databases there are other options, MongoDB has a service named Atlas, with a model of a database as a service, it offers a free tier for a 3-node replica and up to 512Mb of storage. For Cassandra there are similar options, AWS offers in the free tier a small cluster of their Managed Cassandra Service (MCS), Datastax is also planning to offer similar services with Constellation

Should I use SSD or HDD as local disks for kubernetes cluster?

Is it worth using SSD as boot disk? I'm not planning to access local disks within pods.
Also, GCP by default creates 100GB disk. If I use 20GB disk, will it cripple the cluster or it's OK to use smaller sized disks?
Why one or the other?. Kubernetes (Google Conainer Engine) is mainly Memory and CPU intensive unless your applications need a huge throughput on the hard drives. If you want to save money you can create tags on the nodes with HDD and use the node-affinity to tweak which pods goes where so you can have few nodes with SSD and target them with the affinity tags.
I would always recommend SSD considering the small difference in price and large difference in performance. Even if it just speeds up the deployment/upgrade of containers.
Reducing the disk size to what is required for running your PODs should save you more. I cannot give a general recommendation for disk size since it depends on the OS you are using and how many PODs you will end up on each node as well as how big each POD is going to be. To give an example: When I run coreOS based images with staging deployments for nginx, php and some application servers I can reduce the disk size to 10gb with ample free room (both for master and worker nodes). On the extreme side - If I run self-contained golang application containers without storage need, each POD will only require a few MB space.

Spark Configuration: SPARK_MEM vs. SPARK_WORKER_MEMORY

In spark-env.sh, it's possible to configure the following environment variables:
# - SPARK_WORKER_MEMORY, to set how much memory to use (e.g. 1000m, 2g)
export SPARK_WORKER_MEMORY=22g
[...]
# - SPARK_MEM, to change the amount of memory used per node (this should
# be in the same format as the JVM's -Xmx option, e.g. 300m or 1g)
export SPARK_MEM=3g
If I start a standalone cluster with this:
$SPARK_HOME/bin/start-all.sh
I can see at the Spark Master UI webpage that all the workers start with only 3GB RAM:
-- Workers Memory Column --
22.0 GB (3.0 GB Used)
22.0 GB (3.0 GB Used)
22.0 GB (3.0 GB Used)
[...]
However, I specified 22g as SPARK_WORKER_MEMORY in spark-env.sh
I'm somewhat confused by this. Probably I don't understand the difference between "node" and "worker".
Can someone explain the difference between the two memory settings and what I might have done wrong?
I'm using spark-0.7.0. See also here for more configuration info.
A standalone cluster can host multiple Spark clusters (each "cluster" is tied to a particular SparkContext). i.e. you can have one cluster running kmeans, one cluster running Shark, and another one running some interactive data mining.
In this case, the 22GB is the total amount of memory you allocated to the Spark standalone cluster, and your particular instance of SparkContext is using 3GB per node. So you can create 6 more SparkContext's using up to 21GB.