Since spark 3.0 introduces Horovod estimator, I'm wondering how does it compare to using horovod directly using horovod's original mpi-run? and also wonder that can be run on top of k8s scheduler? Any suggestion is welcome.
Related
Can i use different versions of cassandra in a single cluster? My goal is to transfer data from one DC(A) to new DC(B) and decommission DC(A), but DC(A) is on version 3.11.3 and DC(B) is going to be *3.11.7+
* I Want to use K8ssandra deployment with metrics and other stuff. The K8ssandra project cannot deploy older versions of cassandra than 3.11.7
Thank you!
K8ssandra itself is purposefully an "opinionated" stack, which is why you can only use certain more recent and not-known to include major issues versions of Cassandra.
But, if you already have the existing cluster, that doesn't mean you can't migrate between them. Check out this blog for an example of doing that: https://k8ssandra.io/blog/tutorials/cassandra-database-migration-to-kubernetes-zero-downtime/
I run multiple spark jobs on a cluster via $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster.
When a new version of Spark goes live I'd like to somehow roll out a new distribution over the cluster alongside with the old one and then gradually migrate all my jobs one by one.
Unfortunately, Spark relies on $SPARK_HOME global variable so I can't figure out how to achieve it.
It would be especially useful when Spark for Scala 2.12 is out.
It is possible to run any number of Spark distributions on YARN cluster. I've done it a lot of times on my MapR cluster, mixing 1-3 different versions, as well as setting up official Apache Spark there.
All you need is to tweak conf/spark-env.sh (rename spark-env.sh.template) and just add a line:
export SPARK_HOME=/your/location/of/spark/spark-2.1.0
I am currently using spark to write my dimensional data model and we are currently uploading the jar to an AWS EMR cluster to test. However, this is tedious and time consuming for testing and building tables.
I would like to know what others are doing to speed up their development. The possibilities I came across in my research is running spark jobs directly from the IDE with Intellij Idea and I would like to know other development processes that are being used where it's faster to develop.
The ways I have had tried till now are:
Installing spark and hdfs on two or three commodity PCs and test the code before submitting it on the cluster.
Running the code on the single node to avoid dummy mistakes.
Submitting the jar file on the cluster.
The similar part in the first and third method is making the jar file which may takes a lot of time. The second one is not suitable to find and fix the bugs and problems and raise on distributed running environments.
I am taking a deep look inside Flink to see how I can use it on a project and had a question for the creators / high level thinkers... why does Flink use Yarn as the default resource manager?
Was Kubernetes considered? Or is it one of those things where we started on Yarn, it works pretty well...
I have come across many projects and articles that allow Kubernetes and Yarn to work together in cluding the Myraid project that allows yarn to be deployed with Mesos (but I am on Kubernetes...)
I have a very large compute cluster 2000 or so nodes that I use and I want to use the super cool CEP features of Flink feeding off a Kafka infrastructure (also deployed on to this kubernetes environment).
I am looking to understand the reasons behind using Yarn as the resource manager underneath Flink and if would be possible (with some effort and contribution to the project) to make Kubernetes an option alongside Yarn.
Please note - I am new to Yarn - just reading up about it. Also new to Flink and learning about the deployment and scale-out architecture.
Flink is not tied to YARN. It can also run on Apache Mesos and there are also users running it on Kubernetes. In the current version (Flink 1.4.1), there are a few things to consider when running Flink in Kubernetes (see this talk by Patrick Lucas).
The Flink community is also currently working on improving Flink's support for container setups. The effort is called FLIP-6 and will be included in the next release (Flink 1.5.0).
Production system : HDP-2.5.0.0 using Ambari 2.4.0.1
Aplenty demands coming in for executing a range of code(Java MR etc., Scala, Spark, R) atop the HDP but from a desktop Windows machine IDE.
For Spark and R, we have R-Studio set-up.
The challenge lies with Java, Scala and so on, also, people use a range of IDEs from Eclipse to IntelliJ Idea.
I am aware that the Eclipse Hadoop plugin is NOT actively maintained and also has aplenty bugs when working with latest versions of Hadoop, IntelliJ Idea I couldn't find reliable inputs from the official website.
I believe the Hive and HBase client API is a reliable way to connect from Eclipse etc. but I am skeptical about executing MR or other custom Java/Scala code.
I referred several threads like this and this, however, I still have the question that is any IDE like Eclipse/Intellij Idea having an official support for Hadoop ? Even the Spring Data for Hadoop seems to lost traction, it anyways didn't work as expected 2 years ago ;)
As a realistic alternative, which tool/plugin/library should be used to test the MR and other Java/Scala code 'locally' i.e on the desktop machine using a standalone version of the cluster ?
Note : I do not wish to work against/in the sandbox, its about connecting to the prod. cluster directly.
I don't think that there is a genereal solution which would work for all Hadoop services equally. Each solution has it's own development, testing and deployment scenarios as they are different standalone products. For MR case you can use MRUnit to simulate your work locally from IDE. Another option is LocalJobRunner. They both allow you to check your MR logic directly from IDE. For Storm you can use backtype.storm.Testing library to simalate topology's workflow. But they all are used from IDE without direct cluster communications like in case wuth Spark and RStudio integration.
As for the MR recommendation your job should ideally pass the following lifecycle - writing the job and testing it locally, using MRUnit, then you should run it on some development cluster with some test data (see MiniCluster as an option) and then running in on real cluster with some custom counters which would help you to locate your malformed data and to properly maintaine the job.