apache-atlas installation on MS Azure HDInsight spark-cluster - apache-atlas

I have installed Apache-atlas (embedded/non-prod) version on MS Azure HDInsight Spark cluster and it is functional. However, I am not able to start Apache-Atlas on production ready/multinode cluster. Has anybody done that? Is it a good idea to install Apache-Atlas on a Spark cluster after installing all the pre-requisite services(hbase, kafka, solr) manually on the cluster?

Related

How to add Postgres to Alibaba Cloud

I have ECS at Alibaba cloud, I want to add PostgreSQL, but I can't find any Tutorials on the Internet
how to add PostgreSQL to ECS Alibaba Cloud
There are several ways on using PostgreSQL on Alibaba Cloud:
ApsaraDB RDS for PostgreSQL, which is PaaS solution for PostgreSQL on Alibaba Cloud, so you don't have to worry about installing and configuring PostgreSQL from scratch. It comes with a lot of additional features such as, high availability, disaster recovery, backup, etc. You can find their documentation on creating your PostgreSQL instance.
ApsaraDB for PolarDB, also a PaaS, which is Alibaba Cloud's homegrown RDB fully compatible with MySQL and PostgreSQL. It can support higher storage capacity, nodes clustering, and it's designed for high performance. Check out their documentation on how to create a PostgreSQL cluster.
Self-managed PostgreSQL on ECS - of cause you can still run PostgreSQL on your own ECS. There're plenty of resources on how to install and configure your own PostgreSQL. Check out the DigitalOcean's tutorial on installing PostgreSQL on Ubuntu 20.04.
You have two ways to do it.
you would just take the ECS as a Linux server. you build the PostgreSQL by yourself. it may request higher skills.
you would use the PaaS service, polardb(PostgreSQL) ,you do not need build it step by step ,just use it in 2-3 mins.
the polardb links as below:
https://www.alibabacloud.com/product/polardb?spm=a3c0i.20899616.6791778070.dbannerarelationaldb1.53fd2accf4slGC

How to run two spark job in EMR cluster?

I've a realtime spark job which runs in EMR cluster and I've another batch job which runs in another EMR cluster and this job is triggered at specific time.
How to run both these jobs in one EMR cluster ?
Any suggestions.
If the steps in both the EMR are not dependent on each other, then you can use the feature called Concurrency in the EMR to solve your use case. This feature simply means that you can run more than 1 step in parallel at a time.
This feature is there from the EMR version 5.28.0. If you are using the older version than this then you can not use this feature.
While launching the EMR from the AWS console, this feature is termed as 'Concurrency' in the UI. you can choose any number between 1 to 256.
If you are launching the EMR from the AWS CLI, then this feature is termed as 'StepConcurrencyLevel'.
You can read more about this at multiple steps now in EMR and AWS CLI details

How I make Scala code runs on EMR cluster by using SDK?

I wrote code with Scala to run a Cluster in EMR. Also, I have a Spark application written in Scala. I want to run this Spark application on EMR Cluster. But is it possible for me to do this in the first script (that launch EMR Cluster)? I want to do all of them with the SDK, not through the console or CLI. It has to be a kind of automatization, not a single manual job (or minimize manual job).
Basically;
Launch EMR Cluster -> Run Spark Job on EMR -> Terminate after job finished
How do I do it if possible?
Thanks.
HadoopJarStepConfig sparkStepConf = new HadoopJarStepConfig()
.withJar("command-runner.jar")
.withArgs(params);
final StepConfig sparkStep = new StepConfig()
.withName("Spark Step")
.withActionOnFailure("CONTINUE")
.withHadoopJarStep(sparkStepConf);
AddJobFlowStepsRequest request = new AddJobFlowStepsRequest(clusterId)
.withSteps(new ArrayList<StepConfig>(){{add(sparkStep);}});
AddJobFlowStepsResult result = emr.addJobFlowSteps(request);
return result.getStepIds().get(0);
If you are looking just for automation you should read about Pipeline Orchestration-
EMR is the AWS service which allows you to run distributed applications
AWS DataPipeline is an Orchestration tool that allows you to run jobs (activities) on resources (EMR or even EC2)
If you'd just like to run a spark job consistently, I would suggest creating a data pipeline, and configuring your pipeline to have one step, which is to run the Scala spark jar on the master node using a "shellcommandactivity". Another benefit is that the jar you are running can be stored in AWS S3 (object storage service) and you'd just provide the s3 path to your DataPipeline and it will pick up that jar, log onto the EMR service it has brought up (with the configurations you've provided)- clone that jar on the master node, run the jar with the configuration provided in the "shellcommandactivity", and once the the job exits (successfully or with an error) it will kill the EMR cluster so you aren't paying for it and log the output
Please read more into it: https://aws.amazon.com/datapipeline/ & https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html
And if you'd like you can trigger this pipeline via the AWS SDK or even set the pipeline to run on a schedule

Can pure python script (not pyspark) run in parallel in a cluster in Azure Databricks?

I want to migrate my python scripts from local to run on cloud, specifically on a cluster created on Azure Databricks.
Can pure python script run in parallel (using multiple nodes in a cluster at the same time) without having to be converted into pyspark?
Is it possible to check whether the job is running in parallel?
No
If you have submitted a spark job, you can see logs and it will show you number of executors it is using. That's how you 'll know job is running in parallel

How to setup dependent components of python spark job on aws EMR cluster

I've wrote a spark program, which needs to be executed on EMR cluster. But there are some dependent files and modules being used by python program. So is there any way around to setup dependent components on a running cluster ?
Can we mount the s3 bucket and mount that one cluster nodes, and can put all the dependent component on s3 ? Is this a good idea, and using Python how we can mount the s3 buckets on EMR ?
(During cluster creation): You can use Amazon EMR bootstrap custom actions which is capable of executing a bash script at the time of creation of the cluster. You can install all the dependent components using this script. Bootstrap action will be performed on all nodes of the cluster.
(On a running cluster): You can use Amazon EMR step option to create a s3-dist-cp command-runner step to copy files from s3.