How to configure python virtual step for hadoop in ami 4.x - virtualenv

In ami 3 the file /home/hadoop/conf/hadoop-user-env.sh existed. And this legacy code I'm looking at was able to run this command in bootstrapping.
echo ". /home/hadoop/resources/pips/bin/activate" >> /home/hadoop/conf/hadoop-user-env.sh
This activates virtual env for Python.
In ami 4 this file is gone. How am I suppose to get a python step in Hadoop to run in virtual env under ami 4?

Going to give this a shot and hope it helps you.
In Amazon EMR AMI versions 2.x and 3.x, there was a hadoop-user-env.sh script which was not part of standard Hadoop and was used along with the configure-daemons bootstrap action to configure the Hadoop environment. The script included the following actions:
#!/bin/bash
export HADOOP_USER_CLASSPATH_FIRST=true;
echo "HADOOP_CLASSPATH=/path/to/my.jar" >> /home/hadoop/conf/hadoop-user-env.sh
In Amazon EMR release 4.x, you can do the same now with the hadoop-env configurations:
[
{
"Classification":"hadoop-env",
"Properties":{
},
"Configurations":[
{
"Classification":"export",
"Properties":{
"HADOOP_USER_CLASSPATH_FIRST":"true",
"HADOOP_CLASSPATH":"/path/to/my.jar"
}
}
]
}
]
There is more info about the differences and replacement codes on Amazon's Documentation Site.

Related

Mongodb cluster Debian 11 compatibility

I am installing a new cluster using mongo MMS on debian11
I also have 2 Centos 7.9 for shards
When I deploy the conf of my cluster for the first time is doesn't work I have the following message :
Cluster config did not pass validation for pre-expansion semantics : MongoDB Tools download URL for this host was not found in the list of available URLS : [ {100.5.3 map[linux:map[amazon2:https://mongodb.com/tools/db/mongodb-database-tools-amazon-x86_64-100.5.3.tgz amzn64:https://mongodb.com/tools/db/mongodb-database-tools-amazon-x86_64-100.5.3.tgz arm64_amazon2:https://mongodb.com/tools/db/mongodb-database-tools.......
The parameter Installer Download Source is set to remote and I can find mongodb databases tools for debian 11 on mongo website so I don't get why it can't get it automatically here.
Thanks for your help.

dbt to snowflake connections fails via profiles.yml

I'm trying to connect to snowflake via dbt but connections fail with the error below:
Using profiles.yml file at /home/myname/.dbt/profiles.yml
Using dbt_project.yml file at /mnt/c/Users/Public/learn_dbt/rks-learn-dbt/learn_dbt/dbt_project.yml
Configuration:
profiles.yml file [ERROR invalid]
dbt_project.yml file [OK found and valid]
Profile loading failed for the following reason:
Runtime Error
Could not find profile named 'learn_dbt'
Required dependencies:
- git [OK found]
Any advice please.
Note: I am learning to setup dbt connections looking at udemy videos.
Below is my profiles.yml file:
learn_dbt:
target: dev
outputs:
dev:
type: snowflake
account: XXXXXX
user: XXXX
password: XXXX
role: transform_role
database: analytics
warehouse: transform_wh
schema: dbt
threads: 1
client_session_keep_alive: False
My first guess is that you have a profiles.yml file in your dbt project folder and dbt is not actually using the one in /home/myname/.dbt/.
Could you try running the following?
dbt debug --profiles-dir /home/myname/.dbt
The flag --profiles-dir works on most dbt cli commands and lets you use a custom profiles.yml that's outside your project. I use this flag all the time.
I had to run pip install dbt-snowflake and then it worked.
It seems dbt has seperated it's modules to dbt-core and it's adapters dbt-snowflake, dbt-postgres etc
I think this is a similar issue to what i had when using the cloud environment.
If you are using a snowflake instance on the West coast the the account name looks like
If you are using a snowflake instance on the East coast the the account name looks like <xxx12345.us-east-1>
Overall this error mean it is unable to connect to yead your environment template to get your snowflake account details.
'env.pd.template.bat' or 'env.pd.template.sh' is the base file which has your Snowfalke account settings, so you have run this command to connect to snowflake from your editor.
You can use '.bat', or .sh commands based on Powershell or CMD editor.
In my scenario I ran 'env.pd..private.bat', you need to run this command everytime to connect to snowflake account with your credentials. I ran this in cmd window. It fixed my error.

Choosing the correct Amazon Machine Image (AMI) for bash script upload to Postgres

I have a bash script written for OSX that downloads many large .zip files, unpacks them, and writes the contents to a Postgres database.
I want to do this from an EC2 instance because the operation takes a long time.
I don't which AMI to choose, given that OSX is not an option.
Should I be doing this on Ubuntu?
You don't need to use an AMI you can use AWS cloud init for launching your bash script after your instance is launched and as far as operating is concerned i think any operating system is capable for doing things you mentioned in the question however, i would recommend to use Amazon linux because its more optimized for ec2

Running shp2pgsql in Azure cloud shell

I'm working with an Azure Postgresql database and am using the Cloud Shell to run psql scripts without problems. I'm now trying to load some shp files via the shp2pgsql command. The cloud shell responds by:
bash: shp2pgsql: command not found
Is it possible at all to use shp2pgsql with the Cloud Shell or I'm missing something? I've already successfully created the postgis extension on the Postgresql server.
Unfortunately, it seems that you cannot run the shp2pgsql command in the Azure Cloud Shell. It is just an interactive, browser-accessible shell for managing Azure resources. Not integrated with too much tool in it because of its flexibility. You can get more details about the features from the Features & tools for Azure Cloud Shell.
I suggest if you want to do something complicated, you'd better run it in a specific Azure VM for yourself. Hope this will be helpful to you.

What is a spark kernel for apache toree?

I have a spark cluster which master is on 192.168.0.60:7077
I used to use jupyter notebook to make some pyspark scripts.
I am now willing to move on to scala.
I don't know scala's world.
I am trying to use Apache Toree.
I installed it, downloaded the scala kernels, and runned it to the point to open a scala notebook . Till there everything seems ok :-/
But I can't find the spark context, and there are errors in the jupyter's server logs :
[I 16:20:35.953 NotebookApp] Kernel started: afb8cb27-c0a2-425c-b8b1-3874329eb6a6
Starting Spark Kernel with SPARK_HOME=/Users/romain/spark
Error: Master must start with yarn, spark, mesos, or local
Run with --help for usage help or --verbose for debug output
[I 16:20:38.956 NotebookApp] KernelRestarter: restarting kernel (1/5)
As I don't know scala, I am not sure of the issue here ?
It could be :
I need a spark kernel (according to https://github.com/ibm-et/spark-kernel/wiki/Getting-Started-with-the-Spark-Kernel )
I need to add an option on the server (the error message says 'Master must start with yarn, spark, mesos, or local' )
or something else :-/
I was just willing to migrate from python to scala, and I spend a few hours lost just on starting up the jupyter IDE :-/
It looks like you are using Spark in a standalone deploy mode. As Tzach suggested in his comment, following should work:
SPARK_OPTS='--master=spark://192.168.0.60:7077' jupyter notebook
SPARK_OPTS expects usual spark-submit parameter list.
If that does not help, you would need to check the SPARK_MASTER_PORT value in conf/spark-env.sh (7077 is the default).