Writing File to a directory using java.io.File.mkdirs() and then accessing it not working in spark cluster mode - scala

When i try to run the same code in client mode, the code runs successfully. But when run on cluster mode it fails to create the file and prompts with error No such File or Directory.
Below is the code sample:
new File("UnexistingLocation").mkdirs()
---> Directories created in client mode
---> Code do not give error in cluster mode but i cannot see directory created. Also while creating File inside the directory gives error No such file or Directory.
Is there a workaround by which i could create files on driver node local filesystem?

due to your error analysis I assume this is run in driver-scoped code. If you submit using --deploy-mode cluster your driver will be started on an arbitrary node which means that's where your directory will be. It won't be on the node where you do your spark-submit from

Related

How to set Spark structured streaming check point dir to windows local directory?

My OS is Windows 11 and Apache Spark version is spark-3.1.3-bin-hadoop3.2
I try to use Spark structured streaming with pyspark. Belows are my simple spark structured streaming codes.
spark = SparkSession.builder.master("local[*]").appName(appName).getOrCreate()
spark.sparkContext.setCheckpointDir("/C:/tmp")
The same Spark codes without spark.sparkContext.setCheckpointDir line throws no errors on Ubuntu 22.04. However the above codes do not work successfully on Windows 11. The exemptions are
pyspark.sql.utils.IllegalArgumentException: Pathname /C:/tmp/67b1f386-1e71-4407-9713-fa749059191f from C:/tmp/67b1f386-1e71-4407-9713-fa749059191f is not a valid DFS filename.
I think the error codes mean checkpoint directory are generated on Hadoop file system of Linux, not on Windows 11. My operating system is Windows and checkpoint directory should be Windows 11 local directory. How can I configure Apache Spark checkpoint with Windows 11 local directory? I used file:///C:/temp and hdfs://C:/temp URL for test. But the errors are still thrown.
Update
I set below line to be comments.
#spark.sparkContext.setCheckpointDir("/C:/tmp")
Then the exceptions are thrown.
WARN streaming.StreamingQueryManager: Temporary checkpoint location created which is deleted normally when the query didn't fail: C:\Users\joseph\AppData\Local\Temp\temporary-be4f3586-d56a-4830-986a-78124ab5ee74. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
pyspark.sql.utils.IllegalArgumentException: Pathname /C:/Users/joseph/AppData/Local/Temp/temporary-be4f3586-d56a-4830-986a-78124ab5ee74 from hdfs://localhost:9000/C:/Users/joseph/AppData/Local/Temp/temporary-be4f3586-d56a-4830-986a-78124ab5ee74 is not a valid DFS filename.
I wonder why hdfs url contains c:/ driver letters and I want to know how to set spark.sql.streaming.forceDeleteTempCheckpointLocation to true.
step 1)
Since you are running spark from a windows machine, make sure winutils.exe file added in hadoop bin folder reference link for same (6th Step) https://phoenixnap.com/kb/install-spark-on-windows-10.
step 2)
then try to add like this
spark.sparkContext.setCheckpointDir("D:\Learn\Checkpoint")
spark.sparkContext.setCheckpointDir("D:\Learn\Checkpoint")
Make sure spark user does have the permission to write in mentioned checkpoint directory

Using Postgresql JDBC source with Apache Spark on EMR

I have existing EMR cluster running and wish to create DF from Postgresql DB source.
To do this, it seems you need to modify the spark-defaults.conf with the updated spark.driver.extraClassPath and point to the relevant PostgreSQL JAR that has been already downloaded on master & slave nodes, or you can add these as arguments to a spark-submit job.
Since I want to use existing Jupyter notebook to wrangle the data, and not really looking to relaunch cluster, what is the most efficient way to resolve this?
I tried the following:
Create new directory (/usr/lib/postgresql/ on master and slaves and copied PostgreSQL jar to it. (postgresql-9.41207.jre6.jar)
Edited spark-default.conf to include wildcard location
spark.driver.extraClassPath :/usr/lib/postgresql/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/$
Tried to create dataframe in Jupyter cell using the following code:
SQL_CONN = "jdbc:postgresql://some_postgresql_db:5432/dbname?user=user&password=password"
spark.read.jdbc(SQL_CONN, table="someTable", properties={"driver":'com.postgresql.jdbc.Driver'})
I get a Java error as per below:
Py4JJavaError: An error occurred while calling o396.jdbc.
: java.lang.ClassNotFoundException: com.postgresql.jdbc.Driver
Help appreciated.
I think you don't need to copy postgres jar in slaves as the driver programme and cluster manager take care everything. I've created dataframe from Postgres external source by the following way:
Download postgres driver jar:
cd $HOME && wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar
Create dataframe:
atrribute = {'url' : 'jdbc:postgresql://{host}:{port}/{db}?user={user}&password={password}' \
.format(host=<host>, port=<port>, db=<db>, user=<user>, password=<password>),
'database' : <db>,
'dbtable' : <select * from table>}
df=spark.read.format('jdbc').options(**attribute).load()
Submit to spark job:
Add the the downloaded jar to driver class path while submitting the spark job.
--properties spark.driver.extraClassPath=$HOME/postgresql-42.2.5.jar,spark.jars.packages=org.postgresql:postgresql:42.2.5
Check the github repo of the Driver. The class path seems to be something like this org.postgresql.Driver. Try using the same.

Error While Running Kafka Server on Windows

while running kafka on windows.
C:\Program Files\kafka_2.12-2.1.0>.\bin\windows\kafka-server-start.bat .\config\server.properties
And getting the error
The system cannot find the path specified.
The syntax of the command is incorrect.
Error: Could not find or load main class Files\kafka_2.12-2.1.0.logs
You cannot have spaces in the file path, e.g Program Files
There's no specific reason Kafka needs to be in your Program Files folder. You could move it to C:\kafka for example, and I've been able to run it on Windows 10 (out of my users folder), so it does work

How to run a mongo script from Heroku scheduler?

I have implemented a javascript script for my mongo database. This script is called getMetrics.js and I am able to execute it by running: mongo getMetrics.js from my computer.
Now I want to automatically execute that script one time per day. To do so, I have created a Heroku app and I added to it the scheduler add-on (https://devcenter.heroku.com/articles/scheduler).
My main problem is that in order to be run, my task will execute the command "mongo getMetrics.js" and it will failed because I don't have mongo command installed in my Heroku app.
How can I run this script from Heroku?
Thanks a lot for your help.
I did the below in a similar case:
Download mongodb for linux https://www.mongodb.com/download-center#community
The bin folder contains the mongo binary
Make this binary available in your Heroku instance (e.g. If you have your Heroku configured with your git repo, then checkin this binary along side your script
[Make sure the folder you are keeping this binary is in the path, safe path will be inside /bin]

How to migrate/shift/copy/move data in Neo4j

Does any one know how to migrate data from one instance of Neo4j to another. To be more precise, I want to know, how to move the data from one instance of Neo4j on my local machine to another on remote machine. Does any one have any idea about it.
I'm working on my windows machine with Eclipse and Embedded Neo4j . I need to transfer this data to remote Neo4j instance on a Centos machine. Please help me with this.
Not sure how to do it for "embedded neo4j db".
But for standalone and in case you have something like the command line tool "Putty" on your windows machine, this should work. Instead of $NEO4j_HOME you can also use the normal path without the env variable.
$NEO4J_HOME/bin/neo4j stop
cd $NEO4J_HOME/data
tar -cvf graph.db.tar graph.db
gzip graph.db.tar
scp -i ~/some_path/key_for_remote_server.pem ./graph.db.tar.gz username#your_remote_domeain.com:~/
ssh -i ~/some_path/key_for_remote_server.pem/ username#your_remote_domeain.com
On your remote server (at least this works for ubuntu):
Maybe you need to use "sudo" (prefix the commands with sudo).
mv ./graph.db.tar.gz /some_path/
cd /some_path/
gunzip graph.db.tar.gz
tar -xvf graph.db.tar
$NEO4J_HOME/bin/neo4j start
$NEO4J_HOME/bin/neo4j status
You can migrate the data by using the apoc procedure by running the below query in the cypher shell from where the data needs to be exported:
CALL apoc.export.cypher.all('myfilename.cypher');
This will download the file with cypher queries in the import folder
Go the database instance where the data needs to be imported and copy the file in the import folder. Run the below command using the cypher shell:
apoc.cypher.runFile("myfilename.cypher",{}) yield row, result;
For more advanced options follow the below links:
https://neo4j.com/docs/labs/apoc/current/export/cypher/
http://neo4j-contrib.github.io/neo4j-apoc-procedures/3.4/cypher-execution/run-cypher-scripts/
I found out the following workaround for copying the data from a server in the cluster to all others, after using the neo4j-import tool:
Stop all nodes.
On the new node/server, where you need your data to be copied, you have to create the database folder for that graph (in my case loadTest):
/neo4j-enterprise-3.1.0/data/databases/loadTest.db
Then, the source node/server that is holding the data, you have to copy here the neostore.id file to the destination server db folder (loadTest.db from the previous step).
Start all nodes. In the background neo4j will copy data from other cluster servers to the new node.
For embedded mode , you would just need to locate the graph neo4j-db folder location then zip and send it to the remote system.
In your code where you would have called graphdatabaseservice , you would have given the target location
Check if its relative path then the database might be in your project folder .
Now for running the db instance on browser , you will need to use the neo4j communty server and point it to the folder containing the index folder. So if your neo4j-db is located at $project/tmp/neo4j-db then you will give the file path till this folder(the index folder will be inside this folder)
Edit
The folder that will contain the schema and index folders needs to be zipped. You can upload and unzip the folder at a certain location using Putty on your standalone server. Then just change the org.neo4j.server.database.location in conf/neo4j-server.properties file.