how to set checkpiont dir PySpark Data Science Experience - pyspark

Could you help me with instructions on how to set the checkpoint dir for a PySpark session on IBM's Data Science Experience?.
The need came because i have to run connectedComponents() from GraphFrames and it raises the following error
Py4JJavaError: An error occurred while calling o221.run.
: java.io.IOException: Checkpoint directory is not set. Please set it first using sc.setCheckpointDir().

The main issue is to get the directory that the notebook has as working directory to set the checkpoit dir with sc.setCheckpointDir(). this can be done easily with
!pwd
Then, a directory for checkpoints should be created on that route
!mkdir <pwd_output>/checkpoints
Finally set the checkpoint
spark.sparkContext.setCheckpointDir('<pwd_output>/checkpoints')

Related

Failed to read postmaster.pid file while running embedded-postgres

My Spring application uses yandex-qatools/postgresql-embedded for executing Unit Tests.
While executing them, I am constantly getting the below error :
ERROR 75847 --- [ Test worker] r.y.q.embed.postgresql.PostgresProcess : Failed to read PID file (File '/var/folders/sh/xr6l_7bs1_z9v1jfsyctc45w0000gp/T/postgresql-embed-b05c213f-7416-4200-a586-a3afb3263478/db-content-4f285249-22ea-4625-b771-156adbf5851f/postmaster.pid' does not exist)
java.io.FileNotFoundException: File '/var/folders/sh/xr6l_7bs1_z9v1jfsyctc45w0000gp/T/postgresql-embed-b05c213f-7416-4200-a586-a3afb3263478/db-content-4f285249-22ea-4625-b771-156adbf5851f/postmaster.pid' does not exist
There is a warning popped up before the exception, but for now, let's ignore it.
WARN 75847 --- [ Test worker] r.y.q.embed.postgresql.PostgresProcess : Possibly failed to run initdb:
no data was returned by command ""/private/var/folders/sh/xr6l_7bs1_z9v1jfsyctc45w0000gp/T/postgresql-embed-b05c213f-7416-4200-a586-a3afb3263478/pgsql-10.3-1/pgsql/bin/postgres" -V"
The program "postgres" is needed by initdb but was not found in the
same directory as "/private/var/folders/sh/xr6l_7bs1_z9v1jfsyctc45w0000gp/T/postgresql-embed-b05c213f-7416-4200-a586-a3afb3263478/pgsql-10.3-1/pgsql/bin/initdb".
Check your installation.
I verified that no other instance of Postgress is running on my local machine using
ps -ef|grep postgres
Followed this thread as well, but it doesn't help.
Ran out of options to fix this, can anyone please suggest how to resolve it.
OSX version: 12.1
Thanks in advance
In my case, besides your error, I also could see the following error:
r.y.q.embed.postgresql.PostgresProcess : Possibly failed to run initdb:
initdb: invalid locale settings; check LANG and LC_* environment variables
This message led me to the solution. I just added the below environment properties to my .zshrc file:
export LANG="en_US.UTF-8"
export LC_ALL="en_US.UTF-8"
export LC_CTYPE="en_US.UTF-8"

Unable to start zookeeper getting exception saying that unable to load

I am unable to start zookeeper in my local
getting exception
Error: Could not find or load main class software\kafka_2.12-2.3.0\libs\activation-1.1.1.jar;
I have got the same error.
Finally I found the reason.
I put kafka in a folder which it's name contains space like this
C:\Open source\kafka\kafka_2.12-2.3.0
Then I copied all kafka_2.12-2.3.0 to a new folder without space in name, such asc C:\kafka_2.12-2.3.0
and kafka can run
NOte that the command line is: zookeeper-server-start.bat ../../config/zookeeper.properties
NOT: zookeeper-server-start.bat config/zookeeper.properties
Here's my result

How to log useful error if scala play config files missing?

Is it possible to give a better error message if the config filename is mistyped when using scala play.api.Configuration
e.g. if I run my application with sbt run -J-Dconfig.file=conf/my-config.conf but the file is actually called my_config.conf, there is no error raised about file not found, but instead the first time the error is raised is when applicationConfig.has(configPath) is called, at which point it is not clear how to determine programatically the difference between a missing config value in the file or a missing config file.
Here is what I do:
Wrap the configuration in a Config-Class.
Initialize that class on startup.
Log all property - values.
This will log exceptions on Startup. Here is an example: AdaptersContext.scala
As a remark:
If you have your config-file in the conf directory (on classpath), use:
config.resource=demo.conf

Cloudera Kafka can not run

I installed Zookeeper and tried to install Kafka0.8.0 on Cloudera5.4.4. It successfully deployed, but when I ran it, it failed. The error log as following:
[Errno 2] No such file or directory: '/var/log/kafka/server.log'
I really have no any idea.
Thanks in advance!
I met this problem before, the default maximum memory of brokers were set to 0MB by Cloudera(it seems a bug of Cloudeara), it caused Kafka could not get run, and the parameter fetch.message.max.bytes also was set to low by default. Check the stderr installation log, search keyword ERROR, otherwise the log too messy to check. You would find the root error message. The message above [Errno 2] No such file or directory: '/var/log/kafka/server.log' is not the root exception.

FATAL org.apache.hadoop.conf.Configuration - error parsing conf file: org.xml.sax.SAXParseException

I'm trying to run pig locally, installed using homebrew, to test a script. However, I get the following error when I attempt to run a simple dump from the interactive prompt pig -x local:
2012-07-16 23:20:40,447 [Thread-7] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
[Fatal Error] :63:85: Character reference "&#2" is an invalid XML character.
2012-07-16 23:20:40,688 [Thread-7] FATAL org.apache.hadoop.conf.Configuration - error parsing conf file: org.xml.sax.SAXParseException: Character reference "&#2" is an invalid XML character.
The same load/dump works fine on Elastic MapReduce.
I can't find any XML config files, and I've tried with both version 0.9.2 and 0.10.0
What am I missing?
Edit: Just checked a direct download (vs. homebrew) and it doesn't seem to work either
You should check that your Hadoop configuration files have correct configuration data.
Have a look in your hadoop/conf directory.
Have a look inside:
hdfs-site.xml
mapred-site.xml
core-site.xml
Finally worked out what the problem was. I ended up having to use dtruss -p on the pig/java process. This revealed a temporary directory and dynamically generated xml files. Once the temporary directory was discovered, it all fell quickly into place.
It was picking up the proxy excludes from my network connections, which had, as far as I can tell, &#2 (http://www.fileformat.info/info/unicode/char/02/index.htm) embedded in it. How this invalid value came to be in my network preferences in the first place, I haven't the faintest clue.
The value was then being pulled into dynamically generated files, for example /tmp/hadoop-vertis/mapred/staging/vertis-1005847898/.staging/job_local_0001/job.xml.
The offending lines:
<property><name>ftp.nonProxyHosts</name><value>localhost|*.localhost|127.0.0.1|h|*.h</value></property>
<property><name>socksNonProxyHosts</name><value>localhost|*.localhost|127.0.0.1|h|*.h</value></property>
<property><name>http.nonProxyHosts</name><value>localhost|*.localhost|127.0.0.1|h|*.h</value></property>