Write Spark DF to a flat file on local pc from Eclipse - eclipse

I need to write a Spark DF to a flat file on my local PC.
I'm executing my program on Scala IDE on Eclipse (again on my local PC)
This is the command I use:
df.coalesce(1).rdd.saveAsTextFile(s"file:///C:/myfile.csv")
It creates C:\myfile.csv_temporary\0_temporary\attempt_20180208105406_0016_m_000000_819 foder and even part-00000 file in it, but the file is empty
This is the error message I'm getting on the console:
Exception in task 0.0 in stage 16.0 (TID 819)
java.io.IOException: (null) entry in command string: null chmod 0644 C:\myfile.csv_temporary\0_temporary\attempt_20180208105406_0016_m_000000_819\part-00000*

Try set HADOOP_HOME to the subdirectory with bin\winuitls.exe

Related

How Can I Solve "Failed To Source Bitbake" With Xilinx Petalinux SDK

I am running Ubuntu 16.04 with Xilinx Petalinux 2018.03 SDK. After a number of successful compilations I am now facing this error
$ petalinux-build
[INFO] building project
[INFO] sourcing bitbake
ERROR: Failed to source bitbake
ERROR: Failed to build project
How can I solved this issue?
Another reason to get the errors "ERROR: Failed to source bitbake" as well as "ERROR: Failed to build project" is a possible upgrade of Python on the build machine. The Petalinux SDK requires python v2 (>= 2.7.3) for the 2018.3 edition.
You can check under [YOUR_PROJECT]/build/build.log and you might see a log similar to this one below:
[INFO] building project
[INFO] sourcing bitbake
SDK environment now set up; additionally you may now run devtool to perform development tasks.
Run devtool --help for further details.
OpenEmbedded requires 'python' to be python v2 (>= 2.7.3), not python v3.
Please set up python v2 as your default 'python' interpreter.
ERROR: Failed to source bitbake
ERROR: Failed to build project
To remedy this issue remove the symbolic link under /usr/bin and make sure to create a new one that is pointing to Python 2.7:
sudo rm /usr/bin/python
sudo ln -s /usr/bin/python2.7 /usr/bin/python
First you need to investigate the error a little further, do this:
source /opt/pkg/petalinux/2018.3/settings.sh
It will return something similar to this below:
PetaLinux environment set to '/opt/pkg/petalinux/2018.3'
INFO: Checking free disk space
INFO: Checking installed tools
INFO: Checking installed development libraries
INFO: Checking network and other services
Source the environment setup:
source /opt/pkg/petalinux/2018.3/components/yocto/source/aarch64/environment-setup-aarch64-xilinx-linux
followed by:
devtool --help
In my case I can see more about the actual error:
NOTE: Starting bitbake server...
ERROR: Unable to start bitbake server
ERROR: Last 10 lines of server log for this session (/opt/pkg/petalinux/2018.3/components/yocto/source/aarch64/bitbake-cookerdaemon.log):
Traceback (most recent call last):
File "/opt/pkg/petalinux/2018.3/components/yocto/source/aarch64/layers/core/bitbake/lib/bb/daemonize.py", line 77, in createDaemon
function()
File "/opt/pkg/petalinux/2018.3/components/yocto/source/aarch64/layers/core/bitbake/lib/bb/server/process.py", line 433, in _startServer
self.cooker = bb.cooker.BBCooker(self.configuration, self.featureset)
File "/opt/pkg/petalinux/2018.3/components/yocto/source/aarch64/layers/core/bitbake/lib/bb/cooker.py", line 178, in __init__
self.configwatcher = pyinotify.WatchManager()
File "/opt/pkg/petalinux/2018.3/components/yocto/source/aarch64/layers/core/bitbake/lib/pyinotify.py", line 1764, in __init__
raise OSError(err % self._inotify_wrapper.str_errno())
OSError: Cannot initialize new instance of inotify, Errno=Too many open files (EMFILE)
ERROR: Unable to start bitbake server
ERROR: Last 10 lines of server log for this session (/opt/pkg/petalinux/2018.3/components/yocto/source/aarch64/bitbake-cookerdaemon.log):
Traceback (most recent call last):
File "/opt/pkg/petalinux/2018.3/components/yocto/source/aarch64/layers/core/bitbake/lib/bb/daemonize.py", line 77, in createDaemon
function()
File "/opt/pkg/petalinux/2018.3/components/yocto/source/aarch64/layers/core/bitbake/lib/bb/server/process.py", line 433, in _startServer
self.cooker = bb.cooker.BBCooker(self.configuration, self.featureset)
File "/opt/pkg/petalinux/2018.3/components/yocto/source/aarch64/layers/core/bitbake/lib/bb/cooker.py", line 178, in __init__
self.configwatcher = pyinotify.WatchManager()
File "/opt/pkg/petalinux/2018.3/components/yocto/source/aarch64/layers/core/bitbake/lib/pyinotify.py", line 1764, in __init__
raise OSError(err % self._inotify_wrapper.str_errno())
OSError: Cannot initialize new instance of inotify, **Errno=Too many open files (EMFILE)**
This is pointing to the /proc/sys/fs/inotify/max_user_instances that need to be increased. In my case I went from 128 to 256 by doing this:
sudo su
echo 256 > /proc/sys/fs/inotify/max_user_instances
You need to become root with "su" and change the mac_user_instances.

How to solve error on docker:layers_calculator to compute the Merkle tree on private tangle?

I want to setup a private tangle on my own virtual machine with Ubuntu 18.04, 4GB RAM and 20GB memory.
I have follow this instructions: https://docs.iota.org/docs/compass/0.1/how-to-guides/set-up-a-private-tangle. Every command works fine until reach this one: bazel run //docker:layers_calculator.
It shows an error as follows:
Starting local Bazel server and connecting to it...
ERROR: /home/istabraq/compass/third-party/maven_deps.bzl:3:5: Traceback (most recent call last):
File "/home/istabraq/compass/WORKSPACE", line 42
maven_jars()
File "/home/istabraq/compass/third-party/maven_deps.bzl", line 3, in maven_jars
native.maven_jar(<4 more arguments>)
type 'struct' has no method maven_jar()
ERROR: error loading package '': Encountered error while reading extension file 'protobuf_deps.bzl': no such package '#com_google_protobuf_deps//': error loading package 'external': Could not load //external package
ERROR: error loading package '': Encountered error while reading extension file 'protobuf_deps.bzl': no such package '#com_google_protobuf_deps//': error loading package 'external': Could not load //external package
INFO: Elapsed time: 4.743s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
FAILED: Build did NOT complete successfully (0 packages loaded)
How can I solve this problem? what I have missed?
read carefully the message given after running bazel installer:
Make sure you have "/home/yourusername/bin" in your path. You can also activate bash completion by adding the following line to your :
source /home/yourusername/.bazel/bin/bazel-complete.bash
You can check with: "bazel info" or "bazel version"
Unfortunately, there are further errors:
https://github.com/iotaledger/compass/issues/142
I have solve this issue by using this commands:
Step 3: Set up your environment
If you ran the Bazel installer with the --user flag as above, the Bazel executable is installed in your $HOME/bin directory. It’s a good idea to add this directory to your default paths, as follows:
export PATH="$PATH:$HOME/bin"
You can also add this command to your ~/.bashrc or ~/.zshrc file to make it permanent.
reference:
https://docs.bazel.build/versions/master/install-ubuntu.html

Creating virtualenv inside veracypt error

I'm setting up a project inside veracrypt and it's throwing this error when I try to setup the environment.
admin#kali:/media/veracrypt1$ virtualenv --python=python3 venv
Already using interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in /media/veracrypt1/venv/bin/python3
Also creating executable in /media/veracrypt1/venv/bin/python
Traceback (most recent call last):
File "/usr/local/bin/virtualenv", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/virtualenv.py", line 870, in main
symlink=options.symlink,
File "/usr/local/lib/python3.7/dist-packages/virtualenv.py", line 1162, in create_environment
install_python(home_dir, lib_dir, inc_dir, bin_dir, site_packages=site_packages, clear=clear, symlink=symlink)
File "/usr/local/lib/python3.7/dist-packages/virtualenv.py", line 1672, in install_python
os.symlink(py_executable_base, full_pth)
PermissionError: [Errno 1] Operation not permitted: 'python3' -> '/media/veracrypt1/venv/bin/python'
I've tried to look for the source of the issue and it seems it's related to how it's a virtualdrive with limited rights
admin#kali:/media/veracrypt1$ ln -s testfile
ln: failed to create symbolic link './testfile': Operation not permitted
Looks like you are running this in an environment with limited permissions.
Some report this behavior when running on Linux,
but in a folder that is mounted to a "FAT32" partition -
see Chris Lope's blog-post:
permissionerror: [errno 1] operation not permitted
I have experienced this behavior while running in an Ubuntu VM
in a folder that was mounted to the host-OS (Windows-NTFS) as type 'vboxsf'.
Solved it by moving to work in a partition that is native Unix.

Spark Pipe example

I'm new to Spark and trying to figure out how the pipe method works. I have the following code in Scala
sc.textFile(hdfsLocation).pipe("preprocess.py").saveAsTextFile(hdfsPreprocessedLocation)
The values hdfsLocation and hdfsPreprocessedLocation are fine. As proof, the following code works from the command line
hadoop fs -cat hdfsLocation/* | ./preprocess.py | head
When I run the above Spark code I get the following errors
14/11/25 09:41:50 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Cannot run program "preprocess.py": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:135)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
... 12 more
In order to solve this for Hadoop streaming I would just use the --files attribute, so I tried the same thing for Spark. I start Spark with the following command
bin/spark-shell --files ./preprocess.py
but that gave the same error.
I couldn't find a good example of using Spark with an external process via pipe, so I'm not sure if I'm doing this correctly. Any help would be greatly appreciated.
Thanks
I'm not sure if this is the correct answer, so I won't finalize this, but it appears that the file paths are different when running spark in local and cluster mode. When running spark without --master the paths to the pipe command are relative to the local machine. When running spark with --master the paths to the pipe command are ./
UPDATE:
This actually isn't correct. I was using SparkFiles.get() to get the file name. It turns out that when calling .pipe() on an RDD the command string is evaluated on the driver and then passed to the worker. Because of this SparkFiles.get() is not the appropriate way to get the file name. The file name should be ./ because SparkContext.addFile() should put that file on ./ relative to to where each worker is run from. But I'm so sour on .pipe now that I've take .pipe out of my code in total in favor of .mapPartitions in combination of a PipeUtils object that I wrote here. This is actually more efficient because I only have to incur the script startup costs once per partition instead of once per example.

Hadoop on windows with eclipse

I read a post here on stack overflow about this link being very good for hadoop deployment on windows - http://v-lad.org/Tutorials/Hadoop/12%20-%20format%20the%20namendoe.html
Problem is when I format the namenode as given on that page, after
bin/hadoop namenode -format
I get the following error:
Maybe its a problem with my environmental variables but I'm not sure.
bin/hadoop: line 330: C:\Program: command not found
bin/hadoop: line 395: C:\Program Files\java\jdk1.7.0\bin/bin/java: No such
file/directory
bin/hadoop: line 395: exec: C:\Program Files\java\jdk1.7.0\bin/bin/java: cannot
execute: No such file or directory
Just set you JAVA_HOME properly, it resolved my issue-
anwar#dell-pc ~/hadoop-0.20.203.0
$ export JAVA_HOME="C:\Program Files\Java\jdk1.6.0_02"