I am trying to use Apache Zeppelin (0.7.2, net install running locally on a Mac) to explore data loaded from an s3 bucket. The data seems to load just fine, as the command:
val p = spark.read.textFile("s3a://sparkcookbook/person")
gives the result:
p: org.apache.spark.sql.Dataset[String] = [value: string]
However, when I try to call methods on the object p, I get an error. For example:
p.take(1)
results in:
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
My conf/zeppelin-env.sh is the same as the default, except that I have amazon access key and secret key environment variables defined there. In the Spark interpreter in the Zeppelin notebook, I have added the following artifacts:
org.apache.hadoop:hadoop-aws:2.7.3
com.amazonaws:aws-java-sdk:1.7.9
com.fasterxml.jackson.core:jackson-core:2.9.0
com.fasterxml.jackson.core:jackson-databind:2.9.0
com.fasterxml.jackson.core:jackson-annotations:2.9.0
(I think only the first two are necessary). The two commands above work fine in the Spark shell, just not in the Zeppelin notebook (see How to use s3 with Apache spark 2.2 in the Spark shell for how that was set up).
So it seems that there is a problem with one of the Jackson libraries. Maybe I'm using the wrong artifacts above for the Zeppelin interpreter?
UPDATE: Following the advice in the proposed answer below, I removed the jackson jars that came with Zeppelin, and replaced them with the following:
jackson-annotations-2.6.0.jar
jackson-core-2.6.7.jar
jackson-databind-2.6.7.jar
And replaced the artifacts with these, so my artifacts are now:
org.apache.hadoop:hadoop-aws:2.7.3
com.amazonaws:aws-java-sdk:1.7.9
com.fasterxml.jackson.core:jackson-core:2.6.7
com.fasterxml.jackson.core:jackson-databind:2.6.7
com.fasterxml.jackson.core:jackson-annotations:2.6.0
The error I get, however, from running the above commands is the same.
UDPATE2: As per I removed the jackson libraries from the list of artifacts, since they are already now in the jars/ folder - the only added artifacts are now the aws artifacts above. I then cleaned the classpath by entering the following in the notebook (as per the instructions):
%spark.dep
z.reset()
I get a different error now:
val p = spark.read.textFile("s3a://sparkcookbook/person")
p.take(1)
p: org.apache.spark.sql.Dataset[String] = [value: string]
java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<init>(ScalaNumberDeserializersModule.scala:49)
at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<clinit>(ScalaNumberDeserializersModule.scala)
at com.fasterxml.jackson.module.scala.deser.ScalaNumberDeserializersModule$class.$init$(ScalaNumberDeserializersModule.scala:61)
at com.fasterxml.jackson.module.scala.DefaultScalaModule.<init>(DefaultScalaModule.scala:20)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<init>(DefaultScalaModule.scala:37)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<clinit>(DefaultScalaModule.scala)
at org.apache.spark.rdd.RDDOperationScope$.<init>(RDDOperationScope.scala:82)
at org.apache.spark.rdd.RDDOperationScope$.<clinit>(RDDOperationScope.scala)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
UPDATE3: As per the suggestion in a comment to the proposed answer below, I cleaned the class path by deleting all the files in the local repo:
rm -rf local-repo/*
I then restarted the Zeppelin server. To check the class path, I executed the following in the notebook:
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println)
This gave the following output (I include only the jackson libraries from the output here, otherwise the output is too long to paste):
...
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-annotations-2.1.1.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-annotations-2.2.3.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-core-2.1.1.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-core-2.2.3.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-core-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-databind-2.1.1.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-databind-2.2.3.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-jaxrs-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-mapper-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-xc-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/lib/jackson-annotations-2.6.0.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/lib/jackson-core-2.6.7.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/lib/jackson-databind-2.6.7.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-annotations-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-core-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-core-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-databind-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-mapper-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-annotations-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-core-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-core-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-databind-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-jaxrs-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-mapper-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-module-paranamer-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-module-scala_2.11-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-xc-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/json4s-jackson_2.11-3.2.11.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/parquet-jackson-1.8.1.jar
...
It seems that multiple versions are fetched from the repo. Should I exclude the older versions? If so, how do I do that?
Use this jar versions;
aws-java-sdk-1.7.4.jar
hadoop-aws-2.6.0.jar
like in this script : https://github.com/2dmitrypavlov/sparkDocker/blob/master/zeppelin.sh
do not use package but download the jars and put them in a path, let's say in "/root/jars/" then edit your zeppelin-env.sh;
then run this command from zeppelin/conf dir;
echo 'export SPARK_SUBMIT_OPTIONS="--jars /root/jars/mysql-connector-java-5.1.39.jar,/root/jars/aws-java-sdk-1.7.4.jar,/root/jars/hadoop-aws-2.6.0.jar"'>>zeppelin-env.sh
after that restart zeppelin.
The code at the link above is pasted below (just in case the link becomes stale):
#!/bin/bash
# Download jars
cd /root/jars
wget http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.39/mysql-connector-java-5.1.39.jar
cd /usr/share/
wget http://archive.apache.org/dist/zeppelin/zeppelin-0.7.1/zeppelin-0.7.1-bin-all.tgz
tar -zxvf zeppelin-0.7.1-bin-all.tgz
cd zeppelin-0.7.1-bin-all/conf
cp zeppelin-env.sh.template zeppelin-env.sh
echo 'export MASTER=spark://'$MASTERZ':7077'>>zeppelin-env.sh
echo 'export SPARK_SUBMIT_OPTIONS="--jars /root/jars/mysql-connector-java-5.1.39.jar,/root/jars/aws-java-sdk-1.7.4.jar,/root/jars/hadoop-aws-2.6.0.jar"'>>zeppelin-env.sh
echo 'export ZEPPELIN_NOTEBOOK_STORAGE="org.apache.zeppelin.notebook.repo.VFSNotebookRepo, org.apache.zeppelin.notebook.repo.zeppelinhub.ZeppelinHubRepo"'>>zeppelin-env.sh
echo 'export ZEPPELINHUB_API_ADDRESS="https://www.zeppelinhub.com"'>>zeppelin-env.sh
echo 'export ZEPPELIN_PORT=9999'>>zeppelin-env.sh
echo 'export SPARK_HOME=/usr/share/spark'>>zeppelin-env.sh
cd ../bin/
./zeppelin.sh
You are probably using a too recent Jackson version. Even spark 2.3 is still on `2.6.7. Downgrade, and make sure that all your jackson JARs are consistent.
Related
My goal is to use bcftools to check that the reference alleles in my dataset (vcf file) match with a reference genome (fasta file) using the fixref plugin.
Working on command line, I first set the following environment:
export BCFTOOLS_PLUGINS=/path/to/bcftools/plugins
The following code is recommended for test datasets with mismatches:
bcftools +fixref test.bcf -Ob -o output.bcf -- -f ref.fa -m top
When I run this code using my own files (please note that my data is .vcf, not .bcf) I get the following error:
[main] Unrecognized command
If I simply enter:
bcftools
I get a list of the only 5 commands (view, index, cat, ld, ldpair) that I can use. So although I've set the environment, does it somehow need to be activated? Do I need to run my command through a bash script?
bcftools
was pointing to a deprecated version of bcftools (0.1.19) in ../bin/, while
BCFTOOLS_PLUGINS=/path/to/bcftools/plugins
was pointing to the plugins for bcftools version 1.10.2 outside /bin/
Replacing ../bin/bcftools (0.1.19 with 1.10.2) was the fix.
I have installed through Mainteinance Tool Qt 5.12.5 and the sources. I have the next directories:
C:\Qt\5.12.5\Src
C:\Qt\Tools\mingw730_32\
C:\Qt\Tools\mingw730_64\
On the other hand, I have read that downloable Postgres version is compiled with MSVC, and I must to compile my own version. I have do it following link, and now I have a postgresql version in c:\pgsql
Finally I have added c:\pgsql to user Path
Next step, I have opened PowerShell in Admin mode and I´ve gone to C:\Qt\5.12.5\Src\.
Next, set the env path for this PowerShell session:
$env:Path += ";C:\Qt\Tools\mingw730_64\bin\;C:\Qt\5.12.5\Src;C:\pgsql\include\;C:\pgsql\lib\;C:\pgsql\bin\" (setting the pgsql path again....)
After that, I execute configure.bat like that:
configure -v -static -release -static-runtime -platform win32-g++ -prefix C:\Qt\5.12.5\Estatico\ -opensource -confirm-license -qt-zlib -qt-pcre -qt-libpng -qt-libjpeg -qt-freetype -opengl desktop -no-openssl -opensource -confirm-license -skip webengine -make libs -nomake tools -nomake examples -nomake tests -sql-psql
But I have get this error:
ERROR: Feature 'sql-psql' was enabled, but the pre-condition 'libs.psql' failed.
Searching in config.log I can read those lines:
loaded result for library config.qtbase_sqldrivers.libraries.psql
Trying source 0 (type pkgConfig) of library psql ...
pkg-config use disabled globally.
=> source produced no result.
Trying source 1 (type psqlConfig) of library psql ...
pg_config not found.
=> source produced no result.
Trying source 2 (type psqlEnv) of library psql ...
None of [liblibpq.dll.a liblibpq.a libpq.dll.a libpq.a libpq.lib] found in [] and global paths.
=> source produced no result.
Trying source 3 (type psqlEnv) of library psql ...
=> source failed condition '!config.win32'.
test config.qtbase_sqldrivers.libraries.psql FAILED
What can I do or what is the properly way to do that?
Thank you in advance.
UPDATE
There are similar question here but it hasn´t been solved, and those question ask about Visual Studio.
I want to compile it under mingw.
The solution suggested by #Soheil Armin doesn´t work too
The solution suggested by #Soheil Armin works fine, but I need to delete the entire source tree and reinstall it as he suggested. If not, a new configure won't work.
Also, the ^ character can be saved:
configure <your parameters>
PSQL_LIBS="C:\pgsql\lib\libpq.a"
-I "C:\pgsql\include"
-L "C:\pgsql\lib"
You need to explicitly define library paths of Postgres.
configure <your parameters> ^
PSQL_LIBS="C:\pgsql\lib\libpq.a" ^
-I "C:\pgsql\include" ^
-L "C:\pgsql\lib"
In the book "Embedded Linux Systems with the Yocto Project", Chapter 4 contains a sample called "HelloWorld - BitBake style". I encountered a bunch of problems trying to get the old example working against the "Sumo" release 2.5.
If you're like me, the first error you encountered following the book's instructions was that you copied across bitbake.conf and got:
ERROR: ParseError at /tmp/bbhello/conf/bitbake.conf:749: Could not include required file conf/abi_version.conf
And after copying over abi_version.conf as well, you kept finding more and more cross-connected files that needed to be moved, and then some relative-path errors after that... Is there a better way?
Here's a series of steps which can allow you to bitbake nano based on the book's instructions.
Unless otherwise specified, these samples and instructions are all based on the online copy of the book's code-samples. While convenient for copy-pasting, the online resource is not totally consistent with the printed copy, and contains at least one extra bug.
Initial workspace setup
This guide assumes that you're working with Yocto release 2.5 ("sumo"), installed into /tmp/poky, and that the build environment will go into /tmp/bbhello. If you don't the Poky tools+libraries already, the easiest way is to clone it with:
$ git clone -b sumo git://git.yoctoproject.org/poky.git /tmp/poky
Then you can initialize the workspace with:
$ source /tmp/poky/oe-init-build-env /tmp/bbhello/
If you start a new terminal window, you'll need to repeat the previous command which will get get your shell environment set up again, but it should not replace any of the files created inside the workspace from the first time.
Wiring up the defaults
The oe-init-build-env script should have just created these files for you:
bbhello/conf/local.conf
bbhello/conf/templateconf.cfg
bbhello/conf/bblayers.conf
Keep these, they supersede some of the book-instructions, meaning that you should not create or have the files:
bbhello/classes/base.bbclass
bbhello/conf/bitbake.conf
Similarly, do not overwrite bbhello/conf/bblayers.conf with the book's sample. Instead, edit it to add a single line pointing to your own meta-hello folder, ex:
BBLAYERS ?= " \
${TOPDIR}/meta-hello \
/tmp/poky/meta \
/tmp/poky/meta-poky \
/tmp/poky/meta-yocto-bsp \
"
Creating the layer and recipe
Go ahead and create the following files from the book-samples:
meta-hello/conf/layer.conf
meta-hello/recipes-editor/nano/nano.bb
We'll edit these files gradually as we hit errors.
Can't find recipe error
The error:
ERROR: BBFILE_PATTERN_hello not defined
It is caused by the book-website's bbhello/meta-hello/conf/layer.conf being internally inconsistent. It uses the collection-name "hello" but on the next two lines uses _test suffixes. Just change them to _hello to match:
# Set layer search pattern and priority
BBFILE_COLLECTIONS += "hello"
BBFILE_PATTERN_hello := "^${LAYERDIR}/"
BBFILE_PRIORITY_hello = "5"
Interestingly, this error is not present in the printed copy of the book.
No license error
The error:
ERROR: /tmp/bbhello/meta-hello/recipes-editor/nano/nano.bb: This recipe does not have the LICENSE field set (nano)
ERROR: Failed to parse recipe: /tmp/bbhello/meta-hello/recipes-editor/nano/nano.bb
Can be fixed by adding a license setting with one of the values that bitbake recognizes. In this case, add a line onto nano.bb of:
LICENSE="GPLv3"
Recipe parse error
ERROR: ExpansionError during parsing /tmp/bbhello/meta-hello/recipes-editor/nano/nano.bb
[...]
bb.data_smart.ExpansionError: Failure expanding variable PV_MAJOR, expression was ${#bb.data.getVar('PV',d,1).split('.')[0]} which triggered exception AttributeError: module 'bb.data' has no attribute 'getVar'
This is fixed by updating the special python commands being used in the recipe, because #bb.data was deprecated and is now removed. Instead, replace it with #d, ex:
PV_MAJOR = "${#d.getVar('PV',d,1).split('.')[0]}"
PV_MINOR = "${#d.getVar('PV',d,1).split('.')[1]}"
License checksum failure
ERROR: nano-2.2.6-r0 do_populate_lic: QA Issue: nano: Recipe file fetches files and does not have license file information (LIC_FILES_CHKSUM) [license-checksum]
This can be fixed by adding a directive to the recipe telling it what license-info-containing file to grab, and what checksum we expect it to have.
We can follow the way the recipe generates the SRC_URI, and modify it slightly to point at the COPYING file in the same web-directory. Add this line to nano.bb:
LIC_FILES_CHKSUM = "${SITE}/v${PV_MAJOR}.${PV_MINOR}/COPYING;md5=f27defe1e96c2e1ecd4e0c9be8967949"
The MD5 checksum in this case came from manually downloading and inspecting the matching file.
Done!
Now bitbake nano ought to work, and when it is complete you should see it built nano:
/tmp/bbhello $ find ./tmp/deploy/ -name "*nano*.rpm*"
./tmp/deploy/rpm/i586/nano-dbg-2.2.6-r0.i586.rpm
./tmp/deploy/rpm/i586/nano-dev-2.2.6-r0.i586.rpm
I have recently worked on that hands-on hello world project. As far as I am concerned, I think that the source code in the book contains some bugs. Below there is a list of suggested fixes:
Inheriting native class
In fact, when you build with bitbake that you got from poky, it builds only for the target, unless you mention in your recipe that you are building for the host machine (native). You can do the latter by adding this line at the end of your recipe:
inherit native
Adding license information
It is worth mentioning that the variable LICENSE is important to be set in any recipe, otherwise bitbake rises an error. In our case, we try to build the version 2.2.6 of the nano editor, its current license is GPLv3, hence it should be mentioned as follow:
LICENSE = "GPLv3"
Using os.system calls
As the book states, you cannot dereference metadata directly from a python function. Which means it is mandatory to access metadata through the d dictionary. Bellow, there is a suggestion for the do_unpack python function, you can use its concept to code the next tasks (do_configure, do_compile):
python do_unpack() {
workdir = d.getVar("WORKDIR", True)
dl_dir = d.getVar("DL_DIR", True)
p = d.getVar("P", True)
tarball_name = os.path.join(dl_dir, p+".tar.gz")
bb.plain("Unpacking tarball")
os.system("tar -x -C " + workdir + " -f " + tarball_name)
bb.plain("tarball unpacked successfully")
}
Launching the nano editor
After successfully building your nano editor package, you can find your nano executable in the following directory in case you are using Ubuntu (arch x86_64):
./tmp/work/x86_64-linux/nano/2.2.6-r0/src/nano
Should you have any comments or questions, Don't hesitate !
I have downloaded the graphframes package (from here) and saved it on my local disk. Now, I would like to use it. So, I use the following command:
IPYTHON_OPTS="notebook --no-browser" pyspark --num-executors=4 --name gorelikboris_notebook_1 --py-files ~/temp/graphframes-0.1.0-spark1.5.jar --jars ~/temp/graphframes-0.1.0-spark1.5.jar --packages graphframes:graphframes:0.1.0-spark1.5
All the pyspark functionality works as expected, except for the new graphframes package: whenever I try to import graphframes, I get an ImportError. When I examine sys.path, I can see the following two paths:
/tmp/spark-1eXXX/userFiles-9XXX/graphframes_graphframes-0.1.0-spark1.5.jar and /tmp/spark-1eXXX/userFiles-9XXX/graphframes-0.1.0-spark1.5.jar, however these files don't exist. Moreover, the /tmp/spark-1eXXX/userFiles-9XXX/ directory is empty.
What am I missing?
in my case:
1、cd /home/zh/.ivy2/jars
2、jar xf graphframes_graphframes-0.3.0-spark2.0-s_2.11.jar
3、add /home/zh/.ivy2/jar to PYTHONPATH in spark-env.sh like code above:
export PYTHONPATH=$PYTHONPATH:/home/zh/.ivy2/jars:.
This might be an issue in Spark packages with Python in general. Someone else was asking about it too earlier on the Spark user discussion alias.
My workaround is to unpackage the jar to find the python code embedded, and then move the python code into a subdirectory called graphframes.
For instance, I run pyspark from my home directory
~$ ls -lart
drwxr-xr-x 2 user user 4096 Feb 24 19:55 graphframes
~$ ls graphframes/
__init__.pyc examples.pyc graphframe.pyc tests.pyc
You would not need the py-files or jars parameters, though, something like
IPYTHON_OPTS="notebook --no-browser" pyspark --num-executors=4 --name gorelikboris_notebook_1 --packages graphframes:graphframes:0.1.0-spark1.5
and having the python code in the graphframes directory should work.
Add these lines to your $SPARK_HOME/conf/spark-defaults.conf :
spark.executor.extraClassPath file_path/jar1:file_path/jar2
spark.driver.extraClassPath file_path/jar1:file_path/jar2
In the more general case of importing 'orphan' python file (outside of current folder, not part of properly installed package) - use addPyFile, e.g.:
sc.addPyFile('somefolder/graphframe.zip')
addPyFile(path): Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.
I followed this quickstart:
https://docs.prediction.io/templates/classification/quickstart/
and this document for evaluation metrics
https://docs.prediction.io/evaluation/paramtuning/
Everything seems ok until the step build and run evaluation metrics
pio eval org.template.classification.AccuracyEvaluation \
org.template.classification.EngineParamsList
I am getting the exception:
Exception in thread "main" scala.reflect.internal.MissingRequirementError: object org.template.classification.AccuracyEvaluation not found.
at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126)
at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161)
at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21)
at io.prediction.workflow.WorkflowUtils$.getEvaluation(WorkflowUtils.scala:103)
at io.prediction.workflow.CreateWorkflow$$anonfun$19.apply(CreateWorkflow.scala:146)
at io.prediction.workflow.CreateWorkflow$$anonfun$19.apply(CreateWorkflow.scala:144)
Could anyone help me with this?
Thank you very much.
Had the exact same problem. Fixed it by doing the following:
For each .scala file in engine_dir/src/main/scala/org/template/engine_name/ you need to change the first line from...
package <SomeTemplateName>
To the following (replacing engine_name with the name of the folder in the path mentioned above):
package org.template.<engine_name>
Then, in engine.json you need to change the following line...
"engineFactory": "<template name>.<template engine>",
To the following (once again replacing engine_name with the name of the folder in the path mentioned above):
"engineFactory": "org.template.<engine name>.<template engine>",
Now re-run...
pio build
pio train
pio deploy
Then you should be able to run the model evaluation without errors.
Simply run it like this
$ pio eval org.example.classification.AccuracyEvaluation \
org.example.classification.EngineParamsList
You dont have to change anything. The class package from the sample was org.example.classification not org.template.classification