In what order will multiple pyspark program get executed on a spark cluster - pyspark

If I submit multiple python (pyspark) files to a spark submit command, in which order will they get executed?
For Java, there is a main method which will get executed first and the rest of the classes will get executed in the order their objects/methdos are created/invoked.
But python (and also scala) allows the whole REPL style syntax whereby one is allowed to type commands in an 'open code' fashion, i.e outside method blocks.
So when a whole bunch of this REPL statements get submitted to the spark cluster, in what order will they execute?

According to http://spark.apache.org/docs/3.0.1/configuration.html
spark.submit.pyFiles (which is --py-files): Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps. Globs are allowed.
So the python files which are added by --py-files meant to be libraries, or modules, or packages, not for runnable scripts. You will need to create a main.py or something similar, then import the other 5 files and trigger in any order you want spark-submit --py-files five-files.zip main.py

Related

How to add to classpath of running PySpark session

I have a PySpark notebook running in AWS EMR. In my specific case, I want to use pyspark2pmml to create pmml for a model I just trained. However, I get the following error (when running pyspark2pmml.PMMLBuilder but I don't think that matters).
JPMML-SparkML not found on classpath
Traceback (most recent call last):
File "/tmp/1623111492721-0/lib/python3.7/site-packages/pyspark2pmml/__init__.py", line 14, in __init__
raise RuntimeError("JPMML-SparkML not found on classpath")
RuntimeError: JPMML-SparkML not found on classpath
I know that this is caused by my Spark session not have reference to the needed class. What I don't know is how to start a Spark session with that class available. I found one other answer using %%conf -f, but that changed other settings which in turn kept me from using sc.install_pypi_package, which I also needed.
Is there a way that I could have started the Spark session with that JPMML class available, but without changing any other settings?
So, here's an answer, but not the one I want.
To add that class to the classpath I can start my work with this:
%%configure -f
{
"jars": [
"{some_path_to_s3}/jpmml-sparkml-executable-1.5.13.jar"
]
}
That creates the issue I referenced above, where I don't have the ability to sc.install_pypi_package. However, I can add that package in a more manual way. First step was to create a zip file of just the needed modules using the zip from the project's github (in this case, just the pyspark2pmml directory, instead of the whole zip). Then that module can be added using sc.addPyFile
sc.addPyFile('{some_path_to_s3}/pyspark2pmml.zip')
After this, I can run the original commands exactly as I expected.

How to use class/function in another swift file which in same folder without Xcode

I write protocol and classes in G.swift, and write functions and tests in L.swift. These two files in same folder, and I need "import" G.someclass in L.swift.
I searched on internet and it said I do not need import command if two files in same directory. But I am not using Xcode to write these file. I just want write lite function, and run them like Python or Go.
Could it happen? Or I have to use Xcode to make "import" operation happen?
If you're not using xcode, I assume you're running them either via swift or swiftc. In either case, just list both files on the command line and they will be treated as the same module. Your L.swift file should have a main() function that runs the tests.

Upload zip file using --archives option of spark-submit on yarn

I have a directory with some model files and my application has to access these models files in local file system due to some reason.
Of course I know that --files option of spark-submit can upload file to the working directory of each executor and it does work.
However, I want keep the directory structure of my files so I come up with --archives option, which is said
YARN-only:
......
--archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor.
......
But when I actually use it to upload models.zip, I found yarn just put it there without extraction, like what it did with --files. Have I misunderstood to be extracted or misused this option?
Found the answer myself.
YARN does extract the archive but add an extra folder with the same name of the archive. To make it clear, If I put models/model1 and models/models2 in models.zip, then I have to access my models by models.zip/models/model1 and models.zip/models/model2.
Moreover, we can make this more beautiful using the # syntax.
The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.
Edit:
This answer was tested on spark 2.0.0 and I'm not sure the behavior in other versions.

Editing Spark Module in Spark-kernel

We are currently editing a specific module in Spark. We are using spark-kernel https://github.com/ibm-et/spark-kernel to run all our spark jobs. So, what we did is compile again the code that we have edited. This produces a jar file. However, we do not know how to point the code to the jar file.
It looks like it is referencing again to the old script and not to the newly edited and newly compiled one. Do you have some idea on how to modify some spark packages/modules and reflect the changes with spark-kernel? If we're not going to use spark-kernel, is there a way we can edit a particular module in spark for example, the ALS module in spark: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala. Thanks!
You likely edited a scala or java file and recompiled (even though you call them scripts, they are not scripts in the strict sense because they are not interperted). Assuming that's what you did....
You probably then don't have a clean replacement of the resulting JAR file in the deployment you are testing. Odds are your newly compiled JAR file is somewhere, just not in the somewhere you are observing. To get it there properly, you will have to build more than the JAR file, you will have to repackage your installable and reinstall.
Other techniques exist, if you can identify the unpacked item in an installation, sometimes you can copy it in place; however, such a technique is inherently unmaintainable, so I recommend it only on throw away verification of the change and not on any system that will be used.
Keep in mind that with Spark, sometimes the worker nodes are dynamically deployed. If that is so, you might have to locate the installable of the dynamic deployment system and assure you have the right packaging there too.

Can I get MatLab to not call functions in a specific directory?

A little background
I'm working on a project that requires me to use an old (from 2006) large system of MatLab scripts. The script exists in an archive folder on a cluster but I need it to run fully from my cluster folder. I've got it mostly running from my personal folder but not entirely. It runs perfectly but there is a Python script that is called somewhere that doesn't exist in my personal directory.
What I want to do
Since the MatLab code I'm running includes many different script files, which themselves call even more script files, poring through them to find information about the Python script would be very very time consuming.
Therefore, I would like to be able to tell MatLab to not go to specific folders when calling a script, but instead, return an error. For example, if a scripted is called in the directory /notmyfolder, I want it to return an error.
Is this possible?