Sparkling Water on Windows - scala

I'm using spark-2.3.0-bin-hadoop2.7 and sparkling-water-2.3.5 on Windows 10 64 bit.
I've taken the following steps and looking for help for the Steps 4 and 5.
Step 1: Run Spark shell by executing bin/sparkling-shell. Fine
Step 2: At scala prompt (Fine)
scala> import org.apache.spark.h2o._
scala> val h2oContext = H2OContext.getOrCreate(spark)
scala> import h2oContext._
Step 3: - openFlow command at scala prompt to open Flow UI in the browser. Fine.
Step 4: At scala prompt (Not working)
scala> openSparkUI command at scala prompt to open the Spark UI in the browser
- error: not found: value openSparkUI
Step 5: Looking for an editor to write scala code and how to submit that code at scala prompt

openSparkUI used to exist around 2015, but has since been removed. As noted in the question h2oContext.openFlow is still functional and an available option (type q to convert a cell to a scala cell in flow, type h to see the the full keyboard shortcut list - note: keyboard shortcuts only work if you are not in editor mode and typing within a cell).
Other possible interfaces for Scala code include Jupyter notebook and Zeppelin.

Related

How to debug unit test while developping a package in Julia

Say I develop a package with a limited set of dependencies (for example, LinearAlgebra).
In the Unit testing part, I might need additional dependencies (for instance, CSV to load a file). I can configure that in the Project.toml all good.
Now from there and in VS Code, how can I debug the Unit tests? I tried running the "runtests.jl" in the debugger; however, it unsurprisingly complains that the CSV package is unavailable.
I could add the CSV package (as a temporary solution), but I would prefer that the debugger run with the configuration for the unit testing; how can I achieve that?
As requested, here is how it can be reproduced (it is not quite minimal, but instead I used a commonly used package as it give confidence the package is not the problem). We will use DataFrames and try to execute the debugger for its unit tests.
Make a local version of DataFrames for the purpose of developing a feature in it. I execute dev DataFrames in a new REPL.
Select the correct environment (in .julia/dev/DataFrames) through the VS-code user interface.
Execute the "proper" unit testing by executing test DataFrames at the pkg prompt. Everything should go smoothly.
Try to execute the tests directly (open the runtests.jl and use the "Run" button in vs-code). I see some errors of the type:
LoadError: ArgumentError: Package CategoricalArrays not found in current path:
- Run `import Pkg; Pkg.add("CategoricalArrays")` to install the CategoricalArrays package.
which is consistent with CategoricalArrays being present in the [extras] section of the Project.toml but not present in the [deps].
Finally, instead of the "Run" command, execute the "Run and Debug". I encounter similar errors here is the first one:
Test Summary: | Pass Total
merge | 19 19
PASSED: index.jl
FAILED: dataframe.jl
LoadError: ArgumentError: Package DataStructures not found in current path:
- Run `import Pkg; Pkg.add("DataStructures")` to install the DataStructures package.
So I can't debug the code after the part requiring the extras packages.
After all that I delete this package with the command free DataFrames at the pkg prompt.
I see the same behavior in my package.
I'm not certain I understand your question, but I think you might be looking for the TestEnv package. It allows you to activate a temporary environment containing the [extras] dependencies. The discourse announcement contains a good description of the use cases.
Your runtest.jl file should contain all necessary imports to run tests.
Hence you are expected to have in your runtests.jl file lines such as:
using YourPackageName
using CSV
# the lines with tests now go here.
This is a standard in Julia package layout. For an example have a look at any mature Julia such as DataFrames.jl (https://github.com/JuliaData/DataFrames.jl/blob/main/test/runtests.jl).

Visual studio code using pytest for Pyspark getting stuck at SparkSession Creation

I am trying to run a pyspark unit test in Visual studio code on my local windows machine. when i debug the test it gets stuck at line where I am creating a sparksession. It doesn't show any error/failure but status bar just shows "Running Tests" . Once it work, i can refactor my test to create sparksession as part of test fixture, but presently my test is getting stuck at sparksession creation.
Do i have to install/configure on my local machine for sparksession to work?
i tried a simple test with assert 'a' == 'b' and i can debug and test run succsfully, so i assume my pytest configurations are correct. Issue i am facing is with creating sparksession.
# test code
from pyspark.sql import SparkSession, Row, DataFrame
import pytest
def test_poc():
spark_session = SparkSession.builder.master('local[2]').getOrCreate() #this line never returns when debugging test.
spark_session.createDataFrame(data,schema) #data and schema not shown here.
Thanks
What I have done to make it work was:
Create a .env file in the root of the project
Add the following content to the created file:
SPARK_LOCAL_IP=127.0.0.1
JAVA_HOME=<java_path>/jdk/zulu#1.8.192/Contents/Home
SPARK_HOME=<spark_path>/spark-3.0.1-bin-hadoop2.7
PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
Go to .vscode file in the root, expand and open settings.json. Add the following like (replace <workspace_path> with your actual workspace path):
"python.envFile": "<workspace_path>/.env"
After refreshing the Testing section in Visual Studio Code, the setup should succeed.
Note: I use pyenv to setup my python version, so I had to make sure that VS Code was using the correct python version with all the expected dependencies installed.
Solution inspired by py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM and https://github.com/microsoft/vscode-python/issues/6594

Error: Could not find or load main class Main Scala

I've recently installed Scala as a part of my functional programming course and I've encountered a problem: IntelliJ IDEA 2017.2.1 (Java version 9, build 9+181) doesn't run any of my scala code, exitting with
Error: Could not find or load main class Main
This code is an example.
object Main {
def length[A](list:List[A]):Int = {
if (list == Nil) 0
else 1 + length(list.tail)
}
def main(args: Array[String]): Unit = {
length(List[Int](1, 4, 5, 12, -1))
}
}
It's really simple, yet IntelliJ refuses to run it. Windows CMD doesn't even react to a scala command, resulting into
'scala' is not recognized as an internal or external command,
operable program or batch file.
even though I have it installed on my computer. If I call Scala Console inside of IntelliJ everything works fine and compiles as expected. I've tried switching to JDK 1.8 inside of IntelliJ, yet it led to no result.
What could be the problem?
For me it turns out that the src/main was not marked as Sources Root
which causes the following error
...
One of the two will be used. Which one is undefined.
Error: Could not find or load main class Main
Process finished with exit code 1
So of course after I mark the src/main as Sources Root, the Scala Hello World example runs happy again.
Notice the blue color of directory src/main when it's marked as Sources Root
Are you using the little green arrow to run the program from inside of your Main object?
How did you create the program? It could be that your build file SBT configuration of the project is a different Scala version than what's installed on your computer.
It's really simple, yet IntelliJ refuses to run it. Windows CMD
doesn't even react to a scala command, resulting into
'scala' is not recognized as an internal or external command, operable program or batch file.
This means that Scala is not added to your class path in your terminal. Look up how to do that and see if that doesn't help out your IntelliJ problem too.

Running ant script from within scala program

I tried
val cmd = sys.process.Process(Seq("C:\apache-ant-1.9.3\bin\ant", "everythingNoJunit"), new java.io.File(scriptDir))
cmd.lines
and got this error:
CreateProcess error=193, %1 is not a valid Win32 application
How do I run the ant script from within scala app?
The basic answer is that your should be using "ant.bat" instead of "ant" on a windows machine as in this answer
In addition to that, I would suggest using a non-windows styled path so you don't have to escape the backslashes:
val cmd = sys.process.Process(Seq("/apache-ant-1.9.3/bin/ant.bat", "everythingNoJunit"), new java.io.File(scriptDir))
Using this approach, I'm able to run an an ant target successfully when my scala application is also in "c:".

Scala readLine with prompt displays prompt after line input in SBT

I'm running Scala 2.10 program through sbt run from Windows 7 command line and I see an unexpected behavior while calling readLine overload with the prompt. The prompt is shown after the actual line input.
Source
object MyExample extends App {
readLine("This prompt is shown after the readline!")
}
build.sbt
name := "hello"\n
\n
version := "1.0"\n
\n
Output
asdf
This prompt is shown after the readline!
Is there something I don't understand or is it a bug? It seems to be working as expected from IDEA.
Sbt version: 0.13.1
I've run into this before with giter8. The work around is to do your own print, flush the output stream, and then read. See this pull request for an example of the workaround.
Someone has fixed it in the scala source about a month ago. I don't have any idea when we will see that fix, tho.