Unable to call Notebook when using scala code in Databricks - scala

I am into a situation where I am able to successfully run the below snippet in azure Databricks from a separate CMD.
%run ./HSCModule
But running into issues when including that piece of code with other scala code which is importing below packages and getting following error.
import java.io.{File, FileInputStream}
import java.text.SimpleDateFormat
import java.util{Calendar, Properties}
import org.apache.spark.SparkException
import org.apache.spark.sql.SparkSession
import scala.collection.JavaConverters._
import scala.util._
ERROR = :168: error: ';' expected but '.' found. %run
./HSCModule
FYI - I have also used dbutils.notebook.run and still facing same issues.

You can't mix the magic commands, like, %run, %pip, etc. with the Scala/Python code in the same cell. Documentation says:
%run must be in a cell by itself, because it runs the entire notebook inline.
So you need to put this magic command into a separate cell.

Related

Pyspark ModuleNotFound when importing custom package

Context: I'm running a script on azure databricks and I'm using imports to import functions from a given file
Let's say we have something like this in a file called "new_file"
from old_file import x
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql.types import *
spark = SparkSession.builder.appName('workflow').config(
"spark.driver.memory", "32g").getOrCreate()
The imported funcion "x" will take as argument a string that was read as a pyspark dataframe as such:
new_df_spark = spark.read.parquet(new_file)
new_df = ps.DataFrame(new_df_spark)
new_df is then passed as argument to a function that calls the function x
I then get an error like
ModuleNotFoundError: No module named "old_file"
Does this mean I can't use imports? Or do I need to install the old_file in the cluster for this to work? If so, how would this work and will the package update if I change old_file again?
Thanks

Scala with filter column

I get the error;
error: not found: value col
when I issue the command in Databricks notebook which I don't get the error when running it from a spark-shell;
altitiduDF.select("height", "terrain").filter(col("height") >= 11000...
^
I tried importing the following before my command query, but it did not help;
import org.apache.spark.sql.Column
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
Where can I find what I need to import to use the col function?
Found that I need to import;
import org.apache.spark.sql.functions.{col}
to use the col function in Databricks.

FreeCAD CMD color lost after import

Whenenver I try to import a model via FreeCADCmd Script the objects loose all color information.
This can be checked within the GUI by running macros
import FreeCAD
import ImportGui
doc = FreeCAD.newDocument()
FreeCAD.setActiveDocument(doc.Name)
ImportGui.insert("file1.stp", doc.Name)
will preserve the colors -- but can not be run on commandline, because of ImportGui.
import FreeCAD
import Import
doc = FreeCAD.newDocument()
FreeCAD.setActiveDocument(doc.Name)
Import.insert("file1.stp", doc.Name)
will import the model without any color information.
Is there any way to import a step file into FreeCADCmd (commandline -- so no GUI) without the color information being dropped?
Or does anyone know a way to run FreeCAD (GUI Version) without running a xserver?

How to import .py in google Colaboratory?

I want to simplify code. so i make a utils.py , but Google Colaboratory directory is "/content" I read other questions. but this is not my solution
In Google's Colab notebook, How do I call a function from a Python file?
%%writefile example.py
def f():
print 'This is a function defined in a Python source file.'
# Bring the file into the local Python environment.
execfile('example.py')
f()
This is a function defined in a Python source file.
It look likes just using def().
using this, i always write the code in cell.
but i want to this code
import example.py
example.f()
A sample maybe you want:
!wget https://raw.githubusercontent.com/tensorflow/models/master/samples/core/get_started/iris_data.py -P local_modules -nc
import sys
sys.path.append('local_modules')
import iris_data
iris_data.load_data()
I have also had this problem recently.
I addressed the issue by the following steps, though it's not a perfect solution.
src = list(files.upload().values())[0]
open('util.py','wb').write(src)
import util
This code should work with Python 3:
from google.colab import drive
import importlib.util
# Mount your drive. It will be at this path: "/content/gdrive/My Drive/"
drive.mount('/content/gdrive')
# Load your module
spec = importlib.util.spec_from_file_location("YOUR_MODULE_NAME", "/content/gdrive/My Drive/utils.py")
your_module_name = importlib.util.module_from_spec(spec)
spec.loader.exec_module(your_module_name)
import importlib.util
import sys
from google.colab import drive
drive.mount('/content/gdrive')
# To add a directory with your code into a list of directories
# which will be searched for packages
sys.path.append('/content/gdrive/My Drive/Colab Notebooks')
import example.py
This works for me.
Use this if you are out of content folder! hope this help!
import sys
sys.path.insert(0,'/content/my_project')
from example import*
STEP 1. I have just created a folder 'common_module' like shown in the image :
STEP 2 called the required Class from my "colab" code cell,
sys.path.append('/content/common_module/')
from DataPreProcessHelper import DataPreProcessHelper as DPPHelper
My class file 'DataPreProcessHelper.py' looks like this
Add path of 'sample.py' file to system paths as:
import sys
sys.path.append('drive/codes/')
import sample

Spark Shell Import Fine, But Throws Error When Referencing Classes

I am a beginner in Apache Spark, so please excuse me if this is quite trivial.
Basically, I was running the following import in spark-shell:
import org.apache.spark.sql.{DataFrame, Row, SQLContext, DataFrameReader}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql._
import org.apache.hadoop.hive.ql.io.orc.{OrcInputFormat,OrcStruct};
import org.apa‌​che.hadoop.io.NullWritable;
...
val rdd = sc.hadoopFile(path,
classOf[org.apache.hadoop.hive.ql.io.orc.OrcInputFor‌​mat],
classOf[NullWritable],
classOf[OrcStruct],
1)
The import statements up till OrcInputFormat works fine, with the exception that:
error: object apa‌​che is not a member of package org
import org.apa‌​che.hadoop.io.NullWritable;
It does not make sense, if the import statement before goes through without any issue.
In addition, when referencing OrcInputFormat, I was told:
error: type OrcInputFor‌​mat is not a member of package org.apache.hadoop.hive.ql.io.orc
It seems strange that import for OrcInputFormat to work (I assume it works, since no error is thrown), but then the above error message turns up. Basically, I am trying to read ORC files from S3.
I am also looking at what have I done wrong, and why this happens.
What I have done:
I have tried running spark-shell with the --jars option, and tried importing hadoop-common-2.6.0.jar (My current version of Spark is 1.6.1, compiled with Hadoop 2.6)
val df = sqlContext.read.format("orc").load(PathToS3), as referred by (Read ORC files directly from Spark shell). I have tried variations of S3, S3n, S3a, without any success.
You have 2 non-printing characters between org.ape and che in the last import, most certainly due to a copy paste :
import org.apa‌​che.hadoop.io.NullWritable;
Just rewrite the last import statement and it will work. Also you don't need these semi-colons.
You have the same problem with OrcInputFormat :
error: type OrcInputFor‌​mat is not member of package org.apache.hadoop.hive.ql.io.orc
That's funny, in the mobile version of Stackoverflow we can cleary see those non-printing characters :