why to_date is working in Databricks but not in Intellij? - scala

My code is
import org.apache.spark.sql.functions.to_date
val df2=df1.withColumn("Order Date",to_date($"Order Date","dd-MMM-yy"))
Error - Cannot resolve overloaded method 'to_date'
The same code is working fine in Databricks notebook.

Related

how to rewrite this Scala withColumn when condition

I have some Scala code that works when I run manually in Spark EMR, but I get errors when I try to compile in Eclipse.
val tmp_df2 = tmp_df1.withColumn("col_one", when($"col_two" === "good", "bad").otherwise($"col_one"))
When I run "Maven install" it says "error: not found: value when". But I know the code works in EMR.
So, is there another to specify that condition without using "when"?
You may need to import the spark function as follows:
import org.apache.spark.sql.functions.when
or
import org.apache.spark.sql.functions._

How to import Delta Lake module in Zeppelin notebook and pyspark?

I am trying to use Delta Lake in a Zeppelin notebook with pyspark and seems it cannot import the module successfully. e.g.
%pyspark
from delta.tables import *
It fails with the following error:
ModuleNotFoundError: No module named 'delta'
However, there is no problem to save/read the data frame using delta format. And the module can be loaded successfully if using scala spark %spark
Is there any way to use Delta Lake in Zeppelin and pyspark?
Finally managed to load it on zeppelin pyspark. Have to explicitly include the jar file
%pyspark
sc.addPyFile("**LOCATION_OF_DELTA_LAKE_JAR_FILE**")
from delta.tables import *

error not found value spark import spark.implicits._ import spark.sql

I am using hadoop 2.7.2 , hbase 1.4.9, spark 2.2.0, scala 2.11.8 and java 1.8 on a hadoop cluster which is composed of one master and two slave.
when I run spark-shell after starting the cluster , it works fine.
I am trying to connect to hbase using scala by following this tutorial : [https://www.youtube.com/watch?v=gGwB0kCcdu0][1] .
But when I try like he does to run the spark-shell by adding those jars like argument I have this error:
spark-shell --jars
"hbase-annotations-1.4.9.jar,hbase-common-1.4.9.jar,hbase-protocol-1.4.9.jar,htrace-core-3.1.0-incubating.jar,zookeeper-3.4.6.jar,hbase-client-1.4.9.jar,hbase-hadoop2-compat-1.4.9.jar,metrics-json-3.1.2.jar,hbase-server-1.4.9.jar"
<console>:14: error: not found: value spark
import spark.implicits._
^
<console>:14: error: not found: value spark
import spark.sql
^
and after that even I log out and run spark-shell another time I have the same issue.
Can any one tell me please what is the cause and how to fix it .
In your import statement spark should be an object of type SparkSession. That object should have been created previously for you. Or you need to create it yourself (read spark docs). I didn't watch your tutorial video.
The point is it doesn't have to be called spark. It could be for instance called sparkSession and then you can do import sparkSession.implicits._

EMR Notebook Scala kernel import graphframes library

Running spark-shell --packages "graphframes:graphframes:0.7.0-spark2.4-s_2.11" in the bash shell works and I can successfully import graphframes 0.7, but when I try to use it in a scala jupyter notebook like this:
import scala.sys.process._
"spark-shell --packages \"graphframes:graphframes:0.7.0-spark2.4-s_2.11\""!
import org.graphframes._
gives error message:
<console>:53: error: object graphframes is not a member of package org
import org.graphframes._
Which from what I can tell means that it runs the bash command, but then still cannot find the retrieved package.
I am doing this on an EMR Notebook running a spark scala kernel.
Do I have to set some sort of spark library path in the jupyter environment?
That simply shouldn't work. What your code does is a simple attempt to start a new independent Spark shell. Furthermore Spark packages have to loaded when the SparkContext is initialized for the first time.
You should either add (assuming these are correct versions)
spark.jars.packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
to your Spark configuration files, or use equivalent in your SparkConf / SparkSessionBuilder.config before SparkSession is initialized.

Why does from_json fail with “not found : value from_json"? (2)

Have already read the answer to this question that is on SO. None of those fixes are my problem.
I am unable to call the function "from_json".
I already had below in my code:
import org.apache.spark.sql.functions._
I also tried adding:
import org.apache.spark.sql.Column
I am running Scala/Spark through Eclipse. Scala Version 2.11.11, Spark Version 2.0.0.
Any ideas?
from_json function isn't available in Spark 2.0
It is available from Spark 2.1
Release notes of spark 2.1 mentions about adding from_json functionality