how to rewrite this Scala withColumn when condition - eclipse

I have some Scala code that works when I run manually in Spark EMR, but I get errors when I try to compile in Eclipse.
val tmp_df2 = tmp_df1.withColumn("col_one", when($"col_two" === "good", "bad").otherwise($"col_one"))
When I run "Maven install" it says "error: not found: value when". But I know the code works in EMR.
So, is there another to specify that condition without using "when"?

You may need to import the spark function as follows:
import org.apache.spark.sql.functions.when
or
import org.apache.spark.sql.functions._

Related

error not found value spark import spark.implicits._ import spark.sql

I am using hadoop 2.7.2 , hbase 1.4.9, spark 2.2.0, scala 2.11.8 and java 1.8 on a hadoop cluster which is composed of one master and two slave.
when I run spark-shell after starting the cluster , it works fine.
I am trying to connect to hbase using scala by following this tutorial : [https://www.youtube.com/watch?v=gGwB0kCcdu0][1] .
But when I try like he does to run the spark-shell by adding those jars like argument I have this error:
spark-shell --jars
"hbase-annotations-1.4.9.jar,hbase-common-1.4.9.jar,hbase-protocol-1.4.9.jar,htrace-core-3.1.0-incubating.jar,zookeeper-3.4.6.jar,hbase-client-1.4.9.jar,hbase-hadoop2-compat-1.4.9.jar,metrics-json-3.1.2.jar,hbase-server-1.4.9.jar"
<console>:14: error: not found: value spark
import spark.implicits._
^
<console>:14: error: not found: value spark
import spark.sql
^
and after that even I log out and run spark-shell another time I have the same issue.
Can any one tell me please what is the cause and how to fix it .
In your import statement spark should be an object of type SparkSession. That object should have been created previously for you. Or you need to create it yourself (read spark docs). I didn't watch your tutorial video.
The point is it doesn't have to be called spark. It could be for instance called sparkSession and then you can do import sparkSession.implicits._

EMR Notebook Scala kernel import graphframes library

Running spark-shell --packages "graphframes:graphframes:0.7.0-spark2.4-s_2.11" in the bash shell works and I can successfully import graphframes 0.7, but when I try to use it in a scala jupyter notebook like this:
import scala.sys.process._
"spark-shell --packages \"graphframes:graphframes:0.7.0-spark2.4-s_2.11\""!
import org.graphframes._
gives error message:
<console>:53: error: object graphframes is not a member of package org
import org.graphframes._
Which from what I can tell means that it runs the bash command, but then still cannot find the retrieved package.
I am doing this on an EMR Notebook running a spark scala kernel.
Do I have to set some sort of spark library path in the jupyter environment?
That simply shouldn't work. What your code does is a simple attempt to start a new independent Spark shell. Furthermore Spark packages have to loaded when the SparkContext is initialized for the first time.
You should either add (assuming these are correct versions)
spark.jars.packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
to your Spark configuration files, or use equivalent in your SparkConf / SparkSessionBuilder.config before SparkSession is initialized.

How to import play framwork jar in spark-shell in windows?

I am using windows machine and installed spark and scala for my learning. for spark-sql in need to process json input data.
scala> sc
res4: org.apache.spark.SparkContext = org.apache.spark.SparkContext#7431f4b8
scala> import play.api.libs.json._
<console>:23: error: not found: value play
import play.api.libs.json._
^
scala>
How can i import play api in my spark-shell commnad.
If you want to use other libraries while you are using spark-shell, you need to run spark-shell command with --jars and/or --packages. For example, to use play in your spark shell, run the following command;
spark-shell --packages "com.typesafe.play":"play_2.11":"2.6.19"
For more information, you can use spark-shell -h. I hope it helps!

error: not found: value sc

I am new to Scala and am trying to code read a file using the following code
scala> val textFile = sc.textFile("README.md")
scala> textFile.count()
But I keep getting the following error
error: not found: value sc
I have tried everything, but nothing seems to work. I am using Scala version 2.10.4 and Spark 1.1.0 (I have even tried Spark 1.2.0 but it doesn't work either). I have sbt installed and compiled yet not able to run sbt/sbt assembly. Is the error because of this?
You should run this code using ./spark-shell. It's scala repl with provided sparkContext. You can find it in your apache spark distribution in folder spark-1.4.1/bin.

Spark - "sbt package" - "value $ is not a member of StringContext" - Missing Scala plugin?

When running "sbt package" from the command line for a small Spark Scala application, I'm getting the "value $ is not a member of StringContext" compilation error on the following line of code:
val joined = ordered.join(empLogins, $"login" === $"username", "inner")
.orderBy($"count".desc)
.select("login", "count")
Intellij 13.1 is giving me the same error message. The same .scala source code gets compiled without any issue in Eclipse 4.4.2. And also it works well with maven in a separate maven project from the command line.
It looks like sbt doesn't recognize the $ sign because I'm missing some plugin in my project/plugins.sbt file or some setting in my build.sbt file.
Are you familiar with this issue? Any pointers will be appreciated. I can provide build.sbt and/or project/plugins.sbt if needed be.
You need to make sure you import sqlContext.implicits._
This gets you implicit class StringToColumn extends AnyRef
Which is commented as:
Converts $"col name" into an Column.
In Spark 2.0+
$-notation for columns can be used by importing implicit on SparkSession object (spark)
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("App name")
.getOrCreate;
import spark.implicits._
then your code with $ notation
val joined = ordered.join(empLogins, $"login" === $"username", "inner")
.orderBy($"count".desc)
.select("login", "count")
Great answer guys, if resolving import is a concern, then will this work
import org.apache.spark.sql.{SparkSession, SQLContext}
val ss = SparkSession.builder().appName("test").getOrCreate()
val dataDf = ...
import ss.sqlContext.implicits._
dataDf.filter(not($"column_name1" === "condition"))