I'm trying to understand why I can filter on a column that I have previously dropped.
This simple script:
package example
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
object Test {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("Name")
.master("local[*]")
.config("spark.driver.host", "localhost")
.config("spark.ui.enabled", "false")
.getOrCreate()
import spark.implicits._
List(("a0", "a1"), ("b0", "b1"))
.toDF("column1", "column2")
.drop("column2")
.where(col("column2").startsWith("b"))
.show()
}
}
Shows the folliwing output:
+-------+
|column1|
+-------+
| b0|
+-------+
I expected to see some error that "column2" is not available when I try to use it in .where(<condition>).
Snippet from my build.sbt:
scalaVersion := "2.12.10"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4" excludeAll ExclusionRule("org.apache.hadoop")
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "3.2.1"
Is there some documentation on this behaviour? And why is it even possible?
This is because sparks pushes the filter/predicate, i.e. spark optimizes the query in such a way that the filter is applied before the "projection". The same occures with select instead of drop.
This can be beneficial because the filter can be pushed to the data:
Related
How can I start spark from an sbt shell. I don't want to use the spark-shell command. I would like to use spark and use the objects in my sbt project.
Add spark dependencies to build.sbt:
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.1",
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.1.1",
Run sbt console:
sbt console
Load spark session/context:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark = SparkSession.builder().master("local").appName("spark-shell").getOrCreate()
import spark.implicits._
val sc = spark.sparkContext
Or automate the next command with an alias:
initialCommands in console := s"""
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark = SparkSession.builder().master("local").appName("spark-shell").getOrCreate()
import spark.implicits._
val sc = spark.sparkContext
"""
I use Spark 2.3.0.
The following code fragment works fine in spark-shell:
def transform(df: DataFrame): DataFrame = {
df.select(
explode($"person").alias("p"),
$"history".alias("h"),
$"company_id".alias("id")
)
Yet when editing within Intellij, it will not recognize the select, explode and $ functions. These are my dependencies within SBT:
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies ++= {
val sparkVer = "2.1.0"
Seq(
"org.apache.spark" %% "spark-core" % sparkVer % "provided" withSources(),
"org.apache.spark" %% "spark-sql" % sparkVer % "provided" withSources()
)
}
Is there anything missing? An import statement, or an additional library?
You should use the following import in the transform method (to have explode available):
import org.apache.spark.sql.functions._
You could also do the following to be precise on what you import.
import org.apache.spark.sql.functions.explode
It works in spark-shell since it does the import by default (so you don't have to worry about such simple things :)).
scala> spark.version
res0: String = 2.3.0
scala> :imports
1) import org.apache.spark.SparkContext._ (69 terms, 1 are implicit)
2) import spark.implicits._ (1 types, 67 terms, 37 are implicit)
3) import spark.sql (1 terms)
4) import org.apache.spark.sql.functions._ (354 terms)
As to $ it is also imported by default in spark-shell for your convenience. Add the following to have it in your method.
import spark.implicits._
Depending on where you have transform method defined you may add an implicit parameter to the transform method as follows (and skip adding the import above):
def transform(df: DataFrame)(implicit spark: SparkSession): DataFrame = {
...
}
I'd however prefer using the SparkSession bound to the input DataFrame (which seems cleaner and...geeker :)).
def transform(df: DataFrame): DataFrame = {
import df.sparkSession.implicits._
...
}
As a bonus, I'd also cleanup your build.sbt so it would look as follows:
libraryDependencies += "org.apache.spark" %% "spark-sql" % 2.1.0" % "provided" withSources()
You won't be using artifacts from spark-core in your Spark SQL applications (and it's a transitive dependency of spark-sql).
Intellij does not have spark.implicits._ library available, therefore explode throws an error. Do remember to create the SparkSession.builder() object before importing.
Apply the following code, this works:
val spark = SparkSession.builder()
.master("local")
.appName("ReadDataFromTextFile")
.getOrCreate()
import spark.implicits._
val jsonFile = spark.read.option("multiLine", true).json("d:/jsons/rules_dimensions_v1.json")
jsonFile.printSchema()
//jsonFile.select("tag").select("name").show()
jsonFile.show()
val flattened = jsonFile.withColumn("tag", explode($"tag"))
flattened.show()
I'm trying to build and run a Scala/Spark project in IntelliJ IDEA.
I have added org.apache.spark:spark-sql_2.11:2.0.0 in global libraries and my build.sbt looks like below.
name := "test"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.0.0"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.0.0"
I still get an error that says
unknown artifact. unable to resolve or indexed
under spark-sql.
When tried to build the project the error was
Error:(19, 26) not found: type sqlContext, val sqlContext = new sqlContext(sc)
I have no idea what the problem could be. How to create a Spark/Scala project in IntelliJ IDEA?
Update:
Following the suggestions I updated the code to use Spark Session, but it still unable to read a csv file. What am I doing wrong here? Thank you!
val spark = SparkSession
.builder()
.appName("Spark example")
.config("spark.some.config.option", "some value")
.getOrCreate()
import spark.implicits._
val testdf = spark.read.csv("/Users/H/Desktop/S_CR_IP_H.dat")
testdf.show() //it doesn't show anything
//pdf.select("DATE_KEY").show()
sql should upper case letters as below
val sqlContext = new SQLContext(sc)
SQLContext is deprecated for newer versions of spark so I would suggest you to use SparkSession
val spark = SparkSession.builder().appName("testings").getOrCreate
val sqlContext = spark.sqlContext
If you want to set the master through your code instead of from spark-submit command then you can set .master as well (you can set configs too)
val spark = SparkSession.builder().appName("testings").master("local").config("configuration key", "configuration value").getOrCreate
val sqlContext = spark.sqlContext
Update
Looking at your sample data
DATE|PID|TYPE
8/03/2017|10199786|O
and testing your code
val testdf = spark.read.csv("/Users/H/Desktop/S_CR_IP_H.dat")
testdf.show()
I had output as
+--------------------+
| _c0|
+--------------------+
| DATE|PID|TYPE|
|8/03/2017|10199786|O|
+--------------------+
Now adding .option for delimiter and header as
val testdf2 = spark.read.option("delimiter", "|").option("header", true).csv("/Users/H/Desktop/S_CR_IP_H.dat")
testdf2.show()
Output was
+---------+--------+----+
| DATE| PID|TYPE|
+---------+--------+----+
|8/03/2017|10199786| O|
+---------+--------+----+
Note: I have used .master("local") for SparkSession object
(That should really be part of the Spark official documentation)
Replace the following from your configuration in build.sbt:
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.0.0"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.0.0"
with the following:
// the latest Scala version that is compatible with Spark
scalaVersion := "2.11.11"
// Few changes here
// 1. Use double %% so you don't have to worry about Scala version
// 2. I doubt you need spark-core dependency
// 3. Use the latest Spark version
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
Don't worry about IntelliJ IDEA telling you the following:
unknown artifact. unable to resolve or indexed
It's just something you have to live with and the only solution I could find is to...accept the annoyance.
val sqlContext = new sqlContext(sc)
The real type is SQLContext, but as the scaladoc says:
As of Spark 2.0, this is replaced by SparkSession. However, we are keeping the class here for backward compatibility.
Please use SparkSession instead.
The entry point to programming Spark with the Dataset and DataFrame API.
See the Spark official documentation to read on SparkSession and other goodies. Start from Getting Started. Have fun!
Hi I'm trying to join two dataframes in spark, and I'm getting the following error:
org.apache.spark.sql.AnalysisException: Reference 'Adapazari' is ambiguous,
could be: Adapazari#100064, Adapazari#100065.;
According to several sources, this can occur when you try to join two different dataframes together that both have a column with the same name (1, 2, 3). However, in my case, that is not the source of the error. I can tell because (1) my columns all have different names, and (2) the reference indicated in the error is a value contained within the join column.
My code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark = SparkSession
.builder().master("local")
.appName("Spark SQL basic example")
.config("master", "spark://myhost:7077")
.getOrCreate()
val sqlContext = spark.sqlContext
import sqlContext.implicits._
val people = spark.read.json("/path/to/people.jsonl")
.select($"city", $"gender")
.groupBy($"city")
.pivot("gender")
.agg(count("*").alias("total"))
.drop("0")
.withColumnRenamed("1", "female")
.withColumnRenamed("2", "male")
.na.fill(0)
val cities = spark.read.json("/path/to/cities.jsonl")
.select($"name", $"longitude", $"latitude")
cities.join(people, $"name" === $"city", "inner")
.count()
Everything works great until I hit the join line, and then I get the aforementioned error.
The relevant lines in build.sbt are:
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "2.1.0",
"org.apache.spark" % "spark-sql_2.10" % "2.1.0",
"com.databricks" % "spark-csv_2.10" % "1.5.0",
"org.apache.spark" % "spark-mllib_2.10" % "2.1.0"
)
It turned out that this error was due to malformed JSONL. Fixing the JSONL formatting solved the problem.
I am following a Spark example from here http://spark.apache.org/docs/latest/sql-programming-guide.html.
val people = sc.textFile("../spark-training/simple-app/examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.registerTempTable("people")
I get the error that registerTempTable is not recognized.
After looking at some Github projects, it seems to me that I have the necessary imports:
import org.apache.spark.{SparkConf, SparkContext}
val conf = new SparkConf().setAppName("Select people")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
And build.sbt:
name := "exercises"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.0.0"
libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.6.1"
What am I missing?
In your code, people is a RDD. registerTempTable is a dataframe api, not a RDD api. Your code drops the `toDF()' bit from the end of the example. Your first line should be as below
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()