Except function in dataframe spark scala parameters and implementation. [duplicate] - scala

This question already has an answer here:
Computing difference between Spark DataFrames
(1 answer)
Closed 5 years ago.
I see that except and not in are same in sql but spark we have "except" function .
[The documentation is there 1 but can anyone give example for how to implement this in scala .

The path to the DataFrame class does contain the word 'sql', but it's still a class that you can create and use directly in scala. You can call the except function:
df_final = df1.except(df2)

Related

Difference between type DataSet[Row] and sql.DataFrame in Spark Scala [duplicate]

This question already has answers here:
Spark: How can DataFrame be Dataset[Row] if DataFrame's have a schema
(2 answers)
Closed 2 years ago.
I am confused around 2 datatypes DataSet[Row] and sql.DataFrame. From various documents etc its mentioned that DataFrame is nothing but DataSet[Row]. Then what is sql.DataFrame.
Below is the code where i see different type returned
Can you please explain difference between these
Below code returns of type DataSet[Row] (as per return type of method in intellij)
serverDf.select(from_json(col("value"), schema) as "event")
.select("*")
.filter(col("event.type").isin(eventTypes_*))
Below code snippet returns of type sql.DataFrame
serverDf.select(from_json(col("value"), schema) as "event")
.select("*")
Thanks in advance
The are the same thing, as it is stated in the documentation:
Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row.
It's just a type alias:
type DataFrame = Dataset[Row]
They might have different result types in intellij because of methods' different signatures.

Filtering rows on Spark for multiple columns sharing the same value [duplicate]

This question already has an answer here:
Spark SQL filter multiple fields
(1 answer)
Closed 3 years ago.
I was trying to search this on stackoverflow, but i couldn't find one. Is there a spark syntax that filters on where two or more columns share the same value? For instance something like
dataFrame.filter($"col01" == $"col02"== $"col03")
Yes there is. You got it almost correct put 3 '=' between them
dataFrame.filter($"col01" === $"col02"=== $"col03")
Example:
val df = spark.sparkContext.parallelize(Array((1,1,1),(1,2,3))).toDF("col01","col02","col03")
df.filter($"col01" === $"col02"=== $"col03").show(false)
Result:

What is the advantage of using $"col" over "col" in spark data frames [duplicate]

This question already has an answer here:
Spark Implicit $ for DataFrame
(1 answer)
Closed 3 years ago.
Let us say I've a DF created as follows
val posts = spark.read
.option("rowTag","row")
.option("attributePrefix","")
.schema(Schemas.postSchema)
.xml("src/main/resources/Posts.xml")
What is the advantage of converting it to a Column using posts.select("Id") over posts.select($"Id")
df.select operates on the column directly while $"col" creates a Column instance. You can also create Column instances using col function. Now the Columns can be composed to form complex expressions which then can be passed to any of the df functions.
You can also find examples and more usages on the Scaladoc of Column class.
Ref - https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Column
There is no particular advantage, it's an automatic conversion anyway. But not all methods in SparkSQL perform this conversion, so sometimes you have to put the Column object with the $.
There is not much difference but some functionalities can be used only using $ with the column name.
Example : When we want to sort the value in this column, without using $ prior to column name, it will not work.
Window.orderBy("Id".desc)
But if you use $ before column name, it works.
Window.orderBy($"Id".desc)

Filtering with Scala and Apache Spark [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have created a unlabeled Dataset which has some columns. The values in one of the Column are France,Germany,France and UK
I know how to filter and count using below code.
val b =data.filter(_.contains("France")).count
However, I am not sure how to count values other than France.
I tried below code but it is giving me wrong result
val a =data.filter(x=>x!="France").count
PS: My question is a bit similar to Is there a way to filter a field not containing something in a spark dataframe using scala? but I am looking for some simpler answer.
You are trying to filter those elements which is equal to "France".
Try this
val a=data.filter(!_.contains("France")).count
To cricket_007 's point, should be something like this
val myDSCount = data.filter(row => row._1 != "France").count()
I am not sure what column your data is in, so the row._1 would change to the correct number. You can run the following to see all of your columns:
data.printSchema

Grouping by key with Apache Spark but want to apply contcat between values instead of using an aggregate function

I'm learning Spark and want to perform the following task: I want to use group by but the grouping condition shown below is different and not well known in Spark, any help will be appreciated.
I've an RDD[String,String] with data ->
8 kshitij
8 vini
8 mohan
8 guru
5 aashish
5 aakash
5 ram
I want to convert it to an RDD[String,Set[String]] ->
8 Set[kshitij, vini, mohan, guru]
5 Set[aashish, aakash, ram]
As user52045 said in the comments, you can just use groupByKey, which results in a RDD[String, Iterable[String]]. This is part of the RDDPairFunctions available through implicit conversions for any Tuple2.
The only open question is whether you're ok with an Iterable, or if it has to be a Set, which would require an additional step of calling mapValues, or some customization through aggregateByKey (if you want it in one go)