Filter a DataFrame by Array Column - scala

I want to filter a dataframe which has a column with categories (List[String]). I want to ignore all the rows that have a non valid category. They are not valid when they are not in model.getCategories
def checkIncomingData(model: Model, incomingData: DataFrame) : DataFrame = {
val list = model.getCategories.toList
sc.broadcast(list)
incomingData.filter(incomingData("categories").isin(list))
}
Unfortunately my approach does not work because categories is a list, not a single element. Any idea who to make it work?

The first problem I see is that you didn't assign the broadcast to a variable.
val broadcastList = sc.broadcast(list)
Besides you have to reference it using broadcastList.value. For instance:
incomingData.filter($"categories".isin(broadcastList.value: _*))
NOTE
#LostInOverflow made an important contribution, he clarified my answer and said that the method isin is actually evaluated in the driver, so broadcasting the list doesn't help at all, and more important the list shall be expanded in order to be evaluated.

Just expand list:
incomingData.filter(incomingData("categories").isin(list: _*))
Note: broadcasting won't help you here. This is evaluated on driver.

Related

Why does nested flatMap - map in Scala give an RDD of type Object instead of a list of tuples?

I have an rdd that I want to group according to some key, but it just doesn't work. I am a Scala and Spark beginner So I have the following RDD:
rdd: RDD[WikipediaArticle])
val meinVal = rdd.flatMap(article=>langs.map(lang=>{if (article.mentionsLanguage(lang){ Tuple2(lang,article)} else{None}})).filter(_!=None)
meinVal.collect.foreach(println) gives:
(Scala,WikipediaArticle(2,Scala and Java run on the JVM))
(Java,WikipediaArticle(2,Scala and Java run on the JVM))
(Scala,WikipediaArticle(3,Scala is not purely functional))
I have two questions:
Why can I not apply the groupByKey function? It is an rdd that contains a list of tuples, the first tuple-entry is the key.
I don't see how to apply groupby either. I thought I could do meinVal.groupby(x=> x._1), but that trows an error.
I notice, that when I use an IDE and hover over "meinVal" it shows that it is RDD[Object] whereas it should be RDD[(String,WikipediaArticle)]. I do not know how to get this information without the IDE. So it seems that the rdd contains just one big object. I only don't see why that is.
Anyone? Please?
Irene
Ok, so thanks to this post https://stackoverflow.com/a/29426336/909909 I figured it out. The problem was not the nested flatmap-map construct, but the condition in the map instruction. In my code I returned "None" if the condition was not met. Since None is not of type tuple I get an RDD[Object] and therefore I cannot use groupByKey.
To solve this I use Option and then flatten the rdd to get rid of the Option and its Nones again.
val meinVal = rdd.flatMap( article=> langs.map(lang=> { if(article.mentionsLanguage(lang)){Some(Tuple2(lang,article))}else{None}}).flatten)

What is the use of def first(columnName : String) : Column in Spark SQL?

def first is present as part of org.apache.spark.sql Functions.
I see it getting used in many places. Sample code :
override def extractPersonDataset: DataFrame = {
super.extractPersonDataset
.withColumn("marketplace_id", lit(marketplaceId))
.groupBy(joinColumns.head, joinColumns.tail: _*)
.agg(
collect_list("college_id") as "college_ids",
collect_list("nickname") as "nickname",
collect_list("type") as "types",
first("warehouse_id") as "warehouse_id",
first("operating_model") as "operating_model",
first("type") as "type",
first("has_membership") as "has_membership"
)
}
I searched many blogs and the documentation but none provided me a clear understanding of how it works. Can someone please explain what it does ?
For me it has these usages :
in agg-clause after groupBy:
I use first as an alternative to put a column in the groupBy clause. This makes sense if the value is constant per grouping, or it just does not matter. As the ordering before groupBy is not garanteed to be conserved by the groupBy (as for my knowing), you cannot use it to get the first value, it just takes any value.
With window-functions:
Here you can use it to get the real first value given the ordering of your frame. E.g. if you have attributes time,measured temperature you can get the first measured temperature with first($"temperature").over(Window.orderBy($"time")). Note that you can achieve the same with min/max in combination with structs (min(struct($"time",$"temperature")).over(Window...)), then extract temperature after minimizing time), but this is kind of a hassle. This is also more powerful than min/max because you have to option to skip nulls.

Dataframe: Adding prefix to all columns in Scala

val prefix = "ABC"
val renamedColumns = df.columns.map(c=> df(c).as(s"$prefix$c"))
val dfNew = df.select(renamedColumns: _*)
Hi,
I am fairly new to scala and the code above works perfectly to add a prefix to all columns. Can someone please explain the breakdown of how it works ?
The second line above will return a map of col1 as ABCcol1, col2 as ABCcol2.... etc
I have trouble understanding what the third line is doing , especailly the ":_* at the end.
thanks for your help in advance.
The third line is an example of Scala's syntactic sugar. Essentially, Scala has ways to shorten just exactly what you are typing, and you have discovered the dreaded :_*.
There are two portions to this small bit - the : and the _* serve two different purposes. The : is typically for ascription, which tells the compiler "this is the type that I need to use for this method". The _* however, is your type - in Scala this is the type varargs. Varargs is a type that has an arbitrary number of values (good resource here). It allows you to pass a method a list that you do not know the number of elements in.
In your example, you are creating a variable called renamedColumns from the columns of your original dataframe, with the new string appendage. Although you may know just how many columns are in your df, Scala does not. When you create dfNew, you are running a select statement on that and passing in your new column names, of which there could be an arbitrary number.
Essentially, you do not know how many columns you may have, so you pass in your varargs to allow the number to be arbitrary, thus determined by the compiler.

Scala's collect inefficient in Spark?

I am currently starting to learn to use spark with Scala. The problem I am working on needs me to read a file, split each line on a certain character, then filtering the lines where one of the columns matches a predicate and finally remove a column. So the basic, naive implementation is a map, then a filter then another map.
This meant going through the collection 3 times and that seemed quite unreasonable to me. So I tried replacing them by one collect (the collect that takes a partial function as an argument). And much to my surprise, this made it run much slower. I tried locally on regular Scala collections; as expected, the latter way of doing is much faster.
So why is that ? My idea is that the map and filter and map are not applied sequentially, but rather mixed into one operation; in other words, when an action forces evaluation every element of the list will be checked and the pending operations will be executed. Is that right ? But even so, why do the collect perform so badly ?
EDIT: a code example to show what I want to do:
The naive way:
sc.textFile(...).map(l => {
val s = l.split(" ")
(s(0), s(1))
}).filter(_._2.contains("hello")).map(_._1)
The collect way:
sc.textFile(...).collect {
case s if(s.split(" ")(0).contains("hello")) => s(0)
}
The answer lies in the implementation of collect:
/**
* Return an RDD that contains all matching values by applying `f`.
*/
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
filter(cleanF.isDefinedAt).map(cleanF)
}
As you can see, it's the same sequence of filter->map, but less efficient in your case.
In scala both isDefinedAt and apply methods of PartialFunction evaluate if part.
So, in your "collect" example split will be performed twice for each input element.

Extract column values of Dataframe as List in Apache Spark

I want to convert a string column of a data frame to a list. What I can find from the Dataframe API is RDD, so I tried converting it back to RDD first, and then apply toArray function to the RDD. In this case, the length and SQL work just fine. However, the result I got from RDD has square brackets around every element like this [A00001]. I was wondering if there's an appropriate way to convert a column to a list or a way to remove the square brackets.
Any suggestions would be appreciated. Thank you!
This should return the collection containing single list:
dataFrame.select("YOUR_COLUMN_NAME").rdd.map(r => r(0)).collect()
Without the mapping, you just get a Row object, which contains every column from the database.
Keep in mind that this will probably get you a list of Any type. Ïf you want to specify the result type, you can use .asInstanceOf[YOUR_TYPE] in r => r(0).asInstanceOf[YOUR_TYPE] mapping
P.S. due to automatic conversion you can skip the .rdd part.
With Spark 2.x and Scala 2.11
I'd think of 3 possible ways to convert values of a specific column to a List.
Common code snippets for all the approaches
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.getOrCreate
import spark.implicits._ // for .toDF() method
val df = Seq(
("first", 2.0),
("test", 1.5),
("choose", 8.0)
).toDF("id", "val")
Approach 1
df.select("id").collect().map(_(0)).toList
// res9: List[Any] = List(one, two, three)
What happens now? We are collecting data to Driver with collect() and picking element zero from each record.
This could not be an excellent way of doing it, Let's improve it with the next approach.
Approach 2
df.select("id").rdd.map(r => r(0)).collect.toList
//res10: List[Any] = List(one, two, three)
How is it better? We have distributed map transformation load among the workers rather than a single Driver.
I know rdd.map(r => r(0)) does not seems elegant you. So, let's address it in the next approach.
Approach 3
df.select("id").map(r => r.getString(0)).collect.toList
//res11: List[String] = List(one, two, three)
Here we are not converting DataFrame to RDD. Look at map it won't accept r => r(0)(or _(0)) as the previous approach due to encoder issues in DataFrame. So end up using r => r.getString(0) and it would be addressed in the next versions of Spark.
Conclusion
All the options give the same output but 2 and 3 are effective, finally 3rd one is effective and elegant(I'd think).
Databricks notebook
I know the answer given and asked for is assumed for Scala, so I am just providing a little snippet of Python code in case a PySpark user is curious. The syntax is similar to the given answer, but to properly pop the list out I actually have to reference the column name a second time in the mapping function and I do not need the select statement.
i.e. A DataFrame, containing a column named "Raw"
To get each row value in "Raw" combined as a list where each entry is a row value from "Raw" I simply use:
MyDataFrame.rdd.map(lambda x: x.Raw).collect()
In Scala and Spark 2+, try this (assuming your column name is "s"):
df.select('s').as[String].collect
sqlContext.sql(" select filename from tempTable").rdd.map(r => r(0)).collect.toList.foreach(out_streamfn.println) //remove brackets
it works perfectly
List<String> whatever_list = df.toJavaRDD().map(new Function<Row, String>() {
public String call(Row row) {
return row.getAs("column_name").toString();
}
}).collect();
logger.info(String.format("list is %s",whatever_list)); //verification
Since no one has given any solution in java(Real Programming Language)
Can thank me later
from pyspark.sql.functions import col
df.select(col("column_name")).collect()
here collect is functions which in turn convert it to list.
Be ware of using the list on the huge data set. It will decrease performance.
It is good to check the data.
Below is for Python-
df.select("col_name").rdd.flatMap(lambda x: x).collect()
An updated solution that gets you a list:
dataFrame.select("YOUR_COLUMN_NAME").map(r => r.getString(0)).collect.toList
This is java answer.
df.select("id").collectAsList();