Filter Spark Dataframe with list of values in Scala - scala

I am trying to create a dataframe from hive table using SparkSession like below. Once created I am filtering the rows by a list of Ids.
val myDF = spark.sql("select * from myhivetable")
val someDF = mfiDF.where(mfiDF("id").isin(myList:_*))
Instead of this approach is there a way I can query the hive table as below:
val myDF = spark.sql("select * from myhivetable").where (("id").isin(myList:_*))
When I try like this I am getting a compilation error.
Could someone suggest a best approach for this. Thanks.

You could also do an inner join to remove unwanted ids, something like below may work.
val ids = sc.parallelize(myList).toDF("id")
someDF.join(ids, ids.id === someDF.id)

Related

Subquery vs Dataframe filter function in spark

I am running the below spark SQL with the subquery.
val df = spark.sql("""select * from employeesTableTempview where dep_id in (select dep_id from departmentTableTempview)""")
df.count()
I also run the same with the help of dataframe functional way like below, Let's assume we read the employee table and department table as a dataframes and their names should be empDF and DepDF respectively,
val depidList = DepDF.map(x=>x(0).string).collect().toList()
val empdf2 = empDF.filter(col("dep_id").isin(depidList:_*))
empdf2.count
In these above two scenarios, which one gives better performance and why? Please help me to understand this scenarios in spark scala.
I can give you classic answer: it depends :D
Lets take a look at first case. I prepared similar example:
import org.apache.spark.sql.functions._
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
val data = Seq(("test", "3"),("test", "3"), ("test2", "5"), ("test3", "7"), ("test55", "86"))
val data2 = Seq(("test", "3"),("test", "3"), ("test2", "5"), ("test3", "6"), ("test33", "76"))
val df1 = data.toDF("name", "dep_id")
val df2 = data2.toDF("name", "dep_id")
df1.createOrReplaceTempView("employeesTableTempview")
df2.createOrReplaceTempView("departmentTableTempview")
val result = spark.sql("select * from employeesTableTempview where dep_id in (select dep_id from departmentTableTempview)")
result.count
I am setting autoBroadcastJoinThreshold to -1 because i assume that your datasets are going to be bigger than default 10mb for this parameter
This Sql query generates this plan:
As you can see spark is performing a SMJ which will be a case most of the time for datasets bigger than 10mb. This requires data to be shuffled and then sorted so its quiet heavy operation
Now lets check option2 (first lines of codes are the same as previously):
val depidList = df1.map(x=>x.getString(1)).collect().toList
val empdf2 = df2.filter(col("dep_id").isin(depidList:_*))
empdf2.count
For this option plan is different. You dont have the join obviously but there are two separate sqls. First is for reading DepDF dataset and then collecting one column as a list. In second sql this list is used to filter the data in empDF dataset.
When DepDF is relatively small it should be fine, but if you need more generic solution you may stick to sub-query which is going to resolve to join. You can also use join directly on your dataframes with Spark df api

null pointer exception while converting dataframe to list inside udf

I am reading 2 different .csv files which has only column as below:
val dF1 = sqlContext.read.csv("some.csv").select($"ID")
val dF2 = sqlContext.read.csv("other.csv").select($"PID")
trying to search if dF2("PID") exists in dF1("ID"):
val getIdUdf = udf((x:String)=>{dF1.collect().map(_(0)).toList.contains(x)})
val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))
This gives me null pointer exception.
but if I convert dF1 outside and use list in udf it works:
val dF1 = sqlContext.read.csv("some.csv").select($"ID").collect().map(_(0)).toList
val getIdUdf = udf((x:String)=>{dF1.contains(x)})
val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))
I know I can use join to get this done but want to know what is the reason of null pointer exception here.
Thanks.
Please check this question about accessing dataframe inside the transformation of another dataframe. This is exactly what you are doing with your UDF, and this is not possible in spark. Solution is either to use join, or collect outside of transformation and broadcast.

how to create multiple list from dataframe in spark?

how to create multiple list from dataframe in spark.
In my case, I want to order mongodb documents with grouping specific key. and create multiple list which is grouped on the basis of one key of schema
please help me
sparkSession = SparkSession.builder().getOrCreate()
MongoSpark.load[SparkSQL.Character](sparkSession).printSchema()
val characters = MongoSpark.load[SparkSQL.Character](sparkSession)
characters.createOrReplaceTempView("characters")
val sqlstmt = sparkSession.sql("SELECT * FROM characters WHERE site = 'website'")
...
You can do something like this:
val columns = sqlstmt.columns.map(col)
task1
.groupBy(key)
.agg(collect_list(struct(columns: _*)).as("data"))
Don't forget to import
import org.apache.spark.sql.functions._

remove a column from a dataframe spark

I have a Spark dataframe with a very large number of columns. I want to remove two columns from it to get a new dataframe.
Had there been fewer columns, I could have used the select method in the API like this:
pcomments = pcomments.select(pcomments.col("post_id"),pcomments.col("comment_id"),pcomments.col("comment_message"),pcomments.col("user_name"),pcomments.col("comment_createdtime"));
But since picking columns from a long list is a tedious task, is there a workaround?
Use drop method and withColumnRenamed methods.
Example:
val initialDf= ....
val dfAfterDrop=initialDf.drop("column1").drop("coumn2")
val dfAfterColRename= dfAfterDrop.withColumnRenamed("oldColumnName","new ColumnName")
Try this:
val initialDf = ...
val dfAfterDropCols = initialDf.drop("column1", "coumn2")

How to create a dataframe using the value of another dataframe

I am getting suppId DataFrame using below code.
val suppId = sqlContext.sql("SELECT supp_id FROM supplier")
The DataFrame return single or multiple value.
Now I want to create a DataFrame using the value of supp_id from suppId DataFrame. But not understand, how to write this.
I have written below code. But the code is not working.
val nonFinalPE = sqlContext.sql("select * from pmt_expr)
nonFinalPE.where("supp_id in suppId(supp_id)")
It took me a second to figure out what you're trying to do. But, it looks like you want rows from nonFinalPe that are also in suppId. You'd get this by doing an inner join of the two data frames which would look like below
val suppId = sqlContext.sql("SELECT supp_id FROM supplier")
val nonFinalPE = sqlContext.sql("select * from pmt_expr")
val joinedDF = nonFinalPE.join(suppId, nonFinalPE("???") === suppId("supp_id"), "inner")