spark dataframe filter and select - scala

I have a spark scala dataframe and need to filter the elements based on condition and select the count.
val filter = df.groupBy("user").count().alias("cnt")
val **count** = filter.filter(col("user") === ("subscriber").select("cnt")
The error i am facing is value select is not a member of org.apache.spark.sql.Column
Also for some reasons count is Dataset[Row]
Any thoughts to get the count in a single line?

DataSet[Row] is DataFrame
RDD[Row] is DataFrame so no need to worry.. its dataframe
see this for better understanding... Difference between DataFrame, Dataset, and RDD in Spark
Regarding select is not a member of org.apache.spark.sql.Column its purely compile error.
val filter = df.groupBy("user").count().alias("cnt")
val count = filter.filter (col("user") === ("subscriber"))
.select("cnt")
will work since you are missing ) braces which is closing brace for filter.

You are missing ")" before .select, Please check below code.
Column class don't have .select method, you have to invoke select on Dataframe.
val filter = df.groupBy("user").count().alias("cnt")
val **count** = filter.filter(col("user") === "subscriber").select("cnt")

Related

How to change column type for a list of dataframe columns

I'm trying to change the type of a list of columns for a Dataframe in Spark 1.6.0.
All the examples found so far however only allow casting for a single column (df.withColumn) or for all the columns in the dataframe:
val castedDF = filteredDf.columns.foldLeft(filteredDf)((filteredDf, c) => filteredDf.withColumn(c, col(c).cast("String")))
Is there any efficient, batch way of doing this for a list of columns in the dataframe?
There is nothing wrong with withColumn* but you can use select if you prefer:
import org.apache.spark.sql.functions col
val columnsToCast: Set[String]
val outputType: String = "string"
df.select(df.columns map (
c => if(columnsToCast.contains(c)) col(c).cast(outputType) else col(c)
): _*)
* Execution plan will be the same for a single select as with chained withColumn.

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()

null pointer exception while converting dataframe to list inside udf

I am reading 2 different .csv files which has only column as below:
val dF1 = sqlContext.read.csv("some.csv").select($"ID")
val dF2 = sqlContext.read.csv("other.csv").select($"PID")
trying to search if dF2("PID") exists in dF1("ID"):
val getIdUdf = udf((x:String)=>{dF1.collect().map(_(0)).toList.contains(x)})
val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))
This gives me null pointer exception.
but if I convert dF1 outside and use list in udf it works:
val dF1 = sqlContext.read.csv("some.csv").select($"ID").collect().map(_(0)).toList
val getIdUdf = udf((x:String)=>{dF1.contains(x)})
val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))
I know I can use join to get this done but want to know what is the reason of null pointer exception here.
Thanks.
Please check this question about accessing dataframe inside the transformation of another dataframe. This is exactly what you are doing with your UDF, and this is not possible in spark. Solution is either to use join, or collect outside of transformation and broadcast.

How to join two dataframes in Scala and select on few columns from the dataframes by their index?

I have to join two dataframes, which is very similar to the task given here Joining two DataFrames in Spark SQL and selecting columns of only one
However, I want to select only the second column from df2. In my task, I am going to use the join function for two dataframes within a reduce function for a list of dataframes. In this list of dataframes, the column names will be different. However, in each case I would want to keep the second column of df2.
I did not find anywhere how to select a dataframe's column by their numbered index. Any help is appreciated!
EDIT:
ANSWER
I figured out the solution. Here is one way to do this:
def joinDFs(df1: DataFrame, df2: DataFrame): DataFrame = {
val df2cols = df2.columns
val desiredDf2Col = df2cols(1) // the second column
val df3 = df1.as("df1").join(df2.as("df2"), $"df1.time" === $"df2.time")
.select($"df1.*",$"df2.$desiredDf2Col")
df3
}
And then I can apply this function in a reduce operation on a list of dataframes.
var listOfDFs: List[DataFrame] = List()
// Populate listOfDFs as you want here
val joinedDF = listOfDFs.reduceLeft((x, y) => {joinDFs(x, y)})
To select the second column in your dataframe you can simply do:
val df3 = df2.select(df2.columns(1))
This will first find the second column name and then select it.
If the join and select methods that you want to define in reduce function is similar to Joining two DataFrames in Spark SQL and selecting columns of only one Then you should do the following :
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1) map d2.columns map col: _*)
You will have to remember that the name of the second column i.e. Seq(1) should not be same as any of the dataframes column names.
You can select multiple columns as well but remember the bold note above
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1, 2) map d2.columns map col: _*)

Spark Select with a List of Columns Scala

I am trying to find a good way of doing a spark select with a List[Column, I am exploding a column than passing back all the columns I am interested in with my exploded column.
var columns = getColumns(x) // Returns a List[Column]
tempDf.select(columns) //trying to get
Trying to find a good way of doing this I know, if it were a string I could do something like
val result = dataframe.select(columnNames.head, columnNames.tail: _*)
For spark 2.0 seems that you have two options. Both depends on how you manage your columns (Strings or Columns).
Spark code (spark-sql_2.11/org/apache/spark/sql/Dataset.scala):
def select(cols: Column*): DataFrame = withPlan {
Project(cols.map(_.named), logicalPlan)
}
def select(col: String, cols: String*): DataFrame = select((col +: cols).map(Column(_)) : _*)
You can see how internally spark is converting your head & tail to a list of Columns to call again Select.
So, in that case if you want a clear code I will recommend:
If columns: List[String]:
import org.apache.spark.sql.functions.col
df.select(columns.map(col): _*)
Otherwise, if columns: List[Columns]:
df.select(columns: _*)