Spark filter out columns and create dataFrame with remaining columns and create dataFrame with filtered columns - scala

I am new to Spark.
I have loaded a CSV file into a Spark DataFrame, say OriginalDF
Now I want to
1. filter out some columns from it and create a new dataframe of the originalDF
2. create a dataFrame out of the extracted columns
How can these 2 dataframes be created in spark scala?

using select, you can select what columns you want.
val df2 = OriginalDF.select($"col1",$"col2",$"col3")
using filter you should able to filter the rows.
val df3 = OriginalDF.where($"col1" < 10)
another way to filter data is using where. Both filter and where are synonyms so you can use them interchangeably.
val df3 = OriginalDF.filter($"col1" < 10)
Note select and filter returns a new dataframe as a result.

Related

How to drop specific column and then select all columns from spark dataframe

I have a scenario here - Have 30 columns in one dataframe, need to drop specific column and select remaining columns and put it to another dataframe. How can I achieve this? I tried below.
val df1: DataFrame = df2.as(a).join( df3.as(b),col(a.key) === col(b.key), inner).drop(a.col1)
.select(" a.star ")
when I do show of df1, its still show col1. Any advise on resolving this.
drop requires a string without table alias, so you can try:
val df1 = df2.as("a")
.join(df3.as("b"), col("a.key") === col("b.key"), "inner")
.drop("col1")
.select("a.*")
Or instead of dropping the column, you can filter the columns to be selected:
val df1 = df2.as("a")
.join(df3.as("b"), col("a.key") === col("b.key"), "inner")
.select(df2.columns.filterNot(_ == "col1").map("a." + _): _*)
This really just seems like you need to use a "left_semi" join.
val df1 = df2.drop('col1).join(df3, df2("key") === df3("key"), "left_semi")
If key is an actual column you can simplify the syntax even further
val df1 = df2.drop('col1).join(df3, Seq("key"), "left_semi")
The best syntax depends on the details of what your real data looks like. If you need to refer to col1 in df2 specifically because there's ambiguity, then use df2("col1")
left_semi joins take all the columns from the left table for rows finding a match in the right table.

Convert header (column names) to new dataframe

I have a dataframe with headers for example outputDF. I now want to take outputDF.columns and create a new dataframe with just one row which contains column names.
I then want to union both these dataframes with option("head=false") which spark can then write to a HDFS.
How do i do that?
below is an example
Val df = spark.read.csv("path")
val newDf = df.columns.toSeq.toDF
val unoindf= df.union(newDf);

Spark scala copying dataframe column to new dataframe

I have an empty dataframe with schema already created.
I'm trying to add the columns to this dataframe from a new dataframe to the existing columns in a for loop.
k schema - |ID|DATE|REPORTID|SUBMITTEDDATE|
for(data <- 0 to range-1){
val c = df2.select(substring(col("value"), str(data)._2, str(data)._3).alias(str(data)._1)).toDF()
//c.show()
k = c.withColumn(str(data)._1, c(str(data)._1))
}
k.show()
But the k dataframe has just one column, but it should have all the 4 columns populated with values.
I think the last line in for loop is replacing exisitng columns in the dataframe.
Can somebody help me with this?
Thanks!!
Add your logic and conditions and create new dataframe
val dataframe2 = dataframe1.select("A","B",C)
Copying few columns of a dataframe to another one is not possible in spark.
Although there are few alternatives to attain the same
1. You need to join both the dataframe based on some join condition.
2. Convert bot the data frame to json and do RDD Union
val rdd = df1.toJSON.union(df2.toJSON)
val dfFinal = spark.read.json(rdd)

How to join two dataframes in Scala and select on few columns from the dataframes by their index?

I have to join two dataframes, which is very similar to the task given here Joining two DataFrames in Spark SQL and selecting columns of only one
However, I want to select only the second column from df2. In my task, I am going to use the join function for two dataframes within a reduce function for a list of dataframes. In this list of dataframes, the column names will be different. However, in each case I would want to keep the second column of df2.
I did not find anywhere how to select a dataframe's column by their numbered index. Any help is appreciated!
EDIT:
ANSWER
I figured out the solution. Here is one way to do this:
def joinDFs(df1: DataFrame, df2: DataFrame): DataFrame = {
val df2cols = df2.columns
val desiredDf2Col = df2cols(1) // the second column
val df3 = df1.as("df1").join(df2.as("df2"), $"df1.time" === $"df2.time")
.select($"df1.*",$"df2.$desiredDf2Col")
df3
}
And then I can apply this function in a reduce operation on a list of dataframes.
var listOfDFs: List[DataFrame] = List()
// Populate listOfDFs as you want here
val joinedDF = listOfDFs.reduceLeft((x, y) => {joinDFs(x, y)})
To select the second column in your dataframe you can simply do:
val df3 = df2.select(df2.columns(1))
This will first find the second column name and then select it.
If the join and select methods that you want to define in reduce function is similar to Joining two DataFrames in Spark SQL and selecting columns of only one Then you should do the following :
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1) map d2.columns map col: _*)
You will have to remember that the name of the second column i.e. Seq(1) should not be same as any of the dataframes column names.
You can select multiple columns as well but remember the bold note above
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1, 2) map d2.columns map col: _*)

Spark: Computing correlations of a DataFrame with missing values

I currently have a DataFrame of doubles with approximately 20% of the data being null values. I want to calculate the Pearson correlation of one column with every other column and return the columnId's of the top 10 columns in the DataFrame.
I want to filter out nulls using pairwise deletion, similar to R's pairwise.complete.obs option in its Pearson correlation function. That is, if one of the two vectors in any correlation calculation has a null at an index, I want to remove that row from both vectors.
I currently do the following:
val df = ... //my DataFrame
val cols = df.columns
df.registerTempTable("dataset")
val target = "Row1"
val mapped = cols.map {colId =>
val results = sqlContext.sql(s"SELECT ${target}, ${colId} FROM dataset WHERE (${colId} IS NOT NULL AND ${target} IS NOT NULL)")
(results.stat.corr(colId, target) , colId)
}.sortWith(_._1 > _._1).take(11).map(_._2)
This runs very slowly, as every single map iteration is its own job. Is there a way to do this efficiently, perhaps using Statistics.corr in the Mllib, as per this SO Question (Spark 1.6 Pearson Correlation)
There are "na" functions on DataFrame: DataFrameNaFunctions API
They work in the same way DataFramStatFunctions do.
You can drop the rows containing a null in either of your two dataframe columns with the following syntax:
myDataFrame.na.drop("any", target, colId)
if you want to drop rows containing null any of the columns then it is:
myDataFrame.na.drop("any")
By limiting the dataframe to the two columns you care about first, you can use the second method and avoid verbose!
As such your code would become:
val df = ??? //my DataFrame
val cols = df.columns
val target = "Row1"
val mapped = cols.map {colId =>
val resultDF = df.select(target, colId).na.drop("any")
(resultDF.stat.corr(target, colId) , colId)
}.sortWith(_._1 > _._1).take(11).map(_._2)
Hope this helps you.