Updating Dataframe Column name in Spark - Scala while performing Joins - scala

I have two dataframes aaa_01 and aaa_02 in Apache Spark 2.1.0.
And I perform an Inner Join on these two dataframes selecting few colums from both dataframes to appear in the output.
The Join is working perfectly fine but the output dataframe has the column names as it was present in the input dataframes. I get stuck here. I need to have new column names instead of getting the same column names in my output dataframe.
Sample Code is given below for reference
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select("a.col1","a.col2","b.col4")
I am getting the output dataframe with column names as "col1, col2, col3". I tried to modify the code as below but in vain
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select("a.col1","a.col2","b.col4" as "New_Col")
Any help is appreciated. Thanks in advance.
Edited
I browsed and got similar posts which is given below. But I do not see an answer to my question.
Updating a dataframe column in spark
Renaming Column names of a Data frame in spark scala
The answers in this post : Spark Dataframe distinguish columns with duplicated name are not relevant to me as it is related more to pyspark than Scala and it had explained how to rename all the columns of a dataframe whereas my requirement is to rename only one or few columns.

You want to rename columns of the dataset, the fact that your dataset comes from a join does not change anything. Yo can try any example from this answer, for instance :
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner")
.select("a.col1","a.col2","b.col4")
.withColumnRenamed("col4","New_col")

you can .as alias as
import sqlContext.implicits._
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select($"a.col1".as("first"),$"a.col2".as("second"),$"b.col4".as("third"))
or you can use .alias as
import sqlContext.implicits._
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select($"a.col1".alias("first"),$"a.col2".alias("second"),$"b.col4".alias("third"))
if you are looking to update only one column name then you can do
import sqlContext.implicits._
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select($"a.col1", $"a.col2", $"b.col4".alias("third"))

Related

Create an empty DF using schema from another DF (Scala Spark)

I have to compare a DF with another one that is the same schema readed from a specific path, but maybe in that path there are not files so I've thought that I have to compare it with a null DF with the same columns as the original.
So I am trying to create a DF with the schema from another DF that contains a lot of columns but I can't find a solution for this. I have been reading the following posts but no one helps me:
How to create an empty DataFrame with a specified schema?
How to create an empty DataFrame? Why "ValueError: RDD is empty"?
How to create an empty dataFrame in Spark
How can I do it in scala? Or is better take other option?
originalDF.limit(0) will return an empty dataframe with the same schema.

how to replace missing values from another column in PySpark?

I want to use values in t5 to replace some missing values in t4. Searched code, but doesn’t work for me
Current:
example of current
Goal:
example of target
df is a dataframe.Code:
pdf = df.toPandas()
from pyspark.sql.functions import coalesce
pdf.withColumn("t4", coalesce(pdf.t4, pdf.t5))
 Error: 'DataFrame' object has no attribute 'withColumn'
Also, tried the following code previously, didnt work neither.
new_pdf=pdf['t4'].fillna(method='bfill', axis="columns")
Error: No axis named columns for object type
Like the error indicates .withColumn() is not a method of pandas dataframes but spark dataframes. Note that when using .toPandas() your pdf becomes a pandas dataframe, so if you want to use .withColumn() avoid the transformation
UPDATE:
If pdf is a pandas dataframe you can do:
pdf['t4']=pdf['t4'].fillna(pdf['t5'])

show single row for multiple records with total number records as count in a new column of dataframe spark scala

I have data as follows.
I want to summarize this as follows:
I want to take first timestamp of name and add total count for name column.
I am not getting any Idea on how to do this in Spark scala code.
Could you please let me know how to handle this situation in spark scala dataframe.
Thanks,Bab
Spark SQL has functions that you can use to achieve this.
import org.apache.spark.sql.functions.{first, col}
In Scala you can do something like this:
df.groupBy(col("Name"))
.agg(first("ID").alias("ID"),
first(col("Timestamp")).alias("Timestamp"),
count(col("Name")).alias("Count")
)
If you want to group on both ID and Name you can also write it as
df.groupBy(col("ID"), col("Name"))
.agg(first(col("Timestamp")).alias("Timestamp"),
count(col("Name")).alias("Count")
)

Spark DataFrame groupBy

I have Spark Java that looked like this. Code pulls data from oracle table using JDBC and displays the groupby output.
DataFrame jdbcDF = sqlContext.read().format("jdbc").options(options).load();
jdbcDF.show();
jdbcDF.groupBy("VA_HOSTNAME").count().show();
Long ll = jdbcDF.count();
System.out.println("ll="+ll);
When I ran the code, jdbcDF.show(); is working, whereas the groupBy and count are not printing anything and no errors were thrown.
My column name is correct. I tried by printing that column and it worked, but when groupBy it's not working.
Can someone help me with DataFrame output? I am using spark 1.6.3.
You can try
import org.apache.spark.sql.functions.count
jdbcDF.groupBy("VA_HOSTNAME").agg(count("*")).show()

How to sort by column in descending order in Spark SQL?

I tried df.orderBy("col1").show(10) but it sorted in ascending order. df.sort("col1").show(10) also sorts in ascending order. I looked on stackoverflow and the answers I found were all outdated or referred to RDDs. I'd like to use the native dataframe in spark.
You can also sort the column by importing the spark sql functions
import org.apache.spark.sql.functions._
df.orderBy(asc("col1"))
Or
import org.apache.spark.sql.functions._
df.sort(desc("col1"))
importing sqlContext.implicits._
import sqlContext.implicits._
df.orderBy($"col1".desc)
Or
import sqlContext.implicits._
df.sort($"col1".desc)
It's in org.apache.spark.sql.DataFrame for sort method:
df.sort($"col1", $"col2".desc)
Note $ and .desc inside sort for the column to sort the results by.
PySpark only
I came across this post when looking to do the same in PySpark. The easiest way is to just add the parameter ascending=False:
df.orderBy("col1", ascending=False).show(10)
Reference: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy
import org.apache.spark.sql.functions.desc
df.orderBy(desc("columnname1"),desc("columnname2"),asc("columnname3"))
df.sort($"ColumnName".desc).show()
In the case of Java:
If we use DataFrames, while applying joins (here Inner join), we can sort (in ASC) after selecting distinct elements in each DF as:
Dataset<Row> d1 = e_data.distinct().join(s_data.distinct(), "e_id").orderBy("salary");
where e_id is the column on which join is applied while sorted by salary in ASC.
Also, we can use Spark SQL as:
SQLContext sqlCtx = spark.sqlContext();
sqlCtx.sql("select * from global_temp.salary order by salary desc").show();
where
spark -> SparkSession
salary -> GlobalTemp View.