I have one two dataframes df1 and df2, (say df2 is subset of df1) I have to get rows from df1 which are not df2, I have do it in scala, can any help me out on thhis.
You can use except:
df1.except(df2)
Related
I want to merge two dataframes columns into one new dataframe.
df1 has columns x1,x2,x3
df2 has column x4
new_df should be x1,x2,x3,x4
There are no joining conditions just need to merge all the columns together.
I have tried df1.merge(df2) but no luck with this.
throws an error AttributeError: 'DataFrame' object has no attribute 'merge'
this is a pyspark dataframe.
I want to do the union of two pyspark dataframe. They have same columns but sequence of columns are different
I tried this
joined_df = A_df.unionAll(B_DF)
But result is based on column sequence and intermixing the results. IS there a way to do do the union based on columns name and not based on the order of columns. Thanks in advance
Just reorder columns in B so that it has the same column order as in A before union:
A_df.unionAll(B_df.select(*A_df.columns))
Let's say we have have a dataframe with columns as col1, col2, col3, col4. Now while saving the df I want to partition by using col2 and my final df which will be saved should not have col2. So the final df should be col1, col3, col4. Any advice about how should can I achieve this?
newdf.drop("Status").write.mode("overwrite").partitionBy("Status").csv("C:/Users/Documents/Test")
drop will drop status column & Your code will fail with below error at partitionBy as status column was dropped.
org.apache.spark.sql.AnalysisException: Partition column `status` not found in schema [...]
Check below code, It will not include status values inside your data.
newdf
.write
.mode("overwrite")
.partitionBy("Status")
.csv("C:/Users/Documents/Test")
I am reading the file in spark dataframe.
In the first column, I will get two values concatenated with "_".
I need to split the first column into two columns and keep the remaining columns as it is. I am using Scala with Spark
For example:
col1 col2 col3
a_1 xyz abc
b_1 lmn opq
I need to have new DF as:
col1_1 col1_2 col2 col3
a 1 xyz abc
b 1 lmn opq
only one column needs to be split into two columns.
I tried with split function with df.select but I need to write the select for remaining columns and considering different files with 100's of columns and I want to use the reusable code for all files.
you can do something like:
import spark.implicits._
df.withColumn("_tmp", split($"col1", "_"))
.withColumn("col1_1", $"_tmp".getItem(0))
.withColumn("col1_2", $"_tmp".getItem(1))
.drop("_tmp")
In our Spark-Scala application, we want to use typed Datasets. There is a JOIN operation. There is a join between DF1 & DF2 (DF - Dataframe).
My question is should we convert DF1 & DF2 both to Dataset[T] and then perform JOIN or should we do the JOIN and then convert the result DataFrame to Dataset.
As I understand since here Dataset[T] are being used for type safety so we should convert DF1 & DF2 to Dataset[T]. Can someone please confirm and advise if something is not correct?