How to merge a column from df1 to df2 pyspark> - pyspark

I want to merge two dataframes columns into one new dataframe.
df1 has columns x1,x2,x3
df2 has column x4
new_df should be x1,x2,x3,x4
There are no joining conditions just need to merge all the columns together.
I have tried df1.merge(df2) but no luck with this.
throws an error AttributeError: 'DataFrame' object has no attribute 'merge'
this is a pyspark dataframe.

Related

Pyspark union of two dataframes

I want to do the union of two pyspark dataframe. They have same columns but sequence of columns are different
I tried this
joined_df = A_df.unionAll(B_DF)
But result is based on column sequence and intermixing the results. IS there a way to do do the union based on columns name and not based on the order of columns. Thanks in advance
Just reorder columns in B so that it has the same column order as in A before union:
A_df.unionAll(B_df.select(*A_df.columns))

pandas groupe by on multiple column and select all those column used in group by in new dataframe

select count(ErrorCode) as "counterr",DateOnly,System_id,ErrorType,ErrorCode from dbo.error group
by(DateOnly,System_id,ErrorType,ErrorCode)
i have to convert this sql code into pandas and want the output as pandas dataframe having column name as counterr,DateOnly,System_id,ErrorType,ErrorCode
error is my Pandas dataframe
errorMaster=
error.groupby(["System_id","DateOnly","ErrorType","ErrorCode","Acknowledged"]
)["ErrorCode"].agg(CountErr='count').reset_index()
errorMaster
you need to give reset_index() else you will be able to print the column "CountErr" Only rather getting all selected column in your output

spark: split only one column in dataframe and keep remaining columns as it is

I am reading the file in spark dataframe.
In the first column, I will get two values concatenated with "_".
I need to split the first column into two columns and keep the remaining columns as it is. I am using Scala with Spark
For example:
col1 col2 col3
a_1 xyz abc
b_1 lmn opq
I need to have new DF as:
col1_1 col1_2 col2 col3
a 1 xyz abc
b 1 lmn opq
only one column needs to be split into two columns.
I tried with split function with df.select but I need to write the select for remaining columns and considering different files with 100's of columns and I want to use the reusable code for all files.
you can do something like:
import spark.implicits._
df.withColumn("_tmp", split($"col1", "_"))
.withColumn("col1_1", $"_tmp".getItem(0))
.withColumn("col1_2", $"_tmp".getItem(1))
.drop("_tmp")

In Spark-Scala application involving Join, att what point should we convert Dataframe to Dataset?

In our Spark-Scala application, we want to use typed Datasets. There is a JOIN operation. There is a join between DF1 & DF2 (DF - Dataframe).
My question is should we convert DF1 & DF2 both to Dataset[T] and then perform JOIN or should we do the JOIN and then convert the result DataFrame to Dataset.
As I understand since here Dataset[T] are being used for type safety so we should convert DF1 & DF2 to Dataset[T]. Can someone please confirm and advise if something is not correct?

Extracting row which are not in other dataframes

I have one two dataframes df1 and df2, (say df2 is subset of df1) I have to get rows from df1 which are not df2, I have do it in scala, can any help me out on thhis.
You can use except:
df1.except(df2)