How to combine multi columns into one in pyspark

How to combine multi columns into one in pyspark - pyspark

I have a dataframe with 2 columns (df1). Now I want to merge columns values into one (df2). How?

Let's say you have DataFrame like this:
d = [
("Value 1", 1),
("Value 2", 2),
("Value 3", 3),
("Value 4", 4),
("Value 5", 5),
]
df = spark.createDataFrame(d,['col1','col2'])
df.show()
# output
+-------+----+
| col1|col2|
+-------+----+
|Value 1| 1|
|Value 2| 2|
|Value 3| 3|
|Value 4| 4|
|Value 5| 5|
+-------+----+
You can join columns and format them as you want using following syntax:
(
df.withColumn("newCol",
F.format_string("Col 1: %s Col 2: %s", df.col1, df.col2))
.show(truncate=False)
)
# output
+-------+----+-----------------------+
|col1 |col2|newCol |
+-------+----+-----------------------+
|Value 1|1 |Col 1: Value 1 Col 2: 1|
|Value 2|2 |Col 1: Value 2 Col 2: 2|
|Value 3|3 |Col 1: Value 3 Col 2: 3|
|Value 4|4 |Col 1: Value 4 Col 2: 4|
|Value 5|5 |Col 1: Value 5 Col 2: 5|
+-------+----+-----------------------+

from pyspark.sql.functions import concat
df1.withColumn("Merge", concat(df1.Column_1, df1.Column_2)).show()

You can use a struct or a map.
struct:
df.withColumn(
"price_struct",
F.struct(
(F.col("total_price")*100).alias("amount"),
"total_price_currency",
F.lit("CENTI").alias("unit")
)
)
results in
+-----------+--------------------+--------------------+
|total_price|total_price_currency| price_struct|
+-----------+--------------------+--------------------+
| 79.0| USD|[7900.0, USD, CENTI]|
+-----------+--------------------+--------------------+
or as a map
df
.withColumn("price_map",
F.create_map(
F.lit("currency"), F.col("total_price_currency"),
F.lit("amount"), F.col("total_price")*100,
F.lit("unit"), F.lit("CENTI")
).alias("price_struct")
)
results in
+-----------+--------------------+--------------------+
|total_price|total_price_currency| price_map|
+-----------+--------------------+--------------------+
| 79.0| USD|[currency -> USD,...|
+-----------+--------------------+--------------------+

Related

Create summary of Spark Dataframe

I have a Spark Dataframe which I am trying to summarise in order to find overly long columns:
// Set up test data
// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df = Seq(
( 1, "a", "bb", "cc", "file1" ),
( 2, "d", "ee", "fff", "file2" ),
( 3, "g", "hhhh", "ii", "file3" )
).
toDF("rowId", "col1", "col2", "col3", "filename")
I can summarise the lengths of the columns and find overly long ones like this:
// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df2 = df.columns
.map(c => (c, df.agg(max(length(df(s"$c")))).as[String].first()))
.toSeq.toDF("columnName", "maxLength")
.filter($"maxLength" > 2)
If I try and add the existing filename column to the map I get an error:
val df2 = df.columns
.map(c => ($"filename", c, df.agg(max(length(df(s"$c")))).as[String].first()))
.toSeq.toDF("fn", "columnName", "maxLength")
.filter($"maxLength" > 2)
I have tried a few variations of the $"filename" syntax. How can I incorporate the filename column into the summary?
columnName
maxLength
filename
col2
4
file3
col3
3
file2
The real dataframes have 300+ columns and millions of rows so I cannot hard-type column names.

#wBob does the following achieve your goal?
group by file name and get the maximum per column:
val cols = df.columns.dropRight(1) // to remove the filename col
val maxLength = cols.map(c => s"max(length(${c})) as ${c}").mkString(",")
print(maxLength)
df.createOrReplaceTempView("temp")
val df1 = spark
.sql(s"select filename, ${maxLength} from temp group by filename")
df1.show()`
With the output:
+--------+-----+----+----+----+
|filename|rowId|col1|col2|col3|
+--------+-----+----+----+----+
| file1| 1| 1| 2| 2|
| file2| 1| 1| 2| 3|
| file3| 1| 1| 4| 2|
+--------+-----+----+----+----+
Use subqueries to get the maximum per column and concatenate the results using union:
df1.createOrReplaceTempView("temp2")
val res = cols.map(col => {
spark.sql(s"select '${col}' as columnName, $col as maxLength, filename from temp2 " +
s"where $col = (select max(${col}) from temp2)")
}).reduce(_ union _)
res.show()
With the result:
+----------+---------+--------+
|columnName|maxLength|filename|
+----------+---------+--------+
| rowId| 1| file1|
| rowId| 1| file2|
| rowId| 1| file3|
| col1| 1| file1|
| col1| 1| file2|
| col1| 1| file3|
| col2| 4| file3|
| col3| 3| file2|
+----------+---------+--------+
Note that there are multiple entries for rowId and col1 since the maximum is not unique.
There is probably a more elegant way to write it, but I am struggling to find one at the moment.

Pushed a little further for better result.
df.select(
col("*"),
array( // make array of columns name/value/length
(for{ col_name <- df.columns } yield
struct(
length(col(col_name)).as("length"),
lit(col_name).as("col"),
col(col_name).cast("String").as("col_value")
)
).toSeq:_* ).alias("rowInfo")
)
.select(
col("rowId"),
explode( // explode array into rows
expr("filter(rowInfo, x -> x.length >= 3)") //filter the array for the length your interested in
).as("rowInfo")
)
.select(
col("rowId"),
col("rowInfo.*") // turn struct fields into columns
)
.sort("length").show
+-----+------+--------+---------+
|rowId|length| col|col_value|
+-----+------+--------+---------+
| 2| 3| col3| fff|
| 3| 4| col2| hhhh|
| 3| 5|filename| file3|
| 1| 5|filename| file1|
| 2| 5|filename| file2|
+-----+------+--------+---------+

It might be enough to sort your table by total text length. This can be achieved quickly and concisely.
df.select(
col("*"),
length( // take the length
concat( //slap all the columns together
(for( col_name <- df.columns ) yield col(col_name)).toSeq:_*
)
)
.as("length")
)
.sort( //order by total length
col("length").desc
).show()
+-----+----+----+----+--------+------+
|rowId|col1|col2|col3|filename|length|
+-----+----+----+----+--------+------+
| 3| g|hhhh| ii| file3| 13|
| 2| d| ee| fff| file2| 12|
| 1| a| bb| cc| file1| 11|
+-----+----+----+----+--------+------+

Sorting an array[struct] it will sort on the first field first and second field next. This works as we put the size of the sting up front. If you re-order the fields you'll get different results. You can easily accept more than 1 result if you so desired but I think dsicovering a row is challenging is likely enough.
df.select(
col("*"),
reverse( //sort ascending
sort_array( //sort descending
array( // add all columns lengths to an array
(for( col_name <- df.columns ) yield struct(length(col(col_name)),lit(col_name),col(col_name).cast("String")) ).toSeq:_* )
)
)(0) // grab the row max
.alias("rowMax") )
.sort("rowMax").show
+-----+----+----+----+--------+--------------------+
|rowId|col1|col2|col3|filename| rowMax|
+-----+----+----+----+--------+--------------------+
| 1| a| bb| cc| file1|[5, filename, file1]|
| 2| d| ee| fff| file2|[5, filename, file2]|
| 3| g|hhhh| ii| file3|[5, filename, file3]|
+-----+----+----+----+--------+--------------------+

new column in dataframe derived from second dataframe

I've two dataframes df1 and df2.I've to add a new columns in df1 from df2 :
df1
X Y Z
1 2 3
4 5 6
7 8 9
3 6 9
df2
col1 col2
XX aa
YY bb
XX cc
ZZ vv
The values of col1 in df2 should be added as new column(if it does'nt exists) in df1 and col2 as value of new column.For example :
df1
X Y Z XX YY ZZ
1 2 3 aa bb vv
4 5 6 cc
7 8 9
3 6 9
df2
col1 col2
XX aa
YY bb
XX cc
ZZ vv

First, spark dataset are made to be distributed. But column name are part of the schema, so they are in memory of the master. Thus, to add columns for each distinct values of df2.col1, you first need to get those values in the master (i.e. collect)
// inputs
val df1 = List((1,2,3), (4,5,6), (7,8,9), (3,6,9)).toDF("X", "Y", "Z")
val df2 = List(("XX", "aa"), ("YY", "bb"), ("XX", "cc"), ("ZZ", "vv")).toDF("col1", "col2")
val newColumns = df2.select("col1").as[String].distinct.collect
val newDF = newColumns.foldLeft(df1)( (df, col) => df.withColumn(col, lit("?")))
newDF.show
+---+---+---+---+---+---+
| X| Y| Z| ZZ| YY| XX|
+---+---+---+---+---+---+
| 1| 2| 3| ?| ?| ?|
| 4| 5| 6| ?| ?| ?|
| 7| 8| 9| ?| ?| ?|
| 3| 6| 9| ?| ?| ?|
+---+---+---+---+---+---+
But
I don't know what values you want to put in those column (above, I put "?" everywhere)
if there are a lot of rows in df2, like 10's of thousand, it can kill the master to collect and add them all to df1
Now, to give a little more, here is how you can add columns from df2.col1 and put as values the concatenated values of df2.col2
val toAdd = df2.groupBy("col1").agg(concat_ws(",", collect_set("col2")).as("col2All"))
toAdd.show
+----+-------+
|col1|col2All|
+----+-------+
| ZZ| vv|
| YY| bb|
| XX| cc,aa|
+----+-------+
val newColumns = toAdd.rdd.map(r => (r.getAs[String]("col1"), r.getAs[String]("col2All"))).collectAsMap()
val newDF = newColumns.foldLeft(df1){ case (df, (name, value)) => df.withColumn(name, lit(value))}
newDF.show
+---+---+---+-----+---+---+
| X| Y| Z| XX| YY| ZZ|
+---+---+---+-----+---+---+
| 1| 2| 3|cc,aa| bb| vv|
| 4| 5| 6|cc,aa| bb| vv|
| 7| 8| 9|cc,aa| bb| vv|
| 3| 6| 9|cc,aa| bb| vv|
+---+---+---+-----+---+---+

Spark scala column level mismatches from 2 dataframes

I have 2 dataframes
val df1 = Seq((1, "1","6"), (2, "10","8"), (3, "6","4")).toDF("id", "value1","value2")
val df2 = Seq((1, "1","6"), (2, "5","4"), (4, "3","1")).toDF("id", "value1","value2")
and i want to find the difference of column level
output should look like
id,value1_df1,value1_df2,diff_value1,value2_df1,value_df2,diff_value2
1, 1 ,1 , 0 , 6 ,6 ,0
2, 10 ,5 , 5 , 8 ,4 ,4
3, 6 ,3 , 1 , 4 ,1 ,3
like wise i have 100's of column and want to compute difference between same column in 2 dataframes columns are dynamic

Maybe this will help:
val spark = SparkSession.builder.appName("Test").master("local[*]").getOrCreate();
import spark.implicits._
var df1 = Seq((1, "1", "6"), (2, "10", "8"), (3, "6", "4")).toDF("id", "value1", "value2")
var df2 = Seq((1, "1", "6"), (2, "5", "4"), (3, "3", "1")).toDF("id", "value1", "value2")
df1.columns.foreach(column => {
df1 = df1.withColumn(column, df1.col(column).cast(IntegerType))
})
df2.columns.foreach(column => {
df2 = df2.withColumn(column, df2.col(column).cast(IntegerType))
})
df1 = df1.withColumnRenamed("id", "df1_id")
df2 = df2.withColumnRenamed("id", "df2_id")
df1.show()
df2.show()
so till now you have two dataframes with value_x,value_y,value_z and going on ...
df1:
+------+------+------+
|df1_id|value1|value2|
+------+------+------+
| 1| 1| 6|
| 2| 10| 8|
| 3| 6| 4|
+------+------+------+
df2:
+------+------+------+
|df2_id|value1|value2|
+------+------+------+
| 1| 1| 6|
| 2| 5| 4|
| 3| 3| 1|
+------+------+------+
Now we are gonna join them base on id:
var df3 = df1.alias("df1").join(df2.alias("df2"), $"df1.df1_id" === $"df2.df2_id")
and last, we gonna take all columns on df1/df2 (* Its important that they will have the same columns) - without the id, and create a new column of the diff:
df1.columns.tail.foreach(col => {
val new_col_name = s"${col}-diff"
val df_a_col = s"df1.${col}"
val df_b_col = s"df2.${col}"
df3 = df3.withColumn(new_col_name, df3.col(df_a_col) - df3.col(df_b_col))
})
df3.show()
Result:
+------+------+------+------+------+------+-----------+-----------+
|df1_id|value1|value2|df2_id|value1|value2|value1-diff|value2-diff|
+------+------+------+------+------+------+-----------+-----------+
| 1| 1| 6| 1| 1| 6| 0| 0|
| 2| 10| 8| 2| 5| 4| 5| 4|
| 3| 6| 4| 3| 3| 1| 3| 3|
+------+------+------+------+------+------+-----------+-----------+
This is the result, and it`s dynamic so you can add valueX you want.

Group by and find count before doing pivot spark

I have a dataframe like below
A B C D
foo one small 1
foo one large 2
foo one large 2
foo two small 3
I need to groupBy based on A and B pivot on column C, and sum column D
I am able to do this using
df.groupBy("A", "B").pivot("C").sum("D")
However I need also to find count after groupBy ,if I try something like
df.groupBy("A", "B").pivot("C").agg(sum("D"), count)
I get an output like
A B large small large_count small_count
Is there a way to get only one count after groupBy before doing pivot

On output try
output.withColumn("count", $"large_count"+$"small_count").show
You can drop the two count columns if you want to
To do it before pivot try
df.groupBy("A", "B").agg(count("C"))

Is this what you are expecting?.
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
scala> df.show
+---+---+-----+---+
| A| B| C| D|
+---+---+-----+---+
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
+---+---+-----+---+
scala> val df2 = df.groupBy('A,'B).pivot("C").sum("D")
df2: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more fields]
scala> val df3 = df.groupBy('A as "A1",'B as "B1").agg(sum('D) as "sumd")
df3: org.apache.spark.sql.DataFrame = [A1: string, B1: string ... 1 more field]
scala> df3.join(df2,'A==='A1 and 'B==='B1,"inner").select("A","B","sumd","large","small").show
+---+---+----+-----+-----+
| A| B|sumd|large|small|
+---+---+----+-----+-----+
|foo|one| 5| 4| 1|
|foo|two| 3| null| 3|
+---+---+----+-----+-----+
scala>

This wont require a join. Is this what you are looking for ?
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
scala> df.show
+---+---+-----+---+
| A| B| C| D|
+---+---+-----+---+
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
+---+---+-----+---+
df.registerTempTable("dummy")
spark.sql("SELECT * FROM (SELECT A , B , C , sum(D) as D from dummy group by A,B,C grouping sets ((A,B,C) ,(A,B)) order by A nulls last , B nulls last , C nulls last) dummy pivot (first(D) for C in ('large' large ,'small' small , null total))").show
+---+---+-----+-----+-----+
| A| B|large|small|total|
+---+---+-----+-----+-----+
|foo|one| 4| 1| 5|
|foo|two| null| 3| 3|
+---+---+-----+-----+-----+

Turn a Spark dataframe Groupby into a sequence of dataframes [duplicate]

This question already has answers here:
How to split a dataframe into dataframes with same column values?
(3 answers)
Closed 5 years ago.
I'm working with a spark dataframe (in scala), and what I'd like to do is group by a column and turn the different groups as a sequence of dataframes.
So it would look something like
df.groupyby("col").toSeq -> Seq[DataFrame]
Even better would be to turn it into something with a key pair
df.groupyby("col").toSeq -> Dict[key, DataFrame]
This seems like an obvious thing to do, but I can't seem to figure out how it might work

This is what you could do, Here is a simple example
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(29,"City 2", 72),
(28,"City 3", 48),
(28,"City 2", 19),
(27,"City 2", 16),
(28,"City 1", 84),
(28,"City 4", 72),
(29,"City 4", 39),
(27,"City 3", 42),
(26,"City 3", 68),
(27,"City 1", 89),
(27,"City 4", 104),
(26,"City 2", 19),
(29,"City 3", 27)
)).toDF("week", "city", "sale")
//create a dataframe with dummy data
//get list of cities
val city = data.select("city").distinct.collect().flatMap(_.toSeq)
// get all the columns for each city
//this returns Seq[(Any, Dataframe)] as (cityId, Dataframe)
val result = city.map(c => (c -> data.where(($"city" === c))))
//print all the dataframes
result.foreach(a=>
println(s"Dataframe with ${a._1}")
a._2.show()
})
Output looks Like this
Dataframe with City 1
+----+------+----+
|week| city|sale|
+----+------+----+
| 28|City 1| 84|
| 27|City 1| 89|
+----+------+----+
Dataframe with City 3
+----+------+----+
|week| city|sale|
+----+------+----+
| 28|City 3| 48|
| 27|City 3| 42|
| 26|City 3| 68|
| 29|City 3| 27|
+----+------+----+
Dataframe with City 4
+----+------+----+
|week| city|sale|
+----+------+----+
| 28|City 4| 72|
| 29|City 4| 39|
| 27|City 4| 104|
+----+------+----+
Dataframe with City 2
+----+------+----+
|week| city|sale|
+----+------+----+
| 29|City 2| 72|
| 28|City 2| 19|
| 27|City 2| 16|
| 26|City 2| 19|
+----+------+----+
You can also use partitionby to group the data and write to the output as
dataframe.write.partitionBy("col").saveAsTable("outputpath")
this creates a output file for each grouped of "col"
Hope this helps!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to combine multi columns into one in pyspark - pyspark

I have a dataframe with 2 columns (df1). Now I want to merge columns values into one (df2). How?

from pyspark.sql.functions import concat df1.withColumn("Merge", concat(df1.Column_1, df1.Column_2)).show()

Related

Create summary of Spark Dataframe

new column in dataframe derived from second dataframe

Spark scala column level mismatches from 2 dataframes

Group by and find count before doing pivot spark

Turn a Spark dataframe Groupby into a sequence of dataframes [duplicate]

Categories

Resources