Better way to concatenate many columns? - scala

I have 30 columns. 26 of the column names are the names of the alphabet letters. I'd like to take those 26 columns and make them into one column as one string.
price dateCreate volume country A B C D E ..... Z
19 20190501 25 US 1 2 5 6 19 30
49 20190502 30 US 5 4 5 0 34 50
I want this:
price dateCreate volume country new_col
19 20190501 25 US "1,2,5,6,19,....30"
49 20190502 30 US "5,4,5,0,34,50"
I know I can do something like this:
df.withColumn("new_col", concat($"A", $"B", ...$"Z"))
However, in the future when faced with this problem I'd like to know how I can more easily concatenate many columns. Is there a way?

Just apply the following to any number of columns you want to concatenate
val df= Seq((19,20190501,24, "US", 1 , 2, 5, 6, 19 ),(49,20190502,30, "US", 5 , 4, 5, 0, 34 )).
toDF("price", "dataCreate", "volume", "country", "A","B","C","D","E")
val exprs = df.columns.drop(4).map(col _)
df.select($"price", $"dataCreate", $"volume", $"country", concat_ws(",",
array(exprs: _*)).as("new_col"))
+-----+----------+------+-------+----------+
|price|dataCreate|volume|country| new_col|
+-----+----------+------+-------+----------+
| 19| 20190501| 24| US|1,2,5,6,19|
| 49| 20190502| 30| US|5,4,5,0,34|
+-----+----------+------+-------+----------+
for completeness, here is the pyspark equivalent
import pyspark.sql.functions as F
df= spark.createDataFrame([[19,20190501,24, "US", 1 , 2, 5, 6, 19 ],[49,20190502,30, "US", 5 , 4, 5, 0, 34 ]],
["price", "dataCreate", "volume", "country", "A","B","C","D","E"])
exprs = [col for col in df.columns[4:]]
df.select("price","dataCreate", "volume", "country", F.concat_ws(",",F.array(*exprs)).alias("new_col"))

Maybe you had something similar to the next one in mind:
Scala
import org.apache.spark.sql.functions.{col, concat_ws}
val cols = ('A' to 'Z').map{col(_)}
df.withColumn("new_col", concat_ws(",", cols:_*)
Python
from pyspark.sql.functions import col, concat_ws
import string
cols = [col(x) for x in string.ascii_uppercase]
df.withColumn("new_col", concat_ws(",", *cols))

From Spark 2.3.0, you can use concatenation operator directly to do this in spark-sql itself.
spark.sql("select A||B||C from table");
https://issues.apache.org/jira/browse/SPARK-19951

Related

Spark scala aggregate to an array and concat it

I have a Dataset with a number of columns that looks like this:(Columns -name, timestamp, platform, clickcount, id)
Joy 2021-10-10T10:27:16 apple 5 1
May 2020-12-12T22:28:08 android 6 2
June 2021-09-15T20:20:06 Microsoft 9 3
Joy 2021-09-09T09:30:09 android 10 1
May 2021-08-08T05:05:05 apple 8 2
I want to group by id and after it should look like
Joy 2021-10-10T10:27:16,2021-09-09T09:30:09 apple,android 5,10 1
May 2020-12-12T22:28:08,2021-08-08T05:05:05 android,apple 6,8 2
June 2021-09-15T20:20:06 Microsoft 9 3
After calling for another Api which converts the id to pseudo id I want to map that id and make it to look like
Joy 2021-10-10T10:27:16,2021-09-09T09:30:09 apple,android 5,10 1 A12
May 2020-12-12T22:28:08,2021-08-08T05:05:05 android,apple 6,8 2 B23
June 2021-09-15T20:20:06 Microsoft 9 3 C34
I have tried using groupBy and forEach but I am stuck and unable to proceed further
In order to apply the aggregation you want, you should use collect_set as the aggregation function and concat_ws in order to join with comma the created arrays:
import org.apache.spark.sql.functions.{collect_set, concat_ws}
import spark.implicits._
val df: DataFrame = Seq(
("joy", "2021-10-10T10:27:16", "apple", 5, 1),
("may", "2020-12-12T22:28:08", "android", 6, 2),
("june", "2021-09-15T20:20:06", "microsoft", 9, 3),
("joy", "2021-09-09T09:30:09", "android", 10, 1),
("may", "2021-08-08T05:05:05", "apple", 8, 2)
).toDF("name", "timestamp", "platform", "clickcount", "id")
df
.groupBy("id")
.agg(
concat_ws(",", collect_set("timestamp")).as("timestamp"),
concat_ws(",", collect_set("name")).as("name"),
concat_ws(",", collect_set("platform")).as("platform"),
concat_ws(",", collect_set("clickcount")).as("clickcount")
).show()
The output should be:
+---+--------------------+----+-------------+----------+
| id| timestamp|name| platform|clickcount|
+---+--------------------+----+-------------+----------+
| 1|2021-10-10T10:27:...| joy|apple,android| 5,10|
| 3| 2021-09-15T20:20:06|june| microsoft| 9|
| 2|2021-08-08T05:05:...| may|apple,android| 6,8|
+---+--------------------+----+-------------+----------+
In order to add a pseudo id column, you should join the created df with another dataframe that contains the conversion values or write an UDF that will receive an id value and will convert it into pseudo id.

Validate data from the same column in different rows with pyspark

How can I change the value of a column depending on some validation between some cells? What I need is to compare the kilometraje values of each customer's (id) record to compare whether the record that follows the kilometraje is higher.
fecha id estado id_cliente error_code kilometraje error_km
1/1/2019 1 A 1 10
2/1/2019 2 A ERROR 20
3/1/2019 1 D 1 ERROR 30
4/1/2019 2 O ERROR
The error in the error_km column is because for customer (id) 2 the kilometraje value is less than the same customer record for 2/1/2019 (If time passes the car is used so the kilometraje increases, so that there is no error the mileage has to be higher or the same)
I know that withColumn I can overwrite or create a column that doesn't exist and that using when I can set conditions. For example: This would be the code I use to validate the estado and id_cliente column and ERROR overwrite the error_code column where applicable, but I don't understand how to validate between different rows for the same client.
from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col
file_path = 'archive.txt'
error = 'ERROR'
df = spark.read.parquet(file_path)
df = df.persist(StorageLevel.MEMORY_AND_DISK)
df = df.select('estado', 'id_cliente')
df = df.withColumn("error_code", lit(''))
df = df.withColumn('error_code',
F.when((F.col('status') == 'O') &
(F.col('client_id') != '') |
(F.col('status') == 'D') &
(F.col('client_id') != '') |
(F.col('status') == 'A') &
(F.col('client_id') == ''),
F.concat(F.col("error_code"), F.lit(":[{}]".format(error)))
)
.otherwise(F.col('error_code')))
You achieve that with the lag window function. The lag function returns you the row before the current row. With that you can easily compare the kilometraje values. Have a look at the code below:
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [('1/1/2019' , 1 , 10),
('2/1/2019', 2 , 20 ),
('3/1/2019', 1 , 30 ),
('4/1/2019', 1 , 10 ),
('5/1/2019', 1 , 30 ),
('7/1/2019', 3 , 30 ),
('4/1/2019', 2 , 5)]
columns = ['fecha', 'id', 'kilometraje']
df=spark.createDataFrame(l, columns)
df = df.withColumn('fecha',F.to_date(df.fecha, 'dd/MM/yyyy'))
w = Window.partitionBy('id').orderBy('fecha')
df = df.withColumn('error_km', F.when(F.lag('kilometraje').over(w) > df.kilometraje, F.lit('ERROR') ).otherwise(F.lit('')))
df.show()
Output:
+----------+---+-----------+--------+
| fecha| id|kilometraje|error_km|
+----------+---+-----------+--------+
|2019-01-01| 1| 10| |
|2019-01-03| 1| 30| |
|2019-01-04| 1| 10| ERROR|
|2019-01-05| 1| 30| |
|2019-01-07| 3| 30| |
|2019-01-02| 2| 20| |
|2019-01-04| 2| 5| ERROR|
+----------+---+-----------+--------+
The fourth row doesn't get labeled with 'ERROR' as the previous value had a smaller kilometraje value (10 < 30). When you want to label all the id's with 'ERROR' which contain at least one corrupted row, perform a left join.
df.drop('error_km').join(df.filter(df.error_km == 'ERROR').groupby('id').agg(F.first(df.error_km).alias('error_km')), 'id', 'left').show()
I use .rangeBetween(Window.unboundedPreceding,0).
This function searches from the current value for the added value for the back
import pyspark
from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql import Window
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
error = 'This is error'
l = [('1/1/2019' , 1 , 10),
('2/1/2019', 2 , 20 ),
('3/1/2019', 1 , 30 ),
('4/1/2019', 1 , 10 ),
('5/1/2019', 1 , 22 ),
('7/1/2019', 1 , 23 ),
('22/1/2019', 2 , 5),
('11/1/2019', 2 , 24),
('13/2/2019', 1 , 16),
('14/2/2019', 2 , 18),
('5/2/2019', 1 , 19),
('6/2/2019', 2 , 23),
('7/2/2019', 1 , 14),
('8/3/2019', 1 , 50),
('8/3/2019', 2 , 50)]
columns = ['date', 'vin', 'mileage']
df=spark.createDataFrame(l, columns)
df = df.withColumn('date',F.to_date(df.date, 'dd/MM/yyyy'))
df = df.withColumn("max", lit(0))
df = df.withColumn("error_code", lit(''))
w = Window.partitionBy('vin').orderBy('date').rangeBetween(Window.unboundedPreceding,0)
df = df.withColumn('max',F.max('mileage').over(w))
df = df.withColumn('error_code', F.when(F.col('mileage') < F.col('max'), F.lit('ERROR')).otherwise(F.lit('')))
df.show()
Finally, all that remains is to remove the column that has the maximum
df = df.drop('max')
df.show()

scala spark, how do I merge a set of columns to a single one on a dataframe?

I'm looking for a way to do this without a UDF, I am wondering if its possible. Lets say I have a DF as follows:
Buyer_name Buyer_state CoBuyer_name CoBuyers_state Price Date
Bob CA Joe CA 20 010119
Stacy IL Jamie IL 50 020419
... about 3 millions more rows...
And I want to turn it to:
Buyer_name Buyer_state Price Date
Bob CA 20 010119
Joe CA 20 010119
Stacy IL 50 020419
Jamie IL 50 020419
...
Edit: I could also,
Create two dataframes, removing "Buyer" columns from one, and "Cobuyer" columns from the other.
Rename dataframe with "Cobuyer" columns as "Buyer" columns.
Concatenate both dataframes.
You can group struct(Buyer_name, Buyer_state) and struct(CoBuyer_name, CoBuyer_state) into an Array which is then expanded using explode, as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("Bob", "CA", "Joe", "CA", 20, "010119"),
("Stacy", "IL", "Jamie", "IL", 50, "020419")
).toDF("Buyer_name", "Buyer_state", "CoBuyer_name", "CoBuyer_state", "Price", "Date")
df.
withColumn("Buyers", array(
struct($"Buyer_name".as("_1"), $"Buyer_state".as("_2")),
struct($"CoBuyer_name".as("_1"), $"CoBuyer_state".as("_2"))
)).
withColumn("Buyer", explode($"Buyers")).
select(
$"Buyer._1".as("Buyer_name"), $"Buyer._2".as("Buyer_state"), $"Price", $"Date"
).show
// +----------+-----------+-----+------+
// |Buyer_name|Buyer_state|Price| Date|
// +----------+-----------+-----+------+
// | Bob| CA| 20|010119|
// | Joe| CA| 20|010119|
// | Stacy| IL| 50|020419|
// | Jamie| IL| 50|020419|
// +----------+-----------+-----+------+
This sounds like an unpivot operation to me which can be accomplished with the union function in Scala:
val df = Seq(
("Bob", "CA", "Joe", "CA", 20, "010119"),
("Stacy", "IL", "Jamie", "IL", 50, "020419")
).toDF("Buyer_name", "Buyer_state", "CoBuyer_name", "CoBuyer_state", "Price", "Date")
val df_new = df.select("Buyer_name", "Buyer_state", "Price", "Date").union(df.select("CoBuyer_name", "CoBuyer_state", "Price", "Date"))
df_new.show
Thanks to Leo for providing the dataframe definition which I've re-used.

How do you convert a relational dataset with 3 columns to a 2d sparse matrix?

I'm using spark 2.0.0 with scala 2.11.
I have a dataframe that has 3 columns:
object_id category_id count
1 653 5
1 78 1
1 28 6
2 63 2
3 59 7
How do I convert it to this format?
1 653:5 78:1 28:6
2 63:2
3 59:7
Cheers
Using RDD's
yourDS.rdd
.map(row => (row.getInt(0), row.getInt(1), row.getInt(2)))
.grou‌​pBy({ (oid, cid, c) => iod })
.map({
(oid, iter) => (oid, iter.foldLeft("")((a‌​cc, tup) => acc + " " + tup._2 + ":" + tup._3))
})
.toDF("id", "hash")
Staying in DataSet world will be a bit difficult as you have to combine columns.
Well my approach uses DataFrames instead of RDDs, so it differs from the other answer.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
import scala.collection.mutable.WrappedArray
val a = sc.parallelize(Array(
(1, 653, 5),
(1, 78, 1),
(1, 28, 6),
(2, 63, 2),
(3, 59, 7)
)).toDF("object_id", "category_id", "count")
val x = a.select(col("object_id"), concat(col("category_id"), lit(":") , col("count")).as("res"))
def concat_things(a: WrappedArray[String]) = a.reduce(_ + " " + _)
val conUDF = udf(concat_things _, StringType)
x.groupBy("object_id").agg(collect_list(col("res")).as("res")).select(col("object_id"), conUDF(col("res"))).show()
//+---------+---------------+
//|object_id| UDF(res)|
//+---------+---------------+
//| 1|653:5 78:1 28:6|
//| 3| 59:7|
//| 2| 63:2|
//+---------+---------------+
You can check this answer on this published notebook
Don't want to leave this guy unanswered - turns out the pivot function after a groupBy does exactly what I want.
dataset
.groupBy("object_id")
.pivot("category_id", listOfAllCategoryIds)

Adding a column of rowsums across a list of columns in Spark Dataframe

I have a Spark dataframe with several columns. I want to add a column on to the dataframe that is a sum of a certain number of the columns.
For example, my data looks like this:
ID var1 var2 var3 var4 var5
a 5 7 9 12 13
b 6 4 3 20 17
c 4 9 4 6 9
d 1 2 6 8 1
I want a column added summing the rows for specific columns:
ID var1 var2 var3 var4 var5 sums
a 5 7 9 12 13 46
b 6 4 3 20 17 50
c 4 9 4 6 9 32
d 1 2 6 8 10 27
I know it is possible to add columns together if you know the specific columns to add:
val newdf = df.withColumn("sumofcolumns", df("var1") + df("var2"))
But is it possible to pass a list of column names and add them together? Based off of this answer which is basically what I want but it is using the python API instead of scala (Add column sum as new column in PySpark dataframe) I think something like this would work:
//Select columns to sum
val columnstosum = ("var1", "var2","var3","var4","var5")
// Create new column called sumofcolumns which is sum of all columns listed in columnstosum
val newdf = df.withColumn("sumofcolumns", df.select(columstosum.head, columnstosum.tail: _*).sum)
This throws the error value sum is not a member of org.apache.spark.sql.DataFrame. Is there a way to sum across columns?
Thanks in advance for your help.
You should try the following:
import org.apache.spark.sql.functions._
val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(
("a", 5, 7, 9, 12, 13),
("b", 6, 4, 3, 20, 17),
("c", 4, 9, 4, 6 , 9),
("d", 1, 2, 6, 8 , 1)
)).toDF("ID", "var1", "var2", "var3", "var4", "var5")
val columnsToSum = List(col("var1"), col("var2"), col("var3"), col("var4"), col("var5"))
val output = input.withColumn("sums", columnsToSum.reduce(_ + _))
output.show()
Then the result is:
+---+----+----+----+----+----+----+
| ID|var1|var2|var3|var4|var5|sums|
+---+----+----+----+----+----+----+
| a| 5| 7| 9| 12| 13| 46|
| b| 6| 4| 3| 20| 17| 50|
| c| 4| 9| 4| 6| 9| 32|
| d| 1| 2| 6| 8| 1| 18|
+---+----+----+----+----+----+----+
Plain and simple:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{lit, col}
def sum_(cols: Column*) = cols.foldLeft(lit(0))(_ + _)
val columnstosum = Seq("var1", "var2", "var3", "var4", "var5").map(col _)
df.select(sum_(columnstosum: _*))
with Python equivalent:
from functools import reduce
from operator import add
from pyspark.sql.functions import lit, col
def sum_(*cols):
return reduce(add, cols, lit(0))
columnstosum = [col(x) for x in ["var1", "var2", "var3", "var4", "var5"]]
select("*", sum_(*columnstosum))
Both will default to NA if there is a missing value in the row. You can use DataFrameNaFunctions.fill or coalesce function to avoid that.
I assume you have a dataframe df. Then you can sum up all cols except your ID col. This is helpful when you have many cols and you don't want to manually mention names of all columns like everyone mentioned above. This post has the same answer.
val sumAll = df.columns.collect{ case x if x != "ID" => col(x) }.reduce(_ + _)
df.withColumn("sum", sumAll)
Here's an elegant solution using python:
NewDF = OldDF.withColumn('sums', sum(OldDF[col] for col in OldDF.columns[1:]))
Hopefully this will influence something similar in Spark ... anyone?.