Spark Scala how to process multiple columns in single loop [duplicate] - scala

This question already has answers here:
Spark SQL: apply aggregate functions to a list of columns
(4 answers)
Closed 5 years ago.
Bellow rows are dataframe for Mango,Apple,Orange columns respectively
[10,20,30]
[100,2000,300]
[1000,200,3000]
For the above dataframe: I need to get a summary like
{Mango: 1110; Apple:2220; Orange:3330 }
How do i do this with Single iteration ?

If you have a simple dataframe as below
+-----+-----+------+
|Mango|Apple|Orange|
+-----+-----+------+
|10 |20 |30 |
|100 |200 |300 |
|1000 |2000 |3000 |
+-----+-----+------+
you can do something like below
df.select(sum("Mango").as("Mango"), sum("Apple").as("Apple"), sum("Orange").as("Orange")).toJSON.rdd.foreach(println)
which would give you output as
{"Mango":1110,"Apple":2220,"Orange":3330}

Related

How to take an array inside a data frame and turn it into a data frame [duplicate]

This question already has an answer here:
Difference between explode and explode_outer
(1 answer)
Closed 2 years ago.
i have the following data frame
|tokenCnt|filtered |
|5 |[java,scala, list, java, linkedlist]|
|3 |[also, genseq, parseq] |
I want to take out the arrays in the column 'filtered' one by one and turn them into one data frame.
|filtered|
|java |
|scala |
|list |
|java |
|linkedList|
|alse |
|gensqe |
|parseq |
like this.
Could someone help me?
You can use explode:
val result = df.select(explode("filtered").alias("filtered"))

Spark: Conditionally replace col1 value with col2 [duplicate]

This question already has an answer here:
Coalesce duplicate columns in spark dataframe
(1 answer)
Closed 1 year ago.
I have a data frame i've joined with legacy data and updated data:
I would like to collapse this data so whenever an non-null value in the model_update column is available it replaces the model column value in the same row. How can this be achieved?
Data frame:
+----------------------------------------+-------+--------+-----------+------------+
|id |make |model |make_update|model_update|
+----------------------------------------+-------+--------+-----------+------------+
|1234 |Apple |iphone |null |iphone x |
|4567 |Apple |iphone |null |iphone 8 |
|7890 |Apple |iphone |null |null |
+----------------------------------------+-------+--------+-----------+------------+
Ideal Result:
+----------------------------------------+-------+---------+
|id |make |model |
+----------------------------------------+-------+---------|
|1234 |Apple |iphone x |
|4567 |Apple |iphone 8 |
|7890 |Apple |iphone |
+----------------------------------------+-------+---------+
Using coalesce.
df=df.withColumn("model",coalesce(col("model_update"),col("model")))
Here is a quick solution:
val df2 = df1.withColumn("New_Model", when($"model_update".isNull ,Model)
.otherwise(model_update))
Where df1 is your original data frame.

Row manipulation for Dataframe in spark [duplicate]

This question already has an answer here:
How to flatmap a nested Dataframe in Spark
(1 answer)
Closed 4 years ago.
I have a dataframe in spark which is like :
column_A | column_B
--------- --------
1 1,12,21
2 6,9
both column_A and column_B is of String type.
how can I convert the above dataframe to a new dataframe which is like :
colum_new_A | column_new_B
----------- ------------
1 1
1 12
1 21
2 6
2 9
both column_new_A and column_new_B should be of String type.
You need to split the Column_B with comma and use the explode function as
val df = Seq(
("1", "1,12,21"),
("2", "6,9")
).toDF("column_A", "column_B")
You can use withColumn or select to create new column.
df.withColumn("column_B", explode(split( $"column_B", ","))).show(false)
df.select($"column_A".as("column_new_A"), explode(split( $"column_B", ",")).as("column_new_B"))
Output:
+------------+------------+
|column_new_A|column_new_B|
+------------+------------+
|1 |1 |
|1 |12 |
|1 |21 |
|2 |6 |
|2 |9 |
+------------+------------+

Add leading zeros to Columns in a Spark Data Frame [duplicate]

This question already has answers here:
Prepend zeros to a value in PySpark
(2 answers)
Closed 4 years ago.
In short, I'm leveraging spark-xml to do some parsing of XML files. However, using this is removing the leading zeros in all the values I'm interested in. However, I need the final output, which is a DataFrame, to include the leading zeros. I'm unsure/can not figure out a way to add leading zeros to the columns I'm interested in.
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "output")
.option("excludeAttribute", true)
.option("allowNumericLeadingZeros", true) //including this does not solve the problem
.load("pathToXmlFile")
Example output that I'm getting
+------+---+--------------------+
|iD |val|Code |
+------+---+--------------------+
|1 |44 |9022070536692784476 |
|2 |66 |-5138930048185086175|
|3 |25 |805582856291361761 |
|4 |17 |-9107885086776983000|
|5 |18 |1993794295881733178 |
|6 |31 |-2867434050463300064|
|7 |88 |-4692317993930338046|
|8 |44 |-4039776869915039812|
|9 |20 |-5786627276152563542|
|10 |12 |7614363703260494022 |
+------+---+--------------------+
Desired output
+--------+----+--------------------+
|iD |val |Code |
+--------+----+--------------------+
|001 |044 |9022070536692784476 |
|002 |066 |-5138930048185086175|
|003 |025 |805582856291361761 |
|004 |017 |-9107885086776983000|
|005 |018 |1993794295881733178 |
|006 |031 |-2867434050463300064|
|007 |088 |-4692317993930338046|
|008 |044 |-4039776869915039812|
|009 |020 |-5786627276152563542|
|0010 |012 |7614363703260494022 |
+--------+----+--------------------+
This solved it for me, thank you all for the help
val df2 = df
.withColumn("idLong", format_string("%03d", $"iD"))
You can simply do that by using concat inbuilt function
df.withColumn("iD", concat(lit("00"), col("iD")))
.withColumn("val", concat(lit("0"), col("val")))

Expand Column to many rows in Scala [duplicate]

This question already has an answer here:
How to flatmap a nested Dataframe in Spark
(1 answer)
Closed 4 years ago.
I have the following dataframe (df2)
+----------------+---------+-----+------+-----+
| Colours| Model |year |type |count|
+----------------+---------+-----+------+-----|
|red,green,white |Mitsubishi|2006|sedan |3 |
|gray,silver |Mazda |2010 |SUV |2 |
+----------------+---------+-----+------+-----+
I need to explode the column "Colours", so it looks an expanded column like this:
+----------------+---------+-----+------+
| Colours| Model |year |type |
+----------------+---------+-----+------+
|red |Mitsubishi|2006|sedan |
|green |Mitsubishi|2006|sedan |
|white |Mitsubishi|2006|sedan |
|gray |Mazda |2010 |SUV |
|silver |Mazda |2010 |SUV |
+----------------+---------+-----+------+
I have created an array
val colrs=df2.select("Colours").collect.map(_.getString(0))
and added the array to dataframe
val cars=df2.withColumn("c",explode($"colrs")).select("Colours","Model","year","type")
but it didn't work, any help please.
You can use split and explode functions as below in your dataframe (df2)
import org.apache.spark.sql.functions._
val cars = df2.withColumn("Colours", explode(split(col("Colours"), ","))).select("Colours","Model","year","type")
You will have output as
cars.show(false)
+-------+----------+----+-----+
|Colours|Model |year|type |
+-------+----------+----+-----+
|red |Mitsubishi|2006|sedan|
|green |Mitsubishi|2006|sedan|
|white |Mitsubishi|2006|sedan|
|gray |Mazda |2010|SUV |
|silver |Mazda |2010|SUV |
+-------+----------+----+-----+