This question already has an answer here:
How to flatmap a nested Dataframe in Spark
(1 answer)
Closed 4 years ago.
I have the following dataframe (df2)
+----------------+---------+-----+------+-----+
| Colours| Model |year |type |count|
+----------------+---------+-----+------+-----|
|red,green,white |Mitsubishi|2006|sedan |3 |
|gray,silver |Mazda |2010 |SUV |2 |
+----------------+---------+-----+------+-----+
I need to explode the column "Colours", so it looks an expanded column like this:
+----------------+---------+-----+------+
| Colours| Model |year |type |
+----------------+---------+-----+------+
|red |Mitsubishi|2006|sedan |
|green |Mitsubishi|2006|sedan |
|white |Mitsubishi|2006|sedan |
|gray |Mazda |2010 |SUV |
|silver |Mazda |2010 |SUV |
+----------------+---------+-----+------+
I have created an array
val colrs=df2.select("Colours").collect.map(_.getString(0))
and added the array to dataframe
val cars=df2.withColumn("c",explode($"colrs")).select("Colours","Model","year","type")
but it didn't work, any help please.
You can use split and explode functions as below in your dataframe (df2)
import org.apache.spark.sql.functions._
val cars = df2.withColumn("Colours", explode(split(col("Colours"), ","))).select("Colours","Model","year","type")
You will have output as
cars.show(false)
+-------+----------+----+-----+
|Colours|Model |year|type |
+-------+----------+----+-----+
|red |Mitsubishi|2006|sedan|
|green |Mitsubishi|2006|sedan|
|white |Mitsubishi|2006|sedan|
|gray |Mazda |2010|SUV |
|silver |Mazda |2010|SUV |
+-------+----------+----+-----+
Related
This question already has an answer here:
Coalesce duplicate columns in spark dataframe
(1 answer)
Closed 1 year ago.
I have a data frame i've joined with legacy data and updated data:
I would like to collapse this data so whenever an non-null value in the model_update column is available it replaces the model column value in the same row. How can this be achieved?
Data frame:
+----------------------------------------+-------+--------+-----------+------------+
|id |make |model |make_update|model_update|
+----------------------------------------+-------+--------+-----------+------------+
|1234 |Apple |iphone |null |iphone x |
|4567 |Apple |iphone |null |iphone 8 |
|7890 |Apple |iphone |null |null |
+----------------------------------------+-------+--------+-----------+------------+
Ideal Result:
+----------------------------------------+-------+---------+
|id |make |model |
+----------------------------------------+-------+---------|
|1234 |Apple |iphone x |
|4567 |Apple |iphone 8 |
|7890 |Apple |iphone |
+----------------------------------------+-------+---------+
Using coalesce.
df=df.withColumn("model",coalesce(col("model_update"),col("model")))
Here is a quick solution:
val df2 = df1.withColumn("New_Model", when($"model_update".isNull ,Model)
.otherwise(model_update))
Where df1 is your original data frame.
This question already has answers here:
Padding in a Pyspark Dataframe
(2 answers)
Closed 4 years ago.
I have a DataFrame where it looks like below
|string_code|prefix_string_code|
|1234 |001234 |
|123 |000123 |
|56789 |056789 |
Basically what I want is to add '0' as many as necessary so that the length of column prefix_string_code will be 6.
What I have tried:
df.withColumn('prefix_string_code', when(length(col('string_code')) < 6, concat(lit('0' * (6 - length(col('string_code')))), col('string_code'))).otherwise(col('string_code')))
It did not work and instead produced the following:
|string_code|prefix_string_code|
|1234 |0.001234 |
|123 |0.000123 |
|56789 |0.056789 |
As you can see, if it's not in a decimal form, the code actually works. How do I do this properly?
Thanks!
you can use lpad function for this case
>>> import pyspark.sql.functions as F
>>> rdd = sc.parallelize([1234,123,56789,1234567])
>>> data = rdd.map(lambda x: Row(x))
>>> df=spark.createDataFrame(data,['string_code'])
>>> df.show()
+-----------+
|string_code|
+-----------+
| 1234|
| 123|
| 56789|
| 1234567|
+-----------+
>>> df.withColumn('prefix_string_code', F.when(F.length(df['string_code']) < 6 ,F.lpad(df['string_code'],6,'0')).otherwise(df['string_code'])).show()
+-----------+------------------+
|string_code|prefix_string_code|
+-----------+------------------+
| 1234| 001234|
| 123| 000123|
| 56789| 056789|
| 1234567| 1234567|
+-----------+------------------+
This question already has an answer here:
How to flatmap a nested Dataframe in Spark
(1 answer)
Closed 4 years ago.
I have a dataframe in spark which is like :
column_A | column_B
--------- --------
1 1,12,21
2 6,9
both column_A and column_B is of String type.
how can I convert the above dataframe to a new dataframe which is like :
colum_new_A | column_new_B
----------- ------------
1 1
1 12
1 21
2 6
2 9
both column_new_A and column_new_B should be of String type.
You need to split the Column_B with comma and use the explode function as
val df = Seq(
("1", "1,12,21"),
("2", "6,9")
).toDF("column_A", "column_B")
You can use withColumn or select to create new column.
df.withColumn("column_B", explode(split( $"column_B", ","))).show(false)
df.select($"column_A".as("column_new_A"), explode(split( $"column_B", ",")).as("column_new_B"))
Output:
+------------+------------+
|column_new_A|column_new_B|
+------------+------------+
|1 |1 |
|1 |12 |
|1 |21 |
|2 |6 |
|2 |9 |
+------------+------------+
This question already has answers here:
Prepend zeros to a value in PySpark
(2 answers)
Closed 4 years ago.
In short, I'm leveraging spark-xml to do some parsing of XML files. However, using this is removing the leading zeros in all the values I'm interested in. However, I need the final output, which is a DataFrame, to include the leading zeros. I'm unsure/can not figure out a way to add leading zeros to the columns I'm interested in.
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "output")
.option("excludeAttribute", true)
.option("allowNumericLeadingZeros", true) //including this does not solve the problem
.load("pathToXmlFile")
Example output that I'm getting
+------+---+--------------------+
|iD |val|Code |
+------+---+--------------------+
|1 |44 |9022070536692784476 |
|2 |66 |-5138930048185086175|
|3 |25 |805582856291361761 |
|4 |17 |-9107885086776983000|
|5 |18 |1993794295881733178 |
|6 |31 |-2867434050463300064|
|7 |88 |-4692317993930338046|
|8 |44 |-4039776869915039812|
|9 |20 |-5786627276152563542|
|10 |12 |7614363703260494022 |
+------+---+--------------------+
Desired output
+--------+----+--------------------+
|iD |val |Code |
+--------+----+--------------------+
|001 |044 |9022070536692784476 |
|002 |066 |-5138930048185086175|
|003 |025 |805582856291361761 |
|004 |017 |-9107885086776983000|
|005 |018 |1993794295881733178 |
|006 |031 |-2867434050463300064|
|007 |088 |-4692317993930338046|
|008 |044 |-4039776869915039812|
|009 |020 |-5786627276152563542|
|0010 |012 |7614363703260494022 |
+--------+----+--------------------+
This solved it for me, thank you all for the help
val df2 = df
.withColumn("idLong", format_string("%03d", $"iD"))
You can simply do that by using concat inbuilt function
df.withColumn("iD", concat(lit("00"), col("iD")))
.withColumn("val", concat(lit("0"), col("val")))
This question already has answers here:
Spark SQL: apply aggregate functions to a list of columns
(4 answers)
Closed 5 years ago.
Bellow rows are dataframe for Mango,Apple,Orange columns respectively
[10,20,30]
[100,2000,300]
[1000,200,3000]
For the above dataframe: I need to get a summary like
{Mango: 1110; Apple:2220; Orange:3330 }
How do i do this with Single iteration ?
If you have a simple dataframe as below
+-----+-----+------+
|Mango|Apple|Orange|
+-----+-----+------+
|10 |20 |30 |
|100 |200 |300 |
|1000 |2000 |3000 |
+-----+-----+------+
you can do something like below
df.select(sum("Mango").as("Mango"), sum("Apple").as("Apple"), sum("Orange").as("Orange")).toJSON.rdd.foreach(println)
which would give you output as
{"Mango":1110,"Apple":2220,"Orange":3330}