This question already has an answer here:
How to flatmap a nested Dataframe in Spark
(1 answer)
Closed 4 years ago.
I have a dataframe in spark which is like :
column_A | column_B
--------- --------
1 1,12,21
2 6,9
both column_A and column_B is of String type.
how can I convert the above dataframe to a new dataframe which is like :
colum_new_A | column_new_B
----------- ------------
1 1
1 12
1 21
2 6
2 9
both column_new_A and column_new_B should be of String type.
You need to split the Column_B with comma and use the explode function as
val df = Seq(
("1", "1,12,21"),
("2", "6,9")
).toDF("column_A", "column_B")
You can use withColumn or select to create new column.
df.withColumn("column_B", explode(split( $"column_B", ","))).show(false)
df.select($"column_A".as("column_new_A"), explode(split( $"column_B", ",")).as("column_new_B"))
Output:
+------------+------------+
|column_new_A|column_new_B|
+------------+------------+
|1 |1 |
|1 |12 |
|1 |21 |
|2 |6 |
|2 |9 |
+------------+------------+
Related
I'm trying to create a comparison matrix using a Spark dataframe, and am starting by creating a single column dataframe with one row per value:
val df = List(1, 2, 3, 4, 5).toDF
From here, what I need to do is create a new column for each row, and insert (for now), a random number in each space, like this:
Item 1 2 3 4 5
------ --- --- --- --- ---
1 0 7 3 6 2
2 1 0 4 3 1
3 8 6 0 4 4
4 8 8 1 0 9
5 9 5 3 6 0
Any assistance would be appreciated!
Considering to transpose the input DataFrame called df using .pivot() function like the following:
val output = df.groupBy("item").pivot("item").agg((rand()*100).cast(DataTypes.IntegerType))
This will generate a new DataFrame with a random Integer value corrisponding to the row value (null otherwise).
+----+----+----+----+----+----+
|item|1 |2 |3 |4 |5 |
+----+----+----+----+----+----+
|1 |9 |null|null|null|null|
|3 |null|null|2 |null|null|
|5 |null|null|null|null|6 |
|4 |null|null|null|26 |null|
|2 |null|33 |null|null|null|
+----+----+----+----+----+----+
If you don't want the null values you can consider to apply an UDF later.
This question already has an answer here:
Coalesce duplicate columns in spark dataframe
(1 answer)
Closed 1 year ago.
I have a data frame i've joined with legacy data and updated data:
I would like to collapse this data so whenever an non-null value in the model_update column is available it replaces the model column value in the same row. How can this be achieved?
Data frame:
+----------------------------------------+-------+--------+-----------+------------+
|id |make |model |make_update|model_update|
+----------------------------------------+-------+--------+-----------+------------+
|1234 |Apple |iphone |null |iphone x |
|4567 |Apple |iphone |null |iphone 8 |
|7890 |Apple |iphone |null |null |
+----------------------------------------+-------+--------+-----------+------------+
Ideal Result:
+----------------------------------------+-------+---------+
|id |make |model |
+----------------------------------------+-------+---------|
|1234 |Apple |iphone x |
|4567 |Apple |iphone 8 |
|7890 |Apple |iphone |
+----------------------------------------+-------+---------+
Using coalesce.
df=df.withColumn("model",coalesce(col("model_update"),col("model")))
Here is a quick solution:
val df2 = df1.withColumn("New_Model", when($"model_update".isNull ,Model)
.otherwise(model_update))
Where df1 is your original data frame.
This question already has answers here:
Prepend zeros to a value in PySpark
(2 answers)
Closed 4 years ago.
In short, I'm leveraging spark-xml to do some parsing of XML files. However, using this is removing the leading zeros in all the values I'm interested in. However, I need the final output, which is a DataFrame, to include the leading zeros. I'm unsure/can not figure out a way to add leading zeros to the columns I'm interested in.
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "output")
.option("excludeAttribute", true)
.option("allowNumericLeadingZeros", true) //including this does not solve the problem
.load("pathToXmlFile")
Example output that I'm getting
+------+---+--------------------+
|iD |val|Code |
+------+---+--------------------+
|1 |44 |9022070536692784476 |
|2 |66 |-5138930048185086175|
|3 |25 |805582856291361761 |
|4 |17 |-9107885086776983000|
|5 |18 |1993794295881733178 |
|6 |31 |-2867434050463300064|
|7 |88 |-4692317993930338046|
|8 |44 |-4039776869915039812|
|9 |20 |-5786627276152563542|
|10 |12 |7614363703260494022 |
+------+---+--------------------+
Desired output
+--------+----+--------------------+
|iD |val |Code |
+--------+----+--------------------+
|001 |044 |9022070536692784476 |
|002 |066 |-5138930048185086175|
|003 |025 |805582856291361761 |
|004 |017 |-9107885086776983000|
|005 |018 |1993794295881733178 |
|006 |031 |-2867434050463300064|
|007 |088 |-4692317993930338046|
|008 |044 |-4039776869915039812|
|009 |020 |-5786627276152563542|
|0010 |012 |7614363703260494022 |
+--------+----+--------------------+
This solved it for me, thank you all for the help
val df2 = df
.withColumn("idLong", format_string("%03d", $"iD"))
You can simply do that by using concat inbuilt function
df.withColumn("iD", concat(lit("00"), col("iD")))
.withColumn("val", concat(lit("0"), col("val")))
Supposed i have two dataset as following:
Dataset 1:
id, name, score
1, Bill, 200
2, Bew, 23
3, Amy, 44
4, Ramond, 68
Dataset 2:
id,message
1, i love Bill
2, i hate Bill
3, Bew go go !
4, Amy is the best
5, Ramond is the wrost
6, Bill go go
7, Bill i love ya
8, Ramond is Bad
9, Amy is great
I wanted to join above two datasets and counting the top number of person's name that appears in dataset2 according to the name in dataset1 the result should be:
Bill, 4
Ramond, 2
..
..
I managed to join both of them together but not sure how to count how many time it appear for each person.
Any suggestion would be appreciated.
Edited:
my join code:
val rdd = sc.textFile("dataset1")
val rdd2 = sc.textFile("dataset2")
val rddPair1 = rdd.map { x =>
var data = x.split(",")
new Tuple2(data(0), data(1))
}
val rddPair2 = rdd2.map { x =>
var data = x.split(",")
new Tuple2(data(0), data(1))
}
rddPair1.join(rddPair2).collect().foreach(f =>{
println(f._1+" "+f._2._1+" "+f._2._2)
})
Using RDDs, achieving the solution you desire, would be complex. Not so much using dataframes.
First step would be to read the two files you have into dataframes as below
val df1 = sqlContext.read.format("com.databricks.spark.csv")
.option("header", true)
.load("dataset1")
val df2 = sqlContext.read.format("com.databricks.spark.csv")
.option("header", true)
.load("dataset1")
so that you should be having
df1
+---+------+-----+
|id |name |score|
+---+------+-----+
|1 |Bill |200 |
|2 |Bew |23 |
|3 |Amy |44 |
|4 |Ramond|68 |
+---+------+-----+
df2
+---+-------------------+
|id |message |
+---+-------------------+
|1 |i love Bill |
|2 |i hate Bill |
|3 |Bew go go ! |
|4 |Amy is the best |
|5 |Ramond is the wrost|
|6 |Bill go go |
|7 |Bill i love ya |
|8 |Ramond is Bad |
|9 |Amy is great |
+---+-------------------+
join, groupBy and count should give your desired output as
df1.join(df2, df2("message").contains(df1("name")), "left").groupBy("name").count().as("count").show(false)
Final output would be
+------+-----+
|name |count|
+------+-----+
|Ramond|2 |
|Bill |4 |
|Amy |2 |
|Bew |1 |
+------+-----+
This question already has an answer here:
How to flatmap a nested Dataframe in Spark
(1 answer)
Closed 4 years ago.
I have the following dataframe (df2)
+----------------+---------+-----+------+-----+
| Colours| Model |year |type |count|
+----------------+---------+-----+------+-----|
|red,green,white |Mitsubishi|2006|sedan |3 |
|gray,silver |Mazda |2010 |SUV |2 |
+----------------+---------+-----+------+-----+
I need to explode the column "Colours", so it looks an expanded column like this:
+----------------+---------+-----+------+
| Colours| Model |year |type |
+----------------+---------+-----+------+
|red |Mitsubishi|2006|sedan |
|green |Mitsubishi|2006|sedan |
|white |Mitsubishi|2006|sedan |
|gray |Mazda |2010 |SUV |
|silver |Mazda |2010 |SUV |
+----------------+---------+-----+------+
I have created an array
val colrs=df2.select("Colours").collect.map(_.getString(0))
and added the array to dataframe
val cars=df2.withColumn("c",explode($"colrs")).select("Colours","Model","year","type")
but it didn't work, any help please.
You can use split and explode functions as below in your dataframe (df2)
import org.apache.spark.sql.functions._
val cars = df2.withColumn("Colours", explode(split(col("Colours"), ","))).select("Colours","Model","year","type")
You will have output as
cars.show(false)
+-------+----------+----+-----+
|Colours|Model |year|type |
+-------+----------+----+-----+
|red |Mitsubishi|2006|sedan|
|green |Mitsubishi|2006|sedan|
|white |Mitsubishi|2006|sedan|
|gray |Mazda |2010|SUV |
|silver |Mazda |2010|SUV |
+-------+----------+----+-----+