Pattern match string from column in spark dataframe [closed] - scala

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have column in spark dataframe, where i need to search the data with only string containing "xyz" and to stored in new column.
Input (need the only field from column having xyz )
col A colB
A bid:76563,bid:76589,bid:76591,ms:ms15-097,xyz:3089656
B xyz:4462915,xyz:4462917,xyz:4462918
Required Output
col A colB colC
A bid:76563,bid:76589,bid:76591,ms:ms15-097,xyz:3089656 xyz:3089656
B xyz:4462915,xyz:4462917,xyz:4462918 xyz:4462915,xyz:4462917,xyz:4462918
I have 100k rows and cannot used groupby on colA using collect_list, can you please to get the required output.

If you are using Spark 2.4+ then you can split the colB with comma , and use built in functions as expressions
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("A", "bid:76563,bid:76589,bid:76591,ms:ms15-097,xyz:3089656"),
("B", "xyz:4462915,xyz:4462917,xyz:4462918")
).toDF("colA", "colB")
val newDF = df.withColumn("split", split($"colB", ","))
.selectExpr("*", "filter(split, x -> x LIKE 'xyz%' ) filteredB")
.withColumn("colC", concat_ws(",", $"filteredB"))
.drop("split", "filteredB")
newDF.show(false)
Output:
+----+-----------------------------------------------------+-----------------------------------+
|colA|colB |colC |
+----+-----------------------------------------------------+-----------------------------------+
|A |bid:76563,bid:76589,bid:76591,ms:ms15-097,xyz:3089656|xyz:3089656 |
|B |xyz:4462915,xyz:4462917,xyz:4462918 |xyz:4462915,xyz:4462917,xyz:4462918|
+----+-----------------------------------------------------+-----------------------------------+

Related

Spark Scala Data Frame to have multiple aggregation of single Group By [duplicate]

This question already has answers here:
Multiple Aggregate operations on the same column of a spark dataframe
(6 answers)
Closed 3 years ago.
Spark Scala Data Frame to have multiple aggregation of single group by.
eg
val groupped = df.groupBy("firstName", "lastName").sum("Amount").toDF()
But What if I need Count, Sum, Max etc
/* Below Does Not Work , but this is what the intention is
val groupped = df.groupBy("firstName", "lastName").sum("Amount").count().toDF()
*/
output
groupped.show()
--------------------------------------------------
| firstName | lastName| Amount|count | Max | Min |
--------------------------------------------------
case class soExample(firstName: String, lastName: String, Amount: Int)
val df = Seq(soExample("me", "zack", 100)).toDF
import org.apache.spark.sql.functions._
val groupped = df.groupBy("firstName", "lastName").agg(
sum("Amount"),
mean("Amount"),
stddev("Amount"),
count(lit(1)).alias("numOfRecords")
).toDF()
display(groupped)

How could I unpivot a dataframe in Spark? [duplicate]

This question already has answers here:
How to melt Spark DataFrame?
(6 answers)
Closed 3 years ago.
I have a dataframe with the following schema:
subjectID, feature001, feature002, feature003, ..., feature299
Let's say my dataframe looks like:
123,0.23,0.54,0.35,...,0.26
234,0.17,0.49,0.47,...,0.69
Now, what I want is:
subjectID, featureID, featureValue
The above dataframe would look like:
123,001,0.23
123,002,0.54
123,003,0.35
......
123,299,0.26
234,001,0.17
234,002,0.49
234,003,0.47
......
234,299,0.69
I know how to achieve it if i have only several columns:
newDF = df.select($"subjectID", expr("stack(3, 'feature001', 001, 'feature002', 002, 'feature003', 003) as (featureID, featureValue)"))
However, I am looking for a way to deal with 300 columns.
You can build an array of struct with your columns and then use explode to transform them as rows:
import org.apache.spark.sql.functions.{explode, struct, lit, array, col}
// build an array of struct expressions from the feature columns
val columnExprs = df.columns
.filter(_.startsWith("feature"))
.map(name => struct(lit(name.replace("feature","")) as "id", col(name) as "value"))
// unpivot the DataFrame
val newDF = df.select($"subjectID", explode(array(columnExprs:_*)) as "feature")
.select($"subjectID",
$"feature.id" as "featureID",
$"feature.value" as "featureValue")

Scala - Drop records from DF1 if it has matching data with column from DF2 [duplicate]

This question already has answers here:
How to avoid duplicate columns after join?
(10 answers)
Closed 4 years ago.
I have two DF's(railroadGreaterFile, railroadInputFile).
I want to drop records from railroadGreaterFile if data in MEMBER_NUM column from railroadGreaterFile is matching the data in MEMBER_NUM column from railroadInputFile
Below is what i used:
val columnrailroadInputFile = railroadInputFile.withColumn("check", lit("check"))
val railroadGreaterNotInput = railroadGreaterFile
.join(columnrailroadInputFile, Seq("MEMBER_NUM"), "left")
.filter($"check".isNull)
.drop($"check")
Doing above, records are dropped, however i witnessed railroadGreaterNotInput's schema is combination of my DF1 and DF2 so when I try to write the railroadGreaterNotInput's data to file, it gives me below error
org.apache.spark.sql.AnalysisException: Reference 'GROUP_NUM' is ambiguous, could be: GROUP_NUM#508, GROUP_NUM#72
What should i be doing so that railroadGreaterNotInput would only contain fields from railroadGreaterFile DF?
You can only select the MEMBER_NUM while joining
val columnrailroadInputFile = railroadInputFile.withColumn("check", lit("check"))
val railroadGreaterNotInput = railroadGreaterFile.join(
columnrailroadInputFile.select("MEMBER_NUM", "check"), Seq("MEMBER_NUM"), "left")
.filter($"check".isNull).drop($"check")
Or drop all the columns from columnrailroadInputFile as
columnrailroadInputFile.drop(columnrailroadInputFile.columns :_*)
but for this use join contition as
columnrailroadInputFile("MEMBER_NUM") === railroadInputFile("MEMBER_NUM")
Hope this helps!

how to apply partition in spark scala dataframe with multiple columns? [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have the following dataframe df in Spark Scala:
id project start_date Change_date designation
1 P1 08/10/2018 01/09/2017 2
1 P1 08/10/2018 02/11/2018 3
1 P1 08/10/2018 01/08/2016 1
then get designation closure to start_date and less than that
Expected output:
id project start_date designation
1 P1 08/10/2018 2
This is because change date 01/09/2017 is the closest date before start_date.
Can somebody advice how to achieve this?
This is not selecting first row but selecting the designation corresponding to change date closest to the start date
Parse dates:
import org.apache.spark.sql.functions._
val spark: SparkSession = ???
import spark.implicits._
val df = Seq(
(1, "P1", "08/10/2018", "01/09/2017", 2),
(1, "P1", "08/10/2018", "02/11/2018", 3),
(1, "P1", "08/10/2018", "01/08/2016", 1)
).toDF("id", "project_id", "start_date", "changed_date", "designation")
val parsed = df
.withColumn("start_date", to_date($"start_date", "dd/MM/yyyy"))
.withColumn("changed_date", to_date($"changed_date", "dd/MM/yyyy"))
Find difference
val diff = parsed
.withColumn("diff", datediff($"start_date", $"changed_date"))
.where($"diff" > 0)
Apply solution of your choice from How to select the first row of each group?, for example window functions. If you group by id:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"id").orderBy($"diff")
diff.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn").show
// +---+----------+----------+------------+-----------+----+
// | id|project_id|start_date|changed_date|designation|diff|
// +---+----------+----------+------------+-----------+----+
// | 1| P1|2018-10-08| 2017-09-01| 2| 402|
// +---+----------+----------+------------+-----------+----+
Reference:
How to select the first row of each group?

Convert RDD[(String,List[String])] to Dataframe [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
My RDD is in below format. i.e RDD[(String,List[String])]
(abc,List(a,b))
(bcb,List(a,b))
I want to convert it to Dataframe Like below
col1 col2 col3
abc a b
bcb a b
what is the best approach do it in scala ?
you first need to extract the elements of your List into a tuple, than you can use toDF on your RDD (spark implicit conversions need to be imported for this)
val rdd: RDD[(String, List[String])] = sc.parallelize(Seq(
("abc",List("a","b")),
("bcb",List("a","b"))
))
val df = rdd
.map{case (str,list) => (str,list(0),list(1))}
.toDF("col1","col2","col3")
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| abc| a| b|
| bcb| a| b|
+----+----+----+