This question already has answers here:
Unpivot in Spark SQL / PySpark
(2 answers)
How to melt Spark DataFrame?
(6 answers)
Transpose column to row with Spark
(9 answers)
Closed 5 years ago.
I have a data frame and I would like to use Scala to explode rows into multiple rows using the values in multiple columns. Ideally I am looking to replicate the behavior of the R function melt().
All the columns contain Strings.
Example: I want to transform this data frame..
df.show
+--------+-----------+-------------+-----+----+
|col1 | col2 | col3 | res1|res2|
+--------+-----------+-------------+-----+----+
| a| baseline| equivalence| TRUE| 0.1|
| a| experiment1| equivalence|FALSE|0.01|
| b| baseline| equivalence| TRUE| 0.2|
| b| experiment1| equivalence|FALSE|0.02|
+--------+-----------+-------------+-----+----+
...Into this data frame:
+--------+-----------+-------------+-----+-------+
|col1 | col2 | col3 | key |value|
+--------+-----------+-------------+-----+-------+
| a| baseline| equivalence| res1 | TRUE |
| a|experiment1| equivalence| res1 | FALSE|
| b| baseline| equivalence| res1 | TRUE |
| b|experiment1| equivalence| res1 | FALSE|
| a| baseline| equivalence| res2 | 0.1 |
| a|experiment1| equivalence| res2 | 0.01 |
| b| baseline| equivalence| res2 | 0.2 |
| b|experiment1| equivalence| res2 | 0.02 |
+--------+-----------+-------------+-----+-------+
Is there a built-in function in Scala which applies to datasets or
data frames to do this?
If not, would it be relatively simple to
implement this? How would it be done at a high level?
Note: I have found the class UnpivotOp from SMV which would do exactly what I want: (https://github.com/TresAmigosSD/SMV/blob/master/src/main/scala/org/tresamigos/smv/UnpivotOp.scala).
Unfortunately, the class is private, so I cannot do something like this:
import org.tresamigos.smv.UnpivotOp
val melter = new UnpivotOp(df, Seq("res1","res2"))
val melted_df = melter.unpivot()
Does anyone know if there a way to access the class org.tresamigos.smv.UnpivotOp via some some other class of static method of SMV?
Thanks!
Thanks to Andrew's Ray answer to unpivot in spark-sql/pyspark
This did the trick:
df.select($"col1",
$"col2",
$"col3",
expr("stack(2, 'res1', res1, 'res2', res2) as (key, value)"))
Or if the expression for the select should be passed as strings (handy for df %>% sparklyr::invoke("")):
df.selectExpr("col1",
"col2",
"col3",
"stack(2, 'res1', res1, 'res2', res2) as (key, value)")
Related
I exploded a nested schema but I am not getting what I want,
before exploded it looks like this:
df.show()
+----------+----------------------------------------------------------+
|CaseNumber| SourceId |
+----------+----------------------------------------------------------+
| 0 |[{"id":"1","type":"Sku"},{"id":"22","type":"ContractID"}] |
+----------|----------------------------------------------------------|
| 1 |[{"id":"3","type":"Sku"},{"id":"24","type":"ContractID"}] |
+---------------------------------------------------------------------+
I want it to be like this
+----------+-------------------+
| CaseNumber| Sku | ContractId |
+----------+-------------------+
| 0 | 1 | 22 |
+----------|------|------------|
| 1 | 3 | 24 |
+------------------------------|
Here is one way using the build-in get_json_object function:
import org.apache.spark.sql.functions.get_json_object
val df = Seq(
(0, """[{"id":"1","type":"Sku"},{"id":"22","type":"ContractID"}]"""),
(1, """[{"id":"3","type":"Sku"},{"id":"24","type":"ContractID"}]"""))
.toDF("CaseNumber", "SourceId")
df.withColumn("sku", get_json_object($"SourceId", "$[0].id").cast("int"))
.withColumn("ContractId", get_json_object($"SourceId", "$[1].id").cast("int"))
.drop("SourceId")
.show
// +----------+---+----------+
// |CaseNumber|sku|ContractId|
// +----------+---+----------+
// | 0| 1| 22|
// | 1| 3| 24|
// +----------+---+----------+
UPDATE
After our discussion we realised that the mentioned data is of array<struct<id:string,type:string>> type and not a simple string. Next is the solution for the new schema:
df.withColumn("sku", $"SourceIds".getItem(0).getField("id"))
.withColumn("ContractId", $"SourceIds".getItem(1).getField("id"))
This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 1 year ago.
I have the following DataFrame df in Spark:
+------------+---------+-----------+
|OrderID | Type| Qty|
+------------+---------+-----------+
| 571936| 62800| 1|
| 571936| 62800| 1|
| 571936| 62802| 3|
| 661455| 72800| 1|
| 661455| 72801| 1|
I need to select the row that has a largest value of Qty per each unique OrderID or the last rows per OrderID if all Qty are the same (e.g. as for 661455). The expected result:
+------------+---------+-----------+
|OrderID | Type| Qty|
+------------+---------+-----------+
| 571936| 62802| 3|
| 661455| 72801| 1|
Any ides how to get it?
This is what I tried:
val partitionWindow = Window.partitionBy(col("OrderID")).orderBy(col("Qty").asc)
val result = df.over(partitionWindow)
scala> val w = Window.partitionBy("OrderID").orderBy("Qty")
scala> val w1 = Window.partitionBy("OrderID")
scala> df.show()
+-------+-----+---+
|OrderID| Type|Qty|
+-------+-----+---+
| 571936|62800| 1|
| 571936|62800| 1|
| 571936|62802| 3|
| 661455|72800| 1|
| 661455|72801| 1|
+-------+-----+---+
scala> df.withColumn("rn", row_number.over(w)).withColumn("mxrn", max("rn").over(w1)).filter($"mxrn" === $"rn").drop("mxrn","rn").show
+-------+-----+---+
|OrderID| Type|Qty|
+-------+-----+---+
| 661455|72801| 1|
| 571936|62802| 3|
+-------+-----+---+
I have a spark dataframe, and I wish to check whether each string in a particular column exists in a pre-defined a column of another dataframe.
I have found a same problem in Spark (scala) dataframes - Check whether strings in column contain any items from a set
but I want to Check whether strings in column exists in a column of another dataframe not a List or a set follow that question. Who can help me! I don't know convert a column to a set or a list and i don't know "exists" method in dataframe.
My data is similar to this
df1:
+---+-----------------+
| id| url |
+---+-----------------+
| 1|google.com |
| 2|facebook.com |
| 3|github.com |
| 4|stackoverflow.com|
+---+-----------------+
df2:
+-----+------------+
| id | urldetail |
+-----+------------+
| 11 |google.com |
| 12 |yahoo.com |
| 13 |facebook.com|
| 14 |twitter.com |
| 15 |youtube.com |
+-----+------------+
Now, i am trying to create a third column with the results of a comparison to see if the strings in the $"urldetail" column if exists in $"url"
+---+------------+-------------+
| id| urldetail | check |
+---+------------+-------------+
| 11|google.com | 1 |
| 12|yahoo.com | 0 |
| 13|facebook.com| 1 |
| 14|twitter.com | 0 |
| 15|youtube.com | 0 |
+---+------------+-------------+
I want to use UDF but i don't know how to check whether string exists in a column of a dataframe! Please help me!
I have a spark dataframe, and I wish to check whether each string in a
particular column contains any number of words from a pre-defined a
column of another dataframe.
Here is the way. using = or like
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, _}
object CompareColumns extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName(this.getClass.getName)
.config("spark.master", "local").getOrCreate()
import spark.implicits._
val df1 = Seq(
(1, "google.com"),
(2, "facebook.com"),
(3, "github.com"),
(4, "stackoverflow.com")).toDF("id", "url").as("first")
df1.show
val df2 = Seq(
(11, "google.com"),
(12, "yahoo.com"),
(13, "facebook.com"),
(14, "twitter.com")).toDF("id", "url").as("second")
df2.show
val df3 = df2.join(df1, expr("first.url like second.url"), "full_outer").select(
col("first.url")
, col("first.url").contains(col("second.url")).as("check")).filter("url is not null")
df3.na.fill(Map("check" -> false))
.show
}
Result :
+---+-----------------+
| id| url|
+---+-----------------+
| 1| google.com|
| 2| facebook.com|
| 3| github.com|
| 4|stackoverflow.com|
+---+-----------------+
+---+------------+
| id| url|
+---+------------+
| 11| google.com|
| 12| yahoo.com|
| 13|facebook.com|
| 14| twitter.com|
+---+------------+
+-----------------+-----+
| url|check|
+-----------------+-----+
| google.com| true|
| facebook.com| true|
| github.com|false|
|stackoverflow.com|false|
+-----------------+-----+
with full outer join we can achive this...
For more details see my article with all joins here in my linked in post
Note : Instead of 0 for false 1 for true i have used boolean
conditions here.. you can translate them in to what ever you wanted...
UPDATE : If rows are increasing in second dataframe
you can use this, it wont miss any rows from second
val df3 = df2.join(df1, expr("first.url like second.url"), "full").select(
col("second.*")
, col("first.url").contains(col("second.url")).as("check"))
.filter("url is not null")
df3.na.fill(Map("check" -> false))
.show
Also, one more thing is you can try regexp_extract as shown in below post
https://stackoverflow.com/a/53880542/647053
read in your data and use the trim operation just to be conservative when joining on the strings to remove the whitesapace
val df= Seq((1,"google.com"), (2,"facebook.com"), ( 3,"github.com "), (4,"stackoverflow.com")).toDF("id", "url").select($"id", trim($"url").as("url"))
val df2 =Seq(( 11 ,"google.com"), (12 ,"yahoo.com"), (13 ,"facebook.com"),(14 ,"twitter.com"),(15,"youtube.com")).toDF( "id" ,"urldetail").select($"id", trim($"urldetail").as("urldetail"))
df.join(df2.withColumn("flag", lit(1)).drop("id"), (df("url")===df2("urldetail")), "left_outer").withColumn("contains_bool",
when($"flag"===1, true) otherwise(false)).drop("flag","urldetail").show
+---+-----------------+-------------+
| id| url|contains_bool|
+---+-----------------+-------------+
| 1| google.com| true|
| 2| facebook.com| true|
| 3| github.com| false|
| 4|stackoverflow.com| false|
+---+-----------------+-------------+
This question already has answers here:
remove last few characters in PySpark dataframe column
(5 answers)
Closed 3 years ago.
I would like to remove the last two values of a string for each string in a single column of a spark dataframe. I would like to do this in the spark dataframe not by moving it to pandas and then back.
An example dataframe would be below,
# +----+-------+
# | age| name|
# +----+-------+
# | 350|Michael|
# | 290| Andy|
# | 123| Justin|
# +----+-------+
where the dtype of the age column is a string.
# +----+-------+
# | age| name|
# +----+-------+
# | 3|Michael|
# | 2| Andy|
# | 1| Justin|
# +----+-------+
This is the expected output. The last two characters of the string have been removed.
Hi Scala/sparkSql way of doing this is very Simple.
val result = originalDF.withColumn("age", substring(col("age"),0,1))
reult.show
you can probably get your syntax for pyspark
substring, length, col, expr from functions can be used for this purpose.
from pyspark.sql.functions import substring, length, col, expr
df = your df here
substring index 1, -2 were used since its 3 digits and .... its age field logically a person
wont live more than 100 years :-) OP can change substring function suiting to his requirement.
df.withColumn("age",expr("substring(age, 1, length(age)-2)"))
df.show
Result :
+----+-------+
| age| name|
+----+-------+
| 3|Michael|
| 2| Andy|
| 1| Justin|
+----+-------+
Scala answer :
val originalDF = Seq(
(350, "Michael"),
(290, "Andy"),
(123, "Justin")
).toDF("age", "name")
println(" originalDF " )
originalDF.show
println("modified")
originalDF.selectExpr("substring(age,0,1) as age", "name " ).show
Result :
originalDF
+---+-------+
|age| name|
+---+-------+
|350|Michael|
|290| Andy|
|123| Justin|
+---+-------+
modified
+---+-------+
|age| name|
+---+-------+
| 3|Michael|
| 2| Andy|
| 1| Justin|
+---+-------+
I have data in two text files as
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y,t,k|
| 2| u,t,p|
| 3| u,t,k|
| 4| f,o,k|
| 5| e,o,u|
+----------+-------+
file2(diagnosis code,diagnosis description) Time T1
+-------+---------+
|diag_cd|diag_desc|
+-------+---------+
| y| yen|
| t| ten|
| k| ken|
| u| uen|
| p| pen|
| f| fen|
| o| oen|
| e| een|
+-------+---------+
data in file 2 is not fixed and keeps on changing, means at any given point of time diagnosis code y can have diagnosis description as yen and at other point of time it can have diagnosis description as ten. For example below
file2 at Time T2
+-------+---------+
|diag_cd|diag_desc|
+-------+---------+
| y| ten|
| t| yen|
| k| uen|
| u| oen|
| p| ken|
| f| pen|
| o| een|
| e| fen|
+-------+---------+
I have to read these two files data in spark and want only those patients id who are diagnosed with uen.
it can be done using spark sql or scala both.
I tried to read the file1 in spark-shell. The two columns in file1 are pipe delimited.
scala> val tes1 = sc.textFile("file1.txt").map(x => x.split('|')).filter(y => y(1).contains("u")).collect
tes1: Array[Array[String]] = Array(Array(2, u,t,p), Array(3, u,t,k), Array(5, e,o,u))
But as the diagnosis code related to a diagnosis description is not constant in file2 so will have to use the join condition. But I dont know how to apply joins when the diag_cd column in file1 has multiple values.
any help would be appreciated.
Please find the answer below
//Read the file1 into a dataframe
val file1DF = spark.read.format("csv").option("delimiter","|")
.option("header",true)
.load("file1PATH")
//Read the file2 into a dataframe
val file2DF = spark.read.format("csv").option("delimiter","|")
.option("header",true)
.load("file2path")
//get the patient id dataframe for the diag_desc as uen
file1DF.join(file2DF,file1DF.col("diag_cd").contains(file2DF.col("diag_cd")),"inner")
.filter(file2DF.col("diag_desc") === "uen")
.select("patient_id").show
Convert the table t1 from format1 to format2 using explode method.
Format1:
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y,t,k|
| 2| u,t,p|
+----------+-------+
to
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y |
| 1| t |
| 1| k |
| 2| u |
| 2| t |
| 2| p |
+----------+-------+
Code:
scala> val data = Seq("1|y,t,k", "2|u,t,p")
data: Seq[String] = List(1|y,t,k, 2|u,t,p)
scala> val df1 = sc.parallelize(data).toDF("c1").withColumn("patient_id", split(col("c1"), "\\|").getItem(0)).withColumn("col2", split(col("c1"), "\\|").getItem(1)).select("patient_id", "col2").withColumn("diag_cd", explode(split($"col2", "\\,"))).select("patient_id", "diag_cd")
df1: org.apache.spark.sql.DataFrame = [patient_id: string, diag_cd: string]
scala> df1.collect()
res4: Array[org.apache.spark.sql.Row] = Array([1,y], [1,t], [1,k], [2,u], [2,t], [2,p])
I have created dummy data here for illustration. Note how we are exploding the particular column above using
scala> val df1 = sc.parallelize(data).toDF("c1").
| withColumn("patient_id", split(col("c1"), "\\|").getItem(0)).
| withColumn("col2", split(col("c1"), "\\|").getItem(1)).
| select("patient_id", "col2").
| withColumn("diag_cd", explode(split($"col2", "\\,"))).
| select("patient_id", "diag_cd")
df1: org.apache.spark.sql.DataFrame = [patient_id: string, diag_cd: string]
Now you can create df2 for file 2 using -
scala> val df2 = sc.textFile("file2.txt").map(x => (x.split(",")(0),x.split(",")(1))).toDF("diag_cd", "diag_desc")
df2: org.apache.spark.sql.DataFrame = [diag_cd: string, diag_desc: string]
Join df1 with df2 and filter as per the requirement.
df1.join(df2, df1.col("diag_cd") === df2.col("diag_cd")).filter(df2.col("diag_desc") === "ten").select(df1.col("patient_id")).collect()