Joining two datasets spark scala - scala

I have two csv files (datasets) file1 and file2.
File1 consists of following columns:
Orders | Requests | Book1 | Book2
Varchar| Integer | Integer| Integer
File2 consists of following columns:
Book3 | Book4 | Book5 | Orders
String| String| Varchar| Varchar
How to combine the data in two CSV files in scala to check:
how many
Orders, Book1(Ignore Book1 having value = 0), Book3 and Book4 are present in both files for each Orders
Note: column Orders is common in both files

You can join two csv by making Pair RDD.
val rightFile = job.patch.get.file
val rightFileByKeys = sc.textFile(rightFile).map { line =>
new LineParser(line, job.patch.get.patchKeyIndex, job.delimRegex, Some(job.patch.get.patchValueIndex))
}.keyBy(_.getKey())
val leftFileByKeys = sc.textFile(leftFile).map { line =>
new LineParser(line, job.patch.get.fileKeyIndex, job.delimRegex)
}.keyBy(_.getKey())
leftFileByKeys.join(rightFileByKeys).map { case (key, (left, right)) =>
(job, left.line + job.delim + right.getValue())
}

Related

Spark dataframe duplicate row based on splitting column value in scala

I have the following code in scala:
val fullCertificateSourceDf = certificateSourceDf
.withColumn("Stage", when(col("Data.WorkBreakdownUp1Summary").isNotNull && col("Data.WorkBreakdownUp1Summary")=!="", rtrim(regexp_extract($"Data.WorkBreakdownUp1Summary","^.*?(?= - *[a-zA-Z])",0))).otherwise(""))
.withColumn("SubSystem", when(col("Data.ProcessBreakdownSummaryList").isNotNull && col("Data.ProcessBreakdownSummaryList")=!="", regexp_extract($"Data.ProcessBreakdownSummaryList","^.*?(?= - *[a-zA-Z])",0)).otherwise(""))
.withColumn("System", when(col("Data.ProcessBreakdownUp1SummaryList").isNotNull && col("Data.ProcessBreakdownUp1SummaryList")=!="", regexp_extract($"Data.ProcessBreakdownUp1SummaryList","^.*?(?= - *[a-zA-Z])",0)).otherwise(""))
.withColumn("Facility", when(col("Data.ProcessBreakdownUp2Summary").isNotNull && col("Data.ProcessBreakdownUp2Summary")=!="", regexp_extract($"Data.ProcessBreakdownUp2Summary","^.*?(?= - *[a-zA-Z])",0)).otherwise(""))
.withColumn("Area", when(col("Data.ProcessBreakdownUp3Summary").isNotNull && col("Data.ProcessBreakdownUp3Summary")=!="", regexp_extract($"Data.ProcessBreakdownUp3Summary","^.*?(?= - *[a-zA-Z])",0)).otherwise(""))
.select("Data.ID",
"Data.CertificateID",
"Data.CertificateTag",
"Data.CertificateDescription",
"Data.WorkBreakdownUp1Summary",
"Data.ProcessBreakdownSummaryList",
"Data.ProcessBreakdownUp1SummaryList",
"Data.ProcessBreakdownUp2Summary",
"Data.ProcessBreakdownUp3Summary",
"Data.ActualStartDate",
"Data.ActualEndDate",
"Data.ApprovedDate",
"Data.CurrentState",
"DataType",
"PullDate",
"PullTime",
"Stage",
"System",
"SubSystem",
"Facility",
"Area"
)
.filter((col("Stage").isNotNull) && (length(col("Stage"))>0))
.filter(((col("SubSystem").isNotNull) && (length(col("SubSystem"))>0)) || ((col("System").isNotNull) && (length(col("System"))>0)) || ((col("Facility").isNotNull) && (length(col("Facility"))>0)) || ((col("Area").isNotNull) && (length(col("Area"))>0))
)
.select("*")
This dataframe fullCertificateSourceDf contains the following data:
I have hidden some columns for Brevity.
I want the data to look like this:
We are splitting on two columns: ProcessBreakdownSummaryList and ProcessBreakdownUp1SummaryList. They both are comma seperated lists.
Please note if the values in ProcessBreakdownSummaryList (CS10-100-22-10 - Mine Intake Air Fan Heater System, CS10-100-81-10 - Mine Services Switchgear) and ProcessBreakdownUp1SummaryList (CS10-100-22 - Service Shaft Ventilation, CS10-100-81 - Service Shaft Electrical) are the same we should only split once.
However, if they are different as in ProcessBreakdownSummaryList(CS10-100-22-10 - Mine Intake Air Fan Heater System, CS10-100-81-10 - Mine Services Switchgear) and ProcessBreakdownUp1SummaryList (CS10-100-22 - Service Shaft Ventilation, CS10-100-34 - Service Shaft Electrical) it should split again for a third row.
Thank you in advance for your help with this.
You can solve it many ways, I think the easiest approach for complicated processing is to use scala. You can read all columns including "ProcessBreakdownSummaryList" and "ProcessBreakdownUp1SummaryList", compare their values for being same/different and emit multiple rows for a single input row. Then flatmap on the output to get a dataframe with all rows you need.
val fullCertificateSourceDf = // your code
fullCertificateSourceDf.map{ row =>
val id = row.getAs[String]("Data.ID")
... read all columns
val processBreakdownSummaryList = row.getAs[String]("Data.ProcessBreakdownSummaryList")
val processBreakdownUp1SummaryList = row.getAs[String]("Data.ProcessBreakdownUp1SummaryList")
//split processBreakdownSummaryList on ","
//split processBreakdownUp1SummaryList on ","
//compare then for equality
//lets say you end up with 4 rows.
//return Seq of those 4 rows in a list processBreakdownSummary
//return a List of tuple of strings like List((id, certificateId, certificateTag, ..distinct values of processBreakdownUp1SummaryList...), (...) ...)
//all columns id, certificateId, certificateTag etc are repeated for each distinct value of processBreakdownUp1SummaryList and processBreakdownSummaryList
}.flatMap(identity(_)).toDF("column1","column2"...)
Here is an example of splitting one row into multiple
val employees = spark.createDataFrame(Seq(("E1",100.0,"a,b"), ("E2",200.0,"e,f"),("E3",300.0,"c,d"))).toDF("employee","salary","clubs")
employees.map{ r =>
val clubs = r.getAs[String]("clubs").split(",")
for{
c : String <- clubs
}yield(r.getAs[String]("employee"),r.getAs[Double]("salary"), c)
}.flatMap(identity(_)).toDF("employee","salary","clubs").show(false)
The result looks like
+--------+------+-----+
|employee|salary|clubs|
+--------+------+-----+
|E1 |100.0 |a |
|E1 |100.0 |b |
|E2 |200.0 |e |
|E2 |200.0 |f |
|E3 |300.0 |c |
|E3 |300.0 |d |
+--------+------+-----+

add new column in a dataframe depending on another dataframe's row values

I need to add a new column to dataframe DF1 but the new column's value should be calculated using other columns' value present in that DF. Which of the other columns to be used will be given in another dataframe DF2.
eg. DF1
|protocolNo|serialNum|testMethod |testProperty|
+----------+---------+------------+------------+
|Product1 | AB |testMethod1 | TP1 |
|Product2 | CD |testMethod2 | TP2 |
DF2-
|action| type| value | exploded |
+------------+---------------------------+-----------------+
|append|hash | [protocolNo] | protocolNo |
|append|text | _ | _ |
|append|hash | [serialNum,testProperty] | serialNum |
|append|hash | [serialNum,testProperty] | testProperty |
Now the value of exploded column in DF2 will be column names of DF1 if value of type column is hash.
Required -
New column should be created in DF1. the value should be calculated like below-
hash[protocolNo]_hash[serialNumTestProperty] ~~~ here on place of column their corresponding row values should come.
eg. for Row1 of DF1, col value should be
hash[Product1]_hash[ABTP1]
this will result into something like this abc-df_egh-45e after hashing.
The above procedure should be followed for each and every row of DF1.
I've tried using map and withColumn function using UDF on DF1. But in UDF, outer dataframe value is not accessible(gives Null Pointer Exception], also I'm not able to give DataFrame as input to UDF.
Input DFs would be DF1 and DF2 as mentioned above.
Desired Output DF-
|protocolNo|serialNum|testMethod |testProperty| newColumn |
+----------+---------+------------+------------+----------------+
|Product1 | AB |testMethod1 | TP1 | abc-df_egh-4je |
|Product2 | CD |testMethod2 | TP2 | dfg-df_ijk-r56 |
newColumn value is after hashing
Instead of DF2, you can translate DF2 to case class like Specifications, e.g
case class Spec(columnName:String,inputColumns:Seq[String],action:String,action:String,type:String*){}
Create instances of above class
val specifications = Seq(
Spec("new_col_name",Seq("serialNum","testProperty"),"hash","append")
)
Then you can process the below columns
val transformed = specifications
.foldLeft(dtFrm)((df: DataFrame, spec: Specification) => df.transform(transformColumn(columnSpec)))
def transformColumn(spec: Spec)(df: DataFrame): DataFrame = {
spec.type.foldLeft(df)((df: DataFrame, type : String) => {
type match {
case "append" => {have a case match of the action and do that , then append with df.withColumn}
}
}
Syntax may not be correct
Since DF2 has the column names that will be used to calculate a new column from DF1, I have made this assumption that DF2 will not be a huge Dataframe.
First step would be to filter DF2 and get the column names that we want to pick from DF1.
val hashColumns = DF2.filter('type==="hash").select('exploded).collect
Now, hashcolumns will have the columns that we want to use to calculate hash in the newColumn. The hashcolumns is an Array of Row. We need this to be a Column that will be applied while creating the newColumn in DF1.
val newColumnHash = hashColumns.map(f=>hash(col(f.getString(0)))).reduce(concat_ws("_",_,_))
The above line will convert the Row to a Column with hash function applied to it. And we reduce it while concatenating _. Now, the task becomes simple. We just need to apply this to DF1.
DF1.withColumn("newColumn",newColumnHash).show(false)
Hope this helps!

How to merge two or more columns into one?

I have a streaming Dataframe that I want to calculate min and avg over some columns.
Instead of getting separate resulting columns of min and avg after applying the operations, I want to merge the min and average output into a single column.
The dataframe look like this:
+-----+-----+
| 1 | 2 |
+-----+-----+-
|24 | 55 |
+-----+-----+
|20 | 51 |
+-----+-----+
I thought I'd use a Scala tuple for it, but that does not seem to work:
val res = List("1","2").map(name => (min(col(name)), avg(col(name))).as(s"result($name)"))
All code used:
val res = List("1","2").map(name => (min(col(name)),avg(col(name))).as(s"result($name)"))
val groupedByTimeWindowDF1 = processedDf.groupBy($"xyz", window($"timestamp", "60 seconds"))
.agg(res.head, res.tail: _*)
I'm expecting the output after applying the min and avg mathematical opearations to be:
+-----------+-----------+
| result(1)| result(2)|
+-----------+-----------+
|20 ,22 | 51,53 |
+-----------+-----------+
How I should write the expression?
Use struct standard function:
struct(colName: String, colNames: String*): Column
struct(cols: Column*): Column
Creates a new struct column that composes multiple input columns.
That gives you the values as well as the names (of the columns).
val res = List("1","2").map(name =>
struct(min(col(name)), avg(col(name))) as s"result($name)")
^^^^^^ HERE
The power of struct can be seen when you want to reference one field in the struct and you can use the name (not index).
q.select("structCol.name")
What you want to do is to merge the values of multiple columns together in a single column. For this you can use the array function. In this case it would be:
val res = List("1","2").map(name => array(min(col(name)),avg(col(name))).as(s"result($name)"))
Which will give you :
+------------+------------+
| result(1)| result(2)|
+------------+------------+
|[20.0, 22.0]|[51.0, 53.0]|
+------------+------------+

Read FASTQ file into a Spark dataframe

I'm trying to read FASTQ files into Spark dataframes. I have some difficulties because FASTQ is a multi line format.
Example:
#seq1
AGTCAGTCGAC
+
?##FFBFFDDH
#seq2
CCAGCGTCTCG
+
?88ADA?BDF8
Is there a way to get these data in a Spark dataframe like
+-------------+-------------+------------+
| identifier | sequence | quality |
+-------------+-------------+------------+
|seq1 |AGTCAGTCGAC |?##FFBFFDDH |
|seq2 |CCAGCGTCTCG |?88ADA?BDF8 |
+-------------+-------------+------------+
Thanks for your time
I'd slide
import org.apache.spark.mllib.rdd.RDDFunctions._
spark.createDataset(sc.textFile(path).sliding(4, 4).map {
case Array(id, seq, _, qual) => (id, seq, qual)
}).toDF("identifier", "sequence", "quality")
// +----------+-----------+-----------+
// |identifier| sequence| quality|
// +----------+-----------+-----------+
// | #seq1|AGTCAGTCGAC|?##FFBFFDDH|
// | #seq2|CCAGCGTCTCG|?88ADA?BDF8|
// +----------+-----------+-----------+

Copy schema from one dataframe to another dataframe

I'm trying to change the schema of an existing dataframe to the schema of another dataframe.
DataFrame 1:
Column A | Column B | Column C | Column D
"a" | 1 | 2.0 | 300
"b" | 2 | 3.0 | 400
"c" | 3 | 4.0 | 500
DataFrame 2:
Column K | Column B | Column F
"c" | 4 | 5.0
"b" | 5 | 6.0
"f" | 6 | 7.0
So I want to apply the schema of the first dataframe on the second. So all the columns which are the same remain. The columns in dataframe 2 that are not in 1 get deleted. The others become "NULL".
Output
Column A | Column B | Column C | Column D
"NULL" | 4 | "NULL" | "NULL"
"NULL" | 5 | "NULL" | "NULL"
"NULL" | 6 | "NULL" | "NULL"
So I came with a possible solution:
val schema = df1.schema
val newRows: RDD[Row] = df2.map(row => {
val values = row.schema.fields.map(s => {
if(schema.fields.contains(s)){
row.getAs(s.name).toString
}else{
"NULL"
}
})
Row.fromSeq(values)
})
sqlContext.createDataFrame(newRows, schema)}
Now as you can see this will not work because the schema contains String, Int and Double. And all my rows have String values.
This is where I'm stuck, is there a way to automatically convert the type of my values to the schema?
If schema is flat I would use simply map over per-existing schema and select required columns:
val exprs = df1.schema.fields.map { f =>
if (df2.schema.fields.contains(f)) col(f.name)
else lit(null).cast(f.dataType).alias(f.name)
}
df2.select(exprs: _*).printSchema
// root
// |-- A: string (nullable = true)
// |-- B: integer (nullable = false)
// |-- C: double (nullable = true)
// |-- D: integer (nullable = true)
Working in 2018 (Spark 2.3) reading a .sas7bdat
Scala
val sasFile = "file.sas7bdat"
val dfSas = spark.sqlContext.sasFile(sasFile)
val myManualSchema = dfSas.schema //getting the schema from another dataframe
val df = spark.read.format("csv").option("header","true").schema(myManualSchema).load(csvFile)
PD: spark.sqlContext.sasFile use saurfang library, you could skip that part of code and get the schema from another dataframe.
Below are simple PYSPARK steps to achieve same:
df = <dataframe whose schema needs to be copied>
df_tmp = <dataframe with result with fewer fields>
#Note: field names from df_tmp must match with field names from df
df_tmp_cols = [colmn.lower() for colmn in df_tmp.columns]
for col_dtls in df.dtypes:
col_name, dtype = col_dtls
if col_name.lower() in df_tmp_cols:
df_tmp = df_tmp.withColumn(col_name,f.col(col_name).cast(dtype))
else:
df_tmp = df_tmp.withColumn(col_name,f.lit(None).cast(dtype))
df_fin = df_tmp.select(df.columns) #Final dataframe
You could simply do Left Join on your dataframes with query like this:-
SELECT Column A, Column B, Column C, Column D FROM foo LEFT JOIN BAR ON Column C = Column C
Please checkout the answer by #zero323 in this post:-
Spark specify multiple column conditions for dataframe join
Thanks,
Charles.