Read csv file in spark of varying columns

Read csv file in spark of varying columns - scala

I would like to read csv file into dataframe in spark using Scala.
My csv file has first record which has three columns and remaining records have 5 columns. My csv file does not come with column names. I have mentioned here's for understanding
Ex:
I'dtype date recordsCount
0 13-02-2015 300
I'dtype date type location. locationCode
1 13-02-2015. R. USA. Us
1. 13-02-2015. T. London. Lon
My question is how I will read this file into dataframe,as first and remaining rows have different columns.
The solution what I tried is read file as rdd and filter out header record and then convert remaining records into dataframe.
Is there any better solution for ? Please help me

You can load the files as raw text, and then use case classes, Either instances, and pattern matching to sort out what goes where. Example of that below.
case class Col3(c1: Int, c2: String, c3: Int)
case class Col5(c1: Int, c2: String, c5_col3: String, c4:String, c5: String)
case class Header(value: String)
type C3 = Either[Header, Col3]
type C5 = Either[Header, Col5]
// assume sqlC & sc created
val path = "tmp.tsv"
val rdd = sc.textFile(path)
val eitherRdd: RDD[Either[C3, C5]] = rdd.map{s =>
val spl = s.split("\t")
spl.length match{
case 3 =>
val res = Try{
Col3(spl(0).toInt, spl(1), spl(2).toInt)
}
res match{
case Success(c3) => Left(Right(c3))
case Failure(_) => Left(Left(Header(s)))
}
case 5 =>
val res = Try{
Col5(spl(0).toInt, spl(1), spl(2), spl(3), spl(4))
}
res match{
case Success(c5) => Right(Right(c5))
case Failure(_) => Right(Left(Header(s)))
}
case _ => throw new Exception("fail")
}
}
val rdd3 = eitherRdd.flatMap(_.left.toOption)
val rdd3Header = rdd3.flatMap(_.left.toOption).collect().head
val df3 = sqlC.createDataFrame(rdd3.flatMap(_.right.toOption))
val rdd5 = eitherRdd.flatMap(_.right.toOption)
val rdd5Header = rdd5.flatMap(_.left.toOption).collect().head
val df5 = sqlC.createDataFrame(rdd5.flatMap(_.right.toOption))
df3.show()
df5.show()
Tested with simple tsv below:
col1 col2 col3
0 sfd 300
1 asfd 400
col1 col2 col4 col5 col6
2 pljdsfn R USA Us
3 sad T London Lon
which gives output
+---+----+---+
| c1| c2| c3|
+---+----+---+
| 0| sfd|300|
| 1|asfd|400|
+---+----+---+
+---+-------+-------+------+---+
| c1| c2|c5_col3| c4| c5|
+---+-------+-------+------+---+
| 2|pljdsfn| R| USA| Us|
| 3| sad| T|London|Lon|
+---+-------+-------+------+---+
For simplicity sake, I have ignored the date formatting, simply storing those fields as Strings. however it would not be much more complicated to add a date parser to get you a proper column type.
Likewise, I have relied on parsing failure to indicate a header row. You may substitute different logic if either the parsing would not fail, or if a more complicated determination must be made. Similarly, more complicated logic would be needed to differentiate between different record types of the same length, or which may contain (escaped) split character

It's a bit of a hack but here is a solution to ignore the first line of the file.
val cols = Array("dtype", "date", "type", "location", "locationCode")
val schema = new StructType(cols.map(n => StructField(n ,StringType, true)))
spark.read
.schema(schema) // we specify the schema
.option("header", true) // and tell spark that there is a header
.csv("path/file.csv")
The first line is the header, but the schema is specified. The first line is thus ignored.

Related

Spark generate a list of column names that contains(SQL LIKE) a string

This one below is a simple syntax to search for a string in a particular column uisng SQL Like functionality.
val dfx = df.filter($"name".like(s"%${productName}%"))
The questions is How do I grab each and every column NAME that contained the particular string in its VALUES and generate a new column with a list of those "column names" for every row.
So far this is the approach I took but stuck as I cant use spark-sql "Like" function inside a UDF.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types._
import spark.implicits._
val df1 = Seq(
(0, "mango", "man", "dit"),
(1, "i-man", "man2", "mane"),
(2, "iman", "mango", "ho"),
(3, "dim", "kim", "sim")
).toDF("id", "col1", "col2", "col3")
val df2 = df1.columns.foldLeft(df1) {
(acc: DataFrame, colName: String) =>
acc.withColumn(colName, concat(lit(colName + "="), col(colName)))
}
val df3 = df2.withColumn("merged_cols", split(concat_ws("X", df2.columns.map(c=> col(c)):_*), "X"))
Here is a sample output. Note that here there are only 3 columns but in the real job I'll be reading multiple tables which can contain dynamic number of columns.
+--------------------------------------------+
|id | col1| col2| col3| merged_cols
+--------------------------------------------+
0 | mango| man | dit | col1, col2
1 | i-man| man2 | mane | col1, col2, col3
2 | iman | mango| ho | col1, col2
3 | dim | kim | sim|
+--------------------------------------------+

This can be done using a foldLeft over the columns together with when and otherwise:
val e = "%man%"
val df2 = df1.columns.foldLeft(df.withColumn("merged_cols", lit(""))){(df, c) =>
df.withColumn("merged_cols", when(col(c).like(e), concat($"merged_cols", lit(s"$c,"))).otherwise($"merged_cols"))}
.withColumn("merged_cols", expr("substring(merged_cols, 1, length(merged_cols)-1)"))
All columns that satisfies the condition e will be appended to the string in the merged_cols column. Note that the column must exist for the first append to work so it is added (containing an empty string) to the dataframe when sent into the foldLeft.
The last row in the code simply removes the extra , that is added in the end. If you want the result as an array instead, simply adding .withColumn("merged_cols", split($"merged_cols", ",")) would work.
An alternative appraoch is to instead use an UDF. This could be preferred when dealing with many columns since foldLeft will create multiple dataframe copies. Here regex is used (not the SQL like since that operates on whole columns).
val e = ".*man.*"
val concat_cols = udf((vals: Seq[String], names: Seq[String]) => {
vals.zip(names).filter{case (v, n) => v.matches(e)}.map(_._2)
})
val df2 = df.withColumn("merged_cols", concat_cols(array(df.columns.map(col(_)): _*), typedLit(df.columns.toSeq)))
Note: typedLit can be used in Spark versions 2.2+, when using older versions use array(df.columns.map(lit(_)): _*) instead.

Combine multiple ArrayType Columns in Spark into one ArrayType Column

I want to merge multiple ArrayType[StringType] columns in spark to create one ArrayType[StringType]. For combining two columns I found the soluton here:
Merge two spark sql columns of type Array[string] into a new Array[string] column
But how do I go about combining, if I don't know the number of columns at compile time. At run time, I will know the names of all the columns to be combined.
One option is to use the UDF defined in the above stackoverflow question, to add two columns, multiple times in a loop. But this involves multiple reads on the entire dataframe. Is there a way to do this in just one go?
+------+------+---------+
| col1 | col2 | combined|
+------+------+---------+
| [a,b]| [i,j]|[a,b,i,j]|
| [c,d]| [k,l]|[c,d,k,l]|
| [e,f]| [m,n]|[e,f,m,n]|
| [g,h]| [o,p]|[g,h,o,p]|
+------+----+-----------+

val arrStr: Array[String] = Array("col1", "col2")
val arrCol: Array[Column] = arrString.map(c => df(c))
val assembleFunc = udf { r: Row => assemble(r.toSeq: _*)}
val outputDf = df.select(col("*"), assembleFunc(struct(arrCol:
_*)).as("combined"))
def assemble(rowEntity: Any*):
collection.mutable.WrappedArray[String] = {
var outputArray =
rowEntity(0).asInstanceOf[collection.mutable.WrappedArray[String]]
rowEntity.drop(1).foreach {
case v: collection.mutable.WrappedArray[String] =>
outputArray ++= v
case null =>
throw new SparkException("Values to assemble cannot be
null.")
case o =>
throw new SparkException(s"$o of type ${o.getClass.getName}
is not supported.")
}
outputArray
}
outputDf.show(false)

Process the dataframe schema and get all the columns of the type ArrayType[StringType].
create a new dataframe with functions.array_union of the first two columns
iterate through the rest of the columns and adding each of them to the combined column
>>>from pyspark import Row
>>>from pyspark.sql.functions import array_union
>>>df = spark.createDataFrame([Row(col1=['aa1', 'bb1'],
col2=['aa2', 'bb2'],
col3=['aa3', 'bb3'],
col4= ['a', 'ee'], foo="bar"
)])
>>>df.show()
+----------+----------+----------+-------+---+
| col1| col2| col3| col4|foo|
+----------+----------+----------+-------+---+
|[aa1, bb1]|[aa2, bb2]|[aa3, bb3]|[a, ee]|bar|
+----------+----------+----------+-------+---+
>>>cols = [col_.name for col_ in df.schema
... if col_.dataType == ArrayType(StringType())
... or col_.dataType == ArrayType(StringType(), False)
... ]
>>>print(cols)
['col1', 'col2', 'col3', 'col4']
>>>
>>>final_df = df.withColumn("combined", array_union(cols[:2][0], cols[:2][1]))
>>>
>>>for col_ in cols[2:]:
... final_df = final_df.withColumn("combined", array_union(col('combined'), col(col_)))
>>>
>>>final_df.select("combined").show(truncate=False)
+-------------------------------------+
|combined |
+-------------------------------------+
|[aa1, bb1, aa2, bb2, aa3, bb3, a, ee]|
+-------------------------------------+

dynamically pass arguments to function in scala

i have record as string with 1000 fields with delimiter as comma in dataframe like
"a,b,c,d,e.......upto 1000" -1st record
"p,q,r,s,t ......upto 1000" - 2nd record
I am using below suggested solution from stackoverflow
Split 1 column into 3 columns in spark scala
df.withColumn("_tmp", split($"columnToSplit", "\\.")).select($"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2"),$"_tmp".getItem(2).as("col3")).drop("_tmp")
however in my case i am having 1000 columns which i have in JSON schema which i can retrive like
column_seq:Seq[Array]=Schema_func.map(_.name)
for(i <-o to column_seq.length-1){println(i+" " + column_seq(i))}
which returns like
0 col1
1 col2
2 col3
3 col4
Now I need to pass all this indexes and column names to below function of DataFrame
df.withColumn("_tmp", split($"columnToSplit", "\\.")).select($"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2"),$"_tmp".getItem(2).as("col3")).drop("_tmp")
in
$"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2"),
as i cant create the long statement with all 1000 columns , is there any effective way to pass all this arguments from above mentioned json schema to select function , so that i can split the columns , add the header and then covert the DF to parquet.

You can build a series of org.apache.spark.sql.Column, where each one is the result of selecting the right item and has the right name, and then select these columns:
val columns: Seq[Column] = Schema_func.map(_.name)
.zipWithIndex // attach index to names
.map { case (name, index) => $"_tmp".getItem(index) as name }
val result = df
.withColumn("_tmp", split($"columnToSplit", "\\."))
.select(columns: _*)
For example, for this input:
case class A(name: String)
val Schema_func = Seq(A("c1"), A("c2"), A("c3"), A("c4"), A("c5"))
val df = Seq("a.b.c.d.e").toDF("columnToSplit")
The result would be:
// +---+---+---+---+---+
// | c1| c2| c3| c4| c5|
// +---+---+---+---+---+
// | a| b| c| d| e|
// +---+---+---+---+---+

Spark merge dataframe with mismatching schemas without extra disk IO

I would like to merge 2 dataframes with (potentially) mismatching schemas
org.apache.spark.sql.DataFrame = [name: string, age: int, height: int]
org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> A.unionAll(B)
would result in :
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the left table has 2 columns and the right has 3;
I would like to do this from within Spark.
However, the Spark docs only propose to write the whole 2 dataframes out to a directory and read them back in using spark.read.option("mergeSchema", "true").
link to docs
So a union doesn't help me out, and neither does the documentation. I would like to keep this extra I/O out of my job if at all possible. Am I missing some undocumented info, or is it not possible (yet)?

You can append a null column to frame B and after union 2 frames:
import org.apache.spark.sql.functions._
val missingFields = A.schema.toSet.diff(B.schema.toSet)
var C: DataFrame = null
for (field <- missingFields){
C = A.withColumn(field.name, expr("null"));
}
A.unionAll(C)

parquet schema merging is disabled by default, turn on this option by:
(1) set global option: spark.sql.parquet.mergeSchema=true
(2) write code: sqlContext.read.option("mergeSchema", "true").parquet("my.parquet")

Here's a pyspark solution.
It assumes that if the merge can't take place because one dataframe is missing a column contained in the other, then the right thing is to add the missing column with null values.
On the other hand, if the merge can't take place because the two dataframes share a column with conflicting type or nullability, then the right thing is to raise a TypeError (because that's a conflict you probably want to know about).
def harmonize_schemas_and_combine(df_left, df_right):
left_types = {f.name: f.dataType for f in df_left.schema}
right_types = {f.name: f.dataType for f in df_right.schema}
left_fields = set((f.name, f.dataType, f.nullable) for f in df_left.schema)
right_fields = set((f.name, f.dataType, f.nullable) for f in df_right.schema)
# First go over left-unique fields
for l_name, l_type, l_nullable in left_fields.difference(right_fields):
if l_name in right_types:
r_type = right_types[l_name]
if l_type != r_type:
raise TypeError, "Union failed. Type conflict on field %s. left type %s, right type %s" % (l_name, l_type, r_type)
else:
raise TypeError, "Union failed. Nullability conflict on field %s. left nullable %s, right nullable %s" % (l_name, l_nullable, not(l_nullable))
df_right = df_right.withColumn(l_name, lit(None).cast(l_type))
# Now go over right-unique fields
for r_name, r_type, r_nullable in right_fields.difference(left_fields):
if r_name in left_types:
l_type = right_types[r_name]
if r_type != l_type:
raise TypeError, "Union failed. Type conflict on field %s. right type %s, left type %s" % (r_name, r_type, l_type)
else:
raise TypeError, "Union failed. Nullability conflict on field %s. right nullable %s, left nullable %s" % (r_name, r_nullable, not(r_nullable))
df_left = df_left.withColumn(r_name, lit(None).cast(r_type))
return df_left.union(df_right)

Thanks #conradlee! I modified your solution to allow union by adding casting and removing nullability check. It worked for me.
def harmonize_schemas_and_combine(df_left, df_right):
'''
df_left is the main df; we try to append the new df_right to it.
Need to do three things here:
1. Set other claim/clinical features to NULL
2. Align schemas (data types)
3. Align column orders
'''
left_types = {f.name: f.dataType for f in df_left.schema}
right_types = {f.name: f.dataType for f in df_right.schema}
left_fields = set((f.name, f.dataType) for f in df_left.schema)
right_fields = set((f.name, f.dataType) for f in df_right.schema)
# import pdb; pdb.set_trace() #pdb debugger
# I. First go over left-unique fields:
# For columns in the main df, but not in the new df: add it as Null
# For columns in both df but w/ different datatypes, use casting to keep them consistent w/ main df (Left)
for l_name, l_type in left_fields.difference(right_fields): #1. find what Left has, Right doesn't
if l_name in right_types: #2A. if column is in both, then something's off w/ the schema
r_type = right_types[l_name] #3. tell me what's this column's type in Right
df_right = df_right.withColumn(l_name,df_right[l_name].cast(l_type)) #4. keep them consistent w/ main df (Left)
print("Casting magic happened on column %s: Left type: %s, Right type: %s. Both are now: %s." % (l_name, l_type, r_type, l_type))
else: #2B. if Left column is not in Right, add a NULL column to Right df
df_right = df_right.withColumn(l_name, F.lit(None).cast(l_type))
# Make sure Right columns are in the same order of Left
df_right = df_right.select(df_left.columns)
return df_left.union(df_right)

Here is another solution for this. I used rdd union because dataFrame union operation doesnt support multiple dataFrames.
Note - This should not be used to merge lot of dataFrames with different schema. The cost of adding null columns to dataFrames will result quickly in out of memory errors. (i.e: trying to merge 1000 dataFrames with 10 columns missing will result in 10,000 transformations)
If your use case it to read a dataFrame from storage with different schema that is composed from multiple paths with different schemas, a much better option would be to have your data saved as parquet in the first place and then use the 'mergeSchema' option when reading the dataFrame.
def unionDataFramesAndMergeSchema(spark, dfsList):
'''
This function can perform a union between x dataFrames with different schemas.
Non-existing columns will be filled with null.
Note: If a column exist in 2 dataFrames with different types, an exception will be thrown.
:example:
>>> df1 = spark.createDataFrame([
>>> {
>>> 'A': 1,
>>> 'B': 1,
>>> 'C': 1
>>> }])
>>> df2 = spark.createDataFrame([
>>> {
>>> 'A': 2,
>>> 'C': 2,
>>> 'DNew' : 2
>>> }])
>>> unionDataFramesAndMergeSchema(spark,[df1,df2]).show()
>>> +---+----+---+----+
>>> | A| B| C|DNew|
>>> +---+----+---+----+
>>> | 2|null| 2| 2|
>>> | 1| 1| 1|null|
>>> +---+----+---+----+
:param spark: The Spark session.
:param dfsList: A list of dataFrames.
:return: A union of all dataFrames, with schema merged.
'''
if len(dfsList) == 0:
raise ValueError("DataFrame list is empty.")
if len(dfsList) == 1:
logging.info("The list contains only one dataFrame, no need to perform union.")
return dfsList[0]
logging.info("Will perform union between {0} dataFrames...".format(len(dfsList)))
columnNamesAndTypes = {}
logging.info("Calculating unified column names and types...")
for df in dfsList:
for columnName, columnType in dict(df.dtypes).iteritems():
if columnNamesAndTypes.has_key(columnName) and columnNamesAndTypes[columnName] != columnType:
raise ValueError(
"column '{0}' exist in at least 2 dataFrames with different types ('{1}' and '{2}'"
.format(columnName, columnType, columnNamesAndTypes[columnName]))
columnNamesAndTypes[columnName] = columnType
logging.info("Unified column names and types: {0}".format(columnNamesAndTypes))
logging.info("Adding null columns in dataFrames if needed...")
newDfsList = []
for df in dfsList:
newDf = df
dfTypes = dict(df.dtypes)
for columnName, columnType in dict(columnNamesAndTypes).iteritems():
if not dfTypes.has_key(columnName):
# logging.info("Adding null column for '{0}'.".format(columnName))
newDf = newDf.withColumn(columnName, func.lit(None).cast(columnType))
newDfsList.append(newDf)
dfsWithOrderedColumnsList = [df.select(columnNamesAndTypes.keys()) for df in newDfsList]
logging.info("Performing a flat union between all dataFrames (as rdds)...")
allRdds = spark.sparkContext.union([df.rdd for df in dfsWithOrderedColumnsList])
return allRdds.toDF()

If you read both data frames from storage files you can just use predefined schema:
val schemaForRead =
StructType(List(
StructField("userId", LongType,true),
StructField("dtEvent", LongType,true),
StructField("goodsId", LongType,true)
))
val dfA = spark.read.format("parquet").schema(schemaForRead).load("/tmp/file1.parquet")
val dfB = spark.read.format("parquet").schema(schemaForRead).load("/tmp/file2.parquet")
val dfC = dfA.union(dfB)
Note that schema in files file1 and file2 can be different and can differ form schemaForRead. If file1 doesn't contain field from schemaForRead dataframe A will have empty field with null's. If file contains additional field not presented in schemaForRead dataframe just wouldn't have it.

Here's the version in Scala also answered here -
( Spark - Merge / Union DataFrame with Different Schema (column names and sequence) to a DataFrame with Master common schema ) -
It takes List of dataframe to be unioned .. Provided same named columns in all the dataframe should have same datatype..
def unionPro(DFList: List[DataFrame], spark: org.apache.spark.sql.SparkSession): DataFrame = {
/**
* This Function Accepts DataFrame with same or Different Schema/Column Order.With some or none common columns
* Creates a Unioned DataFrame
*/
import spark.implicits._
val MasterColList: Array[String] = DFList.map(_.columns).reduce((x, y) => (x.union(y))).distinct
def unionExpr(myCols: Seq[String], allCols: Seq[String]): Seq[org.apache.spark.sql.Column] = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
// Create EmptyDF , ignoring different Datatype in StructField and treating them same based on Name ignoring cases
val masterSchema = StructType(DFList.map(_.schema.fields).reduce((x, y) => (x.union(y))).groupBy(_.name.toUpperCase).map(_._2.head).toArray)
val masterEmptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], masterSchema).select(MasterColList.head, MasterColList.tail: _*)
DFList.map(df => df.select(unionExpr(df.columns, MasterColList): _*)).foldLeft(masterEmptyDF)((x, y) => x.union(y))
}
Here is the sample test for it -
val aDF = Seq(("A", 1), ("B", 2)).toDF("Name", "ID")
val bDF = Seq(("C", 1, "D1"), ("D", 2, "D2")).toDF("Name", "Sal", "Deptt")
unionPro(List(aDF, bDF), spark).show
Which gives output as -
+----+----+----+-----+
|Name| ID| Sal|Deptt|
+----+----+----+-----+
| A| 1|null| null|
| B| 2|null| null|
| C|null| 1| D1|
| D|null| 2| D2|
+----+----+----+-----+

if you are using spark version > 2.3.0 then you can use the unionByName in-built function to get the required output.
Link to the Git Repo that contains the code for the unionByName code:
https://github.com/apache/spark/blame/cee4ecbb16917fa85f02c635925e2687400aa56b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1894

Spark Dataset API - join

I am trying to use the Spark Dataset API but I am having some issues doing a simple join.
Let's say I have two dataset with fields: date | value, then in the case of DataFrame my join would look like:
val dfA : DataFrame
val dfB : DataFrame
dfA.join(dfB, dfB("date") === dfA("date") )
However for Dataset there is the .joinWith method, but the same approach does not work:
val dfA : Dataset
val dfB : Dataset
dfA.joinWith(dfB, ? )
What is the argument required by .joinWith ?

To use joinWith you first have to create a DataSet, and most likely two of them. To create a DataSet, you need to create a case class that matches your schema and call DataFrame.as[T] where T is your case class. So:
case class KeyValue(key: Int, value: String)
val df = Seq((1,"asdf"),(2,"34234")).toDF("key", "value")
val ds = df.as[KeyValue]
// org.apache.spark.sql.Dataset[KeyValue] = [key: int, value: string]
You could also skip the case class and use a tuple:
val tupDs = df.as[(Int,String)]
// org.apache.spark.sql.Dataset[(Int, String)] = [_1: int, _2: string]
Then if you had another case class / DF, like this say:
case class Nums(key: Int, num1: Double, num2: Long)
val df2 = Seq((1,7.7,101L),(2,1.2,10L)).toDF("key","num1","num2")
val ds2 = df2.as[Nums]
// org.apache.spark.sql.Dataset[Nums] = [key: int, num1: double, num2: bigint]
Then, while the syntax of join and joinWith are similar, the results are different:
df.join(df2, df.col("key") === df2.col("key")).show
// +---+-----+---+----+----+
// |key|value|key|num1|num2|
// +---+-----+---+----+----+
// | 1| asdf| 1| 7.7| 101|
// | 2|34234| 2| 1.2| 10|
// +---+-----+---+----+----+
ds.joinWith(ds2, df.col("key") === df2.col("key")).show
// +---------+-----------+
// | _1| _2|
// +---------+-----------+
// | [1,asdf]|[1,7.7,101]|
// |[2,34234]| [2,1.2,10]|
// +---------+-----------+
As you can see, joinWith leaves the objects intact as parts of a tuple, while join flattens out the columns into a single namespace. (Which will cause problems in the above case because the column name "key" is repeated.)
Curiously enough, I have to use df.col("key") and df2.col("key") to create the conditions for joining ds and ds2 -- if you use just col("key") on either side it does not work, and ds.col(...) doesn't exist. Using the original df.col("key") does the trick, however.

From https://docs.cloud.databricks.com/docs/latest/databricks_guide/05%20Spark/1%20Intro%20Datasets.html
it looks like you could just do
dfA.as("A").joinWith(dfB.as("B"), $"A.date" === $"B.date" )

For the above example, you can try the below:
Define a case class for your output
case class JoinOutput(key:Int, value:String, num1:Double, num2:Long)
Join two Datasets with Seq("key"), this will help you to avoid two duplicate key columns in the output, which will also help to apply the case class or fetch the data in the next step
val joined = ds.join(ds2, Seq("key")).as[JoinOutput]
// res27: org.apache.spark.sql.Dataset[JoinOutput] = [key: int, value: string ... 2 more fields]
The result will be flat instead:
joined.show
+---+-----+----+----+
|key|value|num1|num2|
+---+-----+----+----+
| 1| asdf| 7.7| 101|
| 2|34234| 1.2| 10|
+---+-----+----+----+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Read csv file in spark of varying columns - scala

Related

Spark generate a list of column names that contains(SQL LIKE) a string

Combine multiple ArrayType Columns in Spark into one ArrayType Column

dynamically pass arguments to function in scala

Spark merge dataframe with mismatching schemas without extra disk IO

Spark Dataset API - join

Categories

Resources