Join two text file with one column different in their schema in spark scala - scala

I have two text files and I am creating data frame out of that. Both files have the same no of columns except one column.
When I crate schema and join both I get error like
java.lang.ArrayIndexOutOfBoundsException
Basically my schema has columns and my one of text file has only 5 columns.
No how to append some null value to already created schema and then do join?
Here is my code
val schema = StructType(Array(
StructField("TimeStamp", StringType),
StructField("Id", StringType),
StructField("Name", StringType),
StructField("Val", StringType),
StructField("Age", StringType),
StructField("Dept", StringType)))
val textRdd1 = sc.textFile("s3://test/Text1.txt")
val rowRdd1 = textRdd1.map(line => Row.fromSeq(line.split(",", -1)))
var df1 = sqlContext.createDataFrame(rowRdd1, schema)
val textRdd2 = sc.textFile("s3://test/Text2.txt")
val rowRdd2 = textRdd2.map(line => Row.fromSeq(line.split(",", -1)))
var df2 = sqlContext.createDataFrame(rowRdd2, schema)
val df3 = df1.join(df2)
TimeStamp column is not present in the first text file ...

Why don't you just exclude TimeStamp field from schema for first DataFrame?
val df1 = sqlContext.createDataFrame(rowRdd1, new StructType(schema.tail.toArray))
As mentioned in comments, you need not schemas to be similar. You also can specify you join condition and select columns to join on.

Add the Timestamp column to the 1st dataframe
import spark.sql.functions._
import org.apache.spark.sql.types.DataType
val df1Final = df1.withColumn("TimeStamp", lit(null).cast(Long))
Then proceed with the join

You can create a new schema without this field, and use this schema. What Dmitri was suggesting is to use the original schema and remove the field that you don’t need to save you writing a second schema definition.
Once you have the 2 files loaded in to a dataset you perform the JOIN base in the common fields and remove the duplicate columns, that I guess is what you want, doing this:
df3 = df1.join(df2, (df1("Id") === df2("Id")) && (df1("Name") === df2("Name")) && (df1("Val") === df2("Val")) && (df1("Age") === df2("Age")) && (df1("Dept") === df2("Dept")))
.drop(df2("Id"))
.drop(df2("Name"))
.drop(df2("Val"))
.drop(df2("Age"))
.drop(df2("Dept"))

Related

what is the order guarantee when joining two columns of a spark dataframe which are processed separately?

I have dataframe with 3 columns
date
jsonString1
jsonString2
I want to expand attributes inside json into columns. so i did something like this.
val json1 = spark.read.json(dataframe.select(col("jsonString1")).rdd.map(_.getString(0)))
val json2 = spark.read.json(dataframe.select(col("jsonString2")).rdd.map(_.getString(0)))
val json1Table = json1.selectExpr("id", "status")
val json2Table = json2.selectExpr("name", "address")
now i want to put these table together. so i did the following
val json1TableWithIndex = addColumnIndex(json1Table)
val json2TableWithIndex = addColumnIndex(json2Table)
var finalResult = json1Table
.join(json2Table, Seq("columnindex"))
.drop("columnindex")
def addColumnIndex(df: DataFrame) = spark.createDataFrame(
df.rdd.zipWithIndex.map { case (row, columnindex) => Row.fromSeq(row.toSeq :+ columnindex) },
StructType(df.schema.fields :+ StructField("columnindex", LongType, false))
)
After sampling few rows I observe that rows match exactly as in the source dataframe
I did not find any information on the order guarantee when joining two columns of a dataframe which are processed separately. Is this the right way to solve my problem. Any help is appreciated.
It is always risky to rely on undocumented behaviours, as your code might not work as you intended, because you only have a partial understanding of it.
You can do the same thing in a much more efficient way without using any split and join approach. Use a from_json function to create two nested columns and then flatten out the nested columns and finally drop out the intermediate JSON string columns and nested columns.
Here is an example fo the whole process.
import org.apache.spark.sql.types.{StringType, StructType, StructField}
val df = (Seq(
("09-02-2020","{\"id\":\"01\", \"status\":\"Active\"}","{\"name\":\"Abdullah\", \"address\":\"Jumeirah\"}"),
("10-02-2020","{\"id\":\"02\", \"status\":\"Dormant\"}","{\"name\":\"Ali\", \"address\":\"Jebel Ali\"}")
).toDF("date","jsonString1","jsonString2"))
scala> df.show()
+----------+--------------------+--------------------+
| date| jsonString1| jsonString2|
+----------+--------------------+--------------------+
|09-02-2020|{"id":"01", "stat...|{"name":"Abdullah...|
|10-02-2020|{"id":"02", "stat...|{"name":"Ali", "a...|
+----------+--------------------+--------------------+
val schema1 = (StructType(Seq(
StructField("id", StringType, true),
StructField("status", StringType, true)
)))
val schema2 = (StructType(Seq(
StructField("name", StringType, true),
StructField("address", StringType, true)
)))
val dfFlattened = (df.withColumn("jsonData1", from_json(col("jsonString1"), schema1))
.withColumn("jsonData2", from_json(col("jsonString2"), schema2))
.withColumn("id", col("jsonData1.id"))
.withColumn("status", col("jsonData1.status"))
.withColumn("name", col("jsonData2.name"))
.withColumn("address", col("jsonData2.address"))
.drop("jsonString1")
.drop("jsonString2")
.drop("jsonData1")
.drop("jsonData2"))
scala> dfFlattened.show()
+----------+---+-------+--------+---------+
| date| id| status| name| address|
+----------+---+-------+--------+---------+
|09-02-2020| 01| Active|Abdullah| Jumeirah|
|10-02-2020| 02|Dormant| Ali|Jebel Ali|
+----------+---+-------+--------+---------+

Spark dataframe with complex & nested data

I have 3 dataframes currently
Call them dfA, dfB, and dfC
dfA has 3 cols
|Id | Name | Age |
dfB has say 5 cols. the 2nd col, is a FK reference back to dFA record.
|Id | AId | Street | City | Zip |
Similarily dfC has 3 cols, also with a reference back to dfA
|Id | AId | SomeField |
Using Spark SQL i can do an JOIN across the 3
%sql
SELECT * FROM dfA
INNER JOIN dfB ON dfA.Id = dfB.AId
INNER JOIN dfC ON dfA.Id = dfC.AId
And I'll get my resultset, but it's been "flattened" as SQL would do with tabular results like this.
I want to load it in to a complex schema like this
val destinationSchema = new StructType()
.add("id", IntegerType)
.add("name", StringType)
.add("age", StringType)
.add("b",
new StructType()
.add("street", DoubleType, true)
.add("city", StringType, true)
.add("zip", StringType, true)
)
.add("c",
new StructType()
.add("somefield", StringType, true)
)
Any ideas how to take the results of the SELECT and save to dataframe with specifying the schema?
I ultimately want to save the complex StructType, or JSON, and load this is to Mongo DB using the Mongo Spark Connector.
Or, is there a better way to accomplish this from the 3 seperate dataframes (which were originally 3 seperate CSV files that were read in)?
given three dataframes, loaded in from csv files, you can do this:
import org.apache.spark.sql.functions._
val destDF = atableDF
.join(btableDF, atableDF("id") === btableDF("id")).drop(btableDF("id"))
.join(ctableDF, atableDF("id") === ctableDF("id")).drop(ctableDF("id"))
.select($"id",$"name",$"age",struct($"street",$"city",$"zip") as "b",struct($"somefield") as "c")
val jsonDestDF = destDF.select(to_json(struct($"*")).as("row"))
which will output:
row
"{""id"":100,""name"":""John"",""age"":""43"",""b"":{""street"":""Dark Road"",""city"":""Washington"",""zip"":""98002""},""c"":{""somefield"":""appples""}}"
"{""id"":101,""name"":""Sally"",""age"":""34"",""b"":{""street"":""Light Ave"",""city"":""Los Angeles"",""zip"":""90210""},""c"":{""somefield"":""bananas""}}"
"{""id"":102,""name"":""Damian"",""age"":""23"",""b"":{""street"":""Short Street"",""city"":""New York"",""zip"":""70701""},""c"":{""somefield"":""pears""}}"
the previous one worked if all the records had a 1:1 relationship.
here is how you can achieve it for 1:M (hint: use collect_set to group rows)
import org.apache.spark.sql.functions._
val destDF = atableDF
.join(btableDF, atableDF("id") === btableDF("id")).drop(btableDF("id"))
.join(ctableDF, atableDF("id") === ctableDF("id")).drop(ctableDF("id"))
.groupBy($"id",$"name",$"age")
.agg(collect_set(struct($"street",$"city",$"zip")) as "b",collect_set(struct($"somefield")) as "c")
val jsonDestDF = destDF.select(to_json(struct($"*")).as("row"))
display(jsonDestDF)
which will give you the following output:
row
"{""id"":102,""name"":""Damian"",""age"":""23"",""b"":[{""street"":""Short Street"",""city"":""New York"",""zip"":""70701""}],""c"":[{""somefield"":""pears""},{""somefield"":""pineapples""}]}"
"{""id"":100,""name"":""John"",""age"":""43"",""b"":[{""street"":""Dark Road"",""city"":""Washington"",""zip"":""98002""}],""c"":[{""somefield"":""appples""}]}"
"{""id"":101,""name"":""Sally"",""age"":""34"",""b"":[{""street"":""Light Ave"",""city"":""Los Angeles"",""zip"":""90210""}],""c"":[{""somefield"":""grapes""},{""somefield"":""peaches""},{""somefield"":""bananas""}]}"
sample data I used just in case anyone wants to play:
atable.csv
100,"John",43
101,"Sally",34
102,"Damian",23
104,"Rita",14
105,"Mohit",23
btable.csv:
100,"Dark Road","Washington",98002
101,"Light Ave","Los Angeles",90210
102,"Short Street","New York",70701
104,"Long Drive","Buffalo",80345
105,"Circular Quay","Orlando",65403
ctable.csv:
100,"appples"
101,"bananas"
102,"pears"
101,"grapes"
102,"pineapples"
101,"peaches"

Handling schema mismatches in Spark

I am reading a csv file using Spark in Scala.
The schema is predefined and i am using it for reading.
This is the esample code:
// create the schema
val schema= StructType(Array(
StructField("col1", IntegerType,false),
StructField("col2", StringType,false),
StructField("col3", StringType,true)))
// Initialize Spark session
val spark: SparkSession = SparkSession.builder
.appName("Parquet Converter")
.getOrCreate
// Create a data frame from a csv file
val dataFrame: DataFrame =
spark.read.format("csv").schema(schema).option("header", false).load(inputCsvPath)
From what i read when reading cav with Spark using a schema there are 3 options:
Set mode to DROPMALFORMED --> this will drop the lines that don't match the schema
Set mode to PERMISSIVE --> this will set the whole line to null values
Set mode to FAILFAST --> this will throw an exception when a mismatch is discovered
What is the best way to combine the options? The behaviour I want is to get the mismatches in the schema, print them as errors and ignoring the lines in my data frame.
Basically, I want a combination of FAILFAST and DROPMALFORMED.
Thanks in advance
This is what I eventually did:
I added to the schema the "_corrupt_record" column, for example:
val schema= StructType(Array(
StructField("col1", IntegerType,true),
StructField("col2", StringType,false),
StructField("col3", StringType,true),
StructField("_corrupt_record", StringType, true)))
Then I read the CSV using PERMISSIVE mode (it is Spark default):
val dataFrame: DataFrame = spark.read.format("csv")
.schema(schema)
.option("header", false)
.option("mode", "PERMISSIVE")
.load(inputCsvPath)
Now my data frame holds an additional column that holds the rows with schema mismatches.
I filtered the rows that have mismatched data and printed it:
val badRows = dataFrame.filter("_corrupt_record is not null")
badRows.cache()
badRows.show()
Just use DROPMALFORMED and follow the log. If malformed records are present there are dumped to the log, up to the limit set by maxMalformedLogPerPartition option.
spark.read.format("csv")
.schema(schema)
.option("header", false)
.option("mode", "DROPMALFORMED")
.option("maxMalformedLogPerPartition", 128)
.load(inputCsvPath)

Spark Dataframe Join - Duplicate column (non-joined column)

I have two Dataframes df1 (Employee table) & df2 (Department table) with following schema :
df1.columns
// Arrays(id,name,dept_id)
and
df2.columns
// Array(id,name)
After i join these two tables on df1.dept_id and df2.id :
val joinedData = df1.join(df2,df1("dept_id")===df2("id"))
joinedData.columns
// Array(id,name,dept_id,id,name)
While saving it in file,
joined.write.csv("<path>")
it gives error :
org.apache.spark.sql.AnalysisException: Duplicate column(s) : "name", "id" found, cannot save to file.;
I read about using Sequence of Strings to avoid column duplication but that is for columns on which join is to be performed. I need a similar functionality for non-joined columns.
Is there a direct way to embed table name with repeated column so that it can be saved ?
I came up with a solution of matching columns of both dfs and renaming duplicate columns to append table-name to column-name. But is there a direct way ?
Note : This will be a generic code with only column details on which join is performed. Rest columns will be known at runtime only. So we can't rename columns by hard-coding it.
I would just keep all columns by making sure they have different names, e.g. by prepending an identifier to the column names:
val df1Cols = df1.columns
val df2Cols = df2.columns
// prefixes to column names
val df1pf = df1.select(df1Cols.map(n => col(n).as("df1_"+n)):_*)
val df2pf = df2.select(df2Cols.map(n => col(n).as("df2_"+n)):_*)
df1pf.join(df2pf,
$"df1_dept_id"===$"df2_id",
)
After further research and getting views of other developers, it's sure that there is no direct way. One way is to change all column's name as specified by #Raphael. But i solved my problem by changing only duplicate columns :
val commonCols = df1.columns.intersect(df2.columns)
val newDf2 = changeColumnsName(df2,commonCols,"df1")
where changeColumnsName definition is :
#tailrec
def changeColumnsName(dataFrame: DataFrame, columns: Array[String], tableName: String): DataFrame = {
if (columns.size == 0)
dataFrame
else
changeColumnsName(dataFrame.withColumnRenamed(columns.head, tableName + "_" + columns.head), columns.tail, tableName)
}
Now, performing join :
val joinedData = df1.join(newDf2,df1("dept_id")===newDf2("df2_id"))
joinedData.columns
// Array(id,name,dept_id,df2_id,df2_name)
You could try using alias for dataframe,
import spark.implicits._
df1.as("df1")
.join(df2.alias("df2"),df1("dept_id") === df2("id"))
.select($"df1.*",$"df2.*").show()
val llist = Seq(("bob", "2015-01-13", 4), ("alice", "2015-04-23",10))
val left = llist.toDF("name","date","duration")
val right = Seq(("alice", 100),("bob", 23)).toDF("name","upload")
val df = left.join(right, left.col("name") === right.col("name"))
display(df)
head(drop(join(left, right, left$name == right$name), left$name))
https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

How to convert all column of dataframe to numeric spark scala?

I loaded a csv as dataframe. I would like to cast all columns to float, knowing that the file is to big to write all columns names:
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest2.csv")
Given this DataFrame as example:
val df = sqlContext.createDataFrame(Seq(("0", 0),("1", 1),("2", 0))).toDF("id", "c0")
with schema:
StructType(
StructField(id,StringType,true),
StructField(c0,IntegerType,false))
You can loop over DF columns by .columns functions:
val castedDF = df.columns.foldLeft(df)((current, c) => current.withColumn(c, col(c).cast("float")))
So the new DF schema looks like:
StructType(
StructField(id,FloatType,true),
StructField(c0,FloatType,false))
EDIT:
If you wanna exclude some columns from casting, you could do something like (supposing we want to exclude the column id):
val exclude = Array("id")
val someCastedDF = (df.columns.toBuffer --= exclude).foldLeft(df)((current, c) =>
current.withColumn(c, col(c).cast("float")))
where exclude is an Array of all columns we want to exclude from casting.
So the schema of this new DF is:
StructType(
StructField(id,StringType,true),
StructField(c0,FloatType,false))
Please notice that maybe this is not the best solution to do it but it can be a starting point.