Spark: choose default value for MergeSchema fields - scala

I have a parquet that has an old schema like this :
| name | gender | age |
| Tom | Male | 30 |
And as our schema got updated to :
| name | gender | age | office |
we used mergeSchema when reading from the old parquet :
val mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
But when reading from these old parquet files, I got the following output :
| name | gender | age | office |
| Tom | Male | 30 | null |
which is normal. But I would like to take a default value for office (e.g. "California"), if and only if the field is not present in old schema. Is it possible ?

You don't have any simple method to put a default value when column doesn't exist in some parquet files but exists in other parquet files
In Parquet file format, each parquet file contains the schema definition. By default, when reading parquet, Spark get the schema from parquet file. The only effect of mergeSchema option is that instead of retrieving schema from one random parquet file, with mergeSchema Spark will read all schema of all parquet files and merge them.
So you can't put a default value without modifying the parquet files.
The other possible method is to provide your own schema when reading parquets by setting the option .schema() like that:
spark.read.schema(StructType(Array(FieldType("name", StringType), ...)).parquet(...)
But in this case, there is no option to set a default value.
So the only remaining solution is to add column default value manually
If we have two parquets, first one containing the data with the old schema:
+----+------+---+
|name|gender|age|
+----+------+---+
|Tom |Male |30 |
+----+------+---+
and second one containing the data with the new schema:
+-----+------+---+------+
|name |gender|age|office|
+-----+------+---+------+
|Jane |Female|45 |Idaho |
|Roger|Male |22 |null |
+-----+------+---+------+
If you don't care to replace all the null value in "office" column, you can use .na.fill as follow:
spark.read.option("mergeSchema", "true").parquet(path).na.fill("California", Array("office"))
And you get the following result:
+-----+------+---+----------+
|name |gender|age|office |
+-----+------+---+----------+
|Jane |Female|45 |Idaho |
|Roger|Male |22 |California|
|Tom |Male |30 |California|
+-----+------+---+----------+
If you want that only the old data get the default value, you have to read each parquet file to a dataframe, add the column with default value if necessary, and union all the resulting dataframes:
import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
import org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable
import org.apache.spark.sql.util.CaseInsensitiveStringMap
ParquetTable("my_table",
sparkSession = spark,
options = CaseInsensitiveStringMap.empty(),
paths = Seq(path),
userSpecifiedSchema = None,
fallbackFileFormat = classOf[ParquetFileFormat]
).fileIndex.allFiles().map(file => {
val dataframe = spark.read.parquet(file.getPath.toString)
if (dataframe.columns.contains("office")) {
dataframe
} else {
dataframe.withColumn("office", lit("California"))
}
}).reduce(_ unionByName _)
And you get the following result:
+-----+------+---+----------+
|name |gender|age|office |
+-----+------+---+----------+
|Jane |Female|45 |Idaho |
|Roger|Male |22 |null |
|Tom |Male |30 |California|
+-----+------+---+----------+
Note that all the part with ParquetTable([...].allFiles() is to retrieve the list of parquet files. It can be simplified if you are on hadoop or on local file system.

Related

How to add new rows to DataFrame?

Is there a way in spark to append a string in data frame and upload that data frame value in s3 in some txt file
i have build a DF by reading a text file from s3
val DF = spark.read.textFile("s3_path/file_name.txt")
DF.show(200,false)
+----------------------------------+
|value |
+----------------------------------+
|country:india |
|address:xyz |
After this need to append and update some string in that file and upload it back to s3 in same location
expected output
+----------------------------------+
|value |
+----------------------------------+
|country:abcd |
|address:xyz |
|pin:1234 |
This is union operation:
Returns a new Dataset containing union of rows in this Dataset and another Dataset.

export many files from a table

I have a sql query that generate a table with the below format
|sex |country|popularity|
|null |null | x |
|null |value | x |
|value|null | x |
|value|null | x |
|null |value | x |
|value|value | x |
value for sex column could be woman,man
value for country could be Italy,England,US etc.
x is a int
Now i would like to save four files based on data combination(value,null). So file1 consist of (value,value) for column sex,country.
file2 consist of (value,null) for column sex,country. file3 consist of (null,value) and file4 consist of
(null,null).
I have searched a lot of things but i couldn't find any useful info. I have also tried the below
val df1 = data.withColumn("combination",concat(col("sex") ,lit(","), col("country")))
df1.coalesce(1).write.partitionBy("combination").format("csv").option("header", "true").mode("overwrite").save("text.csv")
but i receive more files because this command generate files based on all possible data of (sex-country).
Same with the below
val df1 = data.withColumn("combination",concat(col("sex")))
df1.coalesce(1).write.partitionBy("combination").format("csv").option("header", "true").mode("overwrite").save("text.csv")
Is there any command similar to partitionby that gives me a combination of pairs (value,null) and not for columns?
You can convert the columns into Boolean depending on whether they are null or not, and concat into a string, which will look like "true_true", "true_false" etc.
df = df.withColumn("coltype", concat(col("sex").isNull(), lit("_"), col("country").isNull()))
df.coalesce(1)
.write
.partitionBy("coltype")
.format("csv")
.option("header", "true")
.mode("overwrite")
.save("output")

Need to read and then remove duplicates from multiple CSV file in Spark scala

I am new to this space,I have multiple partitioned CSV files having duplicates records. I want to read the CSV file in Spark Scala code and remove duplicates as well while reading.
I have tried dropDuplicate() and read.format("csv") with load option.
var df1 = thesparksession.read.format("csv").option("delimiter","|").option("header",true).load("path/../../*csv)
.withcolumn(col1)
df1.dropDuplicates().show()
if lets say csv1 has values
emp1 1000 nuu -1903.33
emp2 1003 yuu 1874.44
and csv2 has
emp1 1000 nuu -1903.33
emp4 9848 hee 1874.33
I need only one record with emp1 will be processed further.
expected output :
emp1 1000 nuu -1903.33
emp2 1003 yuu 1874.44
emp4 9848 hee 1874.33
dropDuplicates()
works perfect.
val sourcecsv = spark.read.option("header", "true").option("delimiter", "|").csv("path/../../*csv")
sourcecsv.show()
+-----+-----+----+--------+
|empid|idnum|name| credit|
+-----+-----+----+--------+
| emp1| 1000| nuu|-1903.33|
| emp2| 1003| yuu| 1874.44|
| emp4| 9848| hee| 1874.33|
| emp1| 1000| nuu|-1903.33|
| emp2| 1003| yuu| 1874.44|
+-----+-----+----+--------+
//dropDuplicates() on a dataframe works perfect as expected
sourcecsv.dropDuplicates().show()
+-----+-----+----+--------+
|empid|idnum|name| credit|
+-----+-----+----+--------+
| emp1| 1000| nuu|-1903.33|
| emp4| 9848| hee| 1874.33|
| emp2| 1003| yuu| 1874.44|
+-----+-----+----+--------+
Please let us know if there is any other issue.
Based on your input data the CSV columns are delimited by a pipe, to read the CSV into a data frame you can do
var df1 = sparkSession.read.option("delimiter","|").csv(filePath)
//Drop duplicates
val result = df1.dropDuplicates
result.show
Output is:
+----+----+---+--------+
| _c0| _c1|_c2| _c3|
+----+----+---+--------+
|emp1|1000|nuu|-1903.33|
|emp4|9848|hee| 1874.33|
|emp2|1003|yuu| 1874.44|
+----+----+---+--------+

Is there a better way to go about this process of trimming my spark DataFrame appropriately?

In the following example, I want to be able to only take the x Ids with the highest counts. x is number of these I want which is determined by a variable called howMany.
For the following example, given this Dataframe:
+------+--+-----+
|query |Id|count|
+------+--+-----+
|query1|11|2 |
|query1|12|1 |
|query2|13|2 |
|query2|14|1 |
|query3|13|2 |
|query4|12|1 |
|query4|11|1 |
|query5|12|1 |
|query5|11|2 |
|query5|14|1 |
|query5|13|3 |
|query6|15|2 |
|query6|16|1 |
|query7|17|1 |
|query8|18|2 |
|query8|13|3 |
|query8|12|1 |
+------+--+-----+
I would like to get the following dataframe if the variable number is 2.
+------+-------+-----+
|query |Ids |count|
+------+-------+-----+
|query1|[11,12]|2 |
|query2|[13,14]|2 |
|query3|[13] |2 |
|query4|[12,11]|1 |
|query5|[11,13]|2 |
|query6|[15,16]|2 |
|query7|[17] |1 |
|query8|[18,13]|2 |
+------+-------+-----+
I then want to remove the count column, but that is trivial.
I have a way to do this, but I think it defeats the purpose of scala all together and completely wastes a lot of runtime. Being new, I am unsure about the best ways to go about this
My current method is to first get a distinct list of the query column and create an iterator. Second I loop through the list using the iterator and trim the dataframe to only the current query in the list using df.select($"eachColumnName"...).where("query".equalTo(iter.next())). I then .limit(howMany) and then groupBy($"query").agg(collect_list($"Id").as("Ids")). Lastly, I have an empty dataframe and add each of these one by one to the empty dataframe and return this newly created dataframe.
df.select($"query").distinct().rdd.map(r => r(0).asInstanceOf[String]).collect().toList
val iter = queries.toIterator
while (iter.hasNext) {
middleDF = df.select($"query", $"Id", $"count").where($"query".equalTo(iter.next()))
queryDF = middleDF.sort(col("count").desc).limit(howMany).select(col("query"), col("Ids")).groupBy(col("query")).agg(collect_list("Id").as("Ids"))
emptyDF.union(queryDF) // Assuming emptyDF is made
}
emptyDF
I would do this using Window-Functions to get the rank, then groupBy to aggrgate:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val howMany = 2
val newDF = df
.withColumn("rank",row_number().over(Window.partitionBy($"query").orderBy($"count".desc)))
.where($"rank"<=howMany)
.groupBy($"query")
.agg(
collect_list($"Id").as("Ids"),
max($"count").as("count")
)

In spark and scala, how to convert or map a dataframe to specific columns info?

Scala.
Spark.
intellij IDEA.
I have a dataframe (multiple rows, multiple columns) from CSV file.
And I want it maps to another specific column info.
I think scala class (not case class, because columns count > 22) or map().....
But I don't know how to convert them.
Example
a dataframe from CSV file.
----------------------
| No | price| name |
----------------------
| 1 | 100 | "A" |
----------------------
| 2 | 200 | "B" |
----------------------
another specific columns info.
=> {product_id, product_name, seller}
First, product_id is mapping to 'No'.
Second, product_name is mapping to 'name'.
Third, seller is null or ""(empty string).
So, finally, I want a dataframe that have another columns info.
-----------------------------------------
| product_id | product_name | seller |
-----------------------------------------
| 1 | "A" | |
-----------------------------------------
| 2 | "B" | |
-----------------------------------------
If you already have a dataframe (eg. old_df) :
val new_df=old_df.withColumnRenamed("No","product_id").
withColumnRenamed("name","product_name").
drop("price").
withColumn("seller", ... )
Let's say your CSV file is "products.csv",
First you have to load it in spark, you can do that using
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
Once the data is loaded you will have all the column names in the dataframe df. As you mentioned your column name will be "No","Price","Name".
To change the name of the column you just have to use withColumnRenamed api of dataframe.
val renamedDf = df.withColumnRenamed("No","product_id").
withColumnRenames("name","product_name")
Your renamedDf will have the name of the column as you have assigned.