export many files from a table - scala

I have a sql query that generate a table with the below format
|sex |country|popularity|
|null |null | x |
|null |value | x |
|value|null | x |
|value|null | x |
|null |value | x |
|value|value | x |
value for sex column could be woman,man
value for country could be Italy,England,US etc.
x is a int
Now i would like to save four files based on data combination(value,null). So file1 consist of (value,value) for column sex,country.
file2 consist of (value,null) for column sex,country. file3 consist of (null,value) and file4 consist of
(null,null).
I have searched a lot of things but i couldn't find any useful info. I have also tried the below
val df1 = data.withColumn("combination",concat(col("sex") ,lit(","), col("country")))
df1.coalesce(1).write.partitionBy("combination").format("csv").option("header", "true").mode("overwrite").save("text.csv")
but i receive more files because this command generate files based on all possible data of (sex-country).
Same with the below
val df1 = data.withColumn("combination",concat(col("sex")))
df1.coalesce(1).write.partitionBy("combination").format("csv").option("header", "true").mode("overwrite").save("text.csv")
Is there any command similar to partitionby that gives me a combination of pairs (value,null) and not for columns?

You can convert the columns into Boolean depending on whether they are null or not, and concat into a string, which will look like "true_true", "true_false" etc.
df = df.withColumn("coltype", concat(col("sex").isNull(), lit("_"), col("country").isNull()))
df.coalesce(1)
.write
.partitionBy("coltype")
.format("csv")
.option("header", "true")
.mode("overwrite")
.save("output")

Related

Spark: choose default value for MergeSchema fields

I have a parquet that has an old schema like this :
| name | gender | age |
| Tom | Male | 30 |
And as our schema got updated to :
| name | gender | age | office |
we used mergeSchema when reading from the old parquet :
val mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
But when reading from these old parquet files, I got the following output :
| name | gender | age | office |
| Tom | Male | 30 | null |
which is normal. But I would like to take a default value for office (e.g. "California"), if and only if the field is not present in old schema. Is it possible ?
You don't have any simple method to put a default value when column doesn't exist in some parquet files but exists in other parquet files
In Parquet file format, each parquet file contains the schema definition. By default, when reading parquet, Spark get the schema from parquet file. The only effect of mergeSchema option is that instead of retrieving schema from one random parquet file, with mergeSchema Spark will read all schema of all parquet files and merge them.
So you can't put a default value without modifying the parquet files.
The other possible method is to provide your own schema when reading parquets by setting the option .schema() like that:
spark.read.schema(StructType(Array(FieldType("name", StringType), ...)).parquet(...)
But in this case, there is no option to set a default value.
So the only remaining solution is to add column default value manually
If we have two parquets, first one containing the data with the old schema:
+----+------+---+
|name|gender|age|
+----+------+---+
|Tom |Male |30 |
+----+------+---+
and second one containing the data with the new schema:
+-----+------+---+------+
|name |gender|age|office|
+-----+------+---+------+
|Jane |Female|45 |Idaho |
|Roger|Male |22 |null |
+-----+------+---+------+
If you don't care to replace all the null value in "office" column, you can use .na.fill as follow:
spark.read.option("mergeSchema", "true").parquet(path).na.fill("California", Array("office"))
And you get the following result:
+-----+------+---+----------+
|name |gender|age|office |
+-----+------+---+----------+
|Jane |Female|45 |Idaho |
|Roger|Male |22 |California|
|Tom |Male |30 |California|
+-----+------+---+----------+
If you want that only the old data get the default value, you have to read each parquet file to a dataframe, add the column with default value if necessary, and union all the resulting dataframes:
import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
import org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable
import org.apache.spark.sql.util.CaseInsensitiveStringMap
ParquetTable("my_table",
sparkSession = spark,
options = CaseInsensitiveStringMap.empty(),
paths = Seq(path),
userSpecifiedSchema = None,
fallbackFileFormat = classOf[ParquetFileFormat]
).fileIndex.allFiles().map(file => {
val dataframe = spark.read.parquet(file.getPath.toString)
if (dataframe.columns.contains("office")) {
dataframe
} else {
dataframe.withColumn("office", lit("California"))
}
}).reduce(_ unionByName _)
And you get the following result:
+-----+------+---+----------+
|name |gender|age|office |
+-----+------+---+----------+
|Jane |Female|45 |Idaho |
|Roger|Male |22 |null |
|Tom |Male |30 |California|
+-----+------+---+----------+
Note that all the part with ParquetTable([...].allFiles() is to retrieve the list of parquet files. It can be simplified if you are on hadoop or on local file system.

Need to read and then remove duplicates from multiple CSV file in Spark scala

I am new to this space,I have multiple partitioned CSV files having duplicates records. I want to read the CSV file in Spark Scala code and remove duplicates as well while reading.
I have tried dropDuplicate() and read.format("csv") with load option.
var df1 = thesparksession.read.format("csv").option("delimiter","|").option("header",true).load("path/../../*csv)
.withcolumn(col1)
df1.dropDuplicates().show()
if lets say csv1 has values
emp1 1000 nuu -1903.33
emp2 1003 yuu 1874.44
and csv2 has
emp1 1000 nuu -1903.33
emp4 9848 hee 1874.33
I need only one record with emp1 will be processed further.
expected output :
emp1 1000 nuu -1903.33
emp2 1003 yuu 1874.44
emp4 9848 hee 1874.33
dropDuplicates()
works perfect.
val sourcecsv = spark.read.option("header", "true").option("delimiter", "|").csv("path/../../*csv")
sourcecsv.show()
+-----+-----+----+--------+
|empid|idnum|name| credit|
+-----+-----+----+--------+
| emp1| 1000| nuu|-1903.33|
| emp2| 1003| yuu| 1874.44|
| emp4| 9848| hee| 1874.33|
| emp1| 1000| nuu|-1903.33|
| emp2| 1003| yuu| 1874.44|
+-----+-----+----+--------+
//dropDuplicates() on a dataframe works perfect as expected
sourcecsv.dropDuplicates().show()
+-----+-----+----+--------+
|empid|idnum|name| credit|
+-----+-----+----+--------+
| emp1| 1000| nuu|-1903.33|
| emp4| 9848| hee| 1874.33|
| emp2| 1003| yuu| 1874.44|
+-----+-----+----+--------+
Please let us know if there is any other issue.
Based on your input data the CSV columns are delimited by a pipe, to read the CSV into a data frame you can do
var df1 = sparkSession.read.option("delimiter","|").csv(filePath)
//Drop duplicates
val result = df1.dropDuplicates
result.show
Output is:
+----+----+---+--------+
| _c0| _c1|_c2| _c3|
+----+----+---+--------+
|emp1|1000|nuu|-1903.33|
|emp4|9848|hee| 1874.33|
|emp2|1003|yuu| 1874.44|
+----+----+---+--------+

Is there a better way to go about this process of trimming my spark DataFrame appropriately?

In the following example, I want to be able to only take the x Ids with the highest counts. x is number of these I want which is determined by a variable called howMany.
For the following example, given this Dataframe:
+------+--+-----+
|query |Id|count|
+------+--+-----+
|query1|11|2 |
|query1|12|1 |
|query2|13|2 |
|query2|14|1 |
|query3|13|2 |
|query4|12|1 |
|query4|11|1 |
|query5|12|1 |
|query5|11|2 |
|query5|14|1 |
|query5|13|3 |
|query6|15|2 |
|query6|16|1 |
|query7|17|1 |
|query8|18|2 |
|query8|13|3 |
|query8|12|1 |
+------+--+-----+
I would like to get the following dataframe if the variable number is 2.
+------+-------+-----+
|query |Ids |count|
+------+-------+-----+
|query1|[11,12]|2 |
|query2|[13,14]|2 |
|query3|[13] |2 |
|query4|[12,11]|1 |
|query5|[11,13]|2 |
|query6|[15,16]|2 |
|query7|[17] |1 |
|query8|[18,13]|2 |
+------+-------+-----+
I then want to remove the count column, but that is trivial.
I have a way to do this, but I think it defeats the purpose of scala all together and completely wastes a lot of runtime. Being new, I am unsure about the best ways to go about this
My current method is to first get a distinct list of the query column and create an iterator. Second I loop through the list using the iterator and trim the dataframe to only the current query in the list using df.select($"eachColumnName"...).where("query".equalTo(iter.next())). I then .limit(howMany) and then groupBy($"query").agg(collect_list($"Id").as("Ids")). Lastly, I have an empty dataframe and add each of these one by one to the empty dataframe and return this newly created dataframe.
df.select($"query").distinct().rdd.map(r => r(0).asInstanceOf[String]).collect().toList
val iter = queries.toIterator
while (iter.hasNext) {
middleDF = df.select($"query", $"Id", $"count").where($"query".equalTo(iter.next()))
queryDF = middleDF.sort(col("count").desc).limit(howMany).select(col("query"), col("Ids")).groupBy(col("query")).agg(collect_list("Id").as("Ids"))
emptyDF.union(queryDF) // Assuming emptyDF is made
}
emptyDF
I would do this using Window-Functions to get the rank, then groupBy to aggrgate:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val howMany = 2
val newDF = df
.withColumn("rank",row_number().over(Window.partitionBy($"query").orderBy($"count".desc)))
.where($"rank"<=howMany)
.groupBy($"query")
.agg(
collect_list($"Id").as("Ids"),
max($"count").as("count")
)

PySpark join dataframes and merge contents of specific columns

My goal is to merge two dataframes on the column id, and perform a somewhat complex merge on another column that contains JSON we can call data.
Suppose I have the DataFrame df1 that looks like this:
id | data
---------------------------------
42 | {'a_list':['foo'],'count':1}
43 | {'a_list':['scrog'],'count':0}
And I'm interested in merging with a similar, but different DataFrame df2:
id | data
---------------------------------
42 | {'a_list':['bar'],'count':2}
44 | {'a_list':['baz'],'count':4}
And I would like the following DataFrame, joining and merging properties from the JSON data where id matches, but retaining rows where id does not match and keeping the data column as-is:
id | data
---------------------------------------
42 | {'a_list':['foo','bar'],'count':3} <-- where 'bar' is added to 'foo', and count is summed
43 | {'a_list':['scrog'],'count':1}
44 | {'a_list':['baz'],'count':4}
As can be seen where id is 42, there is a some logic I will have to apply to how the JSON is merged.
My knee jerk thought is that I'd like to provide a lambda / udf to merge the data column, but not sure how to think about that with during a join.
Alternatively, I could break the properties from the JSON into columns, something like this, that might be a better approach?
df1:
id | a_list | count
----------------------
42 | ['foo'] | 1
43 | ['scrog'] | 0
df2:
id | a_list | count
---------------------
42 | ['bar'] | 2
44 | ['baz'] | 4
Resulting:
id | a_list | count
---------------------------
42 | ['foo', 'bar'] | 3
43 | ['scrog'] | 0
44 | ['baz'] | 4
If I went this route, I would then have to merge the columns a_list and count into JSON again under a single column data, but this I can wrap my head around as a relatively simple map function.
Update: Expanding on Question
More realistically, I will have n number of DataFrames in a list, e.g. df_list = [df1, df2, df3], all shaped the same. What is an efficient way to perform these same actions on n number of DataFrames?
Update to Update
Not sure how efficient this is, or if there is a more spark-esque way to do this, but incorporating accepted answer, this appears to work for question update:
for i in range(0, (len(validations) - 1)):
# set dfs
df1 = validations[i]['df']
df2 = validations[(i+1)]['df']
# joins here...
# update new_df
new_df = df2
Here's one way to accomplish your second approach:
Explode the list column and then unionAll the two DataFrames. Next groupBy the "id" column and use pyspark.sql.functions.collect_list() and pyspark.sql.functions.sum():
import pyspark.sql.functions as f
new_df = df1.select("id", f.explode("a_list").alias("a_values"), "count")\
.unionAll(df2.select("id", f.explode("a_list").alias("a_values"), "count"))\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))
new_df.show(truncate=False)
#+---+----------+-----+
#|id |a_list |count|
#+---+----------+-----+
#|43 |[scrog] |0 |
#|44 |[baz] |4 |
#|42 |[foo, bar]|3 |
#+---+----------+-----+
Finally you can use pyspark.sql.functions.struct() and pyspark.sql.functions.to_json() to convert this intermediate DataFrame into your desired structure:
new_df = new_df.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))
new_df.show()
#+---+----------------------------------+
#|id |data |
#+---+----------------------------------+
#|43 |{"a_list":["scrog"],"count":0} |
#|44 |{"a_list":["baz"],"count":4} |
#|42 |{"a_list":["foo","bar"],"count":3}|
#+---+----------------------------------+
Update
If you had a list of dataframes in df_list, you could do the following:
from functools import reduce # for python3
df_list = [df1, df2]
new_df = reduce(lambda a, b: a.unionAll(b), df_list)\
.select("id", f.explode("a_list").alias("a_values"), "count")\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))\
.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))

remove duplicate column from dataframe using scala

I need to remove one column from the DataFrame having another column with the same name. I need to remove only one column and need the other one for further usage.
For example, given this input DF:
sno | age | psk | psk
---------------------
1 | 12 | a4 | a4
I would like to obtain this output DF:
sno | age | psk
----------------
1 | 12 | a4
RDD is the way (but you need to know the column index of the duplicate columns for removing duplicate columns back to dataframe)
If you have dataframe with duplicate columns as
+---+---+---+---+
|sno|age|psk|psk|
+---+---+---+---+
|1 |12 |a4 |a4 |
+---+---+---+---+
You know that the last two column index are duplicates.
Next step is for you to have column names with duplicates removed and form schema
val columns = df.columns.toSet.toArray
val schema = StructType(columns.map(name => StructField(name, StringType, true)))
Vital part is to convert the dataframe to rdd and remove the required column index (here it is the 4th)
val rdd = df.rdd.map(row=> Row.fromSeq(Seq(row(0).toString, row(1).toString, row(2))))
Final step is to convert the rdd to dataframe using schema
sqlContext.createDataFrame(rdd, schema).show(false)
which should give you
+---+---+---+
|sno|age|psk|
+---+---+---+
|1 |12 |a4 |
+---+---+---+
I hope the answer is helpful