How to save a spark dataframe to file using the column values as filenames . is it possible?
+--------------------------+----------+-----------------+-----------------------------------+
|ID |CITY |DATE |name |
+--------------------------+----------+-----------------+-----------------------------------+
|1 | |2011-01-01 |20110101_DATA.snappy.parquet |
|2 | |2011-01-01 |20110101_DATA.snappy.parquet |
|3 | |2011-01-01 |20110101_DATA.snappy.parquet |
|4 |Chicago |2011-01-01 |20110101_DATA.snappy.parquet |
|5 |Mansfield |2011-01-02 |20110102_DATA.snappy.parquet |
|6 |Pittsburgh|2011-01-02 |20110102_DATA.snappy.parquet |
|7 | |2011-01-02 |20110102_DATA.snappy.parquet |
|8 |Clarion |2011-01-03 |20110103_DATA.snappy.parquet |
|9 |Storrs |2011-01-03 |20110103_DATA.snappy.parquet |
|10 | |2011-01-03 |20110103_DATA.snappy.parquet |
+--------------------------+----------+-----------------+-----------------------------------+
Expected Output:
Partition By Date and use the name value as the filename when saving the data as parquet. The o/p would be 3 files
/DATE=2011-01-01/20110101_DATA.snappy.parquet
/DATE=2011-01-02/20110102_DATA.snappy.parquet
/DATE=2011-01-03/20110103_DATA.snappy.parquet
Spark cannot natively create custom names in your output parquet file as you want. You can use the following code, but it's not a scalable solution because you use .collect() action.
# In large dataframe maybe it will not work
unique_filename = [row.name for row in df.select('name').distinct().collect()]
for filename in unique_filenames:
output_filename = "/DATE=" + filename[0:4] + "-" + filename[4:6] + "-" + filename[6:8] + "/" + filename
df.select("ID", "CITY", "DATE")
.filter(df['name']==filename) \
.write \
.parquet(output_filename)
You will have exacly what you want:
/DATE=2011-01-01/20110101_DATA.snappy.parquet
/DATE=2011-01-02/20110102_DATA.snappy.parquet
/DATE=2011-01-03/20110103_DATA.snappy.parquet
Related
I have an input dataframe:
inputDF=
+--------------------------+-----------------------------+
| info (String) | chars (Seq[String]) |
+--------------------------+-----------------------------+
|weight=100,height=70 | [weight,height] |
+--------------------------+-----------------------------+
|weight=92,skinCol=white | [weight,skinCol] |
+--------------------------+-----------------------------+
|hairCol=gray,skinCol=white| [hairCol,skinCol] |
+--------------------------+-----------------------------+
How to I get this dataframe as an output? I do not know in advance what are the strings contained in chars column
outputDF=
+--------------------------+-----------------------------+-------+-------+-------+-------+
| info (String) | chars (Seq[String]) | weight|height |skinCol|hairCol|
+--------------------------+-----------------------------+-------+-------+-------+-------+
|weight=100,height=70 | [weight,height] | 100 | 70 | null |null |
+--------------------------+-----------------------------+-------+-------+-------+-------+
|weight=92,skinCol=white | [weight,skinCol] | 92 |null |white |null |
+--------------------------+-----------------------------+-------+-------+-------+-------+
|hairCol=gray,skinCol=white| [hairCol,skinCol] |null |null |white |gray |
+--------------------------+-----------------------------+-------+-------+-------+-------+
I also would like to save the following Seq[String] as a variable, but without using .collect() function on the dataframes.
val aVariable: Seq[String] = [weight, height, skinCol, hairCol]
You create another dataframe pivoting on the key of info column than join it back using an id column:
import spark.implicits._
val data = Seq(
("weight=100,height=70", Seq("weight", "height")),
("weight=92,skinCol=white", Seq("weight", "skinCol")),
("hairCol=gray,skinCol=white", Seq("hairCol", "skinCol"))
)
val df = spark.sparkContext.parallelize(data).toDF("info", "chars")
.withColumn("id", monotonically_increasing_id() + 1)
val pivotDf = df
.withColumn("tmp", split(col("info"), ","))
.withColumn("tmp", explode(col("tmp")))
.withColumn("val1", split(col("tmp"), "=")(0))
.withColumn("val2", split(col("tmp"), "=")(1)).select("id", "val1", "val2")
.groupBy("id").pivot("val1").agg(first(col("val2")))
df.join(pivotDf, Seq("id"), "left").drop("id").show(false)
+--------------------------+------------------+-------+------+-------+------+
|info |chars |hairCol|height|skinCol|weight|
+--------------------------+------------------+-------+------+-------+------+
|weight=100,height=70 |[weight, height] |null |70 |null |100 |
|hairCol=gray,skinCol=white|[hairCol, skinCol]|gray |null |white |null |
|weight=92,skinCol=white |[weight, skinCol] |null |null |white |92 |
+--------------------------+------------------+-------+------+-------+------+
for your second question you can get those values in a dataframe like this:
df.withColumn("tmp", explode(split(col("info"), ",")))
.withColumn("values", split(col("tmp"), "=")(0)).select("values").distinct().show()
+-------+
| values|
+-------+
| height|
|hairCol|
|skinCol|
| weight|
+-------+
but you cannot get them in Seq variable without using collect, that just impossible.
I have differents dataframes and I want to select the max common Date of these DF. For example, I have the following dataframes:
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2017-11-19 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2017-11-19 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2012-12-21 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
The selected date would be 2016-09-02 because is the max date that exists in these 3 DF (the date 2017-11-19 is not in the third DF).
I am trying to do it with agg(max) but in this way I just have the highest date of a DataFrame:
df1.select("Date").groupBy("Date").agg(max("Date))
Thanks in advance!
You can do semi joins to get the common dates, and aggregate the maximum date. No need to group by date because you want to get its maximum.
val result = df1.join(df2, Seq("Date"), "left_semi").join(df3, Seq("Date"), "left_semi").agg(max("Date"))
You can also use intersect:
val result = df1.select("Date").intersect(df2.select("Date")).intersect(df3.select("Date")).agg(max("Date"))
I am trying to filter a DataFrame DF1 using the column of another DataFrame DF2, the col is country_id. I Want to reduce all the rows of the first DataFrame to only the countries that there are on the second DF. An example:
+--------------+------------+-------+
|Date | country_id | value |
+--------------+------------+-------+
|2015-12-14 |ARG |5 |
|2015-12-14 |GER |1 |
|2015-12-14 |RUS |1 |
|2015-12-14 |CHN |3 |
|2015-12-14 |USA |1 |
+--------------+------------+
|USE | country_id |
+--------------+------------+
| F |RUS |
| F |CHN |
Expected:
+--------------+------------+-------+
|Date | country_id | value |
+--------------+------------+-------+
|2015-12-14 |RUS |1 |
|2015-12-14 |CHN |3 |
How could I do this? I am new with Spark so I have thought on use maybe intersect? or would be more efficient other method?
Thanks in advance!
You can use left semi join:
val DF3 = DF1.join(DF2, Seq("country_id"), "left_semi")
DF3.show
//+----------+----------+-----+
//|country_id| Date|value|
//+----------+----------+-----+
//| RUS|2015-12-14| 1|
//| CHN|2015-12-14| 3|
//+----------+----------+-----+
You can also use inner join :
val DF3 = DF1.alias("a").join(DF2.alias("b"), Seq("country_id")).select("a.*")
I need to implement NVL function in spark while joining two dataframes.
Input Dataframes :
ds1.show()
---------------
|key | Code |
---------------
|2 | DST |
|3 | CPT |
|null | DTS |
|5 | KTP |
---------------
ds2.show()
------------------
|key | PremAmt |
------------------
|2 | 300 |
|-1 | -99 |
|5 | 567 |
------------------
Need to implement "LEFT JOIN NVL(DS1.key, -1) = DS2.key" .
So I have written like this, but NVL or Coalesce function is missing .so it returned wrong values.
How to incorporate "NVL" in spark dataframes ?
// nvl function is missing, so wrong output
ds1.join(ds1,Seq("key"),"left_outer")
-------------------------
|key | Code |PremAmt |
-------------------------
|2 | DST |300 |
|3 | CPT |null |
|null | DTS |null |
|5 | KTP |567 |
-------------------------
Expected Result :
-------------------------
|key | Code |PremAmt |
-------------------------
|2 | DST |300 |
|3 | CPT |null |
|null | DTS |-99 |
|5 | KTP |567 |
-------------------------
I know one complex way.
val df = df1.join(df2, coalesce(df1("key"), lit(-1)) === df2("key"), "left_outer")
You should rename column name "key" of one df, and drop the column after join.
An implementation of nvl in Scala
import org.apache.spark.sql.Column;
import org.apache.spark.sql.functions.{when, lit};
def nvl(ColIn: Column, ReplaceVal: Any): Column = {
return(when(ColIn.isNull, lit(ReplaceVal)).otherwise(ColIn))
}
Now you can use nvl as you would use any other function for data frame manipulation, like
val NewDf = DF.withColumn("MyColNullsReplaced", nvl($"MyCol", "<null>"))
Obviously, Replaceval must be of the correct type. The example above assumes $"MyCol" is of type string.
This worked for me:
intermediateDF.select(col("event_start_timestamp"),
col("cobrand_id"),
col("rule_name"),
col("table_name"),
coalesce(col("dimension_field1"),lit(0)),
coalesce(col("dimension_field2"),lit(0)),
coalesce(col("dimension_field3"),lit(0)),
coalesce(col("dimension_field4"),lit(0)),
coalesce(col("dimension_field5"),lit(0))
)
The answer is use NVL, this code in python works
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName("CommonMethods").getOrCreate()
Note: SparkSession is being bulit in a "chained" fashion,ie. 3 methods are being applied in teh same line
Read CSV file
df = spark.read.csv('C:\\tableausuperstore1_all.csv',inferSchema='true',header='true')
df.createOrReplaceTempView("ViewSuperstore")
The ViewSuperstore can be ued for SQL NOW
print("*trace1-nvl")
df = spark.sql("select nvl(state,'a') testString, nvl(quantity,0) testInt from ViewSuperstore where state='Florida' and OrderDate>current_date() ")
df.show()
print("*trace2-FINAL")
df.select(expr("nvl(colname,'ZZ')"))
I have a parquet file, named test.parquet. It contains some integers. When I read it using following code:
val df = spark.read.parquet("test.parquet")
df.show(false)
+---+
|id |
+---+
|11 |
|12 |
|13 |
|14 |
|15 |
|16 |
|17 |
|18 |
|19 |
+---+
In logs it shows 2 jobs that were executed. They are as follows:
One is parquet job and another one is show job. Whereas, when I read parquet file using following code:
val df = spark.read.schema(StructType(List(StructField("id",LongType,false)))).parquet("test.parquet")
df.show(false)
+---+
|id |
+---+
|11 |
|12 |
|13 |
|14 |
|15 |
|16 |
|17 |
|18 |
|19 |
+---+
Only one job is executed, i.e., show:
So, my question is:
Why first approach executes 2 jobs whereas second approach executes only one ?
And, why second approach is faster than the first one ?
Spark reads the file twice.
1- To evolve the schema
2- To create the dataFrame
Once the schema will be generated, the dataFrame will be created which is fast.