I have a dataframe (df1) with 2 StringType fields.
Field1 (StringType) Value-X
Field2 (StringType) value-20180101
All I am trying to do is create another dataframe (df2) from df1 with 2 fields-
Field1 (StringType) Value-X
Field2 (Date Type) Value-2018-01-01
I am using the below code-
df2=df1.select(
col("field1").alias("f1"),
unix_timestamp(col("field2"),"yyyyMMdd").alias("f2")
)
df2.show
df2.printSchema
For this field 2, I tried multiple things - unix_timestamp , from_unixtimestamp, to_date, cast(“date”) but nothing worked
I need the following schema as output:
df2.printSchema
|-- f1: string (nullable = false)
|-- f2: date (nullable = false)
I'm using Spark 2.1
to_date seems to work fine for what you need:
import org.apache.spark.sql.functions._
val df1 = Seq( ("X", "20180101"), ("Y", "20180406") ).toDF("c1", "c2")
val df2 = df1.withColumn("c2", to_date($"c2", "yyyyMMdd"))
df2.show
// +---+----------+
// | c1| c2|
// +---+----------+
// | X|2018-01-01|
// | Y|2018-04-06|
// +---+----------+
df2.printSchema
// root
// |-- c1: string (nullable = true)
// |-- c2: date (nullable = true)
[UPDATE]
For Spark 2.1 or prior, to_date doesn't take format string as a parameter, hence explicit string formatting to the standard yyyy-MM-dd format using, say, regexp_replace is needed:
val df2 = df1.withColumn(
"c2", to_date(regexp_replace($"c2", "(\\d{4})(\\d{2})(\\d{2})", "$1-$2-$3"))
)
Related
I'm trying to filter lines from a dataframe with this structure :
|-- age: integer (nullable = true)
|-- qty: integer (nullable = true)
|-- dates: array (nullable = true)
| |-- element: timestamp (containsNull = true)
For example in this dataframe I only want the first row :
+---------+------------+------------------------------------------------------------------+
| age | qty |dates |
+---------+------------+------------------------------------------------------------------+
| 54 | 1| [2020-12-31 12:15:20, 2021-12-31 12:15:20] |
| 45 | 1| [2020-12-31 12:15:20, 2018-12-31 12:15:20, 2019-12-31 12:15:20] |
+---------+------------+------------------------------------------------------------------+
Here is my code :
val result = sqlContext
.table("scores")
result.filter(array_contains(col("dates").cast("string"),
2021)).show(false)
But I'm getting this error :
org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(
due to data type mismatch: Arguments must be an array followed by a value of same type as the > array members;
Can anyone help please?
You need to use rlike to check if each array element contains 2021. array_contains check for exact match, not partial match.
result.filter("array_max(transform(dates, x -> string(x) rlike '2021'))").show(false)
You can explode the ArrayType and then make your proceessing as You want: cast the column as String then apply your filter:
val spark: SparkSession = SparkSession.builder()
.master("local[*]")
.appName("SparkByExamples")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import java.sql.Timestamp
import java.text.SimpleDateFormat
def convertToTimeStamp(s: String) = {
val dateFormat = new SimpleDateFormat("yyyy-MM-dd hh:mm:ss")
val parsedDate = dateFormat.parse(s)
new Timestamp(parsedDate.getTime)
}
val data = Seq(
Row(54, 1, Array(convertToTimeStamp("2020-12-31 12:15:20"), convertToTimeStamp("2021-12-31 12:15:20"))),
Row(45, 1, Array(convertToTimeStamp("2020-12-31 12:15:20"), convertToTimeStamp("2018-12-31 12:15:20"), convertToTimeStamp("2019-12-31 12:15:20")))
)
val Schema = StructType(Array(
StructField("age", IntegerType, nullable = true),
StructField("qty", IntegerType, nullable = true),
StructField("dates", ArrayType(TimestampType, containsNull = true), nullable = true)
))
val rdd = spark.sparkContext.parallelize(data)
var df = spark.createDataFrame(rdd, Schema)
df.show()
df.printSchema()
df = df.withColumn("exp",f.explode(f.col("dates")))
df.filter(f.col("exp").cast(StringType).contains("2021")).show()
You can use exists function to check whether dates array contains a date with year 2021:
df.filter("exists(dates, x -> year(x) = 2021)").show(false)
//+---+---+------------------------------------------+
//|age|qty|dates |
//+---+---+------------------------------------------+
//|54 |1 |[2020-12-31 12:15:20, 2021-12-31 12:15:20]|
//+---+---+------------------------------------------+
If you want to use array_contains, you need to transform the timestamp elements into year:
df.filter("array_contains(transform(dates, x -> year(x)), 2021)").show(false)
I have a dataframe (dataDF) which contains data like :
firstColumn;secondColumn;thirdColumn
myText;123;2010-08-12 00:00:00
In my case, all of these columns are StringType.
In the other hand, I have another DataFrame (customTypeDF) which can be modified and contains for some columns a custom type like :
columnName;customType
secondColumn;IntegerType
thirdColumn; TimestampType
How can I apply dynamically the new types on my dataDF dataframe ?
You can map the column names using the customTypeDF collected as a Seq:
val colTypes = customTypeDF.rdd.map(x => x.toSeq.asInstanceOf[Seq[String]]).collect
val result = dataDF.select(
dataDF.columns.map(c =>
if (colTypes.map(_(0)).contains(c))
col(c).cast(colTypes.filter(_(0) == c)(0)(1).toLowerCase.replace("type","")).as(c)
else col(c)
):_*
)
result.show
+-----------+------------+-------------------+
|firstColumn|secondColumn| thirdColumn|
+-----------+------------+-------------------+
| myText| 123|2010-08-12 00:00:00|
+-----------+------------+-------------------+
result.printSchema
root
|-- firstColumn: string (nullable = true)
|-- secondColumn: integer (nullable = true)
|-- thirdColumn: timestamp (nullable = true)
I am writing a spark scala code to write the output to BQ, The following are the codes used for forming the output table which has two columns (id and keywords)
val df1 = Seq("tamil", "telugu", "hindi").toDF("language")
val df2 = Seq(
(101, Seq("tamildiary", "tamilkeyboard", "telugumovie")),
(102, Seq("tamilmovie")),
(103, Seq("hindirhymes", "hindimovie"))
).toDF("id", "keywords")
val pattern = concat(lit("^"), df1("language"), lit(".*"))
import org.apache.spark.sql.Row
val arrayToMap = udf{ (arr: Seq[Row]) =>
arr.map{ case Row(k: String, v: Int) => (k, v) }.toMap
}
val final_df = df2.
withColumn("keyword", explode($"keywords")).as("df2").
join(df1.as("df1"), regexp_replace($"df2.keyword", pattern, lit("")) =!= $"df2.keyword").
groupBy("id", "language").agg(size(collect_list($"language")).as("count")).
groupBy("id").agg(arrayToMap(collect_list(struct($"language", $"count"))).as("keywords"))
The output of final_df is:
+---+--------------------+
| id| app_language|
+---+--------------------+
|101|Map(tamil -> 2, t...|
|103| Map(hindi -> 2)|
|102| Map(tamil -> 1)|
+---+--------------------+
I am defining the below function to pass the schema for this output table. (Since BQ doesn't support map field, I am using array of struct. But this is also not working)
def createTableIfNotExists(outputTable: String) = {
spark.createBigQueryTable(
s"""
|CREATE TABLE IF NOT EXISTS $outputTable(
|ds date,
|id int64,
|keywords ARRAY<STRUCT<key STRING, value INT64>>
|)
|PARTITION BY ds
|CLUSTER BY user_id
""".stripMargin)
}
Could anyone please help me in writing a correct schema for this so that it's compatible in BQ.
You can collect an array of struct as below:
val final_df = df2
.withColumn("keyword", explode($"keywords")).as("df2")
.join(df1.as("df1"), regexp_replace($"df2.keyword", pattern, lit("")) =!= $"df2.keyword")
.groupBy("id", "language")
.agg(size(collect_list($"language")).as("count"))
.groupBy("id")
.agg(collect_list(struct($"language", $"count")).as("app_language"))
final_df.show(false)
+---+-------------------------+
|id |app_language |
+---+-------------------------+
|101|[[tamil, 2], [telugu, 1]]|
|103|[[hindi, 2]] |
|102|[[tamil, 1]] |
+---+-------------------------+
final_df.printSchema
root
|-- id: integer (nullable = false)
|-- app_language: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- language: string (nullable = true)
| | |-- count: integer (nullable = false)
And then you can have a schema like
def createTableIfNotExists(outputTable: String) = {
spark.createBigQueryTable(
s"""
|CREATE TABLE IF NOT EXISTS $outputTable(
|ds date,
|id int64,
|keywords ARRAY<STRUCT<language STRING, count INT64>>
|)
|PARTITION BY ds
|CLUSTER BY user_id
""".stripMargin)
}
Sample Code:
val sparkSession = SparkUtil.getSparkSession("timestamp_format_test")
import sparkSession.implicits._
val format = "yyyy/MM/dd HH:mm:ss.SSS"
val time = "2018/12/21 08:07:36.927"
val df = sparkSession.sparkContext.parallelize(Seq(time)).toDF("in_timestamp")
val df2 = df.withColumn("out_timestamp", to_timestamp(df.col("in_timestamp"), format))
Output:
df2.show(false)
plz notice: out_timestamp loses the milli-second part from the original value
+-----------------------+-------------------+
|in_timestamp |out_timestamp |
+-----------------------+-------------------+
|2018/12/21 08:07:36.927|2018-12-21 08:07:36|
+-----------------------+-------------------+
df2.printSchema()
root
|-- in_timestamp: string (nullable = true)
|-- out_timestamp: timestamp (nullable = true)
In the above result: in_timestamp is of string type, and I would like to convert to timestamp data type, it does get convert but the millisecond part gets lost. Any idea.? Thanks.!
Sample code for preserving millisecond during conversion from string to timestamp.
val df2 = df.withColumn("out_timestamp", to_timestamp(df.col("in_timestamp")))
df2.show(false)
+-----------------------+-----------------------+
|in_timestamp |out_timestamp |
+-----------------------+-----------------------+
|2018-12-21 08:07:36.927|2018-12-21 08:07:36.927|
+-----------------------+-----------------------+
scala> df2.printSchema
root
|-- in_timestamp: string (nullable = true)
|-- out_timestamp: timestamp (nullable = true)
You just need to remove format parameter from to_timestamp. This will save your result with data type timestamp similar to String value.
I have a column in spark dataframe of String datatype (with date in yyyy-MM-dd pattern)
I want to display the column value in MM/dd/yyyy pattern
My data is
val df = sc.parallelize(Array(
("steak", "1990-01-01", "2000-01-01", 150),
("steak", "2000-01-02", "2001-01-13", 180),
("fish", "1990-01-01", "2001-01-01", 100)
)).toDF("name", "startDate", "endDate", "price")
df.show()
+-----+----------+----------+-----+
| name| startDate| endDate|price|
+-----+----------+----------+-----+
|steak|1990-01-01|2000-01-01| 150|
|steak|2000-01-02|2001-01-13| 180|
| fish|1990-01-01|2001-01-01| 100|
+-----+----------+----------+-----+
root
|-- name: string (nullable = true)
|-- startDate: string (nullable = true)
|-- endDate: string (nullable = true)
|-- price: integer (nullable = false)
I want to show endDate in MM/dd/yyyy pattern. All I am able to do is convert the column to DateType from String
val df2 = df.select($"endDate".cast(DateType).alias("endDate"))
df2.show()
+----------+
| endDate|
+----------+
|2000-01-01|
|2001-01-13|
|2001-01-01|
+----------+
df2.printSchema()
root
|-- endDate: date (nullable = true)
I want to show endDate in MM/dd/yyyy pattern. Only reference I found is this which doesn't solve the problem
You can use date_format function.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(
("steak", "1990-01-01", "2000-01-01", 150),
("steak", "2000-01-02", "2001-01-13", 180),
("fish", "1990-01-01", "2001-01-01", 100))).toDF("name", "startDate", "endDate", "price")
df.show()
df.select(date_format(col("endDate"), "MM/dd/yyyy")).show
Output :
+-------------------------------+
|date_format(endDate,MM/dd/yyyy)|
+-------------------------------+
| 01/01/2000|
| 01/13/2001|
| 01/01/2001|
+-------------------------------+
Use pyspark.sql.functions.date_format(date, format):
val df2 = df.select(date_format("endDate", "MM/dd/yyyy").alias("endDate"))
Dataframe/Dataset having a string column with date value in it and we need to change the date format.
For the query asked, date format can be changed as below:
val df1 = df.withColumn("startDate1", date_format(to_date(col("startDate"),"yyyy-MM-dd"),"MM/dd/yyyy" ))
In Spark, the default date format is "yyyy-MM-dd" hence it can be re-written as
val df1 = df.withColumn("startDate1", date_format(col("startDate"),"MM/dd/yyyy" ))
(i) By applying to_date, we are changing the datatype of this column (string) to Date datatype.
Also, we are informing to_date that the format in this string column is yyyy-MM-dd so read the column accordingly.
(ii) Next, we are applying date_format to achieve the date format we require which is MM/dd/yyyy.
When time component is involved, use to_timestamp instead of to_date.
Note that 'MM' represents month and 'mm' represents minutes.