PySpark DateTime Functions returning nulls - pyspark

I am reading in some Chicago Crimes data, and needs to use the built in pyspark datetime functions to create a month and year column. I have followed the documentation and tried several methods with no luck.
I import the following.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types
from pyspark.sql.functions import unix_timestamp, from_unixtime
from pyspark.sql.window import Window
My schema shows the Date, and it has string values.
df.printSchema()
root
|-- ID: integer (nullable = true)
|-- Case_Number: string (nullable = true)
|-- Date: string (nullable = true)
|-- Block: string (nullable = true)
|-- IUCR: string (nullable = true)
|-- Primary_Type: string (nullable = true)
|-- Description: string (nullable = true)
|-- Location_Description: string (nullable = true)
|-- Arrest: boolean (nullable = true)
|-- Domestic: boolean (nullable = true)
|-- District: integer (nullable = true)
|-- Community_Area: integer (nullable = true)
|-- FBI_Code: string (nullable = true)
|-- X_Coordinate: integer (nullable = true)
|-- Y_Coordinate: integer (nullable = true)
|-- Year: integer (nullable = true)
|-- Updated_On: string (nullable = true)
|-- Location: string (nullable = true)
See values, there are no nulls in the data column.
df.select('Date').show()
+--------------+
| Date|
+--------------+
|9/5/2015 13:30|
|9/4/2015 11:30|
| 9/1/2018 0:01|
|9/5/2015 12:45|
|9/5/2015 13:00|
|9/5/2015 10:55|
|9/4/2015 18:00|
|9/5/2015 13:00|
|9/5/2015 11:30|
| 5/1/2016 0:25|
|9/5/2015 14:00|
|9/5/2015 11:00|
| 9/5/2015 3:00|
|9/5/2015 12:50|
|9/3/2015 13:00|
|9/5/2015 11:45|
|9/5/2015 13:30|
| 7/8/2015 0:00|
| 9/5/2015 9:55|
|9/5/2015 12:35|
+--------------+
only showing top 20 rows
Then I call the following but get all nulls.
df2 = df.withColumn("Date", unix_timestamp("Date", "MM/dd/yyyy hh:mm"))
df2.select("Date").show()
+----------+
| Date|
+----------+
| null|
|1441384200|
| null|
|1441431900|
| null|
|1441468500|
| null|
| null|
|1441470600|
| null|
| null|
|1441468800|
|1441440000|
|1441432200|
| null|
|1441471500|
| null|
| null|
|1441464900|
|1441431300|
+----------+
only showing top 20 rows
df2 = df.withColumn("Date", df.Date.cast(types.TimestampType()))
df2.select("Date").show()
+----+
|Date|
+----+
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
+----+
only showing top 20 rows
I want to just use the Date column to create Month and Year.
### Get Month from date in pyspark
from pyspark.sql.functions import month, year
#df = df.withColumn("Date", df.Date.cast(types.TimestampType()))
#df = df.withColumn("Date", unix_timestamp("Date", "MM/dd/yyyy"))
df = df.withColumn('Year', year(df['Date']))
df = df.withColumn('Month', month(df['Date']))
In: df.select('Month').distinct().collect()
Out: [Row(Month=None)]

Ok first of all - some reproducable example would have been nice..
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('stack').getOrCreate()
data = [("9/5/2015 13:30"),("9/4/2015 11:30"), ("9/1/2018 0:01"), ("9/5/2015 12:45"),("9/5/2015 13:00"), ("9/5/2015 10:55"), ("9/4/2015 18:00"), ("9/5/2015 13:00")]
from pyspark.sql.types import *
dummy_df = spark.createDataFrame(data, StringType()).toDF('Datetime')
dummy_df.show()
dummy_df.printSchema()
that will give you the output:
+--------------+
| Datetime|
+--------------+
|9/5/2015 13:30|
|9/4/2015 11:30|
| 9/1/2018 0:01|
|9/5/2015 12:45|
|9/5/2015 13:00|
|9/5/2015 10:55|
|9/4/2015 18:00|
|9/5/2015 13:00|
+--------------+
root
|-- Datetime: string (nullable = true)
Now, as mentioned before be carefull with the format of the date-column and follow the instructions of https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
from pyspark.sql.functions import regexp_replace, month, year, col, to_date
#simple regex, watch out for the white space
dummy_df = dummy_df.withColumn('Date', regexp_replace('Datetime', '( \d+:\d+)', ''))
#transform the date-column
dummy_df = dummy_df.withColumn('Date', to_date(col('Date'), "M/d/y"))\
.withColumn('Month', month(col('Date')))\
.withColumn('Year', year(col('Date')))
dummy_df.show()
dummy_df.printSchema()
finally we get:
+--------------+----------+-----+----+
| Datetime| Date|Month|Year|
+--------------+----------+-----+----+
|9/5/2015 13:30|2015-09-05| 9|2015|
|9/4/2015 11:30|2015-09-04| 9|2015|
| 9/1/2018 0:01|2018-09-01| 9|2018|
|9/5/2015 12:45|2015-09-05| 9|2015|
|9/5/2015 13:00|2015-09-05| 9|2015|
|9/5/2015 10:55|2015-09-05| 9|2015|
|9/4/2015 18:00|2015-09-04| 9|2015|
|9/5/2015 13:00|2015-09-05| 9|2015|
+--------------+----------+-----+----+
root
|-- Datetime: string (nullable = true)
|-- Date: date (nullable = true)
|-- Month: integer (nullable = true)
|-- Year: integer (nullable = true)

I often use the timestampFormat option when reading data in pyspark. I combine it with schema injection when reading csv files:
# -*- coding: utf-8 -*-
from pyspark.sql import DataFrame, SparkSession
from pyspark.sql.types import *
def main():
print('Spark read Crimes DataFrames inject schema')
# Create Spark Session
spark: SparkSession = (SparkSession
.builder
#.config("spark.sql.legacy.timeParserPolicy", "LEGACY")
.appName('Spark Course')
.getOrCreate())
# Define schema of the data
schema: StructType = StructType([
StructField('ID', IntegerType(), nullable=False),
StructField('Case_Number', StringType(), nullable=False),
StructField('Date', TimestampType(), nullable=True),
StructField('Block', IntegerType(), nullable=True),
StructField('IUCR', StringType(), nullable=True),
StructField('Primary_Type', StringType(), nullable=True),
StructField('Description', StringType(), nullable=True),
StructField('Location_Description', StringType(), nullable=True),
StructField('Arrest', BooleanType(), nullable=True),
StructField('Domestic', BooleanType(), nullable=True),
StructField('District', IntegerType(), nullable=True),
StructField('Community_Area', IntegerType(), nullable=True),
StructField('FBI_Code', StringType(), nullable=True),
StructField('X_Coordinate', IntegerType(), nullable=True),
StructField('Y_Coordinate', IntegerType(), nullable=True),
StructField('Year', IntegerType(), nullable=True),
StructField('Updated_On', StringType(), nullable=True),
StructField('Location', StringType(), nullable=True)
])
# Read csv injecting schema
crimes: DataFrame = (spark
.read
#.option('timestampFormat', "MM/dd/yyyy kk:mm:ss a") # Use with "spark.sql.legacy.timeParserPolicy", "LEGACY")
.option('timestampFormat', "M/d/y h:m:s a")
.csv('/home/san/Downloads/Crimes_-_2001_to_Present.csv.crdownload', header=True, schema=schema))
crimes.show()
crimes.printSchema()
if __name__ == "__main__":
try:
main()
except Exception as e:
print('Failed to execute process: {}'.format(e))
You can store this code in crimes.py and run it with
/opt/spark/bin/spark-submit crimes.py
Please, pay attention to the new date an timestamp format instroduced in Spark 3: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
You can always use old format configuring spark session with
.config("spark.sql.legacy.timeParserPolicy", "LEGACY")

Related

Transpose DataFrame single row to column in Spark with scala

I saw this question here:
Transpose DataFrame Without Aggregation in Spark with scala and I wanted to do exactly the opposite.
I have this Dataframe with a single row, with values that are string, int, bool, array:
+-----+-------+-----+------+-----+
|col1 | col2 |col3 | col4 |col5 |
+-----+-------+-----+------+-----+
|val1 | val2 |val3 | val4 |val5 |
+-----+-------+-----+------+-----+
And I want to transpose it like this:
+-----------+-------+
|Columns | values|
+-----------+-------+
|col1 | val1 |
|col2 | val2 |
|col3 | val3 |
|col4 | val4 |
|col5 | val5 |
+-----------+-------+
I am using Apache Spark 2.4.3 with Scala 2.11
Edit: Values can be of any type (int, double, bool, array), not only strings.
Thought differently with out using arrays_zip (which is available in => Spark 2.4)] and got the below...
It will work for Spark =>2.0 onwards in a simpler way (flatmap , map and explode functions)...
Here map function (used in with column) creates a new map column. The input columns must be grouped as key-value pairs.
Case : String data type in Data :
import org.apache.spark.sql.functions._
val df: DataFrame =Seq((("val1"),("val2"),("val3"),("val4"),("val5"))).toDF("col1","col2","col3","col4","col5")
var columnsAndValues = df.columns.flatMap { c => Array(lit(c), col(c)) }
df.printSchema()
df.withColumn("myMap", map(columnsAndValues:_*)).select(explode($"myMap"))
.toDF("Columns","Values").show(false)
Result :
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: string (nullable = true)
|-- col4: string (nullable = true)
|-- col5: string (nullable = true)
+-------+------+
|Columns|Values|
+-------+------+
|col1 |val1 |
|col2 |val2 |
|col3 |val3 |
|col4 |val4 |
|col5 |val5 |
+-------+------+
Case : Mix of data types in Data :
If you have different types convert them to String... remaining steps wont change..
val df1 = df.select(df.columns.map(c => col(c).cast(StringType)): _*)
Full Example :
import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql.Column
val df = Seq(((2), (3), (true), (2.4), ("val"))).toDF("col1", "col2", "col3", "col4", "col5")
df.printSchema()
/**
* convert all columns to to string type since its needed further
*/
val df1 = df.select(df.columns.map(c => col(c).cast(StringType)): _*)
df1.printSchema()
var ColumnsAndValues: Array[Column] = df.columns.flatMap { c => {
Array(lit(c), col(c))
}
}
df1.withColumn("myMap", map(ColumnsAndValues: _*))
.select(explode($"myMap"))
.toDF("Columns", "Values")
.show(false)
Result :
root
|-- col1: integer (nullable = false)
|-- col2: integer (nullable = false)
|-- col3: boolean (nullable = false)
|-- col4: double (nullable = false)
|-- col5: string (nullable = true)
root
|-- col1: string (nullable = false)
|-- col2: string (nullable = false)
|-- col3: string (nullable = false)
|-- col4: string (nullable = false)
|-- col5: string (nullable = true)
+-------+------+
|Columns|Values|
+-------+------+
|col1 |2 |
|col2 |3 |
|col3 |true |
|col4 |2.4 |
|col5 |val |
+-------+------+
From Spark-2.4 Use arrays_zip with array(column_values), array(column_names) then explode to get the result.
Example:
val df=Seq((("val1"),("val2"),("val3"),("val4"),("val5"))).toDF("col1","col2","col3","col4","col5")
val cols=df.columns.map(x => col(s"${x}"))
val str_cols=df.columns.mkString(",")
df.withColumn("new",explode(arrays_zip(array(cols:_*),split(lit(str_cols),",")))).
select("new.*").
toDF("values","Columns").
show()
//+------+-------+
//|values|Columns|
//+------+-------+
//| val1| col1|
//| val2| col2|
//| val3| col3|
//| val4| col4|
//| val5| col5|
//+------+-------+
UPDATE:
val df=Seq(((2),(3),(true),(2.4),("val"))).toDF("col1","col2","col3","col4","col5")
df.printSchema
//root
// |-- col1: integer (nullable = false)
// |-- col2: integer (nullable = false)
// |-- col3: boolean (nullable = false)
// |-- col4: double (nullable = false)
// |-- col5: string (nullable = true)
//cast to string
val cols=df.columns.map(x => col(s"${x}").cast("string").alias(s"${x}"))
val str_cols=df.columns.mkString(",")
df.withColumn("new",explode(arrays_zip(array(cols:_*),split(lit(str_cols),",")))).
select("new.*").
toDF("values","Columns").
show()
//+------+-------+
//|values|Columns|
//+------+-------+
//| 2| col1|
//| 3| col2|
//| true| col3|
//| 2.4| col4|
//| val| col5|
//+------+-------+

to_timestamp with spark scala is returning null

I am trying to convert a column containing date value in string format to timestamp format in Apache spark scala.
Below is the content of the dataframe(retailsNullRem):
+---------+---------+--------------+----------+
|InvoiceNo|StockCode| InvoiceDate|customerID|
+---------+---------+--------------+----------+
| 536365| 85123A|12/1/2010 8:26| 17850|
| 536365| 71053|12/1/2010 8:26| 17850|
| 536365| 84406B|12/1/2010 8:26| 17850|
| 536365| 84029G|12/1/2010 8:26| 17850|
| 536365| 84029E|12/1/2010 8:26| 17850|
| 536365| 22752|12/1/2010 8:26| 17850|
| 536365| 21730|12/1/2010 8:26| 17850|
| 536366| 22633|12/1/2010 8:28| 17850|
| 536366| 22632|12/1/2010 8:28| 17850|
| 536367| 84879|12/1/2010 8:34| 13047|
"InvoiceDate" is the column that i am converting to timestamp. I tried the below code for the convertion.
val timeFmt = "MM/dd/yyyy HH:mm"
val retails = retailsNullRem
.withColumn("InvoiceDateTS", to_timestamp(col("InvoiceDate"), timeFmt))
In the data source, it is mentioned that the date format is month/day/year hour:min. But the above code is returning 'Null' for InvoiceDateTS column. I even tried with format like ("%M/%d/%y %H:%m") as in some cases the month, day and hour didnot contain leading 0, but still getting null. Please guide me on what i am missing.
Below is the sample output:
+---------+---------+--------------+----------+-------------+
|InvoiceNo|StockCode| InvoiceDate|customerID|InvoiceDateTS|
+---------+---------+--------------+----------+-------------+
| 536365| 85123A|12/1/2010 8:26| 17850| null|
| 536365| 71053|12/1/2010 8:26| 17850| null|
| 536365| 84406B|12/1/2010 8:26| 17850| null|
| 536365| 84029G|12/1/2010 8:26| 17850| null|
| 536365| 84029E|12/1/2010 8:26| 17850| null|
| 536365| 22752|12/1/2010 8:26| 17850| null|
| 536365| 21730|12/1/2010 8:26| 17850| null|
| 536366| 22633|12/1/2010 8:28| 17850| null|
| 536366| 22632|12/1/2010 8:28| 17850| null|
| 536367| 84879|12/1/2010 8:34| 13047| null|
I am not sure why it's not working i have tried below and it worked
import spark.implicits._
scala> val df=Seq("12/1/2010 8:26", "12/1/2010 8:29").toDF("t")
df: org.apache.spark.sql.DataFrame = [t: string]
scala> df.with
withColumn withColumnRenamed withWatermark
scala> df.withColumn
withColumn withColumnRenamed
scala> df.withColumn("s",col("t").cast("timestamp")).show
+--------------+----+
| t| s|
+--------------+----+
|12/1/2010 8:26|null|
|12/1/2010 8:29|null|
+--------------+----+
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> df.withColumn("s",to_timestamp(col("t"),"MM/dd/yyyy HH:mm")).show
+--------------+-------------------+
| t| s|
+--------------+-------------------+
|12/1/2010 8:26|2010-12-01 08:26:00|
|12/1/2010 8:29|2010-12-01 08:29:00|
+--------------+-------------------+
Maybe there is a problem with your file data.
I tried the same with your own data and it works perfectly well, you can try with dataframe functions or sparkSQL.
your data file from: https://www.kaggle.com/carrie1/ecommerce-data/home#data.csv
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850,United Kingdom
536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,22752,SET 7 BABUSHKA NESTING BOXES,2,12/1/2010 8:26,7.65,17850,United Kingdom
536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,12/1/2010 8:26,4.25,17850,United Kingdom
536366,22633,HAND WARMER UNION JACK,6,12/1/2010 8:28,1.85,17850,United Kingdom
536366,22632,HAND WARMER RED POLKA DOT,6,12/1/2010 8:28,1.85,17850,United Kingdom
536367,84879,ASSORTED COLOUR BIRD ORNAMENT,32,12/1/2010 8:34,1.69,13047,United Kingdom
536367,22745,POPPY'S PLAYHOUSE BEDROOM ,6,12/1/2010 8:34,2.1,13047,United Kingdom
536367,22748,POPPY'S PLAYHOUSE KITCHEN,6,12/1/2010 8:34,2.1,13047,United Kingdom
536367,22749,FELTCRAFT PRINCESS CHARLOTTE DOLL,8,12/1/2010 8:34,3.75,13047,United Kingdom
536367,22310,IVORY KNITTED MUG COSY ,6,12/1/2010 8:34,1.65,13047,United Kingdom
536367,84969,BOX OF 6 ASSORTED COLOUR TEASPOONS,6,12/1/2010 8:34,4.25,13047,United Kingdom
code in IntelliJ
val df = sqlContext
.read
.option("header", true)
.option("inferSchema", true)
.csv("/home/cloudera/files/tests/timestamp.csv")
.cache()
df.show(5, truncate = false)
df.printSchema()
import org.apache.spark.sql.functions._
// You can try this with dataframe functions
val retails = df
.withColumn("InvoiceDateTS", to_timestamp(col("InvoiceDate"), "MM/dd/yyyy HH:mm"))
retails.show(5, truncate = false)
retails.printSchema()
// or sparkSQL
df.createOrReplaceTempView("df")
val retailsSQL = sqlContext.sql(
"""
|SELECT InvoiceNo,StockCode,InvoiceDate,customerID, TO_TIMESTAMP(InvoiceDate,"MM/dd/yyyy HH:mm") AS InvoiceDateTS
|FROM df
|""".stripMargin)
retailsSQL.show(5,truncate = false)
retailsSQL.printSchema()
output
+---------+---------+----------------------------------+--------+--------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description |Quantity|InvoiceDate |UnitPrice|CustomerID|Country |
+---------+---------+----------------------------------+--------+--------------+---------+----------+--------------+
|536365 |85123A |WHITE HANGING HEART T-LIGHT HOLDER|6 |12/1/2010 8:26|2.55 |17850 |United Kingdom|
|536365 |71053 |WHITE METAL LANTERN |6 |12/1/2010 8:26|3.39 |17850 |United Kingdom|
+---------+---------+----------------------------------+--------+--------------+---------+----------+--------------+
only showing top 2 rows
root
|-- InvoiceNo: string (nullable = true)
|-- StockCode: string (nullable = true)
|-- Description: string (nullable = true)
|-- Quantity: integer (nullable = true)
|-- InvoiceDate: string (nullable = true)
|-- UnitPrice: double (nullable = true)
|-- CustomerID: integer (nullable = true)
|-- Country: string (nullable = true)
+---------+---------+----------------------------------+--------+--------------+---------+----------+--------------+-------------------+
|InvoiceNo|StockCode|Description |Quantity|InvoiceDate |UnitPrice|CustomerID|Country |InvoiceDateTS |
+---------+---------+----------------------------------+--------+--------------+---------+----------+--------------+-------------------+
|536365 |85123A |WHITE HANGING HEART T-LIGHT HOLDER|6 |12/1/2010 8:26|2.55 |17850 |United Kingdom|2010-12-01 08:26:00|
|536365 |71053 |WHITE METAL LANTERN |6 |12/1/2010 8:26|3.39 |17850 |United Kingdom|2010-12-01 08:26:00|
+---------+---------+----------------------------------+--------+--------------+---------+----------+--------------+-------------------+
only showing top 2 rows
root
|-- InvoiceNo: string (nullable = true)
|-- StockCode: string (nullable = true)
|-- Description: string (nullable = true)
|-- Quantity: integer (nullable = true)
|-- InvoiceDate: string (nullable = true)
|-- UnitPrice: double (nullable = true)
|-- CustomerID: integer (nullable = true)
|-- Country: string (nullable = true)
|-- InvoiceDateTS: timestamp (nullable = true)
+---------+---------+--------------+----------+-------------------+
|InvoiceNo|StockCode|InvoiceDate |customerID|InvoiceDateTS |
+---------+---------+--------------+----------+-------------------+
|536365 |85123A |12/1/2010 8:26|17850 |2010-12-01 08:26:00|
|536365 |71053 |12/1/2010 8:26|17850 |2010-12-01 08:26:00|
+---------+---------+--------------+----------+-------------------+
only showing top 2 rows
root
|-- InvoiceNo: string (nullable = true)
|-- StockCode: string (nullable = true)
|-- InvoiceDate: string (nullable = true)
|-- customerID: integer (nullable = true)
|-- InvoiceDateTS: timestamp (nullable = true)

How to return ListBuffer as a column from UDF using Spark Scala?

I am trying to use UDF's and return ListBuffer as a column from UDF, i am getting error.
I have created Df by executing below code:
val df = Seq((1,"dept3##rama##kumar","dept3##rama##kumar"), (2,"dept31##rama1##kumar1","dept33##rama3##kumar3")).toDF("id","str1","str2")
df.show()
it show like below:
+---+--------------------+--------------------+
| id| str1| str2|
+---+--------------------+--------------------+
| 1| dept3##rama##kumar| dept3##rama##kumar|
| 2|dept31##rama1##ku...|dept33##rama3##ku...|
+---+--------------------+--------------------+
as per my requirement i have to use i have to split the above columns based some inputs so i have tried UDF like below :
def appendDelimiterError=udf((id: Int, str1: String, str2: String)=> {
var lit = new ListBuffer[Any]()
if(str1.contains("##"){val a=str1.split("##")}
else if(str1.contains("##"){val a=str1.split("##")}
else if(str1.contains("#&"){val a=str1.split("#&")}
if(str2.contains("##"){ val b=str2.split("##")}
else if(str2.contains("##"){ val b=str2.split("##") }
else if(str1.contains("##"){val b=str2.split("##")}
var tmp_row = List(a,"test1",b)
lit +=tmp_row
return lit
})
val
i try to cal by executing below code:
val df1=df.appendDelimiterError("newcol",appendDelimiterError(df("id"),df("str1"),df("str2"))
i getting error "this was a bad call" .i want use ListBuffer/list to store and return to calling place.
my expected output will be:
+---+--------------------+------------------------+----------------------------------------------------------------------+
| id| str1| str2 | newcol |
+---+--------------------+------------------------+----------------------------------------------------------------------+
| 1| dept3##rama##kumar| dept3##rama##kumar |ListBuffer(List("dept","rama","kumar"),List("dept3","rama","kumar")) |
| 2|dept31##rama1##kumar1|dept33##rama3##kumar3 | ListBuffer(List("dept31","rama1","kumar1"),List("dept33","rama3","kumar3")) |
+---+--------------------+------------------------+----------------------------------------------------------------------+
How to achieve this?
An alternative with my own fictional data to which you can tailor and no UDF:
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df = Seq(
(1, "111##cat##666", "222##fritz##777"),
(2, "AAA##cat##555", "BBB##felix##888"),
(3, "HHH##mouse##yyy", "123##mickey##ZZZ")
).toDF("c0", "c1", "c2")
val df2 = df.withColumn( "c_split", split(col("c1"), ("(##)|(##)|(##)|(##)") ))
.union(df.withColumn("c_split", split(col("c2"), ("(##)|(##)|(##)|(##)") )) )
df2.show(false)
df2.printSchema()
val df3 = df2.groupBy(col("c0")).agg(collect_list(col("c_split")).as("List_of_Data") )
df3.show(false)
df3.printSchema()
Gives answer but no ListBuffer - really necessary?, as follows:
+---+---------------+----------------+------------------+
|c0 |c1 |c2 |c_split |
+---+---------------+----------------+------------------+
|1 |111##cat##666 |222##fritz##777 |[111, cat, 666] |
|2 |AAA##cat##555 |BBB##felix##888 |[AAA, cat, 555] |
|3 |HHH##mouse##yyy|123##mickey##ZZZ|[HHH, mouse, yyy] |
|1 |111##cat##666 |222##fritz##777 |[222, fritz, 777] |
|2 |AAA##cat##555 |BBB##felix##888 |[BBB, felix, 888] |
|3 |HHH##mouse##yyy|123##mickey##ZZZ|[123, mickey, ZZZ]|
+---+---------------+----------------+------------------+
root
|-- c0: integer (nullable = false)
|-- c1: string (nullable = true)
|-- c2: string (nullable = true)
|-- c_split: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---------------------------------------+
|c0 |List_of_Data |
+---+---------------------------------------+
|1 |[[111, cat, 666], [222, fritz, 777]] |
|3 |[[HHH, mouse, yyy], [123, mickey, ZZZ]]|
|2 |[[AAA, cat, 555], [BBB, felix, 888]] |
+---+---------------------------------------+
root
|-- c0: integer (nullable = false)
|-- List_of_Data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)

Strange behavior when casting array of structs in spark scala

I'm encountering a strange behavior using spark 2.1.1 and scala 2.11.8:
import spark.implicits._
val df = Seq(
(1,Seq(("a","b"))),
(2,Seq(("c","d")))
).toDF("id","data")
df.show(false)
df.printSchema()
+---+-------+
|id |data |
+---+-------+
|1 |[[a,b]]|
|2 |[[c,d]]|
+---+-------+
root
|-- id: integer (nullable = false)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: string (nullable = true)
Now I want to rename my struct fields as suggested in https://stackoverflow.com/a/39781382/1138523
df
.select($"id",$"data".cast("array<struct<k:string,v:string>>"))
.show()
Which results in the correct schema, but the content of the dataframe is now:
+---+-------+
| id| data|
+---+-------+
| 1|[[c,d]]|
| 2|[[c,d]]|
+---+-------+
Both lines show now the same array. What am I doing wrong?
EDIT: In spark 2.1.2 (and also spark 2.3.0) I get the expected output. I also get the expected output if I cache the dataframe:
val df = Seq(
(1,Seq(("a","b"))),
(2,Seq(("c","d")))
).toDF("id","data")
.cache

How to transform Spark Dataframe columns to a single column of a string array

I want to know how can I "merge" multiple dataframe columns into one as a string array?
For example, I have this dataframe:
val df = sqlContext.createDataFrame(Seq((1, "Jack", "125", "Text"), (2,"Mary", "152", "Text2"))).toDF("Id", "Name", "Number", "Comment")
Which looks like this:
scala> df.show
+---+----+------+-------+
| Id|Name|Number|Comment|
+---+----+------+-------+
| 1|Jack| 125| Text|
| 2|Mary| 152| Text2|
+---+----+------+-------+
scala> df.printSchema
root
|-- Id: integer (nullable = false)
|-- Name: string (nullable = true)
|-- Number: string (nullable = true)
|-- Comment: string (nullable = true)
How can I transform it so it would look like this:
scala> df.show
+---+-----------------+
| Id| List|
+---+-----------------+
| 1| [Jack,125,Text]|
| 2| [Mary,152,Text2]|
+---+-----------------+
scala> df.printSchema
root
|-- Id: integer (nullable = false)
|-- List: Array (nullable = true)
| |-- element: string (containsNull = true)
Use org.apache.spark.sql.functions.array:
import org.apache.spark.sql.functions._
val result = df.select($"Id", array($"Name", $"Number", $"Comment") as "List")
result.show()
// +---+------------------+
// |Id |List |
// +---+------------------+
// |1 |[Jack, 125, Text] |
// |2 |[Mary, 152, Text2]|
// +---+------------------+
Can also be used with withColumn :
import org.apache.spark.sql.functions as F
df.withColumn("Id", F.array(F.col("Name"), F.col("Number"), F.col("Comment")))