Convert string to timestamp for Spark using Scala - scala

I have a dataframe called train, he has the following schema :
root
|-- date_time: string (nullable = true)
|-- site_name: integer (nullable = true)
|-- posa_continent: integer (nullable = true)
I want to cast the date_timecolumn to timestampand create a new column with the year value extracted from the date_timecolumn.
To be clear, I have the following dataframe :
+-------------------+---------+--------------+
| date_time|site_name|posa_continent|
+-------------------+---------+--------------+
|2014-08-11 07:46:59| 2| 3|
|2014-08-11 08:22:12| 2| 3|
|2015-08-11 08:24:33| 2| 3|
|2016-08-09 18:05:16| 2| 3|
|2011-08-09 18:08:18| 2| 3|
|2009-08-09 18:13:12| 2| 3|
|2014-07-16 09:42:23| 2| 3|
+-------------------+---------+--------------+
I want to get the following dataframe :
+-------------------+---------+--------------+--------+
| date_time|site_name|posa_continent|year |
+-------------------+---------+--------------+--------+
|2014-08-11 07:46:59| 2| 3|2014 |
|2014-08-11 08:22:12| 2| 3|2014 |
|2015-08-11 08:24:33| 2| 3|2015 |
|2016-08-09 18:05:16| 2| 3|2016 |
|2011-08-09 18:08:18| 2| 3|2011 |
|2009-08-09 18:13:12| 2| 3|2009 |
|2014-07-16 09:42:23| 2| 3|2014 |
+-------------------+---------+--------------+--------+

Well, if you want to cast the date_timecolumn to timestampand create a new column with the year value then do exactly that:
import org.apache.spark.sql.functions.year
df
.withColumn("date_time", $"date_time".cast("timestamp")) // cast to timestamp
.withColumn("year", year($"date_time")) // add year column

You could map the dataframe to add the year at the end of each row:
df.map {
case Row(col1: String, col2: Int, col3: Int) => (col1, col2, col3, DateTime.parse(col1, DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss")).getYear)
}.toDF("date_time", "site_name", "posa_continent", "year").show()

Related

agg function to transform multiple rows to multiple columns with different type

I want to transform value of multiples row with same id into columns , but each columns is a different type.
Input data :
val dataInput = List(
Row( "meta001","duration", 2 , null, null),
Row("meta001","price", 300 , null , null),
Row("meta001","name", null , null , "name"),
Row("meta001","exist", null , true , null),
Row("meta002","price", 400 , null, null),
Row("meta002","duration", 3 , null, null)
)
val schemaInput = new StructType()
.add("id",StringType,true)
.add("code",StringType,true)
.add("integer value",IntegerType,true)
.add("boolean value",BooleanType,true)
.add("string value",StringType,true)
var dfInput = spark.createDataFrame(
spark.sparkContext.parallelize(dataInput),
schemaInput
)
+-------+--------+-------------+-------------+------------+
| id| code|integer value|boolean value|string value|
+-------+--------+-------------+-------------+------------+
|meta001|duration| 2| null| null|
|meta001| price| 300| null| null|
|meta001| name| null| null| name|
|meta001| exist| null| true| null|
|meta002| price| 400| null| null|
|meta002|duration| 3| null| null|
+-------+--------+-------------+-------------+------------+
Expected output :
+-------+--------+-------------+-------------+------------+
| id|duration|price |name |exist |
+-------+--------+-------------+-------------+------------+
|meta001| 2| 300| name| true|
|meta002| 3| 400| null| null|
+-------+--------+-------------+-------------+------------+
I think i should use groupBy and pivot funtion but I am little bit lost when i should agg result :
dfInput.groupby("id").pivot("code",Seq("duration","price","name","exist").agg(???)
You don't need pivot here, just combine first with a when:
dfInput
.groupBy($"id")
.agg(
first(when($"code" === "duration", $"integer value"), ignoreNulls = true).as("duration"),
first(when($"code" === "price", $"integer value"), ignoreNulls = true).as("price"),
first(when($"code" === "name", $"string value"), ignoreNulls = true).as("name"),
first(when($"code" === "exist", $"boolean value"), ignoreNulls = true).as("exist")
)
.show()
gives
+-------+--------+-----+----+-----+
| id|duration|price|name|exist|
+-------+--------+-----+----+-----+
|meta001| 2| 300|name| true|
|meta002| 3| 400|null| null|
+-------+--------+-----+----+-----+
You can use the coalesce function but all columns should be the same type, e.g "string".
df.groupBy("id").pivot("code",Seq("duration","price","name","exist"))
.agg(first(coalesce($"integer value".cast("string"), $"boolean value".cast("string"), $"string value".cast("string"))))
.show()
+-------+--------+-----+----+-----+
| id|duration|price|name|exist|
+-------+--------+-----+----+-----+
|meta001| 2| 300|name| true|
|meta002| 3| 400|null| null|
+-------+--------+-----+----+-----+
If you want to preserve the data type, then the function should be more complex, might need the conditional statements. Or just pivot to all and select the columns what you need.
df.groupBy("id").pivot("code",Seq("duration","price","name","exist"))
.agg(first($"integer value").as("int"), first($"boolean value").as("bool"), first($"string value").as("string"))
.select("id", "duration_int", "price_int", "name_string", "exist_bool")
.show()
+-------+------------+---------+-----------+----------+
| id|duration_int|price_int|name_string|exist_bool|
+-------+------------+---------+-----------+----------+
|meta001| 2| 300| name| true|
|meta002| 3| 400| null| null|
+-------+------------+---------+-----------+----------+
root
|-- id: string (nullable = true)
|-- duration_int: integer (nullable = true)
|-- price_int: integer (nullable = true)
|-- name_string: string (nullable = true)
|-- exist_bool: boolean (nullable = true)

Set all values to None for multiple columns

I'm setting up a Spark batch that aims to filter out some fields that need cleaning up. How do I set the columns in question's values to None for all rows ? (I already have a dataframe containing only rows I want to alter)
I am far from an expert in Spark, and I searched around a lot before asking here, but I am still at loss for a simple enough answer.
There are around 50 columns, and I cannot hard-code the column index to access it, as it may change in future batches.
Example Input dataframe (target columns contain data):
id TARGET 1 TARGET 2 TARGET 3 Col6 ...
someid1 Some(String) Some(String) Some(String) val1 ...
someid2 Some(String) Some(String) None val4 ...
someid5 Some(String) Some(String) Some(String) val3 ...
someid6 Some(String) Some(String) Some(String) val7 ...
Expected Output dataframe (all target columns set to None):
id TARGET 1 TARGET 2 TARGET 3 Col6 ...
someid1 None None None val1 ...
someid2 None None None val4 ...
someid5 None None None val3 ...
someid6 None None None val7 ...
AFAIK, Spark doesn't accept None values. A possible solution would be to replace them with null values casted to String :
ds.
.withColumn("target1", lit(null).cast(StringType))
.withColumn("target2", lit(null).cast(StringType))
It produces the following output :
+--------------------+-------+-------+-----+
| id|target1|target2| col6|
+--------------------+-------+-------+-----+
| 4201735573065099460| null| null|疦薠紀趣餅|
|-6432819446886055080| null| null|┵િ塇駢뱪|
|-7700306868339925800| null| null|鵎썢鳝踽嬌|
|-4913818084582557950| null| null|ꢵ痩찢쑖|
| 6731176796531697018| null| null|少⽬ᩢゖ謹|
+--------------------+-------+-------+-----+
only showing top 5 rows
root
|-- id: long (nullable = false)
|-- target1: string (nullable = true)
|-- target2: string (nullable = true)
|-- col6: string (nullable = true)
That's also what you get when you set a value to None in a Dataset.
case class TestData(id: Long, target1: Option[String], target2: Option[String], col6: String)
val res = Seq(
TestData(1, Some("a"), Some("b"), "c"),
TestData(2, Some("a"), Some("b"), "c"),
TestData(3, Some("a"), Some("b"), "c"),
TestData(4, Some("a"), Some("b"), "c")
).toDS()
res.show(5)
res.map(_.copy(target1 = None, target2 = None)).show(5)
res.printSchema()
Will returns :
+---+-------+-------+----+
| id|target1|target2|col6|
+---+-------+-------+----+
| 1| a| b| c|
| 2| a| b| c|
| 3| a| b| c|
| 4| a| b| c|
+---+-------+-------+----+
+---+-------+-------+----+
| id|target1|target2|col6|
+---+-------+-------+----+
| 1| null| null| c|
| 2| null| null| c|
| 3| null| null| c|
| 4| null| null| c|
+---+-------+-------+----+
root
|-- id: long (nullable = false)
|-- target1: string (nullable = true)
|-- target2: string (nullable = true)
|-- col6: string (nullable = true)

Calculating the rolling sums in pyspark

I have a dataframe that contains information on the daily sales and daily clicks. Before I want to run my analysis, I want to aggregate the data. To make myself clearer, I will try to explain it on an example dataframe
item_id date Price Sale Click Discount_code
2 01.03.2019 10 1 10 NULL
2 01.03.2019 8 1 10 Yes
2 02.03.2019 10 0 4 NULL
2 03.03.2019 10 0 6 NULL
2 04.03.2019 6 0 15 NULL
2 05.03.2019 6 0 14 NULL
2 06.03.2019 5 0 7 NULL
2 07.03.2019 5 1 11 NULL
2 07.03.2019 5 1 11 NULL
2 08.03.2019 5 0 9 NULL
If there are two sales for the given day, I have two observations for that particular day. I want to convert my dataframe to the following one by collapsing observations by item_id and price:
item_id Price CSale Discount_code Cclicks firstdate lastdate
2 10 1 No 20 01.03.2019 03.03.2019
2 8 1 Yes 10 01.03.2019 01.03.2019
2 6 0 NULL 29 04.03.2019 05.03.2019
2 5 2 NULL 38 06.03.2019 08.03.2019
Where CSale correponds to the cumulative sales for the given price and given item_id, Cclicks corresponds to the cumulative clicks for the given price and given item_id, firstdate is the first date on which the given item was available for the given price and lastdate is the last date on which the given item was available for the given price.
According to the problem, OP wants to aggregate the DataFrame on the basis of item_id and Price.
# Creating the DataFrames
from pyspark.sql.functions import col, to_date, sum, min, max, first
df = sqlContext.createDataFrame([(2,'01.03.2019',10,1,10,None),(2,'01.03.2019',8,1,10,'Yes'),
(2,'02.03.2019',10,0,4,None),(2,'03.03.2019',10,0,6,None),
(2,'04.03.2019',6,0,15,None),(2,'05.03.2019',6,0,14,None),
(2,'06.03.2019',5,0,7,None),(2,'07.03.2019',5,1,11,None),
(2,'07.03.2019',5,1,11,None),(2,'08.03.2019',5,0,9,None)],
('item_id','date','Price','Sale','Click','Discount_code'))
# Converting string column date to proper date
df = df.withColumn('date',to_date(col('date'),'dd.MM.yyyy'))
df.show()
+-------+----------+-----+----+-----+-------------+
|item_id| date|Price|Sale|Click|Discount_code|
+-------+----------+-----+----+-----+-------------+
| 2|2019-03-01| 10| 1| 10| null|
| 2|2019-03-01| 8| 1| 10| Yes|
| 2|2019-03-02| 10| 0| 4| null|
| 2|2019-03-03| 10| 0| 6| null|
| 2|2019-03-04| 6| 0| 15| null|
| 2|2019-03-05| 6| 0| 14| null|
| 2|2019-03-06| 5| 0| 7| null|
| 2|2019-03-07| 5| 1| 11| null|
| 2|2019-03-07| 5| 1| 11| null|
| 2|2019-03-08| 5| 0| 9| null|
+-------+----------+-----+----+-----+-------------+
As can be seen in the printSchema below that the dataframe's date column is in date format.
df.printSchema()
root
|-- item_id: long (nullable = true)
|-- date: date (nullable = true)
|-- Price: long (nullable = true)
|-- Sale: long (nullable = true)
|-- Click: long (nullable = true)
|-- Discount_code: string (nullable = true)
Finally aggregating agg() the columns below. Just a caveat - Since Discount_code is a string column and we need to aggregate it as well, we will take the first non-Null value while grouping.
df = df.groupBy('item_id','Price').agg(sum('Sale').alias('CSale'),
first('Discount_code',ignorenulls = True).alias('Discount_code'),
sum('Click').alias('Cclicks'),
min('date').alias('firstdate'),
max('date').alias('lastdate'))
df.show()
+-------+-----+-----+-------------+-------+----------+----------+
|item_id|Price|CSale|Discount_code|Cclicks| firstdate| lastdate|
+-------+-----+-----+-------------+-------+----------+----------+
| 2| 6| 0| null| 29|2019-03-04|2019-03-05|
| 2| 5| 2| null| 38|2019-03-06|2019-03-08|
| 2| 8| 1| Yes| 10|2019-03-01|2019-03-01|
| 2| 10| 1| null| 20|2019-03-01|2019-03-03|
+-------+-----+-----+-------------+-------+----------+----------+

Spark Graphx: Loading a graph from adjacency matrix

I have been experimenting with the Graphx APIs of Spark, primarily to learn and have a feel of how to use them. In the process, I have to load an adjacency matrix into a graph. The matrix dataset is here.
From the site, the matrix is described as
A number of employees in a factory was interviewed on a question: “Do you like to work with your co-worker?”. Possible answers are 1 for yes and 0 for no. Each employee gave an answer for each other employee thus creating an adjecancy matrix.
So, I have decided to name the employees as English alphabets ("A" onwards). Employees form the nodes of the graph, and their preferences for their co-workers form the edges. I haven't found any straightforward way in Spark to achieve this; my R-programmer friends tell me that it is quite easy to do so, in their world. So, I set upon writing a naive implementation to do so. Here's the code
val conf = new SparkConf().setMaster("local[*]").setAppName("GraphExploration App")
val spark = SparkSession
.builder()
.appName("Spark SQL: beginners exercise")
.getOrCreate()
val sc = SparkContext.getOrCreate(conf)
val df = spark.read.csv("./BlogInputs/sociogram-employees-un.csv").cache
val allRows = df.toLocalIterator.toIndexedSeq
type EmployeeVertex = (Long,String)
val employeesWithNames = (0 until allRows.length).map(i => (i.toLong,((i + 'A').toChar.toString())))
val columnNames = (0 until allRows.length).map(i => ("_c" + i)).toIndexedSeq // It is a square matrix; rows == columns
val edgesAsCollected = (for {
rowIndex <- 0 until df.count.toInt
colIndex <- 0 until df.count.toInt
if (rowIndex != colIndex)
} yield {
if (allRows(rowIndex).fieldIndex(columnNames(colIndex)) == 1)
Some(Edge(employeesWithNames(rowIndex)._1,employeesWithNames(colIndex)._1,"Likes"))
else
None
}).flatten
val employeeNodes = sc.parallelize(employeesWithNames)
val edges = sc.parallelize(edgesAsCollected)
val employeeGraph = Graph(sc.parallelize(employeesWithNames),edges,"Nobody")
Here is the schema:
scala>df.printSchema
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
|-- _c4: string (nullable = true)
|-- _c5: string (nullable = true)
|-- _c6: string (nullable = true)
|-- _c7: string (nullable = true)
|-- _c8: string (nullable = true)
|-- _c9: string (nullable = true)
|-- _c10: string (nullable = true)
|-- _c11: string (nullable = true)
|-- _c12: string (nullable = true)
|-- _c13: string (nullable = true)
|-- _c14: string (nullable = true)
|-- _c15: string (nullable = true)
|-- _c16: string (nullable = true)
|-- _c17: string (nullable = true)
|-- _c18: string (nullable = true)
|-- _c19: string (nullable = true)
|-- _c20: string (nullable = true)
|-- _c21: string (nullable = true)
|-- _c22: string (nullable = true)
|-- _c23: string (nullable = true)
|-- _c24: string (nullable = true)
.. and first few rows here
scala> df.show
16/12/21 07:12:00 WARN Executor: 1 block locks were not released by TID = 1:
[rdd_8_0]
+---+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|_c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7|_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18|_c19|_c20|_c21|_c22|_c23|_c24|
+---+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| 0| 1| 0| 1| 1| 0| 1| 1| 1| 0| 0| 1| 0| 1| 1| 0| 1| 1| 0| 1| 0| 1| 0| 1| 1|
| 1| 0| 0| 1| 0| 0| 1| 0| 1| 0| 0| 1| 0| 0| 1| 0| 1| 0| 1| 0| 0| 1| 0| 1| 0|
| 0| 1| 0| 1| 1| 0| 0| 0| 1| 0| 0| 0| 0| 1| 1| 0| 0| 1| 0| 0| 0| 1| 1| 0| 1|
| 0| 1| 1| 0| 0| 0| 1| 0| 0| 0| 1| 1| 0| 1| 0| 0| 1| 1| 0| 0| 1| 0| 1| 1| 0|
This serves my purpose, but I feel there may be a different way. My very little knowledge of Spark's MLLib APIs is perhaps a barrier. Could someone please comment on this? Better even, could someone show me a better yet simple way (by editing my code, if necessary)?
I find #DanieldePaula's suggestion acceptable as an answer, for the case at hand:
As the matrix is square, a very large number of rows would imply a very large number of columns, in which case using SparkSQL wouldn't seem optimal in my opinion. I think you can use Spark for this problem if the matrix is converted into a Sparse format, e.g. RDD[(row, col, value)], then it would be very easy to create your vertices and edges.
Thanks, Daniel!

Spark SQL: Select with arithmetic on column values and type casting?

I'm using Spark SQL with DataFrames. Is there a way to do a select statement with some arithmetic, just as you can in SQL?
For example, I have the following table:
var data = Array((1, "foo", 30, 5), (2, "bar", 35, 3), (3, "foo", 25, 4))
var dataDf = sc.parallelize(data).toDF("id", "name", "value", "years")
dataDf.printSchema
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = true)
// |-- value: integer (nullable = false)
// |-- years: integer (nullable = false)
dataDf.show()
// +---+----+-----+-----+
// | id|name|value|years|
// +---+----+-----+-----+
// | 1| foo| 30| 5|
// | 2| bar| 35| 3|
// | 3| foo| 25| 4|
//+---+----+-----+-----+
Now, I would like to do a SELECT statement that creates a new column with some arithmetic performed on the existing columns. For example, I would like to compute the ratio value/years. I need to convert value (or years) to a double first. I tried this statement, but it wouldn't parse:
dataDf.
select(dataDf("name"), (dataDf("value").toDouble/dataDf("years")).as("ratio")).
show()
<console>:35: error: value toDouble is not a member of org.apache.spark.sql.Column
select(dataDf("name"), (dataDf("value").toDouble/dataDf("years")).as("ratio")).
I saw a similar question in "How to change column types in Spark SQL's DataFrame?", but that's not quite what I want.
A proper way to change type of a Column is to use cast method. It can either take a description string:
dataDf("value").cast("double") / dataDf("years")
or a DataType:
import org.apache.spark.sql.types.DoubleType
dataDf("value").cast(DoubleType) / dataDf("years")
Well if it's not a requirement to use a select method, you can just use withColumn.
val resDF = dataDf.withColumn("result", col("value").cast("double") / col("years"))
resDF.show
//+---+----+-----+-----+------------------+
//| id|name|value|years| result|
//+---+----+-----+-----+------------------+
//| 1| foo| 30| 5| 6.0|
//| 2| bar| 35| 3|11.666666666666666|
//| 3| foo| 25| 4| 6.25|
//+---+----+-----+-----+------------------+
If it's a requirement to use a select, one option could be:
val exprs = dataDf.columns.map(col(_)) ++ List((col("value").cast("double") / col("years")).as("result"))
dataDf.select(exprs: _*).show
//+---+----+-----+-----+------------------+
//| id|name|value|years| result|
//+---+----+-----+-----+------------------+
//| 1| foo| 30| 5| 6.0|
//| 2| bar| 35| 3|11.666666666666666|
//| 3| foo| 25| 4| 6.25|
//+---+----+-----+-----+------------------+