I have a date pyspark dataframe with a string column in the format of Mon-YY eg. 'Jan-17' and I am attempting to convert this into a date column.
I've tried to do it like this but it does not work out :
df.select(to_timestamp(df.t, 'MON-YY HH:mm:ss').alias('dt'))
Is it possible to do it like in SQL or do I need to write a special function for conversion ?
You should use valid Java date format. The following will work
import pyspark.sql.functions as psf
df.select(psf.to_timestamp(psf.col('t'), 'MMM-YY HH:mm:ss').alias('dt'))
Jan-17 will become 2017-01-01 in that case
Example
df = spark.createDataFrame([("Jan-17 00:00:00",'a'),("Apr-19 00:00:00",'b')], ['t','x'])
df.show(2)
+---------------+---+
| t| x|
+---------------+---+
|Jan-17 00:00:00| a|
|Apr-19 00:00:00| b|
+---------------+---+
Conversion to timestamp:
import pyspark.sql.functions as psf
df.select(psf.to_timestamp(psf.col('t'), 'MMM-YY HH:mm:ss').alias('dt')).show(2)
+-------------------+
| dt|
+-------------------+
|2017-01-01 00:00:00|
|2018-12-30 00:00:00|
+-------------------+
Related
This question already has answers here:
Spark SQL replacement for MySQL's GROUP_CONCAT aggregate function
(10 answers)
Closed 3 years ago.
I have a dataframe nameDF as below:
scala> val nameDF = Seq(("John","A"), ("John","B"), ("John","C"), ("John","D"), ("Bravo","E"), ("Bravo","F"), ("Bravo","G")).toDF("Name","Init")
nameDF: org.apache.spark.sql.DataFrame = [Name: string, Init: string]
scala> nameDF.show
+------+----+
|Name |Init|
+------+----+
|Johnny| A|
|Johnny| B|
|Johnny| C|
|Johnny| D|
|Bravo | E|
|Bravo | F|
|Bravo | G|
+------+----+
Without using SQL, I am trying to group the names and convert the multiple rows of each "Name" into a single row as given below:
+------+-------+
|Name |Init |
+------+-------+
|Johnny|A,B,C,D|
|Bravo |E,F,G |
+------+-------+
I see the available options to pivot are not suitable for String operations.
Is Pivot the correct option in this case ? If not, could anyone let me know how can I achieve the solution ?
Try this:
import org.apache.spark.sql.functions._
df.groupBy($"Name")
.agg(concat_ws(",", sort_array(collect_list($"Init"))).as("Init"))
pyspark dataframe which have a range of numerical variables.
for eg
my dataframe have a column value from 1 to 100.
1-10 - group1<== the column value for 1 to 10 should contain group1 as value
11-20 - group2
.
.
.
91-100 group10
how can i achieve this using pyspark dataframe
# Creating an arbitrary DataFrame
df = spark.createDataFrame([(1,54),(2,7),(3,72),(4,99)], ['ID','Var'])
df.show()
+---+---+
| ID|Var|
+---+---+
| 1| 54|
| 2| 7|
| 3| 72|
| 4| 99|
+---+---+
Once the DataFrame has been created, we use floor() function to find the integral part of a number. For eg; floor(15.5) will be 15. We need to find the integral part of the Var/10 and add 1 to it, because the indexing starts from 1, as opposed to 0. Finally, we have need to prepend group to the value. Concatenation can be achieved with concat() function, but keep in mind that since the prepended word group is not a column, so we need to put it inside lit() which creates a column of a literal value.
# Requisite packages needed
from pyspark.sql.functions import col, floor, lit, concat
df = df.withColumn('Var',concat(lit('group'),(1+floor(col('Var')/10))))
df.show()
+---+-------+
| ID| Var|
+---+-------+
| 1| group6|
| 2| group1|
| 3| group8|
| 4|group10|
+---+-------+
I want to understand the best way to solve date-related problems in spark SQL. I'm trying to solve simple problem where I have a file that has date ranges like below:
startdate,enddate
01/01/2018,30/01/2018
01/02/2018,28/02/2018
01/03/2018,30/03/2018
and another table that has date and counts:
date,counts
03/01/2018,10
25/01/2018,15
05/02/2018,23
17/02/2018,43
Now all I want to find is sum of counts for each date range, so the output expected is:
startdate,enddate,sum(count)
01/01/2018,30/01/2018,25
01/02/2018,28/02/2018,66
01/03/2018,30/03/2018,0
Following is the code I have written but it's giving me a cartesian result set:
val spark = SparkSession.builder().appName("DateBasedCount").master("local").getOrCreate()
import spark.implicits._
val df1 = spark.read.option("header","true").csv("dateRange.txt").toDF("startdate","enddate")
val df2 = spark.read.option("header","true").csv("dateCount").toDF("date","count")
df1.createOrReplaceTempView("daterange")
df2.createOrReplaceTempView("datecount")
val res = spark.sql("select startdate,enddate,date,visitors from daterange left join datecount on date >= startdate and date <= enddate")
res.rdd.foreach(println)
The output is:
| startdate| enddate| date|visitors|
|01/01/2018|30/01/2018|03/01/2018| 10|
|01/01/2018|30/01/2018|25/01/2018| 15|
|01/01/2018|30/01/2018|05/02/2018| 23|
|01/01/2018|30/01/2018|17/02/2018| 43|
|01/02/2018|28/02/2018|03/01/2018| 10|
|01/02/2018|28/02/2018|25/01/2018| 15|
|01/02/2018|28/02/2018|05/02/2018| 23|
|01/02/2018|28/02/2018|17/02/2018| 43|
|01/03/2018|30/03/2018|03/01/2018| 10|
|01/03/2018|30/03/2018|25/01/2018| 15|
|01/03/2018|30/03/2018|05/02/2018| 23|
|01/03/2018|30/03/2018|17/02/2018| 43|
Now if I groupby startdate and enddate with sum on count I see following result which is incorrect:
| startdate| enddate| sum(count)|
|01/01/2018|30/01/2018| 91.0|
|01/02/2018|28/02/2018| 91.0|
|01/03/2018|30/03/2018| 91.0|
So how do we handle this and what is the best way to deal with dates in Spark SQL? Should we build columns as dateType in first place OR read as strings and then cast it to date while necessary?
The problem is that your dates are not interpreted as dates by Spark automatically, they are just strings. The solution is therefore to convert them into dates:
val df1 = spark.read.option("header","true").csv("dateRange.txt")
.toDF("startdate","enddate")
.withColumn("startdate", to_date(unix_timestamp($"startdate", "dd/MM/yyyy").cast("timestamp")))
.withColumn("enddate", to_date(unix_timestamp($"enddate", "dd/MM/yyyy").cast("timestamp")))
val df2 = spark.read.option("header","true").csv("dateCount")
.toDF("date","count")
.withColumn("date", to_date(unix_timestamp($"date", "dd/MM/yyyy").cast("timestamp")))
Then use the same code as before. The output of the SQL command is now:
+----------+----------+----------+------+
| startdate| enddate| date|counts|
+----------+----------+----------+------+
|2018-01-01|2018-01-30|2018-01-03| 10|
|2018-01-01|2018-01-30|2018-01-25| 15|
|2018-02-01|2018-02-28|2018-02-05| 23|
|2018-02-01|2018-02-28|2018-02-17| 43|
|2018-03-01|2018-03-30| null| null|
+----------+----------+----------+------+
If the last line should be ignored, simply change to an inner join instead.
Using df.groupBy("startdate", "enddate").sum() on this new dataframe will give the wanted output.
I have a spark dataframe of the below format:
+--------------------+
|value |
+--------------------+
|Id,date |
|000027,2017-11-14 |
|000045,2017-11-15 |
|000056,2018-09-09 |
|C000056,2018-07-01 |
+--------------------+
I need to loop through each row, split it by comma (,) and then place the values in different columns (Id and date as two separate columns).
I am new to spark, not sure whether it could be done through lambda function. Any suggestions would be appreciated.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.SparkSession
val spark=SparkSession.builder().appName("Demo").getOrCreate()
var df=Seq("a,b,c,f","d,f,g,h").toDF("value")
df.show //show the dataFrame
+-------+
| value|
+-------+
|a,b,c,f|
|d,f,g,h|
+-------+
//splitting out the dataFrame with "," delimeter and creating rdd[Row]
var rdd=df.rdd.map(x=>Row(x.getString(0).split(","):_*))
var schema= StructType(Array("name","class","rank","grade").map(x=>StructField(x,StringType,true)))
spark.createDataFrame(rdd,schema).show
+----+-----+----+-----+
|name|class|rank|grade|
+----+-----+----+-----+
| a| b| c| f|
| d| f| g| h|
+----+-----+----+-----+
I find it hard to understand the difference between these two methods from pyspark.sql.functions as the documentation on PySpark official website is not very informative. For example the following code:
import pyspark.sql.functions as F
print(F.col('col_name'))
print(F.lit('col_name'))
The results are:
Column<b'col_name'>
Column<b'col_name'>
so what are the difference between the two and when should I use one and not the other?
The doc says:
col:
Returns a Column based on the given column name.
lit:
Creates a Column of literal value
Say if we have a data frame as below:
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import *
>>> schema = StructType([StructField('A', StringType(), True)])
>>> df = spark.createDataFrame([("a",), ("b",), ("c",)], schema)
>>> df.show()
+---+
| A|
+---+
| a|
| b|
| c|
+---+
If using col to create a new column from A:
>>> df.withColumn("new", F.col("A")).show()
+---+---+
| A|new|
+---+---+
| a| a|
| b| b|
| c| c|
+---+---+
So col grabs an existing column with the given name, F.col("A") is equivalent to df.A or df["A"] here.
If using F.lit("A") to create the column:
>>> df.withColumn("new", F.lit("A")).show()
+---+---+
| A|new|
+---+---+
| a| A|
| b| A|
| c| A|
+---+---+
While lit will create a constant column with the given string as the values.
Both of them return a Column object but the content and meaning are different.
To explain in a very succinct manner, col is typically used to refer to an existing column in a DataFrame, as opposed to lit which is typically used to set the value of a column to a literal
To illustrate with an example:
Assume i have a DataFrame df containing two columns of IntegerType, col_a and col_b
If i wanted a column total which were the sum of the two columns:
df.withColumn('total', col('col_a') + col('col_b'))
Instead of i wanted a column fixed_val having the value "Hello" for all rows of the DataFrame df:
df.withColumn('fixed_val', lit('Hello'))