Pyspark, add colon to separate string - pyspark

I have the following string 103400 I need to write it like 10:34:00 using pyspark. let take the following column as an example
time
130045
230022
And I want it to become like this:
time
13:00:45
23:00:22

you can try with regexp_replace
df.withColumn("time", regexp_replace(col("time") , "(\\d{2})(\\d{2})(\\d{2})" , "$1:$2:$3" ) ).show()
+--------+
| time |
+--------+
|13:00:45|
|23:00:22|
+--------+

Related

Pyspark SelectExp() not working for first() and last()

i have 2 statements which are to my knowledge exactly alike, but select() works fine, but selectExpr() generates following results.
+-----------------------+----------------------+
|first(StockCode, false)|last(StockCode, false)|
+-----------------------+----------------------+
| 85123A| 22138|
+-----------------------+----------------------+
+-----------+----------+
|first_value|last_value|
+-----------+----------+
| StockCode| StockCode|
+-----------+----------+
following is implementation.
df.select(first(col("StockCode")), last(col("StockCode"))).show()
df.selectExpr("""first('StockCode') as first_value""", """last('StockCode') as last_value""").show()
Can any 1 explain the behaviour?
selectExpr takes everything as select clause in sql.
Hence if you write anything in single quote', it will act as string in sql. if you wanted to pass the column to selectExpr use backtique (`) as below-
df.selectExpr("""first(`StockCode`) as first_value""", """last(`StockCode`) as last_value""").show()
backtique will help you to escape space in the column.
you can use without backtique also if your column name is not starting with number like 12col or it doesn't have spaces in between like column name
df.selectExpr("""first(StockCode) as first_value""", """last(StockCode) as last_value""").show()
You should pass like below
df_b = df_b.selectExpr('first(count) as first', 'last(count) as last')
df_b.show(truncate = False)
+-----+----+
|first|last|
+-----+----+
|2527 |13 |
+-----+----+

Spark 2.0 How to convert DF Date/timstamp column to another date format in scala?

For my learning , i have been using below sample dataset .
+-------------------+-----+-----+-----+-----+-------+
| MyDate| Open| High| Low|Close| Volume|
+-------------------+-----+-----+-----+-----+-------+
|2006-01-03 00:00:00|983.8|493.8|481.1|492.9|1537660|
|2006-01-04 00:00:00|979.6|491.0|483.5|483.8|1871020|
|2006-01-05 00:00:00|972.2|487.8|484.0|486.2|1143160|
|2006-01-06 00:00:00|977.8|489.0|482.0|486.2|1370250|
|2006-01-09 00:00:00|973.4|487.4|483.0|483.9|1680740|
+-------------------+-----+-----+-----+-----+-------+
I tried to change "MyDate" column values to different format like "YYYY-MON" and written like this..
citiDataDF.withColumn("New-Mydate",to_timestamp($"MyDate", "yyyy-MON")).show(5)
After executing the code, found that new column "New-Mydate". but i couldn't see the desired output format. can you please help
You need date_format instead to_timestamp:
val citiDataDF = List("2006-01-03 00:00:00").toDF("MyDate")
citiDataDF.withColumn("New-Mydate",date_format($"New-Mydate", "yyyy-MMM")).show(5)
Result:
+-------------------+----------+
| MyDate|New-Mydate|
+-------------------+----------+
|2006-01-03 00:00:00| 2006-Jan|
+-------------------+----------+
Note: Three "M" mean the month as string, if you want a month as Int, you must use only two "M"

How to split column into multiple columns in Spark 2?

I am reading the data from HDFS into DataFrame using Spark 2.2.0 and Scala 2.11.8:
val df = spark.read.text(outputdir)
df.show()
I see this result:
+--------------------+
| value|
+--------------------+
|(4056,{community:...|
|(56,{community:56...|
|(2056,{community:...|
+--------------------+
If I run df.head(), I see more details about the structure of each row:
[(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})]
I want to get the following output:
+---------+----------+
| id | value|
+---------+----------+
|4056 |1 |
|56 |56 |
|2056 |20 |
+---------+----------+
How can I do it? I tried using .map(row => row.mkString(",")),
but I don't know how to extract the data as I showed.
The problem is that you are getting the data as a single column of strings. The data format is not really specified in the question (ideally it would be something like JSON), but given what we know, we can use a regular expression to extract the number on the left (id) and the community field:
val r = """\((\d+),\{.*community:(\d+).*\}\)"""
df.select(
F.regexp_extract($"value", r, 1).as("id"),
F.regexp_extract($"value", r, 2).as("community")
).show()
A bunch of regular expressions should give you required result.
df.select(
regexp_extract($"value", "^\\(([0-9]+),.*$", 1) as "id",
explode(split(regexp_extract($"value", "^\\(([0-9]+),\\{(.*)\\}\\)$", 2), ",")) as "value"
).withColumn("value", split($"value", ":")(1))
If your data is always of the following format
(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})
Then you can simply use split and regex_replace inbuilt functions to get your desired output dataframe as
import org.apache.spark.sql.functions._
df.select(regexp_replace((split(col("value"), ",")(0)), "\\(", "").as("id"), regexp_replace((split(col("value"), ",")(1)), "\\{community:", "").as("value") ).show()
I hope the answer is helpful

Spark-scala: Select distinct arrays from a column dataframe ignoring ordering

I've been thinking the next problem but I haven't reach the solution: I have a dataframe df with only one column A, which elements have dataType Array[String]. I'm trying to get all the different arrays of A, non importing the order of the Strings in the arrays.
For example, if the dataframe is the following:
df.select("A").show()
+--------+
|A |
+--------+
|[a,b,c] |
|[d,e] |
|[f] |
|[e,d] |
|[c,a,b] |
+--------+
I would like to get the dataframe
+--------+
|[a,b,c] |
|[d,e] |
|[f] |
+--------+
I've trying make a distinct(), dropDuplicates() and other functions, but It doesnt't work.
I would appreciate any help. Thank you in advance.
You can use collect_list function to collect all the arrays in that column and then use udf function to sort the individual arrays and finally return the distinct arrays of the collected list. Finally you can use explode function to distribute the distinct collected arrays into separate rows
import org.apache.spark.sql.functions._
def distinctCollectUDF = udf((a: mutable.WrappedArray[mutable.WrappedArray[String]]) => a.map(array => array.sorted).distinct)
df.select(distinctCollectUDF(collect_list("A")).as("A")).withColumn("A", explode($"A")).show(false)
You should have your desired result.
You might try and use the contains method.

Convert date to end of month in Spark

I have a Spark DataFrame as shown below:
#Create DataFrame
df <- data.frame(name = c("Thomas", "William", "Bill", "John"),
dates = c('2017-01-05', '2017-02-23', '2017-03-16', '2017-04-08'))
df <- createDataFrame(df)
#Make sure df$dates column is in 'date' format
df <- withColumn(df, 'dates', cast(df$dates, 'date'))
name | dates
--------------------
Thomas |2017-01-05
William |2017-02-23
Bill |2017-03-16
John |2017-04-08
I want to change dates to the end of month date, so they would look like shown below. How do I do this? Either SparkR or PySpark code is fine.
name | dates
--------------------
Thomas |2017-01-31
William |2017-02-28
Bill |2017-03-31
John |2017-04-30
You may use the following (PySpark):
from pyspark.sql.functions import last_day
df.select('name', last_day(df.dates).alias('dates')).show()
To clarify, last_day(date) returns the last day of the month of which date belongs to.
I'm pretty sure there is a similar function in sparkR
https://spark.apache.org/docs/1.6.2/api/R/last_day.html
last_day is a poorly named function and should be wrapped in something more descriptive to make the code easier to read.
endOfMonth is a better function name. Here's how to use this function with the Scala API. Suppose you have the following data:
+----------+
| some_date|
+----------+
|2016-09-10|
|2020-01-01|
|2016-01-10|
| null|
+----------+
Run the endOfMonth function that's part of spark-daria:
import com.github.mrpowers.spark.daria.sql.functions._
df.withColumn("res", endOfMonth(col("some_date"))).show()
Here are the results:
+----------+----------+
| some_date| res|
+----------+----------+
|2016-09-10|2016-09-30|
|2020-01-01|2020-01-31|
|2016-01-10|2016-01-31|
| null| null|
+----------+----------+
I'll try to add this function to quinn as well so there is an easily accessible function for PySpark users as well.
For completeness, here is the SparkR code:
df <- withColumn(df, 'dates', last_day(df$dates))