I have a non-breaking trailing space in a string in a column . I have tried the below solutions but cannot get rid of the space.
df.select(
col("city"),
regexp_replace(col("city"), " ", ""),
regexp_replace(col("city"), "[\\r\\n]", ""),
regexp_replace(col("city"), "\\s+$", ""),
rtrim(col("city"))
).show()
Is there any other possible solution I can try to remove the blank space?
You can use the ltrim, rtrim or trim functions from org.apache.sql.functions:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
("Bengaluru "),
(" Bengaluru"),
(" Bengaluru ")
).toDF("city")
df.show
+----------------+
| city|
+----------------+
| Bengaluru |
| Bengaluru|
| Bengaluru |
+----------------+
df.withColumn("city", ltrim(col("city"))).show
+-------------+
| city|
+-------------+
| Bengaluru |
| Bengaluru|
|Bengaluru |
+-------------
df.withColumn("city", rtrim(col("city"))).show
+------------+
| city|
+------------+
| Bengaluru|
| Bengaluru|
| Bengaluru|
+------------+
df.withColumn("city", trim(col("city"))).show
+---------+
| city|
+---------+
|Bengaluru|
|Bengaluru|
|Bengaluru|
+---------+
Choosing whether you want to remove leading/trailing spaces or both.
Hope this helps!
Related
I have a DF of emails where the first character in the email is occasionally a symbol. I am trying to remove it if it exists.
+--------------------+
| Email |
+--------------------+
|bob#gmail.com |
|*steve#yahoo.com |
|leeroy#hotmail.com |
|#grant#gmail.com |
+--------------------+
The final df would look like this:
+--------------------+
| Email |
+--------------------+
|bob#gmail.com |
|steve#yahoo.com |
|leeroy#hotmail.com |
|grant#gmail.com |
+--------------------+
Is there a way to do this efficiently?
Using regexp_replace function should do the job.
You can adapt the regex if needed. Here it removes all non-alphanumeric characters from the beginning.
val df1 = df.withColumn("Email", regexp_replace(col("Email"), "^[^a-zA-z0-9]", ""))
df1.show()
//+------------------+
//| Email|
//+------------------+
//| bob#gmail.com|
//| steve#yahoo.com|
//|leeroy#hotmail.com|
//| grant#gmail.com|
//+------------------+
Say I have this table
+------------+
|value |
+------------+
| 2.3 |
| 2.0 |
| 1.55|
+------------+
I want the actual output value to always have two digits at the end, to something like
+------------+
|value |
+------------+
| 2.30|
| 2.00|
| 1.55|
+------------+
This is just for the output part, so I can convert it to String to make it easier, I'm trying to create a regexp to do this, but I feel there should be an easier way to do it with the double value.
Any tips will help.
Thanks!
You can cast the column type to decimal.
import spark.implicits._
Seq(2.3, 2.0, 1.55).toDF()
.withColumn("value", 'value.cast("decimal(38, 2)"))
.show()
+-----+
|value|
+-----+
| 2.30|
| 2.00|
| 1.55|
+-----+
I have a Pyspark dataframe, among others, a column of MSNs (of string type) like the following:
+------+
| Col1 |
+------+
| 654- |
| 1859 |
| 5875 |
| 784- |
| 596- |
| 668- |
| 1075 |
+------+
As you can see, those entries with a value of less than 1000 (i.e. three characters) have a - character at the end to make a total of 4 characters.
I want to get rid of that - character, so that I end up with something like:
+------+
| Col2 |
+------+
| 654 |
| 1859 |
| 5875 |
| 784 |
| 596 |
| 668 |
| 1075 |
+------+
I have tried the following code (where df is the dataframe containing the column, but it does not appear to work:
if df.Col1[3] == "-":
df = df.withColumn('Col2', df.series.substr(1, 3))
return df
else:
return df
Does anyone know how to do it?
You can replace - in the column with empty string ("") using F.regexp_replace
See the code below,
df.withColumn("Col2", F.regexp_replace("Col1", "-", "")).show()
+----+----+
|Col1|Col2|
+----+----+
|589-| 589|
|1245|1245|
|145-| 145|
+----+----+
Here is a solution using the .substr() method:
df.withColumn("Col2", F.when(F.col("Col1").substr(4, 1) == "-",
F.col("Col1").substr(1, 3)
).otherwise(
F.col("Col1"))).show()
+----+----+
|Col1|Col2|
+----+----+
|654-| 654|
|1859|1859|
|5875|5875|
|784-| 784|
|596-| 596|
|668-| 668|
|1075|1075|
+----+----+
Given a DataFrame df, when I do
df.select(df['category_id']+1000), I get results
>>> df.select(df['category_id']).limit(3).show()
+-----------+
|category_id|
+-----------+
| 1|
| 2|
| 3|
+-----------+
>>> df.select(df['category_id']+1000).limit(3).show()
+--------------------+
|(category_id + 1000)|
+--------------------+
| 1001|
| 1002|
| 1003|
+--------------------+
However when I do df.select(df['category_name']+ ' blah'), get null
>>> df.select(df['category_name']).limit(3).show()
+-------------------+
| category_name|
+-------------------+
| Football|
| Soccer|
|Baseball & Softball|
+-------------------+
>>> df.select(df['category_name']+'blah').limit(3).show()
+----------------------+
|(category_name + blah)|
+----------------------+
| null|
| null|
| null|
+----------------------+
Just wondering what makes one work and the other is not? What am I missing?
Unlike python, the + operator is not defined as string concatenation in spark (and sql doesn't do this too), instead it has concat/concat_ws for string concatenation.
import pyspark.sql.functions as f
df.select(f.concat(df.category_name, f.lit('blah')).alias('category_name')).show(truncate=False)
#+-----------------------+
#|category_name |
#+-----------------------+
#|Footballblah |
#|Soccerblah |
#|Baseball & Softballblah|
#+-----------------------+
df.select(f.concat_ws(' ', df.category_name, f.lit('blah')).alias('category_name')).show(truncate=False)
#+------------------------+
#|category_name |
#+------------------------+
#|Football blah |
#|Soccer blah |
#|Baseball & Softball blah|
#+------------------------+
I have a data frame like this, imported from a parquet file:
| Store_id | Date_d_id |
| 0 | 23-07-2017 |
| 0 | 26-07-2017 |
| 0 | 01-08-2017 |
| 0 | 25-08-2017 |
| 1 | 01-01-2016 |
| 1 | 04-01-2016 |
| 1 | 10-01-2016 |
What I am trying to achieve next is to loop through each customer's date in pair and get the day difference. Here is what it should look like:
| Store_id | Date_d_id | Day_diff |
| 0 | 23-07-2017 | null |
| 0 | 26-07-2017 | 3 |
| 0 | 01-08-2017 | 6 |
| 0 | 25-08-2017 | 24 |
| 1 | 01-01-2016 | null |
| 1 | 04-01-2016 | 3 |
| 1 | 10-01-2016 | 6 |
And finally, I will like to reduce the data frame to the average day difference by customer:
| Store_id | avg_diff |
| 0 | 7.75 |
| 1 | 3 |
I am very new to Scala and I don't even know where to start. Any help is highly appreciated! Thanks in advance.
Also, I am using Zeppelin notebook
One approach would be to use lag(Date) over Window partition and a UDF to calculate the difference in days between consecutive rows, then follow by grouping the DataFrame for the average difference in days. Note that Date_d_id is converted to yyyy-mm-dd format for proper String ordering within the Window partitions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(0, "23-07-2017"),
(0, "26-07-2017"),
(0, "01-08-2017"),
(0, "25-08-2017"),
(1, "01-01-2016"),
(1, "04-01-2016"),
(1, "10-01-2016")
).toDF("Store_id", "Date_d_id")
def daysDiff = udf(
(d1: String, d2: String) => {
import java.time.LocalDate
import java.time.temporal.ChronoUnit.DAYS
DAYS.between(LocalDate.parse(d1), LocalDate.parse(d2))
}
)
val df2 = df.
withColumn( "Date_ymd",
regexp_replace($"Date_d_id", """(\d+)-(\d+)-(\d+)""", "$3-$2-$1")).
withColumn( "Prior_date_ymd",
lag("Date_ymd", 1).over(Window.partitionBy("Store_id").orderBy("Date_ymd"))).
withColumn( "Days_diff",
when($"Prior_date_ymd".isNotNull, daysDiff($"Prior_date_ymd", $"Date_ymd")).
otherwise(0L))
df2.show
// +--------+----------+----------+--------------+---------+
// |Store_id| Date_d_id| Date_ymd|Prior_date_ymd|Days_diff|
// +--------+----------+----------+--------------+---------+
// | 1|01-01-2016|2016-01-01| null| 0|
// | 1|04-01-2016|2016-01-04| 2016-01-01| 3|
// | 1|10-01-2016|2016-01-10| 2016-01-04| 6|
// | 0|23-07-2017|2017-07-23| null| 0|
// | 0|26-07-2017|2017-07-26| 2017-07-23| 3|
// | 0|01-08-2017|2017-08-01| 2017-07-26| 6|
// | 0|25-08-2017|2017-08-25| 2017-08-01| 24|
// +--------+----------+----------+--------------+---------+
val resultDF = df2.groupBy("Store_id").agg(avg("Days_diff").as("Avg_diff"))
resultDF.show
// +--------+--------+
// |Store_id|Avg_diff|
// +--------+--------+
// | 1| 3.0|
// | 0| 8.25|
// +--------+--------+
You can use lag function to get the previous date over Window function, then do some manipulation to get the final dataframe that you require
first of all the Date_d_id column need to be converted to include timestamp for sorting to work correctly
import org.apache.spark.sql.functions._
val timestapeddf = df.withColumn("Date_d_id", from_unixtime(unix_timestamp($"Date_d_id", "dd-MM-yyyy")))
which should give your dataframe as
+--------+-------------------+
|Store_id| Date_d_id|
+--------+-------------------+
| 0|2017-07-23 00:00:00|
| 0|2017-07-26 00:00:00|
| 0|2017-08-01 00:00:00|
| 0|2017-08-25 00:00:00|
| 1|2016-01-01 00:00:00|
| 1|2016-01-04 00:00:00|
| 1|2016-01-10 00:00:00|
+--------+-------------------+
then you can apply the lag function over window function and finally get the date difference as
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Store_id").orderBy("Date_d_id")
val laggeddf = timestapeddf.withColumn("Day_diff", when(lag("Date_d_id", 1).over(windowSpec).isNull, null).otherwise(datediff($"Date_d_id", lag("Date_d_id", 1).over(windowSpec))))
laggeddf should be
+--------+-------------------+--------+
|Store_id|Date_d_id |Day_diff|
+--------+-------------------+--------+
|0 |2017-07-23 00:00:00|null |
|0 |2017-07-26 00:00:00|3 |
|0 |2017-08-01 00:00:00|6 |
|0 |2017-08-25 00:00:00|24 |
|1 |2016-01-01 00:00:00|null |
|1 |2016-01-04 00:00:00|3 |
|1 |2016-01-10 00:00:00|6 |
+--------+-------------------+--------+
now the final step is to use groupBy and aggregation to find the average
laggeddf.groupBy("Store_id")
.agg(avg("Day_diff").as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 11.0|
| 1| 4.5|
+--------+--------+
Now if you want to neglect the null Day_diff then you can do
laggeddf.groupBy("Store_id")
.agg((sum("Day_diff")/count($"Day_diff".isNotNull)).as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 8.25|
| 1| 3.0|
+--------+--------+
I hope the answer is helpful