How to repalce "|" with "^" using regex_replace in Spark SQL, PySpark - pyspark

I have to convert "|" to "^" with Spark.sql in PySpark, but it does not work as I expected.
Example:
dd=spark.sql("""select 'Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9' desc,
regexp_replace('Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9','\\|','\\^') new_desc""")
dd.show(truncate=False)
It is showing:
+------------------------------------------+-------------------------------------------------------------------------------------+
|desc |desc_new |
+------------------------------------------+-------------------------------------------------------------------------------------+
|Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9|^Q^1^|^3^1^-^J^U^L^-^1^8^|^C^l^e^a^n^,^ ^Q^3^|^3^1^-^J^A^N^-^1^9^|^C^l^e^a^n^,^ ^Q^9^|
+------------------------------------------+-------------------------------------------------------------------------------------+
Desired output would be:
+------------------------------------------+-------------------------------------------------------------------------------------+
|desc |desc_new |
+------------------------------------------+-------------------------------------------------------------------------------------+
|Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9 Q1^31-JUL-18^Clean, Q3^31-JAN-19^Clean, Q9|
+------------------------------------------+-------------------------------------------------------------------------------------+
How can I achieve my goal? I escaped using double back slash. Seems Spark.sql does not work. Please advise.

You should only need to single escape the | for this, I think it is matching the null character because of the double escape so you are getting ^ between every character.
Try
dd=spark.sql("""select 'Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9' desc,
regexp_replace('Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9','\|','^') new_desc""")
dd.show(truncate=False)
This should just replace any instance of | with ^ which I think is what you're looking for.

Use triple backslash in spark.sql().
spark.sql("""
select 'Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9' as desc,
regexp_replace('Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9','\\\|','^') as new_desc
"""). \
show(truncate=False)
# +------------------------------------------+------------------------------------------+
# |desc |new_desc |
# +------------------------------------------+------------------------------------------+
# |Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9|Q1^31-JUL-18^Clean, Q3^31-JAN-19^Clean, Q9|
# +------------------------------------------+------------------------------------------+
A single escape would work if you're using the dataframe API.
spark.sparkContext.parallelize([('Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9',)]).toDF(['desc']). \
withColumn('new_desc', func.regexp_replace('desc', '\|', '^')). \
show(truncate=False)
# +------------------------------------------+------------------------------------------+
# |desc |new_desc |
# +------------------------------------------+------------------------------------------+
# |Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9|Q1^31-JUL-18^Clean, Q3^31-JAN-19^Clean, Q9|
# +------------------------------------------+------------------------------------------+
P.S. - \ is a reserved character in spark.sql().

Related

pyspark regexp_replace replacing multiple values in a column

I have the url https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r in dataset. I want to remove https:// at the start of the string and \r at the end of the string.
Creating dataframe to replicate the issue
c = spark.createDataFrame([('https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r',)], ['str'])
I tried below regexp_replace with pipe function. But it is not working as expected.
c.select(F.regexp_replace('str', 'https:// | \\r', '')).first()
Actual output:
www.youcuomizei.comEquaion-Kid-Backack-Peronalized301793
Expected output:
www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793
the "backslash"r (\r) is not showing in your original spark.createDataFrame object because you have to escape it. so your spark.createDataFrame should be. please note the double backslashes
c = spark.createDataFrame([("https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\\r",)], ['str'])
which will give this output:
+------------------------------------------------------------------------------+
|str |
+------------------------------------------------------------------------------+
|https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r|
+------------------------------------------------------------------------------+
your regex https://|[\\r] will not remove the \r . the regex should be
c = (c
.withColumn("str", F.regexp_replace("str", "https://|[\\\\]r", ""))
)
which will give this output:
+--------------------------------------------------------------------+
|str |
+--------------------------------------------------------------------+
|www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793|
+--------------------------------------------------------------------+

pyspark read tab delimiter not behaving as expected

I am using Spark 2.4 and I am trying to read a tab delimited file, however, while it does read the file it does not parse the delimiter correctly.
Test file, e.g.,
$ cat file.tsv
col1 col2
1 abc
2 def
The file is tab delimited correctly:
$ cat -A file.tsv
col1^Icol2$
1^Iabc$
2^Idef$
I have tried both "delimiter="\t" and sep="\t" but neither are giving expected results.
df = spark.read.format("csv") \
.option("header", "true") \
.option("delimiter", "\t") \
.option("inferSchema","true") \
.load("file.tsv")
df = spark.read.load("file.tsv", \
format="csv",
sep="\t",
inferSchema="true",
header="true")
The result of the read is a single column string.
df.show(10,False)
+---------+
|col1 col2|
+---------+
|1 abc |
|2 def |
+---------+
Am I doing something wrong or do I have to preprocess the file to convert tab to pipe before reading?

Reading a CSV file into spark with data containing commas in a quoted field

I have CSV data in a file (data.csv) like so:
lat,lon,data
35.678243, 139.744243, "0,1,2"
35.657285, 139.749380, "1,2,3"
35.594942, 139.548870, "4,5,6"
35.705331, 139.282869, "7,8,9"
35.344667, 139.228691, "10,11,12"
Using the following spark shell command:
spark.read.option("header", true).option("escape", "\"").csv("data.csv").show(false)
I'm getting the following output:
+---------+-----------+----+
|lat |lon |data|
+---------+-----------+----+
|35.678243| 139.744243| "0 |
|35.657285| 139.749380| "1 |
|35.594942| 139.548870| "4 |
|35.705331| 139.282869| "7 |
|35.344667| 139.228691| "10|
+---------+-----------+----+
I would expect the commas within the double quotes to be ignored in line with RFC 4180, but the parser is interpreting them as a delimiter.
Using the option quote also has no effect:
scala> spark.read.option("header", true).option("quote", "\"").option("escape", "\"").csv("data.csv").show(false)
+---------+-----------+----+
|lat |lon |data|
+---------+-----------+----+
|35.678243| 139.744243| "0 |
|35.657285| 139.749380| "1 |
|35.594942| 139.548870| "4 |
|35.705331| 139.282869| "7 |
|35.344667| 139.228691| "10|
+---------+-----------+----+
Nor does no options:
scala> spark.read.option("header", true).csv("data.csv").show(false)
+---------+-----------+----+
|lat |lon |data|
+---------+-----------+----+
|35.678243| 139.744243| "0 |
|35.657285| 139.749380| "1 |
|35.594942| 139.548870| "4 |
|35.705331| 139.282869| "7 |
|35.344667| 139.228691| "10|
+---------+-----------+----+
Notice there is a space after delimiter (a comma ,).
This breaks quotation processing .
Spark 3.0 will allow to have multi-character delimiter , (a comma and a space in your case).
See https://issues.apache.org/jira/browse/SPARK-24540 for details.

How to split using multi-char separator with pipe?

I am trying to split a string column of a dataframe in spark based on a delimiter ":|:|:"
Input:
TEST:|:|:51:|:|:PHT054008056
Test code:
dataframe1
.withColumn("splitColumn", split(col("testcolumn"), ":|:|:"))
Result:
+------------------------------+
|splitColumn |
+------------------------------+
|[TEST, |, |, 51, |, |, P] |
+------------------------------+
Test code:
dataframe1
.withColumn("part1", split(col("testcolumn"), ":|:|:").getItem(0))
.withColumn("part2", split(col("testcolumn"), ":|:|:").getItem(3))
.withColumn("part3", split(col("testcolumn"), ":|:|:").getItem(6))
part1 and part2 work correctly.
part3 only has 2 characters and rest of the string is truncated.
part3:
P
I want to get the entire part3 string.
Any help is appreciated.
You're almost there – just need to escape | within your delimiter, as follows:
val df = Seq(
(1, "TEST:|:|:51:|:|:PHT054008056"),
(2, "TEST:|:|:52:|:|:PHT053007057")
).toDF("id", "testcolumn")
df.withColumn("part3", split($"testcolumn", ":\\|:\\|:").getItem(2)).show
// +---+--------------------+------------+
// | id| testcolumn| part3|
// +---+--------------------+------------+
// | 1|TEST:|:|:51:|:|:P...|PHT054008056|
// | 2|TEST:|:|:52:|:|:P...|PHT053007057|
// +---+--------------------+------------+
[UPDATE]
You could also use triple quotes for the delimiter, in which case you still have to escape | to indicate it's a literal pipe (not or in Regex):
df.withColumn("part3", split($"testcolumn", """:\|:\|:""").getItem(2)).show
Note that with triple quotes, you need only a single escape character \, whereas without the triple quotes the escape character itself needs to be escaped (hence \\).

How to use a column value as delimiter in spark sql substring?

I am trying to do a substring option on a column with another column as a delimiter, the methods like substring_index() expects string value, could somebody suggest ?
substring_index defines it as substring_index(Column str, String delim, int count)
So if you have a common delimiter in all the strings of that column as
+-------------+----+
|col1 |col2|
+-------------+----+
|a,b,c |, |
|d,e,f |, |
|Jonh,is,going|, |
+-------------+----+
You can use the function as
import org.apache.spark.sql.functions._
df.withColumn("splitted", substring_index(col("col1"), ",", 1))
which should give result as
+-------------+----+--------+
|col1 |col2|splitted|
+-------------+----+--------+
|a,b,c |, |a |
|d,e,f |, |d |
|Jonh,is,going|, |Jonh |
+-------------+----+--------+
different splitting delimiter on different rows
If you have different splitting delimiter on different rows as
+-------------+----+
|col1 |col2|
+-------------+----+
|a,b,c |, |
|d$e$f |$ |
|jonh|is|going|| |
+-------------+----+
You can define udf function as
import org.apache.spark.sql.functions._
def subStringIndex = udf((string: String, delimiter: String) => string.substring(0, string.indexOf(delimiter)))
And call it using .withColumn api as
df.withColumn("splitted", subStringIndex(col("col1"), col("col2")))
the final output is
+-------------+----+--------+
|col1 |col2|splitted|
+-------------+----+--------+
|a,b,c |, |a |
|d$e$f |$ |d |
|jonh|is|going|| |jonh |
+-------------+----+--------+
I hope the answer is helpful
You can try to invoke the related hive UDF with two different columns as parameters.