How to repalce "|" with "^" using regex_replace in Spark SQL, PySpark

How to repalce "|" with "^" using regex_replace in Spark SQL, PySpark - pyspark

I have to convert "|" to "^" with Spark.sql in PySpark, but it does not work as I expected.
Example:
dd=spark.sql("""select 'Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9' desc,
regexp_replace('Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9','\\|','\\^') new_desc""")
dd.show(truncate=False)
It is showing:
+------------------------------------------+-------------------------------------------------------------------------------------+
|desc |desc_new |
+------------------------------------------+-------------------------------------------------------------------------------------+
|Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9|^Q^1^|^3^1^-^J^U^L^-^1^8^|^C^l^e^a^n^,^ ^Q^3^|^3^1^-^J^A^N^-^1^9^|^C^l^e^a^n^,^ ^Q^9^|
+------------------------------------------+-------------------------------------------------------------------------------------+
Desired output would be:
+------------------------------------------+-------------------------------------------------------------------------------------+
|desc |desc_new |
+------------------------------------------+-------------------------------------------------------------------------------------+
|Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9 Q1^31-JUL-18^Clean, Q3^31-JAN-19^Clean, Q9|
+------------------------------------------+-------------------------------------------------------------------------------------+
How can I achieve my goal? I escaped using double back slash. Seems Spark.sql does not work. Please advise.

You should only need to single escape the | for this, I think it is matching the null character because of the double escape so you are getting ^ between every character.
Try
dd=spark.sql("""select 'Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9' desc,
regexp_replace('Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9','\|','^') new_desc""")
dd.show(truncate=False)
This should just replace any instance of | with ^ which I think is what you're looking for.

Use triple backslash in spark.sql().
spark.sql("""
select 'Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9' as desc,
regexp_replace('Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9','\\\|','^') as new_desc
"""). \
show(truncate=False)
# +------------------------------------------+------------------------------------------+
# |desc |new_desc |
# +------------------------------------------+------------------------------------------+
# |Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9|Q1^31-JUL-18^Clean, Q3^31-JAN-19^Clean, Q9|
# +------------------------------------------+------------------------------------------+
A single escape would work if you're using the dataframe API.
spark.sparkContext.parallelize([('Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9',)]).toDF(['desc']). \
withColumn('new_desc', func.regexp_replace('desc', '\|', '^')). \
show(truncate=False)
# +------------------------------------------+------------------------------------------+
# |desc |new_desc |
# +------------------------------------------+------------------------------------------+
# |Q1|31-JUL-18|Clean, Q3|31-JAN-19|Clean, Q9|Q1^31-JUL-18^Clean, Q3^31-JAN-19^Clean, Q9|
# +------------------------------------------+------------------------------------------+
P.S. - \ is a reserved character in spark.sql().

Related

pyspark regexp_replace replacing multiple values in a column

I have the url https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r in dataset. I want to remove https:// at the start of the string and \r at the end of the string.
Creating dataframe to replicate the issue
c = spark.createDataFrame([('https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r',)], ['str'])
I tried below regexp_replace with pipe function. But it is not working as expected.
c.select(F.regexp_replace('str', 'https:// | \\r', '')).first()
Actual output:
www.youcuomizei.comEquaion-Kid-Backack-Peronalized301793
Expected output:
www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793

the "backslash"r (\r) is not showing in your original spark.createDataFrame object because you have to escape it. so your spark.createDataFrame should be. please note the double backslashes
c = spark.createDataFrame([("https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\\r",)], ['str'])
which will give this output:
+------------------------------------------------------------------------------+
|str |
+------------------------------------------------------------------------------+
|https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r|
+------------------------------------------------------------------------------+
your regex https://|[\\r] will not remove the \r . the regex should be
c = (c
.withColumn("str", F.regexp_replace("str", "https://|[\\\\]r", ""))
)
which will give this output:
+--------------------------------------------------------------------+
|str |
+--------------------------------------------------------------------+
|www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793|
+--------------------------------------------------------------------+

pyspark read tab delimiter not behaving as expected

I am using Spark 2.4 and I am trying to read a tab delimited file, however, while it does read the file it does not parse the delimiter correctly.
Test file, e.g.,
$ cat file.tsv
col1 col2
1 abc
2 def
The file is tab delimited correctly:
$ cat -A file.tsv
col1^Icol2$
1^Iabc$
2^Idef$
I have tried both "delimiter="\t" and sep="\t" but neither are giving expected results.
df = spark.read.format("csv") \
.option("header", "true") \
.option("delimiter", "\t") \
.option("inferSchema","true") \
.load("file.tsv")
df = spark.read.load("file.tsv", \
format="csv",
sep="\t",
inferSchema="true",
header="true")
The result of the read is a single column string.
df.show(10,False)
+---------+
|col1 col2|
+---------+
|1 abc |
|2 def |
+---------+
Am I doing something wrong or do I have to preprocess the file to convert tab to pipe before reading?

Reading a CSV file into spark with data containing commas in a quoted field

I have CSV data in a file (data.csv) like so:
lat,lon,data
35.678243, 139.744243, "0,1,2"
35.657285, 139.749380, "1,2,3"
35.594942, 139.548870, "4,5,6"
35.705331, 139.282869, "7,8,9"
35.344667, 139.228691, "10,11,12"
Using the following spark shell command:
spark.read.option("header", true).option("escape", "\"").csv("data.csv").show(false)
I'm getting the following output:
+---------+-----------+----+
|lat |lon |data|
+---------+-----------+----+
|35.678243| 139.744243| "0 |
|35.657285| 139.749380| "1 |
|35.594942| 139.548870| "4 |
|35.705331| 139.282869| "7 |
|35.344667| 139.228691| "10|
+---------+-----------+----+
I would expect the commas within the double quotes to be ignored in line with RFC 4180, but the parser is interpreting them as a delimiter.
Using the option quote also has no effect:
scala> spark.read.option("header", true).option("quote", "\"").option("escape", "\"").csv("data.csv").show(false)
+---------+-----------+----+
|lat |lon |data|
+---------+-----------+----+
|35.678243| 139.744243| "0 |
|35.657285| 139.749380| "1 |
|35.594942| 139.548870| "4 |
|35.705331| 139.282869| "7 |
|35.344667| 139.228691| "10|
+---------+-----------+----+
Nor does no options:
scala> spark.read.option("header", true).csv("data.csv").show(false)
+---------+-----------+----+
|lat |lon |data|
+---------+-----------+----+
|35.678243| 139.744243| "0 |
|35.657285| 139.749380| "1 |
|35.594942| 139.548870| "4 |
|35.705331| 139.282869| "7 |
|35.344667| 139.228691| "10|
+---------+-----------+----+

Notice there is a space after delimiter (a comma ,).
This breaks quotation processing .
Spark 3.0 will allow to have multi-character delimiter , (a comma and a space in your case).
See https://issues.apache.org/jira/browse/SPARK-24540 for details.

How to split using multi-char separator with pipe?

I am trying to split a string column of a dataframe in spark based on a delimiter ":|:|:"
Input:
TEST:|:|:51:|:|:PHT054008056
Test code:
dataframe1
.withColumn("splitColumn", split(col("testcolumn"), ":|:|:"))
Result:
+------------------------------+
|splitColumn |
+------------------------------+
|[TEST, |, |, 51, |, |, P] |
+------------------------------+
Test code:
dataframe1
.withColumn("part1", split(col("testcolumn"), ":|:|:").getItem(0))
.withColumn("part2", split(col("testcolumn"), ":|:|:").getItem(3))
.withColumn("part3", split(col("testcolumn"), ":|:|:").getItem(6))
part1 and part2 work correctly.
part3 only has 2 characters and rest of the string is truncated.
part3:
P
I want to get the entire part3 string.
Any help is appreciated.

You're almost there – just need to escape | within your delimiter, as follows:
val df = Seq(
(1, "TEST:|:|:51:|:|:PHT054008056"),
(2, "TEST:|:|:52:|:|:PHT053007057")
).toDF("id", "testcolumn")
df.withColumn("part3", split($"testcolumn", ":\\|:\\|:").getItem(2)).show
// +---+--------------------+------------+
// | id| testcolumn| part3|
// +---+--------------------+------------+
// | 1|TEST:|:|:51:|:|:P...|PHT054008056|
// | 2|TEST:|:|:52:|:|:P...|PHT053007057|
// +---+--------------------+------------+
[UPDATE]
You could also use triple quotes for the delimiter, in which case you still have to escape | to indicate it's a literal pipe (not or in Regex):
df.withColumn("part3", split($"testcolumn", """:\|:\|:""").getItem(2)).show
Note that with triple quotes, you need only a single escape character \, whereas without the triple quotes the escape character itself needs to be escaped (hence \\).

How to use a column value as delimiter in spark sql substring?

I am trying to do a substring option on a column with another column as a delimiter, the methods like substring_index() expects string value, could somebody suggest ?

substring_index defines it as substring_index(Column str, String delim, int count)
So if you have a common delimiter in all the strings of that column as
+-------------+----+
|col1 |col2|
+-------------+----+
|a,b,c |, |
|d,e,f |, |
|Jonh,is,going|, |
+-------------+----+
You can use the function as
import org.apache.spark.sql.functions._
df.withColumn("splitted", substring_index(col("col1"), ",", 1))
which should give result as
+-------------+----+--------+
|col1 |col2|splitted|
+-------------+----+--------+
|a,b,c |, |a |
|d,e,f |, |d |
|Jonh,is,going|, |Jonh |
+-------------+----+--------+
different splitting delimiter on different rows
If you have different splitting delimiter on different rows as
+-------------+----+
|col1 |col2|
+-------------+----+
|a,b,c |, |
|d$e$f |$ |
|jonh|is|going|| |
+-------------+----+
You can define udf function as
import org.apache.spark.sql.functions._
def subStringIndex = udf((string: String, delimiter: String) => string.substring(0, string.indexOf(delimiter)))
And call it using .withColumn api as
df.withColumn("splitted", subStringIndex(col("col1"), col("col2")))
the final output is
+-------------+----+--------+
|col1 |col2|splitted|
+-------------+----+--------+
|a,b,c |, |a |
|d$e$f |$ |d |
|jonh|is|going|| |jonh |
+-------------+----+--------+
I hope the answer is helpful

You can try to invoke the related hive UDF with two different columns as parameters.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to repalce "|" with "^" using regex_replace in Spark SQL, PySpark - pyspark

Related

pyspark regexp_replace replacing multiple values in a column

pyspark read tab delimiter not behaving as expected

Reading a CSV file into spark with data containing commas in a quoted field

How to split using multi-char separator with pipe?

How to use a column value as delimiter in spark sql substring?

Categories

Resources