pyspark replace regex with regex - pyspark

I am trying to replaces a regex (in this case a space with a number) with
I have a Spark dataframe that contains a string column. I want to replace a regex (space plus a number) with a comma without losing the number. I have tried both of these with no luck:
df.select("A", f.regexp_replace(f.col("A"), "\s+[0-9]", ' ,
').alias("replaced"))
df.select("A", f.regexp_replace(f.col("A"), "\s+[0-9]", '\s+[0-9] ,
').alias("replaced"))
Any help is appreciated.

What you need is another function, regex_extract
So, you have to divide the regex and get the part you need. It could be something like this:
df.select("A", f.regexp_extract(f.col("A"), "(\s+)([0-9])", 2).alias("replaced"))

Related

How can I use REGEX_REPLACE in pyspark SQL to remove \n and \r from column

I'm trying to read data from ScyllaDB and want to remove \n and \r character from a column. The problem is that these characters are stored as string in the column of a table being read and I need to use REGEX_REPLACE as I'm using Spark SQL for this. The regex pattern don't seem to work which work in MySQL. The string becomes blank but doesn't remove the characters. Below is the snippet of the query being used in Spark SQL. Help appreciated.
The following string is present in the message column: 'hello\nworld\r'
The expected output is 'hello world'
df=spark.sql("select REGEXP_REPLACE(message,'\n|\r|\r\n',' ') as replaced_message from delivery_sms")
Thanks Andrew's for the answer.
The following worked for me:
df.withColumn("test",regexp_replace("_c0","\\\\n|\\\\r"," ")).show()
If anybody is looking for a modern solution to this problem:
df.select(
F.translate(F.col("test"), "\n\t", " " * len("\n\t")).alias("test")
)

How to remove double Quotes In DataStage using a transformer stage?

We receiving Input data like below
“VENKATA,KRISHNA”
I want output like below
VENKATA,KRISHNA
Can anyone help me with this
Check out the Ereplace function - it allows to replace certain characters so you could rplace " with '' (empty string).
An alternative is TRIM - you can specify which character the command should trim and also if All occurrences or Both (from both sides of the string) plus more.

How to remove special characters from a string in postgresql

I am trying to remove using REGEXP_REPLACE the following special characters: "[]{}
from the following text field: [{"x":"y","s":"G_1","cn":"C8"},{"cn":"M2","gn":"G_2","cn":"CA99"},{"c":"ME3","gn":"G_3","c":"CA00"}]
and replace them with nothing, not even a space.
*Needless to say, this is just an example string, and I need to find a consistent solution for similar but different strings.
I was trying to run the following: SELECT REGEXP_REPLACE('[{"x":"y","s":"G_1","cn":"C8"},{"cn":"M2","gn":"G_2","cn":"CA99"},{"c":"ME3","gn":"G_3","c":"CA00"}] ','[{[}]":]','')
But received pretty much the same string..
Thanks in advance!
You need to escape the special characters (\), and to specify that you want to repeat the operation for every characters ('g') else it will stop at the 1st match
SELECT REGEXP_REPLACE(
'[{"x":"y","s":"G_1","cn":"C8"},{"cn":"M2","gn":"G_2","cn":"CA99"},{"c":"ME3","gn":"G_3","c":"CA00"}] ',
'[{\[}\]":]',
'',
'g');
regexp_replace
--------------------------------------------------
xy,sG_1,cnC8,cnM2,gnG_2,cnCA99,cME3,gnG_3,cCA00
(1 row)

Strip out the characters which is non numeric, dashes and pipes

I am trying to find a solution but somehow i am getting wrong output (referred some online solutions and confusing myself. please advise where i am going wrong.
I need to Strip out any characters that is non-numeric,dash "-" or pipe "|" using plsql.
As an example:
if I need to filter the string 0094-78556232_imk*.ext|4444; the output should be 0094-78556232|4444
Use REGEXP_REPLACE:
SELECT
col,
REGEXP_REPLACE (col, '[^0-9|-]', '') AS col_updated
FROM yourTable;
Demo
Don't use regexp_replace, especially if performance is important.
Instead use the standard string function TRANSLATE. Like so:
select col,
translate(col, '0123456789|-' || col, '01234567890|-') as col_updated
from yourTable;
This translates each character in the col value, according to the following scheme: 0 is translated to itself, ...., - is translated to itself. Any other character in col, which is not in this list already, is "translated" to nothing, since there is nothing for it to be translated to in the third argument to the function. So those characters that are NOT on the list are simply removed from the string.

finding a comma in string

[23567,0,0,0,0,0] and other value is [452221,0,0,0,0,0] and the value should be contineously displaying about 100 values and then i want to display only the sensor value like in first sample 23567 and in second sample 452221 , only the these values have to display . For that I have written a code
value = str2double(str(2:7));see here my attempt
so I want to find the comma in the output and only display the value before first comma
As proposed in a comment by excaza, MATLAB has dedicated functions, such as sscanf for such purposes.
sscanf(str,'[%d')
which matches but ignores the first [, and returns the next (i.e. the first) number as a double variable, and not as a string.
Still, I like the idea of using regular expressions to match the numbers. Instead of matching all zeros and commas, and replacing them by '' as proposed by Sardar_Usama, I would suggest directly matching the numbers using regexp.
You can return all numbers in str (still as string!) with
nums = regexp(str,'\d*','match')
and convert the first number to a double variable with
str2double(nums{1})
To match only the first number in str, we can use the regexp
nums = regexp(str,'[(\d*),','tokens')
which finds a [, then takes an arbitrary number of decimals (0-9), and stops when it finds a ,. By enclosing the \d* in brackets, only the parts in brackets are returned, i.e. only the numbers without [ and ,.
Final Note: if you continue working with strings, you could/should consider the regexp solution. If you convert it to a double anyways, using sscanf is probably faster and easier.
You can use regexprep as follows:
str='[23567,0,0,0,0,0]' ;
required=regexprep(str(2:end-1),',0','')
%Taking str(2:end-1) to exclude brackets, and then removing all ,0
If there can be values other than 0 after , , you can use the following more general approach instead:
required=regexprep(str(2:end-1),',[-+]?\d*\.?\d*','')