Concatenating new line character char(13) in pySpark - pyspark

We are getting error while adding next line char(13) in pyspark concat function below is sample code
spark.sql("select CONCAT('Vinay',CHAR(13),'AGARWAL') from tempTable")
Is CHAR(13) not supported under concat function of pyspark?

I found the issue and the solution. Its not taking char(30) instead its taking \n to add next line character. below is the solution.
spark.sql("select CONCAT('Vinay\n','AGARWAL') from tempTable")

Related

How can I use REGEX_REPLACE in pyspark SQL to remove \n and \r from column

I'm trying to read data from ScyllaDB and want to remove \n and \r character from a column. The problem is that these characters are stored as string in the column of a table being read and I need to use REGEX_REPLACE as I'm using Spark SQL for this. The regex pattern don't seem to work which work in MySQL. The string becomes blank but doesn't remove the characters. Below is the snippet of the query being used in Spark SQL. Help appreciated.
The following string is present in the message column: 'hello\nworld\r'
The expected output is 'hello world'
df=spark.sql("select REGEXP_REPLACE(message,'\n|\r|\r\n',' ') as replaced_message from delivery_sms")
Thanks Andrew's for the answer.
The following worked for me:
df.withColumn("test",regexp_replace("_c0","\\\\n|\\\\r"," ")).show()
If anybody is looking for a modern solution to this problem:
df.select(
F.translate(F.col("test"), "\n\t", " " * len("\n\t")).alias("test")
)

Strip out the characters which is non numeric, dashes and pipes

I am trying to find a solution but somehow i am getting wrong output (referred some online solutions and confusing myself. please advise where i am going wrong.
I need to Strip out any characters that is non-numeric,dash "-" or pipe "|" using plsql.
As an example:
if I need to filter the string 0094-78556232_imk*.ext|4444; the output should be 0094-78556232|4444
Use REGEXP_REPLACE:
SELECT
col,
REGEXP_REPLACE (col, '[^0-9|-]', '') AS col_updated
FROM yourTable;
Demo
Don't use regexp_replace, especially if performance is important.
Instead use the standard string function TRANSLATE. Like so:
select col,
translate(col, '0123456789|-' || col, '01234567890|-') as col_updated
from yourTable;
This translates each character in the col value, according to the following scheme: 0 is translated to itself, ...., - is translated to itself. Any other character in col, which is not in this list already, is "translated" to nothing, since there is nothing for it to be translated to in the third argument to the function. So those characters that are NOT on the list are simply removed from the string.

pyspark replace regex with regex

I am trying to replaces a regex (in this case a space with a number) with
I have a Spark dataframe that contains a string column. I want to replace a regex (space plus a number) with a comma without losing the number. I have tried both of these with no luck:
df.select("A", f.regexp_replace(f.col("A"), "\s+[0-9]", ' ,
').alias("replaced"))
df.select("A", f.regexp_replace(f.col("A"), "\s+[0-9]", '\s+[0-9] ,
').alias("replaced"))
Any help is appreciated.
What you need is another function, regex_extract
So, you have to divide the regex and get the part you need. It could be something like this:
df.select("A", f.regexp_extract(f.col("A"), "(\s+)([0-9])", 2).alias("replaced"))

How to escape column names with hyphen in Spark SQL

I have imported a json file in Spark and convertd it into a table as
myDF.registerTempTable("myDF")
I then want to run SQL queries on this resulting table
val newTable = sqlContext.sql("select column-1 from myDF")
However this gives me an error because of the hypen in the name of the column column-1. How do I resolve this is Spark SQL?
Backticks (`) appear to work, so
val newTable = sqlContext.sql("select `column-1` from myDF")
should do the trick, at least in Spark v1.3.x.
Was at it for a bit yesterday, turns out there is a way to escape the (:) and a (.) like so:
Only the field containing (:) needs to be escaped with backticks
sqlc.select("select `sn2:AnyAddRq`.AnyInfo.noInfo.someRef.myInfo.someData.Name AS sn2_AnyAddRq_AnyInfo_noInfo_someRef_myInfo_someData_Name from masterTable").show()
I cannot comment as I have less than 50 reps
When you are referencing a json structure with struct.struct.field and there is a namespace present like:
ns2:struct.struct.field the backticks(`) does not work.
jsonDF = sqlc.read.load('jsonMsgs', format="json")
jsonDF.registerTempTable("masterTable")
sqlc.select("select `sn2:AnyAddRq.AnyInfo.noInfo.someRef.myInfo.someData.Name` AS sn2_AnyAddRq_AnyInfo_noInfo_someRef_myInfo_someData_Name from masterTable").show()
pyspark.sql.utils.AnalysisException: u"cannot resolve 'sn2:AnyAddRq.AnyInfo.noInfo.someRef.myInfo.someData.Name'
If I remove the sn2: fields, the query executes.
I have also tried with single quote ('), backslash (\) and double quotes("")
The only way it works if if I register another temp table on the sn2: strucutre, I am able access the fields within it like so
anotherDF = jsonDF.select("sn2:AnyAddRq.AnyInfo.noInfo.someRef.myInfo.someData")
anotherDF.registerTempTable("anotherDF")
sqlc.select("select Name from anotherDF").show()
This is what I do, which also works in Spark 3.x.
I define function litCol() at the top of my program (or in some global scope):
litCols = lambda seq: ','.join(('`'+x+'`' for x in seq)) # Accepts any sequence of strings.
And then apply it as necessary to prepare my literalized SELECT columns. Here's an example:
>>> UNPROTECTED_COLS = ["RegionName", "StateName", "2012-01", "2012-02"]
>>> LITERALIZED_COLS = litCols(UNPROTECTED_COLS)
>>> print(LITERALIZED_COLS)
`RegionName`,`StateName`,`2012-01`,`2012-02`
The problematic column names in this example are the YYYY-MM columns, which Spark will resolve as an expression, resulting in 2011 and 2010, respectively.

T-SQL, Remove space in string

I have two strings in SQL and the REPLACE function only works on one of them, why is that?
Example 1:
SELECT REPLACE('18 286.74', ' ', '')
Example 2:
SELECT REPLACE('z z', ' ', '')
Example 1's output is still "18 286.74" whereas Example 2's output is "zz". Why does SQL not react the same way to both strings?
UPDATE:
When running select replace('123 123.12', ' ', '') that works fine, still not with '18 286.74'.
Test it the following way.
select unicode(substring('18 286.74', 3, 1))
If the code returns 32 then it's a space, if not, it's a different Unicode character and your replace ' ' won't work.
maybe cast is needed.
UPD: or not(on sql 2005 works fine too)
Are you sure it is a space? i.e. the same whitespace character that you are passing as the second argument? The code you've posted works fine for me on SQL Server 2008.
Re working on your friends PC - perhaps the whitespace got normalized when you sent it to him?
You are probably using non-breakable space.
I could reproduce it by typing ALT+0160 into the number in SELECT REPLACE('18 286.74', ' ', '')
Could you please issue this following:
SELECT CAST('18 286.74' AS BINARY), REPLACE('18 286.74', ' ', '')
by copying the '18 286.74' from REPLACE into CAST?
I was having the same issue and found that it was a char(10) (line feed). when copied out of Managment Studio it became a char(32) but in the record it was a char(10) try
Select Replace(#string, char(13), '')