Renaming columns with ' in pyspark - pyspark

How to rename column "RANDY'S" to 'RANDYS' in pyspark?
I tried below code and its not working
test_rename_df=df.withColumnRenamed('"RANDY''S"','RANDYS')
Note that original column name has double quotes around it
enter image description here

You're adding too many quotes around the original column name. Try this:
test_rename_df = df.withColumnRenamed("RANDY\'S", "RANDYS")
Side-note
When you call df.columns, the column RANDY'S is surrounded by double quotes instead of single quotes to avoid confusion.
If your column had the name RANDY"S, df.columns would instead use single quotes around the column name (see screenshot below):

Related

SalesForce Spark Delimiter issue

I have a glue job, in which am reading table from SF using soql:
df = (
spark.read.format("com.springml.spark.salesforce")
.option("soql", sql)
.option("queryAll", "true")
.option("sfObject", sf_table)
.option("bulk", bulk)
.option("pkChunking", pkChunking)
.option("version", "51.0")
.option("timeout", "99999999")
.option("username", login)
.option("password", password)
.load()
)
and whenever there is a combination of double-quotes and commas in the string it messes up my table schema, like so:
in source:
Column A
Column B
Column C
000AB
"text with, comma"
123XX
read from SF in df :
Column A
Column B
Column C
000AB
"text with
comma"
Is there any option to avoid such cases when this comma is treated as a delimiter? I tried various options but nothing worked. And SOQL doesn't accept REPLACE or SUBSTRING functions, their text manipulation functions are, well, basically there aren't any.
All the information I'm giving need to be tested. I do not have the same env so it is difficult for me to try anything but here is what I foud.
When you check the official doc, you find that there is a field metadataConfig. The documentation of this field can be found here : https://resources.docs.salesforce.com/sfdc/pdf/bi_dev_guide_ext_data_format.pdf
On page 2, csv format, it says :
If a field value contains a control character or a new line the field value must be contained within double quotes (or your
fieldsEscapedBy value). The default control characters (fieldsDelimitedBy, fieldsEnclosedBy,
fieldsEscapedBy, or linesTerminatedBy) are comma and double quote. For example, "Director of
Operations, Western Region".
which kinda sounds like you current problem.
By default, the values are comma and double quotes, so, I do not understand why it is failing. But, apparently, in your output, it keeps the double quotes, so, maybe, it considers only simple quote.
You should try to enforce the format and add in you code :
.option("metadataConfig", '{"fieldsEnclosedBy": "\"", "fieldsDelimitedBy": ","}')
# Or something similar - i could'nt test, so you need to try by yourself

Pyspark : Reading csv files with fields having double quotes and comas

I have a csv file which I am reading thru pyspark and loading into postgresql. One of its field is having strings which have coma and double quotes within the string. Like example below -
1. "RACER ""K"", P.L. 9"
2. "JENIS, B. S. ""N"" JENIS, F. T. ""B"" 5"
Pyspark is parsing it as below. Which is causing issue because it is mixing up the values/columns when I load the data into postgresql and script fail.
1. '\"RACER \"\"K\"\"'
2. '\"JENIS, B. S. \"\"N\"\" JENIS'
I am using spark 2.42. How can this situation be handled in pyspark?
Basically I want to program to ignore coma or double quotes if it is coming inside double quotes.
You can try and remove the comma and double quotes using pandas before reading and loading into postgresql.
You can use str.replace:
df['column_name'] = df['column_name'].str.replace(r"[\"\',]", '')

Error Code: 1582. Incorrect parameter count in the call to native function 'SUBSTRING_INDEX'

When I tried to copy the text from before the first comma in the first column to the second column using the following command:
UPDATE Table_Name
SET second_column = SUBSTRING_INDEX(first_column, ‘,’, 1);
I got the error message:
Error Code: 1582. Incorrect parameter count in the call to native function 'SUBSTRING_INDEX'
What is going wrong?
Thanks in advance.
The error came from copying and pasting the command from a non-programming text editor (i.e. Word):
The quotes around the delimiter were changed from two normal single quotes (') and (') to left and right single quotes (‘) and (’).
Changing the delimiter to two normal single quotes around the delimiter solved the problem:
UPDATE Table_Name
SET second_column = SUBSTRING_INDEX(first_column, ',', 1);

converting 1x1 matrix to a variable

I read the data from the csv which contains two columns id which text/string and the cancer which is 1/0. please see the code be
M = readtable('data.csv');
I try to access the very first value using
row= M(n,1); //It's from the ID column which is text
But it comes in the form of a 1x1matrix, and I am unable to put it in a single variable.
for example I want after the above line works row should contain a string in it like. row = 'patientID'. Now is there anyway to convert it into a single value?
Use row = M{n,1}. Note the curly braces.
The curly braces say "get the contents of the table", as opposed to the circular brackets you had been using which say "get me a portion of the table, as a table".

Using camelCased columns in a postgresql where clause

I have a table with camelCased column names (which I now deeply regret). If I use double quotation marks around the column names as part of the SELECT clause, they work fine, e.g. SELECT "myCamelCasedColumn" FROM the_table;. If, however, I try doing the same in the WHERE clause, then I get an error.
For example, SELECT * FROM the_table WHERE "myCamelCasedColumn" = "hello"; gives me the error column "hello" does not exist.
How can I get around this? If I don't surround the column in double quotation marks then it will just complain that column mycamelcasedcolumn does not exist.
In SQL string literals are enclosed in single quotes, not double quotes.
SELECT *
FROM the_table
WHERE "myCamelCasedColumn" = 'hello';
See the manual for details:
http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS
The manual also explains why "myCamelCasedColumn" is something different in SQL than myCamelCasedColumn
In general you should stay away from quoted identifiers. They are much more trouble than they are worth it. If you never use double quotes everything is a lot easier.
The problem is you use double quote for strin literal "hello". Should be 'hello'. Double quotes is reserved for identifiers.