How to Translate Column values from french to english language using Pyspark - pyspark

I have a dataframe with one column having all the values in French. I would like to know how to translate all those values at a time to English language using Pyspark and in Palantir platform.

Related

Can I use to_char() and make_date() in postgreSQL table definition?

I'm working on a poc to migrate an on-prem SQL Server database to Amazon Aurora for PostgreSQL. Amazon's Schema Conversion Tool struggled to translate the SQL Server code for the creation of a table on this column:
[DOB] AS (CONVERT([varchar],datefromparts([DOB_year],[DOB_month],[DOB_day]),(120))) PERSISTED,
as the CONVERT function is unsupported in Postgres.
The best translation I can come up with is:
dob varchar(30) GENERATED ALWAYS AS (to_char((make_date(dob_year, dob_month, dob_day))::timestamp, 'YYYY-MM-DD HH24:MI:SS')) STORED,
but neither the SCT nor pgAdmin4 are recognising to_char() and make_date() as functions. 'dob_day', 'dob_month' and 'dob_year' are all column names with datatype of integer. I'm new to all this but another column definition is using other functions, e.g. replace() and right(), successfully, so I'm confused why this isn't working.
When I tried to run the code in pgAdmin I got this error:
ERROR: generation expression is not immutable
SQL state: 42P17
Thanks
to_char() is is not marked as immutable even though in your case it would be. But there are format masks that are not immutable if e.g. time zones or different locales are involved.
If you really want to (or are forced to) convert day,month, year in a formatted string (rather than a proper date which would be the correct thing to do), then you can only achieve this with a custom function:
create function create_string_date(p_year int, p_month int, p_day int)
returns text
as
$$
select to_char(make_date(p_year, p_month, p_day), 'yyyy-mm-dd hh24:mi:ss');
$$
language sql
immutable;
Marking the function as immutable isn't cheating, because we know that with the given input and format string this is indeed immutable.
dob text generated always as (create_string_date(dob_year, dob_month, dob_day)) stored

Using DDL, how can the default value for a Numeric(8,0) field be set to today's date as YYYYMMDD?

I'm trying to convert this, which works:
create_timestamp for column
CREATETS TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
to something that works like this, but this code is not working:
date_created for column
DTCREATE NUMERIC(8,0) NOT NULL DEFAULT VARCHAR_FORMAT(CURRENT_TIMESTAMP, 'YYYYMMDD'),
Can anyone advise DDL to accomplish what I'm going for? Thank you.
When asking for help with Db2, always specify your Db2-server platform (Z/OS , i-series, linux/unix/windows) and Db2-server version, because the answer can depend on these facts.
The default-clause for a column does not have syntax that you expect, and that is the reason you get a syntax error.
It's can be a mistake to store a date as a numeric, because it causes no end of hassle to programmers and reporting tools, and data exchange. It's usually a mistake based on false assumptions.
If you want to store a date (not a timestamp) then use the column datatype DATE which lets you use:
DTCREATE DATE NOT NULL DEFAULT CURRENT DATE
How you choose, or future programmers choose , to render the value of a date on the SQL output is a different matter.
You may use BEFORE INSERT trigger to emulate a DEFAULT clause with such an unsupported function instead.
CREATE TRIGGER MYTAB_BIR
BEFORE INSERT ON MYTAB
REFERENCING NEW AS N
FOR EACH ROW
WHEN (N.DATE_CREATED IS NULL)
SET DATE_CREATED = VARCHAR_FORMAT(CURRENT_TIMESTAMP, 'YYYYMMDD');

Postgres truncates trailing zeros for timestamps

Postgres (V11.3, 64bit, Windows) truncates trailing zeros for timestamps. So if I insert the timestamp '2019-06-12 12:37:07.880' into the table and I read it back as text postgres returns '2019-06-12 12:37:07.88'.
Table date_test:
CREATE TABLE public.date_test (
id SERIAL,
"timestamp" TIMESTAMP WITHOUT TIME ZONE NOT NULL,
CONSTRAINT pkey_date_test PRIMARY KEY(id)
)
SQL command when inserting data:
INSERT INTO date_test (timestamp) VALUES( '2019-06-12 12:37:07.880' )
SQL command to retrieve data:
SELECT dt.timestamp ::TEXT FROM date_test dt
returns '2019-06-12 12:37:07.88'
Do you consider this a bug or a feature?
My real issue is: I´m running queries from a C++ program and I have to convert the data returned from the database to appropriate data types. Since the protocol is text-based everything I read from the database is plain text. When parsing timestamps I first tokenize the string and then convert each token to integer. And because the millisecond part is truncated, the last token is "88" instead of "880", and converting "88" yields another value that converting "880" to integer.
That's the default display format when using a cast to text.
If you want to see all three digits, use to_char()
SELECT to_char(dt.timestamp,'yyyy-mm-d hh24:mi:ss.ms')
FROM date_test dt;
will return 2019-06-12 12:37:07.880
It’s a matter of presentation only.
First note that 07.88 seconds and 07.880 seconds is the same amount of time (also 7.88 and 07.880000000 for that matter).
PostgreSQL internally represents a timestamp in a way that we shouldn’t be concerned about as long as it’s an unambiguous representation. When you retrieve the timestamp, it is formatted into some string. This is where PostgreSQL apparently chooses not to print redundant trailing zeros. So it’s probably not even correct to say that it truncates anything. It just refrains from generating that 0.
I think that the nice solution would be to modify your parser in C++ to accept any number of decimals and parse them correctly with and without trailing zeroes. Another solution that should work is given in the answer by a_horse_with_no_name.

multi byte character issue in Redshift

I am unable to convert multibyte characters in Redshift.
create table temp2 (city varchar);
insert into temp2 values('г. красноярск'); // lower value
insert into temp2 values('Г. Красноярск'); //upper value
select * from temp2 where city ilike 'Г. Красноярск'
city
-------------
Г. Красноярск
I tried like below, UTF-8 characters are converting into lower.
select lower('Г. Красноярск')
lower
-------------
г. красноярск
In vertica it is working fine with using lowerb() function.
Internally the LIKE and ILIKE operators use PostgreSQL's regular expression support.
Support for proper handling of utf-8 multibyte chars in regular expressions was added in PostgreSQL 9.2. Redshift is based on PostgreSQL 8.2 (?) and it looks like they haven't backported that support into their forked version.
See Postgresql regex to match uppercase, Unicode-aware
You can work around this, with limitations, by using LIKE lower('Г. Красноярск') instead. An expression index may be useful.

Casting character varying field into a date

I have two tables,
details
id integer primary key
onsetdate Date
questionnaires
id integer primary key
patient_id integer foreign key
questdate Character Varying
Is it possible to make a SELECT statement that performs a JOIN on these two tables, ordering by the earliest date taken from a comparision of onsetdate and questdate (is it possible for example to cast the questdate into a Date field to do this?)
Typical format for questdate is "2009-04-22"
The actual tables have an encyrpted BYTEA field for the onsetdate - but I'll leave that part until later (the application is written in RoR using 'ezcrypto' to encrypt the BYTEA field).
something like
SELECT...
FROM details d
JOIN quesionnaires q ON d.id=q.id
ORDER BY LEAST (decrypt_me(onsetdate), questdate::DATE)
maybe? i'm not sure about the meaning of 'id', if you want to join by it or something else
By the way, you can leave out the explicit cast, it's in ISO format after all.
I guess you know what to use in place of decrypt_me()
There is a date parsing function in postgres: http://www.postgresql.org/docs/9.0/interactive/functions-formatting.html
Look for the to_timestamp function.
PostgreSQL supports the standard SQL CAST() function. (And a couple others, too.) So you can use
CAST (questdate AS DATE)
as long as all the values in the column 'questdate' evaluate to a valid date. If this database has been in production for a while, though, that's pretty unlikely. Not impossible, but pretty unlikely.