Retrieving Row column values as their Scala type and not Column - scala

What I'm trying to achieve is inferring values to certain DataFrame columns taking into account values of each individual row.
.withColumn("date", when(col("date").isNull, lit(new DateTime(col("timestamp").as[Long]).getYear)))
The problem is that I can't wrap my head around how to retrieve, for each of the Row objects, its value for the given column. I've seen other solutions but they either list the whole set of values for all of the rows, or just get the first value of them, which isn't what I'm trying to achieve.
Image an example DF like this...
(year, val1, val2, val3, timestamp)
(null, 10, 12, null, 123456789)
(null, 11, 12, null, 234567897)
And what I want to see after applying individual functions (for example, extracting year from timestamp) to each of the Rows is...
(year, val1, val2, val3, timestamp)
(2018 [using DateTime class], 10, 12, 1012, 123456789)
(2018 [using DateTime class], 12, 12, 1212, 234567897)
Is there any way of doing this?

Thats where UDFs come into play :
val udf_extractYear = udf((ts:Long) => new DateTime(ts).getYear)
then you can use this using e.g.
df
.withColumn("year", when(col("year").isNull, udf_extractYear(col("timestamp"))).otherwise(col("year")))
.show()
As you can see your timestamp column is automatically mapped to Long

Related

PySpak and dataframe : Another way to show the type of one specific colum?

I'm new with pYSPark and I'm struggling when I select one colum and I want to showh the type.
If I have a datagrame and want to show the types of all colums, this is what i do:
raw_df.printSchema()
If i want a specific column, i'm doig this but i'm sure we can do it faster:
new_df = raw_df.select( raw_df.annee)
new_df.printSchema()
Do i have to use select and store my colum in a new dataframe and use printchema()?
I tried something like this but it doesn't work:
raw_df.annee.printchema()
is there another way?
Do i have to use select and store my colum in a new dataframe and use printchema()
Not necessarily - take a look at this code:
raw_df = spark.createDataFrame([(1, 2)], "id: int, val: int")
print(dict(raw_df.dtypes)["val"])
int
The "val" is of course the column name you want to query.

How do you select the 'maximum' struct from each group

I have a dataFrame that contains an id column and a struct of two values order_value
example_input = spark.createDataFrame([(1, (1,2)), (1, (2,1)), (2, (1,2))], ["id", "order_value"])
I would like to keep one record from each id, that is the maximum of the order_value column. Specifically the maximum of the order (first part of order_value) with ties broken by the the maximum of the value (second part of order_value)
How can this be done?
example_input.groupby('id').max() doesn't seem to work as it complains that order_value is not numeric.
my desired output is given by:
example_output = spark.createDataFrame([(1, (2,1)), (2, (1,2))], ["id", "order_value"])
Try with array_max function in spark.
Example:
#groupby on id then collect_list to create an array to find max in the array
example_input.groupBy("id").agg(array_max(collect_list(col("order_value"))).alias("order_value")).\
show(10,False)

How to format date in SSRS?

In SSRS report query is generating date as a column in the format of :
Sales ID20200331 ID20200430 ID20200531
To remove the ID i used following expression:
=Right( Fields!ID20210331.Value, len(Fields!ID20210331.Value) - 2)
This gives me 84, instead of removing ID.
How can I remove ID and format date as 2020 Mar etc.
Thanks
If your fields values are "ID20200430" etc then in SSRS you can use something like this..
=DateSerial(
MID(Fields!IDDate.Value, 3, 4),
MID(Fields!IDDate.Value, 7, 2),
RIGHT(Fields!IDDate.Value, 2)
)
However It appears that it's your column [names] that represent dates is this correct?
If this is true, then you would have to UNPIVOT the columns in SQL then convert the resulting values into a real date format.
Here' some sample data to show how to do this.
DECLARE #t TABLE (Sales varchar(10), ID20200331 int, ID20200430 int, ID20200531 int)
INSERT INTO #t VALUES
('A', 1,2,3),
('B', 4,5,6),
('C', 7,8,9)
SELECT
Sales, IdDate, SomeNumber
, MyDate = DATEFROMPARTS(SUBSTRING(IdDate, 3, 4), SUBSTRING(IdDate, 7, 2), SUBSTRING(IdDate, 9, 2))
FROM #t
UNPIVOT(
SomeNumber FOR IdDate IN ([ID20200331],[ID20200430],[ID20200531])
) unpvt
Which gives us this including the myDate column which is the correct date type
You could then use this in a matrix control in SSRS to get the data back into a pivoted view

Automatically cast value type to column type

TL;DR; In short, what I need is to somehow cast text to unknown from where Postgres will magically cast that to correct type; or some alternative solution to this while keeping in mind things that I want to avoid.
Error in question:
ERROR: column "id" is of type integer but expression is of type text
Say I've got this table:
CREATE TEMP TABLE unknown_test (
id int,
some_timestamp timestamp,
value1 int,
value2 int,
value3 text);
Currently I'm doing DML on that table with queries like that:
INSERT INTO unknown_test (id, some_timestamp, value1, value2, value3)
VALUES ('5', '2018-01-10 14:11:03.763396', '3', '15', 'test2');
So values are unknown type and Postgres has some kind of built in cast for that (it is not in select * from pg_cast where castsource = 'unknown'::regtype;). This works, but is somewhat slow.
What I want to do is this (obviously I have actual table, not values()):
INSERT INTO unknown_test (id, some_timestamp, value1, value2, value3)
SELECT json_data->>'id', json_data->>'some_timestamp', json_data->>'value1', json_data->>'value2', json_data->>'value3'
FROM (VALUES (jsonb_build_object('id', 1, 'some_timestamp', now(), 'value1', 21, 'value2', 5, 'value3', 'test')),
(jsonb_build_object('id', 2, 'some_timestamp', now(), 'value1', 22, 'value2', 15, 'value3', 'test2')),
(jsonb_build_object('id', 3, 'some_timestamp', now(), 'value1', 32, 'value2', 25, 'value3', 'test5')),
(jsonb_build_object('id', 4, 'some_timestamp', now(), 'value1', 42, 'value2', 55, 'value3', 'test7'))
) AS j(json_data);
Sadly, those will give text type and will complain that I need to explicitly cast it. I can't do that, because I don't know what types those are. I could find out of course by checking pg_catalog or by storing in json data info about type. Both of those require some additional computation and/or storage and I want to avoid any unnecessary overhead to this (my pg_catalog is really fat).
Second thing I want to avoid is CREATE CAST for text type, unless someone can assure me it won't break anything.
Doing loop and dynamic SQL to get unknown type is my current approach and I need something faster, so my idea was not to use loop, but table instead.
You can use jsonb_populate_record for this:
SELECT (jsonb_populate_record(null::unknown_test, json_data)).*
FROM ...
This will create a record of the same type as the table unknown_test and then the whole record is expanded into individual columns using the (...).* syntax.
This requires that the (first level) keys in the JSON document have exactly the same names as the columns in the table.

How to simply transpose two columns into a single row in postgres?

Following is the output of my query:
key ;value
"2BxtRdkRvwc-2hPjF8LBmHD-finapril" ;4
"3QXORSfsIY0-2sDizCyvY6m-finapril" ;12
"4QXORSfsIY0-2sDizCyvY6m-curr" ;12
"5QXORSfsIY0-29Xcom4SHVh-finapril" ;12
What i want is simply to bring the rows into columns so that only one row remains with the key as the column name.
I have seen examples with crosstab catering to much complex use cases but i want to know if there is a simpler way in which this can be achieved in my particular case?
Any help is appreciated
Thanks
Postgres Version : 9.5.10
It is impossible to execute a query resulting in an unknown number and names of columns. The simplest way to get a similar effect is to generate a json object which can be easily interpreted by a client app as a pivot table, example:
with the_data(key, value) as (
values
('2BxtRdkRvwc-2hPjF8LBmHD-finapril', 4),
('3QXORSfsIY0-2sDizCyvY6m-finapril', 12),
('4QXORSfsIY0-2sDizCyvY6m-curr', 12),
('5QXORSfsIY0-29Xcom4SHVh-finapril', 12)
)
select jsonb_object_agg(key, value)
from the_data;
The query returns this json object:
{
"4QXORSfsIY0-2sDizCyvY6m-curr": 12,
"2BxtRdkRvwc-2hPjF8LBmHD-finapril": 4,
"3QXORSfsIY0-2sDizCyvY6m-finapril": 12,
"5QXORSfsIY0-29Xcom4SHVh-finapril": 12
}