How to get the values in DataFrame with the correct DataType? - scala

When I tried to get some values in a DataFrame, like:
df.select("date").head().get(0) // type: Any
The result type is Any, which is not expected.
Since a dataframe contains the schema of the data, it should know the DataType for each column, so when i try to get a value using get(0), it should return the value with the correct type. However, it does not.
Instead, I need to specify which DataType i want using getDate(0), which seems weird, inconvenient, and makes me mad.
When I have specified the schema with the correct DataTypes for each column when i created the Dataframe, I don't want to use different getXXX()' for differentcolumn`s.
Are there some convenient ways that I can get the values with their own correct types? That is to say, how can I get the values with the correct DataType specified in the schema?
Thank you!

Scala is a statically typed language. so the get method defined on the Row can only return values with a single type because the return type of the get method is Any. It cannot return Int for one call and a String for another.
you should be calling the getInt, getDate and other get methods provided for each type. Or the getAs method in which you can pass the type as a parameter (for example row.getAs[Int](0)).
As mentioned in the comments other options are
use Dataset instead of a DataFrame.
use Spark SQL

You can call the generic getAs method as getAs[Int](columnIndex), getAs[String](columnIndex) or use specific methods like getInt(columnIndex), getString(columnIndex).
Link to the Scaladoc for org.apache.spark.sql.Row.

Related

How to understand the return type?

I'm building a framework for rust-postgres.
I need to know what value type will be returned from a row.try_get, to get the value in a variable of the appropriate type.
I can get the sql type from row.columns()[index].type, but not if the value is nullable , so i can't decide to put the value in a normal type or a Option<T>.
I can use just the content of the row to understand it, i can't do things like "get the table structure from Postgresql".
is there a way?
The reason that the Column type does not expose any way to find out if a result column is nullable is because the database does not return this information.
Remember that result columns are derived from running a query, and that query may contain arbitrary expressions. If the query was a simple SELECT of columns from a table, then it would be reasonably simple to determine if a column could be nullable.
But it could also be a very complex expression, derived from multiple columns, subselects or even custom functions. Postgres can figure out the data type of each column, but in the general case it doesn't know if a result column may contain nulls.
If your application is only performing simple queries, and you know which table column each result column comes from, then you can find out if that table column is nullable like this:
SELECT is_nullable
FROM information_schema.columns
WHERE table_schema='myschema'
AND table_name='mytable'
AND column_name='mycolumn';
If your queries are not that simple then I recommend you always get the result as an Option<T> and handle the possibility that the result might be None.

Jooq dsl for batch insert of maps, arrays and so forth

Im hoping to use jooq dsl to do batch inserts to postgres. I know it's possible but Im having issues getting the data formatted properly.
dslContext.loadInto(table).loadJSON(json-data).fields(...).execute();
is where Im starting from. The tricky part seems to be getting Map<String, String> into a jsonb column.
I have the data formatted according to this description and jooq seems to be ok with it.. until the map/json-in-json shows up.
Another json-array column still needs to be dealt with too.
Questions:
is this a reasonable approach?
if not - what would you recommend instead?
Error(s) Im seeing:
ERROR: column "changes_to" is of type jsonb but expression is of type character varying
Hint: You will need to rewrite or cast the expression.
Edit:
try (DSLContext context = DSL.using(pgClient.getDataSource(), SQLDialect.POSTGRES_10)) {
context.loadInto(table(RECORD_TABLE))
.loadJSON(jsonData)
.fields(field(name(RECORD_ID_COLUMN)),
field(name(OTHER_ID_COLUMN)),
field(name(CHANGES_TO_COLUMN)),
field(name(TYPE_COLUMN)),
IDS_FIELD)
.execute();
} catch (IOException e) {
throw new RuntimeException(e);
}
with json data:
{"fields":[{"name":"rec_id","type":"VARCHAR"},{"name":"other_id","type":"VARCHAR"},{"name":"changes_to","type":"jsonb"},{"name":"en_type","type":"VARCHAR"},{"name":"ids","type":"BIGINT[]"}],"records":[["recid","crmid","{\"key0\":\"val0\"}","ent type",[10,11,12]],["recid2","crmid2","{\"key0\":\"val0\"}","ent type2",[10,11,12]]]}
The problem(s) being how to format the 'changes_to' and 'ids' columns.
There's a certain price to pay if you're not using jOOQ's code generator (and you should!) jOOQ doesn't know what of data type your columns are if you create a field(name("...")), so it won't be able to bind your values correctly. Granted, the Loader API could read the JSON header information, but it currently doesn't.
Instead, why not just either:
Provide explicit type information to your column references, like field(name(CHANGES_TO_COLUMN), SQLDataType.JSONB)
Much better: use the code generator, in case of which you already have all the type information associated with your Field expression.

what is the difference of type record and type row in PostgreSQL?

As title shown, when reading the manul, I found type record type and row type, which are both composite type. However, I want to figure out their difference.
They're similar once defined but tend to have different use cases.
A RECORD type has no predefined structure and is typically used when the row type might change or is out of your control, for example if you're referencing a record in a FOR LOOP.
ROWTYPE is predefined of a particular table row structure and thus if anything deviates from that structure you will get runtime errors.
It all depends what you're trying to achieve.
For cursor loops I use a RECORD>
For more information:
http://www.postgresql.org/docs/current/static/plpgsql-declarations.html

PL/pgSQL - %TYPE and ARRAY

Is it possible to use the %TYPE and array together?
CREATE FUNCTION role_update(
IN id "role".role_id % TYPE,
IN name "role".role_name % TYPE,
IN user_id_list "user".user_id % TYPE[],
IN permission_id_list INT[]
)
I got syntax error by this, but I don't want to duplicate any column type, so I want to use "user".user_id % TYPE instead of simply INT because then it is easier to modify any column type later.
As the manual explains here:
The type of a column is referenced by writing table_name.column_name%TYPE. Using this feature can sometimes help make a function independent of changes to the definition of a table.
The same functionality can be used in the RETURNS clause.
But there is no simple way to derive an array type from a referenced column, at least none that I would know of.
About modifying any column type later:
You are aware that this type of syntax is only a syntactical convenience to derive the type from a table column? Once created, there is no link whatsoever to the table or column involved.
It helps to keep a whole create script in sync. But id doesn't help with later changes to live objects in the database.
Related answer on dba.SE:
Array of template type in PL/pgSQL function using %TYPE
Using referenced types in function's parameters has no sense (in PostgreSQL), because its translated intermediately to actual types, and it is stored as actual types. Sorry, PostgreSQL doesn't support this functionality - something different is using referenced types inside function, where actual type is detected every first time execution in session.

Cast to user-defined data type in PostgreSQL

I have created a data type called id which consists of two text values:
id(text, text)
I now need to cast values to this data type before they are inserted into my table. How would I go about doing this?
I created the type as follows:
CREATE TYPE ID AS(id text, source text);
Well, to create a cast you need a function that takes a value of one type as your input and outputs the type you wish to cast to (in this case "ID" - which I would name a little more verbose if I were you). What type do you want to cast from?
Realize without messing with all that, you should be able to use your type according to this page.
Just..
SELECT ROW('foo','bar')::ID ;
You have to tell PostgreSQL how to cast, CREATE CAST
if we are talking about "user defined types" which are realy compatible with each other, so you can cast the type to text first and then to your custom type admin_action::text::"UserAction"
in my case "admin_action" was of type "AdminAction" and couldn't be converted to "UserAction" directly, but I've done it through "text" step.