PostgreSQL text field loads as strL - postgresql

I have data stored in PostgreSQL with the data type text.
When I load this data into Stata it has type strL, even if every string in a column is only one charter long. This takes up too much memory. I would like to continue using the text type in PostgreSQL.
Is there a way to specify that text data from PostgreSQL is loaded into Stata with type str8?
I also want numeric data to be loaded as numeric values so allstring is not a good solution. I would also like to avoid specifying data type on a column by column basis.
The command I use to load data into Stata is this:
odbc load, exec("SELECT * FROM mytable") <connect_options>
The file profile.do contains the following:
set odbcmgr unixodbc, permanently
set odbcdriver ansi, permanently
The file odbci.ini contains the following:
[database_name]
Debug = 0
CommLog = 0
ReadOnly = no
Driver = /usr/local/lib/psqlodbcw.so
Servername = <server>
FetchBufferSize = 99
Port = 5432
Database = postgres
In PosrgreSQL mytable looks like this:
postgres=# \d+ mytable
Table "public.mytable"
Column | Type | Modifiers | Storage | Stats target | Description
--------+------+-----------+----------+--------------+-------------
c1 | text | | extended | |
c2 | text | | extended | |
postgres=# select * from mytable;
c1 | c2
----+-------
a | one
b | two
c | three
(3 rows)
In Stata mutable looks like this:
. describe
Contains data
obs: 3
vars: 2
size: 500
---------------------------------------------------------------------------
storage display value
variable name type format label variable label
---------------------------------------------------------------------------
c1 strL %9s
c2 strL %9s
---------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
I am using PostgreSQL v9.6.5 and Stata v14.2.

You can do this in Stata by compress-ing your data after you load the variables:
clear
input strL string
"My name is Pearly Spencer"
"I am a contributor on Stack Overflow"
"This is an example variable"
end
describe
Contains data
obs: 3
vars: 1
size: 355
------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------------------------------------------------------
string strL %9s
------------------------------------------------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
compress, nocoalesce
describe
Contains data
obs: 3
vars: 1
size: 108
------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------------------------------------------------------
string str36 %36s
------------------------------------------------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
The option nocoalesce forces Stata to choose the appropriate length for the loaded string variables.

Related

How to convert nested json to data frame with kdb+

I am trying to get the data from cryptostats like below, it gives me back a nested json. I want it to be in a table format. How do I do that?
query:"https://api.cryptostats.community/api/v1/fees/oneDayTotalFees/2023-02-07";
raw:.Q.hg query;
res:.j.k raw;
To get json file, use https://api.cryptostats.community/api/v1/fees/oneDayTotalFees/2023-02-07
To view json code into a table format, use https://jsongrid.com/json-grid
Final result would be a kdb+ table which has all the cols from nested json output
They are all dictionaries
q)distinct type each res[`data]
,99h
But they do not collapse to a table because they do not all have matching keys
q)distinct key each res[`data]
`id`bundle`results`metadata`errors
`id`bundle`results`metadata
Looking at a row where errors is populated we can see it is a dictionary
q)res[`data;0;`errors]
oneDayTotalFees| "Error executing oneDayTotalFees on compound: Date incomplete"
You can create a prototype dictionary with a blank errors key in it and join , each piece of data onto it. This will result in uniform dictionaries which will be promoted to a table type 98h
q)table:(enlist[`errors]!enlist (`$())!()),/:res`data
q)type table
98h
Row which already had errors is unaffected:
q)table 0
errors | (,`oneDayTotalFees)!,"Error executing oneDayTotalFees..
id | "compound"
bundle | 0n
results | (,`oneDayTotalFees)!,0n
metadata| `source`icon`name`category`description`feeDescription;..
Row which previously did not have errors now has a valid empty dictionary
q)table 1
errors | (`symbol$())!()
id | "swapr-ethereum"
bundle | "swapr"
results | (,`oneDayTotalFees)!,24.78725
metadata| `category`name`icon`bundle`blockchain`description`feeDescription..
https://kx.com/blog/kdb-q-insights-parsing-json-files/
https://code.kx.com/q/ref/join/
https://code.kx.com/q/kb/faq/#construction
https://code.kx.com/q/basics/datatypes/
https://code.kx.com/q/ref/maps/#each-left-and-each-right
If you want to explore nested objects you can index at depth (see blog post linked above). If you have many sparse keys leaving it like this is efficient for storage:
q)select tokenSymbol:metadata[::;`tokenSymbol] from table where not ""~/:metadata[::;`tokenSymbol]
tokenSymbol
-----------
"HNY"
If you do wish to explode a nested field you can run similar to:
q)table:table,'{flip c!flip table[`metadata]#\:(c:distinct raze key each table[`metadata])}[]
q)meta table
c | t f a
----------------| -----
errors |
id | C
bundle | C
results |
metadata |
source | C
icon | C
name | C
category | C
description | C
feeDescription | C
blockchain | C
website | C
tokenTicker | C
tokenCoingecko | C
protocolLaunch | C
tokenLaunch | C
adapter | C
subtitle | C
events | C
shortName | C
protocolShutdown| C
tokenSymbol | C
subcategory | C
tokenticker | C
tokencoingecko | C
Care needs to be taken will filling in nulls and keeping consistent types of data in each column. In this dataset the events tag inside metadata is tabular data:
q)select distinct type each events from table
events
------
10
98
0
This would need to be cleaned similar to:
q)table:update events:count[i]#enlist ([] date:();description:()) from table where not 98h=type each events
The data returned from the API contains dictionaries with two distinct sets of keys:
q)distinct key each res`data
`id`bundle`results`metadata`errors
`id`bundle`results`metadata
One simple way to convert this to a table is to enlist each dictionary first, converting them to tables, then joining with uj:
q)(uj/)enlist each res`data
id bundle results metadata ..
-----------------------------------------------------------------------------..
"compound" 0n (,`oneDayTotalFees)!,0n `source`i..
"swapr-ethereum" "swapr" (,`oneDayTotalFees)!,24.78725 `category..
...
This works as uj generalises the join operator ,, allowing different schemas with common elements to be combined.

COPY with FORCE NULL to all fields

I have several CSVs with varying field names that I am copying into a Postgres database from an s3 data source. There are quite a few of them that contain empty strings, "" which I would like to convert to NULLs at import. When I attempt to copy I get an error along the lines of this (same issue for other data types, integer, etc.):
psycopg2.errors.InvalidDatetimeFormat: invalid input syntax for type date: ""
I have tried using FORCE_NULL (field 1, field2, field3) - and this works for me, except I would like to do FORCE_NULL (*) and apply to all of the columns as I have A LOT of fields I am bringing in that I'd like this applied to.
Is this available?
This is an example of my csv:
"ABC","tgif","123","","XyZ"
Using psycopg2 Copy functions. In this case copy_expert:
cat empty_str.csv
1, ,3,07/22/2
2,test,4,
3,dog,,07/23/2022
create table empty_str_test(id integer, str_fld varchar, int_fld integer, date_fld date);
import psycopg2
con = psycopg2.connect("dbname=test user=postgres host=localhost port=5432")
cur = con.cursor()
with open("empty_str.csv") as csv_file:
cur.copy_expert("COPY empty_str_test FROM STDIN WITH csv", csv_file)
con.commit()
select * from empty_str_test ;
id | str_fld | int_fld | date_fld
----+---------+---------+------------
1 | | 3 | 2022-07-22
2 | test | 4 |
3 | dog | | 2022-07-23
From here COPY:
NULL
Specifies the string that represents a null value. The default is \N (backslash-N) in text format, and an unquoted empty string in CSV format. You might prefer an empty string even in text format for cases where you don't want to distinguish nulls from empty strings. This option is not allowed when using binary format.
copy_expert allows you specify the CSV format. If you use copy_from it will use the text format.

KDB: How to assign string datatype to all columns

When I created the table Tab, I specified the columns as string,
Tab: ([Key1:string()] Col1:string();Col2:string();Col3:string())
But the column datatype (t) is empty. I suppose specifying the column as string has no effect.
meta Tab
c t f a
--------------------
Key1
Col1
Col2
Col3
After I do a bulk upsert in Java...
c.Dict dict = new c.Dict((Object[]) columns.toArray(new String[columns.size()]), data);
c.Flip flip = new c.Flip(dict);
conn.c.ks("upsert", table, flip);
The datatypes are all symbols:
meta Tab
c t f a
--------------------
Key1 s
Col1 s
Col2 s
Col3 s
How can I specify the datatype of the columns as string and have it remain as string?
You cant define a column of the empty table with as strings as they are merely lists of lists of characters
You can just set them as empty lists which is what your code is doing.
But the column will then take on the type of whatever data is inserted into it.
Real question is what is your java process sending symbols when it should be sending strings. You need to make the change there before publishing to KDB
Note if you define as chars you still wont be able to upsert strings
q)Tab: ([Key1:`char$()] Col1:`char$();Col2:`char$();Col3:`char$())
q)Tab upsert ([Key1:enlist"test"] Col1:enlist"test";Col2:enlist"test";Col3:enlist "test")
'rank
[0] Tab upsert ([Key1:enlist"test"] Col1:enlist"test";Col2:enlist"test";Col3:enlist "test")
^
q)Tab: ([Key1:()] Col1:();Col2:();Col3:())
q)Tab upsert ([Key1:enlist"test"] Col1:enlist"test";Col2:enlist"test";Col3:enlist "test")
Key1 | Col1 Col2 Col3
------| --------------------
"test"| "test" "test" "test"
KDB does not allow to define column types as list during creation of table. So that means you can not define your column type as String because that is also a list.
To do that only way is to define column as empty list like:
q) t:([]id:`int$();val:())
Then when you insert data to this table the column will automatically take type of that data.
q)`t insert (4;"row1")
q) meta t
c | t f a
---| -----
id | i
val| C
In your case, one option is to send string data from your Java process as mentioned by user 'emc211' or other option is to convert your data to string in KDB process before insertion.

How do I define a record type for JSON in postgres

Looking at postgres documentation for JSON functions (https://www.postgresql.org/docs/9.6/static/functions-json.html), there is a section I don't understand about expanding a JSON object into a set of rows.
The docs give a sample use of this function: json_populate_recordset(base anyelement, from_json json) as select * from json_populate_recordset(null::myrowtype, '[{"a":1,"b":2},{"a":3,"b":4}]')
But I'm not sure what that first argument (null::myrowtype) is -- a table definition?
The description of this function is: Expands the outermost array of objects in from_json to a set of rows whose columns match the record type defined by base (see note below).
None of the notes at the bottom seemed relevant. I'm hoping for a better explanation with sample code to understand it all.
The 2nd notice is the one of interest in the doc as it explains how missing fields/values are handled
Note: In json_populate_record, json_populate_recordset, json_to_record
and json_to_recordset, type coercion from the JSON is "best effort"
and may not result in desired values for some types. JSON keys are
matched to identical column names in the target row type. JSON fields
that do not appear in the target row type will be omitted from the
output, and target columns that do not match any JSON field will
simply be NULL.
json_populate_recordset maps the name of the json object to the column name in the table given as first argument.
create table public.test (a int, b text);
select * from json_populate_recordset(null::public.test, '[{"a":1,"b":"b2"},{"a":3,"b":"b4"}]');
a | b
---+----
1 | b2
3 | b4
(2 rows)
--Wrong column name:
select * from json_populate_recordset(null::public.test, '[{"a":1,"c":"c2"},{"a":3,"c":"c4"}]');
a | b
---+---
1 |
3 |
(2 rows)
--Wrong datatype:
select * from json_populate_recordset(null::public.test, '[{"a":1.1,"b":22},{"a":3.1,"b":44}]');
ERROR: invalid input syntax for integer: "1.1"
Alternatively, instead of using the column name/type from an existing table, you can define the columns on the fly
select * from json_to_recordset('[{"a":1,"b":"foo"},{"a":"2","c":"bar"}]') as x(a int, b text);
a | b
---+-----
1 | foo
2 |
(2 rows)
--> note that default type cast occurs ("2" is mapped to 2), missing fields are ignored (b, in second record) as well as fields not defined (c)

Strange behaviour in Postgresql

I'm new to Postgresql and I'm trying to migrate my application from MySQL.
I have a table with the following structure:
Table "public.tbl_point"
Column | Type | Modifiers | Storage | Description
------------------------+-----------------------+-----------+----------+-------------
Tag_Id | integer | not null | plain |
Tag_Name | character varying(30) | not null | extended |
Quality | integer | not null | plain |
Execute | integer | not null | plain |
Output_Index | integer | not null | plain |
Last_Update | abstime | | plain |
Indexes:
"tbl_point_pkey" PRIMARY KEY, btree ("Tag_Id")
Triggers:
add_current_date_to_tbl_point BEFORE UPDATE ON tbl_point FOR EACH ROW EXECUTE PROCEDURE update_tbl_point()
Has OIDs: no
when I run the query through a C program using libpq:
UPDATE tbl_point SET "Execute"=0 WHERE "Tag_Id"=0
I got the following output:
ERROR: record "new" has no field "last_update"
CONTEXT: PL/pgSQL function "update_tbl_point" line 3 at assignment
I get exactly the same error when I try to change the value of "Execute" or any other column using pgAdminIII.
Everything works fine if I change the column name from "Last_Update" to "last_update".
I found the same problem with other tables I have in my database and the column always appears with abstime or timestamp columns.
Your update_tbl_point function is probably doing something like this:
new.last_update = current_timestamp;
but it should be using new."Last_Update" so fix your trigger function.
Column names are normalized to lower case in PostgreSQL (the opposite of what the SQL standard says mind you) but identifiers that are double quoted maintain their case:
Quoting an identifier also makes it case-sensitive, whereas unquoted names are always folded to lower case. For example, the identifiers FOO, foo, and "foo" are considered the same by PostgreSQL, but "Foo" and "FOO" are different from these three and each other. (The folding of unquoted names to lower case in PostgreSQL is incompatible with the SQL standard, which says that unquoted names should be folded to upper case. Thus, foo should be equivalent to "FOO" not "foo" according to the standard. If you want to write portable applications you are advised to always quote a particular name or never quote it.)
So, if you do this:
create table pancakes (
Eggs integer not null
)
then you can do any of these:
update pancakes set eggs = 11;
update pancakes set Eggs = 11;
update pancakes set EGGS = 11;
and it will work because all three forms are normalized to eggs. However, if you do this:
create table pancakes (
"Eggs" integer not null
)
then you can do this:
update pancakes set "Eggs" = 11;
but not this:
update pancakes set eggs = 11;
The usual practice with PostgreSQL is to use lower case identifiers everywhere so that you don't have to worry about it. I'd recommend the same naming scheme in other databases as well, having to quote everything just leaves you with a mess of double quotes (standard), backticks (MySQL), and brackets (SQL Server) in your SQL and that won't make you any friends.