How inserting LineStrings to a PostGIS-Database with Python, psycopg2 and ppygis - postgresql

i am trying to insert LineStrings to a local Postgres/PostGIS-database. I use Python 2.7, psycopg2 and ppygis.
Every time when i make a loop-input, only a few records were inserted into the table. I tried to find out the problem with mogrify, but i see no failure.
polyline = []
for row in positions:
lat = row[0]
lon = row[1]
point = ppygis.Point(lon, lat, srid=4326)
polyline.append(point)
linestring = ppygis.LineString(polyline, srid=4326)
self.cursor.execute("BEGIN")
self.cursor.execute("INSERT INTO gtrack_4326 (car, polyline) VALUES (%s,%s);", ("TEST_car", linestring))
self.cursor.execute("COMMIT")
The use of execute.mogrify results in Strings like this:
INSERT INTO gtrack_4326 (car,polyline) VALUES ('TEST_car', '0102000020e610000018000000aefab72638502940140c42d4d899484055a4c2d84250294056a824a1e3994840585b0c795f50294085cda55df1994840edca78a57650294069595249f8994840ec78dd6cbd502940828debdff5994840745314f93f5129407396fecaef994840e1f6bafbd25129404eab329de7994840da588979565229407a562d44e2994840ebc9fca36f522940c2af4797ed9948403bd164b5af5229407a90f9dbf99948407adbf1cb05532940818af4ec039a484062928087585329402834ff9e0e9a4840e8bb5b59a2532940b1ec38341b9a4840dcb28d89de532940afe94141299a484084d3275e0a54294019b1aab9379a484080ca42853454294053509b82469a48408df8043f6054294063844b22569a48406d3e09c7875429406dfbc33b659a4840aa5a77989b542940c20e08196d9a48401b56a7b9cb542940a0a0b9f3699a4840cf2d742502552940192543e9669a484045ac0f351b552940fdb0efd46d9a48406891ed7c3f552940450a0a28799a4840d0189c77525529405f7b6649809a4840');
But if i look into the Database, i see a lot of records without geometry-data in the second column. I did not understand why mogrify shows data in each column and in the DB there is in nearly 50 % of the table no data in the geometry-column.

First, psycopg2 does its own transaction management, so you should generally write:
self.cursor.execute(
"INSERT INTO gtrack_4326 (car, polyline) VALUES (%s,%s);",
("TEST_car", linestring)
)
self.conn.commit()
See the psycopg2 docs.
Second, consider loading data in batches with COPY. See COPY in the psycopg2 docs.
Also consider setting log_statement = 'all' in postgresql.conf and a suitable log_line_prefix then restarting the PostgreSQL server. Examine the logs and see if you can tell what's doing the bogus inserts.
If practical, add a CHECK and/or NOT NULL constraint to the geometry column so that any incorrect INSERTs will fail and report an error to the program doing the insert. This might help you diagnose the problem.

How did you determine that 50% of the rows have no geometry data? I'll warn anyone using clients like pgAdminIII show a blank cell if it contains too much data, so it appears to be NULL, when it isn't. You can also directly view the GIS data in a program like Quantum GIS.
With an SQL client, a better diagnostic to determine if a linestring is really there is to get the number of points in the linestring:
SELECT car, ST_NumPoints(polyline) FROM gtrack_4326;
If the numbers are empty, then your assessment is correct that they are empty. Otherwise, the data are too large to display in your client application.

Related

pq: [parent] Data too large

I'm using Grafana to visualize some data stored in CrateDB in different panes.
Some of my boards work correctly, but there are 3 specific boards (created by someone from my work team), in which at certain times of the day they stop showing data (No Data) and as a warning it shows the following error:
db query error: pq: [parent] Data too large, data for [fetch-1] would be [512323840/488.5mb], which is larger than the limit of [510027366/486.3mb], usages [request=0/0b, in_flight_requests=0/0b, query=150023700/143mb, jobs_log=19146608/18.2mb, operations_log=10503056/10mb]
Honestly, I would like to understand what it means, and how I can fix it.
I remain attentive to any help you can give me, and I deeply appreciate the help.
what I tried
17 SQL Statements of the form:
SELECT
time_index AS "time",
entity_id AS metric,
v1_ps
FROM etsm
WHERE
entity_id = 'SM_B3_RECT'
ORDER BY 1,2
for 17 different entities.
what I hope
I hope to receive the data corresponding to each of the SQL statements for their respective graphing.
The result
As a result, there is no data received on some of the statements made and the warning message I shared:
db query error: pq: [parent] Data too large, data for [fetch-1] would be [512323840/488.5mb], which is larger than the limit of [510027366/486.3mb], usages [request=0/0b, in_flight_requests=0/0b, query=150023700/143mb, jobs_log=19146608/18.2mb, operations_log=10503056/10mb]
As an additional fact, the graph is configured to update every 15 min, but no matter how many times you manually update the graph, the statements that receive data are different.
Example: I refresh the panel and the SQL statements A, B and C get data, while the others don't. I refresh the panel and the SQL statements D, H and J receive data, and the others don't (with a random pattern).
Other additional information:
I have access to the database being consulted with Grafana, and the data is there
You don't have time condition, so query select/process all records all the time and you are hitting limits (e. g. size of processed data) of your DB. Add time condition, so only fraction of all records will be returned.

What could be causing this 'invalid host' error on kdb query?

I get an odd error when trying to query too many dates from a date-partitioned historical database:
q)eod: h"select from eod where date within 2018.01.01 2018.04.22"
'/tablepath/2018.04.04/eod/somecolumn: invalid host
q)eod: h"select from eod where date within 2018.01.17 2018.04.20"
'/tablepath/2018.04.20/eod/othercolumn: invalid host
q)eod: h"select from eod where date within 2018.01.18 2018.04.20"
q)
Note that both dates mentioned in the error messages are within the date range that we manage to extract in the end, and that it fails on a different column each time. This seems to indicate it's something to do with the size of the table being pulled, but when we check the size of the largest table we managed to get:
q)(-22!eod) % 1024 * 1024
646.9043
q)count eod
2872546
we find that it's not particularly large by either memory size nor by number of rows.
Googling for "invalid host" errors doesn't seem to turn up anything relevant, and I'm not seeing anything in the kdb docs about size limits that would be relevant. Anyone got any ideas?
Edit:
When loading the table in a session and making the queries directly, we get what appears to be the same error, but with a different message. For instance:
q)jj: select from eod where date within 2018.01.01 2018.04.22
Too many compressed files open
k){0!(?).#[x;0;p1[;y;z]]}
'./2018.04.04/eod/settlecab: No such file or directory
.
?
(+`exch`date`class..
q.Q))
Note that the file ./2018.04.04/eod/settlecab does in fact exist, and contains data:
I have no problem loading the data for just the date mentioned in the error, and the column mentioned has meaningful values:
q)jj: select from eod where date=2018.04.04
q)select count i by settlecab from jj
settlecab| x
---------| -----
0 | 41573
1 | 2269
The key point seems to be the Too many compressed files open message, but what can I do about this?
Edit for Summary/Solutions:
The table in question had many columns, all stored in a compressed format. When issuing a query against too many dates at once, kdb would try to mmap all of those columns at once, running into a limit on how many compressed files could be open at once.
Once I understood the problem, several solutions were available:
I could pull only certain columns from the database, reducing the number of files that kdb needed to keep open,
I could force kdb to pull all the data into memory by adding a dummy where clause to the query, such as (null column) | not null column (hacky, but it works),
I could have upgraded the kdb version and lifted OS limits (not practical in my case).
I still have no idea why this resulted in an invalid host error when querying the database remotely.
First off, can we just clarify the database structure you're working with. It seems from the filepaths returned in your errors that you've got a date-partitioned database. Did you mean non-segmented database when you said non-partitioned in your original query?
In terms of a fix for your issue, have you tried loading your database into a session, and making those queries directly? If so do you get the same issues?
If that seems to be working alright, the problem might lie with how you're defining your database handle. How is h defined in your original example?
It might also be worth trying to select individual dates from your database, to try and isolate the problem, and to determine if it lies with your on-disk data. Try specifically querying the dates that are mentioned in your errors.
You could also try performing your original queries with a subset a columns, again to try and pinpoint where your issue is coming from.
Let us know if you get any further with this.
Joseph

TorQ: How to use dataloader partitioning with separate tables that have different date ranges?

I'm trying to populate a prices and quotes database using AquaQ's TorQ. For this purpose I use the .loader.loadallfiles function. The difference being that prices is daily data and quotes is more intraday e.g. FX rates.
I do the loading as follows:
/- check the location of database directory
hdbdir:hsym `$getenv[`KDBHDB]
/hdbdir:#[value;`hdbdir;`:hdb]
rawdatadir:hsym `$getenv[`KDBRAWDATA]
target:hdbdir;
rawdatadir:hsym `$("" sv (getenv[`KDBRAWDATA]; "prices"));
.loader.loadallfiles[`headers`types`separator`tablename`dbdir`partitioncol`partitiontype`dataprocessfunc!(`date`sym`open`close`low`high`volume;"DSFFFFF";enlist ",";`prices;target;`date;`year;{[p;t] `date`sym`open`close`low`high`volume xcols update volume:"i"$volume from t}); rawdatadir];
rawdatadir:hsym `$("" sv (getenv[`KDBRAWDATA]; "quotes"));
.loader.loadallfiles[`headers`types`separator`tablename`dbdir`partitioncol`partitiontype`dataprocessfunc!(`date`sym`bid`ask;"ZSFF";enlist ",";`quotes;target;`date;`year;{[p;t] `date`sym`bid`ask`mid xcols update mid:(bid+ask)%2.0 from t}); rawdatadir];
and this works fine. However when loading the database I get errors attemping to select from either table. The reason is that for some partitions there aren't any prices or or there aren't any quotes data. e.g. attempting to:
quotes::`date`sym xkey select from quotes;
fails with an error saying the the partition for year e.g. hdb/2000/ doesn't exist which is true, there are only prices for year 2000 and no quotes
As I see there are two possible solutions but neither I know how to implement:
Tell .loader.loadallfiles to create empty schema for quotes and prices in partitions for which there isn't any data.
While loading the database, gracefully handle the case where there is no data for a given partition i.e. select from ... where ignore empty partitions
Try using .Q.chk[`:hdb]
Where `:hdb is the filepath of your HDB
This fills in missing tables, which will then allow you to preform queries.
Alternatively you can use .Q.bv, where the wiki states:
If your table exists in the latest partition (so there is a prototype
for the schema), then you could use .Q.bv[] to create empty tables
on the fly at run-time without having to create those empties on disk.

PostGIS geography query returns a string value

I have a strange issue. The lonlat column on my app works well on the development server –– its output is in the form of POINT(X Y). But when I move the data to the production server, the output is strange!
ActionView::Template::Error (undefined method `lon' for "0101000020E6100000541B9C887E7A52C02920ED7F80614440":String):
The lonlat value, which is encoded with SRID: 4326, is being read as a string. I am almost certain that there was a corruption in the data during migrating it from development to production because this was not a problem before the migration.
Does anyone know what about the database schema or column may cause this issue?
A geometry field stores its data as WKB. To see the WKT representation you need to change your query to something like
select ST_Astext(the_geom) as geometry from table
However, I don't know why in your development you have some kind of implicit conversion between WKB binary data and WKT strings. ¿What version of postgres and postgis are you using?
What lang is in your app server?
Is that ActiveRecord you're using?
I suggest you try something like
float ST_X(geometry a_point);
To make sure you can read the data properly and determinate if problem is on the data field or somewhere else.
I also would try doing the pg_dump in a single step if you determinate the problem is with the geometry column.
You can use pg_dump with option
--exclude-table-data=reg_expresion_ _tablename_
--exclude-table-data=schema.reg_expresion_ _tablename_
This will bring all the schema definition, but exclude the table data and bring only the data from table you need.
Turns out that when I killed the connection to the server to migrate the data, Rails did not set the schema search path (meaning didn't discover the postgis extension) upon reconnecting. I had to restart the server to solve this problem.

Efficient way of insert millions of rows, convert data and deal with it, on PostgreSQL+PostGIS

I have a big collection of data I want to use for user search later.
Currently I have 200 millions resources (~50GB). For each, I have latitude+longitude. The goal is to create spatial index to be able to do spatial queries on it.
So for that, the plan is to use PostgreSQL + PostGIS.
My data are on CSV file. I tried to use custom function to not insert duplicates, but after days of processing I gave up. I found a way to load it fast in the database: with COPY it takes less than 2 hours.
Then, I need to convert latitude+longitude on Geometry format. For that I just need to do:
ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326))
After some checking, I saw that for 200 millions, I have 50 millions points. So, I think the best way is to have a table "TABLE_POINTS" that will store all the points, and a table "TABLE_RESOURCES" that will store resources with point_key.
So I need to fill "TABLE_POINTS" and "TABLE_RESOURCES" from temporary table "TABLE_TEMP" and not keeping duplicates.
For "POINTS" I did:
INSERT INTO TABLE_POINTS (point)
SELECT DISTINCT ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326))
FROM TABLE_RESOURCES
I don't remember how much time it took, but I think it was matter of hours.
Then, to fill "RESOURCES", I tried:
INSERT INTO TABLE_RESOURCES (...,point_key)
SELECT DISTINCT ...,point_key
FROM TABLE_TEMP, TABLE_POINTS
WHERE ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326) = point;
but again take days, and there is no way to see how far the query is ...
Also something important, the number of resources will continue to grow up, currently should be like 100K added by day, so storage should be optimized to keep fast access to data.
So if you have any idea for the loading or the optimization of the storage you are welcome.
Look into optimizing postgres first (ie google postgres unlogged, wal and fsync options), second do you really need points to be unique? Maybe just have one table with resources and points combined and not worry about duplicate points as it seems your duplicate lookup maybe whats slow.
For DISTINCT to work efficiently, you'll need a database index on those columns for which you want to eliminate duplicates (e.g. on the latitude/longitude columns, or even on the set of all columns).
So first insert all data into your temp table, then CREATE INDEX (this is usually faster that creating the index beforehand, as maintaining it during insertion is costly), and only afterwards do the INSERT INTO ... SELECT DISTINCT.
An EXPLAIN <your query> can tell you whether the SELECT DISTINCT now uses the index.