Database calls, 484ms apart, are producing incorrect results in Postgres - postgresql

We have "things" sending data to AWS IoT. A rule forwards the payloads to a Lambda which is responsible for inserting or updating the data into Postgres (AWS RDS). The Lambda is written in python and uses PG8000 for interacting with the db. The lambda event looks like this:
{
"event_uuid": "8cd0b9b1-be93-49f8-1234-af4381052672",
"date": "2021-07-08T16:09:25.138809Z",
"serial_number": "a1b2c3",
"temp": "34"
}
Before inserting the data into Postgres, a query is run on the table to look for any existing event_uuids which are required to be unique. For a specific reason, there is no UNIQUE constraint on the event_uuid column. If the event_uuid does not exist, the data is inserted. If the event_uuid does exist, the data is updated. This all works great, except for the following case.
THE ISSUE: one of our things is sending two of the same payloads in very quick succession. It's an issue with one of our things but it's not something we can resolve at the moment and we need to account for it. Here are the timestamps from CloudWatch of when each payload was received:
2021-07-08T12:10:09.288-04:00
2021-07-08T12:10:09.772-04:00
As a result of the payloads being received 484ms apart, the Lambda is inserting both payloads instead of inserting the first and performing an update with the second one.
Any ideas on how to get around this?
Here is part of the Lambda code...
conn = make_conn()
event_query = f"""
SELECT json_build_object('uuid', uuid)
FROM samples
WHERE event_uuid='{event_uuid}'
AND serial_number='{serial_number}'
"""
event_resp = fetch_one(conn, event_query)
if event_resp:
update_sample_query = f"""
UPDATE samples SET temp={temp} WHERE uuid='{event_resp["uuid"]}'
"""
else:
insert_sample_query = f"""
INSERT INTO samples (uuid, event_uuid, temp)
VALUES ('{uuid4()}', '{event_uuid}', {temp})
"""

Related

pq: [parent] Data too large

I'm using Grafana to visualize some data stored in CrateDB in different panes.
Some of my boards work correctly, but there are 3 specific boards (created by someone from my work team), in which at certain times of the day they stop showing data (No Data) and as a warning it shows the following error:
db query error: pq: [parent] Data too large, data for [fetch-1] would be [512323840/488.5mb], which is larger than the limit of [510027366/486.3mb], usages [request=0/0b, in_flight_requests=0/0b, query=150023700/143mb, jobs_log=19146608/18.2mb, operations_log=10503056/10mb]
Honestly, I would like to understand what it means, and how I can fix it.
I remain attentive to any help you can give me, and I deeply appreciate the help.
what I tried
17 SQL Statements of the form:
SELECT
time_index AS "time",
entity_id AS metric,
v1_ps
FROM etsm
WHERE
entity_id = 'SM_B3_RECT'
ORDER BY 1,2
for 17 different entities.
what I hope
I hope to receive the data corresponding to each of the SQL statements for their respective graphing.
The result
As a result, there is no data received on some of the statements made and the warning message I shared:
db query error: pq: [parent] Data too large, data for [fetch-1] would be [512323840/488.5mb], which is larger than the limit of [510027366/486.3mb], usages [request=0/0b, in_flight_requests=0/0b, query=150023700/143mb, jobs_log=19146608/18.2mb, operations_log=10503056/10mb]
As an additional fact, the graph is configured to update every 15 min, but no matter how many times you manually update the graph, the statements that receive data are different.
Example: I refresh the panel and the SQL statements A, B and C get data, while the others don't. I refresh the panel and the SQL statements D, H and J receive data, and the others don't (with a random pattern).
Other additional information:
I have access to the database being consulted with Grafana, and the data is there
You don't have time condition, so query select/process all records all the time and you are hitting limits (e. g. size of processed data) of your DB. Add time condition, so only fraction of all records will be returned.

Spark Delta Table Updates

I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost
If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)
Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location

Slick insert into H2, but no data inserted

I'm sure I am missing something really stupidly obvious here - I have a unit test for a very simple Slick 3.2 setup. The DAO has basic retrieve and insert methods as follows:
override def questions: Future[Seq[Tables.QuestionRow]] =
db.run(Question.result)
override def createQuestion(title: String, body: String, authorUuid: UUID): Future[Long] =
db.run(Question returning Question.map(_.id) += QuestionRow(0l, UUID.randomUUID().toString, title, body, authorUuid.toString))
And I have some unit tests - for the tests im using in memory H2 and have a setup script (passed to the jdbcurl) to initialise two basic rows in the table.
The unit tests for retriving works fine, and they fetch the two rows inserted by the init script, and I have just added a simple unit test to create a row and then retrieve them all - assuming it will fetch the three rows, but no matter what I do, it only ever retrieves the initial two:
it should "create a new question" in {
whenReady(questionDao.createQuestion("Question three", "some body", UUID.randomUUID)) { s =>
whenReady(questionDao.questions(s)) { q =>
println(s)
println(q.map(_.title))
assert(true)
}
}
}
The output shows that the original s (the returning ID from the autoinc) is 3, as I would expect (I have also tried the insert not doing the returning step and just letting it return the number of rows inserted, which returns 1, as expecteD), but looking at the values returned in q, its only ever the first two rows inserted by the init script.
What am I missing?
My assumptions are that your JDBC url is something like jdbc:h2:mem:test;INIT=RUNSCRIPT FROM 'init.sql' and no connection pooling is used.
There are two scenarios:
the connection is performed with keepAliveConnection = true (or by appending DB_CLOSE_DELAY=-1 to the JDBC url) and the init.sql is something like:
DROP TABLE IF EXISTS QUESTION;
CREATE TABLE QUESTION(...);
INSERT INTO QUESTION VALUES(null, ...);
INSERT INTO QUESTION VALUES(null, ...);
the connection is performed with keepAliveConnection = false (default) (without appending DB_CLOSE_DELAY=-1 to the JDBC url) and the init.sql is something like:
CREATE TABLE QUESTION(...);
INSERT INTO QUESTION VALUES(null, ...);
INSERT INTO QUESTION VALUES(null, ...);
The call to questionDao.createQuestion will open a new connection to your H2 database and will trigger the initialization script (init.sql).
In both scenarios, right after this call, the database contains a QUESTION table with 2 rows.
In scenario (2) after this call the connection is closed and according to H2 documentation:
By default, closing the last connection to a database closes the database. For an in-memory database, this means the content is lost. To keep the database open, add ;DB_CLOSE_DELAY=-1 to the database URL. To keep the content of an in-memory database as long as the virtual machine is alive, use jdbc:h2:mem:test;DB_CLOSE_DELAY=-1.
The call to questionDao.questions will then open a new connection to your H2 database and will trigger again the initialization script (init.sql).
In scenario (1) the first connection is kept alive (and also the database content) but the new connection will re-execute the initialization script (init.sql) erasing the database content.
Given that (in both scenarios) questionDao.createQuestion returns 3, as expected, but then the content is lost and so the subsequent call to questionDao.questions will use a freshly initialized database.

how read-through work in ignite

my cache is empty so sql queries return null.
The read-through means that if the cache is missed, then Ignite will automatically get down to the underlying db(or persistent store) to load the corresponding data.
If there are new data inserted into the underlying db table ,i have to down cache server to load the newly inserted data from the db table automatically or it will sync automatically ?
Is work same as Spring's #Cacheable or work differently.
It looks to me that the answer is no. Cache SQL query don't work as no data in cache but when i tried cache.get in i got following results :
case 1:
System.out.println("data == " + cache.get(new PersonKey("Manish", "Singh")).getPhones());
result ==> data == 1235
case 2 :
PersonKey per = new PersonKey();
per.setFirstname("Manish");
System.out.println("data == " + cache.get(per).getPhones());
throws error:- as following
error image, image2
Read-through semantics can be applied when there is a known set of keys to read. This is not the case with SQL, so in case your data is in an arbitrary 3rd party store (RDBMS, Cassandra, HBase, ...), you have to preload the data into memory prior to running queries.
However, Ignite provides native persistence storage [1] which eliminates this limitation. It allows to use any Ignite APIs without having anything in memory, and this includes SQL queries as well. Data will be fetched into memory on demand while you're using it.
[1] https://apacheignite.readme.io/docs/distributed-persistent-store
When you insert something into the database and it is not in the cache yet, then get operations will retrieve missing values from DB if readThrough is enabled and CacheStore is configured.
But currently it doesn't work this way for SQL queries executed on cache. You should call loadCache first, then values will appear in the cache and will be available for SQL.
When you perform your second get, the exact combination of name and lastname is sought in DB. It is converted into a CQL query containing lastname=null condition, and it fails, because lastname cannot be null.
UPD:
To get all records that have firstname column equal to 'Manish' you can first do loadCache with an appropriate predicate and then run an SQL query on cache.
cache.loadCache((k, v) -> v.lastname.equals("Manish"));
SqlFieldsQuery qry = new SqlFieldsQuery("select firstname, lastname from Person where firstname='Manish'");
try (FieldsQueryCursor<List<?>> cursor = cache.query(qry)) {
for (List<?> row : cursor)
System.out.println("firstname:" + row.get(0) + ", lastname:" + row.get(1));
}
Note that loadCache is a complex operation and requires to run over all records in the DB, so it shouldn't be called too often. You can provide null as a predicate, then all records will be loaded from the database.
Also to make SQL run fast on cache, you should mark firstname field as indexed in QueryEntity configuration.
In your case 2, have you tried specifying lastname as well? By your stack trace it's evident that Cassandra expects it to be not null.

trying to send data from thingworx composer to Postgres database

]4[]5
I created one thing which access my database table name is sensordata from PostgreSQL. Now I have to send data to these table how. How can I do this?
I did the connection part of thingworx composer and PostgreSQL db on local setup.
I am trying to send sensor data from thingworx to PostgreSQL db but i am not able to sent it
You must to do two things:
1 Create a service to insert row in postgresql_conn thing;
Select 'SQL (Command)' as script type.
Put somthing like this into script area
INSERT INTO sensordata
(Temperature, Humidity, Vibration)
VALUES ([[TemperatureField]], [[HumidityField]], [[VibrationField]]);
TemperatureField, HumidityField, VibrationField are input fields of the service.
2 Create Subscriptions to sensordata thing.
As event set AnyDataChange;
Put something like this in the script area:
var params = {
TemperatureField: me.Temperature,
HumidityField: me.Humidity,
VibrationField: me.Vibration
};
var result = Things["postgresql_conn"].InsertRecord(params);
Now when data of sensordata change one row is add to the postgress table.
Sorry for my english