I have a table with a column named "data" that contains json object of telemetry data.
The telemetry data is recieved from devices, where some of the devices has received a firmware upgrade. After the upgrade some of the properties has received a "namespace" (e.g. ""propertyname"" is now ""dsns:propertyname""), and there is also a few of the properties which is suddenly camelcased (e.g. ""propertyname"" is now ""propertyName"".
Neither the meaning of the properties or the number of properties has changed.
When querying the table, I do not want to get the whole "data", as the json is quite large. I only want the properties that are needed.
For now I have simply used a json_build_object such as:
select json_build_object('requestedproperty',"data"-> 'requestedproperty','anotherrequestedproperty',"data"-> 'anotherrequestedproperty') as "data"
from device_data
where id ='f4ddf01fcb6f322f5f118100ea9a81432'
and timestamp >= '2020-01-01 08:36:59.698' and timestamp <= '2022-02-16 08:36:59.698'
order by timestamp desc
But it does not work for fetching data from after the firmware upgrade for the devices that has received it.
It still works for querying data from before the upgrade, and I would like to have only one query for this if possible.
Are there some easy way to "ignore" namespaces and be case insensitive through json_build_object?
I'm obviously no postgresql expert, and would love all the help I can get.
This kind of insanity is exactly why device developers should not ever be allowed anywhere near a keyboard :-)
This is not an easy fix, but so long as you know the variations on the key names, you can use coalesce() to accomplish your goal and update your query with each new release:
json_build_object(
'requestedproperty', coalesce(
data->'requestedproperty',
data->'dsns:requestedproperty',
data->'requestedProperty'),
'anotherrequestedproperty', coalesce(data-> 'anotherrequestedproperty',
data-> 'dsns:anotherrequestedproperty',
data-> 'anotherRequestedProperty')
) as "data"
Edit to add: You can also use jsonb_each_text(), treat key-value pairs as rows, and then use lower() and a regex split to doctor key names more generally, but I bet that the kinds of scatterbrains behind the inconsistencies you already see will eventually lead to a misspelled key name someday.
Related
I'm writing a kind of summary page for my FileMaker solution.
For this, I have define a "statistics" table, which uses formula fields with ExecuteSQL to gather info from most tables, such as number of records, recently changed records, etc.
This strangely takes a long time - around 10 seconds when I have a total of about 20k records in about 10 tables. The same SQL on any database system shouldn't take more than some fractions of a second.
What could the reason be, what can I do about it and where can I start debugging to figure out what's causing all this time?
The actual code is, like this:
SQLAusführen ( "SELECT COUNT(*) FROM " & _Stats::Table ; "" ; "" )
SQLAusführen ( "SELECT SUM(\"some_field_name\") FROM " & _Stats::Table ; "" ; "" )
Where "_Stats" is my statistics table, and it has a string field "Table" where I store the name of the other tables.
So each row in this _Stats table should have the stats for the table named in the "Table" field.
Update: I'm not using FileMaker server, this is a standalone client application.
We can definitely talk about why it may be slow. Usually this has mostly to do with the size and complexity of your schema. That is "usually", as you have found.
Can you instead use the DDR ( database design report ) instead? Much will depend on what you are actually doing with this data. Tools like FMPerception also will give you many of the stats you are looking for. Again, depends on what you are doing with it.
Also, can you post your actual calculation? Is the statistic table using unstored calculations? Is the statistics table related to any of the other tables? These are a couple things that will affect how ExecuteSQL performs.
One thing to keep in mind, whether ExecuteSQL, a Perform Find, or relationship, it's all the same basic query under-the-hood. So if it would be slow doing it one way, it's going to likely be slow with any other directly related approach.
Taking these one at a time:
All records count.
Placing an unstored calc in the target table allows you to get the count of the records through the relationship, without triggering a transfer of all records to the client. You can get the value from the first record in the relationship. Super light way to get that info vs using Count which requires FileMaker to touch every record on the other side.
Sum of Records Matching a Value.
using a field on the _Stats table with a relationship to the target table will reduce how much work FileMaker has to do to give you an answer.
Then having a Summary field in the target table so sum the records may prove to be more efficient than using an aggregate function. The summary field will also only sum the records that match the relationship. ( just don't show that field on any of your layouts if you don't need it )
ExecuteSQL is fastest when it can just rely on a simple index lookup. Once you get outside of that, it's primarily about testing to find the sweet-spot. Typically, I will use ExecuteSQL for retrieving either a JSON object from a user table, or verifying a single field value. Once you get into sorting and aggregate functions, you step outside of the optimizations of the function.
Also note, if you have an open record ( that means you as the current user ), FileMaker Server doesn't know what data you have on the client side, and so it sends ALL of the records. That's why I asked if you were using unstored calcs with ExecuteSQL. It can seem slow when you can't control when the calculations fire. Often I will put the updating of that data into a scheduled script.
We are evaluating NEsper. Our focus is to monitor data quality in an enterprise context. In an application we are going to log every change on a lot of fields - for example in an "order". So we have fields like
Consignee name
Consignee street
Orderdate
....and a lot of more fields. As you can imagine the log files are going to grow big.
Because the data is sent by different customers and is imported in the application, we want to analyze how many (and which) fields are updated from "no value" to "a value" (just as an example).
I tried to build a test case with just with the fields
order reference
fieldname
fieldvalue
For my test cases I added two statements with context-information. The first one should just count the changes in general per order:
epService.EPAdministrator.CreateEPL("create context RefContext partition by Ref from LogEvent");
var userChanges = epService.EPAdministrator.CreateEPL("context RefContext select count(*) as x, context.key1 as Ref from LogEvent");
The second statement should count updates from "no value" to "a value":
epService.EPAdministrator.CreateEPL("create context FieldAndRefContext partition by Ref,Fieldname from LogEvent");
var countOfDataInput = epService.EPAdministrator.CreateEPL("context FieldAndRefContext SELECT context.key1 as Ref, context.key2 as Fieldname,count(*) as x from pattern[every (a=LogEvent(Value = '') -> b=LogEvent(Value != ''))]");
To read the test-logfile I use the csvInputAdapter:
CSVInputAdapterSpec csvSpec = new CSVInputAdapterSpec(ais, "LogEvent");
csvInputAdapter = new CSVInputAdapter(epService.Container, epService, csvSpec);
csvInputAdapter.Start();
I do not want to use the update listener, because I am interested only in the result of all events (probably this is not possible and this is my failure).
So after reading the csv (csvInputAdapter.Start() returns) I read all events, which are stored in the statements NewEvents-Stream.
Using 10 Entries in the CSV-File everything works fine. Using 1 Million lines it takes way to long. I tried without EPL-Statement (so just the CSV import) - it took about 5sec. With the first statement (not the complex pattern statement) I always stop after 20 minutes - so I am not sure how long it would take.
Then I changed my EPL of the first statement: I introduce a group by instead of the context.
select Ref,count(*) as x from LogEvent group by Ref
Now it is really fast - but I do not have any results in my NewEvents Stream after the CSVInputAdapter comes back...
My questions:
Is the way I want to use NEsper a supported use case or is this the root cause of my failure?
If this is a valid use case: Where is my mistake? How can I get the results I want in a performant way?
Why are there no NewEvents in my EPL-statement when using "group by" instead of "context"?
To 1), yes
To 2) this is valid, your EPL design is probably a little inefficient. You would want to understand how patterns work, by using filter indexes and index entries, which are more expensive to create but are extremely fast at discarding unneeded events.
Read:
http://esper.espertech.com/release-7.1.0/esper-reference/html_single/index.html#processingmodel_indexes_filterindexes and also
http://esper.espertech.com/release-7.1.0/esper-reference/html_single/index.html#pattern-walkthrough
Try the "previous" perhaps. Measure performance for each statement separately.
Also I don't think the CSV adapter is optimized for processing a large file. I think CSV may not stream.
To 3) check your code? Don't use CSV file for large stuff. Make sure a listener is attached.
Is it possible to fetch only one field from the database using the SORM Framework?
What I want in plain SQL would be:
SELECT node_id FROM messages
I can't seem to be able to reproduce this in sorm. I know this might be against how sorm is supposed to work, but right now I have two huge tables with different kind of messages. I was asked to get all the unique node_ids from both tables.
I know I could just query both tables using sorm and parse through all the data but I would like to put the database to work. Obviously, this would be even better if one can get only unique node_ids in a single db call.
Right now with just querying everything and parsing it, it takes way too long.
There doesn't seem to be ORM support for what you want to do, unless node_id happens to be the primary key of your Message object:
val messages = Db.query[Message].fetchIds()
In this case you shouldn't need to worry about it being UNIQUE, since primary keys are by definition unique. Alternatively, you can run a custom SQL query:
Db.fetchWithSql[Message]("SELECT DISTINCT node_id FROM messages")
Note this latter might be typed wrong: you'd have to try it against your database. You might need fetchWithSql[Int], or some other variation: it is unclear what SORM does in the eventuality that the primary key hasn't been queried.
I just came across scenario when occasionally (not for all sets of data) I'm getting "Error: SQL0802 - Data conversion or data mapping error." exception when adding ORDER BY to simple query. For example, this works:
SELECT
market,
locationCode,
locationName
FROM locations
and the following is failing miserably:
SELECT
market,
locationCode,
locationName
FROM locations
ORDER BY locationName
I'm getting: Error: SQL0802 - Data conversion or data mapping error. (State:S1000, Native Code: FFFFFCDE)
I get the same error if I try to sort by name, or population, or anything really.... but only sometimes, meaning, when it errors on name or code, it would error if sorted by any field in locations subset. If it works for particular subset of locations, then it works for any sort order.
There are no null values in any of the fields, code and name fields are character fields.
Initially, I got this error when I added ROW_NUMBER column:
ROW_NUMBER() OVER(PARTITION BY market ORDER BY locationCode) as rowNumber
since, I narrowed it down to failing order case. I don't know which direction to go with it. Any thoughts?
update: there are no blank values for location name field. And even if I remove all fields in this subset and leave only 7 digit numeric id and sort by that field. I still get the same error.
WITH locs as (
SELECT id
FROM locations
)
SELECT *
FROM locs
ORDER BY id
I get this error when I SELECT DISTINCT any field from the subset too.
I had/have the exact same situation as described. The error seemed to be random, but would always appear when sorting was added. Although I can't precisely describe the technical details, what I think is occurring is the "randomness" was actually due to the size of the tables, and the size of the cached chunks of returned rows from the query.
The actual cause of the problem is junk values and/or blanks in the key fields used by the join. If there was no sorting, and the first batch of cached results didn't hit the records with the bad fields, the error wouldn't occur at first...but eventually always it did.
And the two things that ALWAYS drew out the error IMMEDIATELY were sorting or paging through the results. That's because in order to sort, it has to hit every one of the those key fields, and then cache the complete results. I think. Like I said, I don't know the complete technobabble, but I'm pretty sure that's close in laygeek terms.
I was able to solve the error by force-casting the key columns to integer. I changed the join from this...
FROM DAILYV INNER JOIN BXV ON DAILYV.DAITEM=BXV.BXPACK
...to this...
FROM DAILYV INNER JOIN BXV ON CAST(DAILYV.DAITEM AS INT)=CAST(BXV.BXPACK AS INT)
...and I didn't have to make any corrections to the tables. This is a database that's very old, very messy, and has lots of junk in it. Corrections have been made, but it's a work in progress.
I posted a few weeks back inquiring about the firebird DB and how to monitor it. Since then I have come up with a nifty script that monitors all of the page reads/writes/fetches/marks. One of the columns I am monitoring is the MON$STAT_ID and the MON$STAT_GROUP fields. This prints out a nice number for me; however, I have no way to correlate and understand what exactly it is. I thought printing out the MON$STAT_GROUP would help but it has yet to assist me in any way...
I have also looked into the RDB$ commands but have found very limited documentation to see if they might assist me in monitoring my database.
So I decided to come here and inquire first off whether I am monitoring my database in a way that others can view the data from page reads/writes/fetches/marks and make an intelligent decision on whether or not the database is performing as expected.
Secondly, would adding RDB$ commands to my script add anything to the value of the data that I will be giving our database folks?
Lastly, and maybe most importantly, is there anyway to correlate the MON$STAT_ID fields to an actual table in the database to understand when something is going on that should not be? I currently am monitoring the database every minute which may be to frequent, but I am getting valid data out. The only question now is how to interpret this data. Can someone give me advice on methods they use/have used in the past that have worked for them?
(NOTE: Running firebird 2.1)
The column MON$STAT_ID in MON$IO_STATS (and MON$RECORD_STATS and MON$MEMORY_USAGE) is the primary key of the record in the monitoring table. Almost all other monitoring tables include a MON$STAT_ID to point to these statistics: MON$ATTACHMENTS, MON$CALL_STACK, MON$DATABASE, MON$STATEMENTS, MON$TRANSACTIONS.
In other words: the statistics apply on the database, attachment, transaction, statement or call level (PSQL executes). The statistics tables contain a column called MON$STAT_GROUP to discern these types. The values of MON$STAT_GROUP are described in RDB$TYPES:
0 : DATABASE
1 : ATTACHMENT
2 : TRANSACTION
3 : STATEMENT
4 : CALL
Typically the statistics of level 0 contain all from level 1, level 1 contains all from level 2 for that attachment, level 2 contains all from level 3 for that transaction, level 3 contains all from level 4 for that statement.
As there might be data processed unrelated to the lower level, or a specific attachment, transaction or statement handle has already been dropped, the numbers of the lower level do not necessarily aggregate to the entire number of the higher level.
There is no way to correlate the statistics to a specific table (as this information isn't table related, but - simplified - from executing statements which might cover multiple tables).
As I also commented, I am unsure what you mean with "RDB$ commands". But I am assuming you are talking about RDB$GET_CONTEXT() and RDB$SET_CONTEXT(). You could use RDB$GET_CONTEXT() to obtain the current connection (SESSION_ID) and transaction id (TRANSACTION_ID). These values values can be used for MON$ATTACHMENT_ID and MON$TRANSACTION_ID in the monitoring tables. I don't think the other variables in the SYSTEM namespace are interesting, and those in USER_SESSION and USER_TRANSACTION are all user-defined (and initially those namespaces are empty).
It is far easier to use the CURRENT_CONNECTION and CURRENT_TRANSACTION context variables within a statement. As documented in doc\README.monitoring_tables.txt in the Firebird installation:
System variables CURRENT_CONNECTION and CURRENT_TRANSACTION could be used to select data about the current (for the caller) connection and transaction respectively. These variables correspond to the ID columns of the appropriate monitoring tables.
Note: my answer is based on Firebird 2.5.
To present statistics by specific tables I use this SQL (FB 3)
select t.mon$table_name,trim(
case when r.mon$record_seq_reads>0 then 'Non index Reads: '||r.mon$record_seq_reads else '' end||
case when r.mon$record_idx_reads>0 then ' Index Reads: '||r.mon$record_idx_reads else '' end||
case when r.mon$record_inserts>0 then ' Inserts: '||r.mon$record_inserts else '' end||
case when r.mon$record_updates>0 then ' Updates: '||r.mon$record_updates else '' end||
case when r.mon$record_deletes>0 then ' Deletes: '||r.mon$record_deletes else '' end)
from MON$TABLE_STATS t
join mon$record_stats r on r.mon$stat_id=t.mon$record_stat_id
where t.mon$table_name not starting 'RDB$' and r.mon$stat_group=2
order by 1