Influxdb howto update tag values - tags

I would like to do a query like this:
SELECT max(value) as value, "counter" = 'countergasconsumption'
INTO "home"."1month"."test"
FROM "telegraf"."autogen"."mqtt_consumer"
WHERE time > now() - 11d AND "topic" = '/utilitiesmonitor/countergasconsumption'
GROUP BY time(5m)
I already have data with the countergasconsumption tag (not the whole topic). I can't get it to work, tried different select into's and the docs are not really helpful on this topic. The old data is one measurement with a tag for the type of counter, the new data comes in on separate topics via Telegraf.

Related

Spark Delta Table Updates

I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost
If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)
Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location

Is it possible to Group By a field extracted from JSON input in a Siddhi Query?

I currently have an stream, with 1 string attribute that contains a Json event.
This stream receives different events, which I want to apply Json path expressions so I can use those attributes on filters and functions.
JsonPath extractors work like a charm on filters and selectors, unfortunately, I am not being able to use them for the 'Group By' part.
I am actually doing it in an embedded Siddhi App with siddhi-execution-json extension added manually, but for the discussion, so everybody can
easily check and test it, I will paste an example app that works on WSO2 Stream Processor.
The objective looks like the following App:
#App:name("Group_by_json_attribute")
define stream JsonStream(json string);
#sink(type='log')
define stream LogStream(myField string, count long);
#info(name='query1')
from JsonStream#window.time(10 sec)
select json:getString(json, '$.myField') as myField, count() as count
group by myField having count > 1
insert into LogStream;
and it can accept the following events:
{"myField": "my_value"}
However, this query will raise the error:
Cannot find attribute type as 'myField' does not exist in 'JsonStream'; define stream JsonStream(json string)
I have also tried to use directly the Json extractor at 'Group by':
group by json:getString(json, '$.myField') as myField having count > 1
However the error now is:
mismatched input ':' expecting {',', ORDER, LIMIT, OFFSET, HAVING, INSERT, DELETE, UPDATE, RETURN, OUTPUT}
which seems to not be expecting to use an extension here
I am just wondering, if it is possible to group by attributes not directly defined in the input stream. In this case is a field extracted from a JSON object, but it could be any other function that generates another attribute.
I am also using versions from maven central repository
Siddhi: io.siddhi:siddhi-core:5.0.1
siddhi-execution-json: io.siddhi.extension.execution.json:siddhi-execution-json:2.0.1
(Edit) Clarification
The objective is, to use attributes not directly defined in the Stream, to be used on the Group By.
The reason why is, I currently have an embedded app which defines the whole set of input streams coming from external sources formatted as JSON, and there are also a set of output streams to inform external components when a query matches.
This app allows users to create custom queries on this set of predefined Streams, but they are not able to create Streams by their own.
Many thanks!
It seems we are expecting the group by fields from the query input stream, in this case, JsonStream. Use another query before this for extraction and the aggregation and filtering in the following query,
#App:name("Group_by_json_attribute")
define stream JsonStream(json string);
#sink(type='log')
define stream LogStream(myField string, count long);
#info(name='extract_stream')
from JsonStream
select json:getString(json, '$.myField') as myField
insert into ExtractedStream;
#info(name='query1')
from ExtractedStream#window.time(10 sec)
select myField, count() as count
group by myField
having count > 1
insert into LogStream;

How to track rows (by id) with a specific column value using Kafka JDBC Connector?

I have a table containing a large number of records. There's a column defining a type of the record. I'd like to collect records with a specific value in that column. Kind of:
Select * FROM myVeryOwnTable WHERE type = "VERY_IMPORTANT_TYPE"
What I've noticed I can't use WHERE clause in a custom query when I choose incremental(+timestamp) mode, otherwise I'd need to take care if filtering on my own.
The background of that I'd like to achieve is that I use Logstash to transfer some type of data from MySQL to ES. That's easily achievable there by using query that can contain where clause. However, with Kafka I can transfer my data much quicker (almost instantly) after inserting new rows in DB.
Thank you for any hints or advices.
Thanks to #wardziniak I was able to set it up.
query=select * from (select * from myVeryOwnTable p where type = 'VERY_IMPORTANT_TYPE') p
topic.prefix=test-mysql-jdbc-
incrementing.column.name=id
however, I was expecting a topic test-mysql-jdbc-myVeryOwnTable so I've registered my consumer to that. However, using the query shown above table name is skipped so my topic was named exactly as prefix defined above. So I've just updated my properties topic.prefix=test-mysql-jdbc-myVeryOwnTable and it seems to be working just fine.
You can use subquery in your Jdbc Source Connector query property.
Sample JDBC Source Connector configuration:
{
...
"query": "select * from (select * from myVeryOwnTable p where type = 'VERY_IMPORTANT_TYPE') p",
"incrementing.column.name": "id",
...
}

pipelinedb aggregates in where clause

Im am using pipelinedb for testing some analysis of data streams from sensors.
I want to be able, as an example, to find events in a stream that are defined by an aggregate. E.g. find events where the difference between max(temperature) and min(temparature) in the last 5 minutes exceeds a certain range.
When trying to put aggregates in the WHERE clause I get an error message saying something like 'aggregates not allowed in continuous views where clasues'
Am I missing something here or is it just not possible?
Otherwise I like pipelinedb very very much!
Well, pipelinedb says: "continuous queries don't support HAVING clauses".
What I'm trying to do is the following:
I have a stream named geo_vital_stream, which sends some sensor data along with a geolocation. At the moment I am interested
insert into geo_vital_stream (device_id, user_id, measured_at, heartrate, energy, eda, lon, lat) VALUES( 'A005D8-E4 2.0',1,'2015-10-08 15:04:33.134000+02',96.8497201823,351.056269367,0.505791,8.07154018407,52.9531484103 );
My cv looks like this:
CREATE CONTINUOUS VIEW cv_sensor_eda AS
SELECT user_id::integer,
MAX(eda::numeric) - MIN(eda::numeric) as range_eda
FROM geo_vital_stream
WHERE (measured_at > clock_timestamp() - interval '1 minutes')
GROUP BY user_id
Now, I am interested only in those "events", where the range (range_eda execeeds a certain value in the last minute.)
Using an aggregate in a WHERE clause actually isn't legal SQL. That is accomplished using a HAVING clause, but it doesn't seem like that's what you need here. Since aggregates compute values across multiple rows, it's not clear to me how you'd retrieve individual events based on aggregates (min, max) across multiple events. Could you provide an example of what each event looks like?

Converting complex query with inner join to tableau

I have a query like this, which we use to generate data for our custom dashboard (A Rails app) -
SELECT AVG(wait_time) FROM (
SELECT TIMESTAMPDIFF(MINUTE,a.finished_time,b.start_time) wait_time
FROM (
SELECT max(start_time + INTERVAL avg_time_spent SECOND) finished_time, branch
FROM mytable
WHERE name IN ('test_name')
AND status = 'SUCCESS'
GROUP by branch) a
INNER JOIN
(
SELECT MIN(start_time) start_time, branch
FROM mytable
WHERE name IN ('test_name_specific')
GROUP by branch) b
ON a.branch = b.branch
HAVING avg_time_spent between 0 and 1000)t
GROUP BY week
Now I am trying to port this to tableau, and I am not being able to find a way to represent this data in tableau. I am stuck at how to represent the inner group by in a calculated field. I can also try to just use a custom sql data source, but I am already using another data source.
columns in mytable -
start_time
avg_time_spent
name
branch
status
I think this could be achieved new Level Of Details formulas, but unfortunately I am stuck at version 8.3
Save custom SQL for rare cases. This doesn't look like a rare case. Let Tableau generate the SQL for you.
If you simply connect to your table, then you can usually write calculated fields to get the information you want. I'm not exactly sure why you have test_name in one part of your query but test_name_specific in another, so ignoring that, here is a simplified example to a similar query.
If you define a calculated field called worst_case_test_time
datediff(min(start_time), dateadd('second', max(start_time), avg_time_spent)), which seems close to what your original query says.
It would help if you explained what exactly you are trying to compute. It appears to be some sort of worst case bound for avg test time. There may be an even simpler formula, but its hard to know without a little context.
You could filter on status = "Success" and avg_time_spent < 1000, and place branch and WEEK(start_time) on say the row and column shelves.
P.S. Your query seems a little off. Don't you need an aggregation function like MAX or AVG after the HAVING keyword?