druid groupBy query - json syntax - intervals - druid

Im attempting to create this query (which works as I hope)
SELECT userAgent, COUNT(*) FROM page_hour GROUP BY userAgent order by 2 desc limit 10
as a json. I've tried this:
{
"queryType": "groupBy",
"dataSource": "page_hour",
"granularity": "hour",
"dimensions": ["userAgent"],
"aggregations": [
{ "type": "count", "name": "total", "fieldName": "userAgent" }
],
"intervals": [ "2020-02-25T00:00:00.000/2020-03-25T00:00:00.000" ],
"limitSpec": { "type": "default", "limit": 50, "columns": ["userAgent"] },
"orderBy": {
"dimension" : "total",
"direction" : "descending"
}
}
but instead of doing the aggregation over the full range it appears to pick an arbitrary time span (EG 2020-03-19T14:00:00Z)

If you want results from the entire interval to be combined in a single result entry per user agent, set granularity to all in the query.

A few notes on Druid queries:
You can generate a native query by entering a SQL statement in the management console and selecting the explain/plan menu option from the three-dot menu by the run button.
It's worth confirming expectations that the count query-time aggregator will return the number of database rows (not the number of ingested events). This could be the reason the resulting number is smaller than anticipated.
A granularity of all will prevent bucketing results by hour.
A fieldName spec within the count aggregator? I don't know what behavior might be defined for this, so I would remove this property. The docs:
see: https://druid.apache.org/docs/latest/querying/aggregations.html#count-aggregator

Related

Create a Dataset in BigQuery using API

So forgive my ignorance, but I can't seem to work this out.
I want to create a "table" in BigQuery, from an API call.
I am thinking https://developer.companieshouse.gov.uk/api/docs/search/companies/companysearch.html#here
I want to easily query the Companies House API, without writing oodles of code?
And then cross reference that with other datasets - like Facebook API, LinkedIn API.
eg. I want to input a company ID/ name on Companies house and get a fuzzy list of the people and their likely Social connections (Facebook, LinkedIn and Twitter)
Maybe BigQuery is the wrong tool for this? Should I just code it??
Or
It is, and adding a dataset with an API is just not obvious to me how to figure it out - in which case - please enlighten me.
You will not be able to directly use BigQuery and perform the task at hand. BigQuery is a web service that allows you to analyze massive datasets working in conjunction with Google Storage (or any other storage system).
The correct way of going about the situation would be to perform a curl request to collect all the data you require from Companies House and store the data as a spreadsheet (csv). Afterwards you may store the csv within Google Cloud Storage and load the data into BigQuery.
If you simply wish to link clients from Companies House and social media applications such as Facebook or LinkedIn, then you may not even need to use BigQuery. You may construct a structured table using Google Cloud SQL. The fields would consist of the necessary client information and you may later do comparisons with the FaceBook or LinkedIn API responses.
So if you are looking to load data from various sources and do big-query operations through API - Yes there is a way and adding to the previous answer, big-query is meant to do only analytical queries (on big data) otherwise simply, it's gonna cost you a lot and slower than a regular search api if you intend to do thousands of search queries on big datasets joining various tables etc.,
let's try to query using api from bigquery from public datasets
to authenticate - you will need to generate the authentication token using your application default credentials
gcloud auth print-access-token
Now using the token generated by gcloud command - you can use it for rest api calls.
POST https://www.googleapis.com/bigquery/v2/projects/<project-name>/queries
Authorization: Bearer <Token>
Body: {
"query": "SELECT tag, SUM(c) c FROM (SELECT CONCAT('stackoverflow.com/questions/', CAST(b.id AS STRING)), title, c, answer_count, favorite_count, view_count, score, SPLIT(tags, '|') tags FROM \`bigquery-public-data.stackoverflow.posts_questions\` a JOIN (SELECT CAST(REGEXP_EXTRACT(text,r'stackoverflow.com/questions/([0-9]+)/') AS INT64) id, COUNT(*) c FROM `fh-bigquery.hackernews.comments` WHERE text LIKE '%stackoverflow.com/questions/%' AND EXTRACT(YEAR FROM time_ts)>=#year GROUP BY 1 ORDER BY 2 DESC) b ON a.id=b.id), UNNEST(tags) tag GROUP BY 1 ORDER BY 2 DESC LIMIT #limit",
"queryParameters": [
{
"parameterType": {
"type": "INT64"
},
"parameterValue": {
"value": "2014"
},
"name": "year"
},
{
"parameterType": {
"type": "INT64"
},
"parameterValue": {
"value": "5"
},
"name": "limit"
}
],
"useLegacySql": false,
"parameterMode": "NAMED"
}
Response:
{
"kind": "bigquery#queryResponse",
"schema": {
"fields": [
{
"name": "tag",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "c",
"type": "INTEGER",
"mode": "NULLABLE"
}
]
},
"jobReference": {
"projectId": "<project-id>",
"jobId": "<job-id>",
"location": "<location>"
},
"totalRows": "5",
"rows": [
{
"f": [
{
"v": "javascript"
},
{
"v": "102"
}
]
},
{
"f": [
{
"v": "c++"
},
{
"v": "90"
}
]
},
{
"f": [
{
"v": "java"
},
{
"v": "57"
}
]
},
{
"f": [
{
"v": "c"
},
{
"v": "52"
}
]
},
{
"f": [
{
"v": "python"
},
{
"v": "49"
}
]
}
],
"totalBytesProcessed": "3848945354",
"jobComplete": true,
"cacheHit": false
}
Query - The most popular tags on Stack Overflow questions linked from Hacker News since 2014:
#standardSQL
SELECT tag, SUM(c) c
FROM (
SELECT CONCAT('stackoverflow.com/questions/', CAST(b.id AS STRING)),
title, c, answer_count, favorite_count, view_count, score, SPLIT(tags, '|') tags
FROM `bigquery-public-data.stackoverflow.posts_questions` a
JOIN (
SELECT CAST(REGEXP_EXTRACT(text,
r'stackoverflow.com/questions/([0-9]+)/') AS INT64) id, COUNT(*) c
FROM `fh-bigquery.hackernews.comments`
WHERE text LIKE '%stackoverflow.com/questions/%'
AND EXTRACT(YEAR FROM time_ts)>=2014
GROUP BY 1
ORDER BY 2 DESC
) b
ON a.id=b.id),
UNNEST(tags) tag
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5
Result :
So, we do some of our analytical queries using api to build periodical reports. But, I let you explore the other options & big-query API to create & load data using API.

How to get documents from MongoDB based on greater or less than the given date

I need to get the record from MongoDB based on the date using MongoDB. I am providing my collection below.
f_task:
{
"_id": "5a13731f9402cc17f81ade10",
"taskname": "task1",
"description": "description",
"timestamp": "2017-11-21 05:58:14",
"created_by": "subhra",
"taskid": "858fca9e2e153a61515c0372e079c521",
"created_date": "21-11-2017"
}
Here I need to fetch record as per created_date. Suppose user input is 20-11-2017 or 22-11-2017 then I need query to get the record if the given date is greater than or less than the "created_date" value.

How to import Edges from CSV with ETL into OrientDB graph?

I'm trying to import edges from a CSV-file into OrientDB. The vertices are stored in a separate file and already imported via ETL into OrientDB.
So my situation is similar to OrientDB import edges only using ETL tool and OrientDB ETL loading CSV with vertices in one file and edges in another.
Update
Friend.csv
"id","client_id","first_name","last_name"
"0","0","John-0","Doe"
"1","1","John-1","Doe"
"2","2","John-2","Doe"
...
The "id" field is removed by the Friend-Importer, but the "client_id" is stored. The idea is to have a known client-side generated id for searching etc.
PeindingFriendship.csv
"friendship_id","client_id","from","to"
"0","0-1","1","0"
"2","0-15","15","0"
"3","0-16","16","0"
...
The "friendship_id" and "client_id" should be imported as attributes of the "PendingFriendship" edge. "from" is a "client_id" of a Friend. "to" is a "client_id" of another Friend.
For "client_id" exists a unique Index on both Friend and PendingFriendship.
My ETL configuration looks like this
...
"extractor": {
"csv": {
}
},
"transformers": [
{
"command": {
"command": "CREATE EDGE PendingFriendship FROM (SELECT FROM Friend WHERE client_id = '${input.from}') TO (SELECT FROM Friend WHERE client_id = '${input.to}') SET client_id = '${input.client_id}'",
"output": "edge"
}
},
{
"field": {
"fieldName": "from",
"expression": "remove"
}
},
{
"field": {
"fieldName": "to",
"operation": "remove"
}
},
{
"field": {
"fieldName": "friendship_id",
"expression": "remove"
}
},
{
"field": {
"fieldName": "client_id",
"operation": "remove"
}
},
{
"field": {
"fieldName": "#class",
"value": "PendingFriendship"
}
}
],
...
The issue with this configuration is that it creates two edge entries. One is the expected "PendingFriendship" edge. The second one is an empty "PendingFriendship" edge, with all the fields I removed as attributes with empty values.
The import fails, at the second row/document, because another empty "PendingFriendship" cannot be inserted because it violates a uniqueness constraint.
How can I avoid the creation of the unnecessary empty "PendingFriendship".
What is the best way to import edges into OrientDB? All the examples in the documentation use CSV files where vertices and edges are in one file, but this is not the case for me.
I also had a look into the Edge-Transformer, but it returns a Vertex not an Edge!
Created PendingFriendships
After some time I found a way (workaround) to import the above data into OrientDB. Instead of using the ETL Tool I wrote simple ruby scripts which call the HTTP API of OrientDB using the Batch endpoint.
Steps:
Import the Friends.
Use the response to create a mapping of client_ids to #rids.
Parse the PeindingFriendship.csv and build batch requests.
Each Friendships is created by its own command.
The mapping from 2. is used to insert the #rids into the command from 4.
Send the batch requests in junks of 1000 commands.
Example Batch-Request body:
{
"transaction" : true,
"operations" : [
{
"type" : "cmd",
"language" : "sql",
"command" : "create edge PendingFriendship from #27:178 to #27:179 set client_id='4711'"
}
]
}
This isn't the answer to the question I asked, but it solves the higher goal of importing data into OrientDB, for me. Therefore I leave it open for the community to mark this question as solved or not.

magento 2 rest api product filters

I am working on magento 2 api. I need products based on below filters
store id
by product name search
shorting by name
category id
add limit
I have try with this api but no option available
index.php/rest/V1/categories/{id}/products
Please someone suggest how to archive this.
Thanks
You are looking for the (GET) API /rest/V1/products.
the store ID should be automatically detected by the store, because you can pass the store code in the URL before. If you have a store with code test, the API will start with GET /rest/test/V1/products/[...].
You can use the likecondition type. Ex.: products with "sample" in their name: ?searchCriteria[filter_groups][0][filters][0][field]=name
&searchCriteria[filter_groups][0][filters][0][value]=%sample%
&searchCriteria[filter_groups][0][filters][0][condition_type]=like
you are looking for the sortOrders. Ex.: searchCriteria[sortOrders][0][field]=name. You can even add the sort direction, for example DESC, with searchCriteria[sortOrders][0][direction]=DESC.
Use the category_id field and the eq condition type. Ex.: if you want products from category 10: searchCriteria[filter_groups][0][filters][0][field]=category_id&
searchCriteria[filter_groups][0][filters][0][value]=10&
searchCriteria[filter_groups][0][filters][0][condition_type]=eq
use searchCriteria[pageSize]. Ex.: 20 products starting from the 40th, equivalent in SQL to LIMIT 20 OFFSET 40: &searchCriteria[pageSize]=20&searchCriteria[currentPage]=3
Of course you can perform AND and OR operations with filters.
[
"filter_groups": [
{
"filters": [
{
"field": "type_id",
"value": "simple",
"condition_type": "eq"
}
]
},
{
"filters": [
{
"field": "category_id",
"value": "611",
"condition_type": "eq"
}
]
}
],
"page_size": 100,
"current_page": 1,
"sort_orders": [
{
"field": "name",
"direction": "ASC"
}
]
]

Is there a possibility to have another timestamp as dimension in Druid?

Is it possible to have Druid datasource with 2 (or multiple) timestmaps in it?
I know that Druid is time-based DB and I have no problem with the concept but I'd like to add another dimension with which I can work as with timestamp
e.g. User retention: Metric surely is specified to a certain date, but I also need to create cohorts based on users date of registration and rollup those dates maybe to a weeks, months or filter to only a certain time periods....
If the functionality is not supported, are there any plug-ins? Any dirty solutions?
Although I'd rather wait for official implementation for timestamp dimensions full support in druid to be made, I've found a 'dirty' hack which I've been looking for.
DataSource Schema
First things first, I wanted to know, how much users logged in for each day, with being able to aggregate by date/month/year cohorts
here's the data schema I used:
"dataSchema": {
"dataSource": "ds1",
"parser": {
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "iso"
},
"dimensionsSpec": {
"dimensions": [
"user_id",
"platform",
"register_time"
],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
"metricsSpec": [
{ "type" : "hyperUnique", "name" : "users", "fieldName" : "user_id" }
],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "HOUR",
"queryGranularity": "DAY",
"intervals": ["2015-01-01/2017-01-01"]
}
},
so the sample data should look something like (each record is login event):
{"user_id": 4151948, "platform": "portal", "register_time": "2016-05-29T00:45:36.000Z", "timestamp": "2016-06-29T22:18:11.000Z"}
{"user_id": 2871923, "platform": "portal", "register_time": "2014-05-24T10:28:57.000Z", "timestamp": "2016-06-29T22:18:25.000Z"}
as you can see, my "main" timestamp to which I calculate these metrics is timestamp field, where register_time is only the dimension in stringy - ISO 8601 UTC format .
Aggregating
And now, for the fun part: I've been able to aggregate by timestamp (date) and register_time (date again) thanks to Time Format Extraction Function
Query looking like that:
{
"intervals": "2016-01-20/2016-07-01",
"dimensions": [
{
"type": "extraction",
"dimension": "register_time",
"outputName": "reg_date",
"extractionFn": {
"type": "timeFormat",
"format": "YYYY-MM-dd",
"timeZone": "Europe/Bratislava" ,
"locale": "sk-SK"
}
}
],
"granularity": {"timeZone": "Europe/Bratislava", "period": "P1D", "type": "period"},
"aggregations": [{"fieldName": "users", "name": "users", "type": "hyperUnique"}],
"dataSource": "ds1",
"queryType": "groupBy"
}
Filtering
Solution for filtering is based on JavaScript Extraction Function with which I can transform date to UNIX time and use it inside (for example) bound filter:
{
"intervals": "2016-01-20/2016-07-01",
"dimensions": [
"platform",
{
"type": "extraction",
"dimension": "register_time",
"outputName": "reg_date",
"extractionFn": {
"type": "javascript",
"function": "function(x) {return Date.parse(x)/1000}"
}
}
],
"granularity": {"timeZone": "Europe/Bratislava", "period": "P1D", "type": "period"},
"aggregations": [{"fieldName": "users", "name": "users", "type": "hyperUnique"}],
"dataSource": "ds1",
"queryType": "groupBy"
"filter": {
"type": "bound",
"dimension": "register_time",
"outputName": "reg_date",
"alphaNumeric": "true"
"extractionFn": {
"type": "javascript",
"function": "function(x) {return Date.parse(x)/1000}"
}
}
}
I've tried to filter it 'directly' with javascript filter but I haven't been able to convince druid to return the correct records although I've doublecheck it with various JavaScript REPLs, but hey, I'm no JavaScript expert.
Unfortunately Druid has only one time-stamp column that can be used to do rollup plus currently druid treat all the other columns as a strings (except metrics of course) so you can add another string columns with time-stamp values, but the only thing you can do with it is filtering.
I guess you might be able to hack it that way.
Hopefully in the future druid will allow different type of columns and maybe time-stamp will be one of those.
Another solution is add a longMin sort of metric for the timestamp and store the epoch time in that field or you convert the date time to a number and store it (eg 31st March 2021 08:00 to 310320210800)
As for Druid 0.22 it is stated in the documentation that secondary timestamps should be handled/parsed as dimensions of type long. Secondary timestamps can be parsed to longs at ingestion time with a tranformSpec and be transformed back if needed in querying time link.