Cassandra: Making range queries - nosql

I understand that you can make range queries on column names.
Example: Get all columns whose names are between 100-200.
While I have found many examples on how to create a column-family in such a way, I have not found an example of making such a query in CLI or CQL.
I am looking for something like: GET journals['bob'] WHERE column-names BETWEEN 100 AND 200
Does such a statement exist in CLI or CQL?
Or do I have to resort to thrift?

In CQL the query would be:
select 100..200 from journals where name = 'bob';
Note that this syntax is changing in CQL 3.0 to something like:
select value from journals where name = 'bob' and column > 100 and column < 200;

Related

Setting column values as column names in the Flink SQL query result

I would like to read a table that has values that will be the column names of the Flink SQL query result. For example, I have t1 as
name value
----------
sp_1 100
sp_2 200
sp_3 300
... ...
Now I want the result of the query to look like this (t2):
sp_1 sp_2 sp_3 ...
100 200 300
Assume all the sp_* have been created in t2.
Is it possible to achieve it through Flink SQL?
Flink version: 1.13.6
I believe this would be possible using something like PIVOT and UNPIVOT functions, which are at the time of writing this not yet supported. You can track https://issues.apache.org/jira/browse/FLINK-23179 for updates.

JPA: Group by and select function in Postgres

I have a simple reporting query group by id and day that looks like the following:
select id,
avg(case when name = 'temp' then value end) as average_temp,
DATE_TRUNC('day', timestamp) as day
from data
group by id, day
order by id;
The query basically needs to show the average daily temperature for each asset.
The user is able to specify a bunch of different aggregation functions beyond just 'average', the above is only a simple example. For example, avg temp, max temp, max speed, etc.
I'm trying to translate that into JPA as follows:
CriteriaQuery<Object[]> query = criteriaBuilder.createQuery(Object[].class);
Root<AssetMetricDataPoint> root = query.from(Data.class);
List<Selection<?>> selectionList = getSelections(aggregationQuery, criteriaBuilder, root);
Expression<Instant> groupDate = criteriaBuilder.function("date_trunc", Instant.class, criteriaBuilder.literal("day"), root.get("timestamp"));
selectionList.add(groupDate.alias("day"));
query.multiselect(selectionList);
query.where(getWherePredicates(aggregationQuery, criteriaBuilder, root));
query.orderBy(getOrderBy(aggregationQuery, criteriaBuilder, root));
query.groupBy(root.get("id"), groupDate);
return this.setupPagination(entityManager.createQuery(query), aggregationQuery);
I'm using criteriaBuilder.function to group by the date. However, when I execute the query using JPA I get the following exception:
org.postgresql.util.PSQLException: ERROR: column "data0_.timestamp" must appear in the GROUP BY clause or be used in an aggregate function
This appears to occur because the query is parametized and Postgres doesn't realize that the 'day' parameter that appears in both the select and group by clauses are the same.
Is there any way around this. Can I somehow bake in the 'day' value so it's not sent a parameter? Or some other method?
In the end there's a relatively solution to the problem. Rather than grouping by the expression, I thought I'd try to group by the alias instead: criteriaBuilder.literal("day"). This didn't work, however, with Postgres complaining about a non-integer literal.
I then realised I could group by a positional integer instead, which in my case ended up looking like:
query.groupBy(root.get("id"), criteriaBuilder.literal(selectionList.size()));
This works as expected.

PostgreSQL query has = operators inside SELECT column references, what does this syntax mean?

SQLALchemy generated the following query for me:
SELECT count(client.id = user_accounting_journal_entry.client_id) AS count_1
FROM client, user_accounting_journal_entry
WHERE user_accounting_journal_entry.kind = 'debit'
GROUP BY client.name = user_accounting_journal_entry.client_id
Note the part inside select: count(client.id = user_accounting_journal_entry.client_id).
Having mostly used MySQL, I am not familiar with this syntax, and have a hard time finding documentation.
You should be familiar with the syntax from MySQL, at least in this form:
select sum(client.id = user_accounting_journal_entry.client_id)
This would count the number of matches.
The count() version counts the number of times that the expression is not NULL. Or equivalent, it counts the number of times that both values are not NULL . . . something that seems really strange. More commonly in Postgres, I would expect:
select sum((client.id = user_accounting_journal_entry.client_id)::int)
This converts the boolean to an integer and hence counts the number of matches.
The query itself is awful:
It doesn't use proper join syntax
The join conditions between the tables don't look correct (a name to an id)
It is grouping by a boolean condition
In addition, it doesn't look like it is doing something that is really useful.
count(client.id = user_accounting_journal_entry.client_id) count number of times expression is not null.

Thinking Sphinx indexing performance

I have a large index definition that takes too long to index. I suspect the main problem is caused by the many LEFT OUTER JOINs generated.
I saw this question, but can't find documentation about using source: :query, which seems to be part of the solution.
My index definition and the resulting query can be found here: https://gist.github.com/jonsgold/fdd7660bf8bc98897612
How can I optimize the generated query to run faster during indexing?
The 'standard' sphinx solution to this would be to use ranged queries.
http://sphinxsearch.com/docs/current.html#ex-ranged-queries
... splitting up the query into lots of small parts, so the database server has a better chance of being able to run the query (rather than one huge query)
But I have no idea how to actully enable that in Thinking Sphinx. Can't see anything in the documentation. Could help you edit the sphinx.conf, but also not sure how TS will cope with you manually editing the config file.
This is the solution that worked best (from the linked question). Basically, you can remove a piece of the main query sql_query and define it separately as a sql_joined_field in the sphinx.conf file.
It's important to add all relevant sql conditions to each sql_joined_field (such as sharding indexes by modulo on the ID). Here's the new definition:
ThinkingSphinx::Index.define(
:incident,
with: :active_record,
delta?: false,
delta_processor: ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
) do
indexes "SELECT incidents.id * 51 + 7 AS id, sites.name AS site FROM incidents LEFT OUTER JOIN sites ON sites.id = site_id WHERE incidents.deleted = 0 AND EXISTS (SELECT id FROM accounts WHERE accounts.status = 'enabled' AND incidents.account_id = id) ORDER BY id", as: :site, source: :query
...
has
...
end
ThinkingSphinx::Index.define(
:incident,
with: :active_record,
delta?: true,
delta_processor: ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
) do
indexes "SELECT incidents.id * 51 + 7 AS id, sites.name AS site FROM incidents LEFT OUTER JOIN sites ON sites.id = site_id WHERE incidents.deleted = 0 AND incidents.delta = 1 AND EXISTS (SELECT id FROM accounts WHERE accounts.status = 'enabled' AND incidents.account_id = id) ORDER BY id", as: :site, source: :query
...
has
...
end
The magic that defines the field site as a separate query is the option source: :query at the end of the line.
Notice the core index definition has the parameter delta?: false, while the delta index definition has the parameter delta?: true. That's so I could use the condition WHERE incidents.delta = 1 in the delta index and filter out irrelevant records.
I found sharding didn't perform any better, so I reverted to one unified index.
See the whole index definition here: https://gist.github.com/jonsgold/05e2aea640320ee9d8b2.
Important to remember!
The Sphinx document ID offset must be handled manually. That is, whenever an index for another model is added or removed, my calculated document ID will change. This must be updated.
So, in my example, if I added an index for a different model (not :incident), I would have to run rake ts:configure to find out my new offset and change incidents.id * 51 + 7 accordingly.

Converting complex query with inner join to tableau

I have a query like this, which we use to generate data for our custom dashboard (A Rails app) -
SELECT AVG(wait_time) FROM (
SELECT TIMESTAMPDIFF(MINUTE,a.finished_time,b.start_time) wait_time
FROM (
SELECT max(start_time + INTERVAL avg_time_spent SECOND) finished_time, branch
FROM mytable
WHERE name IN ('test_name')
AND status = 'SUCCESS'
GROUP by branch) a
INNER JOIN
(
SELECT MIN(start_time) start_time, branch
FROM mytable
WHERE name IN ('test_name_specific')
GROUP by branch) b
ON a.branch = b.branch
HAVING avg_time_spent between 0 and 1000)t
GROUP BY week
Now I am trying to port this to tableau, and I am not being able to find a way to represent this data in tableau. I am stuck at how to represent the inner group by in a calculated field. I can also try to just use a custom sql data source, but I am already using another data source.
columns in mytable -
start_time
avg_time_spent
name
branch
status
I think this could be achieved new Level Of Details formulas, but unfortunately I am stuck at version 8.3
Save custom SQL for rare cases. This doesn't look like a rare case. Let Tableau generate the SQL for you.
If you simply connect to your table, then you can usually write calculated fields to get the information you want. I'm not exactly sure why you have test_name in one part of your query but test_name_specific in another, so ignoring that, here is a simplified example to a similar query.
If you define a calculated field called worst_case_test_time
datediff(min(start_time), dateadd('second', max(start_time), avg_time_spent)), which seems close to what your original query says.
It would help if you explained what exactly you are trying to compute. It appears to be some sort of worst case bound for avg test time. There may be an even simpler formula, but its hard to know without a little context.
You could filter on status = "Success" and avg_time_spent < 1000, and place branch and WEEK(start_time) on say the row and column shelves.
P.S. Your query seems a little off. Don't you need an aggregation function like MAX or AVG after the HAVING keyword?