Wondering about the Hive QL Stdev_pop -- this query works
select ip, sqrt(variance(time)) as stdev from logs group by ip
However, std deviation (which should be the same as above), give an error like;
select ip, stdev_pop(time) as stdev from logs group by ip;
Error while compiling statement: FAILED: SemanticException [Error 10025]:
Line 1:19 Expression not in GROUP BY key 'time'
I tried the stdev_sample as well with the same error.
Do anybody know why hive would expect group-by time?
Related
I am trying to execute the following query to get the no of rows in the table
SELECT SUM (row_count)
FROM sys.partitions
WHERE object_id=OBJECT_ID('Transactions')
AND (index_id=0 or index_id=1);
But getting the error saying that relation "sys.partitions" does not exist. Any suggestions to get the partitions available in the system.
I've imported a demo dataset in QuestDB and I can query it successfully from the console. I'm using Grafana to build a dashboard to test visualization.
My QuestDB installation is running on port 9000 and I can import it without any issues:
curl -F data=#weather.csv http://localhost:9000/imp
I'm running the following query which is failing:
SELECT timestamp as time,
avg(visMiles) AS average_visibility
FROM 'weather.csv'
WHERE $__timeFilter(timestamp)
SAMPLE BY $__interval
LIMIT 1000
The error I get is
pq: unknown function name: between(TIMESTAMP,STRING,STRING)
I'm using a dataset provided in their examples.
QuestDB relies on a designated timestamp specified during table creation. This would not cause an error if one was provided with the curl request as a URL param, given a column named 'timestamp':
curl -F data=#weather.csv http://localhost:9000/imp?timestamp=timestamp
Another option is during a SELECT operation, a timestamp() function can specify one dynamically. If you've imported using curl and not set a designated timestamp, there are two options:
Modify your query to use timestamp() on the column you want to designate:
SELECT timestamp as time,
avg(visMiles) AS average_visibility
FROM (‘weather.csv’ timestamp(timestamp))
WHERE $__timeFilter(timestamp)
SAMPLE BY $__interval
LIMIT 1000
Create a new table which is a copy of your original dataset but designate a timestamp during creation. ORDER BY is used because the demo dataset has unordered timestamp entries:
create table temp_table as (select * from ‘weather.csv’ order by timestamp) timestamp(timestamp);
And instead of querying your original dataset, use the temp_table:
SELECT timestamp as time,
avg(visMiles) AS average_visibility
FROM temp_table
WHERE $__timeFilter(timestamp)
SAMPLE BY $__interval
LIMIT 1000
If you need more info on the use of designated timestamps, the QuestDB concepts / timestamp docs page has further details.
Edit: There are some more resources to with this topic such as a guide for Grafana with QuestDB and GitHub repo with docker-compose.
Queries I seem to be easily able to run in SQL workbench just don't work in Tableau - it's literally one java error after another... rant over.
One thing I've noticed is that Tableau keeps trying to wrap an additional SELECT which Athena doesn't recognise. I thought I could overcome this using Athena views, but that doens't seem to work either.
When I do the following in Tableau:
SELECT count(distinct uuid), category
FROM "pregnancy_analytics"."final_test_parquet"
GROUP BY category
I get the following in Athena (that throws an error - SYNTAX_ERROR: line 1:8: Column 'tableausql._col0' cannot be resolved). As I say, since it looks like Tableau is trying to "nest" the SELECT:
SELECT "TableauSQL"."_col0" AS "xcol0"
FROM (
SELECT count(distinct uuid)
FROM "pregnancy_analytics"."final_test_parquet"
WHERE category = ''
LIMIT 100
) "TableauSQL"
LIMIT 10000
NB: The error, as I said above, arises because Tableau sticks another SELECT around this to a table that doesn't exist, and as such Athena kicks up an error.
Starting to feel like Tableau is not a good fit with Athena? Is there a better suggestion perhaps?
Thanks!
I'm running queries on a Redshift cluster using DataGrip that take upwards of 10 hours to run and unfortunately these often fail. Alas, DataGrip doesn't maintain a connection to the database long enough for me to get to see the error message with which the queries fail.
Is there a way of retrieving these error messages later, e.g. using internal Redshift tables? Alternatively, is there are a way to make DataGrip maintain the connection for long enough?
Yes, you Can!
Query stl_connection_log table to find out pid by looking at the recordtime column when your connection was initiated and also dbname, username and duration column helps to narrow down.
select * from stl_connection_log order by recordtime desc limit 100
If you can find the pid, you can query stl_query table to find out if are looking at right query.
select * from stl_query where pid='XXXX' limit 100
Then, check the stl_error table for your pid. This will tell you the error you are looking for.
select * from stl_error where pid='XXXX' limit 100
If I’ve made a bad assumption please comment and I’ll refocus my answer.
I am trying to pivot a column which has more than 10000 distinct values. The default limit in Spark for maximum number of distinct values is 10000 and I am receiving this error
The pivot column COLUMN_NUM_2 has more than 10000 distinct values, this could indicate an error. If this was intended, set spark.sql.pivotMaxValues to at least the number of distinct values of the pivot column
How do I set this in PySpark?
You have to add / set this parameter in the Spark interpreter.
I am working with Zeppelin notebooks on an EMR (AWS) cluster, had the same error message as you and it worked after I added the parameter in the interpreter.
Hope this helps...