Clickhouse materialized view aggregate ghost rows

Clickhouse materialized view aggregate ghost rows - apache-zookeeper

So I'm using clickhouse and here is my current tables architecture.
I have a main table containing my data:
CREATE TABLE default.Liquidity
(
`Date` Date,
`LiquidityId` UInt64,
`TreeId_LQ` UInt64,
`AggregateId` UInt64,
`ClientId` UInt64,
`InstrumentId` UInt64,
`IsIn` String,
`Currency` String,
`Scenario` String,
`Price` String,
`Leg` Int8,
`commit` Int64,
`factor` Int8,
`nb_aggregated` UInt64,
`stream_id` Int64
)
ENGINE = Distributed('{cluster}', '', 'shard_Liquidity', TreeId_LQ)
And I also have a materialized view that aggregate data store it in other table
CREATE MATERIALIZED VIEW default.mv_Liquidity_facet TO default.shard_state_Liquidity_facet
(
`Date` Date,
`TreeId_LQ` UInt64,
`AggregateId` UInt64,
`ClientId` UInt64,
`InstrumentId` UInt64,
`Currency` String,
`Scenario` String,
`commit` Int64,
`factor` Int8,
`nb_aggregated` AggregateFunction(sum, UInt64)
) AS
SELECT
Date,
TreeId_LQ,
AggregateId,
ClientId,
InstrumentId,
Currency,
Scenario,
commit,
factor,
sumState(nb_aggregated) AS nb_aggregated
FROM default.shard_Liquidity
GROUP BY
Date,
TreeId_LQ,
AggregateId,
ClientId,
InstrumentId,
Currency,
Scenario,
commit,
factor
----------------
CREATE TABLE default.shard_state_Liquidity_facet
(
`Date` Date,
`TreeId_LQ` UInt64,
`AggregateId` UInt64,
`ClientId` UInt64,
`InstrumentId` UInt64,
`Currency` String,
`Scenario` String,
`commit` Int64,
`factor` Int8,
`nb_aggregated` AggregateFunction(sum, UInt64)
)
ENGINE = ReplicatedAggregatingMergeTree('{zoo_prefix}/tables/{shard}/shard_state_Liquidity_facet', '{host}')
PARTITION BY Date
ORDER BY (commit, TreeId_LQ, ClientId, AggregateId, InstrumentId, Scenario)
SETTINGS index_granularity = 8192
As you might have guessed, the nb_aggregated column represents the number of rows that were aggregated to achieve this result.
If I make that query on my Distributed query with a lot of filter in order to find one row
select
sum(nb_aggregated) AS nb_aggregated
from Liquidity
where Date = '2022-10-17'
and TreeId_LQ = 1129
and AggregateId = 999999999999
and ClientId = 1
and InstrumentId = 593
and Currency = 'AUD'
and Scenario = 'BAU'
and commit = -2695401333399944382
and factor = 1;
--- Result
1
I end up with only one row, therefore if I make the same query with the same filter but one the aggregated version of my table that have been created with the materialize view I should also end up with only one line and with the nb_aggregated = 1 however I end up with nb_aggregated = 2 as if he had aggregated my row with another and most of the other value are wrong too.
I understand that my exemple is hard to understand but if you have any lead it will be nice.

Well, I have ask the same question one the clickhouse repository on github and Denny Crane give me this answer that is working for me here:
https://github.com/ClickHouse/ClickHouse/issues/43988#issuecomment-1339731917
In the most of cases MatView group by should match to a storage table ORDER BY
CREATE MATERIALIZED VIEW default.mv_Liquidity_facet:
GROUP BY Date, TreeId_LQ, AggregateId, ClientId, InstrumentId, Currency, Scenario, commit, factor
CREATE TABLE default.shard_state_Liquidity_facet
PARTITION BY Date
ORDER BY (commit, TreeId_LQ, ClientId, AggregateId, InstrumentId, Scenario)
Your ReplicatedAggregatingMergeTree "CORRUPTS" Currency / factor columns using ANY function
the solution is
ORDER BY (commit, TreeId_LQ, ClientId, AggregateId, InstrumentId, Scenario, Currency , factor)
https://den-crane.github.io/Everything_you_should_know_about_materialized_views_commented.pdf

Related

Not able to create Hive table with TIMESTAMP datatype in Azure Databricks

org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.UnsupportedOperationException: Parquet does not support
timestamp. See HIVE-6384;
Getting above error while executing following code in Azure Databricks.
spark_session.sql("""
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
(
campaign STRING,
status STRING,
file_name STRING,
arrival_time TIMESTAMP
)
PARTITIONED BY (
Date DATE)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION "/mnt/data_analysis/pre-processed/"
""")

As per Hive-6384 Jira, Starting from Hive-1.2 you can use Timestamp,date types in parquet tables.
Workarounds for Hive < 1.2 version:
1. Using String type:
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
(
campaign STRING,
status STRING,
file_name STRING,
arrival_time STRING
)
PARTITIONED BY (
Date STRING)
Stored as parquet
Location '/mnt/data_analysis/pre-processed/';
Then while processing you can cast arrival_time,Date cast to timestamp,date types.
Using a view and cast the columns but views are slow.
2. Using ORC format:
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
(
campaign STRING,
status STRING,
file_name STRING,
arrival_time Timestamp
)
PARTITIONED BY (
Date date)
Stored as orc
Location '/mnt/data_analysis/pre-processed/';
ORC supports both timestamp,date type

Grafana query for created table in clickhouse

As I was trying to see the data from clickhouse as a graph in grafana...I tried a lot with query processing but I couldn't able to get points on grafana..my table looks like
CREATE TABLE m_psutilinfo (timestamp String, namespace String, data Float, unit String, plugin_running_on String, version UInt64, last_advertised_time String) ENGINE = Kafka('10.224.54.99:9092', 'psutilout', 'group3', 'JSONEachRow');
CREATE TABLE m_psutilinfo_t (timestamp DateTime,namespace String,data Float,unit String,plugin_running_on String,version UInt64,last_advertised_time String,DAY Date)ENGINE = MergeTree PARTITION BY DAY ORDER BY (DAY, timestamp) SETTINGS index_granularity = 8192;
CREATE MATERIALIZED VIEW m_psutilinfo_view TO m_psutilinfo_t AS SELECT toDateTime(substring(timestamp, 1, 19)) AS timestamp, namespace, data, unit, plugin_running_on, version, last_advertised_time, toDate(timestamp) AS DAY FROM m_psutilinfo;
these are the tables I created in clickhouse....what should be my query in grafana for getting data as a graph?
SELECT
$timeSeries as t,
count()
FROM $table
WHERE $timeFilter
GROUP BY t
ORDER BY t
I used tabix but wanted in grafana

How do I load data correctly in Hive using spark?

I want to input data which looks as-
"58;""management"";""married"";""tertiary"";""no"";2143;""yes"";""no"";""unknown"";5;""may"";261;1;-1;0;""unknown"";""no"""
"44;""technician"";""single"";""secondary"";""no"";29;""yes"";""no"";""unknown"";5;""may"";151;1;-1;0;""unknown"";""no"""
"33;""entrepreneur"";""married"";""secondary"";""no"";2;""yes"";""yes"";""unknown"";5;""may"";76;1;-1;0;""unknown"";""no"""
"47;""blue-collar"";""married"";""unknown"";""no"";1506;""yes"";""no"";""unknown"";5;""may"";92;1;-1;0;""unknown"";""no"""
My create table statement is as-
sqlContext.sql("create table dummy11(age int, job string, marital string, education string, default string, housing string, loan string, contact string, month string, day_of_week string, duration int, campaign int, pday int, previous int, poutcome string, emp_var_rate int, cons_price_idx int, cons_conf_idx int, euribor3m int, nr_employed int, y string)row format delimited fields terminated by ';'")
When I run the statement-
sqlContext.sql("from dummy11 select age").show()
OR
sqlContext.sql("from dummy11 select y").show()
It returns NULL value instead of correct values, though other values are visible
So how do I correct this??

As you are using Hive QL syntax, you need to validate the input data before processing.
In your data, few records have lesser columns - than the actual columns defined in DDL.
So, for those records, the rest columns (from last) are set as NULL; as that row does not have enough values.
That's why, the last column y has values NULL.
Also, in DDL, first field's data type is INT; but in record, first field values are:
"58
"44
"33
Because of ", the values are not type-casted to INT; setting field value as NULL.
As per the DDL and data - you provided, values are getting set as:
age "58
job ""management""
marital ""married""
education ""tertiary""
default ""no""
housing 2143
loan ""yes""
contact ""no""
month ""unknown""
day_of_week 5
duration ""may""
campaign 261
pday 1
previous -1
poutcome 0
emp_var_rate ""unknown""
cons_price_idx ""no""
cons_price_idx NULL
cons_conf_idx NULL
euribor3m int NULL
nr_employed NULL
y NULL
Check the NULL values for last 5 columns.
So, if that is not expected, you need to validate the data first before proceeding.
And for the column age, if you need it in INT type, cleanse the data to remove unwanted " character.
WORKAROUND
As workaround, you can define age as STRING at beginning, as use spark transformations to parse the first field and convert it to INT
import org.apache.spark.sql.functions._
val ageInINT = udf { (make: String) =>
Integer.parseInt(make.substring(1))
}
df.withColumn("ageInINT", ageInINT(df("age"))).show
Here df is your dataframe created while executing the hive DDL with column age as sTRING.
Nnow, you can perform operation on new column ageInINT rather than column age with INTEGER values.

Since your data contains " just before the age, it is considered as string. In the code you have defined it as int therefore sql parser is trying to find the integer value and therefore you are getting the null record. Change the age int with age string and you will be able to see the result.
Please see below working example Using Spark HiveContext.
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
sqlContext.sql("create external table dummy11(age string, job string, marital string, education string, default string, housing string, loan string, contact string, month string, day_of_week string, duration int, campaign int, pday int, previous int, poutcome string, emp_var_rate int, cons_price_idx int, cons_conf_idx int, euribor3m int, nr_employed int, y string)row format delimited fields terminated by ';' location '/user/skumar143/stack/'")
sqlContext.sql("select age, job from dummy11").show()
Its output:
+---+----------------+
|age| job|
+---+----------------+
|"58| ""management""|
|"44| ""technician""|
|"33|""entrepreneur""|
|"47| ""blue-collar""|
+---+----------------+

Specified partition columns do not match the partition columns of the table.Please use () as the partition columns

here i'm trying to persist the data frame in to a partitioned hive table and getting this silly exception. I have looked in to it many times but not able to find the fault.
org.apache.spark.sql.AnalysisException: Specified partition columns
(timestamp value) do not match the partition columns of the table.
Please use () as the partition columns.;
Here is the script by which the external table is created with,
CREATE EXTERNAL TABLEIF NOT EXISTS events2 (
action string
,device_os_ver string
,device_type string
,event_name string
,item_name string
,lat DOUBLE
,lon DOUBLE
,memberid BIGINT
,productupccd BIGINT
,tenantid BIGINT
) partitioned BY (timestamp_val DATE)
row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
stored AS inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location 'maprfs:///location/of/events2'
tblproperties ('serialization.null.format' = '');
Here is the result of describe formatted of table "events2"
hive> describe formatted events2;
OK
# col_name data_type comment
action string
device_os_ver string
device_type string
event_name string
item_name string
lat double
lon double
memberid bigint
productupccd bigint
tenantid bigint
# Partition Information
# col_name data_type comment
timestamp_val date
# Detailed Table Information
Database: default
CreateTime: Wed Jan 11 16:58:55 IST 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: maprfs:/location/of/events2
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
serialization.null.format
transient_lastDdlTime 1484134135
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.078 seconds, Fetched: 42 row(s)
Here is the line of code where the data is partitioned and stored in to a table,
val tablepath = Map("path" -> "maprfs:///location/of/events2")
AppendDF.write.format("parquet").partitionBy("Timestamp_val").options(tablepath).mode(org.apache.spark.sql.SaveMode.Append).saveAsTable("events2")
While running the application, i'm getting the below
Specified partition columns (timestamp_val) do not match the partition
columns of the table.Please use () as the partition columns.
I might be committing an obvious error, any help is much appreciated with an upvote :)

Please print schema of df:
AppendDF.printSchema()
Make sure it is not type mismatch??

PostgreSQL access to generate_series() cells

I'm generating a date series via PostgreSQL's generate_series(min, max) in the following way:
SELECT
generate_series(getstartdate(some arguments)
, getenddate(some arguments), interval '1 day')
FROM taskresults
getstartdate() and getenddate() are each returning the start- and end date of a given task. I have more tables Employees(employeeid, taskid, worktime) and Tasks(taskid, startdate, enddate).
My goal is to get the employees working time grouped by each day from my generated series. How can I perform this join? Note that I do not have direct access to the columns startdate and enddate in the table Tasks I can only access the dates via those functions mentioned above. The worktime is in hours/day so I have to aggregate it via SUM() for each task the employee works in to the given date in the series. The problem is that I don't how to access a date in the generated series.
EDIT
Data structures:
CREATE TABLE employees
(
employeeid serial NOT NULL,
firstname character varying(32),
lastname character varying(32),
qualification character varying(32),
incomeperhour numeric,
)
CREATE TABLE employeetasks
(
projectid integer,
taskid integer,
employeeid integer,
hoursperday real,
)
CREATE TABLE taskresults
(
simulationid integer,
taskid integer,
duration integer
)
CREATE TABLE tasks
(
projectid integer NOT NULL,
taskname character varying(32),
startdate character varying(32),
enddate character varying(32),
predecessor integer,
minduration integer,
maxduration integer,
taskid integer,
)
Some explanation:
The whole database is for simulation so at first you define a task schedule (in table tasks) and then run the simulation that inserts the results into taskresults. As you can see I only store the duration in the results that's why I can only access the date ranges for each task with the getstartdate / getenddate functions. The table employeetasks basically assigns employees from the employees table to the task table with an hour-amount they're working in that task per day.

You can JOIN on the generated series like anything else.
INNER JOIN generate_series(getstartdate(some arguments), getenddate(some arguments), interval '1 day') workday ON (...)
The join condition is hard to work out without knowing how your data is stored.
Also, your data structure is weird. Employees have a "taskid"? n:1 employees -> task? I can't really write a full query 'cos I don't get the data structure.