Enabling Parallelization in Spark with Partition Pushdown in MemSQL - scala

I have a columnstore table in MemSQL that has a schema similar the one below:
CREATE TABLE key_metrics (
source_id TEXT,
date TEXT,
metric1 FLOAT,
metric2 FLOAT,
…
SHARD KEY (source_id, date) USING CLUSTERED COLUMNSTORE
);
I have a Spark application (running with Spark Job Server) that queries the MemSQL table. Below is a simplified form of the kind of Dataframe operation I am doing (in Scala):
sparkSession
.read
.format(“com.memsql.spark.connector”)
.options( Map (“path” -> “dbName.key_metrics”))
.load()
.filter(col(“source_id”).equalTo(“12345678”)
.filter(col(“date”)).isin(Seq(“2019-02-01”, “2019-02-02”, “2019-02-03”))
I have confirmed by looking at the physical plan that these filter predicates are being pushed down to MemSQL.
I have also checked that there is a pretty even distribution of the partitions in the table:
±--------------±----------------±-------------±-------±-----------+
| DATABASE_NAME | TABLE_NAME | PARTITION_ID | ROWS | MEMORY_USE |
±--------------±----------------±-------------±-------±-----------+
| dbName | key_metrics | 0 | 784012 | 0 |
| dbName | key_metrics | 1 | 778441 | 0 |
| dbName | key_metrics | 2 | 671606 | 0 |
| dbName | key_metrics | 3 | 748569 | 0 |
| dbName | key_metrics | 4 | 622241 | 0 |
| dbName | key_metrics | 5 | 739029 | 0 |
| dbName | key_metrics | 6 | 955205 | 0 |
| dbName | key_metrics | 7 | 751677 | 0 |
±--------------±----------------±-------------±-------±-----------+
My question is regarding partition pushdown. It is my understanding that with it, we can use all the cores of the machines and leverage parallelism for bulk loading. According to the docs, this is done by creating as many Spark tasks as there are MemSQL database partitions.
However when running the Spark pipeline and observing the Spark UI, it seems that there is only one Spark task that is created which makes a single query to the DB that runs on a single core.
I have made sure that the following properties are set as well:
spark.memsql.disablePartitionPushdown = false
spark.memsql.defaultDatabase = “dbName”
Is my understanding of partition pushdown incorrect? Is there some other configuration that I am missing?
Would appreciate your input on this.
Thanks!

Singlestore credentials have to be the same on all nodes to take advantage of partition pushdown.
And if you have same credentials thru out all the nodes please try installing latest version of spark connector.
Because it often occurs due to spark connector and singlestore compatibility issues.

Related

How to compare two tables in KSQL using sub query?

I've created two tables in the KsqlDB. Both of them will have similar structure.
|ID |TIMESTAMP |PAYLOAD |
+------------------------+------------------------+------------------------+
|"1" |1664248879 |{"ID":"1","cha|
| | |nnel":"1"} |
|"2" |1664248879 |{"ID":"2","cha|
| | |nnel":"2"} |
|"5" |1664248879 |{"ID":"5","cha|
| | |nnel":"3"} |
|"6" |1664248879 |{"ID":"6","cha|
| | |nnel":"6"}
Now I need to find the difference between the two tables. I've tried a few queries and knew that sub-queries are not allowed in ksqlDB. Is it possible to achieve this using a KSQL query?

Timescaledb: retention policy isn't removing data from hypertable

(note: I've also posted this as a github issue https://github.com/timescale/timescaledb/issues/3653)
I have a hypertable request_logs configured with a 24 hour retention policy. The retention policy is being marked as running successfully, however no old data from the table is being removed. The table continues to grow day by day.
I checked and don't see any errors in the postgresql log files.
Could use additional guidance on where to look for information to troubleshoot this issue.
request_logs table structure
\d+ request_logs;
Table "public.request_logs"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-----------+--------------------------+-----------+----------+---------+---------+--------------+-------------
time | timestamp with time zone | | not null | | plain | |
referer | bigint | | | | plain | |
useragent | bigint | | | | plain | |
Indexes:
"request_logs_time_idx" btree ("time" DESC)
Triggers:
ts_insert_blocker BEFORE INSERT ON request_logs FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker()
Child tables: _timescaledb_internal._hyper_1_37_chunk,
_timescaledb_internal._hyper_1_38_chunk,
_timescaledb_internal._hyper_1_40_chunk
Access method: heap
This is the hypertable description retrieved by running select * from _timescaledb_catalog.hypertable;
id | schema_name | table_name | associated_schema_name | associated_table_prefix | num_dimensions | chunk_sizing_func_schema | chunk_sizing_func_name | chunk_target_size | compression_state | compressed_hypertable_id | replication_factor
----+-------------+--------------+------------------------+-------------------------+----------------+--------------------------+--------------------------+-------------------+-------------------+--------------------------+--------------------
1 | public | request_logs | _timescaledb_internal | _hyper_1 | 1 | _timescaledb_internal | calculate_chunk_interval | 0 | 0 | |
(1 row)
This is the retention_policy retrieved by running SELECT * FROM timescaledb_information.job_stats;.
hypertable_schema | hypertable_name | job_id | last_run_started_at | last_successful_finish | last_run_status | job_status | last_run_duration | next_start | total_runs | total_successes | total_failures
-------------------+-----------------+--------+-------------------------------+-------------------------------+-----------------+------------+-------------------+-------------------------------+------------+-----------------+----------------
public | request_logs | 1002 | 2021-10-05 23:59:01.601404+00 | 2021-10-05 23:59:01.638441+00 | Success | Scheduled | 00:00:00.037037 | 2021-10-06 23:59:01.638441+00 | 8 | 8 | 0
| | 1 | 2021-10-05 08:38:20.473945+00 | 2021-10-05 08:38:21.153468+00 | Success | Scheduled | 00:00:00.679523 | 2021-10-06 08:38:21.153468+00 | 45 | 45 | 0
(2 rows)
Relevant system information:
OS: Ubuntu 20.04.3 LTS
PostgreSQL version (output of postgres --version): 12
TimescaleDB version (output of \dx in psql): 2.4.1
Installation method: apt install process described https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/ubuntu/installation-apt-ubuntu/#installation-apt-ubuntu
It looks as though this might be related to a bug that has been fixed in version 2.4.2 of TimescaleDB. The GitHub report has been updated, if you find that the issue remains after upgrade, please update the issue on GitHub with your example. Thanks for reporting!
Transparency: I work for Timescale

Handling ksqlDB v0.11 composite key (tables) to replicate in MySQL using JDBC SInk connector

I'm using ksqlDB version 0.11 (I cannot upgrade to newer versions at the moment), and willing to replicate a TABLE data into MySQL using JDBC Sink connector. ksqlDB v0.11 does not support multiple TABLE keys, and my data needs to be grouped using multiple GROUP BY expression.
Using this statement I create the table:
CREATE TABLE estads AS SELECT
STID AS stid,
ASIG AS asig,
COUNT(*) AS np,
MIN(NOTA) AS min,
MAX(NOTA) AS max,
AVG(NOTA) AS med,
LATEST_BY_OFFSET(FECHREG) AS fechreg
FROM estads_stm GROUP BY stid, asig EMIT CHANGES;
The resulting table has the following schema:
Name : ESTADS
Field | Type
---------------------------------------------
KSQL_COL_0 | VARCHAR(STRING) (primary key)
NP | BIGINT
MIN | DOUBLE
MAX | DOUBLE
MED | DOUBLE
FECHREG | VARCHAR(STRING)
As you can see, the two primary keys (stid and asig) has been merged into a field called KSQL_COL_0, which is the expected behavior for version 0.11. The problem is that I need to use JDBC Sink connector to replicate the data into a MySQL table with the following schema:
+---------+--------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+---------+--------------+------+-----+-------------------+-----------------------------+
| stid | varchar(15) | NO | PRI | NULL | |
| asig | varchar(10) | NO | PRI | NULL | |
| np | smallint(6) | YES | | NULL | |
| min | decimal(5,2) | YES | | NULL | |
| max | decimal(5,2) | YES | | NULL | |
| med | decimal(5,2) | YES | | NULL | |
| fechreg | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
+---------+--------------+------+-----+-------------------+-----------------------------+
I don't know how to "unmerge" the automatically generated KSQL_COL_0 in order to tell JDBC that both stid and asig are primary keys in the MySQL table. Any ideas how to manage this? I know that since ksqlDB version 0.15 this is no longer a problem, as ksqlDB tables support multiple keys, but as I said, upgrading is not an option in my case.
Thanks!
I figured it out.
Basically you need to use AS_VALUE() clause in the table creation query. This way you copy the value of both private keys in new columns while also have the newly created private key in its own column. Then, simply specify in the JCBD Sink Connector to get the values of all the columns except the newly created private key.
CREATE TABLE estads AS SELECT
STID AS k1,
ASIG AS k2,
AS_VALUE(STID) AS stid,
AS_VALUE(ASIG) AS asig,
COUNT(*) AS np,
MIN(NOTA) AS min,
MAX(NOTA) AS max,
AVG(NOTA) AS med,
LATEST_BY_OFFSET(FECHREG) AS fechreg
FROM estads_stm GROUP BY k1, k2 EMIT CHANGES;

How do I list all streams and continuous views in pipelinedb?

In pipelinedb I can't seem to locate a way to list all of the streams and continuous views that I've created.
I can back into the CVs by looking for the "mrel" tables that are created but it's kind of clunky.
Is there a system table or view I can query that will list them?
You may have an older version of pipelinedb, or you may be looking at an older version of the docs.
You can check your version with psql like so:
pipeline=# select * from pipeline_version();
pipeline_version
-----------------------------------------------------------------------------------------------------------------------------------------------------------
PipelineDB 0.9.0 at revision b1ea9ab6acb689e6ed69fb26af555ca8d025ebae on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4, 64-bit
(1 row)
In the latest version, information about views can be obtained like so:
pipeline=# select * from pipeline_views();
id | schema | name | query
----+--------+------+-----------------------
11 | public | cv | SELECT x::integer, +
| | | count(*) AS count+
| | | FROM ONLY s +
| | | GROUP BY x::integer
(1 row)
Information about streams can be obtained like so:
pipeline=# select * from pipeline_streams();
schema | name | inferred | queries | tup_desc
--------+------+----------+---------+----------------------------------------
public | s | t | {cv} | \x000000017800000006a4ffffffff00000000
(1 row)
More information can be obtained by using \d+:
pipeline=# \d+ cv
Continuous view "public.cv"
Column | Type | Modifiers | Storage | Description
--------+---------+-----------+---------+-------------
x | integer | | plain |
count | bigint | | plain |
View definition:
SELECT x::integer,
count(*) AS count
FROM ONLY s
GROUP BY x::integer;
pipeline=# \d+ s
Stream "public.s"
Column | Type | Storage
-------------------+-----------------------------+---------
arrival_timestamp | timestamp(0) with time zone | plain
It's easy peasy,
just write
select * from pipeline_streams();
To see pipelinestreams and inside of it u can see which stream has which views.
Edit:
Above code snippet is only for 0.9.x version of PipelineDB since it is PostgreSQL extension with version 1.x you will use foreign tables as a streams
psql -c "\dE[S+];"
This code will show you all foreign tables on psql (Streams on pipelinedb).
For more information : http://docs.pipelinedb.com/streams.html

Vacuum does not reclaim disk space

I have a fact table with 9.5M records.
The table uses distyle=key, and is hosted on a RedShift cluster with 2 "small" nodes.
I made many UPDATE and DELETE operations on the table, and as expected, I see that the "real" number of rows is much above 9.5M.
Hence, I ran vacuum on the table, and to my surprise, after vacuum finished, I still see that the number of "rows" the table allocates did not come back to 9.5M records.
Could you please advice what may be a reason for such a behavior?
What would be the best way to solve it?
A little bit of copy-pastes from my shell:
The fact table I was talking about:
select count(1) from tbl_facts;
9597184
The "real" number of records in the DB:
select * from stv_tbl_perm where id= 332469;
slice | id | name | rows | sorted_rows | temp | db_id | insert_pristine | delete_pristine
-------+--------+--------------------------------------------------------------------------+----------+-------------+------+--------+-----------------+-----------------
0 | 332469 | tbl_facts | 24108360 | 24108360 | 0 | 108411 | 0 | 1
2 | 332469 | tbl_facts | 24307733 | 24307733 | 0 | 108411 | 0 | 1
3 | 332469 | tbl_facts | 24370022 | 24370022 | 0 | 108411 | 0 | 1
1 | 332469 | tbl_facts | 24597685 | 24597685 | 0 | 108411 | 0 | 1
3211 | 332469 | tbl_facts | 0 | 0 | 0 | 108411 | 3 | 0
(All together is almost 100M records).
Thanks a lot!
I think you need to run analyze for the particular fact table. Analyze will update the statistics linked to the fact table after you run the vacuum (or any other command where the count of rows changes).
Do let us know if this was the case or not (i do not have a table handy where i can test this out) :-)