Parallel/Redundant Replication in CouchDB - rest

I have multiple CouchDB servers I want to keep in sync with each other, and I use these servers to share large files (e.g. >100 MB). To keep them synchronized, I have each CouchDB instance do a continuous pull replication from each other instance.
Here's an example: I have three CouchDB servers A, B, & C, all of which have continuous pull replications from each other, as so:
------- <------------- -------
| A | -------------> | B |
------- -------
^ | | ^
| | | |
| V | |
------- <---------------- |
| C | -------------------
-------
Someone uploads a document to server A with a 500MB attachment. B and C both start replicating the document from A, and B finishes the replication before C does:
------- doc -------
| A |--------------->| B |
------- -------
|
| doc
V
-------
| C |
-------
My question is, will C then start replicating the same document from B (since C also has a continuous pull replication from B), while it is still transferring the document from A?
------- -------
| A | | B |
------- -------
| doc |
doc| |------------------
| |
V V
-------
| C |
-------
I would guess this would happen, since AFAIK, CouchDB replication doesn't actually store the replicated documents to the target (using the _bulk_docs API) until the documents (including attachments) have been fully fetched from the source[1]. I am worried about this happening since it would be redundant and a big waste of bandwidth.
[1] https://github.com/couchbaselabs/TouchDB-iOS/wiki/Replication-Algorithm

According to a recent discussion on the CouchDB users# list and to this document describing the replication algorithm the replication knows which attachment is already present on the target. If, however, the attachments are very large and both ends start replicating before either of them has finished, the attachment will be transferred multiple times.

Related

Run Kafka with podman on multiple hosts

Suppose I have three VMs/hosts which have podman installed:
each of them runs 1 zookeeper instance as container
each of the VMs/host run 2 kafka instances as containers (= 6).
my notebook will make use of the cluster as Producer/Consumer
A network/service layout I try to create:
VM/host 1 VM/host 2 VM/host 3 My Dev Comp
------------ ------------ ------------ ------------
| zoo 1 | | zoo 2 | | zoo 3 | | my |
| kafka 1-2| | kafka 3-4| | kafka 5-6| | fancy |
| | | | | | | services |
------------ ------------ ------------ ------------
|--------same-------|-------local-----|------network-----|
I try to run each of the container as rootless one, due to security restrictions. With docker of course it's easy, though you run a daemon as root, doh. I tried to understand slirp4netns and containernetworking, but I could not find an answer to my question above, because I did not understand how it works at great depth.
What's the solution to create that setup above, please, e.g. which commands are required to get this blueprint above running?

How Jasperreports Server stores report output internally?

There are few ways to store report output in JR Server: FS, FTP and Repository. The repository output is the default one. I guess the files in the repository must be stored in the DB or file system. Are the files kept forever? How can I manage the repository and for example set a file's lifetime?
The repository outputs are stored in the database. Usually there is no need to set the lifetime.
As of JasperReports Server v 6.3.0 the reference to all resources is kept in jiresource table, while content of is kept in jiresource.
In my case I was able to retrieve all output reports with:
select r.id,r.name,r.creation_date
from jiresource r, jicontentresource c
where r.id = c.id;
The definition of jicontentresource is
jasperserver=# \d+ jicontentresource
id | bigint | not null | plain | |
data | bytea | | extended | |
file_type | character varying(20) | | extended | |

Start OrientDB without user input

I'm attempting to start OrientDB in distributed mode on AWS.
I have an auto scaling group that creates new nodes as needed. When the nodes are created, they start with a default config without a node name. The idea is that the node name is generated randomly.
My problem is that the server starts up and ask for user input.
+---------------------------------------------------------------+
| WARNING: FIRST DISTRIBUTED RUN CONFIGURATION |
+---------------------------------------------------------------+
| This is the first time that the server is running as |
| distributed. Please type the name you want to assign to the |
| current server node. |
| |
| To avoid this message set the environment variable or JVM |
| setting ORIENTDB_NODE_NAME to the server node name to use. |
+---------------------------------------------------------------+
Node name [BLANK=auto generate it]:
I don't want to set the node name because I need a random name and the server never starts because it's waiting for user input.
Is there a parameter I can pass to dserver.sh that will pass this check and generate a random node name?
You could create a random string to pass to OrientDB as node name with the ORIENTDB_NODE_NAME variable. Example:
ORIENTDB_NODE_NAME=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 32 | head -n 1)
For more information about this, look at: https://gist.github.com/earthgecko/3089509

Visualizing time-series from a SQL Database (Postgres)

I am building an app that applies a datascience model on a SQL Database, for sensor metrics. For this purpose I chose PipelineDB (based on Postgres) that enables me to build a Continuous View on my metrics and apply the model to each new line.
For now, I just want to observe the metrics I collect through the sensor on a dashboard. The table "metrics" looks like this :
+---------------------+--------+---------+------+-----+
| timestamp | T (°C) | P (bar) | n | ... |
+---------------------+--------+---------+------+-----+
| 2015-12-12 20:00:00 | 20 | 1.13 | 0.9 | |
+---------------------+--------+---------+------+-----+
| 2015-12-13 20:00:00 | 20 | 1.132 | 0.9 | |
+---------------------+--------+---------+------+-----+
| 2015-12-14 20:00:00 | 40 | 1.131 | 0.96 | |
+---------------------+--------+---------+------+-----+
I'd like to build a dashboard in which I could see all my metric evolving through time. Even be able to choose which column to display.
So I found a few tools that could match with my need, which are Grafana or Chronograf for InfluxDB.
But neither of them enable me to plug directly on Postgres and query my table to generate metric-formatted data that is required by these tools.
Do you have any advice on what I should do to use such dashboards with such data ?
A bit late here, but Grafana now supports Postgresql datasources directly: https://grafana.com/docs/features/datasources/postgres. I've used it in several projects and it has been really easy to set up and use.

Google Cloud SQL VERY SLOW

I am thinking to migrate my website to Google Cloud SQL and I signed up for a free account (D32).
Upon testing on a table with 23k records the performances were very poor so I read that if I move from the free account to a full paid account I would have access to faster CPU and HDD... so I did.
performances are still VERY POOR.
I am running my own MySQL server for years now, upgrading as needed to handle more and more connections and to gain raw speed (needed because of a legacy application). I highly optimize tables, configuration, and heavy use of query cache, etc...
A few pages of our legacy system have over 1.5k of queries per page, currently I was able to push the mysql query time (execution and pulling of the data) down to 3.6seconds for all those queries, meaning that MySQL takes about 0.0024 seconds to execute the queries and return the values.. not the greatest but acceptable for those pages.
I upload a table involved in those many queries to Google Cloud SQL. I notices that the INSERT already takes SECONDS to execute instead than milliseconds.. but I think that it might be the sync vs async setting. I change it to async and the execution time for the insert doesn't feel like it changes. for now not a big problem, I am only testing queries for now.
I run a simple select * FROM <table> and I notice that it takes over 6 seconds.. I think that maybe the query cache needs to build.. i try again and this times it takes 4 seconds (excluding network traffic). I run the same query on my backup server after a restart and with no connections at all, and it takes less than 1 second.. running it again, 0.06 seconds.
Maybe the problem is the cache, too big... let's try a smaller subset
select * from <table> limit 5;
to my server: 0.00 seconds
GCS: 0.04
so I decide to try a dumb select on an empty table, no records at all, just created with only 1 field
to my server: 0.00 seconds
GCS: 0.03
profiling doesn't give any insights except that the query cache is not running on Google Cloud SQL and that the queries execution seems faster but .. is not...
My Server:
mysql> show profile;
+--------------------------------+----------+
| Status | Duration |
+--------------------------------+----------+
| starting | 0.000225 |
| Waiting for query cache lock | 0.000116 |
| init | 0.000115 |
| checking query cache for query | 0.000131 |
| checking permissions | 0.000117 |
| Opening tables | 0.000124 |
| init | 0.000129 |
| System lock | 0.000124 |
| Waiting for query cache lock | 0.000114 |
| System lock | 0.000126 |
| optimizing | 0.000117 |
| statistics | 0.000127 |
| executing | 0.000129 |
| end | 0.000117 |
| query end | 0.000116 |
| closing tables | 0.000120 |
| freeing items | 0.000120 |
| Waiting for query cache lock | 0.000140 |
| freeing items | 0.000228 |
| Waiting for query cache lock | 0.000120 |
| freeing items | 0.000121 |
| storing result in query cache | 0.000116 |
| cleaning up | 0.000124 |
+--------------------------------+----------+
23 rows in set, 1 warning (0.00 sec)
Google Cloud SQL:
mysql> show profile;
+----------------------+----------+
| Status | Duration |
+----------------------+----------+
| starting | 0.000061 |
| checking permissions | 0.000012 |
| Opening tables | 0.000115 |
| System lock | 0.000019 |
| init | 0.000023 |
| optimizing | 0.000008 |
| statistics | 0.000012 |
| preparing | 0.000005 |
| executing | 0.000021 |
| end | 0.000024 |
| query end | 0.000007 |
| closing tables | 0.000030 |
| freeing items | 0.000018 |
| logging slow query | 0.000006 |
| cleaning up | 0.000005 |
+----------------------+----------+
15 rows in set (0.03 sec)
keep in mind that I connect to both server remotely from a server located in VA and my server is located in Texas (even if it should not matter that much).
What am I doing wrong ? why simple queries take this long ? am I missing or not understanding something here ?
As of right now I won't be able to use Google Cloud SQL because a page with 1500 queries will take way too long (circa 45 seconds)
I know this question is old but....
CloudSQL has poor support for MyISAM tables, it's recommend to use InnoDB.
We had poor performance when migrating a legacy app, after reading through the doc's and contacting the paid support, we had to migrate the tables into InnoDB; No query cache was also a killer.
You may also find later on you'll need to tweak the mysql conf via the 'flags' in the google console. An example being 'wait_timeout' is set too high by default (imo.)
Hope this helps someone :)
Query cache is not as yet a feature of Cloud SQL. This may explain the results. However, I recommend closing this question as it is quite broad and doesn't fit the format of a neat and tidy Q&A. There are just too many variables not mentioned in the Q&A and it doesn't appear clear what a decisive "answer" would look like to the very general question of optimization when there are so many variables at play.