Pgsync synchronisation is not made with new data - postgresql

I want to synchronize 3 tables from a postgresql database to a self hosted elasticsearch and to do so, I use PGSync.
To build this stack, I followed this tutorial.
When I start the docker containers everything works well (execpt some errors in pgsync but its normal, the tables don't exist yet), after that, I restore my database from a dump (each tables has 30 000, 9 000 000 and 13 000 000 lines approximately). After the dump pgsync detects the new lines in the database and sync them in elasticsearch.
My problem is that after that first synchronisation, PGSync detects new lines:
Polling db cardpricetrackerprod: 61 item(s)
Polling db cardpricetrackerprod: 61 item(s)
but the synchronisation isn't made.
Here is what my schema looks like:
[
{
"database": "mydb",
"index": "elastic-index-first-table",
"nodes": {
"table": "first_table",
"schema": "public",
"columns": [
"id",
...
]
}
},
{
"database": "mydb",
"index": "elastic-index-second-table",
"nodes": {
"table": "second_table",
"schema": "public",
"columns": [
"id",
...
]
}
},
{
"database": "mydb",
"index": "elastic-index-third-table",
"nodes": {
"table": "third_table",
"schema": "public",
"columns": [
"id",
...
]
}
}
]
Have I missed a configuration step?

Related

JDBC sink topic with multiple structs to postgres

I am trying to sink a few topics top a postgres database. However the topic schema defines a array at the top level and within it multiple structs. Automapping does not work and I cannot find any reference how to handle this. I need all structs because they are dependent types, the second struct references the first struct as a field.
Currently it breaks when hitting the 2nd struct stating statusChangeEvent (struct) has no mapping to sql column type. This because it is using auto.create to make a table (probably called ProcessStatus) then when hitting the second entry there is no column of course.
[
{
"type": "record",
"name": "processStatus",
"namespace": "company.some.process",
"fields": [
{
"name": "code",
"doc": "The code of the processStatus",
"type": "string"
},
{
"name": "name",
"doc": "The name of the processStatus",
"type": "string"
},
{
"name": "description",
"type": "string"
},
{
"name": "isCompleted",
"type": "boolean"
},
{
"name": "isSuccessfullyCompleted",
"type": "boolean"
}
]
},
{
"type": "record",
"name": "StatusChangeEvent",
"namespace": "company.some.process",
"fields": [
{
"name": "contNumber",
"type": "string"
},
{
"name": "processId",
"type": "string"
},
{
"name": "processVersion",
"type": "int"
},
{
"name": "extProcessId",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "fromStatus",
"type": "process.status"
},
{
"name": "toStatus",
"doc": "The new status of the process",
"type": "company.some.process.processStatus"
},
{
"name": "changeDateTime",
"type": "long",
"logicalType": "timestamp-millis"
},
{
"name": "isPublic",
"type": "boolean"
}
]
}
]
I am not using ksql atm. Which connector settings are suited for this task? If there is a ksql alternative it would be nice to know but the current requirement is to use the JDBC connector.
I tried using flatten but it does not support struct fields that have a schema. Which seems kind of weird. Aren't schema's the whole selling point of connect with kafka? Or is it more of a constraint you have to work around?
Aren't schema's the whole selling point of connect with kafka?
Yes, but Postgres (or the JDBC Sink, in general) doesn't really support nested objects within columns. For that, you're better off with a document database, such as using Mongo Sink Connector.
Which connector settings are suited for this task?
None, really, other than transforms. You could write your own if flatten doesn't work.
You could try pre-defining your table to use JSONB for the two status columns, however, that's more of a workaround.

Adding a custom tag based on topicName (wildcard) via using JmxTrans to send Kafka JMX to influxDb

Basically what i wanted to achieve was to get MessageInPerSec metric for all the topic in kafka and to add the custom tag as topicName in the influx db so as to query based on the topic not based on the 'ObjDomain' definition, below are my JmxTrans configuration, (Note using the wildcard for the topic as to fetch the data MessageInPerSec JMX attribute for all the topic)
{
"servers": [
{
"port": "9581",
"host": "192.168.43.78",
"alias": "kafka-metric",
"queries": [
{
"outputWriters": [
{
"#class": "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url": "http://192.168.43.78:8086/",
"database": "kafka",
"username": "admin",
"password": "root"
}
],
"obj": "kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=*",
"attr": [
"Count",
"MeanRate",
"OneMinuteRate",
"FiveMinuteRate",
"FifteenMinuteRate"
],
"resultAlias": "newTopic"
}
],
"numQueryThreads": 2
}
]
}
which yields a result in the Influx DB as follow
[name=newTopic, time=1589425526087, tags={attributeName=FifteenMinuteRate,
className=com.yammer.metrics.reporting.JmxReporter$Meter, objDomain=kafka.server,
typeName=type=BrokerTopicMetrics,name=MessagesInPerSec,topic=backblaze_smart},
precision=MILLISECONDS, fields={FifteenMinuteRate=1362.9446063537794, _jmx_port=9581
}]
and create tag with whole objDomain spefcified in the config, but i wanted to have topic as a seperate tag that is something as follow
[name=newTopic, time=1589425526087, tags={attributeName=FifteenMinuteRate,
className=com.yammer.metrics.reporting.JmxReporter$Meter, objDomain=kafka.server,
topic=backblaze_smart,
typeName=type=BrokerTopicMetrics,name=MessagesInPerSec,topic=backblaze_smart},
precision=MILLISECONDS, fields={FifteenMinuteRate=1362.9446063537794, _jmx_port=9581
}]
was not able to find any adequate documentation for the same on how to use the wildcard value of topic as a separate tag using jmxtrans and writing it to the InfluxDB.
You just need to add the following additional properties for Influx output writer. Just make sure you are using the latest version of jmxtrans release. The docs are here: https://github.com/jmxtrans/jmxtrans/wiki/InfluxDBWriter
"typeNames": ["topic"],
"typeNamesAsTags": "true"
I have listed your config with the above modifications.
{
"servers": [
{
"port": "9581",
"host": "192.168.43.78",
"alias": "kafka-metric",
"queries": [
{
"outputWriters": [
{
"#class": "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url": "http://192.168.43.78:8086/",
"database": "kafka",
"username": "admin",
"password": "root",
"typeNames": ["topic"],
"typeNamesAsTags": "true"
}
],
"obj": "kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=*",
"attr": [
"Count",
"MeanRate",
"OneMinuteRate",
"FiveMinuteRate",
"FifteenMinuteRate"
],
"resultAlias": "newTopic"
}
],
"numQueryThreads": 2
}
]
}

Copying 7 column table to 6 column table

I'm porting SQL Server Integration Services packages to Azure Data Factory.
I have two tables (Table 1 and Table 2) which live on different servers. One has seven columns, the other six. I followed the example at https://learn.microsoft.com/en-us/azure/data-factory/data-factory-map-columns
Table 1 DDL:
CREATE TABLE dbo.Table1
(
zonename nvarchar(max),
propertyname nvarchar(max),
basePropertyid int,
dfp_ad_unit_id bigint,
MomentType nvarchar(200),
OperatingSystemName nvarchar(50)
)
Table 2 DDL
CREATE TABLE dbo.Table2
(
ZoneID int IDENTITY,
ZoneName nvarchar(max),
propertyName nvarchar(max),
BasePropertyID int,
dfp_ad_unit_id bigint,
MomentType nvarchar(200),
OperatingSystemName nvarchar(50)
)
In ADF, I define Table 1 as:
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Table.json",
"name": "Table1",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "PlatformX",
"structure": [
{ "name": "zonename" },
{ "name": "propertyname" },
{ "name": "basePropertyid" },
{ "name": "dfp_ad_unit_id" },
{ "name": "MomentType" },
{ "name": "OperatingSystemName" }
],
"external": true,
"typeProperties": {
"tableName": "Platform.Zone"
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
In ADF I define Table 2 as:
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Table.json",
"name": "Table2",
"properties": {
"type": "SqlServerTable",
"linkedServiceName": "BrixDW",
"structure": [
{ "name": "ZoneID" },
{ "name": "ZoneName" },
{ "name": "propertyName" },
{ "name": "BasePropertyID" },
{ "name": "dfp_ad_unit_id" },
{ "name": "MomentType" },
{ "name": "OperatingSystemName" }
],
"external": true,
"typeProperties": {
"tableName": "staging.DimZone"
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
As you can see, Table2 has an identity column, which will automatically populated.
This should be a simple Copy activity:
{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Pipeline.json",
"name": "Copy_Table1_to_Table2",
"properties": {
"description": "Copy_Table1_to_Table2",
"activities": [
{
"name": "Copy_Table1_to_Table2",
"type": "Copy",
"inputs": [
{ "name": "Table1" }
],
"outputs": [
{
"name": "Table2"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from dbo.Table1"
},
"sink": {
"type": "SqlSink"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "zonename: ZoneName, propertyname: propertyName, basePropertyid: BasePropertyID, dfp_ad_unit_id: dfp_ad_unit_id, MomentType: MomentType, OperatingSystemName: OperatingSystemName"
}
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 3,
"timeout": "01:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
}
}
],
"start": "2017-07-23T00:00:00Z",
"end": "2020-07-19T00:00:00Z"
}
}
I figured by not mapping ZoneID, it would just be ignored. But ADF is giving me the following error.
Copy activity encountered a user error: GatewayNodeName=APP1250S,ErrorCode=UserErrorInvalidColumnMappingColumnCountMismatch,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Invalid column mapping provided to copy activity: 'zonename: ZoneName, propertyname: propertyName, basePropertyid: BasePropertyID, dfp_ad_unit_id: dfp_ad_unit_id, MomentType: MomentType, OperatingSystemName: OperatingSystemName', Detailed message: Different column count between target structure and column mapping. Target column count:7, Column mapping count:6. Check column mapping in table definition.,Source=Microsoft.DataTransfer.Common,'
In a nutshell I'm trying to copy a 7 column table to a 6 column table and Data Factory doesn't like it. How can I accomplish this task?
I realize this is an old question, but I ran into this issue just now. My problem was that I initially generated the destination/sink table, created a pipeline, and then added a column.
Despite clearing and reimporting the schemas, whenever triggering the pipeline, it would throw the above error. I made sure the new column (which has a default on it) was deselected in the mappings, so it would only use the default value. The error was still thrown.
The only way I managed to get things to work was by completely recreating the pipelines from scratch. It's almost as if somewhere in the meta data, the old mappings are retained.
I had the exact same issue and I solved it by going into the azure dataset and removing the identity column. Then making sure I had the same number of columns in my source and target(sink). After doing this the copy will add the records and the identity in the table will just work as expected. I did not have to modify the physical table in SQL only the dataset for the table in azure.
One option would be to create a view over the 7-column table which does not include the identity column and insert into that view.
CREATE VIEW bulkLoad.Table2
AS
SELECT
ZoneName,
propertyName,
BasePropertyID,
dfp_ad_unit_id,
MomentType,
OperatingSystemName
GO
I can do some digging and see if some trick is possible with the column mapping but that should unblock you.
HTH
I was told by MSFT support to just remove the identity column from the table definition. It seems to have worked.

Exporting a AWS Postgres RDS Table to AWS S3

I wanted to use AWS Data Pipeline to pipe data from a Postgres RDS to AWS S3. Does anybody know how this is done?
More precisely, I wanted to export a Postgres Table to AWS S3 using data Pipeline. The reason I am using Data Pipeline is I want to automate this process and this export is going to run once every week.
Any other suggestions will also work.
There is a sample on github.
https://github.com/awslabs/data-pipeline-samples/tree/master/samples/RDStoS3
Here is the code:
https://github.com/awslabs/data-pipeline-samples/blob/master/samples/RDStoS3/RDStoS3Pipeline.json
You can define a copy-activity in the Data Pipeline interface to extract data from a Postgres RDS instance into S3.
Create a data node of the type SqlDataNode. Specify table name and select query.
Setup the database connection by specifying RDS instance ID (the instance ID is in your URL, e.g. your-instance-id.xxxxx.eu-west-1.rds.amazonaws.com) along with username, password and database name.
Create a data node of the type S3DataNode.
Create a Copy activity and set the SqlDataNode as input and the S3DataNode as output.
Another option is to use an external tool like Alooma. Alooma can replicate tables from PostgreSQL database hosted Amazon RDS to Amazon S3 (https://www.alooma.com/integrations/postgresql/s3). The process can be automated and you can run it once a week.
I built a Pipeline from scratch using the MySQL and the documentation as reference.
You need to have the roles on place, DataPipelineDefaultResourceRole && DataPipelineDefaultRole.
I haven't load the parameters, so, you need to get into the architech and put your credentials and folders.
Hope it helps.
{
"objects": [
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "#{myS3LogsPath}",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"database": {
"ref": "DatabaseId_WC2j5"
},
"name": "DefaultSqlDataNode1",
"id": "SqlDataNodeId_VevnE",
"type": "SqlDataNode",
"selectQuery": "#{myRDSSelectQuery}",
"table": "#{myRDSTable}"
},
{
"*password": "#{*myRDSPassword}",
"name": "RDS_database",
"id": "DatabaseId_WC2j5",
"type": "RdsDatabase",
"rdsInstanceId": "#{myRDSId}",
"username": "#{myRDSUsername}"
},
{
"output": {
"ref": "S3DataNodeId_iYhHx"
},
"input": {
"ref": "SqlDataNodeId_VevnE"
},
"name": "DefaultCopyActivity1",
"runsOn": {
"ref": "ResourceId_G9GWz"
},
"id": "CopyActivityId_CapKO",
"type": "CopyActivity"
},
{
"dependsOn": {
"ref": "CopyActivityId_CapKO"
},
"filePath": "#{myS3Container}#{format(#scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
"name": "DefaultS3DataNode1",
"id": "S3DataNodeId_iYhHx",
"type": "S3DataNode"
},
{
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"instanceType": "m1.medium",
"name": "DefaultResource1",
"id": "ResourceId_G9GWz",
"type": "Ec2Resource",
"terminateAfter": "30 Minutes"
}
],
"parameters": [
]
}
You can now do this with aws_s3.query_export_to_s3 command within postgres itself https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html

ElasticSearch index from mongodb?

I need create index from mongodb. Colection name is Product and have such structure:
{
"_id": ObjectId("5239656f60663de206b1053e"),
"brand": "<brandName>",
"category": {
"$ref": "Category",
"$id": ObjectId("50cb515760663d3577000043"),
"$db": "<dbName>"
},
"image": "<imageUrl>",
"integraId": "<someId>",
"isActive": <isActive>,
"name": "<productName>",
"slug": "<slug>"
}
Collection Product have more 30 000 rows, but elasticsearch indexing only ~10 000 rows.
My query to create index:
{
"type": "mongodb",
"mongodb": {
"servers": [
{ "host": "127.0.0.1", "port": 27017 }
],
"options": {
"secondary_read_preference": true
},
"db": "<dbName>",
"collection": "Product"
},
"index": {
"name": "test",
"type": "test_type"
}
}
And just a second question: How can I indexing only some fields (name, category (get row by id from other collection) and brand)?
You may have more luck in the Google Groups about it bro http://groups.google.com/group/elasticsearch/topics or in the IRC http://www.elasticsearch.org/community/
MongoDB has full text search built in experimentally in version 2.4 if you would like to experiment with that: http://docs.mongodb.org/manual/core/index-text/ you may be able to query more effeciently. I realize this isn't the same as the elasticsearch solution you're looking for but this might be another way to solve the problem. Good luck!