Spec to Create a Datasource/table in Aache Druid with empty/zero records - druid

Could you please help me to get the Druid Spec template to create a Datasource/table in Druid with zero records.

Druid has not static table definition. This means that you can today import data with, for example, column A, B and C. And tomorrow you change your data ingestion and change the table definition to B, C and D.
All data is stored in segments (data-blocks). Each segment has it's own metadata which describes which columns are stored in this segment.
So, without any data, there is just no "column" data known.
When I want to know my data structure of a "table" (in druid it's called a dataSource), I query the metadata of the latest segment. This could of course not be complete, but if you work with an "append only" strategy when handling columns, this works just fine.
If you are a PHP developer, you might want to take a look at this package: https://github.com/level23/druid-client
I have tried to make it easy to query data. I hope it helps. Good luck.

Related

REST V2 ingest of JSON events to a S3 bucket (avoid duplicates)

I would like to ask you for help.
I am trying to ingest events in JSON from a source using REST API (REST V2 connector) in a raw format.
The source allows me to pass parameters "take" and "days" in the headers. The parameter "take" allows us to specify how many records to take, parameter "days", specifies how old events to request.
The job I have created works fine for Data Ingestion to a database, where I map filed to columns in the database.
I tried million things, and the two recent problems I am facing when I try to ingest files into a bucket or database in raw format:
For mass ingestion: there are no incremental jobs options available (for REST V2 source), so I am getting duplicate records, and ingestion never stops.
Is there a way to stop mass ingestion and avoid duplicates when all records are ingested?
For Data integration to a DB: Each record/event I attempt to ingest has multiple fields. Since I DON'T want to separate the records (I WANT entire documents in JSON), I pack all files into an array. The problem is that when I request ten records (or N records), all records get ingested into a single row in a table.
Here is what I mean:
TABLE DB:
ROW1: "array packed" JSON1, JSON2 .... JSNO_N...JSON10 "/array packed"
ROW2: empty
This is what I need (each record in a separate row in raw format)
TABLE DB:
ROW1: JSON1
ROW2: JSON2
ROWN: JSON_N
ROW10: JSON_10
I was also trying to accomplish this using a lambda function. The problem with lambda is that I will have to make sure there is no duplicates (Informatica has this cool option "upsert" that allows me to avoid duplicates).
At the end of the day, I don't care if this will be accomplished using data integration, mass ingestion, or lambda and if ingest will be directly into DB or S3. For now, I am trying to find a working solution.
If somebody can come up with some ideas, I will appreciate the help.

ELT pipeline for Mongo

I am trying to get my data into Amazon Redshift using Fivetran, but have some questions in general about the ELT/ETL process. My source database is Mongo but I want to perform deep analysis on the data using a 3rd party BI tool like Looker, but they integrate with SQL. I am new to the ELT/ETL process and was wondering would it look like this.
Extract data from Mongo (handled by Fivetran)
Load into Amazon Redshift (handled by Fivetran)
Perform Transformation - This is where my biggest knowledge gap is. I obviously have to convert objects and arrays into compatible SQL types. I can perform a transformation on all objects to extract those to columns and transform all arrays to a table. Is this the right idea? Should I design a MYSQL schema and write all the transformations according to that schema design?
as you state, Fivetran will load your data into Redshift putting individual fields in columns where it can and putting everything else into varchar columns as JSON. So at that point you basically have a Data Lake - all your data in an analytical platform but basically still in source format and available for you to do whatever you want with it.
Initially, if you don't know much about your data and just want to investigate it, you can probably leave it as it is. Redshift has SQL functions that allow you to query the elements of a JSON structure so there is no need to build additional tables and more ETL just to allow you to investigate your data - especially as these tables may get thrown away once you understand your data and decide what you want to do with it.
If you have proper reporting requirements then that is the point where you can start to design a schema that will support these requirements (I'm not sure why you suggested a MYSQL schema as MYSQL is a database vendor?). Traditionally an analytical schema would be designed as a Kimball Dimensional model (facts and dimensions) but the type of schema you decide to design will depend on:
The database platform you are using (in your case, Redshift) and the type of structures it works best with e.g. star schema or "flat" tables
The BI tool you are using and how it expects to have data presented to it
For example (and I'm not saying this is a real world example), if Redshift works ok with star schemas but better with flat tables and Looker has to have a star schema then it probably makes more sense to build star schemas in Redshift as this is a single modelling exercise - rather than model flat tables in Redshift and then have to model star schemas in Looker.
Hope this helps?
It depends on how you need the final stage of your data analysis presented, and what the purpose of your data analysis is. As stated by NickW, assuming you need to integrate your data into a BI tool the schema should be adapted according to the tool's data format requirements.
a mongodb ETL/ELT process might looks like this:
Select Connection: Select the set connection
Collection Name:Choose the collection by using the [database].[collection] format.
If you pulling data from your authentication database, only the [collection] name can be determined. Examples: ea sample.products east .
Extract Method:
All: pull the entire data in the table.
Incremental: pull data by incremental value.
Incremental Attributes: Set the name of the incremental attribute to run by. I.e: UpdateTime .
Incremental Type: Timestamp | Epoch. Choose the type of incremental attribute.
Choose Range:
In Timestamp, choose your date increment range to run by.
In Epoch, choose the value increment range to run by.
If no End Date/Value entered, the default is the last date/value in the table.
The increment will be managed automatically
Include End Value: Should the increment process take the end value or not
Interval Chunks: On what chunks the data will be pulled by. Split the data by minutes, hours, days, months or years.
Filter: Filter the data to pull. The filter format will be a MongoDB Extended JSON.
Limit: Limit the rows to pull.
Auto Mapping: You can choose the set of columns you want to bring, add a new column or leave it as it is.
Converting Entire Key Data As a STRING
In cases the data is not as expected by a target, like key names started with numbers, or flexible and inconsistent object data, You can convert attributes to a STRING format by setting their data types in the mapping section as STRING
Conversion exists for any value under that key.
Arrays and objects will be converted to JSON strings.
Use cases:
Here are few filtering examples:
{"account":{"$oid":"1234567890abcde"}, "datasource": "google", "is_deleted": {"$ne": true}}
date(MODIFY_DATE_START_COLUMN) >=date("2020-08-01")

Materialised View in Clickhouse not populating

I am currently working on a project which needs to ingest data from a Kafka Topic (JSON format), and write it directly into Clickhouse. I followed the method as suggested in the Clickhouse documentation:
Step 1: Created a clickhouse consumer which writes into a table (say, level1).
Step 2: I performed a select query on 'level1' and it gives me a set of results, but is not particularly useful as it can be read only once.
Step 3: I created a materialised view that converts data from the engine(level1) and puts it into a previously created table (say, level2). While writing into 'level2' the aggregation is on a day level (done by converting timestamp in level1 to datetime).
Therefore, data in 'level2' :- day + all columns in 'level1'
I intend to use this view (level2) as the base for any future aggregation (say, at level3)
Problem 1: 'level2' is being created but data is not being populated in it, i.e., when I perform a basic select query (select * from level2 limit 10) on the view, the output is "0 rows in set".
Is it because of day level aggregation, and it might populate at the end of the day? Can I ingest data from 'level2' in real-time?
Problem 2: Is there a way of reading the same data from my engine 'level1', multiple times?
Problem 3: Is there a way to convert Avro to JSON while reading from a kafka topic? Or can Clickhouse write data (in Avro format) directly into 'level1' without any conversion?
EDIT: There is latency in Clickhouse while retrieving data from Kafka. Had to make changes in the user.xml file in my Clickhouse server (change max_block_size).
Problem 1: 'level2' is being created but data is not being populated in it, i.e., when I perform a basic select query (select * from level2 limit 10) on the view, the output is "0 rows in set".
This might be related to the default settings of kafka storage, which always starts consuming data from the latest offset. You can change the behavior by adding this
<kafka>
<auto_offset_reset>earliest</auto_offset_reset>
</kafka>
to config.xml
Problem 2: Is there a way of reading the same data from my engine 'level1', multiple times?
You'd better avoid reading from kafka storage directly. You can set up a dedicated materialized view M1 for 'level1' and use that to populate 'level2' too. Then reading from M1 is repeatable.
Problem 3: Is there a way to convert Avro to JSON while reading from a kafka topic? Or can Clickhouse write data (in Avro format) directly into 'level1' without any conversion?
Nope, though you can try using Cap'n Proto which should provide similar performance like Avro, and it's supported directly by ClickHouse.

DB2 updated rows since last check

I want to periodically export data from db2 and load it in another database for analysis.
In order to do this, I would need to know which rows have been inserted/updated since the last time I've exported things from a given table.
A simple solution would probably be to add a timestamp to every table and use that as a reference, but I don't have such a TS at the moment, and I would like to avoid adding it if possible.
Is there any other solution for finding the rows which have been added/updated after a given time (or something else that would solve my issue)?
There is an easy option for a timestamp in Db2 (for LUW) called
ROW CHANGE TIMESTAMP
This is managed by Db2 and could be defined as HIDDEN so existing SELECT * FROM queries will not retrieve the new row which would cause extra costs.
Check out the Db2 CREATE TABLE documentation
This functionality was originally added for optimistic locking but can be used for such situations as well.
There is a similar concept for Db2 z/OS - you have to check that out as I have not tried this one.
Of cause there are other ways to solve it like Replication etc.
That is not possible if you do not have a timestamp column. With a timestamp, you can know which are new or modified rows.
You can also use the TimeTravel feature, in order to get the new values, but that implies a timestamp column.
Another option, is to put the tables in append mode, and then get the rows after a given one. However, this option is not sure after a reorg, and affects the performance and space utilisation.
One possible option is to use SQL replication, but that needs extra tables for staging.
Finally, another option is to read the logs, with the db2ReadLog API, but that implies a development. Also, just appliying the archived logs into the new database is possible, however the database will remain in roll forward pending.

An alternative design to insert/update of talend

I have a requirement in Talend where in I have to update/insert rows from the source table to the destination table. The source and destination tables are identical. The source gets refreshed by a business process and need to update/insert these results in the destination table.
I had designed for the 'insert or update' in tmap and tmysqloutput. However, the job turns out to be super slow
As an alternative to the above solution I am trying to do design the insert and update separately.In order to do this, I was wanting to hash the source rows as the number of rows would be usually less.
So, my question I will hash the input rows but when I join them with the destination rows in tmap should I hash the destination rows as well? Or should I use the destination rows as it is and then join them?
Any suggestions on the job design here?
Thanks
Rathi
If you are using the same database, you should not use ETL loading techniques but ELT loading so that all processing will happen in the database. Talend offers a few ELT components which are a bit different to use but very helpful for this case. I've had things to speed up by multiple magnitudes using only those components.
It is still a good idea to use an indexed hashed field both in the source and the target, which is done in a same way in loading Satellites in the Data Vault 2.0 model.
Alternatively, if you have direct access to the source table database, you could consider adding triggers for C(R)UD scenarios. Doing this, every action on the source database could be reflected in your database immediately. Remember though that you might need to think about a buffer table ("staging") where you could store your changes so that you are able to ingest fast, process later. In this table only the changed rows and the change type (create, update, delete) would be present for you to process. This decouples loading and processing which can be helpful if there will be a problem with loading or processing later on.
Yes i believe that you should use hash component for destination table as well.
Because than your processing (lookup) will be very fast as its happening in memory
If not than lookup load may take more time.