Snowflake Data Pipeline problems - in particular stream issue - snowflake-task

Background
I have implemented a snowflake data pipeline (s3 log file > SNS > pipe > stage table > stream > task > stored proc/UDF > final table) in our production snowflake database.
While things were working on a smaller scale in our dev database, it seems the production pipeline has stopped working given the amount of data (6416006096 records and growing) attempting to flow throw it.
Problem
After some investigation so far it looks like s3 log > SNS > pipe > stage table is OK, but I things are stuck where the task retrieves records from the stream... The stream is NOT stale. I have spent a lot of time reading the docs on streams and have not found any help in there for my current issue.
It looks like the stream has too much data to return -- when I try to get a count(*) or * with limit 10 from the stream it is not returning after 8 minutes (and counting)...
Even if I could limit the data returned, I have experimented where once you select from the stream within a transaction, you can lose all changes even if you don't want them all (i.e., use a where clause to filter)...
Question
Is there any way to get anything to return from the stream without resetting it?
Is there anyway to chunk the results from the stream without losing all changes within a transaction?
Is there some undocumented limit with streams--have I hit that?
Concern
I don't want to shut down the data pipeline bc that means I may have to start all over but I guess I will have to if I get no answers (I have contacted support too but have yet to hear back). Given that streams and tasks are still only preview I guess this shouldn't be a surprise, but I was told they would be GA by now from Snowflake.

Is there any way to get anything to return from the stream without resetting it?
You should be able to select from the stream without resetting it. Only using it in a DML (ex: insert into mytable as select * from stream) will reset it.
Is there anyway to chunk the results from the stream without losing all changes within a transaction?
No, streams don't support chunking.
Is there some undocumented limit with streams--have I hit that?
I don't think there are undocumented limits, streams are essentially ranges on a table so if there's a lot of data in the underlying table, it could take awhile to scan it.
Some other considerations:
Are you using the right sized warehouse? If you have a lot of data in the stream, and a lot of DMLs consisting of updates, deletes, and inserts you might want to reconsider your warehouse size. I believe Snowflake does some partition level comparisons to reconcile added and deleted data.
Can you "tighten" up how often you read from the stream so that there's less data to process each time?
Depending on the type of data you're interested in, Snowflake offers an append only stream type, which only shows added data. This makes scanning much faster.

Related

MarkLogic REST interface to send data to Qlik Sense

I need to present ~10 million XML documents to Qlik Sense using MarkLogic REST interface with the intention of analyzing raw data on Qlik.
I'm unable to send that bulk data using simple cts:search.
A template view with SQL call like below is not helping as it is not recognized at Qlik Sense.
xdmp:to-json(xdmp:sql('select * from SC1.V1'))
Is there a better way to achieve this?
I understand it is not usual to load such huge data to Qlik, but what limitations should I consider?
You are unlikely to be able transfer that volume of data into or out of ANY system in a single 'transaction' (or request ). And if you could you wouldn't want to because when it fails, it's likely to fail forever as you have to start all over.
You should 'batch' up the documents into manageable chunks .. 100MB or '1 minute' is a reasonable high upper bound -- as size and time increase the probability of problems goes up (way up) due to timeouts, memory, temp space, internet and network transient problems etc.
A simple strategy that often works well is to first produce a 'list' of what to extract (document uris, primary keys ..), save that, and then work your way through the list in batches - retrying as needed. Depending on the destination and local storage etc. you can either combine the lot to send on to the recipient, or generally better, send the target data in batches as well.
This approach has good transactional characteristics ... you effectively 'freeze' the set of data when you make the list, but can take your time collecting and sending it. Depending -- you may be able to do so in parallel.

Kafka stream enrichment - Sourcing a lookup table [duplicate]

This question already has an answer here:
Is it a good practice to do sync database query or restful call in Kafka streams jobs?
(1 answer)
Closed 4 years ago.
There is a Kafka stream component that fetches JSON data from a topic. Now I have to do the following:
Parse that input JSON data and fetch the value of a certain ID
(identifier) attribute
Do a lookup against a particular table in Oracle database
Enrich that input JSON with data from the lookup table
Publish the enriched JSON data to another topic
What is the best design approach to achieve Step#2? I have a fair idea on how I can do the other steps. Any help is very much appreciated.
Depending on the size of the dataset you're talking about, and of the volume of the stream, I'd try to cache the database as much as possible (assuming it doesn't change that often). Augmenting data by querying a database on every record is very expensive in terms of latency and performance.
The way I've done this before is instantiating a thread whose only task is to maintain a fresh local cache (usually a ConcurrentHashMap), and make that available to the process that requires it. In this case, you'll probably want to create a processor, give it the reference to the ConcurrentHashMap described above, and when the Kafka record comes in, lookup the data with the key, augment the record, and send it to either a Sink processor, or to another Streams processor, depending on what you want do with it.
In case the lookup fails, you can fallback to actually do a query on demand to the database, but you probably want to test and profile this, because in the worst case scenario of 100% cache misses, you're going to be querying the database a lot.

Idempotent streams or preventing duplicate rows using PipelineDB

My application produces rotating log files containing multiple application metrics. The log file is rotated once a minute, but each file is still relatively large (over 30MB, with 100ks of rows)
I'd like to feed the logs into PipelineDB (running on the same single machine) which Countiuous View can create for me exactly the aggregations I need over the metrics.
I can easily ship the logs to PipelineDB using copy from stdin, which works great.
However, a machine might occasionally power off unexpectedly (e.g. due to power shortage) during the copy of a log file. Which means that once back online there is uncertainty how much of the file has been inserted into PipelineDB.
How could I ensure that each row in my logs is inserted exactly once in such cases? (It's very important that I get complete and accurate aggregations)
Notice each row in the log file has a unique identifier (serial number created by my application), but I can't find in the docs the option to define a unique field in the stream. I assume that PipelineDB's design is not meant to handle unique fields in stream rows
Nonetheless, are there any alternative solutions to this issue?
Exactly once semantics in a streaming (infinite rows) context is a very complex problem. Most large PipelineDB deployments use some kind of message bus infrastructure (e.g. Kafka) in front of PipelineDB for delivery semantics and reliability, as that's not PipelineDB's core focus.
That being said, there are a couple of approaches you could use here that may be worth thinking about.
First, you could maintain a regular table in PipelineDB that keeps track of each logfile and the line number that it has successfully written to PipelineDB. When beginning to ship a new logfile, check it against this table to determine which line number to start at.
Secondly, you could separate your aggregations by logfile (by including a path or something in the grouping) and simply DELETE any existing rows for that logfile before sending it. Then use combine to aggregate over all logfiles at read time, possibly with a VIEW.

How to ensure external projections are in sync when using CQRS and EventSourcing?

I'm starting a new application and I want to use cqrs and eventsourcing. I got the idea of replaying events to recreate aggregates and snapshotting to speedup if needed, using in memory models, caching, etc.
My question is regarding large read models I don't want to hold in memory. Suppose I have an application where I sell products, and I want to listen to a stream of events like "ProductRegistered" "ProductSold" and build a table in a relational database that will be used for reporting or integration with another system. Suppose there are lots of records and this table may take from a few seconds to minutes to truncate/rebuild, and the application exports dozens of these projections for multiple purposes.
How does one handle the consistency of the projections in this scenario?
With in-memory data, it's quite simple and fast to replay the events. But I feel that external projections that are kept in disk will be much slower to rebuild.
Should I always start my application with a TRUNCATE TABLE + rebuild for every external projection? This seems impractical to me over time, but I may be worried about a problem I didn't have yet.
Since the table is itself like a snapshot, I could keep a "control table" to tell which event was the last one I handled for that projection, so I can replay only what's needed. But I'm worried about inconsistencies if the application or database crashes. It seems that checking the consistency of the table and rebuilding would be the same, which points to the solution 1 again.
How would you handle that in a way that is maintainable over time? Are there better solutions?
Thank you very much.
One way to handle this is the concept of checkpointing. Essentially either your event stream or your whole system has a version number (checkpoint) that increments with each event.
For each projection, you store the last committed checkpoint that was applied. At startup, you pull events greater than the last checkpoint number that was applied to the projection, and continue building your projection from there. If you need to rebuild your projection, you delete the data AND the checkpoint and rerun the whole stream (or set of streams).
Caution: the last applied checkpoint and the projection's read models need to be persisted in a single transaction to ensure they do not get out of sync.

PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data

Background:
I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP.
I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but I will be pragmatic, as a benchmark lets say we are hoping for every 15min) and feed it into a data-warehouse.
How much data? At peak times we are talking approx 80-100k rows per min hitting the OLTP side, off-peak this will drop significantly to 15-20k. The most frequently updated rows are ~64 bytes each but there are various tables etc so the data is quite diverse and can range up to 4000 bytes per row. The OLTP is active 24x5.5.
Best Solution?
From what I can piece together the most practical solution is as follows:
Create a TRIGGER to write all DML activity to a rotating CSV log file
Perform whatever transformations are required
Use the native DW data pump tool to efficiently pump the transformed CSV into the DW
Why this approach?
TRIGGERS allow selective tables to be targeted rather than being system wide + output is configurable (i.e. into a CSV) and are relatively easy to write and deploy. SLONY uses similar approach and overhead is acceptable
CSV easy and fast to transform
Easy to pump CSV into the DW
Alternatives considered ....
Using native logging (http://www.postgresql.org/docs/8.3/static/runtime-config-logging.html). Problem with this is it looked very verbose relative to what I needed and was a little trickier to parse and transform. However it could be faster as I presume there is less overhead compared to a TRIGGER. Certainly it would make the admin easier as it is system wide but again, I don't need some of the tables (some are used for persistent storage of JMS messages which I do not want to log)
Querying the data directly via an ETL tool such as Talend and pumping it into the DW ... problem is the OLTP schema would need tweaked to support this and that has many negative side-effects
Using a tweaked/hacked SLONY - SLONY does a good job of logging and migrating changes to a slave so the conceptual framework is there but the proposed solution just seems easier and cleaner
Using the WAL
Has anyone done this before? Want to share your thoughts?
Assuming that your tables of interest have (or can be augmented with) a unique, indexed, sequential key, then you will get much much better value out of simply issuing SELECT ... FROM table ... WHERE key > :last_max_key with output to a file, where last_max_key is the last key value from the last extraction (0 if first extraction.) This incremental, decoupled approach avoids introducing trigger latency in the insertion datapath (be it custom triggers or modified Slony), and depending on your setup could scale better with number of CPUs etc. (However, if you also have to track UPDATEs, and the sequential key was added by you, then your UPDATE statements should SET the key column to NULL so it gets a new value and gets picked by the next extraction. You would not be able to track DELETEs without a trigger.) Is this what you had in mind when you mentioned Talend?
I would not use the logging facility unless you cannot implement the solution above; logging most likely involves locking overhead to ensure log lines are written sequentially and do not overlap/overwrite each other when multiple backends write to the log (check the Postgres source.) The locking overhead may not be catastrophic, but you can do without it if you can use the incremental SELECT alternative. Moreover, statement logging would drown out any useful WARNING or ERROR messages, and the parsing itself will not be instantaneous.
Unless you are willing to parse WALs (including transaction state tracking, and being ready to rewrite the code everytime you upgrade Postgres) I would not necessarily use the WALs either -- that is, unless you have the extra hardware available, in which case you could ship WALs to another machine for extraction (on the second machine you can use triggers shamelessly -- or even statement logging -- since whatever happens there does not affect INSERT/UPDATE/DELETE performance on the primary machine.) Note that performance-wise (on the primary machine), unless you can write the logs to a SAN, you'd get a comparable performance hit (in terms of thrashing filesystem cache, mostly) from shipping WALs to a different machine as from running the incremental SELECT.
if you can think of a 'checksum table' that contains only the id's and the 'checksum' you can not only do a quick select of the new records but also the changed and deleted records.
the checksum could be a crc32 checksum function you like.
The new ON CONFLICT clause in PostgreSQL has changed the way I do many updates. I pull the new data (based on a row_update_timestamp) into a temp table then in one SQL statement INSERT into the target table with ON CONFLICT UPDATE. If your target table is partitioned then you need to jump through a couple of hoops (i.e. hit the partition table directly). The ETL can happen as you load the the Temp table (most likely) or in the ON CONFLICT SQL (if trivial). Compared to to other "UPSERT" systems (Update, insert if zero rows etc.) this shows a huge speed improvement. In our particular DW environment we don't need/want to accommodate DELETEs. Check out the ON CONFLICT docs - it gives Oracle's MERGE a run for it's money!