Idempotent streams or preventing duplicate rows using PipelineDB - pipelinedb

My application produces rotating log files containing multiple application metrics. The log file is rotated once a minute, but each file is still relatively large (over 30MB, with 100ks of rows)
I'd like to feed the logs into PipelineDB (running on the same single machine) which Countiuous View can create for me exactly the aggregations I need over the metrics.
I can easily ship the logs to PipelineDB using copy from stdin, which works great.
However, a machine might occasionally power off unexpectedly (e.g. due to power shortage) during the copy of a log file. Which means that once back online there is uncertainty how much of the file has been inserted into PipelineDB.
How could I ensure that each row in my logs is inserted exactly once in such cases? (It's very important that I get complete and accurate aggregations)
Notice each row in the log file has a unique identifier (serial number created by my application), but I can't find in the docs the option to define a unique field in the stream. I assume that PipelineDB's design is not meant to handle unique fields in stream rows
Nonetheless, are there any alternative solutions to this issue?

Exactly once semantics in a streaming (infinite rows) context is a very complex problem. Most large PipelineDB deployments use some kind of message bus infrastructure (e.g. Kafka) in front of PipelineDB for delivery semantics and reliability, as that's not PipelineDB's core focus.
That being said, there are a couple of approaches you could use here that may be worth thinking about.
First, you could maintain a regular table in PipelineDB that keeps track of each logfile and the line number that it has successfully written to PipelineDB. When beginning to ship a new logfile, check it against this table to determine which line number to start at.
Secondly, you could separate your aggregations by logfile (by including a path or something in the grouping) and simply DELETE any existing rows for that logfile before sending it. Then use combine to aggregate over all logfiles at read time, possibly with a VIEW.

Related

google dataprep (clouddataprep by trifacta) tip: jobs will not be able to run if they are to large

During my cloud dataprep adventures I have come across yet another very annoying bug.
The problem occurs when creating complex flow structures which need to be connected through reference datasets. If a certain limit is crossed in performing a number of unions or a joins with these sets, dataflow is unable to start a job.
I have had a lot of contact with support and they are working on the issue:
"Our Systems Engineer Team was able to determine the root cause resulting into the failed job. They mentioned that the job is too large. That means that the recipe (combined from all datasets) is too big, and Dataflow rejects it. Our engineering team is still investigating approaches to address this.
A workaround is to split the job into two smaller jobs. The first run the flow for the data enrichment, and then use the output as input in the other flow. While it is not ideal, this would be a working solution for the time being."
I ran into the same problem and have a fairly educated guess as to the answer. Keep in mind that DataPrep simply takes all your GUI based inputs and translates it into Apache Beam code. When you pass in a reference data set, it probably writes some AB code that turns the reference data set into a side-input (https://beam.apache.org/documentation/programming-guide/). DataFlow will perform a Parellel Do (ParDo) function where it takes each element from a PCollection, stuffs it into a worker node, and then applies the side-input data for transformation.
So I am pretty sure if the reference sets get too big (which can happen with Joins), the underlying code will take an element from dataset A, pass it to a function with side-input B...but if side-input B is very big, it won't be able to fit into the worker memory. Take a look at the Stackdriver logs for your job to investigate if this is the case. If you see 'GC (Allocation Failure)' in your logs this is a sign of not enough memory.
You can try doing this: suppose you have two CSV files to read in and process, file A is 4 GB and file B is also 4 GB. If you kick off a job to perform some type of Join, it will very quickly outgrow the worker memory and puke. If you CAN, see if you can pre-process in a way where one of the files is in the MB range and just grow the other file.
If your data structures don't lend themselves to that option, you could do what the Sys Engs suggested, split one file up into many small chunks and then feed it to the recipe iteratively against the other larger file.
Another option to test is specifying the compute type for the workers. You can iteratively grow the compute type larger and larger to see if it finally pushes through.
The other option is to code it all up yourself in Apache Beam, test locally, then port to Google Cloud DataFlow.
Hopefully these guys fix the problem soon, they don't make it easy to ask them questions, that's for sure.

Neo4j's MERGE command on big datasets

Currently, I am working on a project of implementing a Neo4j (V2.2.0) database in the field of web-analytics. After loading some samples, I'm trying to load a big data set (>1GB, >4M lines). The problem I am facing, is that the usage of the MERGE command takes exponentially more time as the data size grows. Online sources are ambiguous on what the best way is to load big sets of data when not every line has to be loaded as a node, and I would like some clarity on the subject. To emphasize, in this situation I am just loading the nodes; relations are the next step.
Basically there are three methods
i) Set a uniqueness constraint for a property, and create all nodes. This method was used mainly before the MERGE command was introduced.
CREATE CONSTRAINT ON (book:Book) ASSERT book.isbn IS UNIQUE
followed by
USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
CREATE (:Book{isbn=row.isbn, title=row.title, etc})
In my experience, this will return a error if a duplicate is found, which stops the query.
ii) Merging the nodes with all their properties.
USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (:Book{isbn=row.isbn, title=row.title, etc})
I have tried loading my set in this manner, but after letting the process run for over 36 hours and coming to a grinding halt, I figured there should be a better alternative, as ~200K of my eventual ~750K nodes were loaded.
iii) Merging nodes based on one property, and setting the rest after that.
USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (b:Book{isbn=row.isbn})
ON CREATE SET b.title = row.title
ON CREATE SET b.author = row.author
etc
I am running a test now (~20K nodes) to see if switching from method ii to iii will improve execution time, as a smaller sample gave conflicting results. Are there methods which I am overseeing and could improve execution time? If I am not mistaken, the batch inserter only works for the CREATE command, and not the MERGE command.
I have permitted Neo4j to use 4GB of RAM, and judging from my task manager this is enough (uses just over 3GB).
Method iii) should be the fastest solution since you MERGE against a single property. Do you create the uniqueness constraint before you do the MERGE? Without an index (constraint or normal index), the process will take a long time with a growing number of nodes.
CREATE CONSTRAINT ON (book:Book) ASSERT book.isbn IS UNIQUE
Followed by:
USING PERIODIC COMMIT 20000
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (b:Book{isbn=row.isbn})
ON CREATE SET b.title = row.title
ON CREATE SET b.author = row.author
This should work, you can increase the PERIODIC COMMIT.
I can add a few hundred thousand nodes within minutes this way.
In general, make sure you have indexes in place. Merge a node first on the basis of the properties that are indexed (to exploit fast lookup) and then modify that node's properties as needed with SET.
Beyond that, both of your approaches are going through the transaction layer. If you need to jam a lot of data into the DB really quickly, you probably don't want to use transactions to do that, because they're giving you functionality you might not need, and they require overhead that's slowing you down. So a larger solution would be to not insert data with LOAD CSV but go another route entirely.
If you're using the 2.2 series of neo4j, you can go for the batch inserter via java, or the neo4j-import tool sadly not available prior to 2.2. What they both have in common is that they don't use transactions.
Finally, either way you go you should read Michael Hunger's article on importing data into neo4j as it provides a good conceptual discussion of what's happening, and why you need to skip transactions if you're going to load big huge piles of data into neo4j.

Incrementing hundreds of counters at once, redis or mongodb?

Background/Intent:
So I'm going to create an event tracker from scratch and have a couple of ideas on how to do this but I'm unsure of the best way to proceed with the database side of things. One thing I am interested in doing is allowing these events to be completely dynamic, but at the same time to allow for reporting on relational event counters.
For example, all countries broken down by operating systems. The desired effect would be:
US # of events
iOS - # of events that occured in US
Android - # of events that occured in US
CA # of events
iOS - # of events that occured in CA
Android - # of events that occured in CA
etc.
My intent is to be able to accept these event names like so:
/?country=US&os=iOS&device=iPhone&color=blue&carrier=Sprint&city=orlando&state=FL&randomParam=123&randomParam2=456&randomParam3=789
Which means in order to do the relational counters for something like the above I would potentially be incrementing 100+ counters per request.
Assume there will be 10+ million of the above requests per day.
I want to keep things completely dynamic in terms of the event names being tracked and I also want to do it in such a manner that the lookups on the data remains super quick. As such I have been looking into using redis or mongodb for this.
Questions:
Is there a better way to do this then counters while keeping the fields dynamic?
Provided this was all in one document (structured like a tree), would using the $inc operator in mongodb to increment 100+ counters at the same time in one operation be viable and not slow? The upside here being I can retrieve all of the statistics for one 'campaign' quickly in a single query.
Would this be better suited to redis and to do a zincrby for all of the applicable counters for the event?
Thanks
Depending on how your key structure is laid out I would recommend pipelining the zincr commands. You have an easy "commit" trigger - the request. If you were to iterate over your parameters and zincr each key, then at the end of the request pass the execute command it will be very fast. I've implemented a system like you describe as both a cgi and a Django app. I set up a key structure along the lines of this:
YYYY-MM-DD:HH:MM -> sorted set
And was able to process Something like 150000-200000 increments per second on the redis side with a single process which should be plenty for your described scenario. This key structure allows me to grab data based on windows of time. I also added an expire to the keys to avoid writing a db cleanup process. I then had a cronjob that would do set operations to "roll-up" stats in to hourly, daily, and weekly using variants of the aforementioned key pattern. I bring these ideas up as they are ways you can take advantage of the built in capabilities of Redis to make the reporting side simpler. There are other ways of doing it but this pattern seems to work well.
As noted by eyossi the global lock can be a real problem with systems that do concurrent writes and reads. If you are writing this as a real time system the concurrency may well be an issue. If it is an "end if day" log parsing system then it would not likely trigger the contention unless you run multiple instances of the parser or reports at the time of input. With regards to keeping reads fast In Redis, I would consider setting up a read only redis instance slaved off of the main one. If you put it on the server running the report and point the reporting process at it it should be very quick to generate the reports.
Depending on your available memory, data set size, and whether you store any other type of data in the redis instance you might consider running a 32bit redis server to keep the memory usage down. A 32b instance should be able to keep a lot of this type of data in a small chunk of memory, but if running the normal 64 bit Redis isn't taking too much memory feel free to use it. As always test your own usage patterns to validate
In redis you could use multi to increment multiple keys at the same time.
I had some bad experience with MongoDB, i have found that it can be really tricky when you have a lot of writes to it...
you can look at this link for more info and don't forget to read the part that says "MongoDB uses 1 BFGL (big f***ing global lock)" (which maybe already improved in version 2.x - i didn't check it)
On the other hand, i had a good experience with Redis, i am using it for a lot of read / writes and it works great.
you can find more information about how i am using Redis (to get a feeling about the amount of concurrent reads / writes) here: http://engineering.picscout.com/2011/11/redis-as-messaging-framework.html
I would rather use pipelinethan multiif you don't need the atomic feature..

NoSQL recommendation for a project (text streaming)

I'm looking for a NoSQL DB recommendation... here's what I'm working on:
I'm writing a web-based client for delivering text streams (basically, real-time captions) to a significant number of consumers. Once things are fully ramped up, there might be 100+ events happening at any given moment. Many will be small (< 10 consumers) but some of them could be quite large (10,000+ simultaneous consumers, maybe more?).
During the course of each event, text will be accumulating at a rate of anywhere from a few words per minute up to 200+ words per minute. Each consumer will be running a web client (a browser on a desktop/laptop/tablet/smartphone) which will poll periodically for any text that it hasn't already received. It will also be possible for a given user to ask for the full text of the event up to the time that they make the request. Completed events have to stick around for a while, but will be removed within about 24-36 hours of their completion.
My first thought is to use Redis, which has methods for appending to a text value in the datastore as well as built-in support for getting a substring from the end of a text value (i.e. a client could just hold the character offset of the last character it received and would pass that to the client API and that would be used to pull a substring from the event text). I am concerned though that the growth of the string containing the event text might be an unusual use of Redis and could cause me some issues.
So... is there a NoSQL DB that seems particularly well suited to this sort of application? Is there any significant reason NOT to use Redis for something like this?
An underlying open question is what to do about new clients. For example, say an event has started and someone connects a few minutes into it. Do they need everything from the beginning or just from when they connected?
If the latter I'd recommend a message system instead of appending strings to strings. One way would be to use Redis' Pub/Sub instead. That seems a better fit overall, and especially if new connections do not need everything from the beginning. For longer term storage, a client that listens as any other and the archive entry - preferably by local cache and then upload the completed transcript when completed or in-progress. I'd keep the real-time need and code separate from requesting history and archives.
Another route would be to use an ordered set, using a timestamp for the time the entry was made. As a result the client only keeps track of the last update and retrieves anything from that time on. Ordered Sets documentation can be found here. This method also provides the ability to select a region of time from the transcript. With a bit of math you could even replay the event from a transcript viewpoint as if it were live. If you've got tens of thousands of clients pulling the entire transcript each poll
Another advantage of the timestamp ordered set is string encoding. When using Redis strings and getrange you have to use fixed-width encodings. The range is byte-offsets, not character offsets. If you need the ability to support, say UTF-8, this might be a problem for you.
A third option is to append a string of text to a list. This is similar to the sorted set except that your client stores the last index (size of the list) and on each poll tries to get anything from lastIndex+1 to the end.

PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data

Background:
I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP.
I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but I will be pragmatic, as a benchmark lets say we are hoping for every 15min) and feed it into a data-warehouse.
How much data? At peak times we are talking approx 80-100k rows per min hitting the OLTP side, off-peak this will drop significantly to 15-20k. The most frequently updated rows are ~64 bytes each but there are various tables etc so the data is quite diverse and can range up to 4000 bytes per row. The OLTP is active 24x5.5.
Best Solution?
From what I can piece together the most practical solution is as follows:
Create a TRIGGER to write all DML activity to a rotating CSV log file
Perform whatever transformations are required
Use the native DW data pump tool to efficiently pump the transformed CSV into the DW
Why this approach?
TRIGGERS allow selective tables to be targeted rather than being system wide + output is configurable (i.e. into a CSV) and are relatively easy to write and deploy. SLONY uses similar approach and overhead is acceptable
CSV easy and fast to transform
Easy to pump CSV into the DW
Alternatives considered ....
Using native logging (http://www.postgresql.org/docs/8.3/static/runtime-config-logging.html). Problem with this is it looked very verbose relative to what I needed and was a little trickier to parse and transform. However it could be faster as I presume there is less overhead compared to a TRIGGER. Certainly it would make the admin easier as it is system wide but again, I don't need some of the tables (some are used for persistent storage of JMS messages which I do not want to log)
Querying the data directly via an ETL tool such as Talend and pumping it into the DW ... problem is the OLTP schema would need tweaked to support this and that has many negative side-effects
Using a tweaked/hacked SLONY - SLONY does a good job of logging and migrating changes to a slave so the conceptual framework is there but the proposed solution just seems easier and cleaner
Using the WAL
Has anyone done this before? Want to share your thoughts?
Assuming that your tables of interest have (or can be augmented with) a unique, indexed, sequential key, then you will get much much better value out of simply issuing SELECT ... FROM table ... WHERE key > :last_max_key with output to a file, where last_max_key is the last key value from the last extraction (0 if first extraction.) This incremental, decoupled approach avoids introducing trigger latency in the insertion datapath (be it custom triggers or modified Slony), and depending on your setup could scale better with number of CPUs etc. (However, if you also have to track UPDATEs, and the sequential key was added by you, then your UPDATE statements should SET the key column to NULL so it gets a new value and gets picked by the next extraction. You would not be able to track DELETEs without a trigger.) Is this what you had in mind when you mentioned Talend?
I would not use the logging facility unless you cannot implement the solution above; logging most likely involves locking overhead to ensure log lines are written sequentially and do not overlap/overwrite each other when multiple backends write to the log (check the Postgres source.) The locking overhead may not be catastrophic, but you can do without it if you can use the incremental SELECT alternative. Moreover, statement logging would drown out any useful WARNING or ERROR messages, and the parsing itself will not be instantaneous.
Unless you are willing to parse WALs (including transaction state tracking, and being ready to rewrite the code everytime you upgrade Postgres) I would not necessarily use the WALs either -- that is, unless you have the extra hardware available, in which case you could ship WALs to another machine for extraction (on the second machine you can use triggers shamelessly -- or even statement logging -- since whatever happens there does not affect INSERT/UPDATE/DELETE performance on the primary machine.) Note that performance-wise (on the primary machine), unless you can write the logs to a SAN, you'd get a comparable performance hit (in terms of thrashing filesystem cache, mostly) from shipping WALs to a different machine as from running the incremental SELECT.
if you can think of a 'checksum table' that contains only the id's and the 'checksum' you can not only do a quick select of the new records but also the changed and deleted records.
the checksum could be a crc32 checksum function you like.
The new ON CONFLICT clause in PostgreSQL has changed the way I do many updates. I pull the new data (based on a row_update_timestamp) into a temp table then in one SQL statement INSERT into the target table with ON CONFLICT UPDATE. If your target table is partitioned then you need to jump through a couple of hoops (i.e. hit the partition table directly). The ETL can happen as you load the the Temp table (most likely) or in the ON CONFLICT SQL (if trivial). Compared to to other "UPSERT" systems (Update, insert if zero rows etc.) this shows a huge speed improvement. In our particular DW environment we don't need/want to accommodate DELETEs. Check out the ON CONFLICT docs - it gives Oracle's MERGE a run for it's money!