We have a system where our primary data store (and "Universal Source of Truth") is Postgres, but we replicate that data both in real-time as well as nightly in aggregate. We currently replicate to Elasticsearch, Redis, Redshift (nightly only), and are adding Neo4j as well.
Our ETL pipeline has gotten expansive enough that we're starting to look at tools like Airflow and Luigi, but from what I can tell from my initial research, these tools are meant almost entirely for batch loads in aggregate.
Is there any tool that can handle an ETL process that can handle both large batch ETL processes as well as on-the-fly, high-volume, individual-record replication? Do Airflow or Luigi handle this and I just missed it?
Thanks!
As far as Luigi goes, you would likely end up with a micro batch approach, running the jobs on a short interval. For example, you could trigger a cron job every minute to check for new records in Postgres tables and process that batch. You can create a task for each item, so that your processing flow itself is around a single item. At high volumes, say more than a few hundred updates per second, this is a real challenge.
Apache Spark has a scalable batch mode and micro-batch mode of processing, and some basic pipelining operators that can be adapted to ETL. However, the complexity level of the solution in terms of supporting infrastructure goes up a quite a bit.
I'm no crazy expert on different ETL engines, but I've done lots with Pentaho Kettle and am pretty happy with it performance wise. Especially if you tune your transformations to take advantage of the parallel processing.
I've mostly used it for handling integrations (real time) and nightly jobs that perform ETL to drive our reporting DB but I'm pretty sure you could set it up to perform many real time tasks.
I did set up web services that called all sorts of things on our back end once in real time but it very much was not under any kind of load and it sounds like you're doing some more hefty things than we are. Then again it's got functionality to cluster the ETL servers and scale things that I've never really played with.
I feels like kettle could do these things if you spent time to set it up right. Overall I love the tool. It's a joy to work in the GUI TBH. If you're not familiar or doubt the power of doing ETL from a GUI you should check it out. You might be surprised.
Related
My plan:
Move all data processing to Spark (PySpark preferably) with final output (consumer facing) data going to Redshift only. Spark seems to connect to all the various sources well (Dynamo DB, S3, Redshift). Output to Redshift/S3 etc depending on customer need. This avoids having multiple Redshift clusters, broken/overusing internal unsupported ETL tools, copy of the same data across clusters, views and tables etc (which is the current setup).
Use Luigi to build a web UI to daily monitor pipelines and visualise the dependency tree, and schedule ETL's. Email notifications should be an option for failures also. An alternative is AWS data pipeline, but, Luigi seems to have a better UI for what is happening where many dependencies are involved (some trees are 5 levels deep, but perhaps this can also be avoided with better Spark code).
Questions:
Does Luigi integrate with Spark (I have only used PySpark before, not Luigi, so this is a learning curve for me). The plan was to schedule 'applications' and Spark actually has ETL too I believe, so unsure how Luigi integrates here?
How to account for the fact that some pipelines may be 'real time' - would I need to spin up the Spark / EMR job hourly for example then?
I'm open to thoughts / suggestions / better ways of doing this too!
To answer your questions directly,
1) Yes, Luigi does play nicely with PySpark, just like any other library. We certainly have it running without issue -- the only caveat is that you have to be a little careful with imports and have them within the functions of the Luigi class as, in the background, it is spinning up new Python instances.
2) There are ways of getting Luigi to slurp in streams of data, but it is tricky to do. Realistically, you'd fall back to running an hourly cron cycle to just call the pipeline and process and new data. This sort of reflects Spotify's use case for Luigi where they run daily jobs for calculate top artist, etc.
As #RonD suggests, if I was building a new pipeline now, I'd skip Luigi and go straight to AirFlow. If nothing else, look at the release history. Luigi hasn't really been significantly worked on for a long time (because it works for the main dev). Whereas AirFlow is actively being incubated by Apache.
Instead of Luigi use Apache Airflow for workflow orchestration (code is written in Python). It has a lot of operators and hooks built in which you can call in DAGs (Workflows). For example create task to call operator to start up EMR cluster, another to run PySpark script located in s3 on cluster, another to watch the run for status. You can use tasks to set up dependencies etc too.
I'm trying to optimise ADO .NET (.Net 4.5) data access with Task parallel library (.Net 4.5), For an example when selecting 1000,000,000 records from a database how can we use the machine multicore processor effectively with Task parallel library. If anyone has found use full sources to get some idea please post :)
The following applies to all DB access technologies, not just ADO.NET.
Client-side processing is usually the wrong place to solve data access problems. You can achieve several orders of magnitude improvement in performance by optimizing your schema, create proper indexes and writing proper SQL queries.
Why transfer 1M records to a client for processing, over a limited network connection with significant latency, when a proper query could return the 2-3 records that matter?
RDBMS systems are designed to take advantage of available processors, RAM and disk arrays to perform queries as fast as possible. DB servers typically have far larger amounts of RAM and faster disk arrays than client machines.
What type of processing are you trying to do? Are you perhaps trying to analyze transactional data? In this case you should first extract the data to a reporting, or better yet, an OLAP database. A star schema with proper indexes and precalculated analytics can be 1000x times faster than an OLTP schema for analysis.
Improved SQL coding can also result in 10x-50x times improvement or more. A typical mistake by programmers not accustomed to SQL is to use cursors instead of set operations to process data. This usually leads to horrendous performance degradation, in the order of 50x times and worse.
Pulling all data to the client to process them row-by-row is even worse. This is essentially the same as using cursors, only the data has to travel over the wire and processing will have to use the client's often limited memory.
The only place where asynchronous processing offers any advantage, is when you want to fire off a long operation and execute code when processing finishes. ADO.NET already provides asynchronous operations using the APM model (BeginExecute/EndExecute). You can use TPL to wrap this in a task to simplify programming but you won't get any performance improvements.
It could be that your problem is not suited to database processing at all. If your algorithm requires that you scan the entire dataset multiple times, it would be better to extract all the data to a suitable file format in one go, and transfer it to another machine for processing.
Suggestions for a NoSQL datastore so that we can push data and generate real time Qlikview reports easily?
Easily means:
1. Qlikview support for reads (mongodb connector available, otherwise maybe can write a JDBC connector, otherwise maybe can write a custom QVX connector to the datastore)
Easily adaptable to changes in schema, or schemaless. We change our schema quite frequently ...
Java support for writes
Super fast reads - real time incremental access, as well as batch access for old data within a time range. I read that Cassandra excels in ranges.
Reasonably fast writes
Reasonably big data storage - 20 million rows stored per day, about 200 bytes each
Would be nice if it can scale for a years worth of data, elasticity not so important.
Easy to use, install, and run. Looking at minimal setup and configuration time.
Matlabe support for adhoc querying
Initially I don't think we need a distributed system however a cluster is a possibility.
I've looked at Mongodb, Cassandra and Hbase. I don't think going over REST is a good idea due to (theoretically) slower performance.
I'm leaning towards MongoDB at the moment due to its ease of use, matlab support, totally schema less, Qlikview support (beta connector is available). However if anyone can suggest something better that would be great!
Depending on the server infrastructure you will use, I guess the best choice is amazon's NoSQL service, avalaible in aws.amazon.com.
The fact is any DB will have a poor performance in cloud infrastructure due to the way it stores data, amazon EC2 with EBS for instance is VERY slow for this task, requiring you to connect up to 20 EBS volumes in raid to acquire a decent speed. They solved this issue creating this NoSQL service, which I never used, but seems nice.
I just wanted to know if there is a fundamental difference between hbase, cassandra, couchdb and monogodb ? In other words, are they all competing in the exact same market and trying to solve the exact same problems. Or they fit best in different scenarios?
All this comes to the question, what should I chose when. Matter of taste?
Thanks,
Federico
Those are some long answers from #Bohzo. (but they are good links)
The truth is, they're "kind of" competing. But they definitely have different strengths and weaknesses and they definitely don't all solve the same problems.
For example Couch and Mongo both provide Map-Reduce engines as part of the main package. HBase is (basically) a layer over top of Hadoop, so you also get M-R via Hadoop. Cassandra is highly focused on being a Key-Value store and has plug-ins to "layer" Hadoop over top (so you can map-reduce).
Some of the DBs provide MVCC (Multi-version concurrency control). Mongo does not.
All of these DBs are intended to scale horizontally, but they do it in different ways. All of these DBs are also trying to provide flexibility in different ways. Flexible document sizes or REST APIs or high redundancy or ease of use, they're all making different trade-offs.
So to your question: In other words, are they all competing in the exact same market and trying to solve the exact same problems?
Yes: they're all trying to solve the issue of database-scalability and performance.
No: they're definitely making different sets of trade-offs.
What should you start with?
Man, that's a tough question. I work for a large company pushing tons of data and we've been through a few years. We tried Cassandra at one point a couple of years ago and it couldn't handle the load. We're using Hadoop everywhere, but it definitely has a steep learning curve and it hasn't worked out in some of our environments. More recently we've tried to do Cassandra + Hadoop, but it turned out to be a lot of configuration work.
Personally, my department is moving several things to MongoDB. Our reasons for this are honestly just simplicity.
Setting up Mongo on a linux box takes minutes and doesn't require root access or a change to the file system or anything fancy. There are no crazy config files or java recompiles required. So from that perspective, Mongo has been the easiest "gateway drug" for getting people on to KV/Document stores.
CouchDB and MongoDB are document stores
Cassandra and HBase are key-value based
Here is a detailed comparison between HBase and Cassandra
Here is a (biased) comparison between MongoDB and CouchDB
Short answer: test before you use in production.
I can offer my experience with both HBase (extensive) and MongoDB (just starting).
Even though they are not the same kind of stores, they solve the same problems:
scalable storage of data
random access to the data
low latency access
We were very enthusiastic about HBase at first. It is built on Hadoop (which is rock-solid), it is under Apache, it is active... what more could you want? Our experience:
HBase is fragile
administrator's nightmare (full of configuration settings where default ones are less than perfect, nontransparent configuration, changes from version to version,...)
loses data (unless you have set the X configuration and changed Y to... you get the point :) - we found that out when HBase crashed and we lost 2 hours (!!!) of data because WAL was not setup properly
lacks secondary indexes
lacks any way to perform a backup of database without shutting it down
All in all, HBase was a nightmare. Wouldn't recommend it to anyone except to our direct competitors. :)
MongoDB solves all these problems and many more. It is a delight to setup, it makes administrating it a simple and transparent job and the default configuration settings actually make sense. You can perform (hot) backups, you can have secondary indexes. From what I read, I wouldn't recommend MapReduce on MongoDB (JavaScript, 1 thread per node only), but you can use Hadoop for that.
And it is also VERY active when compared to HBase.
Also:
http://www.google.com/trends?q=HBase%2CMongoDB
Need I say more? :)
UPDATE: many months later I must say MongoDB delivered on all accounts and more. The only real downside is that hosting companies do not offer it the way they offer MySQL. ;)
It also looks like MapReduce is bound to become multi-threaded in 2.2. Still, I wouldn't use MR this way. YMMV.
Cassandra is good for writing the data. it has advantage of "writes never fail". It has no single point failure.
HBase is very good for data processing. HBase is based on Hadoop File System (HDFS) so HBase dosen't need to worry for data replication, data consistency. HBase has the single point of failure. I am not really sure that what does it's mean if it has single point of failure then it is somhow similar to RDBMS where we have single point of failure. I might be wrong in sense since I am quite new.
How abou RIAK ? Does someone has experience using RIAK. I red some where that you need to pay, I am not sure. Need explanation.
One more thing which one you will prefer to use when you are only concern to reading a lot of data. You don't have any concern with writing. Just imagine you have database with pitabyte and you want to make fast search which NOSQL database would you prefer ?
I'm having a system that collects real-time Apache log data from about 90-100 Web Servers. I had also defined some url patterns.
Now I want to build another system that updates the time of occurrence of each pattern based on those logs.
I had thought about using MySQL to store statistic data, update them by statement:
"Update table set count=count+1 where ....",
but i'm afraid that MySQL will be slow for data from such amount of servers. Moreover, I'm looking for some database/storage solutions that more scalable and simple. (As a RDBMS, MySQL supports too much things that I don't need in this situation) . Do you have any idea ?
Apache Cassandra is a high-performance column-family store and can scale extremely well. The learning curve is a bit steep, but will have no problem handling large amounts of data.
A more simple solution would be a key-value store, like Redis. It's easier to understand than Cassandra. Redis only seems to support master-slave replication as a way to scale, so the write performance of your master server could be a bottleneck. Riak has a decentralized architecture without any central nodes. It has no single point of failure nor any bottlenecks, so it's easier to scale out.
Key value storage seems to be an appropriate solution for my system. After taking a quick look on those storages, I'm concerning about race-condition issue, as there will be a lot of clients trying to do these steps on the same key:
count = storage.get(key)
storage.set(key,count+1)
I had worked with Tokyo Cabinet before, and they have 'addint' method which perfectly matched with my case, I wonder if other storages have similar feature? I didn't choose Tokyo Cabinet/Tyrant cause I had experienced some issues about its scalability and data stability (e.g. repair corrupted data, ...)