Can a Mobile Operator use Netflow same as used by a WireLine Operator - netflow

Can a Mobile Operator use Netflow same as used by a WireLine Operator to gain insights in to subscriber behavior. The reason I am asking this question is because the network typologies are very different.

I've implemented solutions for both types and you are correct in your point about different requirements. The number of netflow collection points is dramatically increased when deploying for a mobile provider as the have a lot of Point of Presence (PoP) for the last mile. The network complexity of many routes requires that the netflow product be smart about de-duplication of netflow for accurate visibility. Different network topologies also require different mitigation architectures. I would recommend a netflow solution that:
1. Handles de-duplication for complex networks easily and accurately - identify Core /Edge and so on.
2. Does not charge by users or objects protected so greater value in a large PoP deployment. Billing by overall flowrate is the easiest to budget, maximize value.
3. Play nice with others- To get the most from netflow, it needs to be shared and propagated to existing ops systems
4. Scale and sustainability- The growth of mobile has increased flowrate, so having a system that can scale to unlimited flowrate, and handle all the storage and load balancing automatically if key.
I highly recommend the power of netflow to improve QoS and reduce costs for a mobile operator. Here is some additional detail from and blog post https://www.flowtraq.com/network-flow-analysis-for-maximum-security/
Bottom line- A netflow product with Topology awareness.
-Gurdev

Related

What are the pros and cons of DynamoDB with respect to Google Cloud Datastore

My understanding is DynamoDB behave like a giant table which you must specify a hash key and range key.
The core concept of Google Cloud Datastore is entity based (like Cassandra) and is more flexible, i.e. can use more than 1 index.
But are there any more in-depth comparison?
AWS DynamoDB is a pretty simple flat key-value store. It has support for conditional writes and sets which allow for some cool features. You specify the amount of horsepower you want (which you can only adjust a few times a day) and AWS splits up your dataset uniformly across enough database nodes to meet your demands. You have to make sure your key values are sufficiently random as to guarantee balanced access across your dataset. AWS almost guarantees single-digit latencies. Transactions are not supported. You specify the consistency of operations.
Google Cloud Datastore is a more sophisticated key-valueish store with built-in transaction support and entity hierarchy. You don't have to worry about the capacity of the system, it automatically scales to your data size and access patterns. You have less control of some things so you have to pay attention. You cannot specify for a read to be consistent, but you can force consistency by structuring your entities in a certain way.
One downside of Google Cloud products I have experienced is that documentation and language support is not very uniform. Sometimes you have to read documentation of another language to understand the system fully and many features are not supported in certain languages.
There are a lot of other differences. Look at the API reference of your favorite language on both documentation pages and you'll get a decent feel of the specific features of each.

What are the real-time compute solutions that can take raw semistructured data as input?

Are there any technologies that can take raw semi-structured, schema-less big data input (say from HDFS or S3), perform near-real-time computation on it, and generate output that can be queried or plugged in to BI tools?
If not, is anyone at least working on it for release in the next year or two?
There are some solutions with big semistructured input and queried output, but they are usually
unique
expensive
secret enough
If you are able to avoid direct computations using neural networks or expert systems, you will be close enough to low latency system. All you need is a team of brilliant mathematicians to make a model of your problem, a team of programmers to realize it in code and some cash to buy servers and get needed input/output channels for them.
Have you taken a look at Splunk? We use it to analyze Windows Event Logs and Splunk does an excellent job indexing this information to allow for fast querying of any string that appears in the data.

Can ArangoDB scale like MongoDB or CouchDB

I'm reading about ArangoDB and it is more interesting but I can't find where in the documentation how ArangoDB scales. Does ArangoDB scale and can it use sharding like MongoDB or CouchDB?
EDIT
ArangoDB supports sharding since Version 2.0.
Version 3.0 will bring VelocyPack, which is a binary JSON representation optimized for compactness, parseability and composeability. It supersedes the shape concept / shaped JSON.
/EDIT
I am the chief architect of ArangoDB.
​monkegjinni is right, ArangoDB did not support sharding, but replication. Why?
​Short version:
Offering a support for fairly complex data models like graphs and documents gets into conflicts with how sharding works. However, with the efficiency of modern SSD and computers, we believe that almost all projects no longer need sharding. Today's computer will easily store all data on a single nodes. What these projects need is replication for load distribution which is supported by ArangoDB.
Long version:
There are actually to separate scaling issues.
The first issue is distributing the request over several servers to balance the request load.
ArangoDB will support this through synchronous replication of writes and distribution of the read requests.
Note that most database systems follow a very similar path, i.e., they support distributing the requests either with restricted consistency guarantees or they allow writes only on one node and distribute the read requests. They have this restriction because distributing write requests and supporting full consistency is impossible to do efficiently. And doing it inefficiently will negate the gain that we wanted to achieve through distribution.
​The second issue is distributing the data over several servers to allow larger datasets.
ArangoDB does not support distributing the data over several servers.
​We have made this decision, because distributing the data over several servers always comes at a price.
​This price can be very explicit. For example it can be that the data model is very limited. This is the route that key value stores such as Dynamo or RIAK have taken. Here the data model and the supported queries are so simple, that it is always possible to direct a query to the server (or the small number of servers) on which the requested value live.
​Note that we do believe that this approach is valid for some applications (e.g. Amazons database). But we believe that the number of applications that truly need to store so much data that they must distribute it over a large number of servers and must therefore restrict the access pattern to key-value is very small.
​Or the price can be hidden. This is for example the case if the data is distributed and the database system allows general queries. In that case the query must be distributed over all servers (because the data you are looking for may live on any of the servers). That makes the queries inefficient.
​The ArangoDB approach is rather to squeeze the most onto one server (well ArangoDB supports multiple servers - but to support availability). For this it uses two main strategies.
​One strategy is to make use of SSDs. Note that the capacity of SSDs is growing larger at an incredible rate (you can buy a Terabyte of SSD for by far less money that a second server would cost you). And endurance (the total amount of data that can be written to a SSD) goes up to Petabytes (now that vendors finally get the wear leveling algorithms right) - so reliability of SSDs is no longer an issue. And the performance of those SSDs is very nice (closer to main memory than to ordinary disks).
​The other strategy is to store the data efficiently. ArangoDB uses shapes to store documents: A shape is the information which attributes and attribute types a document has - all document with the same shape share the representation of this information. This means that documents can be stored in less space than the JSON or BSON representation would require.
As I understand, it does not allow sharding (prior to version 2.0), but replication.
From the link
AvocadoDB effortlessly permits replications. We like the “zero-admin principle”. Making replications with AvocadoDB is really easy: Insert IP address and go!
Following replication types are intended for version 2:
master-master synchronous,
master-master asynchronous,
master-slave synchronous,
master-slave asynchronous

StatsD type server for advanced analytics

My company is growing and so is our need for analytics.
But concurrently we are needing greater speed and complexity.
I'm looking for recommendations to FOSS stats servers similar to StatsD, but capable of more complexity. On the scale of things like cohort charts, grouping actions by category and extracting unique vs total.

Is NoSQL 100% ACID 100% of the time?

Quoting: http://gigaom.com/cloud/facebook-trapped-in-mysql-fate-worse-than-death/
There have been various attempts to
overcome SQL’s performance and
scalability problems, including the
buzzworthy NoSQL movement that burst
onto the scene a couple of years ago.
However, it was quickly discovered
that while NoSQL might be faster and
scale better, it did so at the expense
of ACID consistency.
Wait - am I reading that wrongly?
Does it mean that if I use NoSQL, we can expect transactions to be corrupted (albeit I daresay at a very low percentage)?
It's actually true and yet also a bit false. It's not about corruption it's about seeing something different during a (limited) period.
The real thing here is the CAP theorem which simply states you can only choose two of the following three:
Consistency (all nodes see the same data at the same time)
Availability (a guarantee that every request receives a response about whether it was successful or failed)
Partition
tolerance (the system continues to operate despite arbitrary message loss)
The traditional SQL systems choose to drop "Partition tolerance" where many (not all) of the NoSQL systems choose to drop "Consistency".
More precise: They drop "Strong Consistency" and select a more relaxed Consistency model like "Eventual Consistency".
So the data will be consistent when viewed from various perspectives, just not right away.
NoSQL solutions are usually designed to overcome SQL's scale limitations. Those scale limitations are explained by the CAP theorem. Understanding CAP is key to understanding why NoSQL systems tend to drop support for ACID.
So let me explain CAP in purely intuitive terms. First, what C, A and P mean:
Consistency: From the standpoint of an external observer, each "transaction" either fully completed or is fully rolled back. For example, when making an amazon purchase the purchase confirmation, order status update, inventory reduction etc should all appear 'in sync' regardless of the internal partitioning into sub-systems
Availability: 100% of requests are completed successfully.
Partition Tolerance: Any given request can be completed even if a subset of nodes in the system are unavailable.
What do these imply from a system design standpoint? what is the tension which CAP defines?
To achieve P, we needs replicas. Lots of em! The more replicas we keep, the better the chances are that any piece of data we need will be available even if some nodes are offline. For absolute "P" we should replicate every single data item to every node in the system. (Obviously in real life we compromise on 2, 3, etc)
To achieve A, we need no single point of failure. That means that "primary/secondary" or "master/slave" replication configurations go out the window since the master/primary is a single point of failure. We need to go with multiple master configurations. To achieve absolute "A", any single replica must be able to handle reads and writes independently of the other replicas. (in reality we compromise on async, queue based, quorums, etc)
To achieve C, we need a "single version of truth" in the system. Meaning that if I write to node A and then immediately read back from node B, node B should return the up-to-date value. Obviously this can't happen in a truly distributed multi-master system.
So, what is the "correct" solution to the problem? It details really depend on your requirements, but the general approach is to loosen up some of the constraints, and to compromise on the others.
For example, to achieve a "full write consistency" guarantee in a system with n replicas, the # of reads + the # of writes must be greater or equal to n : r + w >= n. This is easy to explain with an example: if I store each item on 3 replicas, then I have a few options to guarantee consistency:
A) I can write the item to all 3 replicas and then read from any one of the 3 and be confident I'm getting the latest version B) I can write item to one of the replicas, and then read all 3 replicas and choose the last of the 3 results C) I can write to 2 out of the 3 replicas, and read from 2 out of the 3 replicas, and I am guaranteed that I'll have the latest version on one of them.
Of course, the rule above assumes that no nodes have gone down in the meantime. To ensure P + C you will need to be even more paranoid...
There are also a near-infinite number of 'implementation' hacks - for example the storage layer might fail the call if it can't write to a minimal quorum, but might continue to propagate the updates to additional nodes even after returning success. Or, it might loosen the semantic guarantees and push the responsibility of merging versioning conflicts up to the business layer (this is what Amazon's Dynamo did).
Different subsets of data can have different guarantees (ie single point of failure might be OK for critical data, or it might be OK to block on your write request until the minimal # of write replicas have successfully written the new version)
The patterns for solving the 90% case already exist, but each NoSQL solution applies them in different configurations. The patterns are things like partitioning (stable/hash-based or variable/lookup-based), redundancy and replication, in memory-caches, distributed algorithms such as map/reduce.
When you drill down into those patterns, the underlying algorithms are also fairly universal: version vectors, merckle trees, DHTs, gossip protocols, etc.
It does not mean that transactions will be corrupted. In fact, many NoSQL systems do not use transactions at all! Some NoSQL systems may sometimes lose records (e.g. MongoDB when you do "fire and forget" inserts rather than "safe" ones), but often this is a design choice, not something you're stuck with.
If you need true transactional semantics (perhaps you are building a bank accounting application), use a database that supports them.
First, asking if NoSql is 100% ACID 100% of the time is a bit of a meaningless question. It's like asking "Are dogs 100% protective 100% of the time?" There are some dogs that are protective (or can be trained to be) such as German Shepherds or Doberman Pincers. There are other dogs that could care less about protecting anyone.
NoSql is the label of a movement, and not a specific technology. There are several different types of NoSql databases. There are document stores, such as MongoDb. There are graph databases such as Neo4j. There are key-value stores such as cassandra.
Each of these serve a different purpose. I've worked with a proprietary database that could be classified as a NoSql database, it's not 100% ACID, but it doesn't need to be. It's a write once, read many database. I think it gets built once a quarter (or once a month?) and then is read 1000s of time a day.
There is a lot of different NoSQL store types and implementations. Every of them can solve trade-offs between consistency and performance differently. The best you can get is a tunable framework.
Also the sentence "it was quickly discovered" from you citation is plainly stupid, this is no surprising discovery but a proven fact with deep theoretical roots.
In general, it's not that any given update would fail to save or get corrupted -- these are obviously going to be a very big issue for any database.
Where they fail on ACID is in data retrieval.
Consider a NoSQL DB which is replicated across numerous servers to allow high-speed access for a busy site.
And lets say the site owners update an article on the site with some new information.
In a typical NoSQL database in this scenario, the update would immediately only affect one of the nodes. Any queries made to the site on the other nodes would not reflect the change right away. In fact, as the data is replicated across the site, different users may be given different content despite querying at the same time. The data could take some time to propagate across all the nodes.
Conversely, in a transactional ACID compliant SQL database, the DB would have to be sure that all nodes had completed the update before any of them could be allowed to serve the new data.
This allows the site to retain high performance and page caching by sacrificing the guarantee that any given page will be absolutely up to date at an given moment.
In fact, if you consider it like this, the DNS system can be considered to be a specialised NoSQL database. If a domain name is updated in DNS, it can take several days for the new data to propagate throughout the internet (depending on TTL configuration).
All this makes NoSQL a useful tool for data such as web site content, where it doesn't necessarily matter that a page isn't instantly up-to-date and consistent as long as it is reasonably up-to-date.
On the other hand, though, it does mean that it would be a very bad idea to use a NoSQL database for a system which does require consistency and up-to-date accuracy. An order processing system or a banking system would definitely not be a good place for your typical NoSQL database engine.
NOSQL is not about corrupted data. It is about viewing at your data from a different perspective. It provides some interesting leverage points, which enable for much easier scalability story, and often usability too. However, you have to look at your data differently, and program your application accordingly (eg, embrace consequences of BASE instead of ACID). Most NOSQL solutions prevent you from making decisions which could make your database hard to scale.
NOSQL is not for everything, but ACID is not the most important factor from end-user perspective. It is just us developers who cannot imagine world without ACID guarantees.
You are reading that correctly. If you have the AP of CAP, your data will be inconsistent. The more users, the more inconsistent. As having many users is the main reason why you scale, don't expect the inconsistencies to be rare. You've already seen data pop in and out of Facebook. Imagine what that would do to Amazon.com stock inventory figures if you left out ACID. Eventual consistency is merely a nice way to say that you don't have consistency but you should write and application where you don't need it. Some types of games and social network application does not need consistency. There are even line-of-business systems that don't need it, but those are quite rare. When your client calls when the wrong amount of money is on an account or when an angry poker player didn't get his winnings, the answer should not be that this is how your software was designed.
The right tool for the right job. If you have less than a few million transactions per second, you should use a consistent NewSQL or NoSQL database such as VoltDb (non concurrent Java applications) or Starcounter (concurrent .NET applications). There is just no need to give up ACID these days.