Neptune-Gremlin-Python | Best practises for scaling network analysis and serving use cases like recommendations in realtime - networkx

I have a generic question around the best practises on usage of Neptune DB as a network database and its ability to scale up for complex computing. I want to develop a user recommendation system where incoming users on the platform are prompted other users they can likely follow in order to grow the network.
For implementing a simple technique like Triadic Closure, should I use gremlin queries on the Network DB(AWS Neptune in my case) for generating the recommendations? I believe in this case I would have to create python scripts that parallelise queries for multiple nodes and generate recommendation for each node at scale.
OR is it a more common practise to store the network data in the form of nodes, edges and their properties into a relational database, and then perform computations on the same by running SQL queries to load the network data into python, and then using packages like NetworkX on top of that. In this case I won't have to worry about batch computations since a relational database like Redshift would take care of it. However I would be writing python logics to implement techniques such as triadic closure.
Additionallly in the future I may want to use more complex graph computational techniques like graph clustering, partitioning, calculation of different kinds of centralities. Are all/any of these possible within the framework of Neptune+Gremlin.
With the above context below are the questions I am seeking answers for:
Whats is the commonly used tech stack by a data science team working with graph data to build solutions such as user recommendations? By data-science tech stack I mean technologies that help query, analyse, visualise, compute and serve.
Can Neptune + Gremlin replace python packages such as NetworkX for network analysis and centrality measurement?
Is Neptune DB ideal only as a data store OR can it also support complex network analysis and recommendation serving?
Any insight/resources on this would be really helpful!

It is definitely possible to do triadic closure in Gremlin. I have also seen data scientists use both NetworkX and Gremlin together by running the gremlin-python client in a Jupyter Notebook. As this question is quite specific to Amazon Neptune you may want to post to the Neptune support forum at [1]. There are also some useful Gremlin Recipes at [2]
If you post to the support forum I am sure someone will respond.
[1] https://forums.aws.amazon.com/forum.jspa?forumID=253&start=0
[2] http://tinkerpop.apache.org/docs/current/recipes/

Related

AWS SageMaker - Realtime Data Processing

My company does online consumer behavior analysis and we do realtime predictions using the data we collected from various websites (with our java script embedded).
We have been using AWS ML for real time prediction but now we are experimenting with AWS SageMaker it occurred to us that the realtime data processing is a problem compared to AWS ML. For example we have some string variables that AWS ML can convert to numerics and use them for real time prediction in AWS ML automatically. But it does not look like SageMaker can do it.
Does anyone have any experience with real time data processing and prediction in AWS SageMaker?
It sounds like you're only familiar with the training component of SageMaker. SageMaker has several different components:
Jupyter Notebooks
Labeling
Training
Inference
You're most likely dealing with #3 and #4. There are a few ways to work with SageMaker here. You can use one of the built-in algorithms which provide both training and inference containers that can be launched on SageMaker. To use these you can work entirely from the console and just point at your data in S3, similar to AWS ML. If you're not using the built-in algos then you can use the sagemaker-python-sdk to create both training and prediction containers if you're using a common framework like tensorflow, mxnet, pytorch, or others. Finally, if you're using a super custom algorithm (which you weren't if you're porting from AWS ML) then you can bring your own docker container for training and for inference.
To create an inference endpoint you can go to the console under the inference section and click around to build your endpoint. See the gif here for an example:
Beyond that if you want to use code to invoke the endpoint in real time you can use any of the AWS SDKs, I'll demonstrate with the python SDK boto3 here:
import boto3
sagemaker = boto3.client("runtime.sagemaker")
response = sagemaker.invoke_endpoint(EndpointName="herpderp", Body="some content")
In this code if you needed to convert the incoming string values to numerical values then you could easily do that with the code.
Yes it can! you have to create a Pipeline (Preprocess + model + Postprocess) and deploy it as endpoint for real time inference. you can double check the inference example in sagemaker github site. it's using sagemaker-python-sdk to train and deploy.
1: This is for small data sklearn model.
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/scikit_learn_inference_pipeline
2: it also support big data (spark ML Pipeline serving container), you can also find the example in its official github.
AWS SageMaker is a robust machine learning service in AWS that manages every major aspect of machine learning implementation, including data preparation, model construction, training and fine-tuning, and deployment.
Preparation
SageMaker uses a range of resources to make it simple to prepare data for machine learning models, even though it comes from many sources or is in a variety of formats.
It's simple to mark data, including video, images, and text, that's automatically processed into usable data, with SageMaker Ground Truth. GroundWork will process and merge this data using auto-segmentation and a suite of tools to create a single data label that can be used in machine learning models. AWS, in conjunction with SageMaker Data Wrangler and SageMaker Processing, reduces a data preparation phase that may take weeks or months to a matter of days, if not hours.
Build
SageMaker Studio Notebooks centralize everything relevant to your machine learning models, allowing them to be conveniently shared along with their associated data. You can choose from a variety of built-in, open-source algorithms to start processing your data with SageMaker JumpStart, or you can build custom parameters for your machine learning model.
Once you've chosen a model, SageMaker starts processing data automatically and offers a simple, easy-to-understand interface for tracking your model's progress and performance.
Training
SageMaker provides a range of tools for training your model from the data you've prepared, including a built-in debugger for detecting possible errors.
Machine Learning
The training job's results are saved in an Amazon S3 bucket, where they can be viewed using other AWS services including AWS Quicksight.
Deployment
It's pointless to have strong machine learning models if they can't be easily deployed to your hosting infrastructure. Fortunately, SageMaker allows deploying machine learning models to your current services and applications as easy as a single click.
SageMaker allows for real-time data processing and prediction after installation. This has far-reaching consequences in a variety of areas, including finance and health. Businesses operating in the stock market, for example, may make real-time financial decisions about stock and make more attractive acquisitions by pinpointing the best time to buy.
Incorporation with Amazon Comprehend, allows for natural language processing, transforming human speech into usable data to train better models, or provide a chatbot to customers through Amazon Lex.
In conclusion…
Machine Learning is no longer a niche technological curiosity; it now plays a critical role in the decision-making processes of thousands of companies around the world. There has never been a better time to start your Machine Learning journey than now, with virtually unlimited frameworks and simple integration into the AWS system.
In this case, you will need to preprocess your data before feeding it into the InvokeEndpoint request body. If you use python, you can use int('your_integer_string') or float('your_float_string') to convert a string to an integer or float. If you use java, you can use Integer.parseInt("yourIntegerString") or Long.parseLong("yourLongString") or Double.parseDouble("yourDoubleString") or Float.parseFloat("yourFloatString").
Hope this helps!
-Han

Social-networking: Hadoop, HBase, Spark over MongoDB or Postgres?

I am architecting a social-network, incorporating various features, many powered by big-data intensive workloads (such as Machine Learning). E.g.: recommender systems, search-engines and time-series sequence matchers.
Given that I currently have 5< users—but forsee significant growth—what metrics should I use to decide between:
Spark (with/without HBase over Hadoop)
MongoDB or Postgres
Looking at Postgres as a means of reducing porting pressure between it and Spark (use a SQL abstraction layer which works on both). Spark seems quite interesting, can imagine various ML, SQL and Graph questions it can be made to answer speedily. MongoDB is what I usually use, but I've found its scaling and map-reduce features to be quite limiting.
I think you are on the right direction to search for software stack/architecture which can:
handle different types of load: batch, real time computing etc.
scale in size and speed along with business growth
be a live software stack which are well maintained and supported
have common library support for domain specific computing such as machine learning, etc.
To those merits, Hadoop + Spark can give you the edges you need. Hadoop is relatively mature for now to handle large scale data in a batch manner. It supports reliable and scalable storage(HDFS) and computation(Mapreduce/Yarn). With the addition of Spark, you can leverage storage (HDFS) plus real-time computing (performance) added by Spark.
In terms of development, both systems are natively supported by Java/Scala. Library support, performance tuning of those are abundant here in stackoverflow and everywhere else. There are at least a few machine learning libraries(Mahout, Mlib) working with hadoop, spark.
For deployment, AWS and other cloud provider can provide host solution for hadoop/spark. Not an issue there either.
I guess you should separate data storage and data processing. In particular, "Spark or MongoDB?" is not a good thing to ask, but rather "Spark or Hadoop or Storm?" and also "MongoDB or Postgres or HDFS?"
In any case, I would refrain from having the database do processing.
I have to admit that I'm a little biased but if you want to learn something new, you have serious spare time, you're willing to read a lot, and you have the resources (in terms of infrastructure), go for HBase*, you won't regret it. A whole new universe of possibilities and interesting features open up when you can have +billions of atomic counters in real time.
*Alongside Hadoop, Hive, Spark...
In my opinion, it depends more on your requirements and the data volume you will have than the number of users -which is also a requirement-. Hadoop (aka Hive/Impala, HBase, MapReduce, Spark, etc.) works fine with big amounts -GB/TB per day- of data and scales horizontally very well.
In the Big Data environments I have worked with I have always used Hadoop HDFS to store raw data and leverage the distributed file system to analyse the data with Apache Spark. The results were stored in a database system like MongoDB to obtain low latency queries or fast aggregates with many concurrent users. Then we used Impala to get on demmand analytics. The main question when using so many technologies is to scale well the infraestructure and the resources given to each one. For example, Spark and Impala consume a lot of memory (they are in memory engines) so it's a bad idea to put a MongoDB instance on the same machine.
I would also suggest you a graph database since you are building a social network architecture; but I don't have any experience with this...
Are you looking to stay purely open-sourced? If you are going to go enterprise at some point, a lot of the enterprise distributions of Hadoop include Spark analytics bundled in.
I have a bias, but, there is also the Datastax Enterprise product, which bundles Cassandra, Hadoop and Spark, Apache SOLR, and other components together. It is in use at many of the major internet entities, specifically for the applications you mention. http://www.datastax.com/what-we-offer/products-services/datastax-enterprise
You want to think about how you will be hosting this as well.
If you are staying in the cloud, you will not have to choose, you will be able to (depending on your cloud environment, but, with AWS for example) use Spark for continuous-batch process, Hadoop MapReduce for long-timeline analytics (analyzing data accumulated over a long time), etc., because the storage will be decoupled from the collection and processing. Put data in S3, and then process it later with whatever engine you need to.
If you will be hosting the hardware, building a Hadoop cluster will give you the ability to mix hardware (heterogenous hardware supported by the framework), will give you a robust and flexible storage platform and a mix of tools for analysis, including HBase and Hive, and has ports for most of the other things you've mentioned, such as Spark on Hadoop (not a port, actually the original design of Spark.) It is probably the most versatile platform, and can be deployed/expanded cheaply, since the hardware does not need to be the same for every node.
If you are self-hosting, going with other cluster options will force hardware requirements on you that may be difficult to scale with later.
We use Spark +Hbase + Apache Phoenix + Kafka +ElasticSearch and scaling has been easy so far.
*Phoenix is a JDBC driver for Hbase, it allows to use java.sql with hbase, spark (via JDBCrdd) and ElasticSearch (via JDBC river), it really simplifies integration.

Combining MongoDB and a GraphDB like Neo4J

As part of a CMS I'm developing I've got MongoDB as the primary datastore which feeds to ElasticSearch and Redis. All this is configured decleratively.
I'm currently trying to develop a declarative api in JSON (A DSL of sorts) which, when implemented, will enable me to write uniform queries in JSON, but at the backend these datastores work in tandem to come up with the result. Federated search if you will.
Now, while fleshing out the supported types of queries for this Json api, I've come across a class of queries not (efficiently) supported by my current setup: graph-based queries, like friend-of-friend, RDF-queries, etc. Something I'd like to support as well.
So I'm looking for a way to introduce a GraphDB into this ecosystem with the best fit. I should probably say the app-layer sits in Node.js.
I've come across lots of articles comparing Neo4J (a popular GraphDB) vs MongoDB, but not so much of actual use-cases, real world scenarios in which the 2 are complemented.
Any pointers highly appreciated.
You might want to take a look at structr[1], which has a RESTful graph database backend that you can configure using Java beans. In future versions, there will be a configuration option using REST calls only, so that you can fire up a structr server and configure and use it as a standalone graph database backend.
Just contact us on twitter or via email.
(disclaimer: I'm one of the developers of structr, so this comment may not be 100% impartial :))
[1] http://structr.org
The databases are very much complementary.
Use MongoDB to store your raw data/system of record and load the raw data into Neo4j for additional insights/analysis. When you are dealing with unstructured data, you want to store the information in a datastore which is conducive to unstructure data - MongoDB fits the bill (as does other similar NOSQL databases). While Neo4j is considered a NOSQL database, it doesn't fit the bill for unstructured data. Because you have to determine what is a relationship, what is a node, and what properties are stored for each - it's better suited when you have semi-structured data and some understanding of the type of analysis you want to do.
A great architecture is store your unstructured data in MongoDB and use jobs to load them into Neo4j. This allows you to re-load your graph if you figure out there are new pieces of information you'd like to store in the graph for additional analysis.
They are definitely NOT replacements for each other. They fit very different use cases.

What are the real-time compute solutions that can take raw semistructured data as input?

Are there any technologies that can take raw semi-structured, schema-less big data input (say from HDFS or S3), perform near-real-time computation on it, and generate output that can be queried or plugged in to BI tools?
If not, is anyone at least working on it for release in the next year or two?
There are some solutions with big semistructured input and queried output, but they are usually
unique
expensive
secret enough
If you are able to avoid direct computations using neural networks or expert systems, you will be close enough to low latency system. All you need is a team of brilliant mathematicians to make a model of your problem, a team of programmers to realize it in code and some cash to buy servers and get needed input/output channels for them.
Have you taken a look at Splunk? We use it to analyze Windows Event Logs and Splunk does an excellent job indexing this information to allow for fast querying of any string that appears in the data.

Are there any data warehouse frameworks?

I've got a lot of mysql data that I need to generate reports from. It's mostly historic data so it won't be changing much, but it weighs in at 20-30 gigabytes easily and is expected to grow. I currently have a collection of php scripts that will do some complex queries and output csv and excel files. I also use phpMyAdmin with bookmarked queries. I manually edit them to change the parameters. The amount of data is growing and the number of people who need access to it is also growing, so I'm making the time to improve this situation.
I started reading about data warehousing the other day and it seems that this an area that relates to what I need to do. I've read some good articles and am even waiting on a book. I think I'm getting a handle on what these sorts of systems do and what's possible.
Creating a reporting system for my data has always been on a todo list, but until recently I figured it would be a highly niche programing venture. Since I now know data warehousing is a common thing, I figure there must be some sort of reporting/warehousing frames available to ease in the development. I'd gladly skip writing interfaces and scripts to schedule and email reports and the like and stick to writing queries and setting up relations.
I've mostly been a lamp guy, but I'm not above switching languages or platforms. I just need a more robust solution as my one off scripts don't scale well.
So where's a good place to get started?
I'll discuss a few points on the {budget, business utility function, time frame} spectrum out there. For convenience, let's follow the architecture conceptualization you linked to at
WikipediaDataWarehouseArticle
Operational database layer
The source data for the data warehouse - Normalized for In One Place Only data maintenance
Data access layer
The transformation of your source data into your informational access layer. ETL tools to extract, transform, load data into the warehouse fall into this layer.
Informational access layer
• Report-facilitating Data Structure
Data is not maintained here. It is merely a reflection of your source data
Hence, denormalized structures (containing duplicate, but systematically derived data)
are usually most effective here
• Reporting tools
How do you actually allow your users access to the data
• pre-canned reports (simple)
• more dynamic slice-and-dice access methods
The data accessed for reporting and analyzing and the tools for reporting and analyzing data
fall into this layer. And the Inmon-Kimball differences about design methodology,
discussed later in the Wikipedia article, have to do with this layer.
Metadata layer (facilitates automation, organization, etc)
Roll your own (low-end)
For very little out-of-pocket cost, just recognizing the need for the denormalized structures can buy those that are not using it some efficiencies
Get in the ballgame (some outlays required)
You don't need to use all the functionality of a platform right off the bat.
IMO, however, you want to be on a platform that you know will grow, and in the highly competitive and consolidating BI environment, that seems to be one of the four enterprise mega-vendors (my opinion)
Microsoft (the platform of our 110 employee firm)
SAP
Oracle
IBM
BiMarketStateArticle
My firm is at this stage, using some of the ETL capability offered by SQL Server Integration Services (SSIS) and some alternate usage of the open source, but in practice license requiring Talend product in the "Data Access Layer", a denormalized reporting structure (implemented completely in the basic SQL Server database), and SQL Server Reporting Services (SSRS) to largely automate (based on your skill) the production of pre-specified reports. Note that an SSRS "report" is merely a (scalable) XML configuration/specification that gets rendered at runtime via the SSRS engine. Choices such as export to an excel file are simple options.
Serious Commitment (some significant human commitment required)
Notice above that we have yet to utilize the data mining/dynamic slicing/dicing
capabilities of SQL Server Analysis Services. We are working toward that,
but now focused on improving the quality of our data cleansing in the "Data Access Layer".
I hope this helps you to get a sense of where to start looking.
Pentaho has put together a pretty comprehensive suite of products. The products are "free", but be prepared for the usual heavy sell once you fork over your identifying information.
I haven't had a chance to really stretch them as we're a Microsoft shop from one sad end to the other.
I think you should first check out Kimball and Inmon and see if you want to approach your data warehouse in a particular way. Kimball, in particular, lays out a very good framework for the modelling and construction of the warehouse.
There are a number of tools which try to make the process of designing, implementing and managing/operating a Data Warehouse and they each have their strengths and weaknesses and often vastly differing price points. Under the covers you are always going to be best off if you have a good knowledge of warsehousing principles from the Kimball and/or Inmon camps.
As well as tools like Kalido and Wherescape RED (which do similar thing in very different ways), many of the ETL platforms now have good in-built support for the donkey work of implementation - SCD components etc and lineage tracking.
Best though to view all these as tools to be used in the hands of you, the craftsman, they make certain easy things even easier (or even trivial), some hard things easier but some things they just get in they way of IMHO ;) Learn the methodology and principles first and get a good understanding of them and then you will know which tools to apply from your kitbag and when...
It hasn't been updated in a while but there's a nice Data Warehousing/ETL Ruby package called ActiveWarehouse.
But I would check out the Pentaho products like Nick mentioned in another answer. It should easily handle the volume of data you have and may provide you with more ways to slice and dice your data than you could have ever imagined.
The best framework you can currently get is Anchor Modeling.
It might look quite complex because of it's generic structure and built-in capability to historize data.
Also modeling technique is quite different than ERD.
But you end-up with sql code to generate all db objects including 3NF views and:
insert/update handled by triggers
query any point/range in history
you application developers will not see underlying 6NF anchor model.
The technology is open sourced and at the moment is unbeatable.
If you would have AM question you may want to ask on that tag anchor-modeling.
Kimball is the simpler method for data warehousing.
We use Informatica for moving data around, but it doesn't do DW things like indexing by default.
I like the idea of Wherescape RED, as a DW tool and using MS SQL's Linked Servers to obviate the need for an ETL tool.