What is the performance of the Weaviate Automatic Classification process? - classification

I would like to investigate the possibility for enriching Splunk ingested data by using the Weaviate Automatic Classification in the streaming ingestion pipeline.
This can only work if the Automatic Classification process will only have a minor impact on the ingestion rate.
Is there any benchmarking data available for the Automatic Classification process (varying text size, schema complexity etc.)?

Related

Using clustering classification as regression feature?

I am attempting to use KMeans clustering to create a feature for an XGBOOST regression. The problem is, I am not sure if there is data leakage. It is data with a date, so right now I am clustering on the first 70% of data sorted by date, and using the same as my training set.
Included in the clustering is my target variable. Using the cluster as a feature provides a huge boost to test scores, so I worry that this is causing data leakage. However, the clusters used for test scores are unseen data in the test set.
Is this valid, or is it causing data leakage? Thank you

Advantages of ARIMA with Spark

I'm new to spark and scala. I'm working on a project doing forecasting with ARIMA models. I see from the posts below that I can train ARIMA models with spark.
I'm wondering what's the advantage of using spark for ARIMA models?
How to do time-series simple forecast?
https://badrit.com/blog/2017/5/29/time-series-analysis-using-spark#.W9ONGBNKi7M
The advantage of Spark is a distributed processing engine. If you have a huge amount of data which is typically the case in real-life systems, we need such processing engines. It will benefit in terms of scalability and performance to use any algorithm not only ARIMA on platforms like Spark.

What is the algorithm of OrientDB partitioning?

I can't find the partitioning algorithm which is supported by OrientDB.
I need a graph database which is supports clever algorithm of partitioning or rebalancing to decrease the number of cutted edges (edge which points on another server). Because I have a lot of reads but few writes.
Also, does Titan database support some clever algorithm?

Does MATLAB support the parallelization of supervised machine learning algorithms? Alternatives?

Up to now I have used RapidMiner for some data/text mining tasks, but with an increasing amount of data there are huge performance issues. AFAIK the RapidMiner Parallel Processing Extensions is only available for the enterprise version - unfortunately I am limited to the community version.
Now I want to transfer the tasks to a high performance cluster by using MATLAB (academic license). I did not find any information that the Parallel Computation Toolbox supports e.g. SVM or KNN.
Does MATLAB or any additional libraries support the paralleliization of data mining algorithms?
Most data mining and machine learning functionality for MATLAB is contained within Statistics Toolbox (in recent versions, that's called Statistics and Machine Learning Toolbox). To enable parallelization, you'll also need Parallel Computing Toolbox, and to enable that parallelization to be carried out on an HPC cluster, you'll need to install MATLAB Distributed Computing Server on the cluster.
There are lots of ways that you might want to parallelize data mining tasks - for example, you might want to parallelize an individual learning task, or parallelize a cross-validation, or parallelize several learning tasks across multiple datasets.
The first is possible for some, but not all of the data mining algorithms in Statistics Toolbox. MathWorks are gradually introducing that piece by piece. For example, kmeans is parallelized, and there is a parallelized algorithm for bagged decision trees, but I believe SVM learning is currently not parallelized. You'll need to look into the documentation for Statistics Toolbox to find out if the algorithms you require are on the list.
The second two are possible. Functionality in Statistics Toolbox for cross-validation (and bootstrapping, jack-knifing) is parallelized, as are some feature selection algorithms. And in order to parallelize running several jobs over multiple datasets, you can use functionality from Parallel Computing Toolbox (such as a parfor or parallel for loop) to iterate over them.
In addition, the upcoming R2015b release of MATLAB (out in September) will include GPU-enabled statistics functionality, providing additional speedups.

Drawbacks of Spark Streaming in Comparison With Real Streaming Computing Systems

Some say that spark streaming, even though it can handle streams in the form of micro-batches, it is still not quite a streaming computing system like storm. So what are the limiting factors of this micro-batch computing ideology? What makes it less than a real computing system? Thanks!
Take a look at the Spark Streaming paper here. It compares record-at-a-time and batch streaming. Since it's a paper on Spark, it's more biased towards Spark's approach of batch streaming. Also, the paper is 2 years old and a lot of things happened in that time frame. Here is another ppt to get started.