fastText can train with a corpus bigger than RAM? - fasttext

I need to train a fastText model on a 400GB corpus. As I don't have a machine with 400GB of RAM I want to know if the fastText implementation ( for example, following this tutorial https://fasttext.cc/docs/en/unsupervised-tutorial.html ) supports corpus bigger than RAM, and which RAM requirements I would have.

Generally for such models, the peak RAM requirement is a function of the size of the vocabulary of unique words, rather than the raw training material.
So, are there only 100k unique words in your 400GB? No problem, it'll only be reading a range at a time, & updating a small, stable amount of RAM. Are there 50M unique words? You'll need a lot of RAM.
Have you tried it to see what wold happen?

Related

yolov4..cfg : increasing subdivisions parameter consequences

I'm trying to train a custom dataset using Darknet framework and Yolov4. I built up my own dataset but I get a Out of memory message in google colab. It also said "try to change subdivisions to 64" or something like that.
I've searched around the meaning of main .cfg parameters such as batch, subdivisions, etc. and I can understand that increasing the subdivisions number means splitting into smaller "pictures" before processing, thus avoiding to get the fatal "CUDA out of memory". And indeed switching to 64 worked well. Now I couldn't find anywhere the answer to the ultimate question: is the final weight file and accuracy "crippled" by doing this? More specifically what are the consequences on the final result? If we put aside the training time (which would surely increase since there are more subdivisions to train), how will be the accuracy?
In other words: if we use exactly the same dataset and train using 8 subdivisions, then do the same using 64 subdivisions, will the best_weight file be the same? And will the object detections success % be the same or worse?
Thank you.
first read comments
suppose you have 100 batches.
batch size = 64
subdivision = 8
it will divide your batch = 64/8 => 8
Now it will load and work one by one on 8 divided parts into the RAM, because of LOW RAM capacity you can change the parameter according to ram capacity.
you can also reduce batch size , so it will take low space in ram.
It will do nothing to the datasets images.
It is just splitting the large batch size which can't be load in RAM, so divided into small pieces.

scikit-learn: Hierarchal Agglomerative Clustering performance with increasing dataset

scikit-learn==0.21.2
Hierarchal Agglomerative Clustering algorithm response time is increasing exponentially when increasing the dataset.
My Data set is textual. Each Document is 7-10 words long.
Using the following code to perform the Clustering.
hac_model = AgglomerativeClustering(affinity=consine,
linkage=complete,
compute_full_tree=True,
connectivity=None, memory=None,
n_clusters=None,
distance_threshold=0.7)
cluster_matrix = hac_model.fit_predict(matrix)
where the matrix of size are:
5000x1500 taking 17 seconds
10000*2000 taking 113 seconds
13000*2418 taking 228 seconds
I can't control 5000, 10000, 15000 as that is the size of input, or the feature set size(i.e 1500,2000,2418) since I am using BOW model(TFIDF).
I end up using all the unique words(after removing stopwords) as my feature list. this list grows as the input size increases.
So two questions.
How do I avoid increase in feature set size irrespective of increase in the size of input data set
Is there a way I can improve on the performance of the Algorithm without compromising on the quality?
Standard AGNES hierarchical clustering is O(n³+n²d) in complexity. So the number of instances is much more a problem than the number of features.
There are approaches that typically run in O(n²d), although the worst case remains the same, so they will be much faster than this. With these you'll usually run into memory limits first... Unfortunately, this isn't implemented in sklearn for all I know, so you'll have to use other clustering tools - or write the algorithm yourself.

Estimating Redshift Table Size

I am trying to create an estimate for how much space a table in Redshift is going to use, however, the only resources I found were in calculating the minimum table size:
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cluster-storage-space/
The purpose of this estimate is that I need to calculate how much space a table with the following dimensions is going to occupy without running out of space on Redshift (I.e. it will define how many nodes we end up using)
Rows : ~500 Billion (The exact number of rows is known)
Columns: 15 (The data types are known)
Any help in estimating this size would be greatly appreciated.
Thanks!
The article you reference (Why does a table in my Amazon Redshift cluster consume more disk storage space than expected?) does an excellent job of explaining how storage is consumed.
The main difficulty in predicting storage is predicting the efficiency of compression. Depending upon your data, Amazon Redshift will select an appropriate Compression Encoding that will reduce the storage space required by your data.
Compression also greatly improves the speed of Amazon Redshift queries by using Zone Maps, which identify the minimum and maximum value stored in each 1MB block. Highly compressed data will be stored on fewer blocks, thereby requiring less blocks to be read from disk during query execution.
The best way to estimate your storage space would be to load a subset of the data (eg 1 billion rows), allow Redshift to automatically select the compression types and then extrapolate to your full data size.

Size of a random forest model in MLlib

I have to compute and to keep in memory several (e.g. 20 or more) random forests model with Apache Spark.
I have only 8 GB available on the driver of the yarn cluster I use to launch the job. And I am faced to OutOfMemory errors because models do not fit in memory. I have already decreased the ratio spark.storage.memoryFraction to 0.1 to try to increase the non-RDD memory.
I have thus two questions:
How could I make these models fit in memory?
What could I check the size of my models?
EDIT
I have 200 executors which have 8GB of space.
I am not sure my models live in the driver but I suspect it as I get OutOfMemory errors and I have plenty of space in the executors. Furthermore, I stock these models in Arrays

What is the max size of collection in mongodb

I would like to know what is the max size of collection in mongodb.
In mongodb limitations documentation it is mentioned single MMAPv1 database has a maximum size of 32TB.
This means max size of collection is 32TB?
If I want to store more than 32TB in one collection what is the solution?
There are theoretical limits, as I will show below, but even the lower bound is pretty high. It is not easy to calculate the limits correctly, but the order of magnitude should be sufficient.
mmapv1
The actual limit depends on a few things like length of shard names and alike (that sums up if you have a couple of hundred thousands of them), but here is a rough calculation with real life data.
Each shard needs some space in the config db, which is limited as any other database to 32TB on a single machine or in a replica set. On the servers I administrate, the average size of an entry in config.shards is 112 bytes. Furthermore, each chunk needs about 250 bytes of metadata information. Let us assume optimal chunk sizes of close to 64MB.
We can have at maximum 500,000 chunks per server. 500,000 * 250byte equals 125MB for the chunk information per shard. So, per shard, we have 125.000112 MB per shard if we max everything out. Dividing 32TB by that value shows us that we can have a maximum of slightly under 256,000 shards in a cluster.
Each shard in turn can hold 32TB worth of data. 256,000 * 32TB is 8.19200 exabytes or 8,192,000 terabytes. That would be the limit for our example.
Let's say its 8 exabytes. As of now, this can easily translated to "Enough for all practical purposes". To give you an impression: All data held by the Library of Congress (arguably one of the biggest library in the world in terms of collection size) holds an estimated size of data of around 20TB in size including audio, video, and digital materials. You could fit that into our theoretical MongoDB cluster some 400,000 times. Note that this is the lower bound of the maximum size, using conservative values.
WiredTiger
Now for the good part: The WiredTiger storage engine does not have this limitation: The database size is not limited (since there is no limit on how many datafiles can be used), so we can have an unlimited number of shards. Even when we have those shards running on mmapv1 and only our config servers on WT, the size of a becomes nearly unlimited – the limitation to 16.8M TB of RAM on a 64 bit system might cause problems somewhere and cause the indices of the config.shard collection to be swapped to disk, stalling the system. I can only guess, since my calculator refuses to work with numbers in that area (and I am too lazy to do it by hand), but I estimate the limit here in the two digit yottabyte area (and the space needed to host that somewhere in the size of Texas).
Conclusion
Do not worry about the maximum data size in a sharded environment. No matter what, it is by far enough, even with the most conservative approach. Use sharding, and you are done. Btw: even 32TB is a hell lot of data: Most clusters I know hold less data and shard because the IOPS and RAM utilization exceeded a single nodes capacity.