Maximum size for a single Wordlist-UIMA RUTA - uima

What is the maximum size for a wordlist in Uima Ruta? Because I want to store list of countries, states and cities name.

There is no maximum size for the wordlists in UIMA Ruta. The lines of the file are normally transferred into a char-based in-memory tree structure (TRIE). This means that the size is only restricted by the available RAM and it's memory consumption is less than linear.
My largest wordlist consisted of about 500k entries, as far as I remember. So a list of country names should not be a problem.
DISCLAIMER: I am a developer of UIMA Ruta

Related

fastText can train with a corpus bigger than RAM?

I need to train a fastText model on a 400GB corpus. As I don't have a machine with 400GB of RAM I want to know if the fastText implementation ( for example, following this tutorial https://fasttext.cc/docs/en/unsupervised-tutorial.html ) supports corpus bigger than RAM, and which RAM requirements I would have.
Generally for such models, the peak RAM requirement is a function of the size of the vocabulary of unique words, rather than the raw training material.
So, are there only 100k unique words in your 400GB? No problem, it'll only be reading a range at a time, & updating a small, stable amount of RAM. Are there 50M unique words? You'll need a lot of RAM.
Have you tried it to see what wold happen?

What is space partitioning and dimensions in TimesclaleDB

I am new to the timescale database. I was learning about chunks and how to create chunks based on time.
But there is another time/space chunking which is confusing me a lot. Please help me with below queries.
What is a dimension in a timescale DB?
What is space chunking and how it works?
Thanks in advance.
A dimension in TimescaleDB is associated with a column. Each hypertable requires to define at least a time dimension, which is a time column for the time series. Then a hypertable is divided into chunks, where each chunk contains data for a time interval of the time dimension. As result all new data usually arrives into the latets chunk, while other chunks contain older data.
Then, it is possible to define space dimensions on other columns, for example device column or/and location column. No interval is defined for space dimensions, instead a number of partitions is defined. So for the same time interval, several chunks will be created, which is equivalent to the number of partitions. Data are distributed by a hashing function on the values of the space dimension. For example, if 3 partitions are defined for a space dimension on device column and 12 different device values were present in the data, each space chunk will contain 4 different values with a hash function uniformly distributing the values.
Space dimensions are specifically useful for parallel I/O, when data are stored on several disks. Another scenario is multinode, i.e., distributed version of hypertable (beta feature, which coming to release in 2.0).
There are some complex usage cases when space partitioning will be also helpful.
You can read more in add_dimension docs, cloud KB about space partitioning
A note in the doc:
Supporting more than one additional dimension is currently experimental.

How should I store data in a recommendation engine?

I am developing a recommendation engine. I think I can’t keep the whole similarity matrix in memory.
I calculated similarities of 10,000 items and it is over 40 million float numbers. I stored them in a binary file and it becomes 160 MB.
Wow!
The problem is that I could have nearly 200,000 items.
Even if I cluster them into several groups and created similarity matrix for each group, then I still have to load them into memory at some point.
But it will consume a lot memory.
So, is there anyway to deal with these data?
How should I stored them and load into the memory while ensuring my engine respond reasonably fast to an input?
You could use memory mapping to access your data. This way you can view your data on disk as one big memory area (and access it just as you would access memory) with the difference that only pages where you read or write data are (temporary) loaded in memory.
If you can group the data somewhat, only smaller portions would have to be read in memory while accessing the data.
As for the floats, if you could do with less resolution and store the values in say 16 bit integers, that would also half the size.

What are the maximum number of columns in the input data in MATLAB

i must import a big data file in matlab , and its size is abute 300 MB.
now i want to know what are the maximum number of columns ,that i can imort to matlab. so divided that file to some small file.
please hellp me
There are no "maximum" number of columns that you can create for a matrix. What's the limiting factor is your RAM (à la knedlsepp), the data type of the matrix (which is also important... a lot of people overlook this), your operating system, and also what version of MATLAB you're using - specifically whether it's 32 or 64 bit.
If you want a more definitive answer, here's a comprehensive chart from MathWorks forums on what you can allocate given your OS version, MATLAB version and the data type of the matrix you want to create:
The link to this post is here: http://www.mathworks.com/matlabcentral/answers/91711-what-is-the-maximum-matrix-size-for-each-platform
Even though the above chart is for MATLAB R2007a, the sizes will most likely not have changed over the evolution of the software.
There are a few caveats with the above figure that you need to take into account:
The above table also takes your workspace size into account. As such, if you have other variables in memory and you are trying to allocate a matrix that tries to reach the limit seen in the charge, you will not be successful in its allocation.
The above table assumes that MATLAB has just been launched with no major processing carried out in a startup.m file.
The above table assumes that there is unlimited system memory, so RAM plus any virtual memory or swap file being available.
The above table's actual limits will be less if there is insufficient system memory available, usually due to the swap file being too small.

Total MongoDB storage size

I have a sharded and replicated MongoDB with dozens millions of records. I know that Mongo writes data with some padding factor, to allow fast updates, and I also know that to replicate the database Mongo should store operation log which requires some (actually, a lot of) space. Even with that knowledge I have no idea how to estimate the actual size required by Mongo given a size of a typical database record. By now I have a descrepancy with a factor of 2 - 3 between weekly repairs.
So the question is: How to estimate a total storage size required by MongoDB given an average record size in bytes?
The short answer is: you can't, not based solely on avg. document size (at least not in any accurate way).
To explain more verbosely:
The space needed on disk is not simply a function of the average document size. There is also the space needed for any indexes you create. Then there is the space needed if you do trigger those moves (despite padding, this does happen) - that space is placed on a list to be re-used but depending on the data you subsequently insert, it may or may not be possible to re-use that space.
You can also add into the fact that pre-allocation will mean that occasionally a handful of documents will increase your on-disk space utilization by ~2GB as a new data file is allocated. Of course, with sufficient data, this will be essentially a rounding error but it is worth bearing in mind.
The only way to estimate this type of data to size ratio, assuming a consistent usage pattern, is to trend it over time for your particular use case and track the disk space usage versus the data inserted (number of documents might be better than data volume depending on variability of doc size).
Similarly, if you track the insertion rate, doc size and the space gained back from a resync/repair. FYI - you can resync a secondary from scratch to get a "fresh" copy of the data files rather than running a repair, which can be less disruptive, and use less space depending on your set up.