Ag-grid: Data Size Limitation for Enterprise license - ag-grid

I am integrating ag-grid in my product and wondering if anyone has done stress testing of ag-grid to see the limits of # of rows/#of columns (or file size) can break ag-grid?

Both the free and enterprise versions will take the same amount of data - the limiting factor will be your users browsers, not the grid itself.
The grid (by default) will only render what's displayed so really the only limiting factor will be the memory available in your users browser.
Practically we've found you'll almost certainly want to use a row model (infinite for example) or pagination for truly vast amounts of data as it simply isn't useful for a user to have massive amounts of data within the grid for user consumption.

Related

CPU L1/L2 cache size over the years

I'm trying to find a graph with information on the CPU L1/L2 cache sizes over the years.
I have only succeeded to find an old chart from 2008 (here).
I know that the cache sizes have stayed roughly the same over the past twenty years, I just want a graphical representation of this info.
Does anyone know where to find one?
Thanks!
There is apparently an open database of historical CPU specifications at http://cpudb.stanford.edu/. I don't see the specific visualization you are asking for, but the data contains cache sizes for processors spanning from 1970s to 2016, so it might be possible to create your own plot from it.

System Design - how to Pick CPU, Memory for an application

I am practicing System Design concepts and I am not clear what configuration (cpu, memory, disk storage) to pick for an application instance? Also, how many instances are needed (assuming you are running your application on Kubernetes cluster)
For Back of the envelope calculation ,I saw examples of calculating tps for read and write calls, calculate bandwidth needs, database storage needs etc. but I have not seen how to determine cpu, memory needs and how many instances are enough. Is there a procedure that guides to solve this problem?
My hunch says that we pick small to medium sized server instance (if we use cloud provider like AWS) and run stress tests for calculated TPS and see CPU and memory usage and see if we need to increase or decrease server configuration based on results?
I would greatly appreciate any inputs you may have.
I am not clear what configuration (cpu, memory, disk storage) to pick for an application instance? Also, how many instances are needed (assuming you are running your application on Kubernetes cluster)
This is mostly a question about economics. If resources was very cheap, you could use a lot of them - but unfortunately, they have an economic cost.
Scale out horizontal or scale up vertical
The first fundamental question to ask, should you scale up your app vertically (e.g. to bigger instances) or should you scale out your app horizontally.
The most important thing here is that scaling out horizontally is much easier. But wether you can scale out horizontally of if you have to scale up vertically depends on your app. If your app is a stateless webserver, it typically is very easy to scale out, but if you have a stateful cache or database, scale up vertically might be your only short term option. Try to design so that you can scale out horizontally since that is much easier.
Accurate size - use observability
To find your accurate size, use observability and investigate your bottlenecks and adjust relatively to that.
E.g. if you use too little memory, your app will be terminated, or if you use too little CPU, your response time will be slow. Just start somewhere and adjust.
In addition to Jonas's answer:
You have two approaches (which are not mutually exclusive):
Estimate your needs based on expected load, etc.
Adjust you needs based on what you observe in production.
Regarding the first approach:
Have you done any analysis into what your expected load is? E.g. how many users (unique sessions), how many requests on average per hour (page views, API calls, etc), potential peaks in activity leading to increased load, etc.
Have you done any benchmarking?
Have you looked at your system and what it does, and worked out if it has any specific resource (CPU, memory, disk, etc) needs?
Estimating resources ahead of time requires some knowledge (or informed guesses) regarding what the load will be, as per the 3 points above. Having an idea of what the daily or hourly request average is isn't a bad place to start.
Also make sure you aware if any potential spikes that might catch you out (end of month for financial systems/services). Whether or not these are significant enough that is worth worrying about is another thing. A friend of mine was working on a ticketing system once, and they had massive traffic spikes for major events that did warrant serious scaling-out and back... but your average system probably won't need to be that extreme.
CPU is probably only worth "worrying" about if you have anything that does any above average processing - this should be obvious through benchmarking or if you/your team has good knowledge of your code.
Disk usage can be calculated - e.g.
If on average a user generates 1Mb of data in a session (not including system logs), and you get 100 sessions a day then that's 100Mb a day, 500Mb a working week, 200Mb a month, etc.
If a user profile has on average 200Kb of data and 300Kb of storage space (images) then you can calculate that.
You can also do this for records, especially for records that you know are "large" (e.g. >25mb) or where there will be lots of them (e.g. millions).
You can also start to forecast growth over time if you allow a growth rate (e.g. number of users and their sessions, and the amount of data generated). A simple way to do that is to have a spreadsheet with some simple formulas that take various inputs like number of users, average requests per user, disk space per user, etc. You can then do what-if modelling by playing with the inputs.
In terms of the second approach - as Jonas says, observe and adjust. Make sure you know how to do that, and that your solution provides the data you need. This might be using metrics provided by your cloud-provider (if applicable) or instrumentation / reporting you have custom built into you solution.
Scaling-Up is probably more relevant in scenarios where you have a central point/resource that cannot be scaled-out, like a central database.

Smartsheet Sheet data limits

What are the current limits of the data contained on a smartsheet, and what is it based on ?
I've found some links talking about 5000 records and 500 columns, but I've reached the point with a sheet containing only 1626 rows and 123 columns.
What are the exact specifications, and is there any way to override the settings using the api ?
There are maximums on the number of rows, columns, and cells in a sheet. There is no method via the Smartsheet API to override these limits. On this Smartsheet Help Center article it currently states the maximums as the following:
5,000 Rows
200 Columns
200,000 Cells
It is possible to reach one of these maximums before reaching another. 5,000 rows at 200 columns would actually be more than 200,000 cells. Once one of these maximums is reached there may be errors and the amount of data in the sheet should be decreased.
But, these aren't absolute limits. Depending on the usage of other Smartsheet features, like Formulas, Cell Linking, and Conditional Formatting, there can be limits to the amount of data that can be stored in a sheet. There are a lot of variables that can affect the performance of a sheet and the amount of data that can be stored in it, so these limits aren't absolutely specific.
This can sometimes require a process of elimination to see what can be stored in your sheet. It is best to divide up the data into logical groupings as best you can to help keep the sheets working as you need.

What is a better approach of storing and querying a big dataset of meteorological data

I am looking for a convenient way to store and to query huge amount of meteorological data (few TB). More information about the type of data in the middle of the question.
Previously I was looking in the direction of MongoDB (I was using it for many of my own previous projects and feel comfortable dealing with it), but recently I found out about HDF5 data format. Reading about it, I found some similarities with Mongo:
HDF5 simplifies the file structure to include only two major types of
object: Datasets, which are multidimensional arrays of a homogenous
type Groups, which are container structures which can hold datasets
and other groups This results in a truly hierarchical, filesystem-like
data format. Metadata is stored in the form of user-defined, named
attributes attached to groups and datasets.
Which looks like arrays and embedded objects in Mongo and also it supports indices for querying the data.
Because it uses B-trees to index table objects, HDF5 works well for
time series data such as stock price series, network monitoring data,
and 3D meteorological data.
The data:
Specific region is divided into smaller squares. On the intersection of each one of the the sensor is located (a dot).
This sensor collects the following information every X minutes:
solar luminosity
wind location and speed
humidity
and so on (this information is mostly the same, sometimes a sensor does not collect all the information)
It also collects this for different height (0m, 10m, 25m). Not always the height will be the same. Also each sensor has some sort of metainformation:
name
lat, lng
is it in water, and many others
Giving this, I do not expect the size of one element to be bigger than 1Mb.
Also I have enough storage at one place to save all the data (so as far as I understood no sharding is required)
Operations with the data.
There are several ways I am going to interact with a data:
convert as store big amount of it: Few TB of data will be given to me as some point of time in netcdf format and I will need to store them (and it is relatively easy to convert it HDF5). Then, periodically smaller parts of data (1 Gb per week) will be provided and I have to add them to the storage. Just to highlight: I have enough storage to save all this data on one machine.
query the data. Often there is a need to query the data in a real-time. The most of often queries are: tell me the temperature of sensors from the specific region for a specific time, show me the data from a specific sensor for specific time, show me the wind for some region for a given time-range. Aggregated queries (what is the average temperature over the last two months) are highly unlikely. Here I think that Mongo is nicely suitable, but hdf5+pytables is an alternative.
perform some statistical analysis. Currently I do not know what exactly it would be, but I know that this should not be in a real time. So I was thinking that using hadoop with mongo might be a nice idea but hdf5 with R is a reasonable alternative.
I know that the questions about better approach are not encouraged, but I am looking for an advice of experienced users. If you have any questions, I would be glad to answer them and will appreciate your help.
P.S I reviewed some interesting discussions, similar to mine: hdf-forum, searching in hdf5, storing meteorological data
It's a difficult question and I am not sure if I can give a definite answer but I have experience with both HDF5/pyTables and some NoSQL databases.
Here are some thoughts.
HDF5 per se has no notion of index. It's only a hierarchical storage format that is well suited for multidimensional numeric data. It's possible to extend on top of HDF5 to implement an index (i.e. PyTables, HDF5 FastQuery) for the data.
HDF5 (unless you are using the MPI version) does not support concurrent write access (read access is possible).
HDF5 supports compression filters which can - unlike popular belief - make data access actually faster (however you have to think about proper chunk size which depends on the way you access the data).
HDF5 is no database. MongoDB has ACID properties, HDF5 doesn't (might be important).
There is a package (SciHadoop) that combines Hadoop and HDF5.
HDF5 makes it relatively easy to do out core computation (i.e. if the data is too big to fit into memory).
PyTables supports some fast "in kernel" computations directly in HDF5 using numexpr
I think your data generally is a good fit for storing in HDF5. You can also do statistical analysis either in R or via Numpy/Scipy.
But you can also think about a hybdrid aproach. Store the raw bulk data in HDF5 and use MongoDB for the meta-data or for caching specific values that are often used.
You can try SciDB if loading NetCDF/HDF5 into this array database is not a problem for you. Note that if your dataset is extremely large, the data loading phase will be very time consuming. I'm afraid this is a problem for all the databases. Anyway, SciDB also provides an R package, which should be able to support the analysis you need.
Alternatively, if you want to perform queries without transforming HDF5 into something else, you can use the product here: http://www.cse.ohio-state.edu/~wayi/papers/HDF5_SQL.pdf
Moreover, if you want to perform a selection query efficiently, you should use index; if you want to perform aggregation query in real time (in seconds), you can consider approximate aggregation. Our group has developed some products to support those functions.
In terms of statistical analysis, I think the answer depends on the complexity of your analysis. If all you need is to compute something like entropy or correlation coefficient, we have products to do it in real time. If the analysis is very complex and ad-hoc, you may consider SciHadoop or SciMATE, which can process scientific data in the MapReduce framework. However, I am not sure if SciHadoop currently can support HDF5 directly.

What are the limitations of GeoServer and OpenLayers when showing a large number of points?

We are trying to show a map with a large number of points (ranging from 1000 up to 20000 depending on the users criteria) using OpenLayers and GeoServer. The points are stored in a PostgreSQL database.
Whilst the application seems to have little problem displaying the lower range, its practical limit seems to be around 5000 points. The SLD we are applying is also huge (listing all the points individually by criteria that isn’t the feature Id). At higher numbers, the image is not guaranteed to be returned, and the request sometimes crashes GeoServer, requiring the service to be reset.
Does anyone know if such a thing is feasible, and if so, of any configuration tips?
We have applied a btree index on the field used for filtering.
What type of layer are you adding to OpenLayers?
You could use a WMS layer rather than having the points as vector features:
http://dev.openlayers.org/docs/files/OpenLayers/Layer/WMS-js.html
GeoServer would then generate an image of the points, and would only need to pass a PNG of JPEG of a few kbs rather than geometry and styling info which would be a lot larger. You'd lose some of the client-side functionality though (mouse-over events etc.)
If you are already doing this, then there may be a separate problem. 5000 points should be fine to handle on the server.
Alternatively you may want to rethink how you are diplaying the points. 5000 points at one time sounds as though it could be very confusing. Perhaps using different sized circles to represent 10, 100, 500 points etc. would be easier in terms of processing and visualisation.