How should I handle large amount of facets in Algolia? - algolia

I set maxValuesPerFacet=7000, but its seems the maximum amount of facets returned is 1000. Is there a limit breaker I can use?
Im using the algoliahelper js library.

Related

Number of facet values

I'm investigating Algolia as a search technology for an application that will have a large number of facet values for one of the fields that will be faceted on (i.e. ~4000).
Does anyone know if there is there a practical limit to the number of facet values that are supported (i.e. will ~4000 have significant performance implications as we scale)?
It really depends on what you mean by scaling.
But as a matter of reference, you can imagine that Medium, which is using Algolia, has, I guess, millions of articles and tens of thousands of tags and has no performance issue.
If your use-case grows a lot and you start having Enterprise plan needs, you'll have people to talk to directly at Algolia to think with you about those performance implications.
If you're not convinced and are on trial, you should have no issues pushing some fake data to see if everything works as expected!

Retrieve a million data from database(postgresql) in gsp

I need to fetch million data from database to GSP page,I written query like
"select * from tablename";
now am able to retrieve only thousand rows at a time if I upload more than that showing error like
java.lang.OutOfMemoryError: GC overhead limit exceeded
I am not using hibernate. How can I fetch large amount of data in grails project?
You have 2 options: use the pagination or use a query result iterator.
If you're using Grails, I recommand you to use Hibernate which allow you to create SQL queries without writting them by hand, and will handle all problems related to the security. Morehover, be restrictive on your request: the * is not always necessary and may save request time / memory.
Pagination
This is the best way to handle a large amount of data: you just have to split the query in sub-queries, returning a known amount of rows. To do so, you have to use the SQL closures LIMIT and OFFSET.
For example, your query could be: select * from tablename LIMIT 100 OFFSET 2000. You just have to change the OFFSET parameter to retrieve all values.
Thanks to that, your backend will not have to handle a huge amount of data at a time. Morehover, you can use Javascript to send requests to your backend while it's rendering previous results, which improves the response time (asynchronous scroll works like this for example).
Grails has a default pagination system that you can use "as is". Please, look at the official documentation here. Maybe you will have to tweak it a little bit if you don't use Hibernate.
Query result iterator
You can handle a huge amount of data by using an iterator on the result, but it depends on the querying framework. Morehover, with that method, you will generate huge HTML pages, where the size may be a problem (remember: you have an OutOfMemory, so you're talking about a hundreds or thousands Mo; at one time, the user will have to download them synchronously !)

Core Reporting API Total results found

I want to return a large result-set of Google Analytics data across a two month period.
However, the total results found is not accurate or what I expect.
If I narrow the start-date and end-date to a particular day it returns roughly 40k of results. Which over a two month period there should be 2.4 million records. However the total results found from the api suggests 350k.
There is some discrepancy and the numbers do not add up when selecting a larger date range. I can confirm there is no gap in ga data over the two month period.
Would be great if someone has come across this issue and has found a reason for it.
In your query you need to supply a sampiling level
samplingLevel=DEFAULT Optional.
Use this parameter to set the sampling level (i.e. the number of visits used to
calculate the result) for a reporting query. The allowed values are consistent with
the web interface and include:
•DEFAULT — Returns response with a sample size that balances speed and accuracy.
•FASTER — Returns a fast response with a smaller sample size.
•HIGHER_PRECISION — Returns a more accurate response using a large sample size,
but this may result in the response being slower.
If not supplied, the DEFAULT sampling level will be used.
There is no way to completely remove sampling large request will still return sampled data even if you have set it to Higher_precission make your request smaller go day by day if you have to.
If you want to pay for a premium Google Analytics account you can extract your data into BigQuery and you will have access to unsampled reports.

Good data store for millions of events?

We have a number of systems that daily generate a total of around 5M events. Currently we are saving these for around 10 days totaling around 40-50M events. Currently we're using an RDBMS as the persistance layer with a web-GUI slapped onto it, but we are experiencing certain performance problems.
An event consists of 20-30 fields composed of the following:
fields representing the event itself (e.g. OrderReceived)
fields representing the system that generated the event (e.g. ERP system)
fields representing the business context in which the event was generated (e.g. OrderManagement)
fields representing other details that we consider relevant/important
Roughly 5-6 of the fields are identifiers, most of them unique, representing the event itself, the business entity/object, the context and similar. Using these identifiers we can also relate events to each other chaining them together. The time difference in an event chain may be hours or in rare cases even days.
Currently we use the solution for analysis of individual event chains, mostly for error and outlier analysis (where did my order go?). In the future we may also like to gather statistics about events and event chains (how many orders per day? how many orders are handled by system X?). If possible the solution should also be able to grow to at least double the current size (we foresee an increase in the number of events as new systems are enabled). Analysis is today currently performed by human beings so search needs to be tolerable (searching for an event chain should take seconds, not minutes). The datastore should also allow for cleaning of stale events.
As mentioned in the beginning we're using a standard RDBMS for this. We were using a fairly normalized structure which we've now started denormalizing to try to increase performance. I can't help wondering whether some other solution might be better though. I've started looking around at different NoSQL databases (and in my own opinion MongoDB seems promising) but also trying to gather information concerning search engines and similar (Solr and ElasticSearch e.g.).
The question is what type of data store/solution would be a good fit for these events? Should we head into the NoSQL space, is perhaps a search engine what we want, or are we barking up the wrong tree when what we really need is to find someone who's really good at optimizing RDBMS:s?
I would suggest a hibrid solution with a conventional SQL server for the actual storage and a Lucene based frontend search engine, that is populated from the SQL based on some automatic or timed event. The web layer queries the Lucene layer and writes the SQL.
An SQL backend keeps your options open for the future (OLAP??, etc) and also provides a standard,scalable and multiuser way to accept data from the world through the dbconnection libraries and ui tools. In short if your data is stored in SQL you can not be lost...
The Lucene layer provides extreme query performance if the query capabilities it provides suffices. (In a nutshell: field value search for numbers, dates, strings, etc,range search,multiple field value search (field is an array actually), all with logical operators and logicalbinary expressions, sorting and paging. HOWEVER! it can not do groupings and sum, avg etc aggregating functions).
UPDATE: several years passed. Solr now has statistical capabilities like sum, avg, etc...
Query Performance: in a 100M record item database selecting a couple of hundred items with multifield query predicate is under 100ms.
Populating the index takes a constant time (no increase on size) because of the internal splitfile implementation. It is possible to build up a 5 million line index in minutes, 20 tops depending on mainly your storage controller. Lucence however supports realtime update to the index, a feature that we have used extensively with success on high load websites.
Lucene supports splitting and index into subindexes and index hierarchies so you can create an index per day but search in all of them (or in a specific subset of them) with a single query (using the multi-index adapter). I tried it with 2000 unique index files and the performance was amazing.
These architecture can be done without much effort in Java and .NET, both has great SQL and Lucene support

Zend Search Lucene HTTP 500 Internal Server Error while bulk indexing on small tables

I am just getting started with Zend Search Lucene and am testing on a GoDaddy shared Linux account. Everything is working - I can create and search Lucene Documents. The problem is when I try to index my whole table for the first time I get a HTTP 500 Internal Server Error after about 30 seconds. If I rewrite my query so that I only select 100 rows of my table to index, it works fine.
I have already increased my php memory_limit settings to 128M. The table I'm trying to index is only 3000 rows and I'm indexing a few columns from each row.
Any thoughts?
Zend_Search_Lucene does not work very well for large data-sets in my experience. For that reason I switched the search backend to Apache Lucene in a larger project.
Did you try setting your timeout to something higher than 30seconds (default in php.ini)? The memory threshold can also be exceeded easily with 3000 rows depending on what you're indexing. If you're indexing everything as Text fields, and perhaps you're indexing related data, you can easily gobble that memory up.