Google Search Appliance (GSA) feeds - unpredictable behavior - feed

We have a metadata-and-url feed and a content feed in our project. The indexing behaviour of the documents submitted using either feed is completely unpredictable. For the content feed, the documents get removed from the index after a random interval every time. For the metadata-and-url feed, the additional metadata we add is ignored, again randomly. The documents themselves do remain in index in the latter case - only our custom metadata gets removed. Basically, it looks like the feeds get "forgotten" by GSA after sometime. What could be the cause of this issue, and how do we go about debugging this?
Points to note:
1) Due to unavoidable reasons, our GSA index is always hovering around the license limit (+/- 1000 documents or so). Could this have an effect? Are feeds purged when nearing license limit? We do have "lock = true" set in the feed records though.
2) These fed documents are not linked to from pages and hence (I believe) would have low page rank. Are feeds automatically purged if not linked to from pages?
3) Our follow patterns include the fed documents.
4) We do not use action=delete with the same documents, so that possibility is ruled out. Also for the content feed we always post all the documents. So they are not removed through feeds.

When you hit the license limit the GSA will start dropping documents from the index so I'd say that's definitely your problem.

Related

github search limit results

I need to do a very large search on Github for a statistic in my thesis.
For example, I need to explore a large number of Android projects on GitHub, but the site limits the search result to 1000 (ex. https://github.com/search?l=java&q=onCreate&ref=searchresults&type=Code&utf8=%E2%9C%93). Also using the Java GitHub API I tried the library org.eclipse.egit.github.core.client.GitHubClient using the method GitHubClient.searchRepositories() but even there the number of results is limited.
Does anyone know how to get all results?
The Search API will return up to 1000 results per query (including pagination), as documented here:
https://developer.github.com/v3/search/#about-the-search-api
However, there's a neat trick you could use to fetch more than 1000 results when executing a repository search. You could split up your search into segments, by the date when the repositories were created. For example, you could first search for repositories that were created in the first week of October 2013, then second week, then September, and so on.
Because you would be restricting search to a narrow period, you will probably get less than 1000 results, and would therefore be able to get all of them. In case you notice that more than 1000 results are returned for a period, you would have to narrow the period even more, so that you can collect all results.
https://help.github.com/articles/searching-repositories/#search-based-on-when-a-repository-was-created-or-last-updated
You should be able to automate this via the API.
If you are searching for all files in Github with filename:your-file-name, you could also slice it with a query attribute : size.
For example, you are looking for all files named test.rb in Github, Github API may return more than 11M results, but you could only get 1000 of them because the GitHub Search API provides up to 1,000 results for each search. An url like : https://api.github.com/search/code?q=filename:test.rb+size:1000..1500 would be able to slice your search by changing size range.

Popularity of each wikipedia article

I would like to store a list of all en.wikipedia articles in my database. For each article I want to store the pageid, title and the popularity. I thought about using the view count (over the last month) as a measurement for popularity but if that is not possible, I could imagine going for something else (maybe use the number of revisions). I'm aware of http://dumps.wikimedia.org/enwiki/latest/ and that I can get a full list of articles from there (current count 36508337). However, I can not find a clever way to get the view count for each article.
// Updates, Edits, ...
The suggested duplicate does not help me because
a) I was looking for a popularity measurement. The answer to the other questions just states that it is not possible to get the number of watchers for a page, which is fine with me.
b) There is no answer there that gives me the page views (or any other metric) for every page.
Okay I'm finally done. Here is what I did:
I found http://dumps.wikimedia.org/other/pagecounts-ez/ which provides page views per month. This seems promising but they don't mention the pageid so what I'm doing is getting a list of all articles from http://dumps.wikimedia.org/enwiki/latest/, create a mapping name->pageid and then parse the pagecount dump. This takes about 30 minutes, here are some statistics:
68% of the articles in the page count file do not exist in the latest dump. This is probably due to some users linking, for example, Misfits_(TV_series) while other link to Misfits_(tv_series) and even stuff like Misfits_%28TV_series%29... I did not bother with those because my program already took long enough to run.
The top 3 pages are:
2.1. Front page with 639 million views (in the last month)
2.2. Malware with 8.5 million views
2.3. Falcon 9 v1.1 with 4.7 million views (cool!)
I made a histogram for the number of pages with a certain view count, here it is:
I also plotted the number of pages I would have to deal with when I disregard all articles below a certain view count. Here it is:

REST pagination content duplicates

When creating REST application which will return a collection of items (topic with collection of posts) with sorting from new to old ones.
If there will be HATEOAS principles performed and all content will be chunked on pages client will get a current page id, offset, data limits and links to first, current and next page for example.
There is no problem to get data from next page, but if somebody has been added content while client is reading current page - data will be pushed on the start of collection and last item of current page will be moved to the next page.
If you will just skip posts which already has been loaded before, you will get lower amount of items on the next page. There is a way to get a count of pushed items in start of list and increment offset.
What is a best practices for this?
Not using offsets indexes, but instead skip tokens that indicate the first value not to include (or first value to include) is a good technique provided the value can be unique for every item in your result set and is an orderable field based on the current sort. But it's not flawless. Usually this just doesn't matter.
If it really does matter you have to put IDs of everything that's in the first page in the call to 2nd page, and again and again. HATEOAS helps you do stuff like this...but it can get very messy and still things can pop up on page 1 given the current sorting when you make a request for page 5...what do you do with that?
Another trick to avoid dupes in a UI is to use the self or canonical link relationships to uniquely identify resources in a page and compare those to existing resources in the UI. Updating the UI with the latest matching resources is usually a simple task. This of course puts some burden on the client.
There is not a 1 size fits all solution to this problem. You have to design for the UX you intend to fulfill.

ExpressionEngine missing channel entries

I am working on a new web app which is based in ExpressionEngine and for the most part I am basing the content on channel entries. However I am experiencing some very weird issues with the exp channel entries tag in that it is not returning all relevant entries each time. I can't figure out what's going on with it as the entries are definitely available when viewing them in the control panel, and they will also show up as requested in my template, but sometimes they just disappear and/or are not processed properly. This is the case for large and small sets of entries also, ranging from 3 channel entries which fit the criteria specified within the exp tag to 500 entries.
Any thoughts or feedback would be greatly appreciated.
There could be a number of things going on here so here are some things to look at, just in case;
If the entries have entry dates in the future - you'll need your channel entries tag to have the parameter show_future_entries = "yes"
Likewise if the entries are closed, or expired, you'll need to add show="open|closed"
Are you looking at a particular category and these entries aren't assigned to the category?
Are you looking at a particular category but have exlcuded category data from the entries tag
Are you retrieving more than 100 entries? There is a default limit of 100 entries returned unless you specify a limit parameter.

Last Updated Date: Antipattern?

I keep seeing questions floating through that make reference to a column in a database table named something like DateLastUpdated. I don't get it.
The only companion field I've ever seen is LastUpdateUserId or such. There's never an indicator about why the update took place; or even what the update was.
On top of that, this field is sometimes written from within a trigger, where even less context is available.
It certainly doesn't even come close to being an audit trail; so that can't be the justification. And if there is and audit trail somewhere in a log or whatever, this field would be redundant.
What am I missing? Why is this pattern so popular?
Such a field can be used to detect whether there are conflicting edits made by different processes. When you retrieve a record from the database, you get the previous DateLastUpdated field. After making changes to other fields, you submit the record back to the database layer. The database layer checks that the DateLastUpdated you submit matches the one still in the database. If it matches, then the update is performed (and DateLastUpdated is updated to the current time). However, if it does not match, then some other process has changed the record in the meantime and the current update can be aborted.
It depends on the exact circumstance, but a timestamp like that can be very useful for autogenerated data - you can figure out if something needs to be recalculated if a depedency has changed later on (this is how build systems calculate which files need to be recompiled).
Also, many websites will have data marking "Last changed" on a page, particularly news sites that may edit content. The exact reason isn't necessary (and there likely exist backups in case an audit trail is really necessary), but this data needs to be visible to the end user.
These sorts of things are typically used for business applications where user action is required to initiate the update. Typically, there will be some kind of business app (eg a CRM desktop application) and for most updates there tends to be only one way of making the update.
If you're looking at address data, that was done through the "Maintain Address" screen, etc.
Such database auditing is there to augment business-level auditing, not to replace it. Call centres will sometimes (or always in the case of financial services providers in Australia, as one example) record phone calls. That's part of the audit trail too but doesn't tend to be part of the IT solution as far as the desktop application (and related infrastructure) goes, although that is by no means a hard and fast rule.
Call centre staff will also typically have some sort of "Notes" or "Log" functionality where they can type freeform text as to why the customer called and what action was taken so the next operator can pick up where they left off when the customer rings back.
Triggers will often be used to record exactly what was changed (eg writing the old record to an audit table). The purpose of all this is that with all the information (the notes, recorded call, database audit trail and logs) the previous state of the data can be reconstructed as can the resulting action. This may be to find/resolve bugs in the system or simply as a conflict resolution process with the customer.
It is certainly popular - rails for example has a shorthand for it, as well as a creation timestamp (:timestamps).
At the application level it's very useful, as the same pattern is very common in views - look at the questions here for example (answered 56 secs ago, etc).
It can also be used retrospectively in reporting to generate stats (e.g. what is the growth curve of the number of records in the DB).
there are a couple of scenarios
Let's say you have an address table for your customers
you have your CRM app, the customer calls that his address has changed a month ago, with the LastUpdate column you can see that this row for this customer hasn't been touched in 4 months
usually you use triggers to populate a history table so that you can see all the other history, if you see that the creationdate and updated date are the same there is no point hitting the history table since you won't find anything
you calculate indexes (stock market), you can easily see that it was recalculated just by looking at this column
there are 2 DB servers, by comparing the date column you can find out if all the changes have been replicated or not etc etc ect
This is also very useful if you have to send feeds out to clients that are delta feeds, that is only the records that have been changed or inserted since the data of the last feed are sent.