I would like to store a list of all en.wikipedia articles in my database. For each article I want to store the pageid, title and the popularity. I thought about using the view count (over the last month) as a measurement for popularity but if that is not possible, I could imagine going for something else (maybe use the number of revisions). I'm aware of http://dumps.wikimedia.org/enwiki/latest/ and that I can get a full list of articles from there (current count 36508337). However, I can not find a clever way to get the view count for each article.
// Updates, Edits, ...
The suggested duplicate does not help me because
a) I was looking for a popularity measurement. The answer to the other questions just states that it is not possible to get the number of watchers for a page, which is fine with me.
b) There is no answer there that gives me the page views (or any other metric) for every page.
Okay I'm finally done. Here is what I did:
I found http://dumps.wikimedia.org/other/pagecounts-ez/ which provides page views per month. This seems promising but they don't mention the pageid so what I'm doing is getting a list of all articles from http://dumps.wikimedia.org/enwiki/latest/, create a mapping name->pageid and then parse the pagecount dump. This takes about 30 minutes, here are some statistics:
68% of the articles in the page count file do not exist in the latest dump. This is probably due to some users linking, for example, Misfits_(TV_series) while other link to Misfits_(tv_series) and even stuff like Misfits_%28TV_series%29... I did not bother with those because my program already took long enough to run.
The top 3 pages are:
2.1. Front page with 639 million views (in the last month)
2.2. Malware with 8.5 million views
2.3. Falcon 9 v1.1 with 4.7 million views (cool!)
I made a histogram for the number of pages with a certain view count, here it is:
I also plotted the number of pages I would have to deal with when I disregard all articles below a certain view count. Here it is:
Related
I am studying for my final OS exam and am currently stuck in a question:
Assume a system uses demand paging as its fetch policy.
The resident size is 2 pages.
Replacement policy is Least Recently Used (LRU).
Initial free frame list: 10, 20. 30, 40, 50
Assume a program runs with the following sequence of page references:
3(read), 2(read), 3(write), 1(write), 1(write), 0(write), 3(read)
I am asked to show the final contents of the free frame list, modified list, and the page table.
Here is the model answer.
This is what I managed to do.
The final Resident Set is correct, but the free frame list and the modified list are not. I just cannot see how the modified list does not contain page number 0 (as in it got written to memory), while page number 1 was not written even though it was referenced before it.
Any help would be appreciated.
Why do you recycle 3(10) to the free list in step 4? It was the least recently used (and is dirty) so you would want to keep it, and get rid of 2(20). That appears to be what the model answer is based on.
When creating REST application which will return a collection of items (topic with collection of posts) with sorting from new to old ones.
If there will be HATEOAS principles performed and all content will be chunked on pages client will get a current page id, offset, data limits and links to first, current and next page for example.
There is no problem to get data from next page, but if somebody has been added content while client is reading current page - data will be pushed on the start of collection and last item of current page will be moved to the next page.
If you will just skip posts which already has been loaded before, you will get lower amount of items on the next page. There is a way to get a count of pushed items in start of list and increment offset.
What is a best practices for this?
Not using offsets indexes, but instead skip tokens that indicate the first value not to include (or first value to include) is a good technique provided the value can be unique for every item in your result set and is an orderable field based on the current sort. But it's not flawless. Usually this just doesn't matter.
If it really does matter you have to put IDs of everything that's in the first page in the call to 2nd page, and again and again. HATEOAS helps you do stuff like this...but it can get very messy and still things can pop up on page 1 given the current sorting when you make a request for page 5...what do you do with that?
Another trick to avoid dupes in a UI is to use the self or canonical link relationships to uniquely identify resources in a page and compare those to existing resources in the UI. Updating the UI with the latest matching resources is usually a simple task. This of course puts some burden on the client.
There is not a 1 size fits all solution to this problem. You have to design for the UX you intend to fulfill.
We have a metadata-and-url feed and a content feed in our project. The indexing behaviour of the documents submitted using either feed is completely unpredictable. For the content feed, the documents get removed from the index after a random interval every time. For the metadata-and-url feed, the additional metadata we add is ignored, again randomly. The documents themselves do remain in index in the latter case - only our custom metadata gets removed. Basically, it looks like the feeds get "forgotten" by GSA after sometime. What could be the cause of this issue, and how do we go about debugging this?
Points to note:
1) Due to unavoidable reasons, our GSA index is always hovering around the license limit (+/- 1000 documents or so). Could this have an effect? Are feeds purged when nearing license limit? We do have "lock = true" set in the feed records though.
2) These fed documents are not linked to from pages and hence (I believe) would have low page rank. Are feeds automatically purged if not linked to from pages?
3) Our follow patterns include the fed documents.
4) We do not use action=delete with the same documents, so that possibility is ruled out. Also for the content feed we always post all the documents. So they are not removed through feeds.
When you hit the license limit the GSA will start dropping documents from the index so I'd say that's definitely your problem.
I am working on a new web app which is based in ExpressionEngine and for the most part I am basing the content on channel entries. However I am experiencing some very weird issues with the exp channel entries tag in that it is not returning all relevant entries each time. I can't figure out what's going on with it as the entries are definitely available when viewing them in the control panel, and they will also show up as requested in my template, but sometimes they just disappear and/or are not processed properly. This is the case for large and small sets of entries also, ranging from 3 channel entries which fit the criteria specified within the exp tag to 500 entries.
Any thoughts or feedback would be greatly appreciated.
There could be a number of things going on here so here are some things to look at, just in case;
If the entries have entry dates in the future - you'll need your channel entries tag to have the parameter show_future_entries = "yes"
Likewise if the entries are closed, or expired, you'll need to add show="open|closed"
Are you looking at a particular category and these entries aren't assigned to the category?
Are you looking at a particular category but have exlcuded category data from the entries tag
Are you retrieving more than 100 entries? There is a default limit of 100 entries returned unless you specify a limit parameter.
I am new to BIRT.
I have a requirement to print a HEADING based on a database value. How do I do that?
How do I leave a Blank line upon break in one of the fields I am reporting?
In the footer, I need to say "Page X of Y" where Y is the total number of pages?
After creating the data source, switch to the "Master Page" and drag the field in question into the header.
No idea.
There are a couple of computed fields (called "AutoText") available. One of them renders itself as the total number of pages.
As you are new to BIRT it would be good to start with the Tutorials. Reading the FAQs is usually enlightening, too. Make sure to check out the BIRT community resources like the mailing list archives, etc. too.