Sentence similarity - How to calculate the depth of subsumer using WordNet? - word

I try to build a tool to calculate the similarity between 2 words and I found that there is a formula come from Manchester Metropolitan University as following:
Until now, I am still confused how to get the h which is the depth of subsumer in the hierarchical semantic nets.
As my understanding, h is the path length from the top word to the a certain word, as reference from the author, the top word is 'entity' for NOUN.
But how about another kind of word such as ADJ, ADV, VERB...?
And if we already have the top word, how can we list out the path from it to the word we need to calculate
The paper is at the following link: https://www.researchgate.net/profile/Keeley_Crockett/publication/232645326_Sentence_Similarity_Based_on_Semantic_Nets_and_Corpus_Statistics/links/0deec51b8db68f19fa000000.pdf
Really appreciate for any answer.
Thanks

Every time I've tried to understand the Wordnet hierarchy I found something that invalidates everything I previously assumed :)
Regarding similarities, if you are using Python and NLTK, I'd recommend you use the provided similarity metrics, if not, those may be a good start to understand how things work.
In this link, scroll down to Similarity:
http://www.nltk.org/howto/wordnet.html

I would like to add more detail which I have just found.
These details are enough for my searching but may not exactly with the question above, but I think I need to share to somebody need it in future.
'Entity' is not only root of Noun, but also the root of any word even it is VERB, ADJ, ADV....
Ex full path for the word 'kiss': ROOT#n#1 < entity#n#1 < abstraction#n#6 < psychological_feature#n#1 < event#n#1 < act#n#2 < touch#n#5 < kiss#n#1
EX full path for the word 'kick': ROOT#n#1 < entity#n#1 < abstraction#n#6 < psychological_feature#n#1 < event#n#1 < act#n#2 < speech_act#n#1 < objection#n#2 < kick#n#4
To calculate the depth of any word, we need to calculate from the beginning word ('entity') and base on the Word Net hierarchical database.
Come back to above example, the h (length of subsummer of 'kiss' and 'kick') is 6, which is count from the top tree node root to the word 'act'

Related

Google Sheets - Retrieve "A:File1" to "A:File2" where "Sheetname:File1" = "B:File2" if "C:File2" is between "E" and "F" in "File1"

Sorry for the somewhat long title, but I was told to be as specific as possible. :D
My problem will require some explantion.
So, I have 2 spreadsheets files ("Konverteringstabeller" and "Tee Posen").
In "Tee Posen" I have a sheet named "Scores MIK" (golf scorecard and my name).
In "Konverteringstabeller" I have sheets with conversion tables for multiple golf courses, but if one works, all should.
What I need is to find out what course handicap I would get if my golf handicap is "HCP 26,0" (as shown in File 2 Picture), and in this case that result should be 29 (not visible), but you should get the point.
(example: golf hcp 10 would result in course hcp 11, because 10 is between 9,9-10,7)
While I have been able to find the right result, it has only been in the "Konverteringstabeller" spreadsheet file and that is not the place I need it.
I want to have it written in E6 in the "Scores MIK" sheet in File 2.
I should mention that in "Scores MIK : File 2", cell C2 (Ikast Golf Klub) has data validation so I can easily change between the different courses in the "Konverteringstabeller" file once I add more.
What I have been messing with is something with vlookup and importrange with concatenate in it, but I can't figure out how to do it, so I ask for your help.
And I am by no means skilled in the art of Spreadsheets, so I would very much appreciate a detailed explanation.
Picture - Scores MIK (File 2)
Picture - Ikast Golf Klub (File 1)
Thanks in advance!
// Mikkel Christensen
OK so a couple notes - One is that to join a static cell where you keep the sheet name but allow it to chance you should add '$' around it, also if the rows for B8-E70 will always be the same position on the various sheets you also need to add $ around those as well.
here is an example of the whole formula
=IFERROR(ARRAYFORMULA(VLOOKUP(E5:E25;IMPORTRANGE("spreadsheet key";"'"&C2&"'!$B$8:$E$70");4;TRUE)))
And lastly - using the "&" operator to concatenate is better at least in my opinion because concatenate sometimes does not work as well with array formula - plus I find it personally quicker and easier to use that having wrap yet another function around my stuff.

matlab: check which lines of a path are used - graphshortestpath

The related problem comes from the power Grid in Germany. I have a network of substations, which are connected according to the Lines. The shortest way from point A to B was calculated using the graphshortestpath function. The result is a path with the used substation ID's. I am interested in the Line ID's though, so I have written a sequential code to figure out the used Line_ID's for each path.
This algorithm uses two for loops. The first for-loop to access the path from a cell array, the second for-loop looks at each connection and searches the Line_ID from the array.
Question: Is there a better way of coding this? I am looking for the Line_ID's, graphshortestpath only returns the node ID's.
Here is the main code:
for i = i_entries
path_i = LKzuLK_path{i_entries};
if length(path_i) > 3 %If length <=3 no lines are used.
id_vb = 2:length(path_i) - 2;
for id = id_vb
node_start = path_i(id);
node_end = path_i(id+1);
idx_line = find_line_idx(newlinks_vertices, node_start, ...
node_end);
Zuordnung_LKzuLK_pathLines(ind2sub(size_path,i),idx_line) = true;
end
end
end
Note: The first and last enrty of path_i are area ID's, so they are not looked upon for the search for the Line_ID's
function idx_line = find_line_idx(newlinks_vertices, v_id_1, v_id_2)
% newlinks_vertices includes the Line_ID, and then the two connecting substations
% Mirror v_id's in newlinks_vertices:
check_links = [newlinks_vertices; newlinks_vertices(:,1), newlinks_vertices(:,3), newlinks_vertices(:,2)];
tmp_dist1 = find(check_links(:,2) == v_id_1);
tmp_dist2 = find(check_links(tmp_dist1,3) == v_id_2,1);
tmp_dist3 = tmp_dist1(tmp_dist2);
idx_line = check_links(tmp_dist3,1);
end
Note: I have already tried to shorten the first find-search routine, by indexing the links list. This step will return a short list with only relevant entries of the links looked upon. That way the algorithm is reduced of the first and most time consuming find function. The result wasn't much better, the calculation time was still at approximately 7 hours for 401*401 connections, so too long to implement.
I would look into Dijkstra's algorithm to get a faster implementation. This is what Matlab's graphshortestpath uses by default. The linked wiki page probably explains it better than I ever could and even lays it out in pseudocode!

Unicode character usage statistics [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I am looking for some statistical data on the usage of Unicode characters in textual documents (with any markup). Googling brought no results.
Background: I am currently developing a finite state machine-based text processing tool. Statistical data on characters might help searching for the right transitions. For instance latin characters are probably most used so it might make sense to check for those first.
Did anyone by chance gathered or saw such statistics?
(I'm not focused on specific languages or locales. Think general-purpose parser like an XML parser.)
To sum up current findings and ideas:
Tom Christiansen gathered such statistics for PubMed Open Access Corpus (see this question). I have asked if he could share these statistics, waiting for the answer.
As #Boldewyn and #nwellnhof suggested, I could run the analysis of the complete Wikipedia dump or CommonCrawl data. I think these are good suggestions, I'll probably go with the CommonCrawl.
So sorry, this is not an answer, but a good research direction.
UPDATE: I have written a small Hadoop job and ran it on one of the CommonCrawl segments. I have posted my results in a spreadsheet here. Below are the first 50 characters:
0x000020 14627262
0x000065 7492745 e
0x000061 5144406 a
0x000069 4791953 i
0x00006f 4717551 o
0x000074 4566615 t
0x00006e 4296796 n
0x000072 4293069 r
0x000073 4025542 s
0x00000a 3140215
0x00006c 2841723 l
0x000064 2132449 d
0x000063 2026755 c
0x000075 1927266 u
0x000068 1793540 h
0x00006d 1628606 m
0x00fffd 1579150
0x000067 1279990 g
0x000070 1277983 p
0x000066 997775 f
0x000079 949434 y
0x000062 851830 b
0x00002e 844102 .
0x000030 822410 0
0x0000a0 797309
0x000053 718313 S
0x000076 691534 v
0x000077 682472 w
0x000031 648470 1
0x000041 624279 #
0x00006b 555419 k
0x000032 548220 2
0x00002c 513342 ,
0x00002d 510054 -
0x000043 498244 C
0x000054 495323 T
0x000045 455061 E
0x00004d 426545 M
0x000050 423790 P
0x000049 405276 I
0x000052 393218 R
0x000044 381975 D
0x00004c 365834 L
0x000042 353770 B
0x000033 334689 E
0x00004e 325299 N
0x000029 302497 /
0x000028 301057 (
0x000035 298087 5
0x000046 295148 F
To be honest, I have no idea if these results are representative. As I said, I only analysed one segment. Looks quite plausible for me. One can also easily spot that the markup is already stripped off - so the distribution is not directly suitable for my XML parser. But it gives valuable hints on which character ranges to check first.
The link to http://emojitracker.com/ in the near-duplicate question I personally think is the most promising resource for this. I have not examined the sources (I don't speak Ruby) but from a real-time Twitter feed of character frequencies, I would expect quite a different result than from static web pages, and probably a radically different language distribution (I see lots more Arabic and Turkish on Twitter than in my otherwise ordinary life). It's probably not exactly what you are looking for, but if we just look at the title of your question (which probably most visitors will have followed to get here) then that is what I would suggest as the answer.
Of course, this begs the question what kind of usage you attempt to model. For static XML, which you seem to be after, maybe the Common Crawl set is a better starting point after all. Text coming out of an editorial process (however informal) looks quite different from spontaneous text.
Out of the suggested options so far, Wikipedia (and/or Wiktionary) is probably the easiest, since it's small enough for local download, far better standardized than a random web dump (all UTF-8, all properly tagged, most of it properly tagged by language and proofread for markup errors, orthography, and occasionally facts), and yet large enough (and probably already overkill by an order of magnitude or more) to give you credible statistics. But again, if the domain is different than the domain you actually want to model, they will probably be wrong nevertheless.

What is he best way how to get the short version of an article?

What do you think? What's the best way how to get from administratior the short version of an article (in CMS)?
I have 3 possibilies in my mind:
1) Create two textareas (WYSIWYG) fro the short and for the full version of article
2) The short version will be (for example) the first 100 words of article
3) To have one sepparator (like <hr>) and the short version will be the text from the beggining up to the sepparator (full version will be up to the end)
A possibility would be to mix your first two ideas (I don't quite like the third one, where the content cannot live without the excerpt) :
Use an aditionnal textarea for the exceprt,
And if the user doesn't input anything in it, just use the first X words / sentences of the content.

Sum of DOM elements using XPath

I am using MSXML v3.0 in a VB 6.0 application. The application calculates sum of an attribute of all nodes using for each loop as shown below
Set subNodes = docXML.selectNodes("//Transaction")
For Each subNode In subNodes
total = total + Val(subNode.selectSingleNode("Amount").nodeTypedValue)
Next
This loop is taking too much time, sometime it takes 15-20 minutes for 60 thousand nodes.
I am looking for XPath/DOM solution to eliminate this loop, probably
docXML.selectNodes("//Transaction").Sum("Amount")
or
docXML.selectNodes("Sum(//Transaction/Amount)")
Any suggestion is welcomed to get this sum faster.
// Open the XML.
docNav = new XPathDocument(#"c:\books.xml");
// Create a navigator to query with XPath.
nav = docNav.CreateNavigator();
// Find the sum
// This expression uses standard XPath syntax.
strExpression = "sum(/bookstore/book/price)";
// Use the Evaluate method to return the evaluated expression.
Console.WriteLine("The price sum of the books are {0}", nav.Evaluate(strExpression));
source: http://support.microsoft.com/kb/308333
Any solution that uses the XPath // pseudo-operator on an XML document with 60000+ nodes is going to be quite slow, because //x causes a complete traversal of the tree starting at the root of the document.
The solution can be speeded up significantly, if a more exact XPath expression is used, that doesn't include the // pseudo-operator.
If you know the structure of the XML document, always use a specific chain of location steps -- never //.
If you provide a small example, showing the specific structure of the document, then many people will be able to provide a faster solution than any solution that uses //.
For example, if it is known that all Transaction elements can be selected using this XPath expression:
/x/y/Transaction
then the evaluation of
sum(/x/y/Transaction/Amount)
is likely to be significantly faster than Sum(//Transaction/Amount)
Update:
The OP has revealed in a comment that the structure of the XML file is quite simple.
Accordingly, I tried with a 60000 Transaction nodes XML document the following:
/*/*/Amount
With .NET XslCompiledTransform (Yes, I used XSLT as the host for the XPath engine) this took 220ms (milliseconds), that means 0.22 seconds, to produce the sum.
With MSXML3 it takes 334 seconds.
With MSXML6 it takes 76 seconds -- still quite slow.
Conclusion: This is a bug in MSXML3 -- try to upgrade to another XPath engine, such as the one offered by .NET.