Defaultdict() the correct choice? - defaultdict

EDIT: mistake fixed
The idea is to read text from a file, clean it, and pair consecutive words (not permuations):
file = f.read()
words = [word.strip(string.punctuation).lower() for word in file.split()]
pairs = [(words[i]+" " + words[i+1]).split() for i in range(len(words)-1)]
Then, for each pair, create a list of all the possible individual words that can follow that pair throughout the text. The dict will look like
[ConsecWordPair]:[listOfFollowers]
Thus, referencing the dictionary for a given pair will return all of the words that can follow that pair. E.g.
wordsThatFollow[('she', 'was')]
>> ['alone', 'happy', 'not']
My algorithm to achieve this involves a defaultdict(list)...
wordsThatFollow = defaultdict(list)
for i in range(len(words)-1):
try:
# pairs overlap, want second word of next pair
# wordsThatFollow[tuple(pairs[i])] = pairs[i+1][1]
EDIT: wordsThatFollow[tuple(pairs[i])].update(pairs[i+1][1][0]
except Exception:
pass
I'm not so worried about the value error I have to circumvent with the 'try-except' (unless I should be). The problem is that the algorithm only successfully returns one of the followers:
wordsThatFollow[('she', 'was')]
>> ['not']
Sorry if this post is bad for the community I'm figuring things out as I go ^^

Your problem is that you are always overwriting the value, when you really want to extend it:
# Instead of this
wordsThatFollow[tuple(pairs[i])] = pairs[i+1][1]
# Do this
wordsThatFollow[tuple(pairs[i])].append(pairs[i+1][1])

Related

Need to create dictionary of idf values, associating words with their idf values

I understand how to get the idf values and vocabulary using the vectorizer. With vocabulary the frequency of the word is the value and the word is the key of a dictionary, however, what I want the value to be is the idf value.
I haven't been able to try anything because I don't know how to work with sklearn.
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())
The code provided above is what I was originally trying to work with.
I have since come up with a new solution that does not use scikit:
for string in text_array:
for word in string:
if word not in total_dict.keys(): # build up a word frequency in the dictionary
total_dict[word] = 1
else:
total_dict[word] += 1
for word in total_dict.keys(): # calculate the tf-idf of each word in the dictionary using this url: https://nlpforhackers.io/tf-idf/
total_dict[word] = math.log(len(text_array) / float(1 + total_dict[word]))
print("word", word, ":" , total_dict[word])
Let me know if the code snippet above is enough to allow a reasonable estimation of what is going on. I included a link to what I was using for guidance.
You can directly use vectorizer.fit_transform(text) for the first time.
What it does is build a vocabulary set according to all the word/tokens in the text.
And then you can use vectorizer.transform(anothertext) to vectorize another text with the same mapping as the previous text.
More explanation:
fit() is to learn vocabulary and idf from training set. transform() is to transform the documents based on the learned vocabulary from the previous fit().
So you should only do fit() once, and can transform many times.

Optimize code for string split and extraction

I have a code where the overall aim is the extract two numbers from a string.
I get the string located in a cell array. To easy this example I have made the string in the test below in my code. I want to extract 1400 in one cell and the 2 in a second cell. The code I have made is working fine, but I think it can be optimized (speed and condensed) a lot. Does any of you have any suggestions?
Code:
test{1,1}='1:1400 og 2-fold'
FD1=test{1,1};
C = strsplit(FD1);
C2 = cell2mat(cellfun(#str2num,strrep(C,':',' '),'un',0));
C3 = cell2mat(C(1,3));
C3=strsplit(C3,'-');
Dilut1=C2(1,2);
Fold1=str2double(C3(1,1));
It really depends on your general structure. For this case, you can split the string at colon, space and dash by using:
A = strsplit(test{1,1},{':',' ','-'});
and then simply extract the two numbers as the second and fourth element
Dilut1=str2num(A{2});
Fold1 = str2num(A{4});
But as said it really comes down to your general structure. The more cases you have to account for the longer the code.
Thus it would maybe be better if you could write out something like
test{1,1}='1 dilute 1400 fold 2';
Then you could split at spaces, and search for the word you are interested in and the next string is then the number, ie
A = strsplit(test{1,1});
Dilute = str2num(A{circshift(strcmp(A,'dilute'),1)})
Fold = str2num(A{circshift(strcmp(A,'fold'),1)})

Find a Name in an Email (Low-Level I/O)

Round 2: Picking out leaders in an email
Alrighty, so my next problem is trying to figure out who the leader is in a project. In order to determine this, we are given an email and have to find who says "Do you want..." (capitalization may vary). I feel like my code should work for the most part, but I really have an issue figuring out how to correctly populate my cell array. I can get it to create the cell array, but it just puts the email in it over over again. So each cell is basically the name.
function[Leader_Name] = teamPowerHolder(email)
email = fopen(email, 'r'); %// Opens my file
lines = fgets(email); %// Reads the first line
conversations = {lines}; %// Creates my cell array
while ischar(lines) %// Populates my cell array, just not correct
Convo = fgets(email);
if Convo == -1 %// Prevents it from just logging -1 into my cell array like a jerk
break; %// Returns to function
end
conversations = [conversations {lines}]; %// Populates my list
end
Sentences = strfind(conversations,'Do you want'); %// Locates the leader position
Leader_Name = Sentences{1}; %// Indexes that position
fclose(email);
end
What I ideally need it to do is find the '/n' character (hence why I used fgets) but I'm not sure how to make it do that. I tried to have my while loop be like:
while lines == '/n'
but that's incorrect. I feel like I know how to do the '/n' bit, I just can't think of it. So I'd appreciate some hints or tips to do that. I could always try to strsplit or strtok the function, but I need to then populate my cell array so that might get messy.
Please and thanks for help :)
Test Case:
Anna: Hey guys, so I know that he just assigned this project, but I want to go ahead and get started on it.
Can you guys please respond and let me know a weekly meeting time that will work for you?
Wiley: Ummmmm no because ain't nobody got time for that.
John: Wiley? What kind of a name is that? .-.
Wiley: It's better than john. >.>
Anna: Hey boys, let's grow up and talk about a meeting time.
Do you want to have a weekly meeting, or not?
Wiley: I'll just skip all of them and not end up doing anything for the project anyway.
So I really don't care so much.
John: Yes, Anna, I'd like to have a weekly meeting.
Thank you for actually being a good teammate and doing this. :)
out2 = teamPowerHolder('teamPowerHolder_convo2.txt')
=> 'Anna'
The main reason why it isn't working is because you're supposed to update the lines variable in your loop, but you're creating a new variable called Convo that is updating instead. This is why every time you put lines in your cell array, it just puts in the first line repeatedly and never quits the loop.
However, what I would suggest you do is read in each line, then look for the : character, then extract the string up until the first time you encounter this character minus 1 because you don't want to include the actual : character itself. This will most likely correspond to the name of the person that is speaking. If we are missing this occurrence, then that person is still talking. As such, you would have to keep a variable that keeps track of who is still currently talking, until you find the "do you want" string. Whoever says this, we return the person who is currently talking, breaking out of the loop of course! To ensure that the line is case insensitive, you'll want to convert the string to lower.
There may be a case where no leader is found. In that case, you'll probably want to return the empty string. As such, initialize Leader_Name to the empty string. In this case, that would be []. That way, should we go through the e-mail and find no leader, MATLAB will return [].
The logic that you have is pretty much correct, but I wouldn't even bother storing stuff into a cell array. Just examine each line in your text file, and keep track of who is currently speaking until we encounter a sentence that has another : character. We can use strfind to facilitate this. However, one small caveat I'll mention is that if the person speaking includes a : in their conversation, then this method will break.
Judging from the conversation that I'm seeing your test case, this probably won't be the case so we're OK. As such, borrowing from your current code, simply do this:
function[Leader_Name] = teamPowerHolder(email)
Leader_Name = []; %// Initialize leader name to empty
name = [];
email = fopen(email, 'r'); %// Opens my file
lines = fgets(email); %// Reads the first line
while ischar(lines)
% // Get a line in your e-mail
lines = fgets(email);
% // Quit like a boss if you see a -1
if lines == -1
break;
end
% // Check if this line has a ':' character.
% // If we do, then another person is talking.
% // Extract the characters just before the first ':' character
% // as we don't want the ':' character in the name
% // If we don't encounter a ':' character, then the same person is
% // talking so don't change the current name
idxs = strfind(lines, ':');
if ~isempty(idxs)
name = lines(1:idxs(1)-1);
end
% // If we find "do you want" in this sentence, then the leader
% // is found, so quit.
if ~isempty(strfind(lower(lines), 'do you want'))
Leader_Name = name;
break;
end
end
By running the above code with your test case, this is what I get:
out2 = teamPowerHolder('teamPowerHolder_convo2.txt')
out2 =
Anna

JQuery Wildcard for using atttributes in selectors

I've research this topic extensibly and I'm asking as a last resort before assuming that there is no wildcard for what I want to do.
I need to pull up all the text input elements from the document and add it to an array. However, I only want to add the input elements that have an id.
I know you can use the \S* wildcard when using an id selector such as $(#\S*), however I can't use this because I need to filter the results by text type only as well, so I searching by attribute.
I currently have this:
values_inputs = $("input[type='text'][id^='a']");
This works how I want it to but it brings back only the text input elements that start with an 'a'. I want to get all the text input elements with an 'id' of anything.
I can't use:
values_inputs = $("input[type='text'][id^='']"); //or
values_inputs = $("input[type='text'][id^='*']"); //or
values_inputs = $("input[type='text'][id^='\\S*']"); //or
values_inputs = $("input[type='text'][id^=\\S*]");
//I either get no values returned or a syntax error for these
I guess I'm just looking for the equivalent of * in SQL for JQuery attribute selectors.
Is there no such thing, or am I just approaching this problem the wrong way?
Actually, it's quite simple:
var values_inputs = $("input[type=text][id]");
Your logic is a bit ambiguous. I believe you don't want elements with any id, but rather elements where id does not equal an empty string. Use this.
values_inputs = $("input[type='text']")
.filter(function() {
return this.id != '';
});
Try changing your selector to:
$("input[type='text'][id]")
I figured out another way to use wild cards very simply. This helped me a lot so I thought I'd share it.
You can use attribute wildcards in the selectors in the following way to emulate the use of '*'. Let's say you have dynamically generated form in which elements are created with the same naming convention except for dynamically changing digits representing the index:
id='part_x_name' //where x represents a digit
If you want to retrieve only the text input ones that have certain parts of the id name and element type you can do the following:
var inputs = $("input[type='text'][id^='part_'][id$='_name']");
and voila, it will retrieve all the text input elements that have "part_" in the beginning of the id string and "_name" at the end of the string. If you have something like
id='part_x_name_y' // again x and y representing digits
you could do:
var inputs = $("input[type='text'][id^='part_'][id*='_name_']"); //the *= operator means that it will retrieve this part of the string from anywhere where it appears in the string.
Depending on what the names of other id's are it may start to get a little trickier if other element id's have similar naming conventions in your document. You may have to get a little more creative in specifying your wildcards. In most common cases this will be enough to get what you need.

Generate unique 3 letter/number code and compare to existing ones in PHP/MySQL

I'm making a code generation script for UN/LOCODE system and the database has unique 3 letter/number codes in every country. So for example the database contains "EE TLL", EE being the country (Estonia) and TLL the unique code inside Estonia, "AR TLL" can also exist (the country code and the 3 letter/number code are stored separately). Codes are in capital letters.
The database is fairly big and already contains a huge number of locations, the user has also the possibility of entering the 3 letter/number him/herself (which will be checked against the database before submission automatically).
Finally neither 0 or 1 may be used (possible confusion with O and I).
What I'm searching for is the most efficient way to pick the next available code when none is provided.
What I've came up with:
I'd check with AAA till 999, but then for each code it would require a new query (slow?).
I could store all the 40000 possibilities in an array and subtract all the used codes that are already in the database... but that uses too much memory IMO (not sure what I'm talking about here actually, maybe 40000 isn't such a big number).
Generate a random code and hope it doesn't exist yet and see if it does, if it does start over again. That's just risk taking.
Is there some magic MySQL query/PHP script that can get me the next available code?
I will go with number 2, it is simple and 40000 is not a big number.
To make it more efficient, you can store a number representing each 3-letter code. The conversion should be trivial because you have a total of 34 (A-Z, 2-9) letters.
I would for option 1 (i.e. do a sequential search), adding a table that gives the last assigned code per country (i.e. such that AAA..code are all assigned already). When assigning a new code through sequential scan, that table gets updated; for user-assigned codes, it remains unmodified.
If you don't want to issue repeated queries, you can also write this scan as a stored routine.
To simplify iteration, it might be better to treat the three-letter codes as numbers (as Shawn Hsiao suggests), i.e. give a meaning to A-Z = 0..25, and 2..9 = 26..33. Then, XYZ is the number X*34^2+Y*34+Z == 23*1156+24*34+25 == 27429. This should be doable using standard MySQL functions, in particular using CONV.
I went with the 2nd option. I was also able to make a script that will try to match as close as possible the country name, for example for Tartu it will try to match T** then TA* and if possible TAR, if not it will try TAT as T is the next letter after R in Tartu.
The code is quite extensive, I'll just post the part that takes the first possible code:
$allowed = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ23456789';
$length = strlen($allowed);
$codes = array();
// store all possibilities in a huge array
for($i=0;$i<$length;$i++)
for($j=0;$j<$length;$j++)
for($k=0;$k<$length;$k++)
$codes[] = substr($allowed, $i, 1).substr($allowed, $j, 1).substr($allowed, $k, 1);
$used = array();
$query = mysql_query("SELECT code FROM location WHERE country = '$country'");
while ($result = mysql_fetch_array($query))
$used[] = $result['code'];
$remaining = array_diff($codes, $used);
$code = $remaining[0];
Thanks for your opinion, this will be the key to transport codes all over the world :)