Database language organization, by word or by language? - mongodb

I seek to add a dictionary for sentences (that most are single words) but am undecided if I should organize them by words, such as:
{word:"bacon", en:"bacon", ro:"sunca", fr:"jambon"}
or by language:
{
en:{ bacon:bacon },
ro:{ bacon:sunca },
fr:{ bacon:jambon }
}
I realize both have pros and cons, are equally valid, but I am seeking the wisdom of those who met this problem before, made a choice, and are happy or regret to have made it, and of course, why.
Thank you.

The below representation is simple and elegant. But the document representation in mongodb (or most nosql databases for that matter) is heavily influenced by the usage pattern of the data.
{word:"bacon", en:"bacon", ro:"sunca", fr:"jambon"}
This representation has the below merits assuming you want to look-up the other language translation by passing in the word
Intitive
No duplication
You can have index on the word

Of the two options that you provide, the first one is better. But since there will be at most so many languages, in your case, I would actually go for arrays:
LANG = {en: 0, ro: 1, fr: 2}
DICT = [
[:ham, :sunca, :jambon],
[:egg, :ou, :œuf]
]
For translation, write eg:
def translate( word, from: nil, to: nil )
from = LANG[from]
to = LANG[to]
i = DICT.index { |entry| entry[from] == word }
return DICT[i][to]
end
translate( :egg, from: :en, to: :fr )
#=> :œuf
Note please, that although effort has been done to minimize the size of the dictionary, as for the speed, faster matching algorithms would be available, using eg. suffix trees.

Related

Swift - should I create local variable of a strings "count"?

Does it matter if I use a strings 'count' multiple times within a function. That is, does Swift cache the 'count' after it firsts computes it. Below are two examples, does it matter which one I use? I assume the second is definitely okay but what about the first? I see example code like the first one all the time.
func Foo1 (str: String) {
...
// calling str.count twice
if x < str.count && y < str.count {
...
}
func Foo2 (str: String) {
...
// calling str.count once
let c = str.count
if x < c && y < c {
...
}
.count is defined by the Collection protocol with the following complexity:
Complexity: O(1) if the collection conforms to RandomAccessCollection; otherwise, O(n), where n is the length of the collection.
String is not a RandomAccessCollection. It's a BidirectionalCollection, so it does not promise O(1). It only promises O(n).
It definitely does not promise any caching (and you shouldn't expect any).
It happens to be true that in many (probably most) cases, String's count is cached. It's part of _StringObject, which is part of the low-level storage abstraction, and it's often inlined by the optimizer. But none of this is promised.
That said, unless you expect the String to be extremely large (10kB at a minimum, possibly more), it is difficult to imagine this being a major bottleneck by being called twice outside a tight loop. As with most things, you should write clearly, and then profile. I would likely create an extra variable just for clarity, but you shouldn't second-guess here too much. Write clearly. Then profile.
Do you have particularly large strings that you're working with?

Index of word in string 'covering' certain position

Not sure if this is the right place to ask but I couldn't find any related or similar questions.
Anyway: imagine you have a certain string like
val exampleString = "Hello StackOverflow this is my question, cool right?"
If given a position in this string, for example 23, return the word that 'occupies' this position in the string. If we look at the example string, we can see that the 23rd character is the letter 's' (the last character of 'this'), so we should return index = 5 (because 'this' is the 5th word). In my question spaces are counted as words. If, for example, we were given position 5, we land on the first space and thus we should return index = 1.
I'm implementing this in Scala (but this should be quite language-agnostic and I would love to see implementations in other languages).
Currently I have the following approach (assume exampleString is the given string and charPosition the given position):
exampleString.split("((?<= )|(?= ))").scanLeft(0)((a, b) => a + b.length()).drop(1).zipWithIndex.takeWhile(_._1 <= charPosition).last._2 + 1
This works, but it is way too complex to be honest. Is there a better (more efficient?) way to achieve this. I'm fairly new to functions like fold, scan, map, filter ... but I would love to learn more.
Thanks in advance.
def wordIndex(exampleString: String, index: Int): Int = {
exampleString.take(index + 1).foldLeft((0, exampleString.head.isWhitespace)) {
case ((n, isWhitespace), c) =>
if (isWhitespace == c.isWhitespace) (n, isWhitespace)
else (n + 1, !isWhitespace)
}._1
}
This will fold over the string, keeping track of whether the previous character was a whitespace or not, and if it detects a change, it will flip the boolean and add 1 to the count (n).
This will be able to handle groups of spaces (e.g. in hello world, world would be at position 2), and also spaces at the start of the string would count as index 0 and the first word would be index 1.
Note that this can't handle when the input is an empty string, I'll let you decide what you want to do in that case.

Iterating through all but the last index of an array

I understand that in Swift 3 there have been some changes from typical C Style for-loops. I've been working around it, but it seems like I'm writing more code than before in many cases. Maybe someone can steer me in the right direction because this is what I want:
let names : [String] = ["Jim", "Jenny", "Earl"]
for var i = 0; i < names.count - 1; i+=1 {
NSLog("%# loves %#", names[i], names[i+1])
}
Pretty simple stuff. I like to be able to get the index I'm on, and I like the for-loop to not run if names.count == 0. All in one go.
But it seems like my options in Swift 3 aren't allowing me this. I would have to do something like:
let names : [String] = ["Jim", "Jenny", "Earl"]
if names.count > 0 {
for i in 0...(names.count - 1) {
NSLog("%# loves %#", names[i], names[i+1])
}
}
The if statement at the start is needed because my program will crash in the situation where it reads: for i in 0...0 { }
I also like the idea of being able to just iterate through everything without explicitly setting the index:
// Pseudocode
for name in names.exceptLastOne {
NSLog("%# loves %#", name, name.plus(1))
}
I feel like there is some sort of syntax that mixes all my wants, but I haven't come across it yet. Does anyone know of a way? Or at least a way to make my code more compact?
UPDATE: Someone suggested that this question has already been asked, citing a SO post where the solution was to use something to the degree of:
for (index, name) in names.enumerated {}
The problem with this when compared to Hamish's answer is that I only am given the index of the current name. That doesn't allow me to get the value at index without needing to do something like:
names[index + 1]
That's just one extra variable to keep track of. I prefer Hamish's which is:
for i in names.indices.dropLast() {
print("\(names[i]) loves \(names[i + 1])")
}
Short, simple, and only have to keep track of names and i, rather than names, index, and name.
One option would be to use dropLast() on the array's indices, allowing you to iterate over a CountableRange of all but the last index of the array.
let names = ["Jim", "Jenny", "Earl"]
for i in names.indices.dropLast() {
print("\(names[i]) loves \(names[i + 1])")
}
If the array has less than two elements, the loop won't be entered.
Another option would be to zip the array with the array where the first element has been dropped, allowing you to iterate through the pairs of elements with their successor elements:
for (nameA, nameB) in zip(names, names.dropFirst()) {
print("\(nameA) loves \(nameB)")
}
This takes advantage of the fact that zip truncates the longer of the two sequences if they aren't of equal length. Therefore if the array has less than two elements, again, the loop won't be entered.

Alternative compound key ranges in CouchDB

Assuming a mapreduce function representing object relationships like:
function (doc) {
emit([doc.source, doc.target, doc.name], null);
}
The normal example of filtering a compound key is something like:
startKey = [ a_source ]
endKey = [ a_source, {} ]
That should provide a list of all objects referenced from a_source
Now I want the oposite and I am not sure if that is possible. I have not been able to find an example where the variant part comes first, like:
startKey = [ *simbol_first_match* , a_destination ]
endKey = [ {} , a_destination ]
Is that posible? Are compound keys (1) filter and (2) sort operations within a query limited to the order of the elements appear in the key?
I know I could define another view/mapreduce, but I would like to avoid the extra disk space cost if possible.
No, you can't do that. See here where I explained how keys work in view requests with CouchDB.
Compound keys are nothing special, no filtering or anything. Internally you have to imagine that there is no array anymore. It's just syntactic sugar for us developers. So [a,b] - [a,c] is treated just like 'a_b' - 'a_c' (with _ being a special delimiter).

Best way to store data in to hash for flexible "pivot-table" like calculations

I have a data set with following fields.
host name, model, location, port number, activated?, up?
I would convert them into a hash structure (perhaps similar to below)
my %switches = (
a => {
"hostname" => "SwitchA",
"model" => "3750",
"location" => "Building A"
"total_ports" => 48,
"configured_ports" => 30,
"used_ports" => 24,
},
b => {
"hostname" => "SwitchB",
"model" => "3560",
"location" => "Building B"
"total_ports" => 48,
"configured_ports" => 36,
"used_ports" => 20,
},
},
);
In the end I want to generate statistics such as:
No. of switches per building,
No. of switches of each model per building
Total no. of up ports per building
The statistics may not be just restricted to building wise, may be even switch based (i.e, no. of switches 95% used etc.,). With the given data structure how can I enumerate those counters?
Conversely, is there a better way to store my data? I can think of at least one format:
<while iterating over records>
{
hash{$location}->{$model_name}->count++;
if ($State eq 'Active') {hash{$location}->{up_ports}->count++};
What would be the better way to go about this? If I chose the first format (where all information is intact inside the hash) how can I mash the data to produce different statistics? (some example code snippets would be of great help!)
If you want querying flexibility, a "database" strategy is often good. You can do that directly, by putting the data into something like SQLite. Under that approach, you would be able to issue a wide variety of queries against the data without much coding of your own.
Alternatively, if you're looking for a pure Perl approach, the way to approximate a database table is by using an array-of-arrays or, even better for code readability, an array-of-hashes. The outer array is like the database table. Each hash within that array is like a database record. Your Perl-based queries would end up looking like this:
my #query_result = grep {
$_->{foo} == 1234 and
$_->{bar} eq 'fubb'
} #data;
If you have so many rows that query performance becomes a bottleneck, you can create your own indexes, using a hash.
%data_by_switch = (
'SwitchA' => [0, 4, 13, ...], # Subscripts to #data.
'SwitchB' => [1, 12, ...],
...
);
My answer is based on answers I received for this question, which has some similarities with your question.
As far as I can see you have a list of tuples, for the sake of the discussion it is enough to consider objects with 2 attributes, for example location and ports_used. So, for example:
(["locA", 23], ["locB", 42], ["locA", 13]) # just the values as tuples, no keys
And you want a result like:
("locA" => 36, "locB" => 42)
Is this correct? If so, what is the problem you are facing?