Extracting important sub-sections and the sub set of documents associated with them from a set of documents - cluster-analysis

I have a set of documents all of which come under the category "crime".
Now, I want to categorize them into a number of (could be overlapping) clusters of documents where each of the clusters are formed under a sub-category like murder or kidnapping, etc.
I want to accomplish this using some way of identifying the importance of individual words occurring in each document. I have already tried using TF-IDF but it is not giving me satisfactory results.

Another alternative is to assign weights to frequently occurring words. Then you can group the words using a k-prototypes or the k-mode approach.

You'll need supervision.
Words such as "suspect", "gun", are likely significant, but do not produce desirable categories. An unsupervised approach cannot know what a "kind of"crime is.

Related

Suggested way to structure Firestore database for deeply nested set of spreadsheet like objects

My application is used for creating production budgets for complex projects (construction, media productions etc.)
The structure of the budget is as follows:
The budget contains "sections",
the sections contain "accounts"
the accounts contains "subaccounts"
the subaccounts contain line items.
Line items have a number of fields, (units, rate, currency, tax etc.) and a calculated total
Or perhaps using Firestore to do these cascading calculations is the wrong approach? I should just load a single complex budget document into my app, do all the cacluations and updates on the clients, and then write back the entire budget as a single document when the user presses "save budget"?
Certain fields in line items may have alpha numeric codes which represent numeric values, which a user can use instead of a hard-coded number, e.g. user can enter "=build-weeks" and define that with a formula that evaluates to say "7" which is then used in the calculation of a total.
Line items bubble up their totals, so subaccounts have total equal to the sum of their line items,
Accounts total equals the total of their subaccounts,
sections total equals sum of accounts totals,
and budget total is total of section totals.
The question si how to aggregate into this data into documents comprising the budget.
Budgets may be sort of long, say 5,000 linesitems or more in total. Single accounts may have hundreds of line items.
Users will most likely look at a all of the line items for a given account, so it occurred to me
to make individual documents for sections, accounts and subaccounts, and make line items a map within a sub account.
The problem main concern I have with this approach is that when the user changes, say the exchange rate of currency of a line item, or changes the calculated value of a named value like "build-weeks" I will ahve to retrieve all the individual line items containing that curency or named value, recalculate the total, and then bubble up the changes through the hierarchy.
This seems not that complicated if each line item is its own document, I can just search the collection for the presence of the code in question, recalculate the line item, and use a cloud function to bubble up teh changes maybe.
But if all the lineitems are contained in an array of maps within each subaccount map item,
it seems like it will be quite tedious to find and change them when necessary..
On the other hand -- keeping these documents so small seems like a lot of document reads when somebody is reviewing a budget, or say, printing it, If somebody just clicks on a bunch of accounts, it might be 100's of reads per click to retrieve all the line items and hundreds or a thousand writes when somebody changes the value of a often used named value like "build-weeks".
Does anybody have any thoughts on the obvious "right" answer to this? Or does it just depend on what I want to optimize for - firestore costs, responsiveness of app, complexity of code?
From my stand point, there is no obvious answer to your problem and indeed it does depend on what you want to optimize for.
However there are a few points that you need to consider on your decision:
Documents in Firestore have a limit of 1Mb/Document;
Documents in Firestore have a limit of 20000 fields;
Queries are shallow, so you don't get data from subcollections on the same query;
For considerations 1 and 2, this means that if you choose the design you database to a big document containing everything, even though you said that your app will have lots of data, I doubt that it will be more than the limits mentioned, still, do consider those. Also, how necessary is it to get all the data at once, this could represent performance and user battery/data usage issues (if you are making a mobile app).
For consideration 3, it means that you would have to make many reads if you choose to get all the data for your sections divided in subdocuments, this will mean more cost to you but better performance for users.
To make the right call on this problem I suggest that you talk to possible users of your solution and understand the problem that you are trying to fix and what they expect of the app. Also, it might be interesting to take a look at the How to Structure Your Data and Maps, Arrays and Subcollections videos, as they explain in a more visual way how Firestore behaves and it could be helpful to antecipate problems that the approach you choose could cause.
Hope I was able to help with these considerations.

Can I find text that's "close" to some query with PostgreSQL?

I have a table in my DB called text. It will have something like this is an example of lite coin. I want to query this for litecoin and things that are close (like lite coin). Is there some way to do this generically as I will have multiple queries. Maybe something with a max Levenshtein distance?
There is a core extension to PostgreSQL which implements the Levenshtein distance. For strings of very unequal length, as in your example, the distance will of necessity be large. So you would have to implement some normalization method, unless all phrases being searched within are the same length.
I don't think Levenshtein is indexable. You could instead look into trigram distance, which is indexable.
+1 on the trigram suggestion. Trigrams in Postgres are excellent and, for sure, indexible. Depending on the index option you choose (GIN or GiST), you get access to different operators. If I remember correctly off the top of my head, GiST gives you distance tolerances for the words, and lets you search for them in order. You can specify the number of words expected between two searches words, and more. (If I'm remembering correctly.) Both GIN and GiST are worth experimenting with.
Levenshtein compares two specific strings, so it doesn't lend itself to indexing. What would you index? The comparison string is unknown in advance. You could index every string by every string in a column and, apart from the O(aaaargh!) complexity, you still might not have unything like your search string in the index.
Tip: If you must use Levenshtein, and it is pretty great where it's useful, you can eliminate many rows from your comparison cheaply. If you've got a 10 character search string and want strings only with a distance of 2, you can eliminate shorter and longer strings from consideration without fear of losing any matches.
You might find that you want to apply Levenshtein (or Jaccard, etc.) to possible matches found by the trigrams. But, honestly, Levenshtein is, by nature, biased towards strings in the same order. That's okay for lite coin/light coin/litecoin, but not helpful when the words can be in any order, like with first and last name, much address data, and many, many phrase-like searches.
The other thing to consider, depending on your range of queries, are full text searches with tsvectors. These are also indexable, and also support a range of operators.

MongoDB database design - contest application

I'm building a contest application. Which have 4 collections so far:
contest
questions
matches
users
I want to store every user score for every match he's assigned into. But I really can't find a proper way to achieve this.
All what I've came up with, Is to replace matches in users with an array in which each element contains a reference to matches collection and score field. But I think this is not very efficient.
EDIT
I was thinking about another solution. A separate collection called scores that contains three fields user, match and score.
Here's my schema structure:
Contests:
Questions:
Matches:
Users:
Note Any recommended adjustments on the current design is welcomed too.
Since mongodb is not designed to support collections relationships you migth end up with some duplicated work, I would suggest you to find a way of storing as much data as you can in a single document.
Your scores would go in each match document, probably the users array would have this structure {'users':[{user_id:'xxx',score:xxx}{user_id:'xxx',score:xxx}]}
The other solution, would be what you say, to have in each user doccument, a matches array with a structure like this: {'matches':[{match_id:'xxx',score:xxx}{match_id:'xxx',score:xxx}]}
You can have both also, this migth be more efficient depending the kind of queries you will need to do. You can also have a field in the subdocuments that stores the user/match name/title
Note: As you can see, you have two solutions, or you optimize for doccument size(so you can store more) or you optimize for performance (so you can read faster/with less resources)
Hope this be of any help.

Storing word frequency data

I am trying to store word frequency data using Mongo. Each word needs to be associated to a user so I can calculate how often an individual uses each word. Currently my words collection looks like this:
{'Hello':3, 'user_id':1}
Which obviously only works on a 'One To One' basis and is no good.
I am trying to work out how best to make this a 'One To Many' relationshop between the user and the words. Would I store the user relationship in my words collection like so:
{'word':"Hello", 'users':[{'id':1, 'count':4},{'id':2, 'count':10}]}
Or would I attach the word counts to the user collection instead?
{'id':1, 'username':'SomeUser', 'words':[{'Hello':4}]}
The obvious disadvantage to the second approach is that the same words will be used across different users, so having a single words collection would help to keeping the data size down.
Can anyone advise me as to what I should do here? Is there a method I have perhaps overlooked in the documentation?
The obvious disadvantage to the second approach is that the same words
will be used across different users, so having a single words
collection would help to keeping the data size down.
Nope, that's the nature of using document db. Data size is really not a matter in non sql solutions, important thing is how easy and how fast you can access your data.
Your first approach is a typical textbook relational model. There is no advantage of using this in mongo (Though you can model this in relational way in mongo). Instead the second approach gives you
Fatser reads/writes since every word is stored inside user. You dont need to perform multiple queries for this

Query for set complement in CouchDB

I'm not sure that there is a good way to do with with the facilities CouchDB provides, but I'd like to somehow extract the relative complement of the sets of two different document types over a particular key.
For example, let's say that I have documents representing users and posts, both of which have a (unique) username field. There's a validation in place ensuring that a user document exists for the username in every post, but there may be any number post documents with a given username, include none. It's trivial to create a view which counts the number of posts per username. The view can even include zero-counts by emitting zero post-counts for the user documents in the view map function. What I want to do though is retrieve just the list of users who have zero associated posts.
It's possible to build the view I described above and filter client-side for zero-value results, but in my actual situation the number of results could be very, very large, and the interesting results a relatively small proportion of the total. Is there a way to do this sever-side and retrieve back just the interesting results?
I would write a map function to iterate through the documents and emit the users (or just usersnames) with 0 posts.
Then I would write a list function to iterate through the map function results and format them however you want (JSON, csv, etc).
(I would NOT use a reduce function to format the results, even if a reduce function appears to work OK in development. That is just my own experience from lessons learned the hard way.)
Personally I would filter on the client-side until I had performance issues. Next I would probably use Teddy's _filter technique—all pretty standard CouchDB stuff.
However, I stumbled across (IMO) an elegant way to find set complements. I described it when exploring how to find documents missing a field.
The basic idea
Finding non-members of your view obviously can't be done with a simple query (and a straightforward index scan.) However, it can be done in constant memory, and linear time, by simultaneously iterating through two query results at the same time.
One query is for all possible document ids. The other query is for matching documents (those you don't want). Importantly, CouchDB sorts query results, therefore you can calculate the complement efficiently.
See my details in the previous question. The basic idea is you iterate through both (sorted) lists simultaneously and when you say "hey, this document id is listed in the full set but it's missing in the sub-set, that is a hit.
(You don't have to query _all_docs, you just need two queries to CouchDB: one returning all possible values, and the other returning values not to be counted.)