Query large list of metadate in weaviate - metadata

I have 100.000 images, each of them have 500 orb vectors, and each image has a unique tag.
My general issue is, when I insert a new image (i.e. 500 new vectors), how can I know if the image's tag is already in the database ?
What I do is to attache to each vector a metadata "tag". In can retrieve the inserted tags with
result = client.query.get('orb_vector', ['tag'])\
.with_limit(200)\
.do()
This provides more or less 200 tags among the 100.000 existing.
Accordingly to the documentation, that way of doing is not scalable.
How do I do ?
Context:
My database is not very dynamic; apart of the initial big insertion (100.000+ images), there will be few insertions each day. So I'm okay with a request taking 5 minutes and keeping the result in memory in a non-dynamic way. Plain python list is okay.
Clarification: each image has one tag, but 500 vectors. So each tag is present 500 times in the database.
I'm using python.
What I can do:
Writing the list of tags in a json/mongo/other and reading/updating it each time I insert new images.
I prefer to avoid this solution since the synchronization between the weaviate database and the json will just be a nightmare.

Have you considered creating a separate class for the tags and using query filters?
For example, define a schema for a class named Tag where:
it has a property called "name" to store the tag's name e.g. outdoors, indoors, etc
it has a property called "images" to store the cross references to the images that are tagged with "outdoors".
Then, when you want to insert an image with tag "car", for example, you do a WHERE filter on the Tag class where the name name is Equal to "car".
If the result is empty, then that tag does not exist.

Related

Sulu CMS: how to search/filter for content of a specific type with specific values for specifc attributes?

Short description of the situation:
We're running a forked version of Sulu 1.5.2, PHP 7.1, Windows server environment, db connection with PostgreSQL
We have a website structure/tree where we have house templates at the top level; each house has one house_rooms and one house_occupants template; each house_rooms template has N house_rooms_room templates, and each house_occupants template has N house_occupants_occupant templates. This represents an actual House that has N Rooms and N Occupants.
Now I'd like to know if there is a way to specifically get, for instance, all the house_occupants_occupant content that follows a certain pattern of attributes (for instance: their gender attribute having value 'female' and their date_of_birth parameter being >= 1990/01/01), without having to load each house, then find its house_occupantspage among the children, and then loop over that template's house_occupants_occupant children and filter the thus begotten content according to their gender and date of birth attributes.
I already found that there is a ContentRepository class that can ::findAll() and ::findByUuids(), but there doesn't seem to be a way to filter on specific attributes (like template type, template attributes, ...). So I took a roundabout way of creating my own "repository" that does direct PDO queries on the phpcr_nodes table in the database, to specifically scan the props attribute for the occurence of a certain template name:
$this->pdo->query("SELECT identifier, props FROM phpcr_nodes WHERE props LIKE '%>house_occupants_occupant<%'");
I can see that the propscontains a string value representing an XML document that somehow translates into the entire template with attribute-value pairs, however it is obscured regarding tag-levels and how certain attributes relate to certain values. So in theory I could use a specific XML parser to turn this into something human-readable, so that for my house_occupants_occupant data I could get something like:
// what I would get after putting the props through a certain XML parser:
$xmlHumanReadableData = [
'<the_uuid_of_occupant_1>' => [
...
'gender' => 'female',
'date_of_birth' => '1992-05-18T00:00:00.000+00:00',
...
],
... //etcetera etcetera
];
When I would have that, I could filter the readable data to ascertain which content I want to keep, add the node-uuid to some $theUuids variable, and then retrieve the actual content using Sulu's ContentRepository::findByUuids($theUuids) method. That would "only" require 2 queries and some PHP array filtering in between, which is a great deal better than looping over all the children content starting from a certain parent and doing this until you've traversed all the parents and all their children... (Certainly, the overhead would increase if you'd want to search for, for instance, all house nodes where at least one of its house_occupants_occupant nodes represents a child less than 10 years old, since you'd need extra queries to "set up" the filterdata used in the final query. But still: a great deal better than looping everything... ;-) )
So my question is sort-of twofold:
What is the Sulu-specific XML parser I can use to turn the XML string value in this props column into something human-readable, with proper attribute-value pairs?
And/or, hopefully: is there a way I can avoid all this nonsense and just use a less low-level way of retrieving content of a specific template type with specific values for specific attributes ?
The ContentRepository you've found is already an abstraction to some of our requirements for pages. Your requirements are already quite specific, so you should write your own query using SQL-2, the query language for PHPCR.
This should enable you to write a query which matches your requirements.

Parse Data Model for users tagging items

im trying to figure out a good data model approach when using parse (underlying mongodb) for users that are allowed to select a tag and provide a value to them, lets call it just a rating since thats the easiest to understand.
Im working on a class that is called user tags that has the following structure currently in its collection.
User (pointer to user class)
Object (pointer to object to tag)
Tags (array of tags with values)
The tags can be up to maybe 30 tags and each one of them can have a rating of 1-5 in this case...
I was wondering if I could do a PFRelationship in an array that has the objectId of the tag as the key and value as the 1 - 5 rating.. here is an example json object mocked up to what im saying.
{
"3q24afadfadf": 3.5 //parse relation object id : value,
"234rrdfadfk": 2.4 //parse relation object id : value,
"as4q2w34lsdf": 2.3 //parse relation object id : value
}
This way I can store one row for the item that the user tagged and all the tags with its rating value along with that.
I'm not sure this is the right way or if its scalable when doing queries for get me all users items he or she tagged, along with the tag (name) and the values).
I also on top of that need to figure out a way to when many users tag the same item with different values that I build up some analytics or maybe counter class that gets incremented or averaged in to then be displayed along with the item. I might try cloudcode to do saveafter to update the analytic data class for that item.
Anyhow Any thoughts on this model would be appreciated and most importantly need to be able to get at the data inside the tag array, with hopefully the key being a pointer, if not a pointer i'm up to suggestions because the result should return
Item A
Tag name 1 with value 4.5
Tag Name 2 with value 3.5
and so on..
...
Also if you have any pointers to how to build aggregated data of the item and its over all value that many users have tagged over time.. My thought as above is to have a analytic class that the cloud code increments or that the app then increments, the challenge is to load all the user tags of item x, and get the tag and value out of the array and then add them to the analytic class. I could run this at night since it doesn't have to be real time.

elasticsearch array field of keywords - how to index it

I've got input that is analogous to tags, where there are a couple of strings per record, and they should be thought of as keywords, not to be tokenized or broken up or analyzed in any particular way. I want it to show up in faceting "as-is", including spaces, slashes, dashes and ampersands.
I don't think I need multi_field here. There is one input value per record "keyPhrases" but the input value is a simple json array of strings.
I want elasticsearch to insert into the facets each of the values, and tag the record with all of the phrases.
Usually there are only one or two or three phrases per record, but there could be more. The set of keyPhrases is fairly small, like 30 or at most like 50. They could be thought of as "categories".
The faceting keeps breaking up the input strings and using lowercasing, even though I'm trying to specify not_analyzed, keyword tokenizer, keyword analyzer, and trying things like that.
I have other fields that keep their spacing and capitalization as I desire in the facets returned, however those fields are not_analyzed and are also store: true, but are also just exactly 1 string input per record, as opposed to many per record.
I could just take the top 1 keyPhrase per record and flatten it, but ideally all the tags would work and be available as facets.
Any ideas on how to do this?
Well, this is embarrassing.
My strict mapping wasn't actually committed to the server at the time I was trying this.
(I was dropping the index and creating the index again with each new mapping, and hadn't realized it, and this was not the final mapping, so it was getting loaded and then dropped.)

Efficiently updating cosine similarity scores

My iPhone application is using a SQLite database with the following schema:
items(id, name, ...) -> this table contains 50 records
tags(id, name) -> this table contains 50 records
item_tags(id, item_id, tag_id, user_id)
similarities(id, item1_id, item2_id, score)
The items, tags, item_tags and similarities tables are populated with pre-defined records, hence also the similarities between different items have already been calculated offline (using cosine similarity algorithm based on the items' tags).
Users are able to add additional tags to items and to remove their custom tags later on. Whenever this happens the similarity scores between the items should be updated locally, i.e. without contacting the server application.
My question now is the following:
What is the most efficient way to do so? So far, on startup of the iPhone application, I compute a term-document matrix for all the items and tags (which reflects the tag frequencies for each item) and keep this matrix in memory as long as the application is running. Whenever a tag is added or removed, I use this matrix to update the similarities in the database. However, this is rather inefficient. Do you have any suggestions?
Thanks!
This presentation might help you:
http://www.slideshare.net/jnvms/incremental-itembased-collaborative-filtering-4095306

Is it possible to store hidden metadata information that is tied to a specific Table or Cell within a Word document?

I am trying to store metadata (basically a unique id) along with each cell of a table in a Word document. Currently, for the add-in I'm developing, I am querying the database, and building a table inside the Word document using the data that is retrieved.
I want to be able to save any of the user's edits to the document, and persist it back to the database. My initial thought was to store a unique id along with each cell in the table so that I would be able to tell which records to update. I would also like to store some sort of "isChanged" flag within each cell so that I could tell which cells were changed. I found that I could add the needed information into the "ID" property of the cell - however, that information was not retained if the user saved the document, closed it, and re-opened it. I then tried storing the data by adding a data to the "Fields" collection - but that did not work and threw a runtime error. Here is the code that I tried:
object t1 = Word.WdFieldType.wdFieldEmpty;
object val = "myValue: " + counter;
object preserveFormatting = true;
tbl.Cell(i, j).Range.Fields.Add(tbl.Cell(i, j).Range, ref t1, ref val, ref preserveFormatting);
This compiles fine, but throws this runtime error "This command is not available".
So, is this possible at all? Or am I headed in the wrong direction?
Thanks in advance.
Wound up using "ContentControls" to store the information I needed. I used the "Title" field to store the unique id, and the "tag" field to track whether the field had been changed or not. See this link for more info: http://blogs.technet.com/gray_knowlton/archive/2010/01/15/associating-data-with-content-controls.aspx
Since a "Word 2007 Document" is XML, you can add a namespace to the document then adore the elements with attributes from your namespace. Word should ignore your namespace when loading and saving. Moreover, you can add new elments to store any information (metadata) needed.
With that said, I have not used this technique with Word, but I have done it successfully using Excel 2003.
First thing to try, is create a bare "Word 2007 Document". In your case, add a simple two by two table. Open it with a text or XML editor and add your namespace, and adore an attribute and add an element. Open with Word make a change then save it. Open with editor and make sure your namespace attribute and element have not been changed.