Training a neural network without historical data - neural-network

I am building a highly personalised recommender system from scratch where I have no historical data for the interactions between users and items.
Nevertheless, a user when added to the system must provide a list of tags for the items:
He really likes;
He has no opinion about;
He dislikes
Then, based on those tags I am able to match some items for groups 1., 2. and 3.
So, I am thinking of sampling items from groups 1., 2. and 3. and assign them the target value 1, 0, and -1 respectively in order to train my neural network. After the training step I would get a neural network highly personalised for each user that would allow me to start recommending some items that match each user's preferences despite having no historical data.
Of course, as the user starts providing feedback for the recommended items I would update the network in order to match his new preferences.
With that said, does this approach makes sense or neural networks are not the best fit for this specific case?

First of all, you did not clearly enough explain your specific question or problem, which usually results in an answer you probably did not expect, but I'll try to give some meaningful information, rather than a plain 42.
You did not specify, what is that you'd like you recommendation system to achieve. Now it is not clear based on what exactly you are planning to give recommendations to the user. Is that a correlation between user A preference and all other user preferences that should suggest the products, not seen by user A he might like?
That seems to be the most likely case, based on description. So you are looking for some sort of solution to the Netflix challenge usually called collaborative filtering. Your model as described is much simpler than the data Netflix or Amazon has, but it still can not operate without any data, so initial guesses are going to be completely off and annoy users. One of my friends is being constantly annoyed by recommendations that other people who liked this movie also watched that - he says it's always wrong even though Netflix has lots of data and a comprehensive recommendation engine. So expect a lot of frustration and possibly even vandalism (as when users deliberately provide incorrect feedback because of poor quality of recommendations). The only way to avoid it is to collect data first by asking for the feedback and only give recommendation after you collected sufficient amount of samples.
We are slowly getting to the actual question as stated: if the neural network is a good tool for the job. If you have sufficient amount of data which can fit a simple model as you described with a small number of false positives (poor recommendations) and large number of true positives (correct recommendations) it is. How much data you need depends on the number of products and strength of correlation between them being liked and disliked. If you have 2 products which has no correlation, no matter how much data you will collect there will be no good. If you got very similar products all together, correlation will be strong, but equally spread between all of the products, so again you wouldn't be able to give any useful advice until you collect a very large amount of data which would simply filter out some poor goods. The best case is a sort of highly correlated yet very different products (something like a high-end mountain bike and a go-pro cam). Those should be reliably chained based on other user preferences.
So without further information you wouldn't get much useful insight. What you describe, if the blanks have been filled in somewhat correctly makes sense, but will it work and how much data you'll need will really depend on specifics of the products and users involved.
I hope it helps.

Related

In NoSQL, how do you handle massive updates to common dependant data?

I really want to understand the NoSQL approach, but some aspects baffle me. And the most readily prominent docs don't seem to address them (that I've found, so far).
For example, I'm looking at the CouchDB website...
Self-Contained Data
An invoice contains all the pertinent information about a single transaction the seller, the buyer, the date, and a list of the items or services sold. [...] Self-contained documents, there’s no abstract reference on this piece of paper that points to some other piece of paper with the seller’s name and address. Accountants appreciate the simplicity of having everything in one place. And given the choice, programmers appreciate that, too.
By "one abstract reference" I think they mean an FK, right? And in an analogous SQL DB the "some other piece of paper" would be a row in a sellers table?
Ok, but what happens when it turns out someone messed up and the seller's address is actually on Maple Avenue, not Maple Lane And you have 96,487 invoices with that say Maple Lane.
What is the orthodox NoSQL way of dealing with that inevitability?
Do you scan your 4.8 million invoice "documents" for the 96k with "Lane", dredge them up, and execute 96k writes?
And if so, in this described CouchDB-based app, WHO goes in and performs that? Because, guessing here, but I imagine your front end probably doesn't have a view with a Seller form. Because your sellers are all embedded inside invoices, right? So in NoSQL, does this sort of data correction & maintenance become the DBA's job?
(Also, do you actually repeat all of the seller's info on every single invoice involving that seller? Doesn't that get expensive?)
And in a huge, busy system, how do you ensure that all that repeated seller data is correct and consistent?
I'm considering which storage technology to look at for a series of upcoming projects. NoSQL is obviously extremely popular and widely adopted. In some domains it's kind of the "Golden Path"/default choice. If I want to use PostgreSQL with Node.js I'll have to scrounge for info about less popular libraries and support.
So there's significant real-world pressure towards MongoDB, CouchDB, etc.
Yet in the systems I'm designing, the questions I mention above are going to really matter. Is there a proven, established, and practical way of addressing these concerns?
What is the orthodox NoSQL way of dealing with that inevitability?
Two possible approaches:
Essentially the same as the pre-SQL (i.e. paper filing cabinets) way:
Update the master file for the customer.
Use the new address on all new invoices.
Historical invoices will continue to have wrong data. But that's okay, and arguably even better than the RDBMS way, because it accurately reflects history.
Go to the extra work of updating all the affected documents. With properly built indexes or views, this isn't that hard (you won't have to scan all 4.8 million invoices--your view will direct you to the 18 actually affected by the change)
I imagine your front end probably doesn't have a view with a Seller form.
Why not? If you do seller-based queries, I sure hope you have a seller-based view (or several).
Because your sellers are all embedded inside invoices, right?
That's irrelevant. Views can index any part of the data.
do you actually repeat all of the the seller's info on every single invoice involving that seller?
Of course. You would repeat it every time you print an invoice on paper, right? Your database document is a "document", same as a printed invoice is.
Doesn't that get expensive?
If you're storing your entire database on a mobile phone, maybe. Otherwise, hard drives are cheap these days.
Yet in the systems I'm designing, the questions I mention above are going to really matter.
NoSQL isn't right for every job. If transactional integrity is important (and it likely is for a financial app like the one you seem to be discussing), it likely is not the right tool.
Think of CouchDB as a sync protocol with a database tacked on for good luck.
If your core feature is the ability to sync, then CouchDB is probably a good fit. If that's not a feature core to your application, then it's probably the wrong tool for the job.

Item recommendation service

I'm supposed to make book recommendation service using MyMediaLite. So far I have collected books from website using Nutch crawler and storing info into hbase. The problems is that I actually not fully understand, how all this thing works. By examples, I have to pass a test data and training data files, with user-item id pairs and rating. But what about other information of book, like categories and authors? How it is possible to find "similar" books, by their information etc, without information about user (so far)? Is it possible to pass data directly from hbase, without storing it to file and then leading in?
Or for this job better suits Apache Mahout or LibRec?
User-item-rating information, often in a matrix, is the basis for collaborative filtering algorithms (user-user CF, item-item CF, matrix factorization, and others). You're using other people's opinions to form recommendations. There's no innate recognition of the content of the items themselves. For that, you'll need some sort of content-based filtering algorithm or data mining technique. These are often used in the "user cold start" scenario you described: you have lots of information about items but not about a particular user's preferences.
First, think about your end goal and the data you have. Based on your product needs and available data, you can choose the right algorithm for your purposes. I highly recommend the RecSys course on Coursera for learning more: https://www.coursera.org/learn/recommender-systems. It's taught by a leader in the field.

Is OLAP the right approach

I have a requirement to develop a reporting solution for a system which has a large number of data items, with a significant number of these being free text fields. Almost any value in the tables are needed for access to a team of analysts who carry out reporting, analysis and data provision.
It has been suggested that an OLAP solution would be appropriate for the delivery of this, however the general need is to get records not aggregates and each cube would have a large number of dimensions (~150) and very few measures (number of records, length of time). I have been told that this approach will let us answer any questions we ask of it, however we do not have repeated business questions that much but need to list the raw records out.
Is OLAP really a logical way to go with this or will the cubes take too long to process and limit the level of access to the data that the user require?

What are the real-time compute solutions that can take raw semistructured data as input?

Are there any technologies that can take raw semi-structured, schema-less big data input (say from HDFS or S3), perform near-real-time computation on it, and generate output that can be queried or plugged in to BI tools?
If not, is anyone at least working on it for release in the next year or two?
There are some solutions with big semistructured input and queried output, but they are usually
unique
expensive
secret enough
If you are able to avoid direct computations using neural networks or expert systems, you will be close enough to low latency system. All you need is a team of brilliant mathematicians to make a model of your problem, a team of programmers to realize it in code and some cash to buy servers and get needed input/output channels for them.
Have you taken a look at Splunk? We use it to analyze Windows Event Logs and Splunk does an excellent job indexing this information to allow for fast querying of any string that appears in the data.

Would MongoDB be a good fit for my industry?

I work in the promotional products industry. We sell pretty much anything that you can print, embroider, engrave, or any other method to customize. Popular products are pens, mugs, shirts, caps, etc. Because we have such a large variety of products, storing information about these products including all the possible product options, decoration options, and all associated extra charges gets extremely complicated. So much so, that although many have tried, no one has been able to provide industry product data in such a way that you could algorithmically turn the data into an eCommerce store without some degree of data massaging. It seems near impossible to store this information to properly in a relational database. I am curious if MongoDB, or any other NoSQL option, would allow me to model the information in such a way that makes it easier to store and manipulate our product information better than an RDBMS like MySQL. The company I work for is over 100 years old and has been using DB2 on an AS400 for many years. I'll need some good reasons to convince them to go with a non relational DB solution.
A common example product used in our industry is the Bic Clic Stic Pen which has over 20 color options each for barrel and trim colors. Even more colors to choose for silkscreen decoration. Then you can choose additional options for what type of ink to use. There are multiple options for packaging. After all that is selected, you have additional option for rush processing. All of these options may or may not have additional charges that can be based on how many pens you order or how many colors in your decoration. Pricing is usually based on quantity, so ordering 250 pens would be cost more per pen than ordering 1000. Similarly, the extra charge for getting special ink would be cheaper per pen ordered when you order 1000 than 250.
Without wanting to sound harsh, this has the ring of a silver bullet question.
You have an inherently complex business domain. It's not clear to me that a different way of storing your data will have any impact on that complexity - storing documents rather than relational data probably doesn't make it easier to price your pen at $0.02 less if the customer orders more than 250.
I'd recommend focussing on the business domain, and not worrying too much about the storage mechanism. I'm a big fan of Domain Driven Design - this sounds like a perfect case for that approach.
Using a document database won't solve your problem completely, but it probably can help.
If your documents represent the options available on a product and an order for that product, in most cases you will be accessing the document as a whole - it's nothing you can't do with SQL, but a good fit for a document database. Since the structure of the documents is flexible, it is relatively easy to define an object within the document as a complex type to define a particular option or rule without changing the database.
However, that only helps with the data - your real problem is on the UI side. The two documents together map directly to the order form, but whatever method you use to define the options/rules some of the products are going to end up with extremely complex settings pages.
Yes, MongoDB is what you need. It doesn't have a strict documents structure, so you'll be able to create set of models you need and embed them into your product page in any order and combinations you need. Actually its possible to work with this data without describing the real model fields directly, so I (for example) can use fields my Rails application doesn't know about at all.
Also MongoDB is extremely easy to set up for replication and sharding. Also it supports GridFS virtual filesystem, so you can store images for your products with documents which describe them and manipulate them as a single object easily.
You should definitely give it a try.
UPD: Anyway it would be good to keep your RDBMS for financial data and crunching numbers, like grouping reports for the sales analysys and so on. NoSQL bases aren't very good at this.