How do you find clusters based on association rules in data mining/data science? - cluster-analysis

Suppose you have a database
Transcation-ID Item_list
1 [bread, butter, milk, diaper]
2 [bread, butter]
3 [coke, diaper]
4 [Chips, beer, bread]
In this case after finding the association rules, how do we find the similar items based on the association rules (that is cluster similar items)?
Hence, how do we cluster the items such as "Bread", "Butter", "Milk" on the same isle and "chips", "beer","coke" together on an another isle? How do we find the cluster items using the association rules? How do we observe or read the association rules?

How is this ms from a cluster?
1 [bread, butter, milk, diaper]
Every itemset is a cluster, of similar items.

Based on the similarity scores we can find the similar transactions on setting threshold similarity score.
But finding the similar transaction will take more space and time.

Related

merge node-list and edge-list in gephi

I have a question about Gephi.
I'm doing a project for my school, where I analyze a website using python. I decided to look into the moviesdatabase.org website. From the https://www.themoviedb.org website I extracted the most popular movies and out of these, I extracted the nodes. For the edges I considered the genre as a relationship (eg two films that share a genre, for example horror, are united by a relationship). Note: for simplicity I have considered only the first page of the most popular movies, in fact I have 20 as the number of nodes. In the node list I have 20 nodes and 0 arcs. In the edge list I have 15 nodes and 150 edges.
The problem is when I upload the node list first and then the edge list it doesn't recognize the node list on the edge list.
I have for example this:
and in the data laboratory I have this:
it is as if the node-list is separate from the edge-list. Why is this happening?

How do I design a data warehouse model that allows me to dynamically query for total action count, unique user count, and a count of total users

Currently facing a problem where I am trying to create a login utilization report for a web application. To describe the report a bit, users in our system are tagged with different metadata about the user. For example, I could be tagged with "New York City" and "Software Engineer", while other users may be tagged with different locations and job titles. The utilization report is essentially the following:
Time period (quarterly)
Total number of logins
Unique logins
Total users
"Engagement percentage" (Unique logins / Total users)
The catch is, the report needs to be a bit dynamic. I need to be able to be apply any combination of job titles and locations and have each of the numbers reflect the applied metadata. The time period also needs to be able to be easily adjusted to support weekly, monthly, and yearly as well. Ideally, I can create a view in Redshift that allows our BI software users to run this report whenever they see fit.
My question is, what is an ideal strategy to design a data model to support this report? I currently have an atomic fact table that contains all logins with this schema:
User ID
Login ID
Login Timestamp
Job Title Group ID (MD5 hash of job titles to support multi valued)
Location Group ID (MD5 hash of locations to support multi valued)
The fact table allows me to easily write a query to aggregate on total (count of login id) and unique (distinct count of user id).
How can I supplement the data I have to include a count of total users? Is what I currently have the best approach?
Hierarchical, fixed-depth many-to-one (M:1) relationships between attributes are typically denormalized or collapsed into a flattened dimension table. If you’ve spent most of your career designing entity-relationship models for transaction processing systems, you’ll need to resist your instinctive tendency to normalize or snowflake a M:1 relationship into smaller subdimensions; dimension denormalization is the name of the game in dimensional modeling.
It is relatively common to have multiple M:1 relationships represented in a single dimension table. One-to-one relationships, like a unique product description associated with a product code, are also handled in a dimension table. Occasionally many-to-one relationships are resolved in the fact table, such as the case when the detailed dimension table has millions of rows and its roll-up attributes are frequently changing. However, using the fact table to resolve M:1 relationships should be done sparingly.
In your case I recommend you to have this following design as a solution :

Determining canonical classes with text data

I have a unique problem and I'm not aware of any algorithm that can help me. Maybe someone on here does.
I have a dataset compiled from many different sources (teams). One field in particular is called "type". Here are some example values for type:
aple, apples, appls, ornge, fruits, orange, orange z, pear,
cauliflower, colifower, brocli, brocoli, leeks, veg, vegetables.
What I would like to be able to do is to group them together into e.g. fruits, vegetables, etc.
Put another way I have multiple spellings of various permutations of a parent level variable (fruits or vegetables in this example) and I need to be able to group them as best I can.
The only other potentially relevant feature of the data is the team that entered it, assuming some consistency in the way each team enters their data.
So, I have several million records of multiple spellings and short spellings (e.g. apple, appls) and I want to group them together in some way. In this example by fruits and vegetables.
Clustering would be challenging since each entry is most often 1 or two words, making it tricky to calculate a distance between terms.
Short of creating a massive lookup table created by a human (not likely with millions of rows), is there any approach I can take with this problem?
You will need to first solve the spelling problem, unless you have Google scale data that could allow you to learn fixing spelling with Google scale statistics.
Then you will still have the problem that "Apple" could be a fruit or a computer. Apple and "Granny Smith" will be completely different. You best guess at this second stage is something like word2vec trained on massive data. Then you get high dimensional word vectors, and can finally try to solve the clustering challenge, if you ever get that far with decent results. Good luck.

Introduction to object databases

I'm trying to understand the idea of noSQL databases, to be more precise, the concept behind neo4j graph database. I have experience with SQL databases (MySQL, MS SQL), but the limitations of managing hierarchical data made me to expand my knowledge. But now I have some questions and I can't find their answers (maybe I don't know what to search).
Imagine we have list of countries in the world. Each country has it's GDP every year. Each country has it's GDP calculated by different sources - World Bank, their government, CIA etc. What's the best way to organise data in this case?
The simplest thing which came in mind is to have the node (the values are imaginary):
China:
GDPByWorldBank2012: 999,
GDPByCIA2011: 994,
GDPByGovernment2012: 1102,
In relational database, I would split the data in three tables: Countries, Sources and Values, where in Values I would have value of GDP, year, id of the country and id of the source.
Other thing which came in mind is to create nodes CIA, World bank, but node Government looks really weird. Even though, the idea is to have relationships (valueIfGDP):
CIA -> valueOfGDP - {year: 2011, value: 994} -> China
World Bank -> valueOfGDP - {year: 2012, value: 999} -> China
This looks pretty weird for me, what is more, what happens when we add the values for all the years from one source? We would have multiple relationships or what?
I'm sorry if my questions are too dumb and I would be happy if someone explain me or show me what book/article to read.
Thanks in advance. :)
Your questions are very legit and you're not the only one having difficulties to grasp graph modelling at first ;)
It is always easier to start thinking about the questions you wanna answer with your data before modelling it up front.
Let's imagine you wanna retrieve the GDP of year 2012 computed by CIA of all countries.
A simple way to achieve this is to label country nodes uniformly, and set an attribute name that obviously depends on the country name.
Moreover, CIA/WorldBank/Government in this domain are all "sources", let's label them uniformly as well.
For instance, that could give something like:
(ORGANIZATION {name: CIA})-[:HAS_COMPUTED_GDP {year:2011, value:994}]->(COUNTRY {name:China})
With Cypher Query Language, following this model, you would execute the following query:
START cia = node:nodes(name = "CIA")
MATCH cia-[gdp:HAS_COMPUTED_GDP]->(country)
WHERE gdp.year = 2012
RETURN cia, country, gdp
In this query, I used an index lookup as a starting point (rather than IDs which are a internal technical notion that shouldn't be used) to retrieve CIA by name and match the relevant subgraph to finally return CIA, the GDP relationships and their linked countries matching the input constraints.
Although Neo4J is totally schemaless, this does not mean you should necessarily have a totally flexible data model. Having a little structure will always help to make your queries or traversals easier to read.
If you're not familiar with Cypher Query Language (which is not the only way to read or write data into the graph), have a look at the excellent documentation of Neo4J (Cypher: http://docs.neo4j.org/chunked/stable/cypher-query-lang.html, complete: http://docs.neo4j.org/chunked/stable/index.html) and try some queries there: http://console.neo4j.org/!
And to answer your second question, if you wanna add another year of GDP computations, this will just boil down to adding new relationship "HAS_COMPUTED_GDP" between the organizations and the countries, no more no less.
Hope it helps :)

Partition Lucene Index by ID across multiple indexes

I am trying to put together my Lucene search solution, and I'm having trouble figuring out how to start.
On my site, I want one search to span 5 different types of objects in my model.
I want my results to come back as one list, ordered by best match first, with a way to differentiate the type so I can show the data appropriately
Our system is split out into what we call sites. I want to index the 5 different model objects by site. Searching will always be done by site.
I'm not sure where to begin to index this system for optimal performance. I'm also not sure how best to implement the search for this setup. Any advice, articalse, and examples are greatly appreciated.
EDIT:
Since it has been said this is too broad,
Let's say I have 3 sites, Site 1, Site 2, and site 3.
Let's say I am indexing Dogs, Cats, and Hamsters. a record in each of these types is linked to a site.
So, for instance, my data might be (Type, Name, SiteId)
Dog, "Fido" 1
Cat, "Sprinkles", 2
Hamster, "Sprinkles", 2
Cat, "Mr. Pretty", 3
Cat, "Mr. Pretty 2", 3
So, when I do a search for "Mr. Pretty", I want to target a specific Site Id. If I go against site id 1, I'll get 0 results. If I search against site id 3, I'll get
Mr. Pretty
Mr. Pretty 2
And if I search for "Sprinkles" on Site 2, I will know that one result is a cat and the other result is a hamster.
What is the best way I can go about achieving this sort of search index?
As goalie7960 suggested, you can add a "SiteID" to each document and add a query term like siteid:3 to your query, in order to retrieve documents only from this site. You can also improve the performance of this by creating and storing a Filter for each different site, so you can apply it to the correspondent queries.
Regarding differente types in the same index, you could use the same strategy. Create a "type" field for each document with the corresponding type (maybe just an ID). Elasticsearch uses the same strategy to have different distinguishable types in the same index. Again, you can use Filters on the types to speed up queries (Elasticsearch does the same).