Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I want to know usage of Hashing in Searching. For example does Google or Yahoo uses Hash Algorithms? Does big companies use this Hashing Algorithm?
Yes. Refer to book Page rank and beyond , there you will find that google uses hashing.Hashing make your complexity too low in all aspect like searching,adding etc.Let me tell you a situation suppose you are making an online chatting website.And you have to handle a million users.you can use linear search which will take worst time around 1 million*time to fetch one element.The user will have to wait a lot on the client side.But you will save money as you are not using extra space complexity.But if you will use hashing time taken will be around the time to fetch only one element.But here the system will cost you a lot as you have to pay for extra storage (1 million data storage records with a better hashing function).But here the challenge is to have a best hashing function that can cause minimum collisons to store elements.Hashing is a big topic I cannot explain in short. refer to these links:
What is a good Hash Function?
http://en.wikipedia.org/wiki/Hash_function
http://www.cs.cmu.edu/~clo/www/CMU/DataStructures/Lessons/lesson11_2.htm
http://www.tutorialspoint.com/dbms/dbms_hashing.htm
http://www.internetlivestats.com/total-number-of-websites/
Google links trillions of websites, about 1156000000.let us assume 1 milli second in getting one page from db.In worst case it will take around 1156000000*1 ms= 1156000 sec = 5.35 years. The user in worst case will have to wait for 5 years to search.Therefore this cannot be done in simple linear search.Google have its own hidden complex algorithms(you can find in the book above).Google have its own servers to store the hashing records, from where the records will be fetched by using some hashing functions.I doesn't have much idea about how google works.What I know is google uses probability a lot.Find in this book about how google works - http://langvillea.people.cofc.edu/UIUC.pdf
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am working on a real time feed which gives me real time data.
The Number of records are up to 1 million per month and I need to provide reports based on these records.
I chose Mongodb as it is high performer in fetching records.
I am facing issue in managing that data because 12 million records.
Do I need to keep every data month wise?
Should I use different collections per month?
There are lot of select queries for Analytics Report and everything.
It depends on how you want to use the data, that's up to you to decide. There is nothing wrong with a lot of data, you just need to limit your heavy queries with the same logic as cache works (easier access, but less fresh). A common methods is:
You have a "raw data" table which contains your millions of records. This table is very large, but contains 'pure' data. You want to access this table as little as possible as it'll be slow.
The next table is less accurrate and sums information you need. In your case this could be a 'month_summery' which you create after a month ends. That way you still have the complete dataset, but also a small table with relevent info (e.g. num lines, sumOfX, averageOfY, etc). Your heavy query is now once per month and you can base your stats of this.
If you need data say per week, you'd make a 'week_summery' table. Or if you need stats per day, you make it per day, 365 entries per year is still a whole lot less than millions.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I have a data-set that consist of 1500 columns and 6500 rows and I am trying to figure out what the best way is to shape the data for web based user interactive visualizations.
What I am trying to do is make the data more interactive and create an admin console that allows anyone to filter the data visually.
Front-end could potentially be based on Crossfilter, D3 and DC.js and give the user basically end-less filtering possibilities(date, value, country. In addition there will be some pre defined views like top and bottom 10 values.
I have seen and tested some great examples like this one, but after testing it did not really fit for the large amount of columns I had and it was based on a full JSON dump from the MongoDB. This amounted in very long loading times and loss of full interactivity with the data.
So in the end my question is what is the best approach (starting with normalization) in getting the data shaped in the right way so it can be manipulated from a front-end. Changing the amount of columns is a priority.
A quick look at the piece of data that you shared suggests that the dataset is highly denormalized. To allow for querying and visualization from a database backend I would suggest normalizing. This is no small bit software work but in the end you will have relational data which is much easier to deal with.
It's hard to guess where you would start but from the bit of data you showed there would be a country table, an event table of some sort and probably some tables of enumerated values.
In any case you will have a hard time finding a db engine that a lows that many columns. The row count is not a problem. I think in the end you will want a db with dozens of tables.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Good sirs.
I've just started planning a new project, and it seems that I should stick with a relational database, (even though I want to play with mongo). Tell me if I'm mistaken!
There will be box models, each of which can contain hundreds to thousands of items.
At any time, the user can move an item to another box.
for example, using some Railsy pseudocode...
item = Item(5676)
item.box // returns 24
item.update(box:25)
item.box // returns 25
This sounds like a simple SQL join table to me, but an expensive array manipulation operation for mongodb.
Or is removing an object out of one (huge) array and inserting it in another (huge) array not a big problem for mongo?
Thanks for any wisdom. I've only just started with mongo.
If you want to use big arrays, stay away from MongoDB. I tell from personal experience. There are two big problems with arrays. If they start to grow, document grows and it needs to be moved on disk. That is very, very slow operation. Plus if you need to scan array to get to 10000 element, that will be very slow as it needs to check 9999 before that.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm starting out to design a site that has some requirements that I've never really dealt with. Specifically, the data objects will have similar, but not exact, attributes. Yes, I could probably figure out most of the possible attributes and then just not populate the ones that don't make sense, and therefore keep a traditional "Relational" table and column design, but I'm thinking this might be a really good time to learn NoSQL.
In addition, the user will have 1, and only 1, textbox to search, and I will need to search all data objects and their attributes to find that string.
Ideally, I'd like to have the search return in order of "importance", meaning that if a match for the user's entered string is found in a "name" attribute, it would be returned as a higher confidence match than if the string was matched on a sub-attribute.
Anyone have any experience in this sort of situation? What have you tried that worked or didn't work? Am I wrong in thinking that this project is very well suited to a NoSQL type of database?
Stick with a traditional relational database such as MySQL or Postgresql. I would suggest sorting by relevance in your application code after obtaining the matching results. The size of your result set should impact your design choices, but if you will have less than 1-2k results then just keep it simple and don't worry too much about optimization.
NoSQL is just a dumb key value store, a persistent dictionary that can be shared across multiple application instances. It can solve scalability issues, but introduces new ones since you now just have a dumb data store. Relational databases have had years of performance tuning and do a great job.
I find NoSQL to be much more suited to storing state data, like a users preferences or cache. If you are analyzing the relationship between data then you need a relational database.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I have 2 computers which are connected to each other via serial comunication.
The main computer is holding a DB (about 10K words) the computer is working at a 20Hz rate.
I need real-time synchronization of the DB for the other computer - if data is added, deleted, or updated, I want the other computer to see or get the changes in real-time.
If I will transfer whole the DB peirodicly it will take about 5 seconds to update the other side - which is not acceptable.
Spmeone has an idea?
As you said, the other computer has to get the changes (i.e. insert, delete, update) via the serial link.
The easiest way to do this (but maybe impossible, if you can't change certain things) is to extend the database-change methods (or, if thats not possible: every call) to send an insert/delete/update-datagram with all required data over the serial link, which has to be robust against packet-loss (i.e. error detection, retransmission, etc.).
On the other end you have to implement a semantically equivalent database where you replay all the received changes.
Of course you still have to synchronize the databases at startup/initialization or maybe periodically (e.g. once per day).