Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am working on a real time feed which gives me real time data.
The Number of records are up to 1 million per month and I need to provide reports based on these records.
I chose Mongodb as it is high performer in fetching records.
I am facing issue in managing that data because 12 million records.
Do I need to keep every data month wise?
Should I use different collections per month?
There are lot of select queries for Analytics Report and everything.
It depends on how you want to use the data, that's up to you to decide. There is nothing wrong with a lot of data, you just need to limit your heavy queries with the same logic as cache works (easier access, but less fresh). A common methods is:
You have a "raw data" table which contains your millions of records. This table is very large, but contains 'pure' data. You want to access this table as little as possible as it'll be slow.
The next table is less accurrate and sums information you need. In your case this could be a 'month_summery' which you create after a month ends. That way you still have the complete dataset, but also a small table with relevent info (e.g. num lines, sumOfX, averageOfY, etc). Your heavy query is now once per month and you can base your stats of this.
If you need data say per week, you'd make a 'week_summery' table. Or if you need stats per day, you make it per day, 365 entries per year is still a whole lot less than millions.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I wonder what would be a more efficient way to partition Parquet data when storing it in S3.
In my cluster I currently have a folder data with a huge amount of Parquet files. I would like to change the way I save data in order to simplify the data retrieval.
I have two options. One option is to store Parquet files in the following folder path:
PARTITION_YEAR=2017/PARTITION_MONTH=07/PARTITION_DAY=12/my-parquet-files-go-here
or
PARTITION_DATE=20170712/my-parquet-files-go-here
Which of these two alternatives would be more recommended if I need to read a range of 7 days in Spark using spark.read.parquet?
Which alternative would be faster?
Since in both cases you are storing data with daily granularity, given the appropriate implementation at read time these two should be equivalent, but the former allows you to define better grained pruning based on your needs: you can easily get data for a whole year, a single month or a single day (or a combination of those) with well supported glob patterns.
I'd encourage you to use the former solution to be more flexible, as for your current use case the efficiency doesn't change significantly.
I would strongly advise against having many, many folders in your s3 store. Why? Spark uses S3 connectors which mimic directory trees through multiple HTTP requests: the deeper and wider the tree, the more inefficient this becomes, not least because AWS S3 throttles HTTP requests
The year/month/day naming scheme works well with hive & spark, but if you go into too much depth (by day, by hour) then you may experience worse performance than if you didn't.
The answer is quite simple... it depends on how you will query the data!
If you are querying purely on a range of days, then the second option is the easiest:
SELECT ...
FROM table
WHERE date BETWEEN ... AND ...
If you partition by month and day, you'd have to write a WHERE clause that uses both fields, which would be difficult if the desired 7-day range straddles two moths (eg 2018-05-27 to 2015-06-02):
SELECT ...
FROM table
WHERE (month = 5 and date BETWEEN 27 AND 31) OR
(month = 6 and date BETWEEN 1 AND 2)
This is the best way to make the partitions work, but is not very efficient for coding.
Thus, if you are using a WHERE on the date, then partition by date!
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I have a data-set that consist of 1500 columns and 6500 rows and I am trying to figure out what the best way is to shape the data for web based user interactive visualizations.
What I am trying to do is make the data more interactive and create an admin console that allows anyone to filter the data visually.
Front-end could potentially be based on Crossfilter, D3 and DC.js and give the user basically end-less filtering possibilities(date, value, country. In addition there will be some pre defined views like top and bottom 10 values.
I have seen and tested some great examples like this one, but after testing it did not really fit for the large amount of columns I had and it was based on a full JSON dump from the MongoDB. This amounted in very long loading times and loss of full interactivity with the data.
So in the end my question is what is the best approach (starting with normalization) in getting the data shaped in the right way so it can be manipulated from a front-end. Changing the amount of columns is a priority.
A quick look at the piece of data that you shared suggests that the dataset is highly denormalized. To allow for querying and visualization from a database backend I would suggest normalizing. This is no small bit software work but in the end you will have relational data which is much easier to deal with.
It's hard to guess where you would start but from the bit of data you showed there would be a country table, an event table of some sort and probably some tables of enumerated values.
In any case you will have a hard time finding a db engine that a lows that many columns. The row count is not a problem. I think in the end you will want a db with dozens of tables.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Good sirs.
I've just started planning a new project, and it seems that I should stick with a relational database, (even though I want to play with mongo). Tell me if I'm mistaken!
There will be box models, each of which can contain hundreds to thousands of items.
At any time, the user can move an item to another box.
for example, using some Railsy pseudocode...
item = Item(5676)
item.box // returns 24
item.update(box:25)
item.box // returns 25
This sounds like a simple SQL join table to me, but an expensive array manipulation operation for mongodb.
Or is removing an object out of one (huge) array and inserting it in another (huge) array not a big problem for mongo?
Thanks for any wisdom. I've only just started with mongo.
If you want to use big arrays, stay away from MongoDB. I tell from personal experience. There are two big problems with arrays. If they start to grow, document grows and it needs to be moved on disk. That is very, very slow operation. Plus if you need to scan array to get to 10000 element, that will be very slow as it needs to check 9999 before that.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I want to create an App like Twitter. Now i have a question about this projects database architecture. I want to show each users Followers/Following count in his/her profile like Twitter, but i don't know that i have to query every time from Followers/Followings table/collection or this values can be two small separate field in user record? If i query every time definitely takes very much time and database overhead. In the other hand, If i save in two field for each user, When there is a change, I have to do 2 actions, Modifying Followers or Followings table and This two fields in user record. My database will be huge and very large amount of data.
Which approach is good and standard?
Well, if you want to know what is right, there is only one answer.
Each of the separate fields in the user record contains derived data (data that can be easily derived via a query). Therefore it constitutes a duplication. Therefore it fails Normalisation.
The consequence of failed Normalisation is, you have an Update Anomaly. You no longer have One Fact in One Place, you have one fact in two places. And you have to update them every time one fact changes, every time the Followers/Followed per User changes. Within a Transaction.
That isn't a "trade-off" against performance concerns, that is a crime. When the fact in two places gets "out of synch", your crimes will be exposed. You will have to re-visit the app and database and perform some hard labour to make amends. And you may have to do that several times. Until you remove the causative problem.
Performance
As for the load on the database, if your application is serious, and you expect to be in business next year, get a real SQL platform.
Population or load for this requirement is simply not an issue on a commercial platform. You always get what you pay for, so pay something of value, and get something of value.
Note that if you have millions of Users, that does not mean you have millions of Followers per User. Note that your files will be indexed, so you will not chase down 16 million Users to count 25 Followers, your index will allow you to identify 25 Followers in a maximum of 25 index rows, in very few pages. This kind of concern simply does not exist on a commercial platform, it is the concern of people with no platform.
Well, it depends who is it for?
If it's for your users - they can see how many followers they are having. I would do this Twitter API call only when user logs in to your service.
If for some reason it must be done for all users. I think best way would be to do this followers-count-api-call for example once in an hour, every second hour or just daily. This could be achieved by a script that runs in cron.
Do you really need followers or just followers count? Or both?
If both, you can request Twitter user's followers and limit it to 100 (if your cron runs every minute to every fifteen minutes). Then loop those follower ids against your database and keep inserting them, until there is match. Twitter returns all the newest follower id:s by default. So this is possible at this moment.
Just remember you can make only 15 request per user tokens agains Twitter API when requesting Followers. This limit could vary between different endpoints.
Good to mention that I assumed that you are getting only follower ids. Those you can get 5000 at a time. If you want to request follower objects, there the limit is only 200 per request.
Hope this helps :D
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I want to know usage of Hashing in Searching. For example does Google or Yahoo uses Hash Algorithms? Does big companies use this Hashing Algorithm?
Yes. Refer to book Page rank and beyond , there you will find that google uses hashing.Hashing make your complexity too low in all aspect like searching,adding etc.Let me tell you a situation suppose you are making an online chatting website.And you have to handle a million users.you can use linear search which will take worst time around 1 million*time to fetch one element.The user will have to wait a lot on the client side.But you will save money as you are not using extra space complexity.But if you will use hashing time taken will be around the time to fetch only one element.But here the system will cost you a lot as you have to pay for extra storage (1 million data storage records with a better hashing function).But here the challenge is to have a best hashing function that can cause minimum collisons to store elements.Hashing is a big topic I cannot explain in short. refer to these links:
What is a good Hash Function?
http://en.wikipedia.org/wiki/Hash_function
http://www.cs.cmu.edu/~clo/www/CMU/DataStructures/Lessons/lesson11_2.htm
http://www.tutorialspoint.com/dbms/dbms_hashing.htm
http://www.internetlivestats.com/total-number-of-websites/
Google links trillions of websites, about 1156000000.let us assume 1 milli second in getting one page from db.In worst case it will take around 1156000000*1 ms= 1156000 sec = 5.35 years. The user in worst case will have to wait for 5 years to search.Therefore this cannot be done in simple linear search.Google have its own hidden complex algorithms(you can find in the book above).Google have its own servers to store the hashing records, from where the records will be fetched by using some hashing functions.I doesn't have much idea about how google works.What I know is google uses probability a lot.Find in this book about how google works - http://langvillea.people.cofc.edu/UIUC.pdf