What is the best way to store data where one column has values that repeat ranging anywhere from 1-300+ times? - postgresql

I've used web scraping to grab approximately 10,000 movies and all their associated review pages URLs, and the next step for me is to grab every single one of those reviews so that I can get the overall positive/negative reviews using sentiment analysis.
I'm writing all this in Python and am using the Pandas library as my means of pre-processing and structuring all the data. Already I have around 36,000 rows containing the name of the movie in one column and the URLs in the other, with the movie name being repeated over and over again, and with the average reviews per page being 20 I'm looking at roughly 720,000 rows when all things are said and done.
This is for the final project of the college course I'm taking, and throughout my schooling I've come to fear data redundancy in databases. I will eventually be writing all of this to a PostgreSQL database so users can query any movie to get back the prediction, and I'm having a hard time overlooking the fact that these movie titles are being repeated so often.
I was wondering if there was a better way to go about this (which could also hopefully save me some processing time), any help would be greatly appreciated!
I feel like this is more of a direct question than a code issue, but if necessary I can provide any relevant code.

If all the information you have about each movie, there is no redundancy (in the relational sense) , since this is the unique identifier.
You could save some space by having a separate movie table that contains an artificial numeric ID and the name and reference the ID from the main table, but that will make your queries more complicated and seems unnecessary for a small table like this.
What I would be more concerned about is whether the movie name is a good identifier at all: what if two movies have the same name? In this age of remakes, that is not a rarity.

Related

Postgres or MongoDB

I have to make a website. I have the choice between Postgres and MongoDB.
Case 1 : Postgres
Each page has one table and each table has only one row, for one page (each page is not structured like another)
I have a timelined page with medias (albums and videos)
So I have multiple medias (pictures, videos), and I display it as well by an album of pictures page and a videos page.
Therefore I have a medias table, linked with an album table (many-to-many), and a type column for determining if it's picture or video.
Case 2 : MongoDB
I'm completely new to NoSQL and I don't know how to store the data.
Problems that I see
Only one row for a table, that disturb me
In the medias table, I can have an album with videos, I'd like to avoid it. But if I cut this table in pictures table and videos table, How can I do a single call to have all the medias for the timelined page.
That's why I think it's better to me to make the website with MongoDB.
What is the best solution, Postgres or MongoDB? What do I need to know if it's MongoDB? Or maybe something escape me for Postgres.
It will depend on time, if you don't have time to learn another technology, the answer is to going straight forward with the one you know and solve the issues with it.
If scalability is more important, then you'll have to take a deeper look to your architecture and know very well how to scale postgresql.
Postgresql can handle json columns for unstructured data, I use it and it's great. I will have a single table with the unstructured data in a column name page_structure, so you'll have one single big indexed table instead of a lot of one row tables.
It's relative easy to query just what you want so no need no separate tables for images and videos, in order to be more specific, you'll need to provide some scheme.
I think you are coming to the right conclusion of using a NoSql database because you are not sure about the columns in a table for a page and thats the reason you are creating different tables for different pages. I will still say to make columns a bit consistent over the records. Anyways, by using MongoDB, you can have different records (called documents in MongoDB) with different columns based on attributes of your page in a single Collection (Tables in SQL). You can have pictures and videos collections separately if you want and wire them with your page collection using some foreign key like page_id. Or you can call page collection to get all the attributes including an array containing the IDs of all videos or pictures by which you can retrieve corresponding videos and pictures of a particular page like illustrated below,
Collections
Pages [{id, name, ...., [video1, video2,..], [pic1, pic2, pic78,...]}, id, name, ...., [video1_id, video2_id,..], [pic1_id, pic2_id, pic78_id,...]},...]
Videos [{video1_id, content,... }, {video2_id, content,...}]
Pictures [{pic1_id, content,... }, {pic2_id, content,...}]
I suggest you use the Clean Code architecture. personally, I believe that you MUST departure your application logic and data access functions aside so they can both work separately. your code must not rely on your database. I rather code the way that I can migrate my data to every database I'd like it would still work.
think about when your project gets big and you want to try cashing to solve a problem. if your data access functions are not separated from your business logic code you can not easily achieve that.
I agree with #espino316 about using the project you are already familiar with.
and also with #Actung about you should consider learning a database like MongoDB but in some training projects first, because there are many projects that the best way to go is to use NoSQL.
just consider that might find out about this 2 years AFTER you deployed your website. or the opposite way, you go for MongoDB and you realize the best way to go was to use Postgres or IDK MySQL, etc.
I think the best way to go is to make the migration easy for yourself.
all the best <3

Calculating price drop Apps or Apps gonna free - App Store

I am working on a Website which is displaying all the apps from the App Store. I am getting AppStore data by their EPF Data Feeds through EPF Importer. In that database I get the pricing of each App for every store. There are dozen of rows in that set of data whose table structure is like:
application_price
The retail price of an application.
Name Key Description
export_date The date this application was exported, in milliseconds since the UNIX Epoch.
application_id Y Foreign key to the application table.
retail_price Retail price of the application, or null if the application is not available.
currency_code The ISO3A currency code.
storefront_id Y Foreign key to the storefront table.
This is the table I get now my problem is that I am not getting any way out that how I can calculate the price reduction of apps and the new free apps from this particular dataset. Can any one have idea how can I calculate it?
Any idea or answer will be highly appreciated.
I tried to store previous data and the current data and then tried to match it. Problem is the table is itself too large and comparing is causing JOIN operation which makes the query execution time to more than a hour which I cannot afford. there are approx 60, 000, 000 rows in the table
With these fields you can't directly determine price drops or new application. You'll have to insert these in your own database, and determine the differences from there. In a relational database like MySQL this isn't too complex:
To determine which applications are new, you can add your own column "first_seen", and then query your database to show all objects where the first_seen column is no longer then a day away.
To calculate price drops you'll have to calculate the difference between the retail_price of the current import, and the previous import.
Since you've edited your question, my edited answer:
It seems like you're having storage/performance issues, and you know what you want to achieve. To solve this you'll have to start measuring and debugging: with datasets this large you'll have to make sure you have the correct indexes. Profiling your queries should helping in finding out if they do.
And probably, your environment is "write once a day", and read "many times a minute". (I'm guessing you're creating a website). So you could speed up the frontend by processing the differences (price drops and new application) on import, rather than when displaying on the website.
If you still are unable to solve this, I suggest you open a more specific question, detailing your DBMS, queries, etc, so the real database administrators will be able to help you. 60 million rows are a lot, but with the correct indexes it should be no real trouble for a normal database system.
Compare the table with one you've downloaded the previous day, and note the differences.
Added:
For only 60 million items, and on a contemporary PC, you should be able to store a sorted array of the store id numbers and previous prices in memory, and do an array lookup faster than the data is arriving from the network feed. Mark any differences found and double-check them against the DB in post-processing.
Actually I also trying to play with these data, and I think best approach for you base on data from Apple.
You have 2 type of data : full and incremental (updated data daily). So within new data from incremental (not really big as full) you can compare only which record updated and insert them into another table to determine pricing has changed.
So you have a list of records (app, song, video...) updated daily with price has change, just get data from new table you created instead of compare or join them from various tables.
Cheers

How to design the DB for a complex wall like Facebook

I'm creating a Facebook-like social-network website.
For my "wall", I have many different kind of information like status, messages, user like/dislike a page, user has updated his profile, ...
I would like to know how to design my DB (wall-related tables) to be the more efficient as possible (in terms of speed) when I'll fetch the wall items.
Thanks in advance!
EDIT: I have 2 ideas:
Have a big table with enough columns to handle all possibilities (user_a, user_a, message, page, is_like, is_dislike, ...). It will be fast but will have many 'NULL' values and will take much space in the DB
Have a 'wall_item' table with just three columns (id, user_a, user_b) and a table for each kind of wall items (messages, likes, status, ...). It will be normalized but will take much more time because of the number of left join needed to get all information.
I propose you have 2 tables - one for content and the other one for likes/dislikes. Having a lot of null values is not a problem - nulls won't take up space. Keeping likes/dislikes separately is probably needed, because they happen a lot and they are not content themselves.
If you want your system to be scalable then avoid JOIN-s. It is better to execute 2-3 queries in a row than one big massive query with a lot of JOIN-s. Also if you have a lot of READ operations and not so many WRITEs (compared to number or READs) it is wise to do do additional actions during WRITE.
For example you could have a separate table for wall (containing user id and post id). When somebody makes a new post, id of post is written to table WALL for each friend. So displaying wall is just reading post-ids from table WALL and pulling the content - rather than searching for content from posts of all users during displaying.
And when new people become friends you just copy their id-s of their recent posts to each other walls. Good luck with the project!
I think, NOSQL is used to gain performance. Any large table db access will be slow for this kind of application, let alone any joined tables.

SQL table structure

I am starting a new project that will handle surveys and reviews. At this point I am trying to figure out what would be the best sql table structure to store and handle such information.
Basically, the survey will contain ratings, text reviews and additional optional information available for clients to share. Now I am thinking of either storing each information in a separate column or maybe merge all this data and store it as an XML in one column.
I am not sure what would be a better solution, but I have the following issues on my mind:
- would possible increase of information collected would be a problem in case of single XML column
- would a single XML column have any serious impact on performance when extracting and handling information from xml column
If you ever have a reason to query on a single piece of info, or update it alone, then don't store that data in XML, but instead as a separate column.
It is rare, IMO, that storing XML (or any other composite type of data) is a good idea in a DB. Although there are always exceptions.
Well, to keep this simple, you have two choices: dyanmic or static surveys.
Dynamic surveys would look like this:
Not only would reporting be more complicated, but so would the UI. The number of questions is unknown and you would eventually need logic to handle order, grouping, and data types.
Static surveys would look more like this:
Although you certainly give up some flexibility, the solution (including reports) is considerably simpler. You need not handle order, grouping, or data types (at least dynamically).
I like to argue that "Simplicity is the best Design" in almost everything.
Since I cannot know your requirements in detail, I cannot assume which is the better fit. But I can tell you this, the dynamic is often built when the static is sufficient.
Best of luck!
If you don't want to fight with a relational database that expects relational data you probably want reasonably normalized data. I don't see in your case what advantage the XML would give you. If you have multiple values entered in the survey, you probably want another table for survey entries with a foreign key to the survey.
If this is going to be a relatively extensive application you might think about a table for survey definition, a table for survey question, a table for survey response, and a table for survey question response. If the survey data can be multiple types, you might need a table for each kind of question that might be asked, though in some cases a column might do.
EDIT - I think you would at least have one row per answer to a question. If the answer is complex (doesn't correspond to just one instance of a simple data type) it might actually be multiple rows (though denormalizing into multiple columns is probably O.K. if the number of columns is small and fixed). If an answer to one question needs to be stored in multiple rows, you would almost certainly end up with one table that represents the answer, and has one row per answer, plus another table that represents pieces of the answer, and has one row per piece.
If the reason you are considering XML is that the answers are going to be of very different types (for example, a review with a rating, a title, a header, a body, and a comments section for one question; a list of hyperlinks for another question, etc.) then the answer table might actually have to be several tables, so that you can model the data for each type of question. That would be a pretty complicated case though.
Hopefully one row per response, in a single table, would be sufficient.
To piggyback off of Flimzy's answer, you want to simply store the data in the database and not a specific format (i.e. XML). You might a requirement at the moment for XML, but tomorrow it might be a CSV or a fixed width DAT file. Also, if you store just the data, then you can use the "power" of the database to search on specific columns of information and then return it as XML, if desired.

What's the fastest way to save data and read it next time in a IPhone App?

In my dictionary IPhone app I need to save an array of strings which actually contains about 125.000 distinct words; this transforms in aprox. 3.2Mb of data.
The first time I run the app I get this data from an SQLite db. As it takes ages for this query to run, I need to save the data somehow, to read it faster each time the app launches.
Until now I've tried serializing the array and write it to a file, and afterword I've tested if writing directly to NSUserDefaults to see if there's any speed gain but there's none. In both ways it takes about 7 seconds on the device to load the data. It seems that not reading from the file (or NSUserDefaults) actually takes all that time, but the deserialization does:
objectsForCharacters = [[NSKeyedUnarchiver unarchiveObjectWithData:data] retain];
Do you have any ideeas about how I could write this data structure somehow that I could read/put in memory it faster?
The UITableView is not really designed to handle 10s of thousands of records. If would take a long time for a user to find what they want.
It would be better to load a portion of the table, perhaps a few hundred rows, as the user enters data so that it appears they have all the records available to them (Perhaps providing a label which shows the number of records that they have got left in there filtered view.)
The SQLite db should be perfect for this job. Add an index to the words table and then select a limited number of rows from it to show the user some progress. Adding an index makes a big difference to the performance of the even this simple table.
For example, I created two tables in a sqlite db and populated them with around 80,000 words
#Create and populate the indexed table
create table words(word);
.import dictionary.txt words
create unique index on words_index on word DESC;
#Create and populate the unindexed table
create table unindexed_words(word);
.import dictionary.txt unindexed_words
Then I ran the following query and got the CPU Time taken for each query
.timer ON
select * from words where word like 'sn%' limit 5000;
...
>CPU Time: user 0.031250 sys 0.015625;
select * from unindex_words where word like 'sn%' limit 5000;
...
>CPU Time: user 0.062500 sys 0.0312
The results vary but the indexed version was consistently faster that the unindexed one.
With fast access to parts of the dictionary through an indexed table, you can bind the UITableView to the database using NSFecthedResultsController. This class takes care of fecthing records as required, caches results to improve performance and allows predicates to be easily specified.
An example of how to use the NSFetchedResultsController is included in the iPhone Developers Cookbook. See main.m
Just keep the strings in a file on the disk, and do the binary search directly in the file.
So: you say the file is 3.2mb. Suppose the format of the file is like this:
key DELIMITER value PAIRDELIMITER
where key is a string, and value is the value you want to associate. The DELIMITER and PAIRDELIMITER must be chosen as such that they don't occur in the value and key.
Furthermore, the file must be sorted on the key
With this file you can just do the binary search in the file itself.
Suppose one types a letter, you go to the half of the file, and search(forwards or backwards) to the first PAIRDELIMITER. Then check the key and see if you have to search upwards or downwards. And repeat untill you find the key you need,
I'm betting this will be fast enough.
Store your dictionary in Core Data and use NSFetchedResultsController to manage the display of these dictionary entries in your table view. Loading all 125,000 words into memory at once is a terrible idea, both performance- and memory-wise. Using the -setFetchBatchSize: method on your fetch request for loading the words for your table, you can limit NSFetchedResultsController to only handling the small subset of words that are visible at any given moment, plus a little buffer. As the user scrolls up and down the list of words, new batches of words are fetched in transparently.
A case like yours is exactly why this class (and Core Data) was added to iPhone OS 3.0.
Do you need to store/load all data at once?
Maybe you can just load the chunk of strings you need to display and load all other strings in the background.
Perhaps you can load data into memory in one thread and search from it in another? You may not get search results instantly, but having some searches feel snappier may be better than none at all, by waiting until all data are loaded.
Are some words searched more frequently or repeatedly than others? Perhaps you can cache frequently searched terms in a separate database or other store. Load it in a separate thread as a searchable store, while you are loading the main store.
As for a data structure solution, you might look into a suffix trie to search for substrings in linear time. This will probably increase your storage requirements, though, which may affect your ability to implement this with an iPhone's limited memory and disk storage capabilities.
I really don't think you're on the right path trying to load everything at once.
You've already determined that your bottleneck is the deserialization.
Regardless what the UI does, the user only sees a handful (literally) of search results at a time.
SQLlite already has a robust indexing mechanism, there is likely no need to re-invent that wheel with your own indexing, etc.
IMHO, you need to rethink how you are using UITableView. It only needs a few screenfuls of data at a time, and you should reuse cell objects as they scroll out of view rather than creating a ton of them to begin with.
So, use SQLlite's indexing and grab "TOP x" rows, where x is the right balance between giving the user some immediately-available rows to scroll through without spending too much time loading them. Set the table's scroll bar scaling using a separate SELECT COUNT(*) query, which only needs to be updated when the user types something different.
You can always go back and cache aggressively after you deserialize enough to get something up on-screen. A slight lag after the first flick or typing a letter is more acceptable than a 7-second delay just starting the app.
I have currently a somewhat similar coding problem with a large amount of searchable strings.
My solution is to store the prepared data in one large memory array, containing both the texttual data and offsets as links. Meaning I do not allocate objects for each item. This makes the data use less memory and also allows me to load & save it to a file without further processing.
Not sure if this is an option for you, since this is quite an obvious solution once you've realized that the object tree is causing the slowdown.
I use a large NSData memory block, then search through it. Well, there's more to it, it took me about two days to get it well optimized.
In your case I suspect you have a dictionary with a lot of words that have similar beginnings. You could prepare them on another computer in a format the both compacts the data and also facilitates fast lookup. As a first step, the words should be sorted. With that, you can already perform a binary search on them for a fast lookup. If you store it all in one large memory area, you can do the search quite fast, compared to how sqlite would search, I think.
Another way would be to see the words as a kind of tree: You have many thousands that begin with the same letter. So you divide your data accordingly: You have a sql table for each beginning letter of your set of words. that way, if you look up a word, you'd select one of the now-smaller tables depening on the first letter. This makes the amount that has to be searched already much smaller. and you can do this for the 2nd and 3rd letter as well, and you already could have quite a fast access.
Did this give you some ideas?
Well actually I figured it out myself in the end, but of course I thank you all for your quick and pertinent answers. To be concise I will just say that, the fact that Objective-C, just like any other object-based programming language, due to introspection and other objective requirements is significantly slower than procedural programming languages.
The solution was in fact to load all my data in a continuous chunk of memory using malloc (a char **) and search on-demand in it and transform to objects. This concluded in a .5 sec loading time (from file to memory) and resonable (should be read "fast") operations during execution. Thank you all again and if you have any questions I'm here for you. Thanks