I'm building a database to track usage/generate reports of particular products on a daily basis. There is a limited number of products (about 20) where they're removed from inventory, used in production, and then remaining product is returned to inventory but not put back into the system. The idea is to have production record how much they receive and then have storage record how much they get back from production.
The tables are pretty straightforward - I have one table with product properties that I completely control for the foreseeable future and another that will house usage data. The issue I'm having is how to design the form for operations for when they receive product. I don't want to have a huge list of product #'s with entries for each one (usually 5 to 10 are used on a daily basis). I also don't want to populate the data table with a bunch of blank records. I want them to add a line, select the product code from a drop down, record the amount received and then repeat. Preferably the drop down would update to exclude any previously filled in codes on the form. I want to do this all at once to limit duplicate records. So they fill out all the product #'s they received and how many of each they received and then click save to have it populate the data table.
Is there a way to have an "add line" option for a form of this design? I know this isn't a terribly extensive database but I want to design and test it prior to integration into our plant's larger scale product tracking system.
Related
There is a scenario where I need to add entry for every user in a table. There will be around 5-10 records per user and the approximate number of users are approximately 1000. So, if I add the data of every user each day in a single table, the table becomes very heavy and the Read/Write operations in the table would take some time to return the data(which would be mostly for a particular user)
The tech stack for back-end is Spring-boot and PostgreSQL.
Is there any way to create a new table for every user dynamically from the java code and is it really a good way to manage the data, or should all the data should be in a single table.
I'm concerned about the performance of the queries once there are many records in case of a single table holding data for every user.
The model will contain the similar things like userName, userData, time, etc.
Thank you for you time!
Creating one table per user is not a good practice. Based on the information you provided, 10000 rows are created per day. Any RDBMS will be able to perfectly handle this amount of data without any performance issues.
By making use of indexing and partitioning, you will be able to address any potential performance issues.
PS: It is always recommended to define a retention period for data you want to keep in operational database. I am not sure about your use-case, but if possible define a retention period and move older data out of operational table into a backup storage.
I am working on a big data project where large amounts of product information is gathered from different online sellers, such as prices, titles, sellers and so on (30+ data points per item).
In general, there are 2 use cases for the project:
Display the latest data points for a specific product in a web app or widget
Analyze historical data, e.g. price history, product clustering, semantic analysis and so on
I first decided to use MongoDB to be able to scale horizontally as the data stored for the project is assumed to be in the range of hundreds of GBs and the data could be sharded dynamically with MongoDB across many MongoDB instances.
The 30+ data points per product won't be collected at once, but at different times, e.g. one crawler collects the prices, a couple of days later another one collects the product description. However, some data points might overlap because both crawler collect e.g. the product title. For example the result could be something like:
Document 1:
{
'_id': 1,
'time': ISODate('01.05.2016'),
'price': 15.00,
'title': 'PlayStation4',
'description': 'Some description'
}
Document 2:
{
'_id': 1,
'time': ISODate('02.05.2016'),
'price': 16.99,
'title': 'PlayStation4',
'color': 'black'
}
Therefore I initially came up with the following idea (Idea 1):
All the data points found at one specific crawl process end up in one document as described above. To get the latest product info, I would then query each data point individually and get the newest entry that is not older than some threshold, e.g. a week, to make sure that the product info is not outdated for "Use Case 1" and that we have all the data points (because a single document may not include all data points but only a subset).
However, as some data points (e.g. product titles) do not change regularly, just saving all the data all the time (to be able to do time series analysis and advanced analytics) would lead to massive redundancy in the database, e.g. the same product description would be saved every day even though it doesn't change. Therefore I thought I might check the latest value in the DB and only save the value if it has changed. However, this leads to a lot of additional DB queries (one for each data point) and, due to the time threshold mentioned above, we would lose the information whether the data point did not change or was removed from the website by the owner of the shop.
Thus, I was thinking about a different solution (Idea 2):
I wanted to split up all the data points in different documents, e.g. the price and the title are stored in separate documents with own timestamps. If a data point does not change, the timestamp can be updated to indicate that the data point did not change and is still available on the website. However, this would lead to a tremendous overhead for small data points, e.g just boolean values, because every document needs its own key, timestamp and so on to be able to find / filter / sort them quickly using indexes.
For example:
{
'_id': 1,
'timestamp': ISODate('04.05.2016'),
'type': 'price',
'value': 15.00
}
Therefore, I am struggling to find the right model and / or database to use for this project. To sum it up, these are the requirements:
Collect hundreds of millions of products (hundreds of GBs even TBs)
Overlapping subsets of product information are retrieved by distributed crawlers at different points of time
Information should be stored in a distributed, horizontally scalable database
Data redundancy should be reduced to a minimum
Time series information about the data points should be retained
I would be very grateful for any ideas (data model / architecture, different database, ...) that might help me advance the project. Thanks a lot in advance!
Are the fields / data points already known and specified? I.e., do you have a fixed schema? If so, then you can consider relational databases as well.
DB2 has a what they call temporal databases. In the 'system' form, the DB handles versioning transparently. Any inserts are automatically timestamped, and whenever you update a row, the previous row is automatically migrated to a history table (keeping its old timestamp). Thereafter, you can run SQL queries at any given point in time, and DB2 will return the data as it was at the time (or time range) specified. They also have an 'application' form, in which you specify the time periods that the row is valid for when you insert the row (e.g. if prices are valid for a specific period of time), but the ultimate SQL queries still work the same way. What's nice is that either way, all the time complexity is managed by the database and you can write relatively clean SQL queries.
You can check out more at their DeveloperWorks site.
I know that other relational DBs like Oracle also have special capabilities for time series data that manage the versioning / timestamping stuff for you.
As far as space efficiency and scale, I'm not sure as I don't run any databases that big :-)
(OTOH, if you don't have a fixed schema, or you know you'll have multiple schemas for the different data inputs and you can't model it with sparse tables, then a document DB like mongo might be your best bet)
I have been working with measuring the data completeness and creating actionable reports for out HRIS system for some time.
Until now i have used Excel, but now that the requirements for reporting has stabilized and the need for quicker response time has increased i want to move the work to another level. At the same time i also wish there to be more detailed options for distinguishing between different units.
As an example I am looking at missing fields. So for each employee in every company I simply want to count how many fields are missing.
For other fields I am looking to validate data - like birthdays compared to hiring dates, threshold for different values, employee groups compared to responsibility level, and so on.
My question is where to move from here. Is there any language that is better than any of the others when dealing with importing lists, doing evaluations on fields in the lists and then quantify it on company and other levels? I want to be able to extract data from our different systems, then have a program do all calculations and summarize the findings in some way. (I consider it to be a good learning experience.)
I've done something like this in the past and sort of cheated. I wrote a program that ran nightly, identified missing fields (not required but necessary for data integrity) and dumped those to an incomplete record table that was cleared each night before the process ran. I then sent batch emails to each of the different groups responsible for the missing element(s) to the responsible group (Payroll/Benefits/Compensation/HR Admin) so the missing data could be added. I used .Net against and Oracle database and sent emails via Lotus Notes, but a similar design should work on just about any environment.
I am working on a Website which is displaying all the apps from the App Store. I am getting AppStore data by their EPF Data Feeds through EPF Importer. In that database I get the pricing of each App for every store. There are dozen of rows in that set of data whose table structure is like:
application_price
The retail price of an application.
Name Key Description
export_date The date this application was exported, in milliseconds since the UNIX Epoch.
application_id Y Foreign key to the application table.
retail_price Retail price of the application, or null if the application is not available.
currency_code The ISO3A currency code.
storefront_id Y Foreign key to the storefront table.
This is the table I get now my problem is that I am not getting any way out that how I can calculate the price reduction of apps and the new free apps from this particular dataset. Can any one have idea how can I calculate it?
Any idea or answer will be highly appreciated.
I tried to store previous data and the current data and then tried to match it. Problem is the table is itself too large and comparing is causing JOIN operation which makes the query execution time to more than a hour which I cannot afford. there are approx 60, 000, 000 rows in the table
With these fields you can't directly determine price drops or new application. You'll have to insert these in your own database, and determine the differences from there. In a relational database like MySQL this isn't too complex:
To determine which applications are new, you can add your own column "first_seen", and then query your database to show all objects where the first_seen column is no longer then a day away.
To calculate price drops you'll have to calculate the difference between the retail_price of the current import, and the previous import.
Since you've edited your question, my edited answer:
It seems like you're having storage/performance issues, and you know what you want to achieve. To solve this you'll have to start measuring and debugging: with datasets this large you'll have to make sure you have the correct indexes. Profiling your queries should helping in finding out if they do.
And probably, your environment is "write once a day", and read "many times a minute". (I'm guessing you're creating a website). So you could speed up the frontend by processing the differences (price drops and new application) on import, rather than when displaying on the website.
If you still are unable to solve this, I suggest you open a more specific question, detailing your DBMS, queries, etc, so the real database administrators will be able to help you. 60 million rows are a lot, but with the correct indexes it should be no real trouble for a normal database system.
Compare the table with one you've downloaded the previous day, and note the differences.
Added:
For only 60 million items, and on a contemporary PC, you should be able to store a sorted array of the store id numbers and previous prices in memory, and do an array lookup faster than the data is arriving from the network feed. Mark any differences found and double-check them against the DB in post-processing.
Actually I also trying to play with these data, and I think best approach for you base on data from Apple.
You have 2 type of data : full and incremental (updated data daily). So within new data from incremental (not really big as full) you can compare only which record updated and insert them into another table to determine pricing has changed.
So you have a list of records (app, song, video...) updated daily with price has change, just get data from new table you created instead of compare or join them from various tables.
Cheers
If I were to create a basic personal accounting system (because I'm like that - it's a hobby project about a domain I'm familiar enough with to avoid getting bogged-down in requirements), would a NoSQL/document database like RavenDB be a good candidate for storing the accounts and more importantly, transactions against those accounts? How do I choose which entity is the "document"?
I suspect this is one of those cases were actually a SQL database is the right fit and trying to go NoSQL is the mistake, but then when I think of what little I know of CQRS and event sourcing, I wonder if the entity/document is actually the Account, and the transactions are Events stored against it, and that when these "events" occur, maybe my application also then writes out to a easily queryable read store like a SQL database.
Many thanks in advance.
Personally think it is a good idea, but I am a little biased because my full time job is building an accounting system which is based on CQRS, Event Sourcing, and a document database.
Here is why:
Event Sourcing and Accounting are based on the same principle. You don't delete anything, you only modify. If you add a transaction that is wrong, you don't delete it. You create an offset transaction. Same thing with events, you don't delete them, you just create an event that cancels out the first one. This means you are publishing a lot of TransactionAddedEvent.
Next, if you are doing double entry accounting, recording a transaction is different than the way your view it on a screen (especially in a balance sheet). Hence, my liking for cqrs again. We can store the data using correct accounting principles but our read model can be optimized to show the data the way you want to view it.
In a balance sheet, you want to view all entries for a given account. You don't want to see the transaction because the transaction has two sides. You only want to see the entry that affects that account.
So in your document db you would have an entries collection.
This makes querying very easy. If you want to see all of the entries for an account you just say SELECT * FROM Entries WHERE AccountId = 1. I know that is SQL but everyone understands the simplicity of this query. It just as easy in a document db. Plus, it will be lightning fast.
You can then create a balance sheet with a query grouping by accountid, and setting a restriction on the date. Notice no joins are needed at all, which makes a document db a great choice.
Theory and Architecture
If you dig around in accounting theory and history a while, you'll see that the "documents" ought to be the source documents -- purchase order, invoice, check, and so on. Accounting records are standardized summaries of those usually-human-readable source documents. An accounting transaction is two or more records that hit two or more accounts, tied together, with debits and credits balancing. Account balances, reports like a balance sheet or P&L, and so on are just summaries of those transactions.
Think of it as a layered architecture -- the bottom layer, the foundation, is the source documents. If the source is electronic, then it goes into the accounting system's document storage layer -- this is where a nosql db might be useful. If the source is a piece of paper, then image it and/or file it with an index number that is then stored in the accounting system's document layer. The next layer up is digital records summarizing those documents; each document is summarized by one or more unbalanced transaction legs. The next layer up is balanced transactions; each transaction is composed of two or more of those unbalanced legs. The top layer is the financial statements that summarize those balanced transactions.
Source Documents and External Applications
The source documents are the "single source of truth" -- not the records that describe them. You should always be able to rebuild the entire db from the source documents. In a way, the db is just an index into the source documents in the first place. Way too many people forget this, and write accounting software in which the transactions themselves are considered the source of truth. This causes a need for a whole 'nother storage and workflow system for the source documents themselves, and you wind up with a typical modern corporate mess.
This all implies that any applications that write to the accounting system should only create source documents, adding them to that bottom layer. In practice though, this gets bypassed all the time, with applications directly creating transactions. This means that the source document, rather than being in the accounting system, is now way over there in the application that created the transaction; that is fragile.
Events, Workflow, and Digitizing
If you're working with some sort of event model, then the right place to use an event is to attach a source document to it. The event then triggers that document getting parsed into the right accounting records. That parsing can be done programatically if the source document is already digital, or manually if the source is a piece of paper or an unformatted message -- sounds like the beginnings of a workflow system, right? You still want to keep that original source document around somewhere though. A document db does seem like a good idea for that, particularly if it supports a schema where you can tie the source documents to their resulting parsed and balanced records and vice versa.
You can certainly create such a system.
In that scenario, you have the Account Aggregate, and you also have the TimePeriod Aggregate.
The time period is usually a Month, a Quarter or a Year.
Inside each TimePeriod, you have the Transactions for that period.
That means that loading the current state is very fast, and you have the full log in which you can go backward.
The reason for TimePeriod is that this is usually the boundary in which you actually think about such things.
In this case, a relational database is the most appropriate, since you have relational data (eg. rows and columns)
Since this is just a personal system, you are highly unlikely to have any scale or performance issues.
That being said, it would be an interesting exercise for personal growth and learning to use a document-based DB like RavenDB. Traditionally, finance has always been a very formal thing, and relational databases are typically considered more formal and rigorous than document databases. But, like you said, the domain for this application is under your control, and is fairly straight forward, so complexity and requirements would not get in the way of designing the system.
If it was my own personal pet project, and I wanted to learn more about a new-ish technology and see if it worked in a particular domain, I would go with whatever I found interesting and if it didn't work very well, then I learned something. But, your mileage may vary. :)