Transferring data into a new ERP for implementation - data-cleaning

I'm transferring data from our old ERP system to a new ERP. The first step is to do data cleaning. The ERP holds 23 years old data. How do I decide which data to transfer and which data to archive? Do I transfer data that's been active in the last 10 years only?
Hoping someone came accross this question before.
Thanks
Adriana

Related

How to handle lots of 'archived' data in Postgres

We have a huge Postgres database where we store fiscal data (invoices, bank statements, sales orders) for thousands of companies. In the UI of our app the data is divided per fiscal year (which is 1 calendar year most of the times). So a user chooses a year and only sees data for that specific year.
For example we have a table that stores journal entries (every incoice line can result in multiple journal entries). This table is quite slow on the more complex queries. It's one big table going like 15 years back. However, users rarely access old data anymore. Only the past 2 or 3 years will be actively accessed, data older than that will almost never be accessed.
What is the best way to deal with this old, almost archived data? Partitioning? Clustering? If anyone could point me in the right direction that would be of great help.
Ps. our database is hosted in Google Cloud

Predicting answers from data in a CSV file when user asks a question

I have a CSV file with columns on crop production
Columns of the CSV file
If a user asks a question like
What should I produce in Maharashtra in 2019?
User should get an answer based on the history of the data in CSV
Can someone help me with the implementation?
For a good answer (training), you probably need the amount of requested or sold crop rather than the amount of produced crop.
A first implementation could give the average of the past years production. Possibly weighing the past few years more.

Which is the best free data warehouse products [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I am developing a system which constains a lot of olap work. According to my research, column based data warehouse is the best choice. But I am puzzled to choose a good data warehouse product.
All the data warehouse comparison article I see is befor 2012,and there seems little article about it. Is data warehouse out-of-date? Hadoop HBase is better?
As far as I know, InfiniDB is a high performance open source data warehouse product, but it has not been maintained for 2 years https://github.com/infinidb/infinidb. And there is little document about InfiniDB . Has InfiniDB been abundanted by developers ?
Which is the best data warehouse product by now?
How do I incrementally move my Business data stored in the Mysql database to data warehouse ?
Thank you for your answer!
Data warehousing is still a hot topic, and HBase is not the fastes, but a very well known and compatible one (many applications build on it)
I have taken the Journey for a good Column store some years ago and finally went with InfiniDB because of the easy migration from plain mysql. its a nice piece of software, but it has still bugs, so i cannot fully recommend it to be used in production. (not without a 2nd failover instance).
However, MariaDB has picket up the InfiniDB technology and is porting it over to their MariaDB Database Server. This new product ist called MariaDB Columnstore[1], of with a testing build is available. They have already put a lot effort in it, so i think ColumnStore will get a Major product of MariaDB within the next two years.
I cant answer that. Im still with InfiniDB and also helping others with their projects.
This totally depends on your data structure and usage.
InfiniDB is great at querying, it had (in my tests) ~8% better performance than impala, however, while infinidb supports INSERT, UPDATE, DELETE and transactions it is not great on transactional workload. i.e. just moving a community driven website to infinidb where visitors always manipulating data will NOT work well. one insert with 10000 rows will work well, 10000 inserts with 1 row will kill it.
We deployed Infinidb for our customers to 'aid' the query performance of a regular mariadb installation - we created a tool that imports and updates MariaDB database tables into InfiniDB faster querying. manipulations on that table are still done in MairaDB and the changes get batch-imported into InfiniDB with 30 sec delay. as original and infinidb tables have the same structure and are accessable with api mysql, we just can switch the database connection and have super-fast SELECT queries. this works well for our use case.
We also built new statistics/analytics applications from ground up to work with infinidb and replace a older MySQL-Based System, which also works great and above any performance-expectations. (we now have 15x of the data we had in mariadb, and its still easier to maintain and much faster to query).
[1] https://mariadb.com/products/mariadb-columnstore
I would give Splice Machine a shot (Open Source). It stores data on HBase and will provide the core data management functions that a warehouse provides (Primary Keys, Constraints, Foreign Keys, etc.)

What data structure to use for timeseries data logging in Mongodb

I have 21 million rows (lines in csv files) that I want to import into MongoDB to report on.
The data comes a process on each PC's within our organisation - which create a row every 15 minutes showing who is logged on.
Columns are: date/time, PC Name, UserName, Idle time (if user logged on)
I need to be able to report from a PC POV (PC usage metrics) and a User POV (user dwell time and activity/movement).
Initially I just loaded the data using mongoimport. But this raw data structure is not easy to report on. This could simply be my lack of knowledge of MongoDB.
I have been reading http://blog.mongodb.org/post/65517193370/schema-design-for-time-series-data-in-mongodb which is a great article on schema design for time series data in mongodb.
This makes sense for reporting on PC usage - as I could pre-process the data and load it into Mongo as one document per PC/date combination, with an array of hourly buckets.
However I suspect this would make reporting from the user POV difficult.
I'm now thinking of create two collection - one for PC data and another for user data (one document per user/date combination etc).
I would like to know if I'm on the right track - or if anyone could suggest a better solution, of if indeed the original, raw data would suffice - and instead I just need to know how to query from both angles (some kind of map-reduce).
Thanks
Tim

Should I use a database instead of serialized files? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I am working in my first real world application that consists of keeping track of medical studies of a medium size medical office. The system needs to keep track of doctors, users, patients, study templates and study reports. The purpose of this program is to apply preformatted study template for any possible study, keep track of each patient's study and keep a easy to find file system. Each study report is saved in an specific folder as an html file that can be used or printed from Windows directly.
I estimate that at any given time would be about 20 active doctors, 30 different study templates, 12 users; the patients and study reports would be cumulative an will remain active indefinitely. I estimate that we are talking about 2000 new patient and 6000 new study reports a year.
I have almost completed the job, but initially I chose to store the data in a serialized file and I did not consider to use a database instead. Now, considering that the size of the data will rapidly grow, I believe that I should consider to work with a database instead. For many different reasons, especially I am concerned about the serialized file choice because I noticed that any change that I may make in the future in any class may conflict with the serialized file and stops me from reopening it. I appreciate any comments, how large a file is too large to work with? It is a serialize file acceptable in this case please pass me any ideas or comments. Thanks for the help
Your concern about breaking compatibility with these files is absolutely reasonable.
I solved the same problem in a small inventory project by taking these steps:
Setup of a DB server (MySQL)
Integration of hibernate into the project
Reimplementation of the serializable classes within a new package using JPA annotations (if the DB schema won't break, add the annotations to existing classes)
Generation of the DB schema using the JPA entitites
Implementation of an importer for existing objects (deserialization, conversion and persisting with referential integrity.
Import and validation of existing data objects
Any required refactoring from old classes to the new JPA entities within the whole project
Removal of old classes and their importer (should slumber in a repository)
Most people will say that you should use a database regardless. If this is a professional application you can't risk the data being corrupted and is a real possibility e.g. due to a bug in your code and someone using the program incorrectly.
It is the value of the data, not the size which matters here. Say it has been running for a year and the file becomes unusable. Are you going to tell them they should enter all the data again from scratch?
If its just an exercise, I still suggest you use a database as you will learn something. A popular choice is to use hibernate and it is CV++. ;)