Data Integration [closed] - data-integration

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I have been looking at data integration methods Global as view and Local as view, but I can not find any examples of how queries would be formed for these, could anyone give me examples of how these methods of data integration can be queried using GAV and LAV please
I am specifically asking about GAV and LAV here
I know that GAV (Global as view) is described over data sources and that LAV (local as view) is described over the mediated schema. However I am not totally sure what those terms mean, nor how they affect the query produced.
There is a wikipedia page for GAV, with no example of a query, and there isn't a wikipedia page for LAV sadly

I think these terms are not widely used in industry - the only references I can see for them appear to arise from academic work. They apply to Enterprise Information Integration - a genre of technology where a client-side reporting or integration layer is placed over existing databases without actually persisting the data into a separate reporting database.
Essentially, 'Global As View' describes where data is transformed into a unified representation before reporting queries are issued. In a data warehouse (where the data is transformed and persisted into a separate database) this view would be the data warehouse tables. An EII tool can do this by issuing queries to the underlying data sources and merging it into the centralised schema. EII is not a widely used technology, though.
'Local as view' techniques query all the sources individually and then merge the result sets together. Conceptually, this is an act of making up several queries to the different sources that produce result sets in the same format, but source the data from wherever it is found in the underlying systems. The data integration is then done in the reporting layer.

Related

Should I use a database instead of serialized files? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I am working in my first real world application that consists of keeping track of medical studies of a medium size medical office. The system needs to keep track of doctors, users, patients, study templates and study reports. The purpose of this program is to apply preformatted study template for any possible study, keep track of each patient's study and keep a easy to find file system. Each study report is saved in an specific folder as an html file that can be used or printed from Windows directly.
I estimate that at any given time would be about 20 active doctors, 30 different study templates, 12 users; the patients and study reports would be cumulative an will remain active indefinitely. I estimate that we are talking about 2000 new patient and 6000 new study reports a year.
I have almost completed the job, but initially I chose to store the data in a serialized file and I did not consider to use a database instead. Now, considering that the size of the data will rapidly grow, I believe that I should consider to work with a database instead. For many different reasons, especially I am concerned about the serialized file choice because I noticed that any change that I may make in the future in any class may conflict with the serialized file and stops me from reopening it. I appreciate any comments, how large a file is too large to work with? It is a serialize file acceptable in this case please pass me any ideas or comments. Thanks for the help
Your concern about breaking compatibility with these files is absolutely reasonable.
I solved the same problem in a small inventory project by taking these steps:
Setup of a DB server (MySQL)
Integration of hibernate into the project
Reimplementation of the serializable classes within a new package using JPA annotations (if the DB schema won't break, add the annotations to existing classes)
Generation of the DB schema using the JPA entitites
Implementation of an importer for existing objects (deserialization, conversion and persisting with referential integrity.
Import and validation of existing data objects
Any required refactoring from old classes to the new JPA entities within the whole project
Removal of old classes and their importer (should slumber in a repository)
Most people will say that you should use a database regardless. If this is a professional application you can't risk the data being corrupted and is a real possibility e.g. due to a bug in your code and someone using the program incorrectly.
It is the value of the data, not the size which matters here. Say it has been running for a year and the file becomes unusable. Are you going to tell them they should enter all the data again from scratch?
If its just an exercise, I still suggest you use a database as you will learn something. A popular choice is to use hibernate and it is CV++. ;)

How to convert a word document into a PostgreSQL table [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I have a word document that contains data dictionaries.
For example, a variable called FUEL is described as follows:
FUEL -- What type of fuel does it take?
1 Gas
2 Diesel
3 Hybrid
4 Flex fuel
7 OTHER, SPECIFY
I want to convert the document into a PostgreSQL table. Do you have any suggestions?
In general, this sort of thing takes two stages: 1st, massage the data into a sane tabular format using text processing tools and scripting, or with something like Excel.
Once you have a tabular format, output the data as CSV (say, with Save As in Excel) and load it into PostgreSQL using the COPY command or psql's \copy after running appropriate CREATE TABLE commands to define a table structure that matches the structure of the CSV.
Edit: Given the updated post, I'd say you probably have to write a simple parser for this, unless the document contains internal structured markup. Save the document as plain text. Now write a script in a language like Perl or Python that looks for the heading that defines the variable, extracts the capitalied variable name and the description from that line, then reads numbered options until it runs out and is ready to read the next variable. If the document is uniformly structured this should only take a few lines of code with some basic regular expressions; you could probably even do it in awk. Have the script either write CSV ready for importing later, or use database interfaces like DBD::Pg (Perl) or psycopg2 (Python) to store the data directly.
If you don't know any scripting tools, you'll either need to learn or get very good at copy and paste.

Validation in postgreSQL [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I made an application to collect data from users. Those data will be collected at different places and from these places will be sent to a central server. I need to design a validation plan for the central server in PostgreSQL. Data must be checked against various validations and if a validation fails a message must be thrown.
It is database to database transfer validation.
Yes you're on the right track, you'll either use triggers and/or check constraints to do this.
Also, PostgresQL has a very flexible type system. Make sure to select the most appropriate, restrictive types. You can even define custom types yourself.
UNIQUE constraints
CHECK Constraints
FOREIGN KEY constraints - tutorial
Triggers, which can call helper functions written in any supported procedural language. Triggers can RAISE EXCEPTION to abort a transaction.
Domain Types
EXCLUSION constraints in 9.2 and newer
Multi-column PRIMARY KEYs
Partial UNIQUE indexes
Note that instead of using varchar(length) you're usually better off using text and a check constraint.

Need to transpose a LARGE csv file in perl [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
The csv data file is 3.2 GB in total, with god knows how many rows and columns (assume very large). The file is a genomics data with SNP data for a population of individuals. Thus the csv file contains IDs such as TD102230 and genetic data such as A/A and A/T.
Now that I used Text::CSV and Array::Transpose modules but couldn't seem to get it right (as in the computing cluster froze). Is there specific module that would do this? I am new to Perl (not much experience in low level programming, mostly used R and MATLAB before) so detailed explanations especially welcome!
As direct answer, you should read file line by line, process them with Text::CSV, push new values to arrays with each array corresponds to original column and then just output them with join or like to get transposed representation of original. Disposing of each array right after join will help with memory problem too.
Writing values to external files instead of array and joining them with OS facilities is another way around memory requirements.
You also should think about why you need this. Is there really no better way to solve real task at hand, since transposing just by itself serves no real purpose?
Break down the task into several steps to save memory.
Read a line and write the fields into a file named after the line number. Output one line per field.
Repeat step 1 until the input CSV file is exhausted.
Use paste to merge all output files into a big one.

which is faster? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
which is faster and usefull ? accumulator , register , or stack?
Registers are the fastest. An accumulator is also a register in which intermediate arithmetic and logic results are stored (info from Wikipedia).
A stack will be slower since it's a region of memory, and memory will always be slower than registers.
However, you will always have more memory available than registers since CPU storage is very costly.
Bottom line: they're all useful and their speed is inversely proportional to their available storage.
Those questions without any context about CPU architecture or other information what you want to accomplish cannot be answered in any useful way.
Usually the accumulator is just one of the registers - modern CPUs don't differentiate anymore, so for old one accu might be faster - or actually the only register allowing you certain operations. Registers are always faster then external memory, but there are just a limited amount of them (and they need to be explicitely named by the compiler/assember).
The stack is an area of RAM used to store data. So that's slower for sure :)
Quistion is not quite correct. "Fast" is related to the operations, not to the registers and etc. Another point - there is nothing about architecture of CPU in first message :-)
Depending on CPU architecture accumulator is a register but can have a special implementation. This way the operations that use accumulateor usualy faster than register operations.
About stack. Some processors have no support of Register-Register operations(i.e. input-output processor). That case some operations on the stack can be faster because it is not required to calculate effective address.
Register are always faster because it doesn't go get data into the memory, but be more clear about the situation.
Registers are usefull when you have many like x64 or Arm architecture.
Generally, registers are faster because they are actually part of the microprocessor. And the accumulator is just one of the register (the one that normally stores the result of various operations).
The stack is just memory like any other memory, allocated for the purpose of tracking return addresses and local variables.
But you can't use registers for everything because there are only a very limited number of them available.
If you explained why you were asking these questions, they might make a little more sense.