Need to transpose a LARGE csv file in perl [closed] - perl

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
The csv data file is 3.2 GB in total, with god knows how many rows and columns (assume very large). The file is a genomics data with SNP data for a population of individuals. Thus the csv file contains IDs such as TD102230 and genetic data such as A/A and A/T.
Now that I used Text::CSV and Array::Transpose modules but couldn't seem to get it right (as in the computing cluster froze). Is there specific module that would do this? I am new to Perl (not much experience in low level programming, mostly used R and MATLAB before) so detailed explanations especially welcome!

As direct answer, you should read file line by line, process them with Text::CSV, push new values to arrays with each array corresponds to original column and then just output them with join or like to get transposed representation of original. Disposing of each array right after join will help with memory problem too.
Writing values to external files instead of array and joining them with OS facilities is another way around memory requirements.
You also should think about why you need this. Is there really no better way to solve real task at hand, since transposing just by itself serves no real purpose?

Break down the task into several steps to save memory.
Read a line and write the fields into a file named after the line number. Output one line per field.
Repeat step 1 until the input CSV file is exhausted.
Use paste to merge all output files into a big one.

Related

Should I use a database instead of serialized files? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I am working in my first real world application that consists of keeping track of medical studies of a medium size medical office. The system needs to keep track of doctors, users, patients, study templates and study reports. The purpose of this program is to apply preformatted study template for any possible study, keep track of each patient's study and keep a easy to find file system. Each study report is saved in an specific folder as an html file that can be used or printed from Windows directly.
I estimate that at any given time would be about 20 active doctors, 30 different study templates, 12 users; the patients and study reports would be cumulative an will remain active indefinitely. I estimate that we are talking about 2000 new patient and 6000 new study reports a year.
I have almost completed the job, but initially I chose to store the data in a serialized file and I did not consider to use a database instead. Now, considering that the size of the data will rapidly grow, I believe that I should consider to work with a database instead. For many different reasons, especially I am concerned about the serialized file choice because I noticed that any change that I may make in the future in any class may conflict with the serialized file and stops me from reopening it. I appreciate any comments, how large a file is too large to work with? It is a serialize file acceptable in this case please pass me any ideas or comments. Thanks for the help
Your concern about breaking compatibility with these files is absolutely reasonable.
I solved the same problem in a small inventory project by taking these steps:
Setup of a DB server (MySQL)
Integration of hibernate into the project
Reimplementation of the serializable classes within a new package using JPA annotations (if the DB schema won't break, add the annotations to existing classes)
Generation of the DB schema using the JPA entitites
Implementation of an importer for existing objects (deserialization, conversion and persisting with referential integrity.
Import and validation of existing data objects
Any required refactoring from old classes to the new JPA entities within the whole project
Removal of old classes and their importer (should slumber in a repository)
Most people will say that you should use a database regardless. If this is a professional application you can't risk the data being corrupted and is a real possibility e.g. due to a bug in your code and someone using the program incorrectly.
It is the value of the data, not the size which matters here. Say it has been running for a year and the file becomes unusable. Are you going to tell them they should enter all the data again from scratch?
If its just an exercise, I still suggest you use a database as you will learn something. A popular choice is to use hibernate and it is CV++. ;)

Which data structure will you use to storing the pointers to the adjacency linked list for a very big graph? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
Suppose you want to implement a graph which can have million number of nodes.But the nodes count will increase from 0 to million.Its uncertain whether it will reach the million mark or not.It may cross it to multi-million nodes also.
I know adjacency list is what is used for this.But a typical adjacency list has a data structure maintaining pointers to the linked lists.
What data structure then should be used to store the pointers to the adjacency list ?
For example take Facebook for that matter.It has millions of users. Suppose each user represents a node. Now all users are represented as nodes of a very big single graph and you want to do operations on it how will you store it ?
Well if you know the basics behind them, it shouldn't be too hard.
Generally you create an array called "buckets" that contain the key and value, with an optional pointer to create a linked list.
When you access the hash table with a key, you process the key with a custom hash function which will return an integer. You then take the modulus of the result and that is the location of your array index or "bucket". Then you check the unhashed key with the stored key, and if it matches, then you found the right place.
Otherwise, you've had a "collision" and must crawl through the linked list and compare keys until you match. (note some implementations use a binary tree instead of linked list for collisions).
Check out this fast hash table implementation:
http://attractivechaos.awardspace.com/khash.h.html

How to convert a word document into a PostgreSQL table [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I have a word document that contains data dictionaries.
For example, a variable called FUEL is described as follows:
FUEL -- What type of fuel does it take?
1 Gas
2 Diesel
3 Hybrid
4 Flex fuel
7 OTHER, SPECIFY
I want to convert the document into a PostgreSQL table. Do you have any suggestions?
In general, this sort of thing takes two stages: 1st, massage the data into a sane tabular format using text processing tools and scripting, or with something like Excel.
Once you have a tabular format, output the data as CSV (say, with Save As in Excel) and load it into PostgreSQL using the COPY command or psql's \copy after running appropriate CREATE TABLE commands to define a table structure that matches the structure of the CSV.
Edit: Given the updated post, I'd say you probably have to write a simple parser for this, unless the document contains internal structured markup. Save the document as plain text. Now write a script in a language like Perl or Python that looks for the heading that defines the variable, extracts the capitalied variable name and the description from that line, then reads numbered options until it runs out and is ready to read the next variable. If the document is uniformly structured this should only take a few lines of code with some basic regular expressions; you could probably even do it in awk. Have the script either write CSV ready for importing later, or use database interfaces like DBD::Pg (Perl) or psycopg2 (Python) to store the data directly.
If you don't know any scripting tools, you'll either need to learn or get very good at copy and paste.

SHA1 brute force program [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
Let said given two hash code, you have been given the value of character 1 to character 9. The remaining of characters are unknown. The message length is unknown too.
Happens that this two hash code generated from 2 different plaintext but only first character are different, the remaining of the characters are exactly the same.
First hash code = *********************
Second hash code = *********************
plaintext1 = 1************************
plaintext2 = 2************************
Able to brute force to recover the plaintext?
Brute-forcing is always possible, it depends on your intention, whether it is applicable or not.
Finding collision (password login)
If you only need to find a collision (a value that results in the same hash-value), brute-forcing is applicable. An off the shelf GPU is able to calculate 3 Giga SHA1 hash values per second. That's why a fast hash function like SHA1 is a bad choice for hashing passwords, instead one should use a key derivation function like BCrypt or PBKDF2.
Finding original password
Finding a collision will be relatively fast, finding the original password (not just a collision) can use more time, it depends on the strength of the password, how much time you need then.
With a good cryptographic hash function, the knowledge about same characters should give you no advantage.
Modification of plaintext (digital signature)
If you want to alter the plaintext, so that it produces the same hash-value, then you will probably spend your life, looking for such a text. This is much harder, because the new text should make sense at last.
Cryptographic hash algorithms are designed to spread small changes in the plaintext across the whole of the computed hash. The kind of attack you're asking about is not feasible.

Data Integration [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I have been looking at data integration methods Global as view and Local as view, but I can not find any examples of how queries would be formed for these, could anyone give me examples of how these methods of data integration can be queried using GAV and LAV please
I am specifically asking about GAV and LAV here
I know that GAV (Global as view) is described over data sources and that LAV (local as view) is described over the mediated schema. However I am not totally sure what those terms mean, nor how they affect the query produced.
There is a wikipedia page for GAV, with no example of a query, and there isn't a wikipedia page for LAV sadly
I think these terms are not widely used in industry - the only references I can see for them appear to arise from academic work. They apply to Enterprise Information Integration - a genre of technology where a client-side reporting or integration layer is placed over existing databases without actually persisting the data into a separate reporting database.
Essentially, 'Global As View' describes where data is transformed into a unified representation before reporting queries are issued. In a data warehouse (where the data is transformed and persisted into a separate database) this view would be the data warehouse tables. An EII tool can do this by issuing queries to the underlying data sources and merging it into the centralised schema. EII is not a widely used technology, though.
'Local as view' techniques query all the sources individually and then merge the result sets together. Conceptually, this is an act of making up several queries to the different sources that produce result sets in the same format, but source the data from wherever it is found in the underlying systems. The data integration is then done in the reporting layer.