Data Transformation Help - Variety of Documents - Distinct Fields - mongodb

Let us say, I want to transfer data from 1 MongoDB cluster with 50 million records to another one where the self-imposed 'schema' has changed drastically and I want to test the import + conversion before actually running it.
I am able to find a list of distinct fields just fine, but I want to pull a variety of documents so that each distinct field is pulled. This data would then be the source to test my Map-Reduce script.
The issue arose due to many years of using and changes in the way of saving the stored data. What originally was user.orgId became user.organizationid.
Any suggestions? Even on 3rd party tools?

Basically it seems like you have two related questions:
How can I run an import and conversion without affecting the final collection.
How can I verify that the documents in a collection match a particular schema definition.
Both questions have a variety of appropriate answers.
For question 1.
a. You can create a temporary duplicate of your cluster: then run your import and conversion in this environment. This is the safest way.
b. You can simply run the import and conversion with a different final collection. This isn't as safe as a, because it requires the developer to be diligent with selecting the appropriate collections at test time, and at final deployment time.
Question 2.
This depends very much on the environment you are developing for, which I don't know anything about. But, for the sake of an example, if you were working in python, you could use something like: https://pypi.python.org/pypi/jsonschema, and iterate over each document confirming that it conforms to the schema you require. If you already have an ODM in place, and have mappings that describe the schema, if should be possible to validate documents using the mapping.

Related

What is the recommended way to create a consolidated data store for large SPSS files that have survey data (having 600–800 columns)?

Hello Everyone I just need your suggestion what is the best way to store the data retrieving from SPSS file and storing into Mongo db or RDBMS or any other .
The data comprises of responses to survey questionnaire which can span upto large number of columns (600-800) depending on the number of questions and other attributes recorded for the respondent and the survey study. Also these surveys are conducted periodically - however it's not necessary that the questions remain exactly the same - these may vary from survey to survey.
The need is to consolidate this data into a uniform structure and enable further analysis over the consolidated data spanning over multiple survey for which again the plan is to use SPSS.
One option I considered was to store data in MongoDB as then there is flexibility on how the schema can be modified across surveys i.e. rigid schema definition part can be avoided. However in this case not sure if SPSS would support working against Mongo
Would be very interested to know if someone has had any experience in this area or could provide some suggestion.
Another thing to consider, if you plan to create generalized jobs that can be run over surveys that are similar but differ in details is to set up a classification system for the variables such as demographic, opinion, economic etc, and assign these using custom attributes when the sav files are created. You can then use these attributes in generalized jobs to determine what to do based on generic properties rather than tying the code to specific variable names.
You can use the SPSSINC SELECT VARIABLES to define macros based on variable properties, including custom attributes and then use those macros in your syntax in place of specific variable names.
We have seen that an approach like this can dramatically reduce the number of different but similar jobs that an organization has to otherwise maintain.

Calculating and reporting Data Completeness

I have been working with measuring the data completeness and creating actionable reports for out HRIS system for some time.
Until now i have used Excel, but now that the requirements for reporting has stabilized and the need for quicker response time has increased i want to move the work to another level. At the same time i also wish there to be more detailed options for distinguishing between different units.
As an example I am looking at missing fields. So for each employee in every company I simply want to count how many fields are missing.
For other fields I am looking to validate data - like birthdays compared to hiring dates, threshold for different values, employee groups compared to responsibility level, and so on.
My question is where to move from here. Is there any language that is better than any of the others when dealing with importing lists, doing evaluations on fields in the lists and then quantify it on company and other levels? I want to be able to extract data from our different systems, then have a program do all calculations and summarize the findings in some way. (I consider it to be a good learning experience.)
I've done something like this in the past and sort of cheated. I wrote a program that ran nightly, identified missing fields (not required but necessary for data integrity) and dumped those to an incomplete record table that was cleared each night before the process ran. I then sent batch emails to each of the different groups responsible for the missing element(s) to the responsible group (Payroll/Benefits/Compensation/HR Admin) so the missing data could be added. I used .Net against and Oracle database and sent emails via Lotus Notes, but a similar design should work on just about any environment.

Database design: Postgres or EAV to hold semi-structured data

I was given the task to decide whether our stack of technologies is adequate to complete the project we have at hand or should we change it (and to which technologies exactly).
The problem is that I'm just a SQL Server DBA and I have a few days to come up with a solution...
This is what our client wants:
They want a web application to centralize pharmaceutical researches separated into topics, or projects, in their jargon. These researches are sent as csv files and they are somewhat structured as follows:
Project (just a name for the project)
Segment (could be behavioral, toxicology, etc. There is a finite set of about 10 segments. Each csv file holds a segment)
Mandatory fixed fields (a small set of fields that are always present, like Date, subjects IDs, etc. These will be the PKs).
Dynamic fields (could be anything here, but always as a key/pair value and shouldn't be more than 200 fields)
Whatever files (images, PDFs, etc.) that are associated with the project.
At the moment, they just want to store these files and retrieve them through a simple search mechanism.
They don't want to crunch the numbers at this point.
98% of the files have a couple of thousand lines, but there's a 2% with a couple of million rows (and around 200 fields).
This is what we are developing so far:
The back-end is SQL 2008R2. I've designed EAVs for each segment (before anything please keep in mind that this is not our first EAV design. It worked well before with less data.) and the mid-tier/front-end is PHP 5.3 and Laravel 4 framework with Bootstrap.
The issue we are experiencing is that PHP chokes up with the big files. It can't insert into SQL in a timely fashion when there's more than 100k rows and that's because there's a lot of pivoting involved and, on top of that, PHP needs to get back all the fields IDs first to start inserting. I'll explain: this is necessary because the client wants some sort of control on the fields names. We created a repository for all the possible fields to try and minimize ambiguity problems; fields, for instance, named as "Blood Pressure", "BP", "BloodPressure" or "Blood-Pressure" should all be stored under the same name in the database. So, to minimize the issue, the user has to actually insert his csv fields into another table first, we called it properties table. This action won't completely solve the problem, but as he's inserting the fields, he's seeing possible matches already inserted. When the user types in blood, there's a panel showing all the fields already used with the word blood. If the user thinks it's the same thing, he has to change the csv header to the field. Anyway, all this is to explain that's not a simple EAV structure and there's a lot of back and forth of IDs.
This issue is giving us second thoughts about our technologies stack choice, but we have limitations on our possible choices: I only have worked with relational DBs so far, only SQL Server actually and the other guys know only PHP. I guess a MS full stack is out of the question.
It seems to me that a non-SQL approach would be the best. I read a lot about MongoDB but honestly, I think it would be a super steep learning curve for us and if they want to start crunching the numbers or even to have some reporting capabilities,
I guess Mongo wouldn't be up to that. I'm reading about PostgreSQL which is relational and it's famous HStore type. So here is where my questions start:
Would you guys think that Postgres would be a better fit than SQL Server for this project?
Would we be able to convert the csv files into JSON objects or whatever to be stored into HStore fields and be somewhat queryable?
Is there any issues with Postgres sitting in a windows box? I don't think our client has Linux admins. Nor have we for that matter...
Is it's licensing free for commercial applications?
Or should we stick with what we have and try to sort the problem out with staging tables or bulk-insert or other technique that relies on the back-end to do the heavy lifting?
Sorry for the long post and thanks for your input guys, I appreciate all answers as I'm pulling my hair out here :)

Automatically trigger operations when creating collections

I know that Mongo creates things on the fly. But I would like to have a server side script, and each time that a new collection is created Mongo will automatically execute that script or set of operations.
The idea is that my application code can be unaware of indexes and sharding configuration etc.
Can I do such thing, and if so, how?
I answered this over on the Google Group: http://groups.google.com/group/mongodb-user/browse_thread/thread/94d19658299f6bcc
The question is quite vague, but I took a shot at it anyway - try being a bit more specific in terms of what you are trying to do and you may get better responses.
There is no such functionality. Implement something inside your application code.
A possible approach to this is that you check whether a collection exists prior to performing any ISUD (insert, select, update, delete) against them. It seems a bit sledgehammer though. I'm not sure how you would know what index to apply to an arbitrarily named collection though, unless you are taking some free text from some user input and executing that against you mongo install? If you're looking to verify the db structure against an 'expected' structure, then you could Testing your document structure for inconsistencies

Is there a way to get around space usage issues when using long field names in MongoDB?

It looks like having descriptive field names (the ones I like the most) can take much space in the memory for big collections. I don't like the idea of giving them short and cryptic names to save memory, neither do I like the idea to translate field names to shortened fields somewhere in the application.
Is there a way to tell mongo not to store every field name as text?
For now the only thing you can do is to vote and wait for SERVER-863 to be solved. After almost a year of discussion the status of this issue has been changes to planned but not scheduled...
The workaround is to use document mapping libraries likes Spring Data Document or morphia (in Java world) and work with nicely named objects. But the underlying database names are still cryptic.
If you are using an "object-document mapper" library to access MongoDB, many of them provide facilities for using descriptive names within your application code, but storing short names in the database. If your application has a data access layer, it may be possible for you to implement this logic in your application code, as well.
Since you haven't said what language you're using, or whether you're using an ODM at all, I provide any more guidance on which ODMs might fit your needs.