We are working towards migrating data from MongoDB to Teradata (DW).
We feel that transformations on the data will be necessary.
Could you please help me answer the below questions which will guide us on developing a solution for migration :
Which would be the best and efficient format to export data from MongoDB to load into Teradata(DW) considering transformations involved ? (CSV/JSON/Others)
Transformations could include omission of line(s) from the exported file, omission of fields, aggregation(sum/count) across fields etc.
If developing a framework for ETL, will Java be a good choice ?
We noticed that ‘\n’ [newline character] is also part of some records. Hence, in the csv we are seeing some blank lines in between.
Do we need to be concerned of the right line delimiter ? Or can the export format help us in this regard ?
We are seeing some records getting truncated because the length of the record exceeds 1024 characters.
We get the ‘Line too long’ message in VI editor. We don’t have an alternate editor in our system. Is there a way around to handle line truncation ?
CSV is not particularly well-specified - there are several variants of it in the wild with slightly different escaping behaviors. I almost always prefer anything-but-csv.
JSON
Yes
This is not a question, but ok.
Don't edit the data with vi, this is purely a limitation of the editor and not the export format. Do transformations programmatically
Related
Hey guys I work as a DB Consultant, I am new to the DataStage software and have been running into some issues. I need to load .xls files to .txt files and then upload them, but I am running into issues with the special characters below. I am not sure what DataType or syntax I could use to upload the data to the DB as the job crashes every time it encounters anything other than
Á,',’,Ç,Í,`,É, * etc
What attempts have we made?
**Convert(Convert('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 ','', Column_Name1),'',Column_Name1)
What issues persist?
Á,',’,Ç,Í,`,É and Spanish tildes
Question?
Is there not a way to bring in the data as is? I do not want the Convert statement because that only works on special chars but when accents and German Umlauts are involved that is a letter, and I don't want to replace with a space or remove it altogether, how do I handle this error?
Check out the unstructured data stage
The Knowledge Center provides a lot of nformation how to design a job with an Excel as data source.
In addition to that there are blogs and vidoes out there as well.
Once you understood the design / interface of the stage it is not difficult to do.
I was given the task to decide whether our stack of technologies is adequate to complete the project we have at hand or should we change it (and to which technologies exactly).
The problem is that I'm just a SQL Server DBA and I have a few days to come up with a solution...
This is what our client wants:
They want a web application to centralize pharmaceutical researches separated into topics, or projects, in their jargon. These researches are sent as csv files and they are somewhat structured as follows:
Project (just a name for the project)
Segment (could be behavioral, toxicology, etc. There is a finite set of about 10 segments. Each csv file holds a segment)
Mandatory fixed fields (a small set of fields that are always present, like Date, subjects IDs, etc. These will be the PKs).
Dynamic fields (could be anything here, but always as a key/pair value and shouldn't be more than 200 fields)
Whatever files (images, PDFs, etc.) that are associated with the project.
At the moment, they just want to store these files and retrieve them through a simple search mechanism.
They don't want to crunch the numbers at this point.
98% of the files have a couple of thousand lines, but there's a 2% with a couple of million rows (and around 200 fields).
This is what we are developing so far:
The back-end is SQL 2008R2. I've designed EAVs for each segment (before anything please keep in mind that this is not our first EAV design. It worked well before with less data.) and the mid-tier/front-end is PHP 5.3 and Laravel 4 framework with Bootstrap.
The issue we are experiencing is that PHP chokes up with the big files. It can't insert into SQL in a timely fashion when there's more than 100k rows and that's because there's a lot of pivoting involved and, on top of that, PHP needs to get back all the fields IDs first to start inserting. I'll explain: this is necessary because the client wants some sort of control on the fields names. We created a repository for all the possible fields to try and minimize ambiguity problems; fields, for instance, named as "Blood Pressure", "BP", "BloodPressure" or "Blood-Pressure" should all be stored under the same name in the database. So, to minimize the issue, the user has to actually insert his csv fields into another table first, we called it properties table. This action won't completely solve the problem, but as he's inserting the fields, he's seeing possible matches already inserted. When the user types in blood, there's a panel showing all the fields already used with the word blood. If the user thinks it's the same thing, he has to change the csv header to the field. Anyway, all this is to explain that's not a simple EAV structure and there's a lot of back and forth of IDs.
This issue is giving us second thoughts about our technologies stack choice, but we have limitations on our possible choices: I only have worked with relational DBs so far, only SQL Server actually and the other guys know only PHP. I guess a MS full stack is out of the question.
It seems to me that a non-SQL approach would be the best. I read a lot about MongoDB but honestly, I think it would be a super steep learning curve for us and if they want to start crunching the numbers or even to have some reporting capabilities,
I guess Mongo wouldn't be up to that. I'm reading about PostgreSQL which is relational and it's famous HStore type. So here is where my questions start:
Would you guys think that Postgres would be a better fit than SQL Server for this project?
Would we be able to convert the csv files into JSON objects or whatever to be stored into HStore fields and be somewhat queryable?
Is there any issues with Postgres sitting in a windows box? I don't think our client has Linux admins. Nor have we for that matter...
Is it's licensing free for commercial applications?
Or should we stick with what we have and try to sort the problem out with staging tables or bulk-insert or other technique that relies on the back-end to do the heavy lifting?
Sorry for the long post and thanks for your input guys, I appreciate all answers as I'm pulling my hair out here :)
I have several (1 million+) documents, email messages, etc, that I need to index and search through. Each document potentally has a different encoding.
What products (or configuration for the products) do I need to learn and understand to do this properly?
My first guess is something Lucene-based, but this is something I'm just learning as I go. My main desire is to start the time consuming encoding process ASAP so that we can concurrently build the search front end. This may require some sort of normalisation of double byte characters.
Any help is appreciated.
Convert everything to UTF-8 and run it through Normalization Form D, too. That will help for your searches.
You could try Tika.
Are you implying you need to transform the documents themselves? This sounds like a bad idea, especially on a large, heterogeneous collection.
A good search engine will have robust encoding detection. Lucene does and Solr uses it (Hadoop isn't a search engine). And I don't think it's possible to have a search engine that doesn't use a normalised encoding in its internal index format. So normalisation won't be a choice criteria, though trying out the encoding detection would be.
I suggest you use Solr. The ExtractingRequestHandler handles encodings and document formats. It is relatively easy to get a working prototype using Solr. DataImportHandler enables importing a document repository into Solr.
I intend to build a RESTful service which will return a custom text format. Given my very large volumes of data, XML/JSON is too verbose. I'm looking for a row based text format.
CSV is an obvious candidate. I'm however wondering if there isn't something better out there. The only I've found through a bit of research is CTX and Fielded Text.
I'm looking for a format which offers the following:
Plain text, easy to read
very easy to parse by most software platforms
column definition can change without requiring changes in software clients
Fielded text is looking pretty good and I could definitely build a specification myself, but I'm curious to know what others have done given that this must be a very old problem. It's surprising that there isn't a better standard out there.
What suggestions do you have?
I'm sure you've already considered this, but I'm a fan of tab-delimited files (\t between fields, newline at the end of each row)
I would say that since CSV is the standard, and since everyone under the sun can parse it, use it.
If I were in your situation, I would take the bandwidth hit and use GZIP+XML, just because it's so darn easy to use.
And, on that note, you could always require that your users support GZIP and just send it as XML/JSON, since that should do a pretty good job of removing the redundancy accross the wire.
You could try YAML, its overhead is relatively small compared to formats such as XML or JSON.
Examples here: http://www.yaml.org/
Surprisingly, the website's text itself is YAML.
I have been thinking on that problem for a while. I came up with a simple format that could work very well for your use case: JTable.
{
"header": ["Column1", "Column2", "Column3"],
"rows" : [
["aaa", "xxx", 1],
["bbb", “yyy”, 2],
["ccc", “zzz”, 3]
]
}
If you wish, you can find a complete specification of the JTable format, with details and resources. But this is pretty self-explanatory and any programmer would know how to handle it. The only thing necessary is, really, to say, that this is JSON.
Looking through the existing answers, most struck me as a bit dated. Especially in terms of 'big data', noteworthy alternatives to CSV include:
ORC : 'Optimised Row Columnar' uses row storage, useful in Python/Pandas. Originated in HIVE, optimised by Hortonworks. Schema is in the footer. The Wikipedia entry is currently quite terse https://en.wikipedia.org/wiki/Apache_ORC but Apache has a lot of detail.
Parquet : Similarly column-based, with similar compression. Often used with Cloudera Impala.
Avro : from Apache Hadoop. Row-based, but uses a Json schema. Less capable support in Pandas. Often found in Apache Kafka clusters.
All are splittable, all are inscrutable to people, all describe their content with a schema, and all work with Hadoop. The column-based formats are considered best where cumulated data are read often; with multiple writes, Avro may be more suited. See e.g. https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/
Compression of the column formats can use SNAPPY (faster) or GZIP (slower but more compression).
You may also want to look into Protocol Buffers, Pickle (Python-specific) and Feather (for fast communication between Python and R).
I'm guessing this won't apply to 99.99% of anyone that sees this. I've been doing some Sawtooth survey programming at work and I've been needing to create a webpage that shows some aggregate data from the completed surveys. I was just wondering if anyone else has done this using the flat files that Sawtooth generates and how you went about doing it. I only know very basic Perl and the server I use does not have PHP so I'm somewhat at a loss for solutions. Anything you've got would be helpful.
Edit: The problem with offering example files is that it's more complicated. It's not a single file and it occasionally gets moved to a different file with a different format. The complexities added in there are why I ask this question.
Doesn't Sawtooth export into CSV format? There are many Perl parsers for CSV files. Just about every language has a CSV parser or two (or twelve), and MS Excel can open them directly, and they're still plaintext so you can look at them in any text editor.
I know our version of Sawtooth at work (which is admittedly very old) exports Sawtooth data into SPSS format, which can then be exported into various spreadsheet formats including CSV, if all else fails.
If you have a flat (fixed-width field) file, you can easily parse it in Perl using regular expressions or just taking substrings of each line one at a time, assuming you know the width of the fields. Your question is too general to give much better advice, sorry.
Matching the values up from a plaintext file with meta-data (variable names and labels, value labels etc.) is more complicated unless you already have the meta-data in some script-readable format. Making all of that stuff available on a web page is more complicated still. I've done it and it can be a bit of a lengthy project to roll your own. There are packages you can buy, like SDA, which will help you build a website where people can browse and download your survey data and view your codebooks.
Honestly though the easiest thing to do if you're posting statistical data on a website is get the data into SPSS or SAS or another statistics package format and post those files for download directly. Then you don't have to worry about it.