Tools for one-off processing and conversion of large data - multicore

I am about to start a research project that will require a lot of data conversion and processing operations. On one hand, the data is rather massive - 10GB is typical for a raw dataset - so efficiency is an issue. On the other hand, many of these operations will be one-off, and rarely re-run, so building a deploy-able application is an overkill. It is not a user application, but mostly an experiment.
Some characteristics and constraints:
A lot of chained format conversions - JSON and XML to tabular format, then some patching, then text indexing, then exporting to some other format, etc.
I have a multi-core machine, but not several machines, at least to begin with.
Data does not fit as a whole in main memory, and from my experience, exploiting several cores is called for.
What are some recommended tools for handling such a project? My preferences are:
Easy-as-possible handling of multiple formats (JSON, XML, CSV)
Supporting multiple sources and sinks (text files, archives, databases)
Makes use of multiple cores
Little as possible administration, deployment issues, etc.
Programming language is not an issue, and I can manage Windows or Linux. Thanks!

Related

How can I visualize GNU remake profile data for multithreaded processes

I'm trying to profile a large multi-threaded Make-based system.
I recently found GNU remake and was able to use it as a drop-in replacement for gmake.
Since my system is multi-threaded and also has many processes, remake generated a large amount of data ~30K callgrind files(10GB).
I tried using kcachegrind to visualize the data, but it can load a maximum of 499 files which doesn't come close to the amount of data I have.
Are there any tools for visualizing profiling data of this magnitude?
Other tools I tried: gprof2dot
Another idea I had was to stitch multiple callgrind files together. But I didn't find any tools for this as well

Does NoSQL Fit the Bill for Reporting Software?

I am currently developing a software system that imports and normalizes historical data in various formats (XML, JSON, CSV). As of right now, we are using SQL server, and are looking to find the best replacement for this tool (Postgres or NoSQL). 90% of the time, the (archived/historical/static)data is accessed via a web client, and is used in a READ only format with users picking and choosing canned reports. Changes to the data only occur to update incorrect information .
The replacement DB must be able to store and report on 10s of millions of rows very quickly, and scale across multiple servers with ease (data replication, clustering, etc). There must also be data integrity, so if I update one KPI (lets say Cost per Hr), then all the reports that rely on the KPI will be updated accordingly.
Having no prior experience with NoSQL databases, I am wondering if it is even the right choice to use in a reporting software. We would like to allow for users to create their own custom reports, and that means being able to query any data as opposed to our canned reports, but I don't know if this would throw a wrench in the comparison between SQL vs NoSQL.
There are a few too many variables in the question, to comfortably answer it in entirety, but here's an attempt.
Your choice in SQL vs NoSQL should be based on data structure. Scalability is generally a second-tier concern, and is only slightly easy on some NoSQL platforms, but as always, isn't always free.
If you're looking for 10s of millions of rows 'very quickly' you are seriously testing the limits of what you can do with it. An RDBMS would allow you a plethora of options at the cost of speed, and a NoSQL although quite fast an inputting at that speed, would make you code most of the RDBMS smartness in your application. Chose your poison.
Updating a metric and 'automagically' updating reports is clearly a business-logic smartness, that shouldn't be tied down to platform selection.
PostgreSQL has in the near past, really picked up a lot of arsenal to deal with file formats (JSON et al) and is clearly worth a try (sans easy scalability).
Having said that, you should really investigate Postgres' otherwise forgotten asset, FDWs. You can clearly consider using a NoSQL setup to ingest large unstructured data, and thence utilize postgres' powerful semantics to use that and create a asynchronous yet structured backend for your application. If done well, that could mean the best of both worlds.

Best Time Series Format for Querying and Converting to Matlab (HDF5)

I have somewhat of a unique problem that looks similar to the problem here :
https://news.ycombinator.com/item?id=8368509
I have a high-speed traffic analysis box that is capturing at about 5 Gbps, and picking out specific packets from this to save into some format in a C++ program. Each day there will probably be 1-3 TB written to disk. Since it's network data, it's all time series down to the nanosecond level, but it would be fine to save it at second or millisecond level and have another application sort the embedded higher-resolution timestamps afterwards. My problem is deciding which format to use. My two requirements are:
Be able to write to disk at about 50 MB/s continuously with several different timestamped parameters.
Be able to export chunks of this data into MATLAB (HDF5).
Query this data once or twice a day for analytics purposes
Another nice thing that's not a hard requirement is :
There will be 4 of these boxes running independently, and it would be nice to query across all of them and combine data if possible. I should mention all 4 of these boxes are in physically different locations, so there is some overhead in sharing data.
The second one is something I cannot change because of legacy applications, but I think the first is more important. The types of queries I may want to export into matlab are something like "Pull metric X between time Y and Z", so this would eventually have to go into an HDF5 format. There is an external library called MatIO that I can use to write matlab files if needed, but it would be even better if there wasn't a translation step. I have read the entire thread mentioned above, and there are many options that appear to stand out: kdb+, Cassandra, PyTables, and OpenTSDB. All of these seem to do what I want, but I can't really figure out how easy it would be to get it into the MATLAB HDF5 format, and if any of these would make it harder than others.
If anyone has experience doing something similar, it would be a big help. Thanks!
A KDB+ tickerplant is certainly capable of capturing data at that rate, however there's lots of things you need to make sure (whatever solution you pick)
Do the machine(s) that are capturing the data have enough cores? Best to taskset a tickerplant, for example, to a core that nothing else will contend with
Similarly with disk - SSD, be sure there is no contention on the bus
Separate the workload - can write different types of data (maybe packets can be partioned by source or stream?) to different cpus/disks/tickerplant processes.
Basically there's lots of ways you can cut this. I can say though that with the appropriate hardware KDB+ could do the job. However, given you want HDF5 it's probably even better to have a simple process capturing the data and writing/converting to disk on the fly.

XML versus MongoDB

I have a problem...
I need to store a daily barrage of about 3,000 mid-sized XML documents (100 to 200 data elements).
The data is somewhat unstable in the sense that the schema changes from time to time and the changes are not announced with enough advance notice, but need to be dealt with retroactively on an emergency "hotfix" basis.
The consumption pattern for the data involves both a website and some simple analytics (some averages and pie charts).
MongoDB seems like a great solution except for one problem; it requires converting between XML and JSON. I would prefer to store the XML documents as they arrive, untouched, and shift any intelligent processing to the consumer of the data. That way any bugs in the data-loading code will not cause permanent damage. Bugs in the consumer(s) are always harmless since you can fix and re-run without permanent data loss.
I don't really need "massively parallel" processing capabilities. It's about 4GB of data which fits comfortably in a 64-bit server.
I have eliminated from consideration Cassandra (due to complex setup) and Couch DB (due to lack of familiar features such as indexing, which I will need initially due to my RDBMS ways of thinking).
So finally here's my actual question...
Is it worthwhile to look for a native XML database, which are not as mature as MongoDB, or should I bite the bullet and convert all the XML to JSON as it arrives and just use MongoDB?
You may have a look at BaseX, (Basex.org), with built in XQuery processor and Lucene text indexing.
That Data Volume is Small
If there is no need for parallel data processing, there is no need for Mongo DB. Especially if dealing with small data amounts like 4GB, the overhead of distributing work can easily get larger than the actual evaluation effort.
4GB / 60k nodes is not large of XML databases, either. After some time of getting into it you will realize XQuery as a great tool for XML document analysis.
Is it Really?
Or do you get daily 4GB and have to evaluate that and all data you already stored? Then you will get to some amount which you cannot store and process on one machine any more; and distributing work will get necessary. Not within days or weeks, but a year will already bring you 1TB.
Converting to JSON
How does you input look like? Does it adhere any schema or even resemble tabular data? MongoDB's capabilities for analyzing semi-structured are way worse than what XML databases provide. On the other hand, if you only want to pull a few fields on well-defined paths and you can analyze one input file after the other, Mongo DB probably will not suffer much.
Carrying XML into the Cloud
If you want to use both an XML database's capabilities in analyzing the data and some NoSQL's systems capabilities in distributing the work, you could run the database from that system.
BaseX is getting to the cloud with exactly the capabilities you need -- but it will probably still take some time for that feature to get production-ready.

Are there any data warehouse frameworks?

I've got a lot of mysql data that I need to generate reports from. It's mostly historic data so it won't be changing much, but it weighs in at 20-30 gigabytes easily and is expected to grow. I currently have a collection of php scripts that will do some complex queries and output csv and excel files. I also use phpMyAdmin with bookmarked queries. I manually edit them to change the parameters. The amount of data is growing and the number of people who need access to it is also growing, so I'm making the time to improve this situation.
I started reading about data warehousing the other day and it seems that this an area that relates to what I need to do. I've read some good articles and am even waiting on a book. I think I'm getting a handle on what these sorts of systems do and what's possible.
Creating a reporting system for my data has always been on a todo list, but until recently I figured it would be a highly niche programing venture. Since I now know data warehousing is a common thing, I figure there must be some sort of reporting/warehousing frames available to ease in the development. I'd gladly skip writing interfaces and scripts to schedule and email reports and the like and stick to writing queries and setting up relations.
I've mostly been a lamp guy, but I'm not above switching languages or platforms. I just need a more robust solution as my one off scripts don't scale well.
So where's a good place to get started?
I'll discuss a few points on the {budget, business utility function, time frame} spectrum out there. For convenience, let's follow the architecture conceptualization you linked to at
WikipediaDataWarehouseArticle
Operational database layer
The source data for the data warehouse - Normalized for In One Place Only data maintenance
Data access layer
The transformation of your source data into your informational access layer. ETL tools to extract, transform, load data into the warehouse fall into this layer.
Informational access layer
• Report-facilitating Data Structure
Data is not maintained here. It is merely a reflection of your source data
Hence, denormalized structures (containing duplicate, but systematically derived data)
are usually most effective here
• Reporting tools
How do you actually allow your users access to the data
• pre-canned reports (simple)
• more dynamic slice-and-dice access methods
The data accessed for reporting and analyzing and the tools for reporting and analyzing data
fall into this layer. And the Inmon-Kimball differences about design methodology,
discussed later in the Wikipedia article, have to do with this layer.
Metadata layer (facilitates automation, organization, etc)
Roll your own (low-end)
For very little out-of-pocket cost, just recognizing the need for the denormalized structures can buy those that are not using it some efficiencies
Get in the ballgame (some outlays required)
You don't need to use all the functionality of a platform right off the bat.
IMO, however, you want to be on a platform that you know will grow, and in the highly competitive and consolidating BI environment, that seems to be one of the four enterprise mega-vendors (my opinion)
Microsoft (the platform of our 110 employee firm)
SAP
Oracle
IBM
BiMarketStateArticle
My firm is at this stage, using some of the ETL capability offered by SQL Server Integration Services (SSIS) and some alternate usage of the open source, but in practice license requiring Talend product in the "Data Access Layer", a denormalized reporting structure (implemented completely in the basic SQL Server database), and SQL Server Reporting Services (SSRS) to largely automate (based on your skill) the production of pre-specified reports. Note that an SSRS "report" is merely a (scalable) XML configuration/specification that gets rendered at runtime via the SSRS engine. Choices such as export to an excel file are simple options.
Serious Commitment (some significant human commitment required)
Notice above that we have yet to utilize the data mining/dynamic slicing/dicing
capabilities of SQL Server Analysis Services. We are working toward that,
but now focused on improving the quality of our data cleansing in the "Data Access Layer".
I hope this helps you to get a sense of where to start looking.
Pentaho has put together a pretty comprehensive suite of products. The products are "free", but be prepared for the usual heavy sell once you fork over your identifying information.
I haven't had a chance to really stretch them as we're a Microsoft shop from one sad end to the other.
I think you should first check out Kimball and Inmon and see if you want to approach your data warehouse in a particular way. Kimball, in particular, lays out a very good framework for the modelling and construction of the warehouse.
There are a number of tools which try to make the process of designing, implementing and managing/operating a Data Warehouse and they each have their strengths and weaknesses and often vastly differing price points. Under the covers you are always going to be best off if you have a good knowledge of warsehousing principles from the Kimball and/or Inmon camps.
As well as tools like Kalido and Wherescape RED (which do similar thing in very different ways), many of the ETL platforms now have good in-built support for the donkey work of implementation - SCD components etc and lineage tracking.
Best though to view all these as tools to be used in the hands of you, the craftsman, they make certain easy things even easier (or even trivial), some hard things easier but some things they just get in they way of IMHO ;) Learn the methodology and principles first and get a good understanding of them and then you will know which tools to apply from your kitbag and when...
It hasn't been updated in a while but there's a nice Data Warehousing/ETL Ruby package called ActiveWarehouse.
But I would check out the Pentaho products like Nick mentioned in another answer. It should easily handle the volume of data you have and may provide you with more ways to slice and dice your data than you could have ever imagined.
The best framework you can currently get is Anchor Modeling.
It might look quite complex because of it's generic structure and built-in capability to historize data.
Also modeling technique is quite different than ERD.
But you end-up with sql code to generate all db objects including 3NF views and:
insert/update handled by triggers
query any point/range in history
you application developers will not see underlying 6NF anchor model.
The technology is open sourced and at the moment is unbeatable.
If you would have AM question you may want to ask on that tag anchor-modeling.
Kimball is the simpler method for data warehousing.
We use Informatica for moving data around, but it doesn't do DW things like indexing by default.
I like the idea of Wherescape RED, as a DW tool and using MS SQL's Linked Servers to obviate the need for an ETL tool.