How to build Scala report projects - scala

Is there a common standard to follow for building a SCALA based report engine from scratch. Data will be sourced from HDFS, Filtered, formatted and emailed. Please share any experience or hurdles to expect.

I used to do such reports as PDF, HTML and XSLX.
We used ElasticSearch but here was the general workflow:
get filtered data from storage to scala (no real trouble, just make sure your filters are well tested)
fill the holes to have a consistent data: think about missing points, crazy timezones...
format (we used an xslt processor to produce email HTML, it is really specific and size for emails is limited, aim ~15 Mo as a very maximum)
if file is too big, store it somewhere and send the link instead

Related

How do you generate a CAD geometry of randomly oriented objects?

How can one generate CAD geometries of randomly oriented and randomly sized objects (3D)? I need to model randomly sized and randomly oriented rectangles--thousands to millions of them.
I have not yet come across any CAD tools that have =rand() functions that can be inputted into dimensions. Is one way perhaps to have a CAD program import a CSV file of these randomly generated parameter values?
In SolidWorks, you can have model parameters (dimension lengths/angles, constraints, etc.) stored in an Excel spreadsheet called a Design Table. Each row in the spreadsheet will represent a different configuration of your model, and each column a different parameter. You can use Excel's built-in capabilities or an export-capable tool of your choosing to generate the configurations according to your desired distribution. I don't recall off the top of my head the easiest way to get a large number of instances with different configurations into the same assembly, but you haven't really told us what you're trying to accomplish so I can't give you specific recommendations anyways.
If you have a specific CAD tool then you can often find documentation on the internal file format. With a little experimentation you can sometimes write a small external program that will generate the header of the CAD file and then loop thousands or millions of times generating each individual object. Finally you generate the lines needed to complete the file. That can sometimes be easier than trying to force a tool to do something the designers never expected. And this might let you use the software of your choice to generate the file.
I would suggest starting small. Use the CAD tool to create a file with two or three of your rectangles. Save and inspect the contents of the file to see that it matches your understanding of the needed format. Then try externally creating what should be the same file and verify your version is correctly accepted.
You might consider that some tool designers never expected someone to want thousands or millions of anything. I would suggest sneaking up on the problem. Try doubling the number of items, check this works as expected and then repeat this process again and again until either you successfully get to millions or until you find the CAD tool won't be able to handle this.

DATASTAGE capabilities

I'm a Linux programmer.
I used to write code in order to get things done: java perl php c.
I need to start working with DATA STAGE.
All I see is that DATA STAGE is working on table/csv style data and doing it line by line.
I want to know if DATA STAGE can work on file that are not table/csv like. can it load
data into data structures and run function on them, or is it limited to working
only on one line at a time.
thank you for any information that you can give on the capabilities of DATA SATGE
IBM (formerly Ascential) DataStage is an ETL platform that, indeed, works on data sets by applying various transformations.
This does not necessarily mean that you are constrained on applying only single line transformations (you can also aggregate, join, split etc). Also, DataStage has it's own programming language - BASIC - that allows you to modify the design of your jobs as needed.
Lastly, you are still free to call external scripts from within DataStage (either using the DSExecute function, Before Job property, After Job property or the Command stage).
Please check the IBM Information Center for a comprehensive documentation on BASIC Programming.
You could also check the DSXchange forums for DataStage specific topics.
Yes it can, as Razvan said you can join, aggregate, split. It can uses loops and external scripts, it can also handles XML.
My advice for you is that if you have large quantities of data you're gonna have to work on then datastage is your friend, else if the data that you're going to have to load is not very big then it's going to be easier to use JAVA, c, or any programming language that you know.
You can all times of functions , conversions , manipulate the data. mainly Datastage is used for ease of use when you handling humongous data from datamart /datawarehouse.
The main process of datastage would be ETL - Extraction Transformation Loading.
If a programmer uses 100 lines of code to connect to some database here we can do it with one click.
Anything can be done here even c , c++ coding in a rountine activity.
If you are talking about hierarchical files, like XML or JSON, the answer is yes.
If you are talking about complex files, such as are produced by COBOL, the answer is yes.
All using in-built functionality (e.g. Hierarchical Data stage, Complex Flat File stage). Review the DataStage palette to find other examples.

XML versus MongoDB

I have a problem...
I need to store a daily barrage of about 3,000 mid-sized XML documents (100 to 200 data elements).
The data is somewhat unstable in the sense that the schema changes from time to time and the changes are not announced with enough advance notice, but need to be dealt with retroactively on an emergency "hotfix" basis.
The consumption pattern for the data involves both a website and some simple analytics (some averages and pie charts).
MongoDB seems like a great solution except for one problem; it requires converting between XML and JSON. I would prefer to store the XML documents as they arrive, untouched, and shift any intelligent processing to the consumer of the data. That way any bugs in the data-loading code will not cause permanent damage. Bugs in the consumer(s) are always harmless since you can fix and re-run without permanent data loss.
I don't really need "massively parallel" processing capabilities. It's about 4GB of data which fits comfortably in a 64-bit server.
I have eliminated from consideration Cassandra (due to complex setup) and Couch DB (due to lack of familiar features such as indexing, which I will need initially due to my RDBMS ways of thinking).
So finally here's my actual question...
Is it worthwhile to look for a native XML database, which are not as mature as MongoDB, or should I bite the bullet and convert all the XML to JSON as it arrives and just use MongoDB?
You may have a look at BaseX, (Basex.org), with built in XQuery processor and Lucene text indexing.
That Data Volume is Small
If there is no need for parallel data processing, there is no need for Mongo DB. Especially if dealing with small data amounts like 4GB, the overhead of distributing work can easily get larger than the actual evaluation effort.
4GB / 60k nodes is not large of XML databases, either. After some time of getting into it you will realize XQuery as a great tool for XML document analysis.
Is it Really?
Or do you get daily 4GB and have to evaluate that and all data you already stored? Then you will get to some amount which you cannot store and process on one machine any more; and distributing work will get necessary. Not within days or weeks, but a year will already bring you 1TB.
Converting to JSON
How does you input look like? Does it adhere any schema or even resemble tabular data? MongoDB's capabilities for analyzing semi-structured are way worse than what XML databases provide. On the other hand, if you only want to pull a few fields on well-defined paths and you can analyze one input file after the other, Mongo DB probably will not suffer much.
Carrying XML into the Cloud
If you want to use both an XML database's capabilities in analyzing the data and some NoSQL's systems capabilities in distributing the work, you could run the database from that system.
BaseX is getting to the cloud with exactly the capabilities you need -- but it will probably still take some time for that feature to get production-ready.

Load and perform search on large amount of data

I need a suggest how to operate with large amount of data on iPhone. Let say I have xml file with ~120k text records. I need to perform search on this data. The solution i have tried is to use Core Data to store information in sorted order in caches. And then use binary search which works fast. But the problem is to build this caches. On first launch application takes about 15-25 seconds to build this caches. Maybe I need to use different approach to search the data?
Thanks in advance.
If you're using an XML file with the requirement that you can't cache, then you're not going to succeed unless you somehow carefully format your XML file to have useful data traversal properties -- but then you may as well use a binary file that's more useful unless you have some very esoteric requirements.
Really what you want is one of the typical indexing algorithms (on disk hash, B-tree, etc) from the get-go.
However...
If you have to read in and parse your XML text file, then you can skirt using a typical big and slow generic XML parser and write a fast hackish version since most of the data records you'll need to recognize are probably formatted the same way over and over. Nothing special, just find where the relevant data fields start, grab the data until it ends, move on to the next data field.
Honestly, 120k of text isn't very much-- sounds like whatever XML parser you're using is just slow. (I use this trick all the time for autogenerated XML data that just represents things like tables or simple data records -- my own parser is faster than any generic XML parser.)
This is probably the solution you actually want since you sound fairly attached to the XML file format. It won't be as error-proof as a generic XML parser if you're not careful, however it will eat that 120KB file up like nobody's business. And it's entry level CS work -- read in a file with certain specific formatting and grab the data values from it. Regexps are your friend if you have access to them.
Try storing and doing your searches in the cloud. (using a database stored on a server somewhere)
Unless you specifically need ALL of the information on the device..

How can I create a web page that shows aggregate data from Sawtooth surveys?

I'm guessing this won't apply to 99.99% of anyone that sees this. I've been doing some Sawtooth survey programming at work and I've been needing to create a webpage that shows some aggregate data from the completed surveys. I was just wondering if anyone else has done this using the flat files that Sawtooth generates and how you went about doing it. I only know very basic Perl and the server I use does not have PHP so I'm somewhat at a loss for solutions. Anything you've got would be helpful.
Edit: The problem with offering example files is that it's more complicated. It's not a single file and it occasionally gets moved to a different file with a different format. The complexities added in there are why I ask this question.
Doesn't Sawtooth export into CSV format? There are many Perl parsers for CSV files. Just about every language has a CSV parser or two (or twelve), and MS Excel can open them directly, and they're still plaintext so you can look at them in any text editor.
I know our version of Sawtooth at work (which is admittedly very old) exports Sawtooth data into SPSS format, which can then be exported into various spreadsheet formats including CSV, if all else fails.
If you have a flat (fixed-width field) file, you can easily parse it in Perl using regular expressions or just taking substrings of each line one at a time, assuming you know the width of the fields. Your question is too general to give much better advice, sorry.
Matching the values up from a plaintext file with meta-data (variable names and labels, value labels etc.) is more complicated unless you already have the meta-data in some script-readable format. Making all of that stuff available on a web page is more complicated still. I've done it and it can be a bit of a lengthy project to roll your own. There are packages you can buy, like SDA, which will help you build a website where people can browse and download your survey data and view your codebooks.
Honestly though the easiest thing to do if you're posting statistical data on a website is get the data into SPSS or SAS or another statistics package format and post those files for download directly. Then you don't have to worry about it.