I am looking for an approach to load and print data from .n3 files of .tar.gz archive in scala. Or should I extract it?
If you want to download the file, it is located on
http://wiki.knoesis.org/index.php/LinkedSensorData
Could anyone describe how can I print the data on the screen from this archive using scala?
The files that you are dealing with are large. I therefore suggest you import it into an RDF store of some sort rather than try and parse it yourself. You can use GraphDB, Blazegraph, Virtuso and the list goes on. A search for RDF stores should give many other options. Then you can use SPARQL to query the RDF store (which is like SQL for relational databases).
For finding a Scala library that can access RDF data you can see this related SO question, though it does not look promising. I would suggest you look at Apache Jena, a Java library.
You may also want to look at the DBPedia Extraction Framework where they extract data from Wikipedia and store it as RDF data using Scala. It is certainly not exactly what you are trying to do, but it could give you insight into the tools they used for generating RDF from Scala and related issues.
Related
I've downloaded the Wikidata truthy dump in RDF format (.nt.bz2 file). I want to limit the language of the dump to English only and generate this new filtered dump as a new .nt file.
I've tried using parallel grep to filter lines with '#en' text, but it consumes a lot of processing time.
Is there some much faster way to generate filtered dumps? Something like using Spark?
Maybe it is a bit late for you, but meanwhile a tool was generated to create custom dumps: https://tools.wmflabs.org/wdumps/
With this tool you can online define a language filter and then download an .nt file with only the relevant triples.
Is there a common standard to follow for building a SCALA based report engine from scratch. Data will be sourced from HDFS, Filtered, formatted and emailed. Please share any experience or hurdles to expect.
I used to do such reports as PDF, HTML and XSLX.
We used ElasticSearch but here was the general workflow:
get filtered data from storage to scala (no real trouble, just make sure your filters are well tested)
fill the holes to have a consistent data: think about missing points, crazy timezones...
format (we used an xslt processor to produce email HTML, it is really specific and size for emails is limited, aim ~15 Mo as a very maximum)
if file is too big, store it somewhere and send the link instead
I have over 100 thousand files on Google Cloud Storage that contain JSON objects and I'd like to create a mirror maintaining the filesytem structure, but with some fields removed from the content of files.
I tried to use Apache Beam on Google Cloud Dataflow, but it splits all files and I can't maintain the structure anymore. I'm using TextIO.
The structure I have is something like reports/YYYY/MM/DD/<filename>
But Dataflow outputs to output_dir/records-*-of-*.
How can I make Dataflow not split the files and output them with the same directory and file structure?
Alternatively, is there a better system to do this kind of edits on a large number of files?
You can not directly use TextIO for this, but Beam 2.2.0 will include a feature that will help you write this pipeline yourself.
If you can build a snapshot of Beam at HEAD, you can already use this feature. Note: the API may change slightly between the time of writing this answer and the release of Beam 2.2.0
Use Match.filepatterns() to create a PCollection<Metadata> of files matching the filepattern
Map the PCollection<Metadata> with a ParDo that does what you want to each file using FileSystems:
Use the FileSystems.open() API to read the input file and then standard Java utilities for working with ReadableByteChannel.
Use FileSystems.create() API to write the output file.
Note that Match is a very simple PTransform (that uses FileSystems under the hood) and another way you can use it in your project is by just copy-pasting (the necessary parts of) its code into your project, or studying its code and reimplementing something similar. This can be an option in case you're hesitant to update your Beam SDK version.
I'm looking to use WEKA to train and predict from data in MongoDB. Specifically, I intend to use Weka API to analyse data (e.g. build a recommendation engine). But I have no idea how to proceed, because the data in MongoDB is stored in the BSON format, while WEKA uses the ARFF format. I would like to use the WEKA API to read data from MongoDB, analyse it, and provide recommendations to the user in real-time. I can not find a bridge beween WEKA and MongoDB.
Is this even possible or should I try another approach?
Before I begin, I should say that WEKA isn't the best tool for working with Big Data. If you really have Big Data, you will likely want to use Spark and the Hadoop family as they are more suited to analysis.
To answer your question as written, I would advise doing the training manually (i.e. creating a training file using any programmatic tools available to you) and pretraining a model. These models can then be saved and integrated into a program accordingly.
For testing, you can follow the official instructions, but I usually take a bit of a shortcut: I usually preprocess my data into a CSV-like format (as if it was going into an ARFF file) and just prepend a valid ARFF header (the same one as your training file uses). From there, it is very easy to test the instances. In my experience, this greatly simplifies the process of writing code that actually makes novel predictions.
I'm looking for a little assistance in Scala similar to that provided by pyTables. PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
Any suggestions?
I had a quick look at pyTables, and I don't think there's anything remotely like it in Scalaland (or indeed Javaland), but we have a few of the ingredients necessary to make it a possibility if you want to invest the time:
scala.Dynamic to do idiomatic selection on data-driven structures
A bunch of graph databases to provide the underlying navigational persistence substrate (I've had acceptable results from OrientDB, which has a better license than most)
PyTables is a python implementation of HDF5 with some added niceties to let you work on it in a pythonic way, and get good indexing support. I'm not sure if there's a package implemented in a similar way in Scala, but you can use the same HFD5 based hierarchical data storageusing the HDF5 implementation in Java: HDF Java