Hash Partitioning RDF (OWL/N3/NT) datasets - hash

I have an N3 dataset that contains triples. I wish to hash partition this dataset. Is there a hash partitioner that hash partitions OWL/NT/N3 datasets? If not, could you please provide me with some code/tips on how to proceed with parsing the file in an effective way.

Parsing an RDF file is an entirely different task from storing the resulting triples in an efficient way. For simply parsing the RDF file, you can use one of the many RDF processing libraries out there, and that works just fine. (StackOverflow really isn't the place for lists of tools, but the question Which Tools and Libraries do you use to develop Semantic Web applications? on http://answers.semanticweb.com has a bunch listed.) As you clarified in the comments:
I generated an OWL dataset using LUBM's (Lehigh University Benchmak)
data generator, and converted it to N3 format using an online
converter. Now, I would like to hash partition the dataset and store
each partition on a worker machine. Before implementing my own, I
wanted to know if there is such a library out there. Could you please
point me to some of the available libraries. As for efficiency, I
mentioned it because the dataset I have is very large and using a
sequential hash partitioner might consume a lot of time to finish the
task.
There are at least two importan things to note here.
OWL is not the same as RDF, but OWL can be serialized in RDF. It appears that you've already serialized your OWL in RDF.
RDF can be serialized in a number of forms. One of the most common is RDF/XML, but there are also N3, Turtle (a subset of N3), and N-Triples (NT).
N-Triples is a line-based format, with just one triple per line. If you just need to split your data into three pieces and send it places, just convert it into N-triples, where the k triples will be on k lines. You could then send the first k/3 to worker A, the second k/3 to worker B, and the last k/3 to worker B. Alternatively, you could iterate through through the lines one at a time, sending a line to A, then a line to B, then a line to C. This is one of the big advantages of N-Triples: it's very cheap to split or combine datasets. As an example, consider this DBpedia query and its results in NTriples. You can just split it up into three chunks of 3, 3, and 4 lines, and send them off to your workers.
construct where {
dbpedia:Mount_Monadnock ?prop ?obj
}
limit 10
<http://dbpedia.org/resource/Mount_Monadnock> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Mountain> .
<http://dbpedia.org/resource/Mount_Monadnock> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/NaturalPlace> .
<http://dbpedia.org/resource/Mount_Monadnock> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.opengis.net/gml/_Feature> .
<http://dbpedia.org/resource/Mount_Monadnock> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/class/yago/GeologicalFormation109287968> .
<http://dbpedia.org/resource/Mount_Monadnock> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://umbel.org/umbel/rc/Mountain> .
<http://dbpedia.org/resource/Mount_Monadnock> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Mountain> .
<http://dbpedia.org/resource/Mount_Monadnock> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Place> .
<http://dbpedia.org/resource/Mount_Monadnock> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/class/yago/Object100002684> .
<http://dbpedia.org/resource/Mount_Monadnock> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Place> .
<http://dbpedia.org/resource/Mount_Monadnock> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> .

Related

KubeFlow, handling large dynamic arrays and ParallelFor with current size limitations

I've been struggling to find a good solution for this manner for the past day and would like to hear your thoughts.
I have a pipeline which receives a large & dynamic JSON array (containing only stringified objects),
I need to be able to create a ContainerOp for each entry in that array (using dsl.ParallelFor).
This works fine for small inputs.
Right now the array comes in as a file http url due to pipeline input arguements size limitations of argo and Kubernetes (or that is what I understood from the current open issues), but - when I try to read the file from one Op to use as input for the ParallelFor I encounter the output size limitation.
What would be a good & reusable solution for such a scenario?
Thanks!
the array comes in as a file http url due to pipeline input arguements size limitations of argo and Kubernetes
Usually the external data is first imported into the pipeline (downloaded and output). Then the components use inputPath and outputPath to pass big data pieces as files.
The size limitation only applies for the data that you consume as value instead of file using inputValue.
The loops consume the data by value, so the size limit applies to them.
What you can do is make this data smaller. For example if your data is a JSON list of big objects [{obj1}, {obj2}, ... , {objN}], you can transform it to list of indexes [1, 2, ... , N], pass that list to the loop and then inside the loop you can have a component that uses the index and the data to select a single piece to work on N ->{objN}.

Filtering GCS uris before job execution

I have a frequent use case I couldn't solve.
Let's say I have a filepattern like gs://mybucket/mydata/*/files.json where * is supposed to match a date.
Imagine I want to keep 251 dates (this is an example, let's say a big number of dates but without a meta-pattern to match them like 2019* or else).
For now, I have two options :
create a TextIO for every single file, which is overkill and fails almost everytime (graph too large)
read ALL data and then filter it within my job from data : which is also overkill when you have 10 TB of data while you only need 10 Gb for instance
In my case, I would like to just do something like that (pseudo code) :
Read(LIST[uri1,uri2,...,uri251])
And that this instruction actually spawn a single TextIO task on the graph.
I am sorry if I missed something, but I couldn't find a way to do it.
Thanks
Ok I found it, the naming was mileading me :
Example 2: reading a PCollection of filenames.
Pipeline p = ...;
// E.g. the filenames might be computed from other data in the pipeline, or
// read from a data source.
PCollection<String> filenames = ...;
// Read all files in the collection.
PCollection<String> lines =
filenames
.apply(FileIO.matchAll())
.apply(FileIO.readMatches())
.apply(TextIO.readFiles());
(Quoted from Apache Beam documentation https://beam.apache.org/releases/javadoc/2.13.0/org/apache/beam/sdk/io/TextIO.html)
So we need to generate a PCollection of URIS (with Create/of) or to read it from the pipeline, then to match all the uris (or patterns I guess) and the to read all files.

Is it possible to only evaluate the Key when reading a SequenceFile in Spark?

I'm trying to read a sequence file with custom Writable subclasses for both K and V of a sequencefile input to a spark job.
the vast majority of rows need to be filtered out by a match to a broadcast variable ("candidateSet") and the Kclass.getId. Unfortunately values V are deserialized for every record no matter what with the standard approach, and according to a profile that is where the majority of time is being spent.
here is my code. note my most recent attempt to read here as "Writable" generically, then later cast back which worked functionally but still caused the full deserialize in the iterator.
val rdd = sc.sequenceFile(
path,
classOf[MyKeyClassWritable],
classOf[Writable]
).filter(a => candidateSet.value.contains(a._1.getId))```
Turns out Twitter has a library that handles this case pretty well. Specifically, using this class allows to evaluate the serialized fields in a later step by reading them as DataInputBuffers
https://github.com/twitter/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/RawSequenceFileRecordReader.java

using linked lists to add integers and read them

Infinite Adder
Write a program that can add 2 integers of ANY LENGTH (limited only by computer memory). Store the 2 integers as a linked list of digits (each node is a single digit from 0 – 9). Read each number from a comma-delimited file of digits in reverse order (for example, a file with “2,3,7,0,1” represents the number 10732). One file should be called “num1.txt” and the other file should be called “num2.txt”. The Hard Part: When you print out the answer, it should be in “normal” order.
It may be wise to write a doubly-linked list in order to make things easier.
Help would be very much appreciated.
I can tell you what I need to do, but I don't know how to write it code-wise.

how to get probability of each topic in mallet

I am doing topic modelling with mallet.I have imported my file(each document in a line)and I trained mallet with 200 topics.Now I have 200 topics with words related to them for each topic.Now I need to know each topic`s probability.How can I know?
Thank you
The command bin/mallet train-topics has an option --output-doc-topics topic-composition.txt. This outputs a big table in TAB-separated text format containing the topic composition of each text.