Why do we need to set the output key/value class explicitly in the Hadoop program? - class

In the "Hadoop : The Definitive Guide" book, there is a sample program with the below code.
JobConf conf = new JobConf(MaxTemperature.class);
conf.setJobName("Max temperature");
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(MaxTemperatureMapper.class);
conf.setReducerClass(MaxTemperatureReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
The MR framework should be able to figure out the output key and value class from the Mapper and the Reduce functions which are being set on the JobConf class. Why do we need to explicitly set the output key and value class on the JobConf class? Also, there is no similar API for the input key/value pair.

The reason is type erasure[1]. You set the output K/V classes as generics. During job setup (which is run time, not compile time), these generics are erased.
The input k/v classes can be read from the input file, in the case of SequenceFiles the classes are in the header- you can read them when opening a sequence file in the editor.
This header must be written, since every map output is a SequenceFile, so you need to provide the classes.
[1] http://download.oracle.com/javase/tutorial/java/generics/erasure.html

Related

Spark serializes variable value as null instead of its real value

My understanding of the mechanics of Spark's code distribution toward the nodes running it is merely cursory, and I fail in having my code successfully run within Spark's mapPartitions API when I wish to instantiate a class for each partition, with an argument.
The code below worked perfectly, up until I evolved the class MyWorkerClass to require an argument:
val result : DataFrame =
inputDF.as[Foo].mapPartitions(sparkIterator => {
// (1) initialize heavy class instance once per partition
val workerClassInstance = MyWorkerClass(bar)
// (2) provide an iterator using a function from that class instance
new CloseableIteratorForSparkMapPartitions[Post, Post](sparkIterator, workerClassInstance.recordProcessFunc)
}
The code above worked perfectly well up to the point in time when I had (or chose) to add a constructor argument to my class MyWorkerClass. The passed argument value turns out as null in the worker, instead of the real value of bar. Somehow the serialization of the argument fails to work as intended.
How would you go about this?
Additional Thoughts/Comments
I'll avoid adding the bulky code of CloseableIteratorForSparkMapPartitions ― it merely provides a Spark friendly iterator and might even not be the most elegant implementation in that.
As I understand it, the constructor argument is not being correctly passed to the Spark worker due to how Spark captures state when serializing stuff to send for execution on the Spark worker. However instantiating the class does seamlessly make heavy-to-load assets included in that class ― normally available to the function provided on the last line of my above code; And the class did seem to instantiate per partition. Which is actually a valid if not key use case for using mapPartitions instead of map.
It's the passing of an argument to its instantiation, that I am having trouble figuring how to enable or work-around. In my case this argument is a value only known after the program started running (even if always invariant throughout a single execution of my job; it's actually a program argument). I do need it passing along for the initialization of the class.
I tried tinkering to solve, by providing a function which instantiates MyWorkerClass with its input argument, rather than directly instantiating as above, but this did not solve matters.
The root symptom of the problem is not any exception, but simply that the value of bar when MyWorkerClass is instantiated will just be null, instead of the actual value of bar which is known in the scope of the code enveloping the code snippet which I included above!
* one related old Spark issue discussion here

Losing path dependent type when extracting value from Try in scala

I'm working with scalax to generate a graph of my Spark operationS. So, I have a custom library that generates my graph. So, let me show a sample:
val DAGWithoutGet = createGraphFromOps(ops)
val DAGWithGet = createGraphFromOps(ops).get
The return type of DAGWithoutGet is
scala.util.Try[scalax.collection.Graph[typeA, scalax.collection.GraphEdge.DiEdge]],
and, for DAGWithGet is
scalax.collection.Graph[typeA, scalax.collection.GraphEdge.DiEdge].
Here, typeA is a project related class representing a single Spark operation, not relevant for the context of this question. (for context only: What my custom library does is, essentially, generate a map of dependencies between those operations, creating a big Map object, and calling Graph(myBigMap: _*) to generate the graph).
As far as I know, calling the .get command on this point of my code or later should not make any difference, but that is not what I'm seeing.
Calling DAGWithoutGet.get.nodes has a return type of scalax.collection.Graph[typeA,DiEdge]#NodeSetT,
while calling DAGWithGet.nodes returns DAGWithGet.NodeSetT.
When I extract one of those nodes (using the .find method), I receive scalax.collection.Graph[typeA,DiEdge]#NodeT and DAGWithGet.NodeT types, respectively. Much to my dismay, even the methods available in each case are different - I cannot use pathTo (which happens to be what I want) or withSubgraph on the former, only on the latter.
My doubt is, then, after this relatively complex example: what is going on here? Why extracting the value from the Try construct on different moments leads to different types, one path dependent, and the other not - or, if that isn't correct, what may I be missing here?

I am getting an error while trying to pass the data from scoreboard to sequence, how to get rid of it?

I am new to UVM and I am trying to verify a memory design where I am trying to run a write sequence multiple times followed by read sequence same number of times so that I could read the same addresses I am writing to, and compare. For this I tried to create a new class extended from uvm_object with a queue to store the addresses I am writing to, so that I could use them in read seq and I am instantiating this class in the scoreboard and then sending the handle of class to the read sequence via uvm_config_db, now the issue is I am able to store addresses in queue but unable to get the class handle in read sequence ......Is this the right way of checking or is there some better way to check the write and read back from memory, please help me !
entire code link (yet to complete): https://www.edaplayground.com/x/3iTr
Relevant code snippets:
This is the class I created to store the addresses
class address_list extends uvm_object;
reg[7:0]addr_q[$];
function new(string name);
super.new(name);
endfunction
endclass;
In my scoreboard, I am passing the handle of class with address queue to the read sequence, here is the snippet from scoreboard
virtual function void write(mem_seq_item pkt);
if(pkt.wr_en==1)
begin
pkt_qu_write.push_back(pkt);
addr.addr_q.push_back(pkt.addr);
uvm_config_db#(address_list)::set(uvm_root::get(),"*","address",addr);
end
if(pkt.rd_en==1)
pkt_qu_read.push_back(pkt);
`uvm_info(get_type_name(),$sformatf("Adder list is
%p",addr.addr_q),UVM_LOW)
endfunction : write
In my read sequence, I am trying to get the handle
virtual task body();
repeat(3)
`uvm_do(wr_seq)
if(!uvm_config_db#(address_list)::get(this, " ", "address", addr_))
`uvm_fatal("NO_VIF",{"virtual interface must be set for:",get_full_name(),".addr_"});
`uvm_info(get_type_name(),$sformatf("ADDR IS %p",addr_),UVM_LOW)
repeat(3)
`uvm_do(rd_seq)
endtask
Error-[ICTTFC] Incompatible complex type usage
mem_sequence.sv, 137 {line where i try to get from uvm_config_db}
Incompatible complex type usage in task or function call.
The following expression is incompatible with the formal parameter of the
function. The type of the actual is 'class $unit::wr_rd_sequence', while
the
type of the formal is 'class uvm_pkg::uvm_component'. Expression: this
Source info: uvm_config_db#
(_vcs_unit__3308544630::address_list)::get(this,
" ", "address", this.addr_)
There are two problems with this line:
if(!uvm_config_db#(address_list)::get(this, " ", "address", addr_))
One is causing your error. One might lead to you not being able to find what you're looking for in the database.
This (literally this) is causing your error. You are calling get from a class derived from uvm_sequence. The first argument to get is expecting a class derived from uvm_component. Your problem is that a sequence is not part of the testbench hierarchy, so you cannot use a sequence as the first argument to a call to get (or set) in a uvm_config_db. Instead the convention is to use the sequencer that the sequence is running on, which is returned by a call to the sequence's get_sequencer() method. This solves your problem:
if(!uvm_config_db#(address_list)::get(get_sequencer(), "", "address", addr_))
This works because you used a wildcard when you called set.
Notice that I also removed the space from between the quotes. That might not give you a problem, because you used the wildcard when you called set, but in general this string should either be empty or should be a real hierarchical path. (The hierarchy input to the set and get calls is split between the first argument - a SystemVerilog hierarchical path - and the second - a string representing a hierarchical path).
uvm_config_db is basically for passing configuration between components.
For purpose of passing data from scoreboard to sequence, you can use uvm_event.
Trigger event in scoreboard using event.trigger(address_list)
sequence wait for event using event.wait_for_trigger_data(address_list)

Transformer's Op name isn't available when setting opName

I created my custom transformer (simple model that adds a string to a column value) to test Mleap serialization, but while writing my Op file for Mleap and Spark serialization, I couldn't my transformer's name.
My reference.conf file looks like this
my.domain.mleap.spark.ops = ["spark_side.CustomTransformerOp"]
// include the custom transformers ops we have defined to the default Spark registries
ml.combust.mleap.spark.registry.v20.ops += my.domain.mleap.spark.ops
ml.combust.mleap.spark.registry.v21.ops += my.domain.mleap.spark.ops
ml.combust.mleap.spark.registry.v22.ops += my.domain.mleap.spark.ops
ml.combust.mleap.spark.registry.v23.ops += my.domain.mleap.spark.ops
my.domain.mleap.ops = ["mleap_side.CustomTransformerOp"]
// include the custom transformers we have defined to the default MLeap registry
ml.combust.mleap.registry.default.ops += my.domain.mleap.ops
When I run the pipeline with only that stage on my dataset it works fine, I'm even able to save the pipeline if I set opName to some string or one of the Bundle.BuiltinOps members.
If I put in some string, error pops up that says: "unable to find key : thatString", and if I use another member the error states that it's unable to find a key from that member (which is completely reasonable and I understand why it happens).
My question is how do I make the name of my transformer available when declaring opName in my Op files.
(if somebody could hit up Hollin Wilkins that would be amazing :D)
I had the same question. according to this link
https://github.com/combust/mleap/wiki/Adding-an-MLeap-Spark-Transformer
you'll need to add it yourself to ml.combust.bundle.dsl.Bundle.BuiltinOps
In Section 3. Implement Bundle.ML serialization for MLeap
Note: if implementing a vanilla Spark transformer, make sure to add the opName to ml.combust.bundle.dsl.Bundle.BuiltinOps.

Writable type family resolution in Scrunch vs Crunch

I have a Scrunch Spark pipeline, and when I try to save its output to Avro format using:
data.write(to.avroFile(path))
I get the following Exception:
java.lang.ClassCastException: org.apache.crunch.types.writable.WritableType cannot be cast to org.apache.crunch.types.avro.AvroType
at org.apache.crunch.io.avro.AvroFileTarget.configureForMapReduce(AvroFileTarget.java:77) ~[crunch-core-0.14.0-SNAPSHOT.jar:0.14.0-SNAPSHOT]
at org.apache.crunch.impl.spark.SparkRuntime.monitorLoop(SparkRuntime.java:327) [crunch-spark-0.14.0-SNAPSHOT.jar:0.14.0-SNAPSHOT]
at org.apache.crunch.impl.spark.SparkRuntime.access$000(SparkRuntime.java:80) [crunch-spark-0.14.0-SNAPSHOT.jar:0.14.0-SNAPSHOT]
at org.apache.crunch.impl.spark.SparkRuntime$2.run(SparkRuntime.java:139) [crunch-spark-0.14.0-SNAPSHOT.jar:0.14.0-SNAPSHOT]
while this instead works just fine:
data.write(to.textFile(path))
(in both cases data being the same instance of PCollection and path a String)
I understand why that error happens, the PCollection I am trying to write belongs to the Writable type family instead of the Avro one. What is not clear to me is how it gets decided in Scrunch that my PCollection belongs to one and not the other.
This mechanism however, seems to be somewhat more clear in Crunch. According to the official Crunch documentation:
Crunch supports two different type families, which each implement the PTypeFamily interface: one for Hadoop's Writable interface and another based on Apache Avro. There are also classes that contain static factory methods for each PTypeFamily to allow for easy import and usage: one for Writables and one for Avros.
And then:
For most of your pipelines, you will use one type family exclusively, and so you can cut down on some of the boilerplate in your classes by importing all of the methods from the Writables or Avros classes into your class
import static org.apache.crunch.types.avro.Avros.*;
In fact, in the examples provided for Crunch in the official repo, it can be seen how this is made explicit. See the following code snippet from WordCount example:
PCollection<String> lines = pipeline.readTextFile(args[0]);
PCollection<String> words = lines.parallelDo(new DoFn<String, String>() {
public void process(String line, Emitter<String> emitter) {
for (String word : line.split("\\s+")) {
emitter.emit(word);
}
}
}, Writables.strings()); // Indicates the serialization format
PTable<String, Long> counts = words.count();
While the equivalent Scrunch version goes like this:
def countWords(file: String) = {
read(from.textFile(file))
.flatMap(_.split("\\W+").filter(!_.isEmpty()))
.count
}
And no explicit, or as far as I can see, implicit reference to the WritableFamily is provided.
So how does Scrunch decides what Writable family type to use? is it based on the default of the original input Source? (e.g. if reads from a text file, it's Writable, if from Avro then Avro) If that is the case, then how could I change the type to read from one source and write to a target taht belongs to a different family type in Scrunch?