Mahout clustered points - cluster-analysis

When I run kmeans clustering in Mahoot I get two folders, clusters-x and clusteredPoints.
I have read cluster centers using cluster dumper, but I somehow can't get to clusteredPoints? Concretely, I need to do it from code.
The strange thing is that I file size in clusteredPoints is always 128 bytes, and when I try to loop through results, using next code, it just goes out of the loop, like there is no result, but I get the cluster centers, which leads to assumption that points are clustered.
IntWritable key = new IntWritable();
WeightedPropertyVectorWritable value = new WeightedPropertyVectorWritable();
while (reader.next(key, value)) {
System.out.println(
value.toString() + " belongs to cluster " + key.toString());
}
It just goes out of the loop?
It is really strange, any help would be great, thanks.

You need to open up your final cluster file ('clusteredPoints/part-m-0') with:
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path("output/clusteredPoints/part-m-0"), conf);
then, assuming your keys are int's, iterate through it (as you already did), with:
IntWritable key = new IntWritable();
WeightedPropertyVectorWritable value = new WeightedPropertyVectorWritable();
while (reader.next(key, value)) {
LOG.info("{} belongs to cluster {}", value.toString(), key.toString());
}
reader.close();
I can post a fully working example if you still have trouble doing this.

Related

Is there any way to use groupBy on a stream and send each substream to a different file?

For example, if I am parsing a log that starts with a server name and I want to split it in a file for each server, is there a way of doing that without knowing how many servers there are?
FileIO.fromPath(Paths.get("in.log"))
.via(Framing.delimiter(ByteString("\n".getBytes), maximumFrameLength = 4000)).map(_.utf8String)
.map(_.span(_ == ' '))
.groupBy(100, _._1)
This would result in a substream of (filename, logged), but I don't know if it's possible to connect each substream to a separate sink.
Do you need to further process the log lines? If not, you can just use a custom sink.
def writeLineByServerName(line: String): Unit = {
val name = getServerNameFromLine(line)
// either create a new output stream or get an existing one from
// a pool. you may need to manage your resources to limit open
// buffers
val outputStream = getOutputStream(name)
outputStream.write(line)
}
FileIO.fromPath(Paths.get("in.log"))
.via(Framing.delimiter(ByteString("\n".getBytes), maximumFrameLength = 4000)).map(_.utf8String)
.map(_.span(_ == ' '))
.to(Sink.foreach[String](writeLineByServerName))
Otherwise, you can read the file twice. The first read will figure out how many servers there are.

SparkSQL performance issue with collect method

We are currently facing a performance issue in sparksql written in scala language. Application flow is mentioned below.
Spark application reads a text file from input hdfs directory
Creates a data frame on top of the file using programmatically specifying schema. This dataframe will be an exact replication of the input file kept in memory. Will have around 18 columns in the dataframe
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema)
Creates a filtered dataframe from the first data frame constructed in step 2. This dataframe will contain unique account numbers with the help of distinct keyword.
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
Using the two dataframes constructed in step 2 & 3, we will get all the records which belong to one account number and do some Json parsing logic on top of the filtered data.
var filtrEqpDF =
eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
Finally the json parsed data will be put into Hbase table
Here we are facing performance issues while calling the collect method on top of the data frames. Because collect will fetch all the data into a single node and then do the processing, thus losing the parallel processing benefit.
Also in real scenario there will be 10 billion records of data which we can expect. Hence collecting all those records in to driver node will might crash the program itself due to memory or disk space limitations.
I don't think the take method can be used in our case which will fetch limited number of records at a time. We have to get all the unique account numbers from the whole data and hence I am not sure whether take method, which takes
limited records at a time, will suit our requirements
Appreciate any help to avoid calling collect methods and have some other best practises to follow. Code snippets/suggestions/git links will be very helpful if anyone have had faced similar issues
Code snippet
val eqpSchemaString = "acoountnumber ....."
val eqpSchema = StructType(eqpSchemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)));
val eqpRdd = sc.textFile(inputPath)
val eqpRowRdd = eqpRdd.map(_.split(",")).map(eqpRow => Row(eqpRow(0).trim, eqpRow(1).trim, ....)
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema);
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}
The approach usually take in this kind of situation is:
Instead of collect, invoke foreachPartition: foreachPartition applies a function to each partition (represented by an Iterator[Row]) of the underlying DataFrame separately (the partition being the atomic unit of parallelism of Spark)
the function will open a connection to HBase (thus making it one per partition) and send all the contained values through this connection
This means the every executor opens a connection (which is not serializable but lives within the boundaries of the function, thus not needing to be sent across the network) and independently sends its contents to HBase, without any need to collect all data on the driver (or any one node, for that matter).
It looks like you are reading a CSV file, so probably something like the following will do the trick:
spark.read.csv(inputPath). // Using DataFrameReader but your way works too
foreachPartition { rows =>
val conn = ??? // Create HBase connection
for (row <- rows) { // Loop over the iterator
val data = parseJson(row) // Your parsing logic
??? // Use 'conn' to save 'data'
}
}
You can ignore collect in your code if you have large set of data.
Collect Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
Also this can cause the driver to run out of memory, though, because collect() fetches the entire RDD/DF to a single machine.
I have just edited your code, which should work for you.
var distAccNrsDF = eqpDF.select("accountnumber").distinct()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'")
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}

Flink: join file with kafka stream

I have a problem I don't really can figure out.
So I have a kafka stream that contains some data like this:
{"adId":"9001", "eventAction":"start", "eventType":"track", "eventValue":"", "timestamp":"1498118549550"}
And I want to replace 'adId' with another value 'bookingId'.
This value is located in a csv file, but I can't really figure out how to get it working.
Here is my mapping csv file:
9001;8
9002;10
So my output would ideally be something like
{"bookingId":"8", "eventAction":"start", "eventType":"track", "eventValue":"", "timestamp":"1498118549550"}
This file can get refreshed every hour at least once, so it should pick up changes to it.
I currently have this code which doesn't work for me:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(30000); // create a checkpoint every 30 seconds
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
DataStream<String> adToBookingMapping = env.readTextFile(parameters.get("adToBookingMapping"));
DataStream<Tuple2<Integer,Integer>> input = adToBookingMapping.flatMap(new Tokenizer());
//Kafka Consumer
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", parameters.get("bootstrap.servers"));
properties.setProperty("group.id", parameters.get("group.id"));
FlinkKafkaConsumer010<ObjectNode> consumer = new FlinkKafkaConsumer010<>(parameters.get("inbound_topic"), new JSONDeserializationSchema(), properties);
consumer.setStartFromGroupOffsets();
consumer.setCommitOffsetsOnCheckpoints(true);
DataStream<ObjectNode> logs = env.addSource(consumer);
DataStream<Tuple4<Integer,String,Integer,Float>> parsed = logs.flatMap(new Parser());
// output -> bookingId, action, impressions, sum
DataStream<Tuple4<Integer, String,Integer,Float>> joined = runWindowJoin(parsed, input, 3);
public static DataStream<Tuple4<Integer, String, Integer, Float>> runWindowJoin(DataStream<Tuple4<Integer, String, Integer, Float>> parsed,
DataStream<Tuple2<Integer, Integer>> input,long windowSize) {
return parsed.join(input)
.where(new ParsedKey())
.equalTo(new InputKey())
.window(TumblingProcessingTimeWindows.of(Time.of(windowSize, TimeUnit.SECONDS)))
//.window(TumblingEventTimeWindows.of(Time.milliseconds(30000)))
.apply(new JoinFunction<Tuple4<Integer, String, Integer, Float>, Tuple2<Integer, Integer>, Tuple4<Integer, String, Integer, Float>>() {
private static final long serialVersionUID = 4874139139788915879L;
#Override
public Tuple4<Integer, String, Integer, Float> join(
Tuple4<Integer, String, Integer, Float> first,
Tuple2<Integer, Integer> second) {
return new Tuple4<Integer, String, Integer, Float>(second.f1, first.f1, first.f2, first.f3);
}
});
}
The code only runs once and then stops, so it doesn't convert new entries in kafka using the csv file. Any ideas on how I could process the stream from Kafka with the latest values from my csv file?
Kind regards,
darkownage
Your goal appears to be to join steaming data with a slow-changing catalog (i.e. a side-input). I don't think the join operation is useful here because it doesn't store the catalog entries across windows. Also, the text file is a bounded input whose lines are read once.
Consider using connect to create a connected stream, and store the catalog data as managed state to perform lookups into. The operator's parallelism would need to be 1.
You may find a better solution by researching 'side inputs', looking at the solutions that people use today. See FLIP-17 and Dean Wampler's talk at Flink Forward.

Compute file content hash with Scala

In our app, we are in need to compute file hash, so we can compare if the file was updated later.
The way I am doing it right now is with this little method:
protected[services] def computeMigrationHash(toVersion: Int): String = {
val migrationClassName = MigrationClassNameFormat.format(toVersion, toVersion)
val migrationClass = Class.forName(migrationClassName)
val fileName = migrationClass.getName.replace('.', '/') + ".class"
val resource = getClass.getClassLoader.getResource(fileName)
logger.debug("Migration file - " + resource.getFile)
val file = new File(resource.getFile)
val hc = Files.hash(file, Hashing.md5())
logger.debug("Calculated migration file hash - " + hc.toString)
hc.toString
}
It all works perfectly, until the code get's deployed into different environment and file file is located in a different absolute path. I guess, the hashing take the path into account as well.
What is the best way to calculate some sort of reliable hash of a file content that well produce the same result for as log as the content of a file stays the same?
Thanks,
Having perused the source code https://github.com/google/guava/blob/master/guava/src/com/google/common/io/Files.java - only the file contents are hashed - the path does not come into play.
public static HashCode hash(File file, HashFunction hashFunction) throws IOException {
return asByteSource(file).hash(hashFunction);
}
Therefore you need not worry about locality of the file. Now why you end up with a different hash on a different fs .. maybe you should compare the size/contents to ensure eg no compound eol's were introduced.

Empty data while reading data from kafka using Trident Topology

I am new to Trident. I am writing a trident topology which reads data from kafka. Topic name is 'test'. I have local kafka setup. I started zookeeper, kafka in local. And created a topic 'test' in kafka and opened the producer and typed the message 'Hello Kafka!'.
I want to read the message 'Hello Kafka' from the 'test' topic using trident.
Below is my code. I am getting empty tuple.
TridentTopology topology = new TridentTopology();
BrokerHosts brokerHosts = new ZkHosts("localhost:2181");
TridentKafkaConfig kafkaConfig = new TridentKafkaConfig(brokerHosts, "test");
kafkaConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
kafkaConfig.bufferSizeBytes = 1024 * 1024 * 4;
kafkaConfig.fetchSizeBytes = 1024 * 1024 * 4;
kafkaConfig.forceFromStart = false;
OpaqueTridentKafkaSpout opaqueTridentKafkaSpout = new OpaqueTridentKafkaSpout(kafkaConfig);
topology.newStream("TestSpout", opaqueTridentKafkaSpout).parallelismHint(1)
.each(new Fields(), new TestFilter()).parallelismHint(1)
.each(new Fields(), new Utils.PrintFilter());
and this is my TestFilter class code
public TestFilter()
{
//
}
#Override
public boolean isKeep(TridentTuple tuple) {
boolean isKeep=true;
System.out.println("TestFilter is called...");
if (tuple != null && tuple.getValues().size()>0) {
System.out.println("data from kafka ::: "+tuple.getValues());
}
return isKeep;
}
Whenever i type message in kafka producer to the 'test' topic, first sysout getting printed but it doesn't pass the if loop. I am simply getting message 'TestFilter is called...' not more than that.
I want to get the actual data i produced to the 'test' topic. How?
The problems lies in the parameters to Stream.each. The relevant portion of the javadoc for the method is:
each(Fields inputFields, Filter filter)
The documentation is't too clear about it, but the semantic is that you should specifies all the fields used by your filter using the inputFields parameter.
Storm will then apply a projection on the input tuple and forward it to the filter.
Given that you didn't specified any input fields, the projection resulted in an empty tuple thus resulting in the failure of the tuple.getValues().size()>0 condition inside the filter.
It's worth mentioning also the other variants of each:
each(Fields inputFields, Function function, Fields functionFields)
each(Function function, Fields functionFields)
These will apply the provided function on the projection of the input tuple, appending the resulting tuple to the original input tuple renaming the new fields as functionFields (i.e. the projection is only used for applying the function).
In particular the second version is equivalent to invoke each with inputFields set to null (or new Fields()) and will result in an empty tuple getting passed to function.