Left join operation in apache beam - apache-beam

I have two datasets with a common key column and I want to perform left join operation. Is there any corresponding function in apache beam that performs the left join operation in apache beam ?

There is a small library of joins available in Beam Java SDK, see if the implementation works for you: org.apache.beam.sdk.extensions.joinlibrary.Join, source
Update
You can implement it yourself with similar approach, utilizing CoGroupByKey:
- put both PCollections into a KeyedPCollectionTuple;
- apply a CoGroupByKey which will group elements from both PCollections per key per window;
- apply a ParDo which loops over the results of a CoGroupByKey, joins left and right record one at a time, and emits the results (see the CoGroupByKey example in the Beam Programming Guide);

Example of left outer join:
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.extensions.joinlibrary.Join;
import org.apache.beam.sdk.testing.PAssert;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollection;
import org.junit.Test;
public class TestJoin {
#Test
public void left_join_example() {
Pipeline pipeline = Pipeline.create();
PCollection<KV<Integer, String>> leftCollection = pipeline.apply(Create.of(KV.of(1, "a"), KV.of(2, "b"), KV.of(3, "c")));
PCollection<KV<Integer, Integer>> rightCollection = pipeline.apply(Create.of(KV.of(1, 1), KV.of(1, 10), KV.of(2, 2), KV.of(4, 4)));
PCollection<KV<Integer, KV<String, Integer>>> leftJoinCollection = Join.leftOuterJoin(leftCollection, rightCollection, 0);
PAssert.that(leftJoinCollection).containsInAnyOrder(KV.of(1, KV.of("a" , 1)), KV.of(1, KV.of("a" , 10)),
KV.of(2, KV.of("b", 2)), KV.of(3, KV.of("c", 0)));
pipeline.run().waitUntilFinish();
}
}

Related

Best approach for building an LSH table using Apache Beam and Dataflow

I have an LSH table builder utility class which goes as follows (referred from here):
class BuildLSHTable:
def __init__(self, hash_size=8, dim=2048, num_tables=10, lsh_file="lsh_table.pkl"):
self.hash_size = hash_size
self.dim = dim
self.num_tables = num_tables
self.lsh = LSH(self.hash_size, self.dim, self.num_tables)
self.embedding_model = embedding_model
self.lsh_file = lsh_file
def train(self, training_files):
for id, training_file in enumerate(training_files):
image, label = training_file
if len(image.shape) < 4:
image = image[None, ...]
features = self.embedding_model.predict(image)
self.lsh.add(id, features, label)
with open(self.lsh_file, "wb") as handle:
pickle.dump(self.lsh,
handle, protocol=pickle.HIGHEST_PROTOCOL)
I then execute the following in order to build my LSH table:
training_files = zip(images, labels)
lsh_builder = BuildLSHTable()
lsh_builder.train(training_files)
Now, when I am trying to do this via Apache Beam (code below), it's throwing:
TypeError: can't pickle tensorflow.python._pywrap_tf_session.TF_Operation objects
Code used for Beam:
def generate_lsh_table(args):
options = beam.options.pipeline_options.PipelineOptions(**args)
args = namedtuple("options", args.keys())(*args.values())
with beam.Pipeline(args.runner, options=options) as pipeline:
(
pipeline
| 'Build LSH Table' >> beam.Map(
args.lsh_builder.train, args.training_files)
)
This is how I am invoking the beam runner:
args = {
"runner": "DirectRunner",
"lsh_builder": lsh_builder,
"training_files": training_files
}
generate_lsh_table(args)
Apache Beam pipelines should be converted to a standard (for example, proto) format before being executed. As a part of this certain pipeline objects such as DoFns get serialized (picked). If your DoFns have instance variables that cannot be serialized this process cannot continue.
One way to solve this is to load/define such instance objects or modules during execution instead of creating and storing such objects during pipeline submission. This might require adjusting your pipeline.

How do you write to Kafka using the ABRiS library in PySpark?

Has anyone been able to write to Kafka using this library using PySpark?
I've been able to successfully read using the code from the README documentation:
import logging, traceback
import requests
from pyspark.sql import Column
from pyspark.sql.column import *
jvm_gateway = spark_context._gateway.jvm
abris_avro = jvm_gateway.za.co.absa.abris.avro
naming_strategy = getattr(getattr(abris_avro.read.confluent.SchemaManager, "SchemaStorageNamingStrategies$"), "MODULE$").TOPIC_NAME()
schema_registry_config_dict = {"schema.registry.url": schema_registry_url,
"schema.registry.topic": topic,
"value.schema.id": "latest",
"value.schema.naming.strategy": naming_strategy}
conf_map = getattr(getattr(jvm_gateway.scala.collection.immutable.Map, "EmptyMap$"), "MODULE$")
for k, v in schema_registry_config_dict.items():
conf_map = getattr(conf_map, "$plus")(jvm_gateway.scala.Tuple2(k, v))
deserialized_df = data_frame.select(Column(abris_avro.functions.from_confluent_avro(data_frame._jdf.col("value"), conf_map))
.alias("data")).select("data.*")
However, I am struggling to extend the behaviour by writing to topics via the to_confluent_avro function.

How implement LEFT or RIGHT JOIN using spark-cassandra-connector

I have spark streaming job. I am using Cassandra as datastore.
I have stream which is need to be joined with cassandra table.
I am using spark-cassandra-connector, there is great method joinWithCassandraTable which is as far as I can understand implementing inner join with cassandra table
val source: DStream[...] = ...
source.foreachRDD { rdd =>
rdd.joinWithCassandraTable( "keyspace", "table" ).map{ ...
}
}
So the question is how can I implement left outer join with cassandra table?
Thanks in advance
This is currently not supported, but there is a ticket to introduce the functionality. Please vote on it if you would like it introduced in the future.
https://datastax-oss.atlassian.net/browse/SPARKC-181
A workaround is suggested in the ticket
As RussS mentioned such feature is not available in spark-cassandra-connector driver yet. So as workaround I propose the following code snippet.
rdd.foreachPartition { partition =>
CassandraConnector(rdd.context.getConf).withSessionDo { session =>
for (
leftSide <- partition;
rightSide <- {
val rs = session.execute(s"""SELECT * FROM "keyspace".table where id = "${leftSide._2}"""")
val iterator = new PrefetchingResultSetIterator(rs, 100)
if (iterator.isEmpty) Seq(None)
else iterator.map(r => Some(r.getString(1)))
}
) yield (leftSide, rightSide)
}
}

How to extract best parameters from a CrossValidatorModel

I want to find the parameters of ParamGridBuilder that make the best model in CrossValidator in Spark 1.4.x,
In Pipeline Example in Spark documentation, they add different parameters (numFeatures, regParam) by using ParamGridBuilder in the Pipeline. Then by the following line of code they make the best model:
val cvModel = crossval.fit(training.toDF)
Now, I want to know what are the parameters (numFeatures, regParam) from ParamGridBuilder that produces the best model.
I already used the following commands without success:
cvModel.bestModel.extractParamMap().toString()
cvModel.params.toList.mkString("(", ",", ")")
cvModel.estimatorParamMaps.toString()
cvModel.explainParams()
cvModel.getEstimatorParamMaps.mkString("(", ",", ")")
cvModel.toString()
Any help?
Thanks in advance,
One method to get a proper ParamMap object is to use CrossValidatorModel.avgMetrics: Array[Double] to find the argmax ParamMap:
implicit class BestParamMapCrossValidatorModel(cvModel: CrossValidatorModel) {
def bestEstimatorParamMap: ParamMap = {
cvModel.getEstimatorParamMaps
.zip(cvModel.avgMetrics)
.maxBy(_._2)
._1
}
}
When run on the CrossValidatorModel trained in the Pipeline Example you cited gives:
scala> println(cvModel.bestEstimatorParamMap)
{
hashingTF_2b0b8ccaeeec-numFeatures: 100,
logreg_950a13184247-regParam: 0.1
}
val bestPipelineModel = cvModel.bestModel.asInstanceOf[PipelineModel]
val stages = bestPipelineModel.stages
val hashingStage = stages(1).asInstanceOf[HashingTF]
println("numFeatures = " + hashingStage.getNumFeatures)
val lrStage = stages(2).asInstanceOf[LogisticRegressionModel]
println("regParam = " + lrStage.getRegParam)
source
To print everything in paramMap, you actually don't have to call parent:
cvModel.bestModel.extractParamMap()
To answer OP's question, to get a single best parameter, for example regParam:
cvModel.bestModel.extractParamMap().apply(cvModel.bestModel.getParam("regParam"))
This is how you get the chosen parameters
println(cvModel.bestModel.getMaxIter)
println(cvModel.bestModel.getRegParam)
this java code should work:
cvModel.bestModel().parent().extractParamMap().you can translate it to scala code
parent()method will return an estimator, you can get the best params then.
This is the ParamGridBuilder()
paraGrid = ParamGridBuilder().addGrid(
hashingTF.numFeatures, [10, 100, 1000]
).addGrid(
lr.regParam, [0.1, 0.01, 0.001]
).build()
There are 3 stages in pipeline. It seems we can assess parameters as the following:
for stage in cv_model.bestModel.stages:
print 'stages: {}'.format(stage)
print stage.params
print '\n'
stage: Tokenizer_46ffb9fac5968c6c152b
[Param(parent='Tokenizer_46ffb9fac5968c6c152b', name='inputCol', doc='input column name'), Param(parent='Tokenizer_46ffb9fac5968c6c152b', name='outputCol', doc='output column name')]
stage: HashingTF_40e1af3ba73764848d43
[Param(parent='HashingTF_40e1af3ba73764848d43', name='inputCol', doc='input column name'), Param(parent='HashingTF_40e1af3ba73764848d43', name='numFeatures', doc='number of features'), Param(parent='HashingTF_40e1af3ba73764848d43', name='outputCol', doc='output column name')]
stage: LogisticRegression_451b8c8dbef84ecab7a9
[]
However, there is no parameter in the last stage, logiscRegression.
We can also get weight and intercept parameter from logistregression like the following:
cv_model.bestModel.stages[1].getNumFeatures()
10
cv_model.bestModel.stages[2].intercept
1.5791827733883774
cv_model.bestModel.stages[2].weights
DenseVector([-2.5361, -0.9541, 0.4124, 4.2108, 4.4707, 4.9451, -0.3045, 5.4348, -0.1977, -1.8361])
Full exploration:
http://kuanliang.github.io/2016-06-07-SparkML-pipeline/
I am working with Spark Scala 1.6.x and here is a full example of how i can set and fit a CrossValidator and then return the value of the parameter used to get the best model (assuming that training.toDF gives a dataframe ready to be used) :
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
// Instantiate a LogisticRegression object
val lr = new LogisticRegression()
// Instantiate a ParamGrid with different values for the 'RegParam' parameter of the logistic regression
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.0001, 0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1)).build()
// Setting and fitting the CrossValidator on the training set, using 'MultiClassClassificationEvaluator' as evaluator
val crossVal = new CrossValidator().setEstimator(lr).setEvaluator(new MulticlassClassificationEvaluator).setEstimatorParamMaps(paramGrid)
val cvModel = crossVal.fit(training.toDF)
// Getting the value of the 'RegParam' used to get the best model
val bestModel = cvModel.bestModel // Getting the best model
val paramReference = bestModel.getParam("regParam") // Getting the reference of the parameter you want (only the reference, not the value)
val paramValue = bestModel.get(paramReference) // Getting the value of this parameter
print(paramValue) // In my case : 0.001
You can do the same for any parameter or any other type of model.
If javaļ¼Œsee this debug show;
bestModel.parent().extractParamMap()
Building in the solution of #macfeliga, a single liner that works for pipelines:
cvModel.bestModel.asInstanceOf[PipelineModel]
.stages.foreach(stage => println(stage.extractParamMap))
This SO thread kinda answers the question.
In a nutshell, you need to cast each object to its supposed-to-be class.
For the case of CrossValidatorModel, the following is what I did:
import org.apache.spark.ml.tuning.CrossValidatorModel
import org.apache.spark.ml.PipelineModel
import org.apache.spark.ml.regression.RandomForestRegressionModel
// Load CV model from S3
val inputModelPath = "s3://path/to/my/random-forest-regression-cv"
val reloadedCvModel = CrossValidatorModel.load(inputModelPath)
// To get the parameters of the best model
(
reloadedCvModel.bestModel
.asInstanceOf[PipelineModel]
.stages(1)
.asInstanceOf[RandomForestRegressionModel]
.extractParamMap()
)
In the example, my pipeline has two stages (a VectorIndexer and a RandomForestRegressor), so the stage index is 1 for my model.
For me, the #orangeHIX solution is perfect:
val cvModel = cv.fit(training)
val cvMejorModelo = cvModel.bestModel.asInstanceOf[ALSModel]
cvMejorModelo.parent.extractParamMap()
res86: org.apache.spark.ml.param.ParamMap =
{
als_08eb64db650d-alpha: 0.05,
als_08eb64db650d-checkpointInterval: 10,
als_08eb64db650d-coldStartStrategy: drop,
als_08eb64db650d-finalStorageLevel: MEMORY_AND_DISK,
als_08eb64db650d-implicitPrefs: false,
als_08eb64db650d-intermediateStorageLevel: MEMORY_AND_DISK,
als_08eb64db650d-itemCol: product,
als_08eb64db650d-maxIter: 10,
als_08eb64db650d-nonnegative: false,
als_08eb64db650d-numItemBlocks: 10,
als_08eb64db650d-numUserBlocks: 10,
als_08eb64db650d-predictionCol: prediction,
als_08eb64db650d-rank: 1,
als_08eb64db650d-ratingCol: rating,
als_08eb64db650d-regParam: 0.1,
als_08eb64db650d-seed: 1994790107,
als_08eb64db650d-userCol: user
}

Get the first elements (take function) of a DStream

I look for a way to retrieve the first elements of a DStream created as:
val dstream = ssc.textFileStream(args(1)).map(x => x.split(",").map(_.toDouble))
Unfortunately, there is no take function (as on RDD) on a dstream //dstream.take(2) !!!
Could someone has any idea on how to do it ?! thanks
You can use transform method in the DStream object then take n elements of the input RDD and save it to a list, then filter the original RDD to be contained in this list. This will return a new DStream contains n elements.
val n = 10
val partOfResult = dstream.transform(rdd => {
val list = rdd.take(n)
rdd.filter(list.contains)
})
partOfResult.print
The previous suggested solution did not compile for me as the take() method returns an Array, which is not serializable thus Spark streaming will fail with a java.io.NotSerializableException.
A simple variation on the previous code that worked for me:
val n = 10
val partOfResult = dstream.transform(rdd => {
rdd.filter(rdd.take(n).toList.contains)
})
partOfResult.print
Sharing a java based solution that is working for me. The idea is to use a custom function, which can send the top row from a sorted RDD.
someData.transform(
rdd ->
{
JavaRDD<CryptoDto> result =
rdd.keyBy(Recommendations.volumeAsKey)
.sortByKey(new CryptoComparator()).values().zipWithIndex()
.map(row ->{
CryptoDto purchaseCrypto = new CryptoDto();
purchaseCrypto.setBuyIndicator(row._2 + 1L);
purchaseCrypto.setName(row._1.getName());
purchaseCrypto.setVolume(row._1.getVolume());
purchaseCrypto.setProfit(row._1.getProfit());
purchaseCrypto.setClose(row._1.getClose());
return purchaseCrypto;
}
).filter(Recommendations.selectTopinSortedRdd);
return result;
}).print();
The custom function selectTopinSortedRdd looks like below:
public static Function<CryptoDto, Boolean> selectTopInSortedRdd = new Function<CryptoDto, Boolean>() {
private static final long serialVersionUID = 1L;
#Override
public Boolean call(CryptoDto value) throws Exception {
if (value.getBuyIndicator() == 1L) {
System.out.println("Value of buyIndicator :" + value.getBuyIndicator());
return true;
}
else {
return false;
}
}
};
It basically compares all incoming elements, and returns true only for the first record from the sorted RDD.
This seems to be always an issue with DStreams as well as regular RDDs.
If you don't want (or can't) to use .take() (especially in DStreams) you can think outside the box here and just use reduce instead. That is a valid function for both DStreams as well as RDD's.
Think about it. If you use reduce like this (Python example):
.reduce( lambda x, y : x)
Then what happens is: For every 2 elements you pass in, always return only the first. So if you have a million elements in your RDD or DStream it will shrink it to one element in the end which is the very first one in your RDD or DStream.
Simple and clean.
However keep in mind that .reduce() does not take order into consideration. However you can easily overcome this with a custom function instead.
Example: Let's assume your data looks like this x = (1, [1,2,3]) and y = (2, [1,2]). A tuple x where the 2nd element is a list. If you are sorting by the longest list for example then your code could look like below maybe (adapt as needed):
def your_reduce(x,y):
if len(x[1]) > len(y[1]):
return x
else:
return y
yourNewRDD = yourOldRDD.reduce(your_reduce)
Accordingly you will get '(1, [1,2,3])' as that has the longer list. There you go!
This has caused me some headaches in the past until I finally tried this. Hopefully this helps.