reduceByKey/aggregateByKey alternative for a DStream[Class] Spark Streaming

reduceByKey/aggregateByKey alternative for a DStream[Class] Spark Streaming - scala

There's already a similar question here, but it is using Maven, and I'm using sbt. Moreover none of the solutions there worked for me
I'm using Spark 2.4.0, Scala 2.11.12 and IntelliJ IDEA 2019.1
My build.sbt looks like:
libraryDependencies ++= Seq(
"com.groupon.sparklint" %% "sparklint-spark212" % "1.0.12" excludeAll ExclusionRule(organization = "org.apache.spark"),
"org.apache.spark" %% "spark-core" % "2.4.0",
"org.apache.spark" %% "spark-sql" % "2.4.0",
"org.apache.spark" %% "spark-streaming" % "2.4.0",
"org.apache.spark" %% "spark-streaming-kafka" % "1.6.2",
"com.datastax.spark" %% "spark-cassandra-connector" % "2.4.0",
"com.typesafe.slick" %% "slick" % "3.3.0",
"org.slf4j" % "slf4j-nop" % "1.6.4",
"com.typesafe.slick" %% "slick-hikaricp" % "3.3.0",
"com.typesafe.slick" %% "slick-extensions" % "3.0.0"
)
Edit all over:
I will be receiving a stream of data from Kafka, which will be sent to the Spark Streaming context using:
val rawWeatherStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
From this, I want to create a stream of RawWeatherData objects. A sample output from the stream would look like:
(null,725030:14732,2008,12,31,11,0.6,-6.7,1001.7,80,6.2,8,0.0,0.0)
Things look all good, except that I need to remove the first null value to create the stream of RawWeatherData objects as the constructor cannot accept the first null value, but can accept all other values from the stream.
Just for clarity sakes, here's what RawWeatherData looks like (I cannot edit this):
case class RawWeatherData(
wsid: String,
year: Int,
month: Int,
day: Int,
hour: Int,
temperature: Double,
dewpoint: Double,
pressure: Double,
windDirection: Int,
windSpeed: Double,
skyCondition: Int,
skyConditionText: String,
oneHourPrecip: Double,
sixHourPrecip: Double) extends WeatherModel
To achieve that purpose, I send my stream into a function, which returns me the desired stream of RawWeatherData objects:
def ingestStream(rawWeatherStream: InputDStream[(String, String)]): DStream[RawWeatherData] = {
rawWeatherStream.map(_._2.split(",")).map(RawWeatherData(_))
}
Now I am looking to insert this stream into a MySQL/DB2 database. From this RawWeatherData object (725030:14732,2008,12,31,11,0.6,-6.7,1001.7,80,6.2,8,0.0,0.0), the left highlighted bold part is the primary key, and the right bold part is the value that has to be reduced/aggregated.
So essentially I want my DStream to have key-value pairs of ([725030:14732,2008,12,31] , <summed up values for the key>)
So after ingestStream, I try to perform this:
parsedWeatherStream.map { weather =>
(weather.wsid, weather.year, weather.month, weather.day, weather.oneHourPrecip)
}.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)
After the end of map, I try to write .reduceByKey(), but when I try that, the error says Cannot resolve symbolreduceByKey`. I'm not sure why this is happening as the function is available in the spark documentation.
PS. Right now weather.oneHourPrecip is set to counter in cassandra, so cassandra will automatically aggregate the value for me. But this will not be possible in other databases like DB2, hence I wanted an apt replacement, like reduceByKey in spark. Is there any way to proceed with such a case?

Type of your stream is DStream[RawWeatherData] and reduceByKey is available only on streams of type DStream[(K,V)], which is a stream of tuples consisting of key and value.
What you wanted to do is probably to use mapValues instead of map:
val parsedWeatherStream: DStream[(String, RawWeatherData)] = rawWeatherStream
.mapValues(_.split(","))
.mapValues(RawWeatherData)
As you can see by type of parsedWeatherStream from the snippet above, if you'd use mapValues, you'd not discard your keys and you could use reduceByKey.

Related

How to define withSources and/or withJavadoc in mill ivyDeps

If you want to load module sources and/or javadocs you write following sbt:
lazy val joda_timeV = "1.2.3"
lazy val scalatagsV = "1.2.3"
lazy val upickleV = "1.2.4"
lazy val autowireV = "1.2.5"
lazy val scalarxV = "1.2.6"
libraryDependencies ++= Seq(
"joda-time" % "joda-time" % joda_timeV withJavadoc(),
"com.lihaoyi" %%% "scalatags" % scalatagsV withSources() withJavadoc(),
"com.lihaoyi" %% "upickle" % upickleV withSources() withJavadoc(),
"com.lihaoyi" %%% "autowire" % autowireV withJavadoc(),
"com.lihaoyi" %%% "scalarx" % scalarxV withSources(),
"org.scalatestplus.play" %% "scalatestplus-play" % scalatestplus_playV % "test" withSources() withJavadoc()
),
In mill you say
override def ivyDeps = Agg(
ivy"joda-time:joda-time:${joda_timeV}",
ivy"com.lihaoyi:::scalatags:${scalatagsV}",
ivy"com.lihaoyi::upickle:${upickleV}",
ivy"com.lihaoyi:::autowire:${autowireV}",
ivy"com.lihaoyi:::scalarx:${scalarxV}"
)
but how can you add withJavadoc() or withSources() or withSources() withJavadoc() in to mill build.sc?
There is function .withConfiguration(String) but no scaladoc how to use it.
Is it possible to define that a module is available only in test (like org.scalatestplus.play in the previous code) or should I create separate ivyDeps for testing module?

Regarding your first question, I assume, your are interested in good IDE support, e.g. completion and jump-to the sources of your dependencies.
Mill already supports IDE integration. It comes with a project generator for IntelliJ IDEA (mill mill.scalalib.GenIdea/idea), which automatically downloads the sources for you. Alternatively, you can use the new BSP Support (Build Server Protocol) which should in combination with the Metals Language Server (https://scalameta.org/metals/) provide a nice editing experience in various IDEs and Editors. Unfortunately, at the time of this writing, Mills built-in BSP server isn't as robust as its IDEA generator, but there is even another alternative, the Bloop contrib module. All these methods should provide decent code navigation through dependencies and completion.
And to your second question:
Is it possible to define that a module is available only in test (like org.scalatestplus.play in the previous code) or should I create separate ivyDeps for testing module?
Test dependencies are declared it the test modules (which are technically regular modules too).
// build.sc
// ...
object yourplaymodule extends PlayModule {
override def ivyDeps = Agg(
ivy"joda-time:joda-time:${joda_timeV}",
ivy"com.lihaoyi:::scalatags:${scalatagsV}",
ivy"com.lihaoyi::upickle:${upickleV}",
ivy"com.lihaoyi:::autowire:${autowireV}",
ivy"com.lihaoyi:::scalarx:${scalarxV}"
)
// ...
object test extends PlayTests {
override def ivyDeps = Agg(
ivy"org.scalatestplus.play::scalatestplus-play:${scalatestplus_playV}"
)
}
}
Edit 2021-09-16: Added the answer to the first question.

Write complex numbers in an HDF5 dataset with Matlab

How to write complex numbers in an HDF5 dataset with Matlab?
The high-level API (h5create) does not support complex data.

% Write the matrix A (complex data) to HDF5.
% Preserve the memory layout: what is contiguous in matlab (the
% columns) remain contiguous in the HDF5 file (the rows).
% In other words, the HDF5 dataset appears to be translated.
%
% Adapted from https://support.hdfgroup.org/ftp/HDF5/examples/examples-by-api/matlab/HDF5_M_Examples/h5ex_t_cmpd.m
A = reshape((1:6)* (1 + 2 * 1i), 2, 3);
fileName = 'complex_example.h5';
datasetName = 'A';
%
% Initialize data. It is more efficient to use Structures with array fields
% than arrays of structures.
%
wdata = struct;
wdata.r = real(A);
wdata.i = imag(A);
%% File creation/opening
file = H5F.create(fileName, 'H5F_ACC_TRUNC', 'H5P_DEFAULT', 'H5P_DEFAULT');
%file = H5.open(fileName);
%% Datatype creation
%Create the complex datatype:
doubleType = H5T.copy('H5T_NATIVE_DOUBLE');
sz = [H5T.get_size(doubleType), H5T.get_size(doubleType)];
% Computer the offsets to each field. The first offset is always zero.
offset = [0, sz(1)];
% Create the compound datatype for the file and for the memory (same).
filetype = H5T.create ('H5T_COMPOUND', sum(sz));
H5T.insert (filetype, 'r', offset(1), doubleType);
H5T.insert (filetype, 'i', offset(2), doubleType);
%% Write data
% Create dataspace. Setting maximum size to [] sets the maximum
% size to be the current size.
space = H5S.create_simple (ndims(A), fliplr(size(A)), []);
% Create the datasetName and write the compound data to it.
dset = H5D.create (file, datasetName, filetype, space, 'H5P_DEFAULT');
H5D.write (dset, filetype, 'H5S_ALL', 'H5S_ALL', 'H5P_DEFAULT', wdata);
%% Finalise
% Close and release resources.
H5D.close(dset);
H5S.close(space);
H5T.close(filetype);
H5F.close(file);

Scalatest Playframework must contain List[String]

I am using playframework 2.4.x, and this libraries
"org.scalatest" %% "scalatest" % "2.2.1" % "test"
"org.scalatestplus" %% "play" % "1.4.0-M3" % "test"
I want to check if therea are some strings in a List I build on the test, this is the code
val userTeams = validateAndGet((teamsUserResponse.json \ "teams").asOpt[List[TeamUser]]).map( x => x.teamKey )
userTeams must contain ("team1", "team2")
But I am getting this error
List("team1", "team2") did not contain element (team1,team2)

If you write ("team1", "team2") then you're actualy creating a tuple of two strings which is from perspective of the ScalaTest matcher a single element.
Based on documentation you have to use allOf:
userTeams must contain allOf ("team1", "team2")

Passing parameters to a Matlab function

I have a very simple question, but I didn't figure out how to solve this.I have the function definition below:
function model = oasis(data, class_labels, parms)
% model = oasis(data, class_labels, parms)
%
% Code version 1.3 May 2011 Fixed random seed setting
% Code version 1.2 May 2011 added call to oasis_m.m
% Code version 1.1 May 2011 handle gaps in class_labels
%
% Input:
% -- data - Nxd sparse matrix (each instance being a ROW)
% -- class_labels - label of each data point (Nx1 integer vector)
% -- parms (do sym, do_psd, aggress etc.)
%
% Output:
% -- model.W - dxd matrix
% -- model.loss_steps - a binary vector: was there an update at
% each iterations
% -- modeo.parms, the actual parameters used in the run (inc. defaults)
%
% Parameters:
% -- aggress: The cutoff point on the size of the correction
% (default 0.1)
% -- rseed: The random seed for data point selection
% (default 1)
% -- do_sym: Whether to symmetrize the matrix every k steps
% (default 0)
% -- do_psd: Whether to PSD the matrix every k steps, including
% symmetrizing them (defalut 0)
% -- do_save: Whether to save the intermediate matrices. Note that
% saving is before symmetrizing and/or PSD in case they exist
% (default 0)
% -- save_path: In case do_save==1 a filename is needed, the
% format is save_path/part_k.mat
% -- num_steps - Number of total steps the algorithm will
% run (default 1M steps)
% -- save_every: Number of steps between each save point
% (default num_steps/10)
% -- sym_every: An integer multiple of "save_every",
% indicates the frequency of symmetrizing in case do_sym=1. The
% end step will also be symmetrized. (default 1)
% -- psd_every: An integer multiple of "save_every",
% indicates the frequency of projecting on PSD cone in case
% do_psd=1. The end step will also be PSD. (default 1)
% -- use_matlab: Use oasis_m.m instead of oasis_c.c
% This is provided in the case of compilation problems.
%
I want to use this function, but I don't figure how to set the parameters, or use the default values. What is the variable parms in this case, it is an object that keep all the other variables? Can I make something python like syntax where we put the name of the parameter plus value? For example:
model = oasis(data_example, labels_example, agress = 0.2)
Additionally, If I have understood correctly, I get two Objects in the Output, which is model and modeo, so I need to make this call to receive all contents this function returns?
[model,modeo] = oasis(data_example, labels_example, ?(parms)?)

From your function definition it seems like params is simply a placeholder for the parameters. Typically the parameters themselves are passed as pairs of inputs in the form:
model = oasis(data, class_labels, 'do_sym',do_symValue, 'do_psd', do_psdValue,...)
where do_symValue and do_psdValue are the values you want to pass as the respective parameters.
As for the functions return value, it returns a single struct with members W, loss_steps, and parms. I believe that what you thought as a second output (modelo) is simply a typo in the text - at least based on the function's definition.

From the documentation above, I don't know which one is the right, but there are two common ways for optional parameters in matlab.
parameter value pairs:
model = oasis(data, class_labels, 'do_sym',1,'do_psd',0)
structs:
params.do_sym=1
params.do_psd=0
model = oasis(data, class_labels, params)
Probably one of these two possibilities is right.

Play Framework 2.1 upgrade, now having to cast everything for views?

I just migrated across to Play Framework 2.1-RC1 from 2.0 on an existing project and for some reason I'm now having to cast everything to Scala classes from Java classes when I render the views. (Obviously I'm using Play in a Java project rather than a Scala project)
Below is an example error...
render(java.lang.String,scala.collection.immutable.List<models.User>) in views.html.list cannot be applied to (java.lang.String,java.util.List<models.User>)
And the top line of my view...
#(message: String, users : List[models.User])
From this I surmise that for some reason classes aren't being automatically cast from java.util.List to the scala equivalent. I'm a Java guy, not a Scala guy at this stage so I may be doing something stupid.
Example code that calls render...
public static Result list() {
List<User> users = MorphiaManager.getDatastore().find(User.class).asList();
System.out.println("about to try to display list of " + users.size() + " users");
return ok(list.render("Welcome", msgs));
}
Build.scala below
import sbt._
import Keys._
import PlayProject._
object ApplicationBuild extends Build {
val appName = "blah-worker"
val appVersion = "1.0-SNAPSHOT"
val appDependencies = Seq(
// Play framework dependencies
javaCore, javaJdbc, javaEbean,
// Add your project dependencies here,
"org.apache.camel" % "camel-core" % "2.10.0",
"org.apache.camel" % "camel-jms" % "2.10.0",
"org.apache.camel" % "camel-mail" % "2.10.0",
"org.apache.camel" % "camel-jackson" % "2.10.0",
"org.apache.camel" % "camel-gson" % "2.10.0",
"org.apache.activemq" % "activemq-core" % "5.6.0",
"org.apache.activemq" % "activemq-camel" % "5.6.0",
"org.apache.activemq" % "activemq-pool" % "5.6.0",
"com.google.code.morphia" % "morphia" % "0.99.1-SNAPSHOT",
"com.google.code.morphia" % "morphia-logging-slf4j" % "0.99",
"cglib" % "cglib-nodep" % "[2.1_3,)",
"com.thoughtworks.proxytoys" % "proxytoys" % "1.0",
"org.apache.james" % "apache-mime4j" % "0.7.2",
("org.jclouds" % "jclouds-allblobstore" % "1.5.0-beta.4").exclude("com.google.guava", "guava").exclude("org.reflections", "reflections"),
("org.reflections" % "reflections" % "0.9.7").exclude("com.google.guava", "guava").exclude("javassist", "javassist")
)
val main = play.Project(appName, appVersion, appDependencies).settings(
// Add your own project settings here
resolvers += "Morphia repo" at "http://morphia.googlecode.com/svn/mavenrepo/",
resolvers += "CodeHaus for ProxyToys" at "http://repository.codehaus.org/",
checksums := Nil
)
}

Are you missing the new 'javaCore' dependency? It is required for Java projects using Play 2.1. Look here for migration details:
https://github.com/playframework/Play20/wiki/Migration

Figured it out, I hadn't updated one of the imports on Build.scala.
Specifically...
import PlayProject._
should be updated to...
import play.Project._
which is also detailed in the migration guide (but I missed it): https://github.com/playframework/Play20/wiki/Migration

I'm not sure if it will fix you problem but can you try adding this import:
import scala.collection.JavaConversions.*;

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

reduceByKey/aggregateByKey alternative for a DStream[Class] Spark Streaming - scala

Related

How to define withSources and/or withJavadoc in mill ivyDeps

Write complex numbers in an HDF5 dataset with Matlab

Scalatest Playframework must contain List[String]

Passing parameters to a Matlab function

Play Framework 2.1 upgrade, now having to cast everything for views?

Categories

Resources