Create a weighted feeder in Gatling - scala

I have a few .csv files I want to use for the same data in Gatling. Each of these files has a certain number of ID's that I want to be accessed fairly equally. I don't want to put them all in the same file because the .csv files are generated from SQL queries and, while I may have a lot of IDs in one file, I only have a few in another. What's important to me is that I have a random sample from each of my files and a way to specify the distribution.
I found an example of how to do this but I'm having trouble applying it in my case. Here is the code I have so far. I try to both 1) print out the value from the feeder in the session and 2) try to use the value from the feeder in a get request. Both attempts fail with various errors which I detail below:
import scala.concurrent.duration._
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import io.gatling.jdbc.Predef._
import util.Random
class FeederTest extends Simulation {
//headers...
val userCreds = csv("user_creds.csv")
val sample1 = csv("sample1.csv")
val sample2 = csv("sample2.csv")
def randFeed(): String = {
val foo = Random.nextInt(2)
var retval = ""
if (foo == 0) retval = "file1"
if (foo == 1) retval = "file2"
return retval
}
val scn = scenario("feeder test")
.repeat(1) {
feed(userCreds)
.doSwitch(randFeed)(
"file1" -> feed(sample1),
"file2" -> feed(sample2)
)
.exec(http("request - login")
.post("<URL>")
.headers(headers_login)
.formParam("email", "${username}")
.formParam("password", "<not telling>"))
.exec(session => {
println(session)
println(session("first").as[String])
session})
.exec(http("goto_url")
.get("<my url>/${first}"))
}
setUp(scn.inject(atOnceUsers(1))).protocols(httpProtocol)
}
This is the error I get when I attempt to printout the feeder value in the session (as in the above code using session(<value>).as[String]):
[ERROR] [03/13/2015 10:22:38.221] [GatlingSystem-akka.actor.default-dispatcher-8] [akka://GatlingSystem/user/sessionHook-2] key not found: first
java.util.NoSuchElementException: key not found: first
at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at io.gatling.core.session.SessionAttribute.as(Session.scala:40)
at FeederTest$$anonfun$2.apply(feeder_test.scala:81)
at FeederTest$$anonfun$2.apply(feeder_test.scala:79)
at io.gatling.core.action.SessionHook.executeOrFail(SessionHook.scala:35)
at io.gatling.core.action.Failable$class.execute(Actions.scala:71)
at io.gatling.core.action.SessionHook.execute(SessionHook.scala:28)
at io.gatling.core.action.Action$$anonfun$receive$1.applyOrElse(Actions.scala:29)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at io.gatling.core.akka.BaseActor.aroundReceive(BaseActor.scala:22)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I have also tried just using the EL expression ${first} in the session. All this does is print out the string ${first}. Similarly in the last .get line I get an error saying "No attribute named 'first' is defined.
The CSV files I'm currently using just have two columns of first and last in them like so:
sample1.csv:
first, last
george, bush
bill, clinton
barak, obama
sample2.csv:
first, last
super, man
aqua, man
bat, man
I'm using Gatling 2.1.4.

doSwitch takes an Expression[Any], which is a type alias for Session => Validation[Any]. Gatling has an implicit conversion that let you pass a static value instead, see documentation.
Which is exactly what you do. Even if randFeed is a def, it still doesn't return a function, but a String.
As you want randFeed to be called every time a virtual user pass through this step, you have to wrap the randFeed inside a function, even if you don't use the Session input parameter.
doSwitch(_ => randFeed)
Then, your randFeed is both ugly (no offense) and inefficient (Random is synchronized):
import scala.concurrent.forkjoin.ThreadLocalRandom
def randFeed(): String =
ThreadLocalRandom.current().nextInt(2) match {
case 0 => "file1"
case 1 => "file2"
}

I've never used feeders, but from the documentation, I see two possible problems:
randFeed returns either "foo" or "bar" but you're mapping with "file1" and "file2", so is anything being loaded? (The documentation says "If no switch is selected, the switch is bypassed.")
The examples for feeders I see don't show accessing fed data with session(varname) but rather "${varname}".

Related

Mock inner class's attributes using MagicMock

Apologies for a length post. I have been trying to beat my head around reading about mock, MagicMock, and all the time getting confused. Hence, decided to write this post.
I know several questions, and pages have been written on this. But, still not able to wrap my head around this.
My Setup:
All the test code, and the 2 module files come under one "folder" mymodule
my_module_1.py file contains
class MyOuterClass(object):
MyInnerClass(object):
attribute1: str
attribute2: str
attribute3: str
def get(self) -> MyInnerClass:
'''
pseudocode
1. a call to AWS's service is made
2. the output from call in step 1 is used to set attributes of this InnerClass
3. return innerclass instance
'''
I use the OuterClass in another file(my_module_2.py), to set some values and return a string as follows:
class MyModule2():
def get_foo(self, some_boolean_predicate):
if some_boolean_predicate:
temp = my_module_1.OuterClass().get()
statement = f'''
WITH (
BAR (
FIELD_1 = '{temp.attribute1}',
FIELD_2 = '{temp.attribute2}',
FIELD_3 = '{temp.attribute3}'
)
)
'''
else:
statement = ''
return statement
I want to write the unit tests for the file my_module_2.py file, and test the function get_foo
How I am writing the tests(or planning on)
a test file by name test_my_module2.py
I started with creating a pytest.fixture for the MyOuterClass's get function as follows since I will be reusing this piece of info again in my other tests
#pytest.fixture
def mock_get(mocker: MockerFixture) -> MagicMock:
return mocker.patch.object(MyOuterClass, 'get')
Finally,
Then I proceeded to use this fixture in my test as follows:
from unittest import mock
from unittest.mock import MagicMock, Mock, patch, PropertyMock
import pytest
from pytest_mock import MockerFixture
from my_module.my_module_1 import myOuterClass
def test_should_get_from_inner_class(self, mock_get):
# mock call to get are made
output = mock_get.get
#update the values for the InnerClass's attributes here
output.attribute1.side_effect = 'attr1'
output.attribute2.side_effect = 'attr2'
output.attribute3.side_effect = 'attr3'
mock_output_str = '''
WITH (
BAR (
FIELD_1 = 'attr1',
FIELD_2 = 'attr2',
FIELD_3 = 'attr3'
)
)
'''
module2Obj = MyModule2()
response = module2Obj.get_foo(some_boolean_predicate=True)
# the following assertion passes
assert mock_get.get.called_once()
# I would like match `response to that with mock_output_str instance above
assert response == mock_output_str
But, the assertion as you might have guessed failed, and I know I am comparing completely different types, since I see
errors such as
FAILED [100%]
WITH (
BAR (
FIELD1 = '<MagicMock name='get().attr1' id='4937943120'>',
FIELD3 = '<MagicMock name='get().attr2' id='4937962976'>',
FIELD3 = '<MagicMock name='get().attr3' id='4937982928'>'
)
)
Thank you for being patient with me till here, i know its a really lengthy post, but stuck on this for a few days, ended up creating a post here.
How do i get to validate the mock's value with the mock_output_str?
yess! the hint was in the #gold_cy's answer. I was only calling my mock and never setting its values
this is what my test case ended up looking
mock_obj = OuterClass.InnerClass()
mock_obj.attribute1='some-1'
mock_obj.attribute2='some-2'
mock_obj.attribute3='some-3'
mock_get.return_value = mock_obj
once my mock was setup properly, then the validation became easy! Thank you!

How to do a case insensitive match for command line arguments in scala?

I'm working on a command line tool written in Scala which is executed as:
sbt "run --customerAccount 1234567"
Now, I wish to make this flexible to accept "--CUSTOMERACCOUNT" or --cUsToMerAccount or --customerACCOUNT ...you get the drift
Here's what the code looks like:
lazy val OptionsParser: OptionParser[Args] = new scopt.OptionParser[Args]("scopt") {
head(
"XML Generator",
"Creates XML for testing"
)
help("help").text(s"Prints this usage message. $envUsage")
opt[String]('c', "customerAccount")
.text("Required: Please provide customer account number as -c 12334 or --customerAccount 12334")
.required()
.action { (cust, args) =>
assert(cust.nonEmpty, "cust is REQUIRED!!")
args.copy(cust = cust)
}
}
I assume the opt[String]('c', "customerAccount") does the pattern matching from the command line and will match with "customerAccount" - how do I get this to match with "--CUSTOMERACCOUNT" or --cUsToMerAccount or --customerACCOUNT? What exactly does the args.copy (cust = cust) do?
I apologize if the questions seem too basic. I'm incredibly new to Scala, have worked in Java and Python earlier so sometimes I find the syntax a little hard to understand as well.
You'd normally be parsing the args with code like:
OptionsParser.parse(args, Args())
So if you want case-insensitivity, probably the easiest way is to canonicalize the case of args with something like
val canonicalized = args.map(_.toLowerCase)
OptionsParser.parse(canonicalized, Args())
Or, if you for instance wanted to only canonicalize args starting with -- and before a bare --:
val canonicalized =
args.foldLeft(false -> List.empty[String]) { (state, arg) =>
val (afterDashes, result) = state
if (afterDashes) true -> (arg :: result) // pass through unchanged
else {
if (arg == "==") true -> (arg :: result) // move to afterDash state & pass through
else {
if (arg.startsWith("--")) false -> (arg.toLowerCase :: result)
else false -> (arg :: result) // pass through unchanged
}
}
}
._2 // Extract the result
.reverse // Reverse it back into the original order (if building up a sequence, your first choice should be to build a list in reversed order and reverse at the end)
OptionsParser.parse(canonicalized, Args())
Re the second question, since Args is (almost certainly) a case class, it has a copy method which constructs a new object with (most likely, depending on usage) different values for its fields. So
args.copy(cust = cust)
creates a new Args object, where:
the value of the cust field in that object is the value of the cust variable in that block (this is basically a somewhat clever hack that works with named method arguments)
every other field's value is taken from args

Pytest - skip (xfail) mixed with parametrize

is there a way to use the #incremental plugin like described att Pytest: how to skip the rest of tests in the class if one has failed? mixed with #pytest.mark.parametrize like below:
#pytest.mark.incremental
Class TestClass:
#pytest.mark.parametrize("input", data)
def test_preprocess_check(self,input):
# prerequisite for test
#pytest.mark.parametrize("input",data)
def test_process_check(self,input):
# test only if test_preprocess_check succeed
The problem i encountered is, at the first fail of test_preprocess_check with a given input of my data set, the following test_preprocess_check and test_process_check are labeled "xfail".
The behaviour i expect will be, at each new "input" of my parametrized data set, the test will act in an incremental fashion.
ex: data = [0,1,2]
if only test_preprocess_check(0) failed:
i got the following report:
1 failed, 5 xfailed
but i expect the report:
1 failed, 1 xfailed, 4 passed
Thanks
After some experiments i found a way to generalize the #incremental to works with parametrize annotation. Simply rewrite the _previousfailed argument to make it unique for each input. The argument _genid was excactly the need.
I added a #pytest.mark.incrementalparam to achieve this.
Code become:
def pytest_runtest_setup(item):
previousfailed_attr = getattr(item, "_genid",None)
if previousfailed_attr is not None:
previousfailed = getattr(item.parent, previousfailed_attr, None)
if previousfailed is not None:
pytest.xfail("previous test failed (%s)" %previousfailed.name)
previousfailed = getattr(item.parent, "_previousfailed", None)
if previousfailed is not None:
pytest.xfail("previous test failed (%s)" %previousfailed.name)
def pytest_runtest_makereport(item, call):
if "incrementalparam" in item.keywords:
if call.excinfo is not None:
previousfailed_attr = item._genid
setattr(item.parent,previousfailed_attr, item)
if "incremental" in item.keywords:
if call.excinfo is not None:
parent = item.parent
parent._previousfailed = item
It's interesting to mention that's it can't be used without parametrize cause parametrize annotation creates automatically _genid variable.
Hope this can helps others than me.

Spark: run an external process in parallel

Is it possible with Spark to "wrap" and run an external process managing its input and output?
The process is represented by a normal C/C++ application that usually runs from command line. It accepts a plain text file as input and generate another plain text file as output. As I need to integrate the flow of this application with something bigger (always in Spark), I was wondering if there is a way to do this.
The process can be easily run in parallel (at the moment I use GNU Parallel) just splitting its input in (for example) 10 part files, run 10 instances in memory of it, and re-join the final 10 part files output in one file.
The simplest thing you can do is to write a simple wrapper which takes data from standard input, writes to file, executes an external program, and outputs results to the standard output. After that all you have to do is to use pipe method:
rdd.pipe("your_wrapper")
The only serious considerations is IO performance. If it is possible it would be better to adjust program you want to call so it can read and write data directly without going through disk.
Alternativelly you can use mapPartitions combined with process and standard IO tools to write to the local file, call your program and read the output.
If you end up here based on the question title from a Google search, but you don't have the OP restriction that the external program needs to read from a file--i.e., if your external program can read from stdin--here is a solution. For my use case, I needed to call an external decryption program for each input file.
import org.apache.commons.io.IOUtils
import sys.process._
import scala.collection.mutable.ArrayBuffer
val showSampleRows = true
val bfRdd = sc.binaryFiles("/some/files/*,/more/files/*")
val rdd = bfRdd.flatMap{ case(file, pds) => { // pds is a PortableDataStream
val rows = new ArrayBuffer[Array[String]]()
var errors = List[String]()
val io = new ProcessIO (
in => { // "in" is an OutputStream; write the encrypted contents of the
// input file (pds) to this stream
IOUtils.copy(pds.open(), in) // open() returns a DataInputStream
in.close
},
out => { // "out" is an InputStream; read the decrypted data off this stream.
// Even though this runs in another thread, we can write to rows, since it
// is part of the closure for this function
for(line <- scala.io.Source.fromInputStream(out).getLines) {
// ...decode line here... for my data, it was pipe-delimited
rows += line.split('|')
}
out.close
},
err => { // "err" is an InputStream; read any errors off this stream
// errors is part of the closure for this function
errors = scala.io.Source.fromInputStream(err).getLines.toList
err.close
}
)
val cmd = List("/my/decryption/program", "--decrypt")
val exitValue = cmd.run(io).exitValue // blocks until subprocess finishes
println(s"-- Results for file $file:")
if (exitValue != 0) {
// TBD write to string accumulator instead, so driver can output errors
// string accumulator from #zero323: https://stackoverflow.com/a/31496694/215945
println(s"exit code: $exitValue")
errors.foreach(println)
} else {
// TBD, you'll probably want to move this code to the driver, otherwise
// unless you're using the shell, you won't see this output
// because it will be sent to stdout of the executor
println(s"row count: ${rows.size}")
if (showSampleRows) {
println("6 sample rows:")
rows.slice(0,6).foreach(row => println(" " + row.mkString("|")))
}
}
rows
}}
scala> :paste "test.scala"
Loading test.scala...
...
rdd: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[62] at flatMap at <console>:294
scala> rdd.count // action, causes Spark code to actually run
-- Results for file hdfs://path/to/encrypted/file1: // this file had errors
exit code: 255
ERROR: Error decrypting
my_decryption_program: Bad header data[0]
-- Results for file hdfs://path/to/encrypted/file2:
row count: 416638
sample rows:
<...first row shown here ...>
...
<...sixth row shown here ...>
...
res43: Long = 843039
References:
https://www.scala-lang.org/api/current/scala/sys/process/ProcessIO.html
https://alvinalexander.com/scala/how-to-use-closures-in-scala-fp-examples#using-closures-with-other-data-types

How do I create a Spark RDD from Accumulo 1.6 in spark-notebook?

I have a Vagrant image with Spark Notebook, Spark, Accumulo 1.6, and Hadoop all running. From notebook, I can manually create a Scanner and pull test data from a table I created using one of the Accumulo examples:
val instanceNameS = "accumulo"
val zooServersS = "localhost:2181"
val instance: Instance = new ZooKeeperInstance(instanceNameS, zooServersS)
val connector: Connector = instance.getConnector( "root", new PasswordToken("password"))
val auths = new Authorizations("exampleVis")
val scanner = connector.createScanner("batchtest1", auths)
scanner.setRange(new Range("row_0000000000", "row_0000000010"))
for(entry: Entry[Key, Value] <- scanner) {
println(entry.getKey + " is " + entry.getValue)
}
will give the first ten rows of table data.
When I try to create the RDD thusly:
val rdd2 =
sparkContext.newAPIHadoopRDD (
new Configuration(),
classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
classOf[org.apache.accumulo.core.data.Key],
classOf[org.apache.accumulo.core.data.Value]
)
I get an RDD returned to me that I can't do much with due to the following error:
java.io.IOException: Input info has not been set. at
org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
at
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343)
at
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538)
at
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120) at
org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at
org.apache.spark.rdd.RDD.count(RDD.scala:927)
This totally makes sense in light of the fact that I haven't specified any parameters as to which table to connect with, what the auths are, etc.
So my question is: What do I need to do from here to get those first ten rows of table data into my RDD?
update one
Still doesn't work, but I did discover a few things. Turns out there are two nearly identical packages,
org.apache.accumulo.core.client.mapreduce
&
org.apache.accumulo.core.client.mapred
both have nearly identical members, except for the fact that some of the method signatures are different. not sure why both exist as there's no deprecation notice that I could see. I attempted to implement Sietse's answer with no joy. Below is what I did, and the responses:
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.conf.Configuration
val jobConf = new JobConf(new Configuration)
import org.apache.hadoop.mapred.JobConf import
org.apache.hadoop.conf.Configuration jobConf:
org.apache.hadoop.mapred.JobConf = Configuration: core-default.xml,
core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
yarn-site.xml
Configuration: core-default.xml, core-site.xml, mapred-default.xml,
mapred-site.xml, yarn-default.xml, yarn-site.xml
AbstractInputFormat.setConnectorInfo(jobConf,
"root",
new PasswordToken("password")
AbstractInputFormat.setScanAuthorizations(jobConf, auths)
AbstractInputFormat.setZooKeeperInstance(jobConf, new ClientConfiguration)
val rdd2 =
sparkContext.hadoopRDD (
jobConf,
classOf[org.apache.accumulo.core.client.mapred.AccumuloInputFormat],
classOf[org.apache.accumulo.core.data.Key],
classOf[org.apache.accumulo.core.data.Value],
1
)
rdd2: org.apache.spark.rdd.RDD[(org.apache.accumulo.core.data.Key,
org.apache.accumulo.core.data.Value)] = HadoopRDD[1] at hadoopRDD at
:62
rdd2.first
java.io.IOException: Input info has not been set. at
org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
at
org.apache.accumulo.core.client.mapred.AbstractInputFormat.validateOptions(AbstractInputFormat.java:308)
at
org.apache.accumulo.core.client.mapred.AbstractInputFormat.getSplits(AbstractInputFormat.java:505)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120) at
org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
org.apache.spark.rdd.RDD.take(RDD.scala:1077) at
org.apache.spark.rdd.RDD.first(RDD.scala:1110) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:69)
at...
* edit 2 *
re: Holden's answer - still no joy:
AbstractInputFormat.setConnectorInfo(jobConf,
"root",
new PasswordToken("password")
AbstractInputFormat.setScanAuthorizations(jobConf, auths)
AbstractInputFormat.setZooKeeperInstance(jobConf, new ClientConfiguration)
InputFormatBase.setInputTableName(jobConf, "batchtest1")
val rddX = sparkContext.newAPIHadoopRDD(
jobConf,
classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
classOf[org.apache.accumulo.core.data.Key],
classOf[org.apache.accumulo.core.data.Value]
)
rddX: org.apache.spark.rdd.RDD[(org.apache.accumulo.core.data.Key,
org.apache.accumulo.core.data.Value)] = NewHadoopRDD[0] at
newAPIHadoopRDD at :58
Out[15]: NewHadoopRDD[0] at newAPIHadoopRDD at :58
rddX.first
java.io.IOException: Input info has not been set. at
org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
at
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343)
at
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538)
at
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120) at
org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
org.apache.spark.rdd.RDD.take(RDD.scala:1077) at
org.apache.spark.rdd.RDD.first(RDD.scala:1110) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:61)
at
edit 3 -- progress!
i was able to figure out why the 'input INFO not set' error was occurring. the eagle-eyed among you will no doubt see the following code is missing a closing '('
AbstractInputFormat.setConnectorInfo(jobConf, "root", new PasswordToken("password")
as I'm doing this in spark-notebook, I'd been clicking the execute button and moving on because I wasn't seeing an error. what I forgot was that notebook is going to do what spark-shell will do when you leave off a closing ')' -- it will wait forever for you to add it. so the error was the result of the 'setConnectorInfo' method never getting executed.
unfortunately, I'm still unable to shove the accumulo table data into an RDD that's useable to me. when I execute
rddX.count
I get back
res15: Long = 10000
which is the correct response - there are 10,000 rows of data in the table I pointed to. however, when I try to grab the first element of data thusly:
rddX.first
I get the following error:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0.0 in stage 0.0 (TID 0) had a not serializable result:
org.apache.accumulo.core.data.Key
any thoughts on where to go from here?
edit 4 -- success!
the accepted answer + comments are 90% of the way there - except for the fact that the accumulo key/value need to be cast into something serializable. i got this working by invoking the .toString() method on both. i'll try to post something soon that's complete working code incase anyone else runs into the same issue.
Generally with custom Hadoop InputFormats, the information is specified using a JobConf. As #Sietse pointed out there are some static methods on the AccumuloInputFormat that you can use to configure the JobConf. In this case I think what you would want to do is:
val jobConf = new JobConf() // Create a job conf
// Configure the job conf with our accumulo properties
AccumuloInputFormat.setConnectorInfo(jobConf, principal, token)
AccumuloInputFormat.setScanAuthorizations(jobConf, authorizations)
val clientConfig = new ClientConfiguration().withInstance(instanceName).withZkHosts(zooKeepers)
AccumuloInputFormat.setZooKeeperInstance(jobConf, clientConfig)
AccumuloInputFormat.setInputTableName(jobConf, tableName)
// Create an RDD using the jobConf
val rdd2 = sc.newAPIHadoopRDD(jobConf,
classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
classOf[org.apache.accumulo.core.data.Key],
classOf[org.apache.accumulo.core.data.Value]
)
Note: After digging into the code, it seems the the is configured property is set based in part on the class which is called (makes sense to avoid conflicts with other packages potentially), so when we go and get it back in the concrete class later it fails to find the is configured flag. The solution to this is to not use the Abstract classes. see https://github.com/apache/accumulo/blob/bf102d0711103e903afa0589500f5796ad51c366/core/src/main/java/org/apache/accumulo/core/client/mapreduce/lib/impl/ConfiguratorBase.java#L127 for the implementation details). If you can't call this method on the concrete implementation with spark-notebook probably using spark-shell or a regularly built application is the easiest solution.
It looks like those parameters have to be set through static methods : http://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/client/mapred/AccumuloInputFormat.html. So try setting the non-optional parameters and run again. It should work.