I am using ScalarDB which provides ACID functionality on Cassandra. While deleting a record, I am getting com.scalar.db.exception.transaction.InvalidUsageException: the record to be deleted must be existing and read beforehand exception.
I am deleting entries from several tables (thus using Scalar to provide Atomocity). I created a DistributedTransaction at the start and then started deleting the entries.
def deleteQuestion(questionKey:PracticeQuestionKeys,user:User) = {
logger.trace(s"delete question request ${questionKey}, ${user}")
val transaction = transactionService.start
val questionGetResult = getQuestionFromQuestionID(transaction,questionKey)//
if(questionGetResult.isLeft) throw questionGetResult.left.get
val question = questionGetResult.right.get
deleteQuestionIfUserIsAuthorized(transaction,questionKey, question, user)
deleteQuestionTagFromTagRepository(transaction,question)
deleteQuestionFromProfileAndPortfolio(transaction, question, user.id)
commitTransaction(transaction)
}
All steps before commitTransaction seem to be ok but `commitTransaction fails with error
2020-08-02 13:19:12,883 [WARN] from com.scalar.db.transaction.consensuscommit.CommitHandler in scala-execution-context-global-141 - preparing records failed
com.scalar.db.exception.transaction.InvalidUsageException: the record to be deleted must be existing and read beforehand
at com.scalar.db.transaction.consensuscommit.PrepareMutationComposer.add(PrepareMutationComposer.java:89)
at com.scalar.db.transaction.consensuscommit.PrepareMutationComposer.add(PrepareMutationComposer.java:45)
at com.scalar.db.transaction.consensuscommit.Snapshot.lambda$to$1(Snapshot.java:134)
at java.util.concurrent.ConcurrentHashMap$EntrySetView.forEach(ConcurrentHashMap.java:4795)
at com.scalar.db.transaction.consensuscommit.Snapshot.to(Snapshot.java:130)
at com.scalar.db.transaction.consensuscommit.CommitHandler.prepareRecords(CommitHandler.java:104)
at com.scalar.db.transaction.consensuscommit.CommitHandler.commit(CommitHandler.java:40)
at com.scalar.db.transaction.consensuscommit.ConsensusCommit.commit(ConsensusCommit.java:121)
at services.QuestionsTransactionDatabaseService.commitTransaction(QuestionsTransactionDatabaseService.scala:251)
at services.QuestionsTransactionDatabaseService.deleteQuestion(QuestionsTransactionDatabaseService.scala:388)
at services.QuestionsTransactionService.$anonfun$deleteQuestion$1(QuestionsTransactionService.scala:57)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:653)
at scala.util.Success.$anonfun$map$1(Try.scala:251)
at scala.util.Success.map(Try.scala:209)
at scala.concurrent.Future.$anonfun$map$1(Future.scala:287)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:140)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Referring to https://scalardl.readthedocs.io/en/latest/trouble-shooting-guide/, it seems blind delete is not allowed. But I have tried doing a get before delete and have also tried to put conditions like DeleteIfExists but I am still getting the error.
def delete(transaction:DistributedTransaction,key:PracticeKeys) = {
logger.trace("Deleting question. Checking if question exists for" + key)
get(transaction,key) //I have tried with and without commenting getting/reading question before deleting
//Perform the operations you want to group in the transaction
val pKey = new Key(new TextValue("id", key.id.toString))
logger.trace(s"created question keys ${pKey}")
logger.trace(s"getting question using ${keyspaceName}, ${tablename}")
val deleteToken:Delete = new Delete(pKey)
.forNamespace(keyspaceName)
.forTable(tablename)
.withCondition(new DeleteIfExists)
transaction.delete(deleteToken)
}
Why do I have to get a record before deleting it? What is the way to delete straightaway? Am I not using the library correctly?
As it turns out, the problem was in my code. Firstly, we have to do a get before doing a delete. I don't know why but that is the rule it seems.
So I added a get in each delete
def delete(transaction:DistributedTransaction,questionKey:QuestionsCreatedByAUserForATagKeys) = {
logger.trace(s"deleting question created for tag with key ${questionKey}")
get(transaction,questionKey) // <-- HERE. I should probably also check if get returned a valid record.
val pQuestionsCreatedKey = new Key(
new TextValue("question_creator", questionKey.creator.creatorId.toString),
new TextValue("tag", questionKey.tag)
)
val cQuestionKey = if(questionKey.creationMonth !=0 && questionKey.creationYear !=0 && questionKey.questionId.isDefined){
new Key(
new BigIntValue("creation_year", questionKey.creationYear),
new BigIntValue("creation_month", questionKey.creationMonth),
new TextValue("question_id", questionKey.questionId.get.toString),
)
} else {
throw CannotDeleteWithoutEntireCompositeKeyException()
}
logger.trace(s"deleting question with keys ${pQuestionsCreatedKey},${cQuestionKey}")
val deleteToken = new Delete(pQuestionsCreatedKey,cQuestionKey)
.forNamespace(keyspaceName)
.forTable(tablename)
.withCondition(new DeleteIfExists)
logger.trace(s"deleting from user profile ${deleteToken}")
transaction.delete(deleteToken)
}
The reason I was getting the exception despite adding get earlier was due to programming issues (mistakes in assigning variables correctly). Say the delete was for record A but I hadn't stored A correctly but had stored it as A1. So the get was actually failing.
In general, I think this error means that the record being deleted doesn't exist.
Related
I am new to Scala, Spark and so struggling with a map function I am trying to create.
The map function on the Dataframe a Row (org.apache.spark.sql.Row)
I have been loosely following this article.
val rddWithExceptionHandling = filterValueDF.rdd.map { row: Row =>
val parsed = Try(from_avro(???, currentValueSchema.value, fromAvroOptions)) match {
case Success(parsedValue) => List(parsedValue, null)
case Failure(ex) => List(null, ex.toString)
}
Row.fromSeq(row.toSeq.toList ++ parsed)
}
The from_avro function wants to accept a Column (org.apache.spark.sql.Column), however I don't see a way in the docs to get a column from a Row.
I am fully open to the idea that I may be doing this whole thing wrong.
Ultimately my goal is to parse the bytes coming in from a Structure Stream.
Parsed records get written to a Delta Table A and the failed records to another Delta Table B
For context the source table looks as follows:
Edit - from_avro returning null on "bad record"
There have been a few comments saying that from_avro returns null if it fails to parse a "bad record". By default from_avro uses mode FAILFAST which will throw an exception if parsing fails. If one sets the mode to PERMISSIVE an object in the shape of the schema is returned but with all properties being null (also not particularly useful...). Link to the Apache Avro Data Source Guide - Spark 3.1.1 Documentation
Here is my original command:
val parsedDf = filterValueDF.select($"topic",
$"partition",
$"offset",
$"timestamp",
$"timestampType",
$"valueSchemaId",
from_avro($"fixedValue", currentValueSchema.value, fromAvroOptions).as('parsedValue))
If there are ANY bad rows the job is aborted with org.apache.spark.SparkException: Job aborted.
A snippet of the log of the exception:
Caused by: org.apache.spark.SparkException: Malformed records are detected in record parsing. Current parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:111)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:732)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$2(FileFormatWriter.scala:291)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1615)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:300)
... 10 more
Suppressed: java.lang.NullPointerException
at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsOutputStream.write(NativeAzureFileSystem.java:1099)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.write(HadoopPositionOutputStream.java:50)
at shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
at shaded.parquet.org.apache.thrift.transport.TTransport.write(TTransport.java:107)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:482)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:489)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBeginInternal(TCompactProtocol.java:252)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBegin(TCompactProtocol.java:234)
at org.apache.parquet.format.InterningProtocol.writeFieldBegin(InterningProtocol.java:74)
at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1184)
at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1051)
at org.apache.parquet.format.FileMetaData.write(FileMetaData.java:949)
at org.apache.parquet.format.Util.write(Util.java:222)
at org.apache.parquet.format.Util.writeFileMetaData(Util.java:69)
at org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:757)
at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:750)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:135)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:58)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.abort(FileFormatDataWriter.scala:84)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$3(FileFormatWriter.scala:297)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1626)
... 11 more
Caused by: java.lang.ArithmeticException: Unscaled value too large for precision
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:83)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577)
at org.apache.spark.sql.avro.AvroDeserializer.createDecimal(AvroDeserializer.scala:308)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$16(AvroDeserializer.scala:177)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$16$adapted(AvroDeserializer.scala:174)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:336)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1$adapted(AvroDeserializer.scala:332)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2(AvroDeserializer.scala:354)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2$adapted(AvroDeserializer.scala:351)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$converter$3(AvroDeserializer.scala:75)
at org.apache.spark.sql.avro.AvroDeserializer.deserialize(AvroDeserializer.scala:89)
at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:101)
... 16 more
In order to get a specific column from the Row object you can use either row.get(i) or using the column name with row.getAs[T]("columnName"). Here you can check the details of the Row class.
Then your code would look as next:
val rddWithExceptionHandling = filterValueDF.rdd.map { row: Row =>
val binaryFixedValue = row.getSeq[Byte](6) // or row.getAs[Seq[Byte]]("fixedValue")
val parsed = Try(from_avro(binaryFixedValue, currentValueSchema.value, fromAvroOptions)) match {
case Success(parsedValue) => List(parsedValue, null)
case Failure(ex) => List(null, ex.toString)
}
Row.fromSeq(row.toSeq.toList ++ parsed)
}
Although in your case, you don't really need to go into the map function because then you have to work with primitive Scala types when from_avro works with the Dataframe API. This is the reason that you can't call from_avro directly from map since instances of Column class can be used only in combination with the Dataframe API, i.e: df.select($"c1"), here c1 is an instance of Column. In order to use from_avro, as you initially intended, just type:
filterValueDF.select(from_avro($"fixedValue", currentValueSchema))
As #mike already mentioned, if from_avro fails to parse the AVRO content will return null. Finally, if you want to separate succeeded rows from failed, you could do something like:
val includingFailuresDf = filterValueDF.select(
from_avro($"fixedValue", currentValueSchema) as "avro_res")
.withColumn("failed", $"avro_res".isNull)
val successDf = includingFailuresDf.where($"failed" === false)
val failedDf = includingFailuresDf.where($"failed" === true)
Please be aware that the code was not tested.
From what i understand you just need to fetch a column for a row . You can probably do that by getting a column value at specific index using row.get()
I am using h2 database to test my postgres slick functionality.
I created a below h2DbComponent:
trait H2DBComponent extends DbComponent {
val driver = slick.jdbc.H2Profile
import driver.api._
val h2Url = "jdbc:h2:mem:test;MODE=PostgreSQL;DB_CLOSE_DELAY=-1;DATABASE_TO_UPPER=false;INIT=runscript from './test/resources/schema.sql'\\;runscript from './test/resources/schemadata.sql'"
val logger = LoggerFactory.getLogger(this.getClass)
val db: Database = {
logger.info("Creating test connection ..................................")
Database.forURL(url = h2Url, driver = "org.h2.Driver")
}
}
In the above snippet i am creating my tables using schema.sql and inserting a single row(record) with schemadata.sql.
Then i am trying to insert a record into the table as below using my test case:
class RequestRepoTest extends FunSuite with RequestRepo with H2DBComponent {
test("Add new Request") {
val response = insertRequest(Request("XYZ","tk", "DM", "RUNNING", "0.1", "l1", "file1",
Timestamp.valueOf("2016-06-22 19:10:25"), Some(Timestamp.valueOf("2016-06-22 19:10:25")), Some("scienceType")))
val actualResult=Await.result(response,10 seconds)
assert(actualResult===1)
val response2 = getAllRequest()
assert(Await.result(response2, 5 seconds).size === 2)
}
}
The above assert of insert works fine stating that the record is inserted. But the getAllRequest() assert fails as the output still contains the single row(as inserted by schemadata.sql) => which means the insertRequest change is not persisted. However the below statements states that the record is inserted as the insert returned 1 stating one record inserted.
val response = insertRequest(Request("CMP_XYZ","tesco_uk", "DM", "RUNNING", "0.1", "l1", "file1",
Timestamp.valueOf("2016-06-22 19:10:25"), Some(Timestamp.valueOf("2016-06-22 19:10:25")),
Some("scienceType")))
val actualResult=Await.result(response,10 seconds)
Below is my definition of insertRequest:
def insertRequest(request: Request):Future[Int]= {
db.run { requestTableQuery += request }
}
I am unable to figure out how can i see the inserted record. Is there any property/config which i need to add?
But the getAllRequest() assert fails as the output still contains the single row(as inserted by schemadata.sql) => which means the insertRequest change is not persisted
I would double-check that the assert(Await.result(response2, 5 seconds).size === 2) line is failing because of a size difference. Could it be failing for some other general failure?
For example, as INIT is run on each connection it could be that you are re-creating the database for each connection. Unless you're careful with the SQL, that could produce an error such as "table already exists". Adding TRACE_LEVEL_SYSTEM_OUT=2; to your H2 URL can be helpful in tracking what H2 is doing.
A couple of suggestions.
First, you could ensure your SQL only runs as needed. For example, your schema.sql could add checks to avoid trying to create the table twice:
CREATE TABLE IF NOT EXISTS my_table( my_column VARCHAR NULL );
And likewise for your schemadata.sql:
MERGE INTO my_table KEY(my_column) VALUES ('a') ;
Alternatively, you could establish schema and test data around your tests (e.g., possibly in Scala code, using Slick). Your test framework probably has a way to ensure something is run before and after a test or test suit.
I am working on .net core entity framework. I have two list of class type. One for update and other for new entry, adding new records all worked fine but which is achieved by context.[Model].Add but update which is done by context.[Model].Update throw exception update i know no record been updated as it is running on local.
$exception {Microsoft.EntityFrameworkCore.DbUpdateConcurrencyException: Database operation expected to affect 1 row(s) but actually affected 0 row(s). Data may have been modified or deleted since entities were loaded.
Code
List<AnswerDataModel> surveyResponseListToCreate = new
List<AnswerDataModel>();
List<AnswerDataModel> surveyResponseListToUpdate = new
List<AnswerDataModel>();
if (surveyResponseListToUpdate.Count > 0)
{
foreach (var answerObject in surveyResponseListToUpdate)
{
Context.Answers.Update(answerObject);
if (answerObject.AnswerOptions.Count > 0)
{
foreach (var optItem in answerObject.AnswerOptions)
{
AnswerOptionDataModel answOpt = new AnswerOptionDataModel();
answOpt = optItem;
Context.AnswerOptions.Update(answOpt);
}
}
}
}
var recordsAffected = Context.SaveChanges();
if (!UsingExternalTransaction)
{
FinalizeTransaction(recordsAffected);
}
I can't resist a quote:
"I do not think [your code] means what you think it means."
Assuming that surveyResponseListToUpdate was a list of entities previously loaded and modified:
if (answerObject.AnswerOptions.Count > 0) // Unnecessary...
{
foreach (var optItem in answerObject.AnswerOptions)
{
AnswerOptionDataModel answOpt = new AnswerOptionDataModel(); // does nothing.
answOpt = optItem; // references existing answer option..
Context.AnswerOptions.Update(answOpt);
}
}
The whole block boils down to:
foreach (var optItem in answerObject.AnswerOptions)
Context.AnswerOptions.Update(optItem);
The error you are likely running into is because Update will recurse through navigation properties automatically, so when the parent (Answer) is updated, it's AnswerOptions will be updated as well. So when you go through the extra steps to try and save answer options, they've already been updated when the answer was saved. Provided the Answer was loaded by the same context that you are saving it to, you should be in the clear with:
foreach (var answerObject in surveyResponseListToUpdate)
Context.Answers.Update(answerObject);
var recordsAffected = Context.SaveChanges();
This should update the answer and it's associated answer objects. Even if options were added or removed, the change tracking should do it's job and ensure all of the associated data records are updated.
The extra if checks and such aren't necessary and just add to nesting depth making code harder to read.
However, I suspect that your real code is doing something different to the example given that my tests where I tried to reproduce your error, the code worked fine even updating the child references after updating the parent. If the above still raises issues, please update your example with the code you are running.
I have a Vagrant image with Spark Notebook, Spark, Accumulo 1.6, and Hadoop all running. From notebook, I can manually create a Scanner and pull test data from a table I created using one of the Accumulo examples:
val instanceNameS = "accumulo"
val zooServersS = "localhost:2181"
val instance: Instance = new ZooKeeperInstance(instanceNameS, zooServersS)
val connector: Connector = instance.getConnector( "root", new PasswordToken("password"))
val auths = new Authorizations("exampleVis")
val scanner = connector.createScanner("batchtest1", auths)
scanner.setRange(new Range("row_0000000000", "row_0000000010"))
for(entry: Entry[Key, Value] <- scanner) {
println(entry.getKey + " is " + entry.getValue)
}
will give the first ten rows of table data.
When I try to create the RDD thusly:
val rdd2 =
sparkContext.newAPIHadoopRDD (
new Configuration(),
classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
classOf[org.apache.accumulo.core.data.Key],
classOf[org.apache.accumulo.core.data.Value]
)
I get an RDD returned to me that I can't do much with due to the following error:
java.io.IOException: Input info has not been set. at
org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
at
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343)
at
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538)
at
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120) at
org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at
org.apache.spark.rdd.RDD.count(RDD.scala:927)
This totally makes sense in light of the fact that I haven't specified any parameters as to which table to connect with, what the auths are, etc.
So my question is: What do I need to do from here to get those first ten rows of table data into my RDD?
update one
Still doesn't work, but I did discover a few things. Turns out there are two nearly identical packages,
org.apache.accumulo.core.client.mapreduce
&
org.apache.accumulo.core.client.mapred
both have nearly identical members, except for the fact that some of the method signatures are different. not sure why both exist as there's no deprecation notice that I could see. I attempted to implement Sietse's answer with no joy. Below is what I did, and the responses:
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.conf.Configuration
val jobConf = new JobConf(new Configuration)
import org.apache.hadoop.mapred.JobConf import
org.apache.hadoop.conf.Configuration jobConf:
org.apache.hadoop.mapred.JobConf = Configuration: core-default.xml,
core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
yarn-site.xml
Configuration: core-default.xml, core-site.xml, mapred-default.xml,
mapred-site.xml, yarn-default.xml, yarn-site.xml
AbstractInputFormat.setConnectorInfo(jobConf,
"root",
new PasswordToken("password")
AbstractInputFormat.setScanAuthorizations(jobConf, auths)
AbstractInputFormat.setZooKeeperInstance(jobConf, new ClientConfiguration)
val rdd2 =
sparkContext.hadoopRDD (
jobConf,
classOf[org.apache.accumulo.core.client.mapred.AccumuloInputFormat],
classOf[org.apache.accumulo.core.data.Key],
classOf[org.apache.accumulo.core.data.Value],
1
)
rdd2: org.apache.spark.rdd.RDD[(org.apache.accumulo.core.data.Key,
org.apache.accumulo.core.data.Value)] = HadoopRDD[1] at hadoopRDD at
:62
rdd2.first
java.io.IOException: Input info has not been set. at
org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
at
org.apache.accumulo.core.client.mapred.AbstractInputFormat.validateOptions(AbstractInputFormat.java:308)
at
org.apache.accumulo.core.client.mapred.AbstractInputFormat.getSplits(AbstractInputFormat.java:505)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120) at
org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
org.apache.spark.rdd.RDD.take(RDD.scala:1077) at
org.apache.spark.rdd.RDD.first(RDD.scala:1110) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:69)
at...
* edit 2 *
re: Holden's answer - still no joy:
AbstractInputFormat.setConnectorInfo(jobConf,
"root",
new PasswordToken("password")
AbstractInputFormat.setScanAuthorizations(jobConf, auths)
AbstractInputFormat.setZooKeeperInstance(jobConf, new ClientConfiguration)
InputFormatBase.setInputTableName(jobConf, "batchtest1")
val rddX = sparkContext.newAPIHadoopRDD(
jobConf,
classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
classOf[org.apache.accumulo.core.data.Key],
classOf[org.apache.accumulo.core.data.Value]
)
rddX: org.apache.spark.rdd.RDD[(org.apache.accumulo.core.data.Key,
org.apache.accumulo.core.data.Value)] = NewHadoopRDD[0] at
newAPIHadoopRDD at :58
Out[15]: NewHadoopRDD[0] at newAPIHadoopRDD at :58
rddX.first
java.io.IOException: Input info has not been set. at
org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
at
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343)
at
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538)
at
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120) at
org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
org.apache.spark.rdd.RDD.take(RDD.scala:1077) at
org.apache.spark.rdd.RDD.first(RDD.scala:1110) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:61)
at
edit 3 -- progress!
i was able to figure out why the 'input INFO not set' error was occurring. the eagle-eyed among you will no doubt see the following code is missing a closing '('
AbstractInputFormat.setConnectorInfo(jobConf, "root", new PasswordToken("password")
as I'm doing this in spark-notebook, I'd been clicking the execute button and moving on because I wasn't seeing an error. what I forgot was that notebook is going to do what spark-shell will do when you leave off a closing ')' -- it will wait forever for you to add it. so the error was the result of the 'setConnectorInfo' method never getting executed.
unfortunately, I'm still unable to shove the accumulo table data into an RDD that's useable to me. when I execute
rddX.count
I get back
res15: Long = 10000
which is the correct response - there are 10,000 rows of data in the table I pointed to. however, when I try to grab the first element of data thusly:
rddX.first
I get the following error:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0.0 in stage 0.0 (TID 0) had a not serializable result:
org.apache.accumulo.core.data.Key
any thoughts on where to go from here?
edit 4 -- success!
the accepted answer + comments are 90% of the way there - except for the fact that the accumulo key/value need to be cast into something serializable. i got this working by invoking the .toString() method on both. i'll try to post something soon that's complete working code incase anyone else runs into the same issue.
Generally with custom Hadoop InputFormats, the information is specified using a JobConf. As #Sietse pointed out there are some static methods on the AccumuloInputFormat that you can use to configure the JobConf. In this case I think what you would want to do is:
val jobConf = new JobConf() // Create a job conf
// Configure the job conf with our accumulo properties
AccumuloInputFormat.setConnectorInfo(jobConf, principal, token)
AccumuloInputFormat.setScanAuthorizations(jobConf, authorizations)
val clientConfig = new ClientConfiguration().withInstance(instanceName).withZkHosts(zooKeepers)
AccumuloInputFormat.setZooKeeperInstance(jobConf, clientConfig)
AccumuloInputFormat.setInputTableName(jobConf, tableName)
// Create an RDD using the jobConf
val rdd2 = sc.newAPIHadoopRDD(jobConf,
classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
classOf[org.apache.accumulo.core.data.Key],
classOf[org.apache.accumulo.core.data.Value]
)
Note: After digging into the code, it seems the the is configured property is set based in part on the class which is called (makes sense to avoid conflicts with other packages potentially), so when we go and get it back in the concrete class later it fails to find the is configured flag. The solution to this is to not use the Abstract classes. see https://github.com/apache/accumulo/blob/bf102d0711103e903afa0589500f5796ad51c366/core/src/main/java/org/apache/accumulo/core/client/mapreduce/lib/impl/ConfiguratorBase.java#L127 for the implementation details). If you can't call this method on the concrete implementation with spark-notebook probably using spark-shell or a regularly built application is the easiest solution.
It looks like those parameters have to be set through static methods : http://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/client/mapred/AccumuloInputFormat.html. So try setting the non-optional parameters and run again. It should work.
I am using EF 6 Code First and I need to delete an item and then also update a different item within a collection of Entities. If I try to delete one item and then modify a completely different item I get the error message "An object with the same key already exists in the ObjectStateManage" This is inaccurate because there are two objects with completely different PK IDs but when the update happens it throws the error. If I comment out the code to delete then the update works just fine with multiple items to update. Why would it complain about the "same key" when the keys are different?
foreach (var phone in phones)
{
if (!_isValidPhone(phone))
{
if(phone.PhoneId != 0)
{
var deletePhone = _db.Phones.FirstOrDefault(r => r.PhoneId == phone.PhoneId);
_db.Entry(deletePhone).State = EntityState.Deleted;
continue;
}
}
if (_isNewPhone(phone))
{
AddNewPhone(phone, _person);
}
else
{
UpdatePhoneData(phone, _person.Phones.FirstOrDefault(r => r.Order == phone.Order));
}
}
private void UpdatePhoneData(Phone phoneFrom, Phone phoneTo)
{
phoneTo.Note = phoneFrom.Note;
phoneTo.PhoneNumber = phoneFrom.PhoneNumber;
phoneTo.Order = phoneFrom.Order;
_db.Entry(phoneTo).State = EntityState.Modified;
}
If a phone is not valid and has id you try to add it to the context in two ways:
While deleting it:
_db.Entry(deletePhone).State = EntityState.Deleted;
Besides, when checking if it's new or not you either add or update it. Thats the problem.
So, what you need to do is wrap the add or update part inside an else, to add or update only if the phone has not been deleted.
This is not really a logic issue, the phone that was being updated is completely different than the phone that was being deleted. The issue is in the object statemanager. Because I was doing this in the delete
var deletePhone = _db.Phones.FirstOrDefault();
And then later I had a separate list of Phones where I try to set one of them to modified
_db.Entry(phoneTo).State = EntityState.Modified;
Well the object state manager now has each phone loaded twice basically. So if I just use the _person.Phones for both my delete and modify, then the Phones list is only loaded once and there are now duplicate keys.
_db.Entry(_person.Phones.FirstOrDefault(r => r.PhoneId == phone.PhoneId)).State = EntityState.Deleted;