Can I use non volatile external variables in Scala Enumeratee? - scala

I need to group output of my Enumerator in different ZipEntries, based on specific property (providerId), original chartPreparations stream is ordered by providerId, so I can just keep reference to provider, and add a new entry when provider chages
Enumerator.outputStream(os => {
val currentProvider = new AtomicReference[String]()
// Step 1. Creating zipped output file
val zipOs = new ZipOutputStream(os, Charset.forName("UTF8"))
// Step 2. Processing chart preparation Enumerator
val chartProcessingTask = (chartPreparations) run Iteratee.foreach(cp => {
// Step 2.1. Write new entry if needed
if(currentProvider.get() == null || cp.providerId != currentProvider.get()) {
if (currentProvider.get() != null) {
zipOs.write("</body></html>".getBytes(Charset.forName("UTF8")))
}
currentProvider.set(cp.providerId)
zipOs.putNextEntry(new ZipEntry(cp.providerName + ".html"))
zipOs.write(HTML_HEADER)
}
// Step 2.2 Write chart preparation in HTML format
zipOs.write(toHTML(cp).getBytes(Charset.forName("UTF8")))
})
// Step 3. On Complete close stream
chartProcessingTask.onComplete(_ => zipOs.close())
})
Since current provider reference, is changing, during the output, I made it AtomicReference, so that I could handle references from different threads.
Can currentProvider just be a var Option[String], and why?

Related

Extend HBase Put to avoid original Row Check in add method

HBase Need to export data from one cluster and import it to another with slight modification in row key
As I have referred in above post, need to export the HBase data of table from one cluster and import it into the another cluster by changing row key based on our match pattern
In the "org.apache.hadoop.hbase.mapreduce.Import" we have option to change the ColumnFamily using the args "HBASE_IMPORTER_RENAME_CFS"
I have slightly modified the Import code to support row key change.My code is available in Pastebin
https://pastebin.com/ticgeBb0
Changed the row key using the below code.
private static Cell convertRowKv(Cell kv, Map<byte[], byte[]> rowkeyReplaceMap) {
if (rowkeyReplaceMap != null) {
byte[] oldrowkeyName = CellUtil.cloneRow(kv);
String oldrowkey = Bytes.toString(oldrowkeyName);
Set<byte[]> keys = rowkeyReplaceMap.keySet();
for (byte[] key : keys) {
if (oldrowkey.contains(Bytes.toString(key))) {
byte[] newrowkeyName = rowkeyReplaceMap.get(key);
ByteBuffer buffer = ByteBuffer.wrap(oldrowkeyName);
buffer.get(key);
ByteBuffer newbuffer = buffer.slice();
ByteBuffer bb = ByteBuffer.allocate(newrowkeyName.length + newbuffer.capacity());
byte[] newrowkey = bb.array();
kv = new KeyValue(newrowkey, // row buffer
0, // row offset
newrowkey.length, // row length
kv.getFamilyArray(), // CF buffer
kv.getFamilyOffset(), // CF offset
kv.getFamilyLength(), // CF length
kv.getQualifierArray(), // qualifier buffer
kv.getQualifierOffset(), // qualifier offset
kv.getQualifierLength(), // qualifier length
kv.getTimestamp(), // timestamp
KeyValue.Type.codeToType(kv.getTypeByte()), // KV
// Type
kv.getValueArray(), // value buffer
kv.getValueOffset(), // value offset
kv.getValueLength()); // value length
}
}
}
return kv;
}
Executed the Import
hbase org.apache.hadoop.hbase.mapreduce.ImportWithRowKeyChange -DHBASE_IMPORTER_RENAME_ROW=123:123456 import file:///home/nshsh/export/
The row key has been successfully changed. But while put the Cell in the HBase table, using
"org.apache.hadoop.hbase.client.Put.add(Cell)" we have check as
"the row of the kv is the same as the put as we are changing row key"
Here it fails.
Then I have commented the check in Put class and updated the hbase-client.jar. Also I have tried to write HBasePut which extends Put
public class HBasePut extends Put {
public HBasePut(byte[] row) {
super(row);
// TODO Auto-generated constructor stub
}
public Put add(Cell kv) throws IOException{
byte [] family = CellUtil.cloneFamily(kv);
System.err.print(Bytes.toString(family));
List<Cell> list = getCellList(family);
//Checking that the row of the kv is the same as the put
/*int res = Bytes.compareTo(this.row, 0, row.length,
kv.getRowArray(), kv.getRowOffset(), kv.getRowLength());
if (res != 0) {
throw new WrongRowIOException("The row in " + kv.toString() +
" doesn't match the original one " + Bytes.toStringBinary(this.row));
}*/
list.add(kv);
familyMap.put(family, list);
return this;
}
}
In the Mapreduce, the task always fails with the below exception
2020-07-24 13:37:15,105 WARN [htable-pool1-t1] hbase.HBaseConfiguration: Config option "hbase.regionserver.lease.period" is deprecated. Instead, use "hbase.client.scanner.timeout.period"
2020-07-24 13:37:15,122 INFO [LocalJobRunner Map Task Executor #0] client.AsyncProcess: , tableName=import
2020-07-24 13:37:15,178 INFO [htable-pool1-t1] client.AsyncProcess: #2, table=import, attempt=18/35 failed=7ops, last exception: org.apache.hadoop.hbase.client.WrongRowIOException: org.apache.hadoop.hbase.client.WrongRowIOException: The row in \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00/vfrt:con/1589541180643/Put/vlen=225448/seqid=0 doesn't match the original one 123_abcf
at org.apache.hadoop.hbase.client.Put.add(Put.java:330)
at org.apache.hadoop.hbase.protobuf.ProtobufUtil.toPut(ProtobufUtil.java:574)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:744)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:720)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2168)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33656)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2196)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
at java.lang.Thread.run(Thread.java:745)
I don't know where the old Put Class has been referred in the task.
Can someone please help to fix this.

How do I update a MongoDB document with new value using reactors Mono? (Kotlin)

So the context is that I require to update a value in a single document, I have a Mono, the parameter Object contains values such as username (to find the correct user by unique username) and an amount value.
The problem is that this value (due to other components of my application) is the value by which I need to increase/decrease the users balance, as opposed to passing a new balance. I intend to do this using two Monos where one finds the user, then this is combined to the other Mono with the inbound request, where I can then perform a simple sum (i.e balance + changeRequest.amount) then return this to the document database.
override fun increaseBalance(changeRequest: Mono<ChangeBalanceRequestResource>): Mono<ChangeBalanceResponse> {
val changeAmount: Mono<Decimal128> = changeRequest.map { it.transactionAmount }
val user: Mono<User> = changeRequest.flatMap { rxUserRepository.findByUsername(it.username)
val newBalace = user.map {
val r = changeAmount.block()
it.balance = sumBalance(it.balance!!, r!!)
rxUserRepository.save(it)
}
.flatMap { it }
.map { it.balance!! }
return Mono.just(ChangeBalanceResponse("success", newBalace.block()!!))
}
Obviously I'm trying to achieve this in a non-blocking fashion. I'm also open to using only a single Mono if that's possible/optimal. I also appreciate I've truly butchered the example and used .block as a placeholder to illustrate what I'm trying to achieve.
P.S this is my first post, so any tips on how to express my problem clearer would be useful.
Here's how I would do this in Java (Using Double instead of Decimal128):
public Mono<ChangeBalanceResponse> increaseBalance(Mono<ChangeBalanceRequestResource> changeRequest) {
Mono<Double> changeAmount = changeRequest.map(a -> a.transactionAmount());
Mono<User> user = changeRequest.map(a -> a.username()).flatMap(RxUserRepository::findByUsername);
return Mono.zip(changeAmount,user).flatMap(t2 -> {
Double changeAmount = t2.getT1();
User user = t2.getT2();
//assumes User is chained
return rxUserRepository.save(user.balance(sumBalance(changeAmount,user.balance())));
}).map(res -> new ChangeBalanceResponse("success",res.newBalance()))
}

How do you update the CanExecute value after the ReactiveCommand has been declared

I am using ReactiveUI with AvaloniaUI and have a ViewModel with several ReactiveCommands namely Scan, Load, and Run.
Scan is invoked when an Observable<string> is updated (when I receive a barcode from a scanner).
Load is triggered from within the Scan command.
Run is triggered from a button on the UI.
Simplified code below:
var canRun = Events.ToObservableChangeSet().AutoRefresh().ToCollection().Select(x => x.Any());
Run = ReactiveCommand.CreateFromTask<bool>(EventSuite.RunAsync, canRun);
var canLoad = Run.IsExecuting.Select(x => x == false);
var Load = ReactiveCommand.CreateFromTask<string, Unit>(async (barcode) =>
{
//await - go off and load Events.
}, canLoad);
var canReceiveScan = Load.IsExecuting.Select(x => x == false)
.Merge(Run.IsExecuting.Select(x => x == false));
var Scan = ReactiveCommand.CreateFromTask<string, Unit>(async (barcode) =>
{
//do some validation stuff
await Load.Execute(barcode)
}, canReceiveScan);
Barcode
.SubscribeOn(RxApp.TaskpoolScheduler)
.ObserveOn(RxApp.MainThreadScheduler)
.InvokeCommand(Scan);
Each command can only be executed if no other command is running (including itself). But I can't reference the commands' IsExecuting property before is it declared. So I have been trying to merge the "CanExecute" observable variables like so:
canRun = canRun
.Merge(Run.IsExecuting.Select(x => x == false))
.Merge(Load.IsExecuting.Select(x => x == false))
.Merge(Scan.IsExecuting.Select(x => x == false))
.ObserveOn(RxApp.MainThreadScheduler);
// same for canLoad and canScan
The issue I'm having is that the ReactiveCommand will continue to execute when another command is executing.
Is there a better/correct way to implement this?
But I can't reference the commands' IsExecuting property before is it declared.
One option is to use a Subject<T>, pass it as the canExecute: parameter to the command, and later emit new values using OnNext on the Subject<T>.
Another option is to use WhenAnyObservable:
this.WhenAnyObservable(x => x.Run.IsExecuting)
// Here we get IObservable<bool>,
// representing the current execution
// state of the command.
.Select(executing => !executing)
Then, you can apply the Merge operator to the observables generated by WhenAnyObservable. To skip initial null values, if any, use either the Where operator or .Skip(1).
To give an example of the Subject<T> option described in the answer by Artyom, here is something inspired by Kent Boogaart's book p. 82:
var canRun = new BehaviorSubject<bool>(true);
Run = ReactiveCommand.Create...(..., canExecute: canRun);
Load = ReactiveCommand.Create...(..., canExecute: canRun);
Scan = ReactiveCommand.Create...(..., canExecute: canRun);
Observable.Merge(Run.IsExecuting, Load.IsExecuting, Scan.IsExecuting)
.Select(executing => !executing).Subscribe(canRun);

Entity Framework is too slow during mapping data up to 100k

I have min 100 000 data into a Job_Details table and I'm using Entity Framework to map the data.
This is the code:
public GetJobsResponse GetImportJobs()
{
GetJobsResponse getJobResponse = new GetJobsResponse();
List<JobBO> lstJobs = new List<JobBO>();
using (NSEXIM_V2Entities dbContext = new NSEXIM_V2Entities())
{
var lstJob = dbContext.Job_Details.ToList();
foreach (var dbJob in lstJob.Where(ie => ie.IMP_EXP == "I" && ie.Job_No != null))
{
JobBO job = MapBEJobforSearchObj(dbJob);
lstJobs.Add(job);
}
}
getJobResponse.Jobs = lstJobs;
return getJobResponse;
}
I found to this line is taking about 2-3 min to execute
var lstJob = dbContext.Job_Details.ToList();
How can i solve this issue?
To outline the performance issues with your example: (see inline comments)
public GetJobsResponse GetImportJobs()
{
GetJobsResponse getJobResponse = new GetJobsResponse();
List<JobBO> lstJobs = new List<JobBO>();
using (NSEXIM_V2Entities dbContext = new NSEXIM_V2Entities())
{
// Loads *ALL* entities into memory. This effectively takes all fields for all rows across from the database to your app server. (Even though you don't want it all)
var lstJob = dbContext.Job_Details.ToList();
// Filters from the data in memory.
foreach (var dbJob in lstJob.Where(ie => ie.IMP_EXP == "I" && ie.Job_No != null))
{
// Maps the entity to a DTO and adds it to the return collection.
JobBO job = MapBEJobforSearchObj(dbJob);
lstJobs.Add(job);
}
}
// Returns the DTOs.
getJobResponse.Jobs = lstJobs;
return getJobResponse;
}
First: pass your WHERE clause to EF to pass to the DB server rather than loading all entities into memory..
public GetJobsResponse GetImportJobs()
{
GetJobsResponse getJobResponse = new GetJobsResponse();
using (NSEXIM_V2Entities dbContext = new NSEXIM_V2Entities())
{
// Will pass the where expression to be DB server to be executed. Note: No .ToList() yet to leave this as IQueryable.
var jobs = dbContext.Job_Details..Where(ie => ie.IMP_EXP == "I" && ie.Job_No != null));
Next, use SELECT to load your DTOs. Typically these won't contain as much data as the main entity, and so long as you're working with IQueryable you can load related data as needed. Again this will be sent to the DB Server so you cannot use functions like "MapBEJobForSearchObj" here because the DB server does not know this function. You can SELECT a simple DTO object, or an anonymous type to pass to a dynamic mapper.
var dtos = jobs.Select(ie => new JobBO
{
JobId = ie.JobId,
// ... populate remaining DTO fields here.
}).ToList();
getJobResponse.Jobs = dtos;
return getJobResponse;
}
Moving the .ToList() to the end will materialize the data into your JobBO DTOs/ViewModels, pulling just enough data from the server to populate the desired rows and with the desired fields.
In cases where you may have a large amount of data, you should also consider supporting server-side pagination where you pass a page # and page size, then utilize a .Skip() + .Take() to load a single page of entries at a time.

How to count new element from stream by using spark-streaming

I have done implementation of daily compute. Here is some pseudo-code.
"newUser" may called first activated user.
// Get today log from hbase or somewhere else
val log = getRddFromHbase(todayDate)
// Compute active user
val activeUser = log.map(line => ((line.uid, line.appId), line).reduceByKey(distinctStrategyMethod)
// Get history user from hdfs
val historyUser = loadFromHdfs(path + yesterdayDate)
// Compute new user from active user and historyUser
val newUser = activeUser.subtractByKey(historyUser)
// Get new history user
val newHistoryUser = historyUser.union(newUser)
// Save today history user
saveToHdfs(path + todayDate)
Computation of "activeUser" can be converted to spark-streaming easily. Here is some code:
val transformedLog = sdkLogDs.map(sdkLog => {
val time = System.currentTimeMillis()
val timeToday = ((time - (time + 3600000 * 8) % 86400000) / 1000).toInt
((sdkLog.appid, sdkLog.bcode, sdkLog.uid), (sdkLog.channel_no, sdkLog.ctime.toInt, timeToday))
})
val activeUser = transformedLog.groupByKeyAndWindow(Seconds(86400), Seconds(60)).mapValues(x => {
var firstLine = x.head
x.foreach(line => {
if (line._2 < firstLine._2) firstLine = line
})
firstLine
})
But the approach of "newUser" and "historyUser" is confusing me.
I think my question can be summarized as "how to count new element from stream". As my pseudo-code above, "newUser" is part of "activeUser". And I must maintain a set of "historyUser" to know which part is "newUser".
I consider an approach, but I think it may not work right way:
Load the history user as a RDD. Foreach DStream of "activeUser" and find the elements doesn't exist in the "historyUser". A problem here is when should I update this RDD of "historyUser" to make sure I can get the right "newUser" of a window.
Update the "historyUser" RDD means add "newUser" to it. Just like what I did in the pseudo-code above. The "historyUser" is updated once a day in that code. Another problem is how to do this update RDD operation from a DStream. I think update "historyUser" when window slides is proper. But I haven't find a proper API to do this.
So which is the best practice to solve this problem.
updateStateByKey would help here as it allows you to set initial state (your historical users) and then update it on each interval of your main stream. I put some code together to explain the concept
val historyUsers = loadFromHdfs(path + yesterdayDate).map(UserData(...))
case class UserStatusState(isNew: Boolean, values: UserData)
// this will prepare the RDD of already known historical users
// to pass into updateStateByKey as initial state
val initialStateRDD = historyUsers.map(user => UserStatusState(false, user))
// stateful stream
val trackUsers = sdkLogDs.updateStateByKey(updateState, new HashPartitioner(sdkLogDs.ssc.sparkContext.defaultParallelism), true, initialStateRDD)
// only new users
val newUsersStream = trackUsers.filter(_._2.isNew)
def updateState(newValues: Seq[UserData], prevState: Option[UserStatusState]): Option[UserStatusState] = {
// Group all values for specific user as needed
val groupedUserData: UserData = newValues.reduce(...)
// prevState is defined only for users previously seen in the stream
// or loaded as initial state from historyUsers RDD
// For new users it is None
val isNewUser = !prevState.isDefined
// as you return state here for the user - prevState won't be None on next iterations
Some(UserStatusState(isNewUser, groupedUserData))
}