ResultSet exhausted error is coming while accessing value of cassandra table through scala - scala

I had written a simple code to connect to cassandra and fetch the table value in scala .Here is my code
def t()={
var cluster:Cluster=connect("127.0.0.1")
println(cluster)
var session:Session=cluster.connect("podv1");
// var tx= session.execute("CREATE KEYSPACE myfirstcassandradb WITH " +
// "replication = {'class':'SimpleStrategy','replication_factor':1}")
var p=session.execute("select data from section_tbl limit 2;")
session.close()
cluster.close
}
Although create keyspace thing is working(which i had commented) but when i am trying to fetch the data from table it is giving error like
ResultSet[ exhausted: false, Columns[data(varchar)]] .I just started connecting scala to cassandra so may be i am doing some mistake or something.Any help regarding how to get data will be appreciable.

it is giving error like ResultSet[ exhausted: false, Columns[data(varchar)]]
This is not an error, ResultSet is the object returned from execute. exhausted indicates that not all the rows have been consumed by you yet, and Columns indicates the name of each column and its type.
So for example in your case of 'p' being the variable assigned to session.execute, you can do something like:
var p = session.execute("select data from section_tbl limit 2;")
println(p.all())
all() the grabs all the rows associated with your query from Cassandra and captures it as a List[Row]. If you have a really large result set (which in your case you don't since you are limiting to two rows), you may want to iterate of the ResultSet instead, which will allow the driver to use paging to retrieve chunks of Rows at a time from cassandra.
A ResultSet implements java.util.Iteratable<Row>, so you can go all the operations you can with iterable. Refer to the ResultSet api for more operations.

Related

spring data jpa issue with postgres - Tried to send an out-of-range integer as a 2-byte value

quoteEntitiesPage = quoteRepository.findAllByQuoteIds(quoteIds, pageRequest);
The above query gives me the error "Tried to send an out-of-range integer as a 2-byte value" if the count of quoteIds parameter is above Short.MAX_VALUE.
What is the best approach to get all quote entities here? My Quote class has id(long) and quoteId(UUID) fields.
When using a query of the type "select ... where x in (list)", such as yours, Spring adds a bind parameter for each list element. PostgreSQL limits the number of bind parameters in a query to Short.MAX_VALUE bind, so when the list is longer than that, you get that exception.
A simple solution for this problem would be to partition the list in blocks, query for each one of them, and combine the results.
Something like this, using Guava:
List<QuoteEntity> result = new ArrayList<>();
List<List<Long>> partitionedQuoteIds = Lists.partition(quoteIds, 10000);
for (List<Long> partitionQuoteIds: partitionedQuoteIds) {
result.addAll(quoteRepository.findAllByQuoteIds(partitionQuoteIds))
}
This is very wasteful when paginating, but it might be enough for your use case.

Bulk delete records from HBase - how to convert an RDD to Array[Byte]?

I have an RDD of objects that I want to bulk delete from HBase. After reading HBase documentation and examples I came up with the following code:
hc.bulkDelete[Array[Byte]](salesObjects, TableName.valueOf("salesInfo"),
putRecord => new Delete(putRecord), 4)
However as far as I understand salesObjects has to be converted to Array[Byte].
Since salesObjects is an RDD[Sale] how to convert it to Array[Byte] correctly?
I've tried Bytes.toBytes(salesObjects) but the method doesn't accept RDD[Sale] as an argument. Sale is a complex object so it will be problematic to parse each field to bytes.
For now I've converted RDD[Sale] to val salesList: List[Sale] = salesObjects.collect().toList but currently stuck with where to proceed next.
I've never used this method but I'll try to help:
the methods accepts a RDD of any type T: https://github.com/apache/hbase/blob/master/hbase-spark/src/main/scala/org/apache/hadoop/hbase/spark/HBaseContext.scala#L290 ==> so you should be able to use it on your RDD[Sale]
bulkDelete expects a function transforming your Sale object to HBase's Delete object (https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Delete.html)
Delete object represents a row to delete. You can get an example of Delete object initialization here: https://www.tutorialspoint.com/hbase/hbase_delete_data.htm
depending on what and how you want to remove a row, you should convert the parts of your Sales into a byte. For instance, you want to remove the data by row key, you should extract it and put into Delete object
In my understanding bulkDelete method will accumulate batchSize number of Delete objects and send them into HBase at once. Otherwise, could you please show some code to get a more concrete idea of what you're trying to do ?
Doing val salesList: List[Sale] = salesObjects.collect().toList is not a good idea since it brings all data into your driver. Potentially it can lead to OOM problems.

how read-through work in ignite

my cache is empty so sql queries return null.
The read-through means that if the cache is missed, then Ignite will automatically get down to the underlying db(or persistent store) to load the corresponding data.
If there are new data inserted into the underlying db table ,i have to down cache server to load the newly inserted data from the db table automatically or it will sync automatically ?
Is work same as Spring's #Cacheable or work differently.
It looks to me that the answer is no. Cache SQL query don't work as no data in cache but when i tried cache.get in i got following results :
case 1:
System.out.println("data == " + cache.get(new PersonKey("Manish", "Singh")).getPhones());
result ==> data == 1235
case 2 :
PersonKey per = new PersonKey();
per.setFirstname("Manish");
System.out.println("data == " + cache.get(per).getPhones());
throws error:- as following
error image, image2
Read-through semantics can be applied when there is a known set of keys to read. This is not the case with SQL, so in case your data is in an arbitrary 3rd party store (RDBMS, Cassandra, HBase, ...), you have to preload the data into memory prior to running queries.
However, Ignite provides native persistence storage [1] which eliminates this limitation. It allows to use any Ignite APIs without having anything in memory, and this includes SQL queries as well. Data will be fetched into memory on demand while you're using it.
[1] https://apacheignite.readme.io/docs/distributed-persistent-store
When you insert something into the database and it is not in the cache yet, then get operations will retrieve missing values from DB if readThrough is enabled and CacheStore is configured.
But currently it doesn't work this way for SQL queries executed on cache. You should call loadCache first, then values will appear in the cache and will be available for SQL.
When you perform your second get, the exact combination of name and lastname is sought in DB. It is converted into a CQL query containing lastname=null condition, and it fails, because lastname cannot be null.
UPD:
To get all records that have firstname column equal to 'Manish' you can first do loadCache with an appropriate predicate and then run an SQL query on cache.
cache.loadCache((k, v) -> v.lastname.equals("Manish"));
SqlFieldsQuery qry = new SqlFieldsQuery("select firstname, lastname from Person where firstname='Manish'");
try (FieldsQueryCursor<List<?>> cursor = cache.query(qry)) {
for (List<?> row : cursor)
System.out.println("firstname:" + row.get(0) + ", lastname:" + row.get(1));
}
Note that loadCache is a complex operation and requires to run over all records in the DB, so it shouldn't be called too often. You can provide null as a predicate, then all records will be loaded from the database.
Also to make SQL run fast on cache, you should mark firstname field as indexed in QueryEntity configuration.
In your case 2, have you tried specifying lastname as well? By your stack trace it's evident that Cassandra expects it to be not null.

C# Comparing lists of data from two separate databases using LINQ to Entities

I have 2 SQL Server databases, hosted on two different servers. I need to extract data from the first database. Which is going to be a list of integers. Then I need to compare this list against data in multiple tables in the second database. Depending on some conditions, I need to update or insert some records in the second database.
My solution:
(WCF Service/Entity Framework using LINQ to Entities)
Get the list of integers from 1st db, takes less than a second gets 20,942 records
I use the list of integers to compare against table in the second db using the following query:
List<int> pastDueAccts; //Assuming this is the list from Step#1
var matchedAccts = from acct in context.AmAccounts
where pastDueAccts.Contains(acct.ARNumber)
select acct;
This above query is taking so long that it gives a timeout error. Even though the AmAccount table only has ~400 records.
After I get these matchedAccts, I need to update or insert records in a separate table in the second db.
Can someone help me, how I can do step#2 more efficiently? I think the Contains function makes it slow. I tried brute force too, by putting a foreach loop in which I extract one record at a time and do the comparison. Still takes too long and gives timeout error. The database server shows only 30% of the memory has been used.
Profile the sql query being sent to the database by using SQL Profiler. Capture the SQL statement sent to the database and run it in SSMS. You should be able to capture the overhead imposed by Entity Framework at this point. Can you paste the SQL Statement emitted in step #2 in your question?
The query itself is going to have all 20,942 integers in it.
If your AmAccount table will always have a low number of records like that, you could just return the entire list of ARNumbers, compare them to the list, then be specific about which records to return:
List<int> pastDueAccts; //Assuming this is the list from Step#1
List<int> amAcctNumbers = from acct in context.AmAccounts
select acct.ARNumber
//Get a list of integers that are in both lists
var pastDueAmAcctNumbers = pastDueAccts.Intersect(amAcctNumbers);
var pastDueAmAccts = from acct in context.AmAccounts
where pastDueAmAcctNumbers.Contains(acct.ARNumber)
select acct;
You'll still have to worry about how many ids you are supplying to that query, and you might end up needing to retrieve them in batches.
UPDATE
Hopefully somebody has a better answer than this, but with so many records and doing this purely in EF, you could try batching it like I stated earlier:
//Suggest disabling auto detect changes
//Otherwise you will probably have some serious memory issues
//With 2MM+ records
context.Configuration.AutoDetectChangesEnabled = false;
List<int> pastDueAccts; //Assuming this is the list from Step#1
const int batchSize = 100;
for (int i = 0; i < pastDueAccts.Count; i += batchSize)
{
var batch = pastDueAccts.GetRange(i, batchSize);
var pastDueAmAccts = from acct in context.AmAccounts
where batch.Contains(acct.ARNumber)
select acct;
}

ETL:parallel lookup in and insert in scala

For our ETL, the fact data don't have item_key, but have item_number. During the loading, if we can find the item_key for the item_number, then just use it,if can NOT find, then auto create an item_key. Currently the process is not in parallel, I am thinking about run it in parallel using scala since scala have build-in concurrent collection.
Use a simple example:
val keys=1 to 1000
val items=keys map {num=>"Item"+num}
var itemMap=(items zip keys).toMap
and now we have millions rows to load whose item number is:
def g(v:String)=List.fill(5000)(v)
var fact="Item2000" :: List(items.flatMap(x=>g(x)))
Since the fact data has an item item2000 which can't be found in item master data of itemMap, we need to autocreate a map of (item2000,2000) and add it to itemMap so that if in future we find item2000 again we could use the same item key.
How to implement using concurrent collection? For each loop of the row in fact data, if can't find the item key then autocreate, so we need a way to lock itemMap otherwise there might be multiple thead trying to insert autocreate data into itemMap