Scala how to use reduceBykey when I have two keys - scala

Data format of one row:
id: 123456
Topiclist: ABCDE:1_8;5_10#BCDEF:1_3;7_11
One id can have many rows:
id: 123456
Topiclist:ABCDE:1_1;7_2;#BCDEF:1_2;7_11#
Target: (123456, (ABCDE,9,2),(BCDEF,5,2))
Records in topic list are split by #, so ABCDE:1_8;5_10 is one record.
A record is in the format <topicid>:<topictype>_<topicvalue>
E.g for ABCDE:1_8 has
topicid = ABCDE
topictype = 1
topicvalue = 8
Target: sum the total value of TopicType1 , and count frequency of TopicType1
so should be (id, (topicid, value,frequency)), eg: (123456, (ABCDE,9,2),(BCDEF,5,2))

Assume that your data are "123456!ABCDE:1_8;5_10#BCDEF:1_3;7_11" and "123456!ABCDE:1_1;7_2#BCDEF:1_2;7_11", so we use "!" to get your userID "123456"
rdd.map{f=>
val userID = f.split("!")(0)
val items = f.split("!")(1).split("#")
var result = List[Array[String]]()
for (item <- items){
val topicID = item.split(":")(0)
for (topicTypeValue <- item.split(":")(1).split(";") ){
println(topicTypeValue);
if (topicTypeValue.split("_")(0)=="1"){result = result:+Array(topicID,topicTypeValue.split("_")(1),"1") }
}
}
(userID,result)
}
.flatMapValues(x=>x).filter(f=>f._2.length==3)
.map{f=>( (f._1,f._2(0)),(f._2(1).toInt,f._2(2).toInt) )}
.reduceByKey{case(x,y)=> (x._1+y._1,x._2+y._2) }
.map(f=>(f._1._1,(f._1._2,f._2._1,f._2._2))) // (userID, (TopicID,valueSum,frequences) )
The output is ("12345",("ABCDE",9,2)), ("12345",("BCDEF",5,2)) a little different from your output, you can group this result if you really need ("12345",("ABCDE",9,2), ("BCDEF",5,2) )

Related

H2: counting (with table lock)

I need to implement a counter by prefix and get the current value. Therefore I created a table UPLOAD_ID:
CREATE TABLE UPLOAD_ID
(
COUNTER INT NOT NULL,
UPLOAD_PREFIX VARCHAR(60) PRIMARY KEY
);
Using H2 and a Spring nativeQuery:
#Query(nativeQuery = true, value = MYQUERY)
override fun nextId(#Param("prefix") prefix: String): Long
with MYQUERY being
SELECT COUNTER FROM FINAL TABLE (
USING (SELECT CAST(:prefix AS VARCHAR) AS UPLOAD_PREFIX FOR UPDATE) S FOR UPDATE
ON T.UPLOAD_PREFIX = S.UPLOAD_PREFIX
WHEN MATCHED
THEN UPDATE
SET COUNTER = COUNTER + 1
WHEN NOT MATCHED
THEN INSERT (UPLOAD_PREFIX, COUNTER)
VALUES (S.UPLOAD_PREFIX, 1) );
I'm unable to lock the table to avoid "Unique index or primary key violation" in my test. In MSSQL I can add WITH (HOLDLOCK) T in MERGE INTO UPLOAD_ID WITH (HOLDLOCK) T to solve this issue.
The gist of my test looks like
try { uploadIdRepo.deleteById(prefix) } catch (e: EmptyResultDataAccessException) { }
val startCount = uploadIdRepo.findById(prefix).map { it.counter }.orElseGet { 0L }
val workerPool = Executors.newFixedThreadPool(35)
val nextValuesRequested = 100
val res = (1..nextValuesRequested).toList().parallelStream().map { i ->
workerPool.run {
uploadIdRepo.nextId(prefix)
}
}.toList()
res shouldHaveSize nextValuesRequested // result count
res.toSet() shouldHaveSize nextValuesRequested // result spread
res.max() shouldBeEqualComparingTo startCount + nextValuesRequested
Can I solve this with H2?

LINQ Group By or JOIN with OrderBy

i have two tables
1) BankAccount -> Fields (BankAccountID, Name)
2) BnkTransaction -> Fields(ID,Amount,TransactionType,Total,BankId)FK (BankId)
here is the picture of both tables
Database tables
what i'm trying to do is: i need Name, Amount,Credit or Debit(transactiontype),Total.
the result should be grouped by name like this by using Entity Framework LINQ
Output
HERE is my code which i tried, however i am rather not sure how would i get the desired output, by using group by? if so then how?
var joinResult = Entity.BnkTransactions
.Include("BankAccounts")
.Select(x => new
{
Name = x.BankAccount.Name,
Amount = x.Amount,
Credit = x.TransType == 1 ? x.TransType : 0,
DEBIT = x.TransType == 2 ? x.TransType : 0,
Total = x.Total
}).OrderBy(x => x.Name).ToList();
foreach (var item in joinResult)
{
string credit = item.Credit == 1 ? "Credit" : "---";
string Debit = item.DEBIT == 2 ? "Debit" : "---";
Console.WriteLine("Name:-{0} Amount: {1} Credit: {2} DEBIT: {3}
Total: {4}", item.Name, item.Amount, credit, Debit,item.Total );
}
Please help me how can i achieve this ?
You can do this with a group by, something like this:
(not tested)
var joinResult = Entity.BnkTransactions
.Include("BankAccounts")
.GroupBy(x => x.BankAccount.Name, (key, g) => new
{
Name = key,
Transactions = g.ToList()
}).OrderBy(x => x.Name).ToList();
This should get you an anonymous type with one name and a list of transactions attached to it.
See if you can loop through that to output the results you need.

Counter inside of scala for comprehension

I have this piece of code.
for {
country <- getCountryList
city <- getCityListForCountry(country)
person <- getPersonListForCity(person)
} {...}
When we run this code, we need to have a counter inside the body of the loop which increments every time the loop executes. This counter needs to show the number of people processed per country. So it has to reset itself to 0 every time we start executing the loop for a new country.
I tried
for {
country <- getCountryList
counterPerCountry = 0
city <- getCityListForCountry(country)
person <- getPersonListForCity(city)
} {counterPerCountry = counterPerCountry + 1; ...}
but this says that I am trying to reassign a value to val.
so I tried
var counterPerCountry = 0
for {
country <- getCountryList
counterPerCountry = 0
city <- getCityListForCountry(country)
person <- getPersonListForCity(city)
} {counterPerCountry = counterPerCountry + 1; ...}
also tried
for {
country <- getCountryList
var counterPerCountry = 0
city <- getCityListForCountry(country)
person <- getPersonListForCity(city)
} {counterPerCountry = counterPerCountry + 1; ...}
If you're just trying to figure out how to assign a value to a var within a for-comprehension for science, here's a solution:
var counter = 0
for {
a <- getList1
_ = {counter = 0}
b <- getList2(a)
c <- getList3(b)
} {
counter = counter + 1
...
}
If you're actually trying to count the number of people in a country, and you say it's the number of people in a city times the number of cities in a country - then it comes down to simple arithmetics:
for {
country <- getCountryList
cities = getCityListForCountry(country)
city <- cities
persons = getPersonListForCity(person)
personsPerCountry = cities.length * persons.length
person <- persons
} {...}
I agree with #pamu that a for-comprehension does not seem the like a natural choice here. But if you turn the for comprehension into the underlying operations, I think you can get a solution that, while not as readable as a for comprehension, works with Scala's functional style and avoids mutable variables. I'm thinking of something along this line:
getCountryList flatMap (country =>
(getCityListForCountry(country) flatMap (city =>
getPersonListForCity(city))
).zipWithIndex
)
That should yield a list of (person, index) tuples where the index starts at zero for each country.
The inner part could be turned back into a for comprehension, but I'm not sure whether that would improve readability.
I don't think for-comprehension allows this naturally. You have to do it bit hacky way. Here is one way to do it.
var counter = 0
for {
country <- getCountryList.map { elem => counter = 0; elem }
city <- getCityForCountry(country)
person <- getPersonForCity(person)
} {
counter + 1
//do something else here
}
or use function for being modular
var counter = 0
def reset(): Unit = counter = 0
for {
country <- getCountryList
_ = reset()
city <- getCityForCountry(country)
person <- getPersonForCity(person)
} {
counter + 1
//do something else here
}
People per country
val peoplePerCountry =
for {
country <- getCountryList
cities = getCityForCountry(country)
city <- cities
persons = getPersonForCity(person)
} yield (country -> (cities.length * persons.length))
The code returns list of country, persons per that country
The above for-comprehension is the answer, you do not have to go for counter approach. This functional and clean. No mutable state.
One more approach, if your only need is the actual sum would be something compact and functional such as:
getCountryList.map( country => //-- for each country
(country, //-- return country, and ...
getCityListForCountry(country).map ( city => //-- the sum across cities
getPersonListForCity(city).length //-- of the number of people in that city
).sum
)
)
which is a list of tuples of countries with the number of people in each country. I like to think of map as the "default" loop where I would have used a for in the past. I've found the index value is very seldom needed. The index value is available with the zipWithIndex method as mentioned in another answer.

In Scala, how do I return a list of all object instances with a property set to a certain value

I am using Scala, and would like to be able to select all instances of an object with a property set to a given value. Say I have an Order class and a Order Item class, and I want to select or pull all of the order items with an order id of say 1, and return all of these in a list, how do I do this?
I would like to be able to return a list of order line items that have an order id of say 1
Here is my Line Item class definition:
Code:
case class LineItem(val itemId : Int, val orderId:Int, val productId:Int, val quantity:Int)
I then have some order items defined in an object:
Code:
object LineItem {
val li1 = new LineItem(1, 1, 1, 10)
val li2 = new LineItem(2, 1, 4, 1)
val li3 = new LineItem(3, 1, 2, 1)
val li4 = new LineItem(4, 1, 3, 1)
val li5 = new LineItem(5, 2, 1, 1)
val li6 = new LineItem(6, 1, 7, 1)
var lineItems = Set(li1, li2, li3, li4, li5, li6)
def findItemsInOrder(ordId:Int) = lineItems.find(_.orderId == ordId)
}
As you can see 5 of the line items belong to the order with an id of 1.
So first a list of orders are printed out for the user to see, then I want the user to be able to select an order to see all the line items within that order, so I would like these line items to be printed out in the shell.
So the order id is inputted by the user:
Code:
val orderNo = readLine("Which order would you like to view?").toInt
And then this line:
Code:
println("OrderID: " + LineItem.findItemsInOrder(orderNo).get.orderId + ", ProductID: " + LineItem.findItemsInOrder(orderNo).get.productId + ", Quantity: " + LineItem.findItemsInOrder(orderNo).get.quantity)
Only prints out the first line item with an order id of 1.
I have then tried to loop through all line items, like this:
Code:
var currentLineItems = new ListBuffer[LineItem]()
for (item <- LineItem.lineItems){
val item1: LineItem = LineItem(LineItem.findItemsInOrder(orderNo).get.lineId, LineItem.findItemsInOrder(orderNo).get.orderId, LineItem.findItemsInOrder(orderNo).get.productId, LineItem.findItemsInOrder(orderNo).get.quantity)
currentLineItems += item1
}
for (item <- currentLineItems ){
println("OrderID: " + LineItem.findItemsInOrder(orderNo).get.orderId + ", ProductID: " + LineItem.findItemsInOrder(orderNo).get.productId + ", Quantity: " + LineItem.findItemsInOrder(orderNo).get.quantity)
}
But this code just prints out the same line item 6 times.
I would be very grateful for any help received to help solve my problem
Thanks
Jackie
Define findItemsInOrder to filter out all elements in the list that match the order id:
def findItemsInOrder(ordId: Int): List[LineItem] = lineItems.filter(_.orderId == ordId)
find will locate the first element matching the id, if found will return Some[T] where T is the element type, else None:
Finds the first element of the sequence satisfying a predicate, if any.
If you have multiple elements which match against the id, you need filter:
Selects all elements of this traversable collection which satisfy a predicate.

Fetching Cassandra row keys

Assume a Cassandra datastore with 20 rows, with row keys named "r1" .. "r20".
Questions:
How do I fetch the row keys of the first ten rows (r1 to r10)?
How do I fetch the row keys of the next ten rows (r11 to r20)?
I'm looking for the Cassandra analogy to:
SELECT row_key FROM table LIMIT 0, 10;
SELECT row_key FROM table LIMIT 10, 10;
Take a look at:
list<KeySlice> get_range_slices(keyspace, column_parent, predicate, range, consistency_level)
Where your KeyRange tuple is (start_key, end_key) == (r1, r10)
Based on my tests there is no order for the rows (unlike columns). CQL 3.0.0 can retrieve row keys but not distinct (there should be a way that I do not know).I my case I do not know what my key range is, so I tried to retrieve all the keys with both Hector and Thrift, and sort the keys later. The performance test with CQL 3.0.0 for 100000 columns 200 rows was about 500 milliseconds, Hector around 100 and thrift about 50 milliseconds. My Row key here is integer. Hector code follows:
public void queryRowkeys() {
myCluster = HFactory.getOrCreateCluster(CLUSTER_NAME, "127.0.0.1:9160");
ConfigurableConsistencyLevel ccl = new ConfigurableConsistencyLevel();
ccl.setDefaultReadConsistencyLevel(HConsistencyLevel.ONE);
myKeyspace = HFactory.createKeyspace(KEYSPACE_NAME, myCluster, ccl);
RangeSlicesQuery<Integer, Composite, String> rangeSlicesQuery = HFactory.createRangeSlicesQuery(myKeyspace, IntegerSerializer.get(),
CompositeSerializer.get(), StringSerializer.get());
long start = System.currentTimeMillis();
QueryResult<OrderedRows<Integer, Composite, String>> result =
rangeSlicesQuery.setColumnFamily(CF).setKeys(0, -1).setReturnKeysOnly().execute();
OrderedRows<Integer, Composite, String> orderedRows = result.get();
ArrayList<Integer> list = new ArrayList<Integer>();
for(Row<Integer, Composite, String> row: orderedRows){
list.add(row.getKey());
}
System.out.println((System.currentTimeMillis()-start));
Collections.sort(list);
for(Integer i: list){
System.out.println(i);
}
}
This is the Thrift code:
public void retreiveRows(){
try {
transport = new TFramedTransport(new TSocket("localhost", 9160));
TProtocol protocol = new TBinaryProtocol(transport);
client = new Cassandra.Client(protocol);
transport.open();
client.set_keyspace("prefdb");
ColumnParent columnParent = new ColumnParent("events");
SlicePredicate predicate = new SlicePredicate();
predicate.setSlice_range(new SliceRange(ByteBuffer.wrap(new byte[0]), ByteBuffer.wrap(new byte[0]), false, 1));
KeyRange keyRange = new KeyRange(); //Get all keys
keyRange.setStart_key(new byte[0]);
keyRange.setEnd_key(new byte[0]);
long start = System.currentTimeMillis();
List<KeySlice> keySlices = client.get_range_slices(columnParent, predicate, keyRange, ConsistencyLevel.ONE);
ArrayList<Integer> list = new ArrayList<Integer>();
for (KeySlice ks : keySlices) {
list.add(ByteBuffer.wrap(ks.getKey()).getInt());
}
Collections.sort(list);
System.out.println((System.currentTimeMillis()-start));
for(Integer i: list){
System.out.println(i);
}
transport.close();
} catch (Exception e) {
e.printStackTrace();
}
}
You should firstly modify cassandra.yaml in the version of cassandra1.1.o, where you should set as follows:
partitioner: org.apache.cassandra.dht.ByteOrderedPartitioner
Secondly,you should define as follows:
create keyspace DEMO with placement_strategy =
'org.apache.cassandra.locator.SimpleStrategy' and
strategy_options = [{replication_factor:1}];
use DEMO;
create column family Users with comparator = AsciiType and
key_validation_class = LongType and
column_metadata = [
{
column_name: aaa,
validation_class: BytesType
},{
column_name: bbb,
validation_class: BytesType
},{
column_name: ccc,
validation_class: BytesType
}
];
Finally, you can insert data into cassandra and can realize range query.