Iterating over PTable in crunch - apache-crunch

I have following PTables,
PTable<String, String> somePTable1 = somePCollection1.parallelDo(new SomeClass(),
Writables.tableOf(Writables.strings(), Writables.strings()));
PTable<String, Collection<String>> somePTable2 = somePTable1.collectValues();
For somePTable2 described above, I want to make a new file for every record in somePTable2, Is there any way to iterate over somePTable2 so that I can access the record.I know I can apply the DoFn on somePTable2, but is it possible to apply pipeline.write() operation in DoFn ?

Try this to store your list as is
somePTable2.values().write()
If you want generate one record for each element in the collection inside your PTable, you will need apply a DoFn and emit one record for each element in the collection before write it.

Related

Filter on CassandraJoinRDD

I have applied a join on file and existing Cassandra table via joinWithCassandraTable. Now, I want to apply a filter on joinCassandraRDD. Here is the code and functionality which I have written for extraction of data:
var outrdd = sc.textFile("/usr/local/spark/bin/select_element/src/main/scala/file_small.txt")
.map(_.toString).map(Tuple1(_))
.joinWithCassandraTable(settings.keyspace, settings.table)
.select("id", "listofitems")
Here "/usr/local/spark/bin/select_element/src/main/scala/file_small.txt" is the text file which is having a list of ids. Now, I have some elements in another list, say userlistofitems=["jas", "yuk"], I need to search 'userlistofitems' sublist from 'listofitems' column of joinCassandraRDD.
We have around 2Million ids where we have several user_lists for which we have to extract the data from Cassandra. We are using versions spark=2.4.4, scala=2.11.12, and spark-cassandra-connector=spark-cassandra-connector-2.4.2-3-gda70746.jar.
Any help is highly appreciated.
References Used:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc,
https://www.youtube.com/watch?v=UsenTP029tM

How to get counts from AlterRow transformation in Azure Data Factory

I have an AlterRow transformation that marks each row with the appropriate CRUD operation in an ADFv2 data flow. I don't see any output variables on this activity that will give me the total inserts, updates, etc. I do, however, see methods in the expression syntax to tell me if a particular row is an IsInsert(), IsUpdate(), etc.
Would the correct way to get counts be to
Add another output from the AlterRow transformation
Add derived column that uses the expression syntax IsInsert(), IsUpdate() to set operation type (I, U, D)
Add an aggregate to group by this column to get total counts for each operation
When creating the aggregate, I don't see any metadata that would allow me to group by the CRUD operation type so I assume I would have to create this myself, but it seems like it should already be there since that's the purpose of the AlterRow transformation. Am I working too hard to get these counts?
Add an aggregate after your AlterRow with no group-by and use these formulas:

Merge vertex list in gremlin orientDb

I am a newbie in the graph databases world, and I made a query to get leaves of the tree, and I also have a list of Ids. I want to merge both lists of leaves and remove duplicates in a new one to sum property of each. I cannot merge the first 2 sets of vertex
g.V().hasLabel('Group').has('GroupId','G001').repeat(
outE().inV()
).emit().hasLabel('User').as('UsersList1')
.V().has('UserId', within('001','002')).as('UsersList2')
.select('UsersList1','UsersList2').dedup().values('petitions').sum().unfold()
Regards
There are several things wrong in your query:
you call V().has('UserId', within('001','002')) for every user that was found by the first part of the traversal
the traversal could emit more than just the leafs
select('UsersList1','UsersList2') creates pairs of users
values('petitions') tries to access the property petitions of each pair, this will always fail
The correct approach would be:
g.V().has('User', 'UserId', within('001','002')).fold().
union(unfold(),
V().has('Group','GroupId','G001').
repeat(out()).until(hasLabel('User'))).
dedup().
values('petitions').sum()
I didn't test it, but I think the following will do:
g.V().union(
hasLabel('Group').has('GroupId','G001').repeat(
outE().inV()
).until(hasLabel('User')),
has('UserId', within('001','002')))
.dedup().values('petitions').sum()
In order to get only the tree leaves, it is better to use until. Using emit will output all inner tree nodes as well.
union merges the two inner traversals.

How to obtain the output of the pipeline and perform read&write to Cloud Firestore

I am using Apache Beam to take log from Pub/Sub which contains information of pageview traffic. Each page contains unique ID and when one log of pageview traffic come from the Pub/Sub, Cloud Dataflow will collect them in a constant windowed manner and count them. At the end of combiner, we will get something like this:
12345, 2
12456, 1
15213, 1
...
As I know, ParDo is a Beam transform for generic parallel processing. After combine, I wish to implement a transform that write query to Cloud Firestore to get the existing pageview ID, take the current view count, perform addition on it and perform write operating to update the view count one by one from the combined output as shown above. Any suggestion?
Below is my code so far for the UpdateViewCount. When I get the query, it seems impossible to have a for loop to get the query (it will be only one row of query since the pageview is unique tho)
class UpdateIntoFireStore(beam.DoFn):
def process(self, element):
listingid, count = element
doc_ref = db.collection('listings').where('listingid', u'==', '12345')
try:
docs = doc_ref.get()
for doc in docs:
print doc
except NotFound:
print(u'No such document!')
I solved it. There is no need to put a loop to retrieve the data and I should retrieve the particular ID with document name.
doc_ref = db.collection(u'listings').document(listingid)
try:
doc = doc_ref.get()
doc_dict = doc.to_dict()
self.cur_count = doc_dict[u'count']
doc_ref.update({
u'count': self.cur_count + count
})
except NotFound:
doc_ref.set({'count': count})

Bulk delete records from HBase - how to convert an RDD to Array[Byte]?

I have an RDD of objects that I want to bulk delete from HBase. After reading HBase documentation and examples I came up with the following code:
hc.bulkDelete[Array[Byte]](salesObjects, TableName.valueOf("salesInfo"),
putRecord => new Delete(putRecord), 4)
However as far as I understand salesObjects has to be converted to Array[Byte].
Since salesObjects is an RDD[Sale] how to convert it to Array[Byte] correctly?
I've tried Bytes.toBytes(salesObjects) but the method doesn't accept RDD[Sale] as an argument. Sale is a complex object so it will be problematic to parse each field to bytes.
For now I've converted RDD[Sale] to val salesList: List[Sale] = salesObjects.collect().toList but currently stuck with where to proceed next.
I've never used this method but I'll try to help:
the methods accepts a RDD of any type T: https://github.com/apache/hbase/blob/master/hbase-spark/src/main/scala/org/apache/hadoop/hbase/spark/HBaseContext.scala#L290 ==> so you should be able to use it on your RDD[Sale]
bulkDelete expects a function transforming your Sale object to HBase's Delete object (https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Delete.html)
Delete object represents a row to delete. You can get an example of Delete object initialization here: https://www.tutorialspoint.com/hbase/hbase_delete_data.htm
depending on what and how you want to remove a row, you should convert the parts of your Sales into a byte. For instance, you want to remove the data by row key, you should extract it and put into Delete object
In my understanding bulkDelete method will accumulate batchSize number of Delete objects and send them into HBase at once. Otherwise, could you please show some code to get a more concrete idea of what you're trying to do ?
Doing val salesList: List[Sale] = salesObjects.collect().toList is not a good idea since it brings all data into your driver. Potentially it can lead to OOM problems.