How to obtain the output of the pipeline and perform read&write to Cloud Firestore - google-cloud-firestore

I am using Apache Beam to take log from Pub/Sub which contains information of pageview traffic. Each page contains unique ID and when one log of pageview traffic come from the Pub/Sub, Cloud Dataflow will collect them in a constant windowed manner and count them. At the end of combiner, we will get something like this:
12345, 2
12456, 1
15213, 1
...
As I know, ParDo is a Beam transform for generic parallel processing. After combine, I wish to implement a transform that write query to Cloud Firestore to get the existing pageview ID, take the current view count, perform addition on it and perform write operating to update the view count one by one from the combined output as shown above. Any suggestion?
Below is my code so far for the UpdateViewCount. When I get the query, it seems impossible to have a for loop to get the query (it will be only one row of query since the pageview is unique tho)
class UpdateIntoFireStore(beam.DoFn):
def process(self, element):
listingid, count = element
doc_ref = db.collection('listings').where('listingid', u'==', '12345')
try:
docs = doc_ref.get()
for doc in docs:
print doc
except NotFound:
print(u'No such document!')

I solved it. There is no need to put a loop to retrieve the data and I should retrieve the particular ID with document name.
doc_ref = db.collection(u'listings').document(listingid)
try:
doc = doc_ref.get()
doc_dict = doc.to_dict()
self.cur_count = doc_dict[u'count']
doc_ref.update({
u'count': self.cur_count + count
})
except NotFound:
doc_ref.set({'count': count})

Related

Remove rows from search expression solr

I'm trying to search for the items which's attribute matches the given function below in my large dataset, but I'm facing a problem here.
The row parameter only selects first 300 objects and the function then filters the matching results, but I'm trying to search the whole index, not only just first few, how can I rewrite this to achieve it?
having(
select(search(myIndex,q="*:*", fl="*", rows=300),
id,
dotProduct(ATTRIBUTE, array(4,5,2)) as prod,
l1norm(array(1,2,3)) as a,
l1norm(ATTRIBUTE) as b,
div(prod, add(a, sub(b, prod))) as c
), and(gteq(c, 5), lteq(c, 8)))
The simplest would be to increase the number of rows to cover the number of entries in the index.
However if this number is huge, you should probably use the /export request handler instead of a regular select-like handler.
The /export request handler allows a fully sorted result set to be
streamed out of Solr using a special rank query parser and
response writer. These have been specifically designed to work together to handle scenarios that involve sorting and exporting millions of records.
Depending on your needs, you could also do multiple queries playing with paginated results using both parameter start and rows, or if the number of entries is not known by the client code, use cursorMark.

Is it possible to Group By a field extracted from JSON input in a Siddhi Query?

I currently have an stream, with 1 string attribute that contains a Json event.
This stream receives different events, which I want to apply Json path expressions so I can use those attributes on filters and functions.
JsonPath extractors work like a charm on filters and selectors, unfortunately, I am not being able to use them for the 'Group By' part.
I am actually doing it in an embedded Siddhi App with siddhi-execution-json extension added manually, but for the discussion, so everybody can
easily check and test it, I will paste an example app that works on WSO2 Stream Processor.
The objective looks like the following App:
#App:name("Group_by_json_attribute")
define stream JsonStream(json string);
#sink(type='log')
define stream LogStream(myField string, count long);
#info(name='query1')
from JsonStream#window.time(10 sec)
select json:getString(json, '$.myField') as myField, count() as count
group by myField having count > 1
insert into LogStream;
and it can accept the following events:
{"myField": "my_value"}
However, this query will raise the error:
Cannot find attribute type as 'myField' does not exist in 'JsonStream'; define stream JsonStream(json string)
I have also tried to use directly the Json extractor at 'Group by':
group by json:getString(json, '$.myField') as myField having count > 1
However the error now is:
mismatched input ':' expecting {',', ORDER, LIMIT, OFFSET, HAVING, INSERT, DELETE, UPDATE, RETURN, OUTPUT}
which seems to not be expecting to use an extension here
I am just wondering, if it is possible to group by attributes not directly defined in the input stream. In this case is a field extracted from a JSON object, but it could be any other function that generates another attribute.
I am also using versions from maven central repository
Siddhi: io.siddhi:siddhi-core:5.0.1
siddhi-execution-json: io.siddhi.extension.execution.json:siddhi-execution-json:2.0.1
(Edit) Clarification
The objective is, to use attributes not directly defined in the Stream, to be used on the Group By.
The reason why is, I currently have an embedded app which defines the whole set of input streams coming from external sources formatted as JSON, and there are also a set of output streams to inform external components when a query matches.
This app allows users to create custom queries on this set of predefined Streams, but they are not able to create Streams by their own.
Many thanks!
It seems we are expecting the group by fields from the query input stream, in this case, JsonStream. Use another query before this for extraction and the aggregation and filtering in the following query,
#App:name("Group_by_json_attribute")
define stream JsonStream(json string);
#sink(type='log')
define stream LogStream(myField string, count long);
#info(name='extract_stream')
from JsonStream
select json:getString(json, '$.myField') as myField
insert into ExtractedStream;
#info(name='query1')
from ExtractedStream#window.time(10 sec)
select myField, count() as count
group by myField
having count > 1
insert into LogStream;

Using NEsper to read LogFiles for reporting purposes

We are evaluating NEsper. Our focus is to monitor data quality in an enterprise context. In an application we are going to log every change on a lot of fields - for example in an "order". So we have fields like
Consignee name
Consignee street
Orderdate
....and a lot of more fields. As you can imagine the log files are going to grow big.
Because the data is sent by different customers and is imported in the application, we want to analyze how many (and which) fields are updated from "no value" to "a value" (just as an example).
I tried to build a test case with just with the fields
order reference
fieldname
fieldvalue
For my test cases I added two statements with context-information. The first one should just count the changes in general per order:
epService.EPAdministrator.CreateEPL("create context RefContext partition by Ref from LogEvent");
var userChanges = epService.EPAdministrator.CreateEPL("context RefContext select count(*) as x, context.key1 as Ref from LogEvent");
The second statement should count updates from "no value" to "a value":
epService.EPAdministrator.CreateEPL("create context FieldAndRefContext partition by Ref,Fieldname from LogEvent");
var countOfDataInput = epService.EPAdministrator.CreateEPL("context FieldAndRefContext SELECT context.key1 as Ref, context.key2 as Fieldname,count(*) as x from pattern[every (a=LogEvent(Value = '') -> b=LogEvent(Value != ''))]");
To read the test-logfile I use the csvInputAdapter:
CSVInputAdapterSpec csvSpec = new CSVInputAdapterSpec(ais, "LogEvent");
csvInputAdapter = new CSVInputAdapter(epService.Container, epService, csvSpec);
csvInputAdapter.Start();
I do not want to use the update listener, because I am interested only in the result of all events (probably this is not possible and this is my failure).
So after reading the csv (csvInputAdapter.Start() returns) I read all events, which are stored in the statements NewEvents-Stream.
Using 10 Entries in the CSV-File everything works fine. Using 1 Million lines it takes way to long. I tried without EPL-Statement (so just the CSV import) - it took about 5sec. With the first statement (not the complex pattern statement) I always stop after 20 minutes - so I am not sure how long it would take.
Then I changed my EPL of the first statement: I introduce a group by instead of the context.
select Ref,count(*) as x from LogEvent group by Ref
Now it is really fast - but I do not have any results in my NewEvents Stream after the CSVInputAdapter comes back...
My questions:
Is the way I want to use NEsper a supported use case or is this the root cause of my failure?
If this is a valid use case: Where is my mistake? How can I get the results I want in a performant way?
Why are there no NewEvents in my EPL-statement when using "group by" instead of "context"?
To 1), yes
To 2) this is valid, your EPL design is probably a little inefficient. You would want to understand how patterns work, by using filter indexes and index entries, which are more expensive to create but are extremely fast at discarding unneeded events.
Read:
http://esper.espertech.com/release-7.1.0/esper-reference/html_single/index.html#processingmodel_indexes_filterindexes and also
http://esper.espertech.com/release-7.1.0/esper-reference/html_single/index.html#pattern-walkthrough
Try the "previous" perhaps. Measure performance for each statement separately.
Also I don't think the CSV adapter is optimized for processing a large file. I think CSV may not stream.
To 3) check your code? Don't use CSV file for large stuff. Make sure a listener is attached.

Concise way to filter on two child attributes in ArangoDB (AQL / Spring Data ArangoDB)

In ArangoDB I have documents in a trip collection which is related to documents in a driver collection via edges in a tripToDriver collection, and the trip documents are also related to documents in a departure collection via edges in a departureToTrip collection.
To fetch trips where their driver has a given idNumber and their associated departure has a startTime after a supplied date/time, I've successfully written the following AQL:
FOR doc IN trip
LET drivers = (FOR v IN 1..1 OUTBOUND doc tripToDriver RETURN v)
LET departures = (FOR v in 1..1 INBOUND doc departureToTrip RETURN v
FILTER drivers[0].idNumber == '999999-9999' AND departures[0].startTime >= '2018-07-30'
RETURN doc
But I wonder if there is a more concise / elegant way to achieve the same results?
A related question, since I'm using Spring Data ArangoDB: Is it possible to achieve this result with derived queries?
For a single relation I was able to create a query like:
Iterable<Trip> findTripsByDriversIdNumber( String driverId ); but haven't had luck incorporating the departure relation into this signature (maybe because it's inbound?).
First of all your query only works if you have only one connected driver/departure for every trip. You're fetching all linked drivers but only check the first found one.
If this is your model it is totally ok, but I would recommend to do the idNumber/startTime check within the sub queries. Then, because we only need to know that at least one driver/departure fits our filter condition, we add a LIMIT 1 to the sub query and return only a true. This is enough we need for our FITLER in our main query.
FOR doc IN trip
FILTER (FOR v IN 1..1 OUTBOUND doc tripToDriver FILTER v.idNumber == #idNumber LIMIT 1 RETURN true)[0]
FILTER (FOR v IN 1..1 INBOUND doc departureToTrip FILTER v.startTime >= #startTime LIMIT 1 RETURN true)[0]
RETURN doc
I tested to solve your case with a derived query. It would work, if there wasn't a bug in the current arangodb-spring-data release. The bug is already fixed but not yet released. You can already test it using a snapshot version of arangodb-spring-data (1.3.1-SNAPSHOT or 2.3.1-SNAPSHOT depending on your Spring Data version, see supported versions).
The following derived query method worked for me with the snapshot version.
Iterable<Trip> findByDriversIdNumberAndDeparturesStartTimeGreaterThanEqual(String idNumber, LocalDate startTime);
To make the dervied query work you need the following annotated fields in your class Trip
#Relations(edges = TripToDriver.class, direction = Direction.OUTBOUND, maxDepth = 1)
private Collection<Driver> drivers;
#Relations(edges = DepartureToTrip.class, direction = Direction.INBOUND, maxDepth = 1)
private Collection<Departure> departures;
I also created an working example project on github.

Iterating over PTable in crunch

I have following PTables,
PTable<String, String> somePTable1 = somePCollection1.parallelDo(new SomeClass(),
Writables.tableOf(Writables.strings(), Writables.strings()));
PTable<String, Collection<String>> somePTable2 = somePTable1.collectValues();
For somePTable2 described above, I want to make a new file for every record in somePTable2, Is there any way to iterate over somePTable2 so that I can access the record.I know I can apply the DoFn on somePTable2, but is it possible to apply pipeline.write() operation in DoFn ?
Try this to store your list as is
somePTable2.values().write()
If you want generate one record for each element in the collection inside your PTable, you will need apply a DoFn and emit one record for each element in the collection before write it.