How to avoid CORB timeout when run large batch of data pull over 10 million docs pdf/xml? Do I need to reduce thread-count and batch-size.
uris-module:
let $uris := cts:uris(
(),
(),
cts:and-query((
cts:collection-query("/sites"),
cts:field-range-query("cdate","<","2019-10-01"),
cts:not-query(
cts:or-query((
cts:field-word-query("dcax","200"),
more code...,
))
)
))
return (fn:count($uris), $uris)
process.xqy:
declare variable $URI as xs:string external;
let $uris := fn:tokenize($URI,";")
let $outputJson := "/output/json/"
let $outputPdf := "/output/pdf/"
for $uri1 in $uris
let $accStr := fn:substring-before(fn:substring-after($uri1,"/sites/"),".xml")
let $pdfUri := fn:concat("/pdf/iadb/",$accStr,".pdf")
let $doc := fn:doc($uri1)
let $obj := json:object()
let $_ := map:put($obj,"PaginationOrMediaCount",fn:number($doc/rec/MediaCount))
let $_ := map:put($obj,"Abstract",fn:replace($doc/rec/Abstract/text(),"[^a-zA-Z0-9 ,.\-\r\n]",""))
let $_ := map:put($obj,"Descriptors",json:to-array($doc/rec/Descriptor/text()))
let $_ := map:put($obj,"FullText",fn:replace($doc/rec/FullText/text(),"[^a-zA-Z0-9 ,.\-\r\n]",""))
let $_ := xdmp:save(
fn:concat($outputJson,$accStr,".json"),
xdmp:to-json($obj)
)
let $_ := if (fn:doc-available($pdfUri))
then xdmp:save(
fn:concat($outputPdf,$accStr,".pdf"),
fn:doc($pdfUri)
)
else ()
return $URI
It would be easier to diagnose and suggest improvements if you shared the CoRB job options and the code for your URIS-MODULE and PROCESS-MODULE
The general concept of a CoRB job is that is splits up the work to perform multiple module executions rather than trying to do all of the work in a single execution, in order to avoid timeout issues and excessive memory consumption.
For instance, if you wanted to download 10 million documents, the URIS-MODULE would select the URIs of all of those documents, and then each URI would be sent to the PROCESS-MODULE, which would be responsible for retrieving it. Depending upon the THREAD-COUNT, you could be downloading several documents at a time but they should all be returning very quickly.
Is the execution of the URIs module what is timing out, or the process module?
You can increase the timeout limit from the default limit up to the maximum timeout limit by using: xdmp:request-set-time-limit()
Generally, the process modules should execute quickly and shouldn't be timing out. One possible reason would be performing too much work in the transform (i.e. setting BATCH-SIZE really large and doing too much at once) or maybe a misconfiguration or poorly written query (i.e. rather than fetching a single doc with the $URI value, performing a search and retrieving all of the docs each time that the process module is executed).
Related
I want a few jobs executed everyday at specific times.
The first job I want to run is to acquire data from the database and store it in a global variable
The second job I want to run is a few minutes after the first job is executed where it uses the data acquired from the first job that was stored in a global variable.
global dataacq
dataacq = None
def condb():
global check
global dataacq
conn = psycopg2.connect(#someinformation)
cursor = conn.cursor()
query = "SELECT conversation_id FROM tablename"
cursor.execute(query)
dataacq = cursor.fetchall()
print(dataacq)
cursor.close()
conn.close()
check = True
print(check)
return dataacq
def printresult(result):
print(result)
schedule.every().day.at("08:59").do(condb)
schedule.every().day.at("09:00").do(printresult, dataacq)
Above is a part of the code I am using for testing. The problem here is when the "printresult" function is called it displays None as output. But if I execute all the functions without any scheduling then it works and displays what I need it to show. So why is this happening?
I have a command which is made using "labix.org/v2/mgo" library
err = getCollection.Find(bson.M{}).Sort("department").Distinct("department", &listedDepartment)
this is working fine. But now I'm moving to the official golang mongo-driver "go.mongodb.org/mongo-driver/mongo" and I want to run this command in that library but there is no direct function that I can use with Find then Sort then Distinct. How can I achieve this command using this mongo-driver. The variable listedDepartment is type of []string. Please suggest me know the solutions.
You may use Collection.Distinct() but it does not yet support sorting:
// Obtain collection:
c := client.Database("dbname").Collection("collname")
ctx := context.Background()
results, err := c.Distinct(ctx, "department", bson.M{})
It returns a value of type []interface{}. If you know it contains string values, you may use a loop and type assertions to obtain the string values like this:
listedDepartment = make([]string, len(results))
for i, v := range results {
listedDepartment[i] = v.(string)
}
And if you need it sorted, simply sort the slice:
sort.Strings(listedDepartment)
Im trying to run a Pipe that doesn't return any results, because the last pipeline operator is $out.
// { $out: "y" }
pipeline := DB. C("x"). Pipe(stages). AllowDiskUse()
result := []bson.M{}
err := pipeline.All(&result)
When running the pipe with I'm getting a timeout. I assume mgo is waiting for results to be read - forever.
Solved. Instead of calling All(&result), call Iter().
All would call Next on an iterator that is empty from the beginning, obviously leading to the timeout.
Iter returns an iterator, that will just get discarded. No calls to Next, no timeouts.
I am doing a small script to get SNMP traps with PySnmp.
I am able to get the oid = value pairs, but the value is too long with a small information in the end. How can I access the octectstring only which comes in the end of the value. Is there a way other than string manipulations? Please comment.
OID =_BindValue(componentType=NamedTypes(NamedType('value', ObjectSyntax------------------------------------------------(DELETED)-----------------(None, OctetString(b'New Alarm'))))
Is it possible to get the output like the following, as is available from another SNMP client:
.iso.org.dod.internet.private.enterprises.xxxx.1.1.2.2.14: CM_DAS Alarm Traps:
Edit - the codes are :
**for oid, val in varBinds:
print('%s = %s' % (oid.prettyPrint(), val.prettyPrint()))
target.write(str(val))**
On screen, it shows short, but on file, the val is so long.
Usage of target.write( str(val[0][1][2])) does not work for all (program stops with error), but the 1st oid(time tick) gets it fine.
How can I get the value from tail as the actual value is found there for all oids.
Thanks.
SNMP transfers information in form of a sequence of OID-value pairs called variable-bindings:
variable_bindings = [[oid1, value1], [oid2, value2], ...]
Once you get the variable-bindings sequence from SNMP PDU, to access value1, for example, you might do:
variable_binding1 = variable_bindings[0]
value1 = variable_binding1[1]
To access the tail part of value1 (assuming it's a string) you could simply subscribe it:
tail_of_value1 = value1[-10:]
I guess in your question you operate on a single variable_binding, not a sequence of them.
If you want pysnmp to translate oid-value pair into a human-friendly representation (of MIB object name, MIB object value), you'd have to pass original OID-value pair to the ObjectType class and run it through MIB resolver as explained in the documentation.
Thanks...
the following codes works like somwwhat I was looking for.
if str(oid)=="1.3.6.1.2.1.1.3.0":
target.write(" = str(val[0][1]['timeticks-value']) = " +str(val[0][1]['timeticks-value'])) # time ticks
else:
target.write("= val[0][0]['string-value']= " + str(val[0][0]['string-value']))
I'm trying to read a collection dump generated by mongodump. The file is a few gigabytes so I want to read it incrementally.
I can read the first object with something like this:
buf := make([]byte, 100000)
f, _ := os.Open(path)
f.Read(buf)
var m bson.M
bson.Unmarshal(buf, &m)
However I don't know how much of the buf was consumed, so I don't know how to read the next one.
Is this possible with mgo?
Using mgo's bson.Unmarshal() alone is not enough -- that function is designed to take a []byte representing a single document, and unmarshal it into a value.
You will need a function that can read the next whole document from the dump file, then you can pass the result to bson.Unmarshal().
Comparing this to encoding/json or encoding/gob, it would be convenient if mgo.bson had a Reader type that consumed documents from an io.Reader.
Anyway, from the source for mongodump, it looks like the dump file is just a series of bson documents, with no file header/footer or explicit record separators.
BSONTool::processFile shows how mongorestore reads the dump file. Their code reads 4 bytes to determine the length of the document, then uses that size to read the rest of the document. Confirmed that the size prefix is part of the bson spec.
Here is a playground example that shows how this could be done in Go: read the length field, read the rest of the document, unmarshal, repeat.
The method File.Read returns the number of bytes read.
File.Read
Read reads up to len(b) bytes from the File. It returns the number of bytes read and an error, if any. EOF is signaled by a zero count with err set to io.EOF.
So you can get the number of bytes read by simply storing the return parameters of you read:
n, err := f.Read(buf)
I managed to solve it with the following code:
for len(buf) > 0 {
var r bson.Raw
var m userObject
bson.Unmarshal(buf, &r)
r.Unmarshal(&m)
fmt.Println(m)
buf = buf[len(r.Data):]
}
Niks Keets' answer did not work for me. Somehow len(r.Data) was always the whole buffer length. So I came out with this other code:
for len(buff) > 0 {
messageSize := binary.LittleEndian.Uint32(buff)
err = bson.Unmarshal(buff, &myObject)
if err != nil {
panic(err)
}
// Do your stuff
buff = buff[messageSize:]
}
Of course you have to handle truncated strucs at the end of the buffer. In my case I could load the whole file into memory.