Read mongodump output with go and mgo - mongodb

I'm trying to read a collection dump generated by mongodump. The file is a few gigabytes so I want to read it incrementally.
I can read the first object with something like this:
buf := make([]byte, 100000)
f, _ := os.Open(path)
f.Read(buf)
var m bson.M
bson.Unmarshal(buf, &m)
However I don't know how much of the buf was consumed, so I don't know how to read the next one.
Is this possible with mgo?

Using mgo's bson.Unmarshal() alone is not enough -- that function is designed to take a []byte representing a single document, and unmarshal it into a value.
You will need a function that can read the next whole document from the dump file, then you can pass the result to bson.Unmarshal().
Comparing this to encoding/json or encoding/gob, it would be convenient if mgo.bson had a Reader type that consumed documents from an io.Reader.
Anyway, from the source for mongodump, it looks like the dump file is just a series of bson documents, with no file header/footer or explicit record separators.
BSONTool::processFile shows how mongorestore reads the dump file. Their code reads 4 bytes to determine the length of the document, then uses that size to read the rest of the document. Confirmed that the size prefix is part of the bson spec.
Here is a playground example that shows how this could be done in Go: read the length field, read the rest of the document, unmarshal, repeat.

The method File.Read returns the number of bytes read.
File.Read
Read reads up to len(b) bytes from the File. It returns the number of bytes read and an error, if any. EOF is signaled by a zero count with err set to io.EOF.
So you can get the number of bytes read by simply storing the return parameters of you read:
n, err := f.Read(buf)

I managed to solve it with the following code:
for len(buf) > 0 {
var r bson.Raw
var m userObject
bson.Unmarshal(buf, &r)
r.Unmarshal(&m)
fmt.Println(m)
buf = buf[len(r.Data):]
}

Niks Keets' answer did not work for me. Somehow len(r.Data) was always the whole buffer length. So I came out with this other code:
for len(buff) > 0 {
messageSize := binary.LittleEndian.Uint32(buff)
err = bson.Unmarshal(buff, &myObject)
if err != nil {
panic(err)
}
// Do your stuff
buff = buff[messageSize:]
}
Of course you have to handle truncated strucs at the end of the buffer. In my case I could load the whole file into memory.

Related

How to perform Find, Distinct & Sort all together using go mongo driver

I have a command which is made using "labix.org/v2/mgo" library
err = getCollection.Find(bson.M{}).Sort("department").Distinct("department", &listedDepartment)
this is working fine. But now I'm moving to the official golang mongo-driver "go.mongodb.org/mongo-driver/mongo" and I want to run this command in that library but there is no direct function that I can use with Find then Sort then Distinct. How can I achieve this command using this mongo-driver. The variable listedDepartment is type of []string. Please suggest me know the solutions.
You may use Collection.Distinct() but it does not yet support sorting:
// Obtain collection:
c := client.Database("dbname").Collection("collname")
ctx := context.Background()
results, err := c.Distinct(ctx, "department", bson.M{})
It returns a value of type []interface{}. If you know it contains string values, you may use a loop and type assertions to obtain the string values like this:
listedDepartment = make([]string, len(results))
for i, v := range results {
listedDepartment[i] = v.(string)
}
And if you need it sorted, simply sort the slice:
sort.Strings(listedDepartment)

How mimic the function map.getORelse to a CSV file

I have a CSV file that represent a map[String,Int], then I am reading the file as follows:
def convI2N (vkey:Int):String={
val in = new Scanner("dictionaryNV.csv")
loop.breakable{
while (in.hasNext) {
val nodekey = in.next(',')
val value = in.next('\n')
if (value == vkey.toString){
n=nodekey
loop.break()}
}}
in.close
n
}
the function give the String given the Int. The problem here is that I must browse the whole file, and the file is to big, then the procedure is too slow. Someone tell me that this is O(n) complexity time, and recomend me to pass to O(log n). I suppose that the function map.getOrElse is O(log n).
Someone can help me to find a way to get a best performance of this code?
As additional comment, the dictionaryNV file is sorted by the Int values
maybe I can divide the file by lines, or set of lines. The CSV has like 167000 Tuples [String,Int]
or in another way how you make some kind of binary search through the csv in scala?
If you are calling confI2N function many times then definitely the job will be slow because each time you have to scan the big file. So if the function is called many times then it is recommended to store them in temporary variable as properties or hashmap or collection of tuple2 and change the other code that is eating the memory.
You can try following way which should be faster than scanner way
Assuming that your csv file is comma separated as
key1,value1
key2,value2
Using Source.fromFile can be your solution as
def convI2N (vkey:Int):String={
var n = "not found"
val filtered = Source.fromFile("<your path to dictionaryNV.csv>")
.getLines()
.map(line => line.split(","))
.filter(sline => sline(0).equalsIgnoreCase(vkey.toString))
for(str <- filtered){
n = str(0)
}
n
}

reusing read buffers when working with sockets

I'd like to know the proper way to reuse the []byte buffer in go. I declare it like this
buf := make([]byte, 1024)
and then use like this
conn, _ := net.Dial("tcp", addr)
_, err = conn.read(buf)
I heard that declaring a new buffer isn't efficient since it involves memory allocations and that we should reuse existing buffers instead. However I am not sure if I just can pass the buffer again and it will be wiped or it can hold parts of previous messages (especially if the current message from socket is shorter than prev.one)?
The Read method reads up to the len(buf) bytes to the buffer and returns the number of bytes read.
The Read method does not modify length of the caller's slice. It cannot because the slice is passed by value. The application must use the returned length to get a slice of the bytes actually read.
n, err = conn.Read(buf)
bufRead := buf[:n]
The application can call Read multiple times using the the same buffer.
conn, err := net.Dial("tcp", addr)
if err != nil {
// handle error
}
buf := make([]byte, 1024)
for {
n, err = conn.Read(buf)
if err != nil {
// handle error
}
fmt.Printf("Read %s\n", buf[:n]) // buf[:n] is slice of bytes read from conn
}
In practice you rarely use io.Reader.Read(), instead you pipe it down where io.Reader needed in code.
Buffer will not be wiped, you must do it by hand. Or if you want a buffer you can use bufio
conn, _ := net.Dial("tcp", addr)
r:=bufio.NewReader(conn)
which you can
r.WriteTo(io.Writer) //for example for further processing
and you can reset
r.Reset(NewConn)
Package io
import "io"
type Reader
type Reader interface {
Read(p []byte) (n int, err error)
}
Reader is the interface that wraps the basic Read method.
Read reads up to len(p) bytes into p. It returns the number of bytes
read (0 <= n <= len(p)) and any error encountered. Even if Read
returns n < len(p), it may use all of p as scratch space during the
call. If some data is available but not len(p) bytes, Read
conventionally returns what is available instead of waiting for more.
When Read encounters an error or end-of-file condition after
successfully reading n > 0 bytes, it returns the number of bytes read.
It may return the (non-nil) error from the same call or return the
error (and n == 0) from a subsequent call. An instance of this general
case is that a Reader returning a non-zero number of bytes at the end
of the input stream may return either err == EOF or err == nil. The
next Read should return 0, EOF.
Callers should always process the n > 0 bytes returned before
considering the error err. Doing so correctly handles I/O errors that
happen after reading some bytes and also both of the allowed EOF
behaviors.
Implementations of Read are discouraged from returning a zero byte
count with a nil error, except when len(p) == 0. Callers should treat
a return of 0 and nil as indicating that nothing happened; in
particular it does not indicate EOF.
Implementations must not retain p.
Read may use all of the buffer as scratch space during the call.
For example,
buf := make([]byte, 4096)
for {
n, err := r.Read(buf[:cap(buf)])
buf = buf[:n]
if err != nil {
// handle error
}
// process buf
}

Golang md5 Sum() function

package main
import (
"crypto/md5"
"fmt"
)
func main() {
hash := md5.New()
b := []byte("test")
fmt.Printf("%x\n", hash.Sum(b))
hash.Write(b)
fmt.Printf("%x\n", hash.Sum(nil))
}
Output:
*md5.digest74657374d41d8cd98f00b204e9800998ecf8427e
098f6bcd4621d373cade4e832627b4f6
Could someone please explain to me why/how do I get different result for the two print ?
I'm building up on the already good answers. I'm not sure if Sum is actually the function you want. From the hash.Hash documentation:
// Sum appends the current hash to b and returns the resulting slice.
// It does not change the underlying hash state.
Sum(b []byte) []byte
This function has a dual use-case, which you seem to mix in an unfortunate way. The use-cases are:
Computing the hash of a single run
Chaining the output of several runs
In case you simply want to compute the hash of something, either use md5.Sum(data) or
digest := md5.New()
digest.Write(data)
hash := digest.Sum(nil)
This code will, according to the excerpt of the documentation above, append the checksum of data to nil, resulting in the checksum of data.
If you want to chain several blocks of hashes, the second use-case of hash.Sum, you can do it like this:
hashed := make([]byte, 0)
for hasData {
digest.Write(data)
hashed = digest.Sum(hashed)
}
This will append each iteration's hash to the already computed hashes. Probably not what you want.
So, now you should be able to see why your code is failing. If not, take this commented version of your code (On play):
hash := md5.New()
b := []byte("test")
fmt.Printf("%x\n", hash.Sum(b)) // gives 74657374<hash> (74657374 = "test")
fmt.Printf("%x\n", hash.Sum([]byte("AAA"))) // gives 414141<hash> (41 = 'A')
fmt.Printf("%x\n", hash.Sum(nil)) // gives <hash> as append(nil, hash) == hash
fmt.Printf("%x\n", hash.Sum(b)) // gives 74657374<hash> (74657374 = "test")
fmt.Printf("%x\n", hash.Sum([]byte("AAA"))) // gives 414141<hash> (41 = 'A')
hash.Write(b)
fmt.Printf("%x\n", hash.Sum(nil)) // gives a completely different hash since internal bytes changed due to Write()
You have 2 ways to actually get a md5.Sum of a byte slice :
func main() {
hash := md5.New()
b := []byte("test")
hash.Write(b)
fmt.Printf("way one : %x\n", hash.Sum(nil))
fmt.Printf("way two : %x\n", md5.Sum(b))
}
According to http://golang.org/src/pkg/crypto/md5/md5.go#L88, your hash.Sum(b) is like calling append(b, actual-hash-of-an-empty-md5-hash).
The definition of Sum :
func (d0 *digest) Sum(in []byte) []byte {
// Make a copy of d0 so that caller can keep writing and summing.
d := *d0
hash := d.checkSum()
return append(in, hash[:]...)
}
When you call Sum(nil) it returns d.checkSum() directly as a byte slice, however if you call Sum([]byte) it appends d.checkSum() to your input.
From the docs:
// Sum appends the current hash to b and returns the resulting slice.
// It does not change the underlying hash state.
Sum(b []byte) []byte
so "*74657374*d41d8cd98f00b204e9800998ecf8427e" is actually a hex representation of "test", plus the initial state of the hash.
fmt.Printf("%x", []byte{"test"})
will result in... "74657374"!
So basically hash.Sum(b) is not doing what you think it does. The second statement is the right hash.
I would like to tell you to the point:
why/how do I get different result for the two print ?
Ans:
hash := md5.New()
As you are creating a new instance of md5 hash once you call hash.Sum(b) it actually md5 hash for b as hash itself is empty, hence you got 74657374d41d8cd98f00b204e9800998ecf8427e as output.
Now in next statement hash.Write(b) you are writing b to the hash instance then calling hash.Sum(nil) it will calculate md5 for b that you just written and sum it to previous value i.e 74657374d41d8cd98f00b204e9800998ecf8427e
This is the reason you are getting these outputs.
For your reference look at the Sum API:
func (d0 *digest) Sum(in []byte) []byte {
85 // Make a copy of d0 so that caller can keep writing and summing.
86 d := *d0
87 hash := d.checkSum()
88 return append(in, hash[:]...)
89 }

Result of an assignment in Java

I am looking at the code on page 11 here http://www.cs.usfca.edu/~parrt/doc/java/JavaIO-notes.pdf
I have trouble with one statement. I thought the result of an assignment was an lvalue. So ((byteRead = inFile.read()) != -1) should be the same as (inFile.read()) != -1). This doesn't seem to be the case though looking at the output. So my question is how is the statement ((byteRead = inFile.read()) != -1) parsed?
EDIT: It seems from the responses that I had the current interpretation of the result of an assignment. I was wondering what goes wrong by replacing the code fragment
int byteRead;
while((byteRead = inFile.read()) != -1)
outFile.write(byteRead);
with
while( inFile.read() != -1)
outFile.write( inFile.read());
So, now that you posted both versions of code, the answer is clear:
In your first version, each byte read is assigned to byteRead and then written to the output stream.
In the second version, you consume a byte with the read() but don't assign it to a variable. Then, you read another byte (the next one in the stream) which you write to the output stream.
So, if the input file is:
abcdefghijklmnopqrstuvwxyz
The output of the first version will be :
abcdefghijklmnopqrstuvwxyz
The output of the second will be :
bdfhjlnqrtuxz
((byteRead = inFile.read()) != -1) and (inFile.read() != -1) are, in one sense, equivalent boolean expressions. However, the first one has a side effect: It stores the result of inFile.read() in the variable byteRead.
The code example you referenced uses this for a compact while loop that reads one byte from input, writes it to output and keeps doing that until inFile.read() returns -1 (meaning end of file has been reached).