Is there something similar to journal for read operations in mongodb? - mongodb

I am currently developing a program that read documents from mongo and write them in a file... something like this:
for doc in db.col.find(field=="bla"):
file.write(doc)
My problem is that something can happen while doing this process (its gonna take a week to do all the writes), for example, a shutdown or network problem. My question is: Is there something similar to journal for write operations to recover from a checkpoint? So I don't need to do all the write to file all over again.

this program will do what you want it writes IDs to a seperate log file. If everything runs fine it will just work. If it fails then the lopfile will ensure you start after the last write. Will also work if the dataset is changing as long as only inserts are done.
It needs Python 3.6 or better for the fStrings and pymongo.
import pymongo
import bson.json_util
import pathlib
import os
log_file = "logfile.txt"
output_file = "zip_codes.json"
host = "mongodb+srv://readonly:readonly#demodata.rgl39.mongodb.net/test?retryWrites=true&w=majority"
log_set = frozenset()
# Do we have a log file of previous writes
if os.path.isfile(log_file):
with open(log_file, "r") as input:
log_set = frozenset([x.strip() for x in input.readlines()])
print(f"{log_file} contains {len(log_set)} items")
else: # lets create one that is empty
pathlib.Path(log_file).touch()
print(f"creating {log_file}")
# connect to MongoDB we are using a readonly dataset for testing.
client = pymongo.MongoClient(host)
db = client["demo"]
zipcodes=db["zipcodes"]
count = 0
# Note we use bson to dump the file rather than json.dumps. This ensures
# that we can read this file back into MongoDB.
with open(output_file, "w") as data_output:
with open(log_file, "w") as log_output:
for doc in zipcodes.find():
if doc["_id"] not in log_set: # did we write this record already
count = count + 1
data_output.write(f"{bson.json_util.dumps(doc)}\n")
log_output.write(f"{doc['_id']}\n")
print(f"inserted {count} docs")

Related

find_one() finds duplicates that are not there

I am trying to copy a remote mongodb atlas server to a local one. I do this by a python script which also checks if the record is already there. I see that eventhough the local database is empty my script find duplicates, which are not in the remote mongodb atlas (at least i cannot find them). I am not so experienced with mongodb and pymongo but I connot see what I am doing wrong. Sometimes Find_one() finds exactly the same record as before (all the fields are the same even the _id) ?
I removed the collection completely from my local server and tried again, but still the same result.
UserscollectionRemote = dbRemote['users']
UserscollectionNew = dbNew['users']
LogcollectionRemote = dbRemote['events']
LogcollectionNew = dbNew['events']
UsersOrg = UserscollectionRemote.find()
for document in UsersOrg: # loop over all users
print(document)
if UserscollectionNew.find_one({'owner_id': document["owner_id"]}) is None: # check if already there
UserscollectionNew.insert_one(document)
UserlogsOrg = LogcollectionRemote.find({'owner_id': document["owner_id"]}) # get all logs from this user
for doc in UserlogsOrg:
try:
if LogcollectionNew.find_one({'date': doc["date"]}) is None: # there was no entry yet with this date
LogcollectionNew.insert_one(doc)
else:
print("duplicate");
print (doc);
except:
print("an error occured finding the document");
print(doc);
You have the second for loop inside the first; that could be trouble.
On a separate note, you should investigate mongodump and mongorestore for copying collections; unless you need to be doing it in code, these tools are better suited for your use case.

How can I merge multiple tfrecords file into one file?

My question is, if I want to create one tfrecords file for my data , it will take approximately 15 days to finish it, it has 500000 pairs of template , and each template is 32 frames( images). In order to save the time, I have 3 GPUs, so I thought I can create three tfrocords file each one file on one GPUs and then I can finish creating the tfrecords in 5 days. But then I searched about a way to merge these three files in one file and couldn't find proper solution.
So Is there any way to merge these three files in one file, OR is there any way that I can train my network by feeding batch of example extracted form the three tfrecords files, knowing I am using Dataset API.
As the question is asked two months ago, I thought you already find the solution. For the follows, the answer is NO, you do not need to create a single HUGE tfrecord file. Just use the new DataSet API:
dataset = tf.data.TFRecordDataset(filenames_to_read,
compression_type=None, # or 'GZIP', 'ZLIB' if compress you data.
buffer_size=10240, # any buffer size you want or 0 means no buffering
num_parallel_reads=os.cpu_count() # or 0 means sequentially reading
)
# Maybe you want to prefetch some data first.
dataset = dataset.prefetch(buffer_size=batch_size)
# Decode the example
dataset = dataset.map(single_example_parser, num_parallel_calls=os.cpu_count())
dataset = dataset.shuffle(buffer_size=number_larger_than_batch_size)
dataset = dataset.batch(batch_size).repeat(num_epochs)
...
For details, check the document.
Addressing the question title directly for anyone looking to merge multiple .tfrecord files:
The most convenient approach would be to use the tf.Data API:
(adapting an example from the docs)
# Create dataset from multiple .tfrecord files
list_of_tfrecord_files = [dir1, dir2, dir3, dir4]
dataset = tf.data.TFRecordDataset(list_of_tfrecord_files)
# Save dataset to .tfrecord file
filename = 'test.tfrecord'
writer = tf.data.experimental.TFRecordWriter(filename)
writer.write(dataset)
However, as pointed out by holmescn, you'd likely be better off leaving the .tfrecord files as separate files and reading them together as a single tensorflow dataset.
You may also refer to a longer discussion regarding multiple .tfrecord files on Data Science Stackexchange
The answer by MoltenMuffins works for higher versions of tensorflow. However, if you are using lower versions, you have to iterate through the three tfrecords and save them them into a new record file as follows. This works for tf versions 1.0 and above.
def comb_tfrecord(tfrecords_path, save_path, batch_size=128):
with tf.Graph().as_default(), tf.Session() as sess:
ds = tf.data.TFRecordDataset(tfrecords_path).batch(batch_size)
batch = ds.make_one_shot_iterator().get_next()
writer = tf.python_io.TFRecordWriter(save_path)
while True:
try:
records = sess.run(batch)
for record in records:
writer.write(record)
except tf.errors.OutOfRangeError:
break
Customizing the above the script for better tfrecords listing
import os
import glob
import tensorflow as tf
save_path = 'data/tf_serving_warmup_requests'
tfrecords_path = glob.glob('data/*.tfrecords')
dataset = tf.data.TFRecordDataset(tfrecords_path)
writer = tf.data.experimental.TFRecordWriter(save_path)
writer.write(dataset)

Pyspark - Transfer control out of Spark Session (sc)

This is a follow up question on
Pyspark filter operation on Dstream
To keep a count of how many error messages/warning messages has come through for say a day, hour - how does one design the job.
What I have tried:
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
def counts():
counter += 1
print(counter.value)
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingNetworkWordCount")
ssc = StreamingContext(sc, 5)
counter = sc.accumulator(0)
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
errors = lines.filter(lambda l: "error" in l.lower())
errors.foreachRDD(lambda e : e.foreach(counts))
errors.pprint()
ssc.start()
ssc.awaitTermination()
this however has multiple issues, to start with print doesn't work (does not output to stdout, I have read about it, the best I can use here is logging). Can I save the output of that function to a text file and tail that file instead?
I am not sure why the program just comes out, there is no error/dump anywhere to look further into (spark 1.6.2)
How does one preserve state? What I am trying is to aggregate logs by server and severity, another use case is to count how many transactions were processed by looking for certain keywords
Pseudo Code for what I want to try:
foreachRDD(Dstream):
if RDD.contains("keyword1 | keyword2 | keyword3"):
dictionary[keyword] = dictionary.get(keyword,0) + 1 //add the keyword if not present and increase the counter
print dictionary //or send this dictionary to else where
The last part of sending or printing dictionary requires switching out of spark streaming context - Can someone explain the concept please?
print doesn't work
I would recommend reading the design patterns section of the Spark documentation. I think that roughly what you want is something like this:
def _process(iter):
for item in iter:
print item
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
errors = lines.filter(lambda l: "error" in l.lower())
errors.foreachRDD(lambda e : e.foreachPartition(_process))
This will get your call print to work (though it is worth noting that the print statement will execute on the workers and not the drivers, so if you're running this code on a cluster you will only see it on the worker logs).
However, it won't solve your second problem:
How does one preserve state?
For this, take a look at updateStateByKey and the related example.

date in pig latin

I am trying to do the following. I have multiple dates and I want to create a pig script which gets unknown number of input dates and then runs the pig script for the input arguments. My question is:
How can I send an unknown number of input variables to a pig script and then handle them within the pig script?
Thanks
Sara
I have some trouble understanding what you actually want to do. That would be my solution >for your problem, sending an unknown number of dates (sorted as chararray):
A = load 'input_dates' AS (date:chararray);
B = my_macro(A);
It's quite basic, so I guess I didn't understand your problem correctly. Could you maybe >develop a little bit more your problem?
UPDATE >> How about something like this if you use Pig 0.11 (there is a bug until 0.10 for module imports):
#!/usr/bin/python
import os
from org.apache.pig.scripting import *
P = Pig.compile("""
data = LOAD '$docs_in' AS (a:int);
-- do something
""")
lof = os.listdir("/home/.../dates/")
params = []
for elem in lof:
params.append({'docs_in': str(elem)})
lof.remove(elem)
bound = P.bind(list_of_files)
stats = bound.run(params)
If each run is counting on the result of the previous one, use runSingle() instead.
If I understand question correctly, you want to load number of files or directories. You can specify as "," as input.
Below is an example:
load.pig (content):
A = LOAD '$input' using PigStorage();
dump A;
command to run ( to run locally):
pig -x local -param input=20120301,20120302,20120304 load.pig

SQLAlchemy, Psycopg2 and Postgresql COPY

It looks like Psycopg has a custom command for executing a COPY:
psycopg2 COPY using cursor.copy_from() freezes with large inputs
Is there a way to access this functionality from with SQLAlchemy?
accepted answer is correct but if you want more than just the EoghanM's comment to go on the following worked for me in COPYing a table out to CSV...
from sqlalchemy import sessionmaker, create_engine
eng = create_engine("postgresql://user:pwd#host:5432/db")
ses = sessionmaker(bind=engine)
dbcopy_f = open('/tmp/some_table_copy.csv','wb')
copy_sql = 'COPY some_table TO STDOUT WITH CSV HEADER'
fake_conn = eng.raw_connection()
fake_cur = fake_conn.cursor()
fake_cur.copy_expert(copy_sql, dbcopy_f)
The sessionmaker isn't necessary but if you're in the habit of creating the engine and the session at the same time to use raw_connection you'll need separate them (unless there is some way to access the engine through the session object that I don't know). The sql string provided to copy_expert is also not the only way to it, there is a basic copy_to function that you can use with subset of the parameters that you could past to a normal COPY TO query. Overall performance of the command seems fast for me, copying out a table of ~20000 rows.
http://initd.org/psycopg/docs/cursor.html#cursor.copy_to
http://docs.sqlalchemy.org/en/latest/core/connections.html#sqlalchemy.engine.Engine.raw_connection
If your engine is configured with a psycopg2 connection string (which is the default, so either "postgresql://..." or "postgresql+psycopg2://..."), you can create a psycopg2 cursor from an SQL Alchemy session using
cursor = session.connection().connection.cursor()
which you can use to execute
cursor.copy_from(...)
The cursor will be active in the same transaction as your session currently is. If a commit or rollback happens, any further use of the cursor with throw a psycopg2.InterfaceError, you would have to create a new one.
You can use:
def to_sql(engine, df, table, if_exists='fail', sep='\t', encoding='utf8'):
# Create Table
df[:0].to_sql(table, engine, if_exists=if_exists)
# Prepare data
output = cStringIO.StringIO()
df.to_csv(output, sep=sep, header=False, encoding=encoding)
output.seek(0)
# Insert data
connection = engine.raw_connection()
cursor = connection.cursor()
cursor.copy_from(output, table, sep=sep, null='')
connection.commit()
cursor.close()
I insert 200000 lines in 5 seconds instead of 4 minutes
It doesn't look like it.
You may have to just use psycopg2 to expose this functionality and forego the ORM capabilities. I guess I don't really see the benefit of ORM in such an operation anyway since it's a straight bulk insert and dealing with individual objects a la an ORM would not really make a whole lot of sense.
If you're starting from SQLAlchemy, you need to first get to the connection engine (also known by the property name bind on some SQLAlchemy objects):
engine = create_engine('postgresql+psycopg2://myuser:password#localhost/mydb')
# or
engine = session.engine
# or any other way you know to get to the engine
From the engine you can isolate a psycopg2 connection:
# get a psycopg2 connection
connection = engine.connect().connection
# get a cursor on that connection
cursor = connection.cursor()
Here are some templates for the COPY statement to use with cursor.copy_expert(), a more complete and flexible option than copy_from() or copy_to() as it is indicated here: https://www.psycopg.org/docs/cursor.html#cursor.copy_expert.
# to dump to a file
dump_to = """
COPY mytable
TO STDOUT
WITH (
FORMAT CSV,
DELIMITER ',',
HEADER
);
"""
# to copy from a file:
copy_from = """
COPY mytable
FROM STDIN
WITH (
FORMAT CSV,
DELIMITER ',',
HEADER
);
"""
Check out what the options above mean and others that may be of interest to your specific situation https://www.postgresql.org/docs/current/static/sql-copy.html.
IMPORTANT NOTE: The link to the documentation of cursor.copy_expert() indicates to use STDOUT to write out to a file and STDIN to copy from a file. But if you look at the syntax on the PostgreSQL manual, you'll notice that you can also specify the file to write to or from directly in the COPY statement. Don't do that, you're likely just wasting your time if you're not running as root (who runs Python as root during development?) Just do what's indicated in the psycopg2's docs and specify STDIN or STDOUT in your statement with cursor.copy_expert(), it should be fine.
# running the copy statement
with open('/path/to/your/data/file.csv') as f:
cursor.copy_expert(copy_from, file=f)
# don't forget to commit the changes.
connection.commit()
You don't need to drop down to psycopg2, use raw_connection nor a cursor.
Just execute the sql as usual, you can even use bind parameters with text():
engine.execute(text('''copy some_table from :csv
delimiter ',' csv'''
).execution_options(autocommit=True),
csv='/tmp/a.csv')
You can drop the execution_options(autocommit=True) if this PR will be accepted