Today I was stress testing multiprocessing in Python against our cloud MongoDB (Atlas).
It's currently running at 100%, and I'd like to do something like a "restart".
I have found a "Shutdown" command, but I can't find a command to start it up after it has shutdown, so I'm afraid to run just the "Shutdown".
I have tried killing processes one at a time in the lower right section of the screen below, but after refreshing the page, the same processes numbers are there, and I think there are more on the bottom of the list. I think they are all backed up.
Doing an insert of a large document does not return to the Python program in 5 minutes. I need to get that working again (should update in 10-15 seconds as it has in the past).
I am able to open a command window and connect to that server. Just unclear what commands to run.
Here is an example of how I tried to kill some of the processes:
Also note that the "Performance Adviser" page is not recommending any new indexes.
Update 1:
Alternatively, can I kill all running, hung, or locked processes?
I was reading about killop here:(https://docs.mongodb.com/manual/tutorial/terminate-running-operations/), but found it confusing with the versions, and the fact that I'm using Atlas.
I'm not sure if there is an easier way, but this is what I did.
First, I ran a Python program to extract all the desired operation IDs based on my database and collection name. You have to look at the file created to understand the if statements in the code below. NOTE: it says that db.current_op is deprecated, and I haven't found out how to do this without that command (from PyMongo).
Note the doc page warns against killing certain types of operations, so I was careful to pick ones that were doing inserts on one specific collection. (Do not attempt to kill all processes in the JSON returned).
import requests
import os
import sys
import traceback
import pprint
import json
from datetime import datetime as datetime1, timedelta, timezone
import datetime
from time import time
import time as time2
import configHandler
import pymongo
from pymongo import MongoClient
from uuid import UUID
def uuid_convert(o):
if isinstance(o, UUID):
return o.hex
# This get's all my config from a config.json file, not including that code here.
config_dict = configHandler.getConfigVariables()
cluster = MongoClient(config_dict['MONGODB_CONNECTION_STRING_ADMIN'])
db = cluster[config_dict['MONGODB_CLUSTER']]
current_ops = db.current_op(True)
count_ops = 0
for op in current_ops["inprog"]:
count_ops += 1
#db.kill- no such command
if op["type"] == "op":
if "op" in op:
if op["op"] == "insert" and op["command"]["insert"] == "TestCollectionName":
#print(op["opid"], op["command"]["insert"])
print('db.adminCommand({"killOp": 1, "op": ' + str(op["opid"]) + '})')
print("\n\ncount_ops=", count_ops)
currDateTime = datetime.datetime.now()
print("type(current_ops) = ", type(current_ops))
# this dictionary has nested fields
# current_ops_str = json.dumps(current_ops, indent=4)
# https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
filename = "./data/currents_ops" + currDateTime.strftime("%y_%m_%d__%H_%M_%S") + ".json"
with open(filename, "w") as file1:
#file1.write(current_ops_str)
json.dump(current_ops, file1, indent=4, default=uuid_convert)
print("Wrote to filename=", filename)
It writes the full ops file to disk, but I did a copy/paste from the command windows to a file. Then from the command line, I ran something like this:
mongo "mongodb+srv://mycluster0.otwxp.mongodb.net/mydbame" --username myuser --password abc1234 <kill_opid_script.js
The kill_opid_script.js looked like this. I added the print(db) because the first time I ran it didn't seem to do anything.
print(db)
db.adminCommand({"killOp": 1, "op": 648685})
db.adminCommand({"killOp": 1, "op": 667396})
db.adminCommand({"killOp": 1, "op": 557439})
etc... for 400+ times...
print(db)
Related
I am attempting to run the following code, in a Dataproc cluster (you can find the software versions I am using here):
# IMPORTANT: THIS CODE WAS RUN IN A SINGLE JUPYTER NOTEBOOK CELL
print("IMPORTING LIBRARIES...")
import pandas as pd
import numpy as np
import time
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col, pandas_udf
# https://spark.apache.org/docs/3.1.3/api/python/_modules/pyspark/sql/types.html
from pyspark.sql.types import ArrayType, StringType
print("STARTING SPARK SESSION...")
spark = SparkSession.builder.appName('SpacyOverPySpark') \
.getOrCreate()
print("FUNCTION DEFINTION...")
def load_spacy_model():
import spacy
print("\tLoading spacy model...")
return spacy.load("./spacy_model") # This model exists locally
#pandas_udf(ArrayType(StringType()))
def get_entities(list_of_text: pd.Series) -> pd.Series:
# retrieving the shared nlp object
nlp = broadcasted_nlp.value
# batch processing our list of text
docs = nlp.pipe(list_of_text)
# entity extraction (`ents` is a list[list[str]])
ents=[
[ent.text for ent in doc.ents]
for doc in docs
]
return pd.Series(ents)
# loading spaCy model and broadcasting it
broadcasted_nlp = spark.sparkContext.broadcast(load_spacy_model())
print("DATA READING (OR MANUAL DATA GENERATION)...")
# # Manually-generated data (DISABLED BY DEFAULT, USE FOR "TESTING")
# # IMPORTANT: Code works well for this case !!!
# pdf = pd.DataFrame(
# [
# "Python and Pandas are very important for Automation",
# "Tony Stark is an Electrical Engineer",
# "Pipe welding is a very dangerous task in Oil mining",
# "Nursing is often underwhelmed, but it's very interesting",
# "Software Engineering now opens a lot of doors for you",
# "Civil Engineering can get exiting, as you travel very often",
# "I am a Java Programmer, and I think I'm quite good at what I do",
# "Diane is never bored of doing the same thing all day",
# "My father is a Doctor, and he supports people in condition of poverty",
# "A janitor is required as soon as possible"
# ],
# columns=['posting']
# )
# sdf=spark.createDataFrame(pdf)
# Reading data from CSV stored in GCS (ENABLED BY DEFAULT, USE FOR "PRODUCTION")
sdf = spark.read.csv("gs://onementor-ml-data/1M_indeed_eng_clean.csv", header=True) # ~1M rows, 1 column 'posting', ~1GB in size
print("\tDataFrame shape: ", (sdf.count(), len(sdf.columns)))
print("NAMED ENTITY RECOGNITION USING SPACY OVER PYSPARK...")
t1 = time.time()
# df_dummy2.withColumn("entities", get_entities(col("posting"))).show(5, truncate=10)
sdf_new = sdf.withColumn('skills',get_entities('posting'))
sdf_new.show(5, truncate=10)
print("\tData mined in {:.2f} seconds (Dataframe shape: ({}, {}))".format(
time.time()-t1,
sdf_new.count(),
len(sdf_new.columns))
)
BTW, some basic specs of my cluster (this info can be updated, please request it in the comment section):
Master node
Standard (1 master, N workers)
Machine type: n1-highmem-4 (originally n1-standard-4, still with errors)
Number of GPUs: 0
Primary disk type: pd-standard
Primary disk size: 500GB
Local SSDs: 0
Worker nodes
(Qty.:) 10 (originally 2, still with errors)
Machine type: n1-standard-4
Number of GPUs: 0
Primary disk type: pd-standard
Primary disk size: 500GB
Local SSDs: 0
Secondary worker nodes: 0
When running the previous script with the "manually-generated data", the entity extraction works OK (if you need details about how I created my cluster, hit that link too); however when importing the .csv data from Cloud Storage, the following error appears (both VM and cluster names have been changed, for safety):
ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 11 on my-vm-w-9.us-central1-a.c.my-project.internal: Container marked as failed: container_1661960727108_0002_01_000013 on host: my-vm-w-9.us-central1-a.c.my-project.internal. Exit status: -100. Diagnostics: Container released on a *lost* node.
I have also read in the logs the following warning:
WARN org.apache.spark.storage.BlockManagerMasterEndpoint: No more replicas available for broadcast_0_python !
I have made a quick research, but I was astonished at the considerable amount of very different possible causes of that error (however none of them valid for PySpark over Dataproc), so I am not quite sure if there's a more optimal troubleshooting approach for this case (than just shooting blindly at case after case I find in the web).
What could be happening here?
Thank you
Even after a successful execution of my tests in DeviceFarm, I get an empty screenshots report. I have kept my code as simple as below -
from appium import webdriver
import time
import unittest
import os
class MyAndroidTest(unittest.TestCase):
def setUp(self):
caps = {}
self.driver = webdriver.Remote("http://127.0.0.1:4723/wd/hub", caps)
def test1(self):
self.driver.get('http://docs.aws.amazon.com/devicefarm/latest/developerguide/welcome.html')
time.sleep(5)
screenshot_folder = os.getenv('SCREENSHOT_PATH', '/tmp')
self.driver.save_screenshot(screenshot_folder + 'screen1.png')
time.sleep(5)
def tearDown(self):
self.driver.quit()
if __name__ == '__main__':
suite = unittest.TestLoader().loadTestsFromTestCase(MyAndroidTest)
unittest.TextTestRunner(verbosity=2).run(suite)
I tested on a single device pool -
How can I make this work ?
TIA.
Missing a slash (/) before the filename (i.e., screen1.png). Line 15 should be as below -
self.driver.save_screenshot(screenshot_folder + '/screen1.png')
Though I'm not sure exactly how to write this to a file in Device Farm here are the appium docs for the screenshot endpoint and a python example.
https://github.com/appium/appium/blob/master/docs/en/commands/session/screenshot.md
It gets a base 64 encoded string which then we would just need to save it somewhere like the appium screenshot dir the other answers mentioned. Otherwise we could also save it in the /tmp dir and then export it using the custom artifacts feature.
Let me know if that link helps.
James
I am running a script to copy millions (2.4 million to be exact) images from several gcs buckets into one central bucket, with all buckets in the same region. I was originally working from one csv file but broke it into 64 smaller ones so each process can iterate through its own file as to not wait for the others. When the script launches on a 64 vCPU, 240 GB memory instance on GCP it runs fine for about an hour and a half. In 75 minutes 155 thousand files copied over. The CPU usage was registering a sustained 99%. After this, the CPU usage drastically declines to 2% and the transfer rate falls significantly. I am really unsure why this. I am keeping track of files that fail by creating blank files in an errors directory. This way there is no write lock when writing to a central error file. Code is below. It is not a spacing or syntax error, some spacing got messed up when I copied into the post. Any help is greatly appreciated.
Thanks,
Zach
import os
import subprocess
import csv
from multiprocessing.dummy import Pool as ThreadPool
from multiprocessing import Pool as ProcessPool
import multiprocessing
gcs_destination = 'gs://dest-bucket/'
source_1 = 'gs://source-1/'
source_2 = 'gs://source-2/'
source_3 = 'gs://source-3/'
source_4 = 'gs://source-4/'
def copy(img):
try:
imgID = img[0] # extract name
imgLocation = pano[9] # extract its location on gcs
print pano[0] + " " + panoLocation
source = ""
if imgLocation == '1':
source = source_1
elif imgLocation == '2':
source = source-2
elif imgLocation == '3':
source = source_3
elif imgLocation == '4':
source = source_4
print str(os.getpid())
command = "gsutil -o GSUtil:state_dir=.{} cp {}{}.tar.gz {}".format(os.getpid(), source, imgID , g
prog = subprocess.call(command, shell="True")
if prog != 0:
command = "touch errors/{}_{}".format(imgID, imgLocation)
os.system(command)
except:
print "Doing nothing with the error"
def split_into_threads(csv_file):
with open(csv_file) as f:
csv_f = csv.reader(f)
pool = ThreadPool(15)
pool.map(copy, csv_f)
if __name__ == "__main__":
file_names = [None] * 64
# Read in CSV file of all records
for i in range(0,64):
file_names[i] = 'split_origin/origin_{}.csv'.format(i)
process_pool = ProcessPool(multiprocessing.cpu_count())
process_pool.map(split_into_threads, file_names)
For gsutil, I agree strongly with the multithreading suggestion by adding -m. Further, composite uploads, -o, may be unnecessary and undesirable as the images are not GB each in size and need not be split into shards. They're likely in the X-XXMB range.
Within your python function, you are calling gsutil commands, which are in turn calling further python functions. It should be cleaner and more performant to leverage the google-made client library for python, available [below]. Gsutil is built for interactive CLI use rather than for calling programatically.
https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-python
Also, for gsutil, see your ~/.boto file and look at the multi-processing and multi-threading values. Beefier machines can handle greater thread and process. For reference, I work from my Macbook Pro w/ 1 process and 24 threads. I use an ethernet adapter and hardwire into my office connection and get incredible performance off internal SSD (>450 Mbps). That's Megabits, not bytes. The transfer rates are impressive, nonetheless
I strongly recommend you to use the "-m" flag on gsutil to enable multi thread copy.
Also as an alternative you can use the Storage Transfer Service [1] to move data between buckets.
[1] https://cloud.google.com/storage/transfer/
I would like to change the following batch script to Scala (just for fun), however, the script must keep running and listen for changes to the *.mkd files. If any file is changed, then the script should re-generate the affected doc. File IO has always been my Achilles heel...
#!/bin/sh
for file in *.mkd
do
pandoc --number-sections $file -o "${file%%.*}.pdf"
done
Any ideas around a good approach to this will be appreciated.
The following code, taken from my answer on: Watch for project files also can watch a directory and execute a specific command:
#!/usr/bin/env scala
import java.nio.file._
import scala.collection.JavaConversions._
import scala.sys.process._
val file = Paths.get(args(0))
val cmd = args(1)
val watcher = FileSystems.getDefault.newWatchService
file.register(
watcher,
StandardWatchEventKinds.ENTRY_CREATE,
StandardWatchEventKinds.ENTRY_MODIFY,
StandardWatchEventKinds.ENTRY_DELETE
)
def exec = cmd run true
#scala.annotation.tailrec
def watch(proc: Process): Unit = {
val key = watcher.take
val events = key.pollEvents
val newProc =
if (!events.isEmpty) {
proc.destroy()
exec
} else proc
if (key.reset) watch(newProc)
else println("aborted")
}
watch(exec)
Usage:
watchr.scala markdownFolder/ "echo \"Something changed!\""
Extensions have to be made to the script to inject file names into the command. As of now this snippet should just be regarded as a building block for the actual answer.
Modifying the script to incorporate the *.mkd wildcards would be non-trivial as you'd have to manually search for the files and register a watch on all of them. Re-using the script above and placing all files in a directory has the added advantage of picking up new files when they are created.
As you can see it gets pretty big and messy pretty quick just relying on Scala & Java APIs, you would be better of relying on alternative libraries or just sticking to bash while using INotify.
I designed a GUI application using wxPython that communicate with a local database (Mongodb) located in the same folder. My main application has the relative path to the database daemon to start it every time the GUI is lunched.
This is the main.py:
import mongodb
class EVA(wx.App):
# wxPython GUI here
pass
if __name__ == "__main__":
myMongodb = mongodb.Mongodb()
myMongodb.start()
myMongodb.connect()
app = EVA(0)
app.MainLoop()
This is the mongodb.py module:
from pymongo import Connection
import subprocess, os , signal
class Mongodb():
pid = 0
def start(self):
path = "/mongodb-osx-x86_64-1.6.5/bin/mongod"
data = "/data/db/"
cmd = path + " --dbpath " + data
MyCMD = subprocess.Popen([cmd],shell=True)
self.pid = MyCMD.pid
def connect(self):
try:
connection = Connection(host="localhost", port=27017)
db = connection['Example_db']
return db
except Exception as inst:
print "Database connection error: " , inst
def stop(self):
os.kill(self.pid,signal.SIGTERM)
Every thing works fine from the terminal. However, when I used py2app to make a standalone version of my program on Mac OS (OS v10.6.5, Python v2.7), I am able to lunch the GUI but can't start the database. It seems py2app changed the location of Mongodb executable folder and broke my code.
I use the following parameters with py2app:
$ py2applet --make-setup main.py
$ rm -rf build dist
$ python setup.py py2app --iconfile /icons/main_icon.icns -r /mongodb-osx-x86_64-1.6.5
How to force py2app to leave my application structure intact?
Thanks.
Py2app changes the current working directory to the foo.app/Content/Resources folder within the app bundle when it starts up. It doesn't seem to be the case from the code you show above, but if you have any paths that are dependent on the CWD (including relative pathnames) then you'll have to deal with that somehow. One common way to deal with it is to also copy the other stuff you need into that folder within the application bundle, so it will then truly be a standalone bundle that is not dependent on its location in the filesystem and hopefully also not dependent on the machine it is running upon.