How do I write messages to the output log on AWS Glue? - pyspark

AWS Glue jobs log output and errors to two different CloudWatch logs, /aws-glue/jobs/error and /aws-glue/jobs/output by default. When I include print() statements in my scripts for debugging, they get written to the error log (/aws-glue/jobs/error).
I have tried using:
log4jLogger = sparkContext._jvm.org.apache.log4j
log = log4jLogger.LogManager.getLogger(__name__)
log.warn("Hello World!")
but "Hello World!" doesn't show up in either of the logs for the test job I ran.
Does anyone know how to go about writing debug log statements to the output log (/aws-glue/jobs/output)?
TIA!
EDIT:
It turns out the above actually does work. What was happening was that I was running the job in the AWS Glue Script editor window which captures Command-F key combinations and only searches in the current script. So when I tried to search within the page for the logging output it seemed as if it hadn't been logged.
NOTE: I did discover through testing the first responder's suggestion that AWS Glue scripts don't seem to output any log message with a level less than WARN!

Try to use built-in python logger from logging module, by default it writes messages to standard output stream.
import logging
MSG_FORMAT = '%(asctime)s %(levelname)s %(name)s: %(message)s'
DATETIME_FORMAT = '%Y-%m-%d %H:%M:%S'
logging.basicConfig(format=MSG_FORMAT, datefmt=DATETIME_FORMAT)
logger = logging.getLogger(<logger-name-here>)
logger.setLevel(logging.INFO)
...
logger.info("Test log message")

I know the article is not new but maybe it could be helpful for someone:
For me logging in glue works with the following lines of code:
# create glue context
glueContext = GlueContext(sc)
# set custom logging on
logger = glueContext.get_logger()
...
#write into the log file with:
logger.info("s3_key:" + your_value)

I noticed the above answers are written in python. For Scala you could do the following
import com.amazonaws.services.glue.log.GlueLogger
object GlueApp {
def main(sysArgs: Array[String]) {
val logger = new GlueLogger
logger.info("info message")
logger.warn("warn message")
logger.error("error message")
}
}
You can find both Python and Scala solution from official doc here

Just in case this helps. This works to change the log level.
sc = SparkContext()
sc.setLogLevel('DEBUG')
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
logger.info('Hello Glue')

This worked for INFO level in a Glue Python job:
import sys
root = logging.getLogger()
root.setLevel(logging.DEBUG)
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
root.addHandler(handler)
root.info("check")
source

I faced the same problem. I resolved it by added
logging.getLogger().addHandler(logging.StreamHandler(sys.stdout))
Before there was no prints at all, even ERROR level
The idea was taken from here
https://medium.com/tieto-developers/how-to-do-application-logging-in-aws-745114ac6eb7
Another option would be to log to stdout and glue AWS logging to stdout (using stdout is actually one of the best practices in cloud logging).
Update: it works only for setLevel("WARNING") and when prints ERROR or WARING. I didn't find how to manage it for the INFO level :(

If you're just debugging, print() (Python) or println() (Scala) works just fine.

Related

Flink Read AvroInputFormat PROCESS_CONTINUOUSLY

I am new to Flink (v1.3.2) and trying to read avro record continuously in scala on EMR. I know for file you can use something like following and it will keep running and scanning directory.
val stream = env.readFile(
inputFormat = textInputFormat,
filePath = path,
watchType = FileProcessingMode.PROCESS_CONTINUOUSLY,
interval = Time.seconds(10).toMilliseconds
)
Is there a similar way in Flink for avro record? So I have the following code
val textInputFormat = new AvroInputFormat(new Path(path), classOf[User])
textInputFormat.setNestedFileEnumeration(true)
val avroInputStream = env.createInput(textInputFormat)
val output = avroInputStream.map(line => (line.getUserID, 1))
.keyBy(0)
.timeWindow(Time.seconds(10))
.sum(1)
output.print()
I am able to see the output there then Flink switched to FINISHED, but still want to get the code running/waiting for any new files arrive in the future. Is there something like FileProcessingMode.PROCESS_CONTINUOUSLY? Please suggest!
I figure out this by setting up a flink-yarn-session on EMR and make it run PROCESS_CONTINUOUSLY.
env.readFile(textInputFormat, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 100)
Create a new flink yarn session using flink-yarn-session -n 2 -d
Get application_id using yarn application -list, for example, it is application_0000000000_0002
Attached flink run job with the application_id,flink run -m yarn-cluster -yid application_0000000000_0002 xxx.jar
More detail can be found on EMR documentation now: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-flink.html

pytest implementing a logfile per test method

I would like to create a separate log file for each test method. And i would like to do this in the conftest.py file and pass the logfile instance to the test method. This way, whenever i log something in a test method it would log to a separate log file and will be very easy to analyse.
I tried the following.
Inside conftest.py file i added this:
logs_dir = pkg_resources.resource_filename("test_results", "logs")
def pytest_runtest_setup(item):
test_method_name = item.name
testpath = item.parent.name.strip('.py')
path = '%s/%s' % (logs_dir, testpath)
if not os.path.exists(path):
os.makedirs(path)
log = logger.make_logger(test_method_name, path) # Make logger takes care of creating the logfile and returns the python logging object.
The problem here is that pytest_runtest_setup does not have the ability to return anything to the test method. Atleast, i am not aware of it.
So, i thought of creating a fixture method inside the conftest.py file with scope="function" and call this fixture from the test methods. But, the fixture method does not know about the the Pytest.Item object. In case of pytest_runtest_setup method, it receives the item parameter and using that we are able to find out the test method name and test method path.
Please help!
I found this solution by researching further upon webh's answer. I tried to use pytest-logger but their file structure is very rigid and it was not really useful for me. I found this code working without any plugin. It is based on set_log_path, which is an experimental feature.
Pytest 6.1.1 and Python 3.8.4
# conftest.py
# Required modules
import pytest
from pathlib import Path
# Configure logging
#pytest.hookimpl(hookwrapper=True,tryfirst=True)
def pytest_runtest_setup(item):
config=item.config
logging_plugin=config.pluginmanager.get_plugin("logging-plugin")
filename=Path('pytest-logs', item._request.node.name+".log")
logging_plugin.set_log_path(str(filename))
yield
Notice that the use of Path can be substituted by os.path.join. Moreover, different tests can be set up in different folders and keep a record of all tests done historically by using a timestamp on the filename. One could use the following filename for example:
# conftest.py
# Required modules
import pytest
import datetime
from pathlib import Path
# Configure logging
#pytest.hookimpl(hookwrapper=True,tryfirst=True)
def pytest_runtest_setup(item):
...
filename=Path(
'pytest-logs',
item._request.node.name,
f"{datetime.datetime.now().strftime('%Y%m%dT%H%M%S')}.log"
)
...
Additionally, if one would like to modify the log format, one can change it in pytest configuration file as described in the documentation.
# pytest.ini
[pytest]
log_file_level = INFO
log_file_format = %(name)s [%(levelname)s]: %(message)
My first stackoverflow answer!
I found the answer i was looking for.
I was able to achieve it using the function scoped fixture like this:
#pytest.fixture(scope="function")
def log(request):
test_path = request.node.parent.name.strip(".py")
test_name = request.node.name
node_id = request.node.nodeid
log_file_path = '%s/%s' % (logs_dir, test_path)
if not os.path.exists(log_file_path):
os.makedirs(log_file_path)
logger_obj = logger.make_logger(test_name, log_file_path, node_id)
yield logger_obj
handlers = logger_obj.handlers
for handler in handlers:
handler.close()
logger_obj.removeHandler(handler)
In newer pytest version this can be achieved with set_log_path.
#pytest.fixture
def manage_logs(request, autouse=True):
"""Set log file name same as test name"""
request.config.pluginmanager.get_plugin("logging-plugin")\
.set_log_path(os.path.join('log', request.node.name + '.log'))

Logging from pyspark application to a local or hdfs file

I have an application in pyspark includes closure functions that contain logging statements, I don't know how to log messages to local/hdfs file in pyspark.
I tried something as below but doesn't work:
import json
from pyspark import SparkContext
import logging
def parse_json(text_line):
try:
return(json.loads(text_line))
except ValueError:
# here I need to log a warning message to a local file or even to default spark logs
logger.warn("invalid json structure" + text_line)
return({})
if __name__ == "__main__":
my_data = ['{"id": "111", "name": "aaa"}',
'{"wrong json", "name": "bbb"}',
'{"id": "333", "name": "ccc"}']
sc = SparkContext()
logger = logging.getLogger('py4j')
lines = sc.parallelize(my_data)
my_data_json = lines.map(parse_json).filter(lambda x: x)
print(my_data_json.collect())
Any help please!
You can configure the log4j appender in your log4j setting and use it inside your pyspark application. I haven't tried storing logs on HDFS, however this method will definitely help you get started with logging onto console and locally to files.
I have written a small blog post to address your solution.
https://www.shantanualshi.com/logging-in-pyspark/2016-07-04-logging-in-pyspark-scripts/
Let me know if that works!

How to execute JMeter test case from Java code

How do I run a JMeter test case from Java code?
I have followed the example Here from Blazemeter.com
My code is as follows:
public class BasicSampler {
public static void main(String[] argv) throws Exception {
// JMeter Engine
StandardJMeterEngine jmeter = new StandardJMeterEngine();
// Initialize Properties, logging, locale, etc.
JMeterUtils.loadJMeterProperties("/home/stone/Workbench/automated-testing/apache-jmeter-2.11/bin/jmeter.properties");
JMeterUtils.setJMeterHome("/home/stone/Workbench/automated-testing/apache-jmeter-2.11");
JMeterUtils.initLogging();// you can comment this line out to see extra log messages of i.e. DEBUG level
JMeterUtils.initLocale();
// Initialize JMeter SaveService
SaveService.loadProperties();
// Load existing .jmx Test Plan
FileInputStream in = new FileInputStream("/home/stone/Workbench/automated-testing/apache-jmeter-2.11/bin/examples/CSVSample.jmx");
HashTree testPlanTree = SaveService.loadTree(in);
in.close();
// Run JMeter Test
jmeter.configure(testPlanTree);
jmeter.run();
}
}
but I keep getting the following messages in the console and my test never executes.
INFO 2014-09-23 12:04:40.492 [jmeter.e] (): Listeners will be started after enabling running version
INFO 2014-09-23 12:04:40.511 [jmeter.e] (): To revert to the earlier behaviour, define jmeterengine.startlistenerslater=false
I have also tried uncommented jmeterengine.startlistenerslater=false from jmeter.properties file
How do you know that your "test never executes"?
What is in jmeter.log file (it should be in the root of your project). Or alternatively comment JMeterUtils.initLogging() line to see the full output in STDOUT
Have you changed relative path CSVSample_user.csv in "Get user details" CSV Data Set Config as it may resolve into a different location as it recommended in Using CSV DATA SET CONFIG
Is CSVSample.jtl file generated anywhere (again it should be in the root of your project by default)? What is in it?
The code looks good and I'm pretty sure that the problem is with the path to CSVSample_user.csv file and you have something like java.io.FileNotFoundException in your log. Please double check that CSVSample.jmx file contains valid full path to CSVSample_user.csv.
UPDATE TO ANSWER QUESTIONS IN COMMENTS
jmeter.log file should be under your Eclipse workspace folder by default
Looking into CSVSample.jmx there is a View Resulst in Table listener which is configured to store results under ~/CSVSample.jtl
If you want to see summarizer messages and "classic" .jtl reporting add next few lines before jmeter.configure(testPlanTree); stanza
Summariser summer = null;
String summariserName = JMeterUtils.getPropDefault("summariser.name", "summary");
if (summariserName.length() > 0) {
summer = new Summariser(summariserName);
}
String logFile = "/path/to/jtl/results/file.jtl";
ResultCollector logger = new ResultCollector(summer);
logger.setFilename(logFile);
testPlanTree.add(testPlanTree.getArray()[0], logger);
Try using library - https://github.com/abstracta/jmeter-java-dsl.
It supports implementing JMeter test as java code.
Below example shows how to implement and execute test for REST API. Same approach could be applied to other type of tests as well.
#Test
public void testPerformance() throws IOException {
TestPlanStats stats = testPlan(
threadGroup(2, 10,
httpSampler("http://my.service")
.post("{\"name\": \"test\"}", Type.APPLICATION_JSON)
),
//this is just to log details of each request stats
jtlWriter("test" + Instant.now().toString().replace(":", "-") + ".jtl")
).run();
assertThat(stats.overall().elapsedTimePercentile99()).isLessThan(Duration.ofSeconds(5));
}

How do you write a Scala script that will react to file changes

I would like to change the following batch script to Scala (just for fun), however, the script must keep running and listen for changes to the *.mkd files. If any file is changed, then the script should re-generate the affected doc. File IO has always been my Achilles heel...
#!/bin/sh
for file in *.mkd
do
pandoc --number-sections $file -o "${file%%.*}.pdf"
done
Any ideas around a good approach to this will be appreciated.
The following code, taken from my answer on: Watch for project files also can watch a directory and execute a specific command:
#!/usr/bin/env scala
import java.nio.file._
import scala.collection.JavaConversions._
import scala.sys.process._
val file = Paths.get(args(0))
val cmd = args(1)
val watcher = FileSystems.getDefault.newWatchService
file.register(
watcher,
StandardWatchEventKinds.ENTRY_CREATE,
StandardWatchEventKinds.ENTRY_MODIFY,
StandardWatchEventKinds.ENTRY_DELETE
)
def exec = cmd run true
#scala.annotation.tailrec
def watch(proc: Process): Unit = {
val key = watcher.take
val events = key.pollEvents
val newProc =
if (!events.isEmpty) {
proc.destroy()
exec
} else proc
if (key.reset) watch(newProc)
else println("aborted")
}
watch(exec)
Usage:
watchr.scala markdownFolder/ "echo \"Something changed!\""
Extensions have to be made to the script to inject file names into the command. As of now this snippet should just be regarded as a building block for the actual answer.
Modifying the script to incorporate the *.mkd wildcards would be non-trivial as you'd have to manually search for the files and register a watch on all of them. Re-using the script above and placing all files in a directory has the added advantage of picking up new files when they are created.
As you can see it gets pretty big and messy pretty quick just relying on Scala & Java APIs, you would be better of relying on alternative libraries or just sticking to bash while using INotify.