How to collate pytest logging output to console? - pytest

I'd like to collate logging output to console such that the repeated "----ClassName.TestName---, and "-- Captured log call---" lines are removed or limited to a single entry. The below simplified example, with output, demonstrates the problem.
desired output:
2020-08-14 13:51:50 INFO test[test_01]
2020-08-14 13:51:50 INFO test[test_02]
2020-08-14 13:51:50 INFO test[test_03]
========= short test summary info =====================================
PASSED tests/test_logging.py::Test_Logging::test_01
PASSED tests/test_logging.py::Test_Logging::test_02
PASSED tests/test_logging.py::Test_Logging::test_03
source code:
import logging
import pytest
#pytest.mark.testing
class Test_Logging:
_logger = None
def setup_method(self):
self._logger = logging.getLogger('Test Logger')
def test_01(self, request):
self._logger.info(f"test[{request.node.name}]")
def test_02(self, request):
self._logger.info(f"test[{request.node.name}]")
def test_03(self, request):
self._logger.info(f"test[{request.node.name}]")
current output:
_____________ Test_Logging.test_01 ____________________________________
-------- Captured log call --------------------------------------------
2020-08-14 13:51:50 INFO test[test_01]
_____________ Test_Logging.test_02 ____________________________________
-------- Captured log call --------------------------------------------
2020-08-14 13:51:50 INFO test[test_02]
_____________ Test_Logging.test_03 ____________________________________
-------- Captured log call --------------------------------------------
2020-08-14 13:51:50 INFO test[test_03]
========= short test summary info =====================================
PASSED tests/test_logging.py::Test_Logging::test_01
PASSED tests/test_logging.py::Test_Logging::test_02
PASSED tests/test_logging.py::Test_Logging::test_03

Related

Distribute group tasks evenly using pandas_udf in PySpark

I have a Spark Dataframe which contains groups of training data. Each group is identified by the "group" column.
group | feature_1 | feature_2 | label
--------------------------------------
1 | 123 | 456 | 0
1 | 553 | 346 | 1
... | ... | ... | ...
2 | 623 | 498 | 0
2 | 533 | 124 | 1
... | ... | ... | ...
I want to train a python ML model (lightgbm in my case) for each group in parallel.
Therefore I have the following working code:
schema = T.StructType([T.StructField("group_id", T.IntegerType(), True),
T.StructField("model", T.BinaryType(), True)])
#F.pandas_udf(schema, F.PandasUDFType.GROUPED_MAP)
def _fit(pdf):
group_id = pdf.loc[0, "group"]
X = df.loc[: X_col]
y = df.loc[:, y_col].values
# train model
model = ...
out_df = pd.DataFrame(
[[group_id, pickle.dumps(model)],
columns=["group_id", "model"]]
)
return out_df
df.groupby("group").apply(_fit)
I have 10 groups in the dataset and 10 worker nodes.
Most of the times, each group is assigned to an executor and the processing is very quick.
However sometimes, more than 1 group are assigned to an executor while some other executors are left free.
This causes the processing to become very slow as the executor has to train multiple models at the same time.
Question: how do I schedule each group to train on a separate executor to avoid this problem?
I think you're going to want to look into playing around with setting the following 2 spark configurations:
spark.task.cpus (the number of cpus per task)
spark.executor.cores (the number of cpus per executor)
I believe setting spark.executor.cores = spark.task.cpus = (cores per worker -1) might solve your problem.

Read pipe separated values in ksql

I am working on POC, I have to read pipe separated value file and insert these records into ms sql server.
I am using confluent 5.4.1 to use value_delimiter create stream property. But its giving exception: Delimeter only supported with DELIMITED format
1. Start Confluent (version: 5.4.1)::
[Dev root # myip ~]
# confluent local start
The local commands are intended for a single-node development environment
only, NOT for production usage. https://docs.confluent.io/current/cli/index.html
Using CONFLUENT_CURRENT: /tmp/confluent.vHhSRAnj
Starting zookeeper
zookeeper is [UP]
Starting kafka
kafka is [UP]
Starting schema-registry
schema-registry is [UP]
Starting kafka-rest
kafka-rest is [UP]
Starting connect
connect is [UP]
Starting ksql-server
ksql-server is [UP]
Starting control-center
control-center is [UP]
[Dev root # myip ~]
# jps
49923 KafkaRestMain
50099 ConnectDistributed
49301 QuorumPeerMain
50805 KsqlServerMain
49414 SupportedKafka
52103 Jps
51020 ControlCenter
1741
49646 SchemaRegistryMain
[Dev root # myip ~]
#
2. Create Topic:
[Dev root # myip ~]
# kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic SampleData
Created topic SampleData.
3. Provide pipe separated data to SampeData Topic
[Dev root # myip ~]
# kafka-console-producer --broker-list localhost:9092 --topic SampleData <<EOF
> this is col1|and now col2|and col 3 :)
> EOF
>>[Dev root # myip ~]
#
4. Start KSQL::
[Dev root # myip ~]
# ksql
===========================================
= _ __ _____ ____ _ =
= | |/ // ____|/ __ \| | =
= | ' /| (___ | | | | | =
= | < \___ \| | | | | =
= | . \ ____) | |__| | |____ =
= |_|\_\_____/ \___\_\______| =
= =
= Streaming SQL Engine for Apache Kafka® =
===========================================
Copyright 2017-2019 Confluent Inc.
CLI v5.4.1, Server v5.4.1 located at http://localhost:8088
Having trouble? Type 'help' (case-insensitive) for a rundown of how things work!
5. Declare a schema for the existing topic: SampleData
ksql> CREATE STREAM sample_delimited (
> column1 varchar(1000),
> column2 varchar(1000),
> column3 varchar(1000))
> WITH (KAFKA_TOPIC='SampleData', VALUE_FORMAT='DELIMITED', VALUE_DELIMITER='|');
Message
----------------
Stream created
----------------
6. Verify data into KSQl Stream
ksql> SET 'auto.offset.reset' = 'earliest';
Successfully changed local property 'auto.offset.reset' to 'earliest'. Use the UNSET command to revert your change.
ksql> SELECT * FROM sample_delimited emit changes limit 1;
+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
|ROWTIME |ROWKEY |COLUMN1 |COLUMN2 |COLUMN3 |
+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
|1584339233947 |null |this is col1 |and now col2 |and col 3 :) |
Limit Reached
Query terminated
7. Write a new Kafka topic: SampleDataAvro that serializes all the data from sample_delimited stream to Avro format stream
ksql> CREATE STREAM sample_avro WITH (KAFKA_TOPIC='SampleDataAvro', VALUE_FORMAT='AVRO') AS SELECT * FROM sample_delimited;
Delimeter only supported with DELIMITED format
ksql>
8. Above line gives exception::
Delimeter only supported with DELIMITED format
9. Load ms sql kafka connect configuration
confluent local load test-sink -- -d ./etc/kafka-connect-jdbc/sink-quickstart-mssql.properties
The only time you need to specify the delimiter is when you define the stream that is reading from the source topic.
Here's my worked example:
Populate a topic with pipe-delimited data:
$ kafkacat -b localhost:9092 -t SampleData -P<<EOF
this is col1|and now col2|and col 3 :)
EOF
Declare a stream over it
CREATE STREAM sample_delimited (
column1 varchar(1000),
column2 varchar(1000),
column3 varchar(1000))
WITH (KAFKA_TOPIC='SampleData', VALUE_FORMAT='DELIMITED', VALUE_DELIMITER='|');
Query the stream to make sure it works
ksql> SET 'auto.offset.reset' = 'earliest';
Successfully changed local property 'auto.offset.reset' to 'earliest'. Use the UNSET command to revert your change.
ksql> SELECT * FROM sample_delimited emit changes limit 1;
+----------------+--------+---------------+--------------+--------------+
|ROWTIME |ROWKEY |COLUMN1 |COLUMN2 |COLUMN3 |
+----------------+--------+---------------+--------------+--------------+
|1583933584658 |null |this is col1 |and now col2 |and col 3 :) |
Limit Reached
Query terminated
Reserialise the data to Avro:
CREATE STREAM sample_avro WITH (KAFKA_TOPIC='SampleDataAvro', VALUE_FORMAT='AVRO') AS SELECT * FROM sample_delimited;
Dump the contents of the topic - note that it is now Avro:
ksql> print SampleDataAvro;
Key format: UNDEFINED
Value format: AVRO
rowtime: 3/11/20 1:33:04 PM UTC, key: <null>, value: {"COLUMN1": "this is col1", "COLUMN2": "and now col2", "COLUMN3": "and col 3 :)"}
The error that you're hitting is as a result of bug #4200. You can wait for the next release of Confluent Platform, or use standalone ksqlDB in which the issue is already fixed.
Here's using ksqlDB 0.7.1 streaming the data to MS SQL:
CREATE SINK CONNECTOR SINK_MSSQL WITH (
'connector.class' = 'io.confluent.connect.jdbc.JdbcSinkConnector',
'connection.url' = 'jdbc:sqlserver://mssql:1433',
'connection.user' = 'sa',
'connection.password' = 'Admin123',
'topics' = 'SampleDataAvro',
'key.converter' = 'org.apache.kafka.connect.storage.StringConverter',
'auto.create' = 'true',
'insert.mode' = 'insert'
);
Now query the data in MS SQL
1> Select ##version
2> go
---------------------------------------------------------------------
Microsoft SQL Server 2017 (RTM-CU17) (KB4515579) - 14.0.3238.1 (X64)
Sep 13 2019 15:49:57
Copyright (C) 2017 Microsoft Corporation
Developer Edition (64-bit) on Linux (Ubuntu 16.04.6 LTS)
(1 rows affected)
1> SELECT * FROM SampleDataAvro;
2> GO
COLUMN3 COLUMN2 COLUMN1
-------------- --------------- ------------------
and col 3 :) and now col2 this is col1
(1 rows affected)

#eval escapes $ in #testset name argument

I have several #testsets in a file and I want to align the output of them.
At first it looks like this:
Test Summary: | Pass Total
short | 1 1
Test Summary: | Pass Total
test with longer name | 1 1
I want the | aligned so that it goes like this:
Test Summary: | Pass Total
short | 1 1
Test Summary: | Pass Total
test with longer name | 1 1
I tried using #sprintf to make the names the same length, but the #testset macro needs a string literal. So I used #eval to interpolate a formatted string into the name:
testName(name) = #sprintf("%-25s", name)
#eval begin
#testset $(testName("short")) begin
#test true
end
#testset $(testName("test with longer name")) begin
#test true
end
end # #eval
This gives me what I want for simple testsets, but falls short on ones with for
loop variables interpolated into the name.
If I add a test like this:
#testset $(testName("some for test \$i")) for i in 9:11
#test i > 0
end
the output is
Test Summary: | Pass Total
short | 1 1
Test Summary: | Pass Total
test with longer name | 1 1
Test Summary: | Pass Total
some for test $i | 1 1
Test Summary: | Pass Total
some for test $i | 1 1
Test Summary: | Pass Total
some for test $i | 1 1
It looks like the eval macro escapes the $ somehow and it no longer works with the #testset macro.
Why doesn't it work and how can I force the $ to exist unescaped in the name string literal? (suggestions on how to align the results also welcome).
Use String with interpolation as the argument to #testset macro:
julia> #testset "$(testName("some for test $i"))" for i in 9:11
#test i > 0
end;
Test Summary: | Pass Total
some for test 9 | 1 1
Test Summary: | Pass Total
some for test 10 | 1 1
Test Summary: | Pass Total
some for test 11 | 1 1

Splunk query using append

I have a query that calculates Batch logs from different time slots and shows the output using append command.But in the first time slot i'm getting a batch log which is not there in 2nd timeslot of the same query. In the output of the query after appending i'm not getting the logs that are appearing in only timeslot.
Query using
index=main sourcetype=xml "MSR*" earliest=-30d latest=-15d
in the above query i'm getting MSR1451 batch in the output.
index=main sourcetype=xml "MSR*" earliest=-14d latest=now()
in the above query, we are not getting that MSR1451 batch.
index=main sourcetype=xml "MSR*" earliest=-30d latest=-15d
| fields jobName
| eval marker="Before 15 days"
| append
[search index=main sourcetype=xml "MSR*" earliest=-30d latest=-15d
|fields jobName
| eval marker="After 15 days"]
| stats count (eval(marker="Before 15 days")) AS Before 15 days, count (eval(marker="After 15 days")) AS After 15 days by JobName
In the above query I'm getting only the common jobs that are appearing in both time slots. I need the jobs that appear in only one time slot also should be listed.
Are you meaning to use earliest=-30d and latest=-15d in your main search and sub search?
In the query you posted, you are using the same values for earliest and latest in both searches. You need to do:
index=main sourcetype=xml "MSR*" earliest=-30d latest=-15d |fields jobName | eval marker="Before 15 days" | append [search index=main sourcetype=xml "MSR*" earliest=-15d latest=now() |fields jobName | eval marker="After 15 days"] | stats count (eval(marker="Before 15 days")) AS Before 15 days, count (eval(marker="After 15 days")) AS After 15 days by JobName

Spark scala cassandra

Please see the below code and let me know where I am doing it wrong?
Using:
DSE Version - 5.1.0
Connected to Test Cluster at 172.31.16.45:9042.
[cqlsh 5.0.1 | Cassandra 3.10.0.1652 | DSE 5.1.0 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
Thanks
Cassandra Table :
cqlsh:tdata> select * from map;
sno | name
-----+------
1 | One
2 | Two
-------------------------------------------
scala> :showSchema tdata
========================================
Keyspace: tdata
========================================
Table: map
----------------------------------------
- sno : Int (partition key column)
- name : String
scala> val rdd = sc.cassandraTable("tdata", "map")
scala> rdd.foreach(println)
I am not getting anything here?
Not even an error.
You have hit a very common spark issue. Your println code is being executed on your remote executor JVMs. That means the printout is to the STDOUT of the executor JVM process. If you want to bring the data back to the driver JVM before printing you need a collect call.
rdd
.collect //Change from RDD to local collection
.foreach(println)