What is the encoding of the gifs included in the following SEC document?
https://www.sec.gov/Archives/edgar/data/1010775/000095012310034416/0000950123-10-034416.txt
Also I wonder about the whole document, but really I just want to see the juicy pics...
Example text, too long to fully quote:
<DOCUMENT>
<TYPE>GRAPHIC
<SEQUENCE>2
<FILENAME>h72151fh72151z0001.gif
<DESCRIPTION>GRAPHIC
<TEXT>
begin 644 h72151fh72151z0001.gif
M1TE&.#=AT`(<`O<``````.##X$!`0``X<$B#T"`#($"0R!#8&%!XJ`!0B"##
M(`!8F$!#B,"XP$"8,"A8#``H8*B#H.#P^"!`:`!(>(B(B$"08)"#N#"0&&!X
MH%"8B'AX>"A0<*C`T!#X8$"0T"B(P"A(<&"0P("#N,##Z"`X8"A0#"!(:&AP
M:`!XN!#X6"!(<)BPR.C#X&B0L,C8X&AH:$!XF!!`<"!8#"!`8*BXR&B8R(BH
MP'"HT-#0T%!84)"HP"`X6!A(<%"`H"A#B+#(V.#PX!"`N&B8V`#("$"8R)#`
MX("`B!A`8/CX^"!(>,#(V`!PL)B#H+BXN"A0>("8L.CP\&B(H("PX&B#V-C8
MV)B8F%"(L"!0>$AHD'"#V##P.%AXF*C0Z$A(2,#0V-C#X&B8T.#HZ"A(>&"8
MR!A`6!A(>#!HD`!XP)"PP("XV'B0J+C`T%"`N+"PJ#!8#-#8X$AXL'B`>/#X
M^&"HT'BHV`!HJ``X:`!PN`!`<$!HD,C`R)C(X%!PF'"(J!A`<!`0$/#P\$AP
MD)"0D#`P,!A0#`A(>""(P.CH\+C8Z$"0:-CHZ,C0V#B0R&!#8&"8T-##Z#A#
MB%"`L'B0L/#P^)BXR!!0#&B`H(B8L#!8B'"8L.#P\%!04'"HV)C`X,#8Z&BP
MV`!`>)BHP.#H\&"#T&"(J*"PP+C(T`#X:&"8P&"0R.##Z$!HB#AHD!B`P%!X
MF%BHV(C`X'BXV%!PD+#`T`!PP)"HN#"0R'"0J,C0X-##\*"8H`AXN+C`P`!#
MH*BHJ+"PN``P:%"(N'B8L&"`F`!0F*#`T+#0Z+"PL-C8X-C#Z'"PV&"`H*"P
<!-- etc -->
</DOCUMENT>
It's uuencoding. If you drop the contents into a file:
begin 644 h72151fh72151z0001.gif
M1TE&.#=AT`(<`O<``````.##X$!`0``X<$B#T"`#($"0R!#8&%!XJ`!0B"##
M(`!8F$!#B,"XP$"8,"A8#``H8*B#H.#P^"!`:`!(>(B(B$"08)"#N#"0&&!X
MH%"8B'AX>"A0<*C`T!#X8$"0T"B(P"A(<&"0P("#N,##Z"`X8"A0#"!(:&AP
...
M^\B/):"/_]B/`EF.`4F0[TB0`)F0"IF0"'F0"^F0`AF1_S#TD?Q8D?AXD?68
MD?&XD?#8D>SXD>L8DNHXDNE8DOD(D2G9D"H)D?ZXD!1)C_/XDN9(`V40$``A
>_AI3;V9T=V%R93H#36EC<F]S;V9T($]F9FEC90`[
`
end
You can use the uudecode utility or the uu python module to decode it. Here's a python program to extract all of them from your example text file:
import re
import sys
import uu
from io import BytesIO
with open(sys.argv[1]) as f:
p = re.compile(r"^begin.*\n(.*\n)+?^end$", re.M)
for m in p.finditer(f.read()):
uu.decode(BytesIO(bytes(m.group(0), encoding="utf-8")))
Result:
Related
I want to know how big files are within my repository in terms of lines of code, to see the 'health' of a repository.
In order to answer this, I would like to see a distribution (visualised or not) of the number of files for a specific range (can be 1):
#lines of code #files
1-10 1
11-20 23
etc...
(A histogram of this would be nice)
Is there quick why to get this, with for example cloc or any other (command line) tool?
A combination of cloc and Pandas can handle this. First, capture the line counts with cloc to a csv file using --by-file and --csv switches, for example
cloc --by-file --csv --out data.csv curl-7.80.0.tar.bz2
then use the Python program below to aggregate and bin the data by folders:
./aggregate_by_folder.py data.csv
The code for aggregate_by_folder.py is
#!/usr/bin/env python
import sys
import os.path
import pandas as pd
def add_folder(df):
"""
Return a Pandas dataframe with an additional 'folder' column
containing each file's parent directory
"""
header = 'github.com/AlDanial/cloc'
df = df.drop(df.columns[df.columns.str.contains(header)], axis=1)
df['folder'] = df['filename'].dropna().apply(os.path.dirname)
return df
def bin_by_folder(df):
bins = list(range(0,1000,50))
return df.groupby('folder')['code'].value_counts(bins=bins).sort_index()
def file_count_by_folder(df):
df_files = pd.pivot_table(df, index=['folder'], aggfunc='count')
file_counts = df_files.rename(columns={'blank':'file count'})
return file_counts[['file count']]
def main():
if len(sys.argv) != 2:
print(f"Usage: {sys.argv[0]} data.csv")
print(" where the .csv file is created with")
print(" cloc --by-file --csv --out data.csv my_code_base")
raise SystemExit
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)
df = add_folder(pd.read_csv(sys.argv[1]))
print(pd.pivot_table(df, index=['folder'], aggfunc='sum'))
print('-' * 50)
print(file_count_by_folder(df))
print('-' * 50)
print(bin_by_folder(df))
if __name__ == "__main__": main()
how to rename the files efficiently by the number in the name (see picture)? I did not succeed with Windows PowerToys and I dont wana click each file and rename to the number (e.g. 290)
or how to read the files in this order and define a name? If I try it with a script (see below) the following output occurs:
[![ValueError: invalid literal for int() with base 10: '211001_164357_P_Scripted_Powermeasurement_Wavelength_automatic_Powermeter1_0'][1]][1]
or how to select only the numbers (290 to 230 - see picture) within the name when reading?
Script:
#import libraries
import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
data_location = r'C:\Users\...\Characterization_OPO\Data'
data_folder = Path(data_location)
data = {}
allist = list(data_folder.glob('*'))
for i, file in enumerate(allist):
file = str(file)
file_name = file.split('\\')[-1]
wavelength = int(file_name.split('.')[0])
tmp = pd.read_csv(file, skiprows=20, skipfooter=59, index_col="PixelNo")
data[f'{wavelength} nm'] = tmp;
#data.plot(x='Wavelength',y='CCD_1', label=f"{wavelength} nm")
Picture:
I removed all words with windows power rename and than took the last three digits:
for i, file in enumerate(allist):
file = str(file)
file_name = file.split('\\')[-1]
wavelength = int(file_name.split('.')[0])
tmp = pd.read_csv(file, skiprows=26, skipfooter=5)
data[f'{wavelength % 1000} nm'] = tmp;
#data.plot(x='Wavelength',y='CCD_1', label=f"{wavelength} nm")
I am using tensorRT to perform inference with CUDA. I'd like to use CuPy to preprocess some images that I'll feed to the tensorRT engine. The preprocessing function, called my_function, works fine as long as tensorRT is not run between different calls of the my_function method (see code below). Specifically, the issue is not strictly related by tensorRT but by the fact that tensorRT inference requires to be wrapped by push and pop operations of the pycuda context.
With respect to the following code, the last execution of my_function will raise the following error:
File "/home/ubuntu/myfile.py", line 188, in _pre_process_cuda
img = ndimage.zoom(img, scaling_factor)
File "/home/ubuntu/.local/lib/python3.6/site-packages/cupyx/scipy/ndimage/interpolation.py", line 482, in zoom
kern(input, zoom, output)
File "cupy/core/_kernel.pyx", line 822, in cupy.core._kernel.ElementwiseKernel.__call__
File "cupy/cuda/function.pyx", line 196, in cupy.cuda.function.Function.linear_launch
File "cupy/cuda/function.pyx", line 164, in cupy.cuda.function._launch
File "cupy_backends/cuda/api/driver.pyx", line 299, in cupy_backends.cuda.api.driver.launchKernel
File "cupy_backends/cuda/api/driver.pyx", line 124, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
Note: in the following code I haven't reported the entire tensorRT inference code. In fact, simply pushing and popping a pycuda context generates the error
Code:
import numpy as np
import cv2
import time
from PIL import Image
import requests
from io import BytesIO
from matplotlib import pyplot as plt
import cupy as cp
from cupyx.scipy import ndimage
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
def my_function(numpy_frame):
dtype = 'float32'
img = cp.array(numpy_frame, dtype='float32')
# print(img)
img = ndimage.zoom(img, (0.5, 0.5, 3))
img = (cp.array(2, dtype=dtype) / cp.array(255, dtype=dtype)) * img - cp.array(1, dtype=dtype)
img = img.transpose((2, 0, 1))
img = img.ravel()
return img
# load image
url = "https://www.pexels.com/photo/109919/download/?search_query=&tracking_id=411xe21veam"
response = requests.get(url)
img = Image.open(BytesIO(response.content))
img = np.array(img)
# initialize tensorrt
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)
cfx = cuda.Device(0).make_context()
my_function(img) # ok
my_function(img) # ok
# ----- TENSORRT ---------
cfx.push()
# .... tensorrt inference....
cfx.pop()
# ----- TENSORRT ---------
my_function(img) # <---- error
I even tried to do it other ways, but unfortunately with the same result:
cfx.push()
my_function(img) # ok
cfx.pop()
cfx.push()
my_function(img) # error
cfx.pop()
#admin: if you can think of a better name for this question feel free to edit it :)
There were multiple contexts open. For instance, it seems that all of the following open a context:
import pycuda.autoinit
cfx.cuda.Device(0).make_context()
cfx.push()
So if you run the three command above, then simply running one cfx.pop() won't be enough. You will need to run cfx.pop() three times to pop all the contexts.
I have set of .msg files stored in E:/ drive that I have to read and extract some information from it. For that i am using the below code in Python 3.6
from email.parser import Parser
p = Parser()
headers = p.parse(open('E:/Ratan/msg_files/Test1.msg', encoding='Latin-1'))
print('To: %s' % headers['To'])
print('From: %s' % headers['From'])
print('Subject: %s' % headers['subject'])
In the output I am getting as below.
To: None
From: None
Subject: None
I am not getting the actual values in To, FROM and subject fields.
Any thoughts why it is not printing the actual values?
Please download my sample msg file from this link:
drive.google.com/file/d/1pwWWG3BgsMKwRr0WmP8GqzG3WX4GmEy6/view
Here is a demonstration of how to use some of python's standard email libraries.
You didn't show us your input file in the question, and the g-drive URL is a deadlink.
The code below looks just like yours and works fine, so I don't know what is odd about your environment, modulo some Windows 'rb' binary open nonsense, CRLFs, or the Latin1 encoding.
I threw in .upper() but it does nothing beyond showing that the API is case insensitive.
#! /usr/bin/env python3
from email.parser import Parser
from pathlib import Path
import mailbox
def extract_messages(maildir, mbox_file, k=2, verbose=False):
for n, message in enumerate(mailbox.mbox(mbox_file)):
with open(maildir / f'{n}.txt', 'w') as fout:
fout.write(str(message))
hdrs = 'From Date Subject In-Reply-To References Message-ID'.split()
p = Parser()
for i in range(min(k, n)):
with open(maildir / f'{i}.txt') as fin:
msg = p.parse(fin)
print([len(msg[hdr.upper()] or '')
for hdr in hdrs])
for k, v in msg.items():
print(k, v)
print('')
if verbose:
print(msg.get_payload())
if __name__ == '__main__':
# from https://mail.python.org/pipermail/python-dev/
maildir = Path('/tmp/py-dev/')
extract_messages(maildir, maildir / '2018-January.txt')
For Matplotlib plots in iPython/Jupyter you can make the notebook plot plots inline with
%matplotlib inline
How can one do the same for NLTK draw() for trees? Here is the documentation http://www.nltk.org/api/nltk.draw.html
Based on this answer:
import os
from IPython.display import Image, display
from nltk.draw import TreeWidget
from nltk.draw.util import CanvasFrame
def jupyter_draw_nltk_tree(tree):
cf = CanvasFrame()
tc = TreeWidget(cf.canvas(), tree)
tc['node_font'] = 'arial 13 bold'
tc['leaf_font'] = 'arial 14'
tc['node_color'] = '#005990'
tc['leaf_color'] = '#3F8F57'
tc['line_color'] = '#175252'
cf.add_widget(tc, 10, 10)
cf.print_to_file('tmp_tree_output.ps')
cf.destroy()
os.system('convert tmp_tree_output.ps tmp_tree_output.png')
display(Image(filename='tmp_tree_output.png'))
os.system('rm tmp_tree_output.ps tmp_tree_output.png')
Little slow, but does the job. If you're doing it remotely, don't forget to run your ssh session with -X key (like ssh -X user#server.com) so that Tk could initialize itself (no display name and no $DISPLAY environment variable-kind of error)
UPD: it seems like last versions of jupyter and nltk work nicely together, so you can just do IPython.core.display.display(tree) to get a nice-looking tree-render embedded into the output.
2019 Update:
This runs on Jupyter Notebook:
from nltk.tree import Tree
from IPython.display import display
tree = Tree.fromstring('(S (NP this tree) (VP (V is) (AdjP pretty)))')
display(tree)
Requirements:
NLTK
Ghostscript