How do you use BeautifulSoup to fetch data in specific format? - dom

I would create a python code to fetch the average volume of a given link stock using BeautifulSoup.
What I have done so far:
import bs4
import requests
from bs4 import BeautifulSoup
r=requests.get('https://finance.yahoo.com/quote/M/key-statistics?p=M')
soup=BeautifulSoup(r.content,"html.parser")
# p = soup.find_all(class_="Fw(500) Ta(end) Pstart(10px) Miw(60px)")[1].get_text
# p = soup.find_all('td')[2].get_text
# p = soup.find_all('table', class_='W(100%) Bdcl(c)')[70].tr.get_text
Anyway, I was able to get that number directly from google console using this command:
Document.querySelectorAll('table tbody tr td')[71].innerText
"21.07M"
Please, help with the basic explanation, I know a few about DOM.

You can use this logic to make it easier:
Use find_all to find all the spans in the html file
Search the spans for the correct label (Avg Vol...)
Use parent to go up the hierarchy to the full table row
Use find_all again from the parent to get the last cell which contains the value
Here is the updated code:
import bs4
import requests
from bs4 import BeautifulSoup
r=requests.get('https://finance.yahoo.com/quote/M/key-statistics?p=M')
soup=BeautifulSoup(r.content,"html.parser")
p = soup.find_all('span')
for s in p: # each span
if s.text == 'Avg Vol (10 day)': # starting cell
pnt = s.parent.parent # up 2 levels, table row
print(pnt.find_all('td')[-1].text) # last table cell
Output
21.76M

Related

Pyspark - How to calculate file hashes

I have a bunch of CSV files in a mounted blob container and I need to calculate the 'SHA1' hash values for every file to store as inventory. I'm very new to Azure cloud and pyspark so I'm not sure how this can be achieved efficiently. I have written the following code in Python Pandas and I'm trying to use this in pyspark. It seems to work however it takes quite a while to run as there are thousands of CSV files. I understand that things work differently in pyspark, so can someone please guide if my approach is correct, or if there is a better piece of code I can use to accomplish this task?
import os
import subprocess
import hashlib
import pandas as pd
class File:
def __init__(self, path):
self.path = path
def get_hash(self):
hash = hashlib.sha1()
with open(self.path, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash.update(chunk)
self.md5hash = hash.hexdigest()
return self.md5hash
path = '/dbfs/mnt/data/My_Folder' #Path to CSV files
cnt = 0
rlist = []
for path, subdirs, files in os.walk(path):
for fi in files:
if cnt < 10: #check on only 10 files for now as it takes ages!
f = File(os.path.join(path, fi))
cnt +=1
hash_value = f.get_hash()
results = {'File_Name': fi, 'File_Path': f.filename, 'SHA1_Hash_Value': hash_value}
rlist.append(results)
print(fi)
df = pd.DataFrame(rlist)
print(str(cnt) + ' files processed')
df = pd.DataFrame(rlist)
#df.to_csv('/dbfs/mnt/workspace/Inventory/File_Hashes.csv', mode='a', header=False) #not sure how to write files in pyspark!
display(df)
Thanks
Since you want to treat the files as blobs and not read them into a table. I would recommend using spark.sparkContext.binaryFiles this would land you an RDD of pairs where the key is the file name and the value is a file-like object, on which you can calculate the hash in a map function (rdd.mapValues(calculate_hash_of_file_like))
For more information, refer to the documentation: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.binaryFiles.html#pyspark.SparkContext.binaryFiles

Accessing array contents inside .mat file in python

I want to read a .mat file available at http://www.eigenvector.com/data/tablets/index.html. To access the data inside this file, I am trying the follwing:
import scipy.io as spio
import numpy as np
import matplotlib.pyplot as plt
mat = spio.loadmat('nir_shootout_2002.mat')
# http://pyhogs.github.io/reading-mat-files.html
def print_mat_nested(d, indent=0, nkeys=0):
if nkeys>0:
d = {k: d[k] for k in d.keys()[:nkeys]}
if isinstance(d, dict):
for key, value in d.iteritems():
print '\t' * indent + 'Key: ' + str(key)
print_mat_nested(value, indent+1)
if isinstance(d,np.ndarray) and d.dtype.names is not None:
for n in d.dtype.names:
print '\t' * indent + 'Field: ' + str(n)
print_mat_nested(d[n], indent+1)
print_mat_nested(mat, nkeys=1)
Above command shows that the first key in the dictionary is "validate_1" and it has a field "data". To access this field, I try:
t = mat['validate_1']
print(t['data'])
It prints an array but when I use np.shape(t['data']), it just returns (1,1) whereas the data seems to be larger. I am not sure how to access array inside t['data'].
Thanks for the help.
I found that following works:
t = mat['validate_1']['data'][0,0]
print(np.shape(t))
It returns that t is an array of shape (40,650).

Print output to file no longer working - RPi 2, PN532 NFC RDW

For my classroom, I have a PN532 NFC card reader/writer hooked up via UART to a Raspberry Pi 2, and I'm using Type 2 NXP NTAG213 NFC cards to store information specifically to the text record. While weak in Python, I used the example under subheader 8.3 in the NFCPy Documentation to write to the card and used "How to redirect 'print' output to a file using python?" in order to complete the output process to a text file. For a while, the reading, writing, and outputting to my text file worked:
import nfc
import nfc.ndef
import nfc.tag
import os, sys
import subprocess
import glob
from os import path
import datetime
f = open('BankTransactions.txt', 'a')
sys.stdout = f
path = '/home/pi/BankTransactions.txt'
def connected(tag): print(tag); return False
clf = nfc.ContactlessFrontend('tty:AMA0:pn532')
clf.connect(rdwr={'on-connect': connected})
tag = clf.connect(rdwr={'on-connect': connected})
record_1 = tag.ndef.message[0]
signature = nfc.tag.tty2_nxp.NTAG213
today = datetime.date.today()
print(record_1.pretty())
if tag.ndef is not None:
print(tag.ndef.message.pretty())
if tag.ndef.is_writeable:
text_record = nfc.ndef.TextRecord("Jessica has 19 GP on card")
tag.ndef.message = nfc.ndef.Message(text_record)
print >> f, "Edited by Roman", today, record_1, signature, '\n'
f.close()
Now, however, when I use the same card for testing, it will not append the data within the text file. The data is still being written to the card, as I can read the information on the card with a simple read program.

How do I add category names to my seaborn boxplot when my data is from a python dictionary?

I have some data that is sitting in a python dictionary of lists.
How can I use the keys from the dictionary as category labels for this boxplot?
Here is a sample of the dictionary, plot_data:
plot_data {
'Group1': [0.02339976, 0.03235323, 0.12835462, 0.10238375, 0.04223188],
'Group2': [0.02339976, 0.03235323, 0.12835462, 0.10238375, 0.04223188]
}
This code is probably a mess, but here it is:
data = plot_data.values()
#Get data in proper format
fixed_data = list(sorted(data))
#Set up the graph parameters
sns.set(context='notebook', style='whitegrid')
sns.axlabel(xlabel="Groups", ylabel="Y-Axis", fontsize=16)
#Plot the graph
sns.boxplot(data=fixed_data, whis=np.inf, width=.18)
sns.swarmplot(data=fixed_data, size=6, edgecolor="black", linewidth=.9)
Here how to add category labels "manually":
import seaborn as sns, matplotlib.pyplot as plt, operator as op
plot_data = {
'Group1': range(10,16),
'Group2': range(5,15),
'Group3': range(1,5)
}
# sort keys and values together
sorted_keys, sorted_vals = zip(*sorted(plot_data.items(), key=op.itemgetter(1)))
# almost verbatim from question
sns.set(context='notebook', style='whitegrid')
sns.axlabel(xlabel="Groups", ylabel="Y-Axis", fontsize=16)
sns.boxplot(data=sorted_vals, width=.18)
sns.swarmplot(data=sorted_vals, size=6, edgecolor="black", linewidth=.9)
# category labels
plt.xticks(plt.xticks()[0], sorted_keys)
plt.show()
And here the output:

How to sort tree view on click on column header

I am having a GtkTreeView with GtkTreeStore in it and I want to Sort this entire table (GtkTreeView) when user clicks on any of the column headers I also want to align text in the cells to left side.
How to do this?
Enjoy!
#! /usr/bin/python
###########################################################
#
# Basic Gtk.TreeView Example with two sortable columns
#
###########################################################
# use the new PyGObject binding
from gi.repository import Gtk
import os
import getpass # this is only to automatically print your home folder.
class MyWindow(Gtk.Window):
def __init__(self):
Gtk.Window.__init__(self, title='My Window Title')
self.connect('delete-event', Gtk.main_quit)
# Gtk.ListStore will hold data for the TreeView
# Only the first two columns will be displayed
# The third one is for sorting file sizes as numbers
store = Gtk.ListStore(str, str, long)
# Get the data - see below
self.populate_store(store)
treeview = Gtk.TreeView(model=store)
# The first TreeView column displays the data from
# the first ListStore column (text=0), which contains
# file names
renderer_1 = Gtk.CellRendererText()
column_1 = Gtk.TreeViewColumn('File Name', renderer_1, text=0)
# Calling set_sort_column_id makes the treeViewColumn sortable
# by clicking on its header. The column is sorted by
# the ListStore column index passed to it
# (in this case 0 - the first ListStore column)
column_1.set_sort_column_id(0)
treeview.append_column(column_1)
# xalign=1 right-aligns the file sizes in the second column
renderer_2 = Gtk.CellRendererText(xalign=1)
# text=1 pulls the data from the second ListStore column
# which contains filesizes in bytes formatted as strings
# with thousand separators
column_2 = Gtk.TreeViewColumn('Size in bytes', renderer_2, text=1)
# Make the Treeview column sortable by the third ListStore column
# which contains the actual file sizes
column_2.set_sort_column_id(1)
treeview.append_column(column_2)
# Use ScrolledWindow to make the TreeView scrollable
# Otherwise the TreeView would expand to show all items
# Only allow vertical scrollbar
scrolled_window = Gtk.ScrolledWindow()
scrolled_window.set_policy(Gtk.PolicyType.NEVER, Gtk.PolicyType.AUTOMATIC)
scrolled_window.add(treeview)
scrolled_window.set_min_content_height(200)
self.add(scrolled_window)
self.show_all()
def populate_store(self, store):
directory = '/home/'+getpass.getuser()
for filename in os.listdir(directory):
size = os.path.getsize(os.path.join(directory, filename))
# the second element is displayed in the second TreeView column
# but that column is sorted by the third element
# so the file sizes are sorted as numbers, not as strings
store.append([filename, '{0:,}'.format(size), size])
# The main part:
win = MyWindow()
Gtk.main()
In order:
You really need to start looking more at the fine documentation.
Make sure you set a sort column id on each of your columns, then look at the GtkTreeSortable interface. This tutorial section is helpful, too.
Set the "xalign" property of your GtkCellRenderer to 0.f.