pickle.PicklingError: Cannot pickle files that are not opened for reading - pyspark

I'm getting this error while running PySpark job on Dataproc. What could be the reason?
This is the stack trace of error.
File "/usr/lib/python2.7/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 553, in save_reduce
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems
save(v)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 582, in save_file
pickle.PicklingError: Cannot pickle files that are not opened for reading

The issue was that I was using a dictionary in the Map function.
The reason it was failing: worker nodes couldn't access the dictionary which I was passing in map function.
Solution:
I broadcasted the dictionary and then used it in function (Map)
sc = SparkContext()
lookup_bc = sc.broadcast(lookup_dict)
Then in function, I took value by using this:
data = lookup_bc.value.get(key)

Related

Is there a way to save each HDF5 data set as a .csv column?

I'm struggling with a H5 file to extract and save data as a multi column csv. as shown in the picture the structure of h5 file consisted of main groups (Genotypes, Positions, and taxa). The main group, Genotypes contains more than 1500 subgroups (genotype partial names) and each subgroup contains sub-sun groups (complete name of genotypes).There are about 1 million data sets (named calls) -each one is laid in one sub-sub group - which i need them to be written - each one - in a separate column. The problem is that when i use h5py (group.get function) i have to use the path of any calls. I extracted the all paths containing "calls" at the end of path but I cant reach all
1 million calls to get them into a csv file.
could anybody help me to extracts "calls" which are 8bit integer i\as a separate columns in a csv file.
By running the code in first answer I get this error:
Traceback (most recent call last): File "path/file.py", line 32,
in
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string! File "path/file.py", line 565, in visititems
return h5o.visit(self.id, proxy) File "h5py_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File
"h5py_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py\h5o.pyx", line 355, in h5py.h5o.visit File
"h5py\defs.pyx", line 1641, in h5py.defs.H5Ovisit_by_name File
"h5py\h5o.pyx", line 302, in h5py.h5o.cb_obj_simple File
"path/file.py", line 564, in proxy
return func(name, self[name]) File "path/file.py", line 10, in dump_calls2csv
np.savetxt(csvfname, arr, fmt='%5d', delimiter=',') File "<array_function internals>", line 6, in savetxt File
"path/file.py", line 1377, in savetxt
open(fname, 'wt').close() OSError: [Errno 22] Invalid argument: 'Genotypes_ArgentineFlintyComposite-C(1)-37-B-B-B2-1-B25-B2-B?-1-B:100000977_calls.csv
16-May-2020 Update:
Added a second example that reads and exports using Pytables (aka
tables) using .walk_nodes(). I prefer this method over h5py
.visititems()
For clarity, I separated the code that creates the example file from the
2 examples that read and export the CSV data.
Enclosed below are 2 simple examples that show how to recursively loop on all top level objects. For completeness, the code to create the test file is at the end of this post.
Example 1: with h5py
This example uses the .visititems() method with a callable function (dump_calls2csv).
Summary of this procedure:
1) Checks for dataset objects with calls in the name.
2) When it finds a matching object it does the following:
a) reads the data into a Numpy array,
b) creates a unique file name (using string substitution on the H5 group/dataset path name to insure uniqueness),
c) writes the data to the file with numpy.savetxt().
import h5py
import numpy as np
def dump_calls2csv(name, node):
if isinstance(node, h5py.Dataset) and 'calls' in node.name :
print ('visiting object:', node.name, ', exporting data to CSV')
csvfname = node.name[1:].replace('/','_') +'.csv'
arr = node[:]
np.savetxt(csvfname, arr, fmt='%5d', delimiter=',')
##########################
with h5py.File('SO_61725716.h5', 'r') as h5r :
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!
If you want to get fancy, you can replace arr in np.savetxt() with node[:].
Also, you you want headers in your CSV, extract and reference the dtype field names from the dataset (I did not create any in this example).
Example 2: with PyTables (tables)
This example uses the .walk_nodes() method with a filter: classname='Leaf'. In PyTables, a leaf can be any of the storage classes (Arrays and Table).
The procedure is similar to the method above. walk_nodes() simplifies the process to find datasets and does NOT require a call to a separate function.
import tables as tb
import numpy as np
with tb.File('SO_61725716.h5', 'r') as h5r :
for node in h5r.walk_nodes('/',classname='Leaf') :
print ('visiting object:', node._v_pathname, 'export data to CSV')
csvfname = node._v_pathname[1:].replace('/','_') +'.csv'
np.savetxt(csvfname, node.read(), fmt='%d', delimiter=',')
For completeness, use the code below to create the test file used in the examples.
import h5py
import numpy as np
ngrps = 2
nsgrps = 3
nds = 4
nrows = 10
ncols = 2
with h5py.File('SO_61725716.h5', 'w') as h5w :
for gcnt in range(ngrps):
grp1 = h5w.create_group('Group_'+str(gcnt))
for scnt in range(nsgrps):
grp2 = grp1.create_group('SubGroup_'+str(scnt))
for dcnt in range(nds):
i_arr = np.random.randint(1,100, (nrows,ncols) )
ds = grp2.create_dataset('calls_'+str(dcnt), data=i_arr)

pyspark read error when I save data frame in orc format and read that

I have some data frame I save that dataframe using below code
df.write.orc("file:///home/test/path/orc")
saving is succcessful it does not give any error but when I read this using
df1=spark.read.orc("file:///home/test/path/orc")
below is the error
Traceback (most recent call last):
File "/home/user1/soft/spark/spark-2.3.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/home/user1/soft/spark/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o24.orc.
: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '.' expecting ':'(line 1, pos 515)
== SQL ==
But if I use parquet format to save and read it works fine
df.write.parquet("file:///home/test/path/parquet")
df1=spark.read.parquet("file:///home/test/path/parquet")

Automate Boring Stuff Ch13 PyPDF2: pdfReader is not defined

While trying to follow the instructions on the book, I encountered an error which I am afraid is due to my wrong understanding about the loop. My code is as below.
#! Python3
import PyPDF2, os
# Loop through all the PDF files
for filename in pdfFiles:
try:
pdfFileObj = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
except FileNotFoundError:
print('File not foungd ' + filename)
pass
# Read through all the PDF files.
for pageNum in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pdfWriter.addPage(pageObj)
=====
and the result is:
Traceback (most recent call last):
File "C:/…automate_py/combinesPdf.py", line 24, in <module>
for pageNum in range(1, pdfReader.numPages):
NameError: name 'pdfReader' is not defined
Does anyone know why is pdfReader not found? Very appreciated.
I have tried adjust indentation but it didn't seem to work. :(
You have committed an indentation mistake. pdfReader is defined within the first loop of your piece of code, so the second one should be inside that block. This is the entire commented code of that sample on the book:
#! python3
# combinePdfs.py - Combines all the PDFs in the current working directory into
# a single PDF.
import PyPDF2, os
# Get all the PDF filenames.
pdfFiles = []
for filename in os.listdir('.'):
if filename.endswith('.pdf'):
pdfFiles.append(filename)
pdfFiles.sort()
pdfWriter = PyPDF2.PdfFileWriter()
# Loop through all the PDF files.
for filename in pdfFiles:
pdfFileObj = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# Loop through all the pages (except the first) and add them.
for pageNum in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pdfWriter.addPage(pageObj)
# Save the resulting PDF to a file.
pdfOutput = open('allminutes.pdf', 'wb')
pdfWriter.write(pdfOutput)
pdfOutput.close()
Had the same issue (Python 3.x, Windows 10). This worked for me:
from PyPDF2 import PdfFileReader
with open("C:/yourpdf.pdf", 'rb') as f:
pdf = PdfFileReader(f)
page = pdf.getPage(1)
text = page.extractText()
print(text)

inferSchema in spark csv package

i am trying to read a csv file as a spark df by enabling inferSchema, but then am unable to get the fv_df.columns. below is the error message
>>> fv_df = spark.read.option("header", "true").option("delimiter", "\t").csv('/home/h212957/FacilityView/datapoints_FV.csv', inferSchema=True)
>>> fv_df.columns
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/h212957/spark/python/pyspark/sql/dataframe.py", line 687, in columns
return [f.name for f in self.schema.fields]
File "/home/h212957/spark/python/pyspark/sql/dataframe.py", line 227, in schema
self._schema = _parse_datatype_json_string(self._jdf.schema().json())
File "/home/h212957/spark/python/pyspark/sql/types.py", line 894, in _parse_datatype_json_string
return _parse_datatype_json_value(json.loads(json_string))
File "/home/h212957/spark/python/pyspark/sql/types.py", line 911, in _parse_datatype_json_value
return _all_complex_types[tpe].fromJson(json_value)
File "/home/h212957/spark/python/pyspark/sql/types.py", line 562, in fromJson
return StructType([StructField.fromJson(f) for f in json["fields"]])
File "/home/h212957/spark/python/pyspark/sql/types.py", line 428, in fromJson
_parse_datatype_json_value(json["type"]),
File "/home/h212957/spark/python/pyspark/sql/types.py", line 907, in _parse_datatype_json_value
raise ValueError("Could not parse datatype: %s" % json_value)
ValueError: Could not parse datatype: decimal(7,-31)
However If i don't infer the Schema than I am able to fetch the columns and do further operations. I am unable to get as why this is working in this way. Can anyone please explain me.
I suggest you use the function '.load' rather than '.csv', something like this:
data = sc.read.load(path_to_file,
format='com.databricks.spark.csv',
header='true',
inferSchema='true').cache()
Of you course you can add more options. Then you can simply get you want:
data.columns
Another way of doing this (to get the columns) is to use it this way:
data = sc.textFile(path_to_file)
And to get the headers (columns) just use
data.first()
Looks like you are trying to get your schema from your csv file without opening it! The above should help you to get them and hence manipulate whatever you like.
Note: to use '.columns' your 'sc' should be configured as:
spark = SparkSession.builder \
.master("yarn") \
.appName("experiment-airbnb") \
.enableHiveSupport() \
.getOrCreate()
sc = SQLContext(spark)
Good luck!
Please try the below code and this infers the schema along with header
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('operation').getOrCreate()
df=spark.read.csv("C:/LEARNING//Spark_DataFrames/stock.csv ",inferSchema=True, header=True)
df.show()
It would be good if you can provide some sample data next time. How should we know how your csv looks like. Concerning your question, it looks like that your csv column is not a decimal all the time. InferSchema takes the first row and assign a datatype, in your case, it is a DecimalType but then in the second row you might have a text so that the error would occur.
If you don't infer the schema then, of course, it would work since everything will be cast as a StringType.

Reading and writing error pajek file in Networkx

I am receiving error when I write to a pajek file and then read back the same file using Networkx library python
>>> G=nx.read_pajek("eatRS.net")
>>> nx.write_pajek(G,"temp.net")
>>> G1=nx.read_pajek("temp.net")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 2, in read_pajek
File "/usr/local/lib/python2.7/dist-packages/networkx/utils/decorators.py", line 193, in _open_file
result = func(*new_args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/networkx/readwrite/pajek.py", line 132, in read_pajek
return parse_pajek(lines)
File "/usr/local/lib/python2.7/dist-packages/networkx/readwrite/pajek.py", line 168, in parse_pajek
splitline=shlex.split(str(next(lines)))
File "/usr/lib/python2.7/shlex.py", line 279, in split
return list(lex)
File "/usr/lib/python2.7/shlex.py", line 269, in next
token = self.get_token()
File "/usr/lib/python2.7/shlex.py", line 96, in get_token
raw = self.read_token()
File "/usr/lib/python2.7/shlex.py", line 172, in read_token
r aise ValueError, "No closing quotation"
ValueError: No closing quotation
Creating a graph within networkx, writing in pajek format and then back again works fine for me. E.g. with gnm_random_graph:
import matplotlib.pyplot as np
n = 10
m = 20
G = nx.gnm_random_graph(n,m)
nx.write_pajek(G, "temp.net")
G1 = nx.read_pajek("temp.net")
Only if I edit the intermediate graph to have, say,
"vertex one 0.3456 0.1234 box ic White fos 20
do I get the ValueError: No closing quotation error you have. Node labels can be numeric or string, but if they include spaces, the name must be quoted.From the Pajek manual:
label - if label starts with character A..Z or 0..9 first blank determines end of the label
(example: vertex1), labels consisting of more words must be enclosed in pair of special
characters (example: "vertex 1")
Thus, I suggest that you inspect your input file "eatRS.net". Perhaps there is an issue with character encoding, mismatched quotes (e.g. opening with " and closing with '), or a line break within the node label?