getting error when export data to csv by using python - mongodb

I tried to export data to csv from Mongodb by using Python. I am getting error: "TypeError: can only join an iterable"
mongodb data sample:
{ "_id" : ObjectId("51dc52fec0d988a9547b5201"),
"hiringManagerIds" : [
"529f5ad1030dedd0a88ed7be",
"529f5ad1030dedd0a88ed7bf"
]
}
Python script:
import codecs
import csv
cursor = db.jobs.find( {}, {'_id': 1, 'hiringManagerIds': 1})
with codecs.open('jobs_array.csv', 'w', encoding ='utf-8') as outfile:
fields = ['_id', 'hiringManagerIds']
write = csv.writer(outfile)
write.writerow(fields)
for x in cursor:
write.writerow((x["_id"], u','.join(x.get('hiringManagerIds'))))
error messages:
Traceback (most recent call last):
File "<stdin>", line 6, in <module>
TypeError: can only join an iterable
I want to remove u letters so I use u.join in the script. the hiringManagerIds filed is missing in some document, so I use get. If I don't add u''.join in the script, it works, but it brings u letters in the csv file. I tried many different ways, but it did not work. Appreciate for any helps. thanks.

The error stems from one of the documents not having hiringManagerIds. If this is undefined, at present your code returns None into the join method, which requires an iterable (None is not an iterable).
You should check if the key is present in the dictionary first:
if 'hiringManagerIds' in x.keys():
write.writerow((x["_id"], u','.join(x.get('hiringManagerIds'))))
else:
write.writerow((x["_id"], '')))

Related

Is there a way to save each HDF5 data set as a .csv column?

I'm struggling with a H5 file to extract and save data as a multi column csv. as shown in the picture the structure of h5 file consisted of main groups (Genotypes, Positions, and taxa). The main group, Genotypes contains more than 1500 subgroups (genotype partial names) and each subgroup contains sub-sun groups (complete name of genotypes).There are about 1 million data sets (named calls) -each one is laid in one sub-sub group - which i need them to be written - each one - in a separate column. The problem is that when i use h5py (group.get function) i have to use the path of any calls. I extracted the all paths containing "calls" at the end of path but I cant reach all
1 million calls to get them into a csv file.
could anybody help me to extracts "calls" which are 8bit integer i\as a separate columns in a csv file.
By running the code in first answer I get this error:
Traceback (most recent call last): File "path/file.py", line 32,
in
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string! File "path/file.py", line 565, in visititems
return h5o.visit(self.id, proxy) File "h5py_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File
"h5py_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py\h5o.pyx", line 355, in h5py.h5o.visit File
"h5py\defs.pyx", line 1641, in h5py.defs.H5Ovisit_by_name File
"h5py\h5o.pyx", line 302, in h5py.h5o.cb_obj_simple File
"path/file.py", line 564, in proxy
return func(name, self[name]) File "path/file.py", line 10, in dump_calls2csv
np.savetxt(csvfname, arr, fmt='%5d', delimiter=',') File "<array_function internals>", line 6, in savetxt File
"path/file.py", line 1377, in savetxt
open(fname, 'wt').close() OSError: [Errno 22] Invalid argument: 'Genotypes_ArgentineFlintyComposite-C(1)-37-B-B-B2-1-B25-B2-B?-1-B:100000977_calls.csv
16-May-2020 Update:
Added a second example that reads and exports using Pytables (aka
tables) using .walk_nodes(). I prefer this method over h5py
.visititems()
For clarity, I separated the code that creates the example file from the
2 examples that read and export the CSV data.
Enclosed below are 2 simple examples that show how to recursively loop on all top level objects. For completeness, the code to create the test file is at the end of this post.
Example 1: with h5py
This example uses the .visititems() method with a callable function (dump_calls2csv).
Summary of this procedure:
1) Checks for dataset objects with calls in the name.
2) When it finds a matching object it does the following:
a) reads the data into a Numpy array,
b) creates a unique file name (using string substitution on the H5 group/dataset path name to insure uniqueness),
c) writes the data to the file with numpy.savetxt().
import h5py
import numpy as np
def dump_calls2csv(name, node):
if isinstance(node, h5py.Dataset) and 'calls' in node.name :
print ('visiting object:', node.name, ', exporting data to CSV')
csvfname = node.name[1:].replace('/','_') +'.csv'
arr = node[:]
np.savetxt(csvfname, arr, fmt='%5d', delimiter=',')
##########################
with h5py.File('SO_61725716.h5', 'r') as h5r :
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!
If you want to get fancy, you can replace arr in np.savetxt() with node[:].
Also, you you want headers in your CSV, extract and reference the dtype field names from the dataset (I did not create any in this example).
Example 2: with PyTables (tables)
This example uses the .walk_nodes() method with a filter: classname='Leaf'. In PyTables, a leaf can be any of the storage classes (Arrays and Table).
The procedure is similar to the method above. walk_nodes() simplifies the process to find datasets and does NOT require a call to a separate function.
import tables as tb
import numpy as np
with tb.File('SO_61725716.h5', 'r') as h5r :
for node in h5r.walk_nodes('/',classname='Leaf') :
print ('visiting object:', node._v_pathname, 'export data to CSV')
csvfname = node._v_pathname[1:].replace('/','_') +'.csv'
np.savetxt(csvfname, node.read(), fmt='%d', delimiter=',')
For completeness, use the code below to create the test file used in the examples.
import h5py
import numpy as np
ngrps = 2
nsgrps = 3
nds = 4
nrows = 10
ncols = 2
with h5py.File('SO_61725716.h5', 'w') as h5w :
for gcnt in range(ngrps):
grp1 = h5w.create_group('Group_'+str(gcnt))
for scnt in range(nsgrps):
grp2 = grp1.create_group('SubGroup_'+str(scnt))
for dcnt in range(nds):
i_arr = np.random.randint(1,100, (nrows,ncols) )
ds = grp2.create_dataset('calls_'+str(dcnt), data=i_arr)

inferSchema in spark csv package

i am trying to read a csv file as a spark df by enabling inferSchema, but then am unable to get the fv_df.columns. below is the error message
>>> fv_df = spark.read.option("header", "true").option("delimiter", "\t").csv('/home/h212957/FacilityView/datapoints_FV.csv', inferSchema=True)
>>> fv_df.columns
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/h212957/spark/python/pyspark/sql/dataframe.py", line 687, in columns
return [f.name for f in self.schema.fields]
File "/home/h212957/spark/python/pyspark/sql/dataframe.py", line 227, in schema
self._schema = _parse_datatype_json_string(self._jdf.schema().json())
File "/home/h212957/spark/python/pyspark/sql/types.py", line 894, in _parse_datatype_json_string
return _parse_datatype_json_value(json.loads(json_string))
File "/home/h212957/spark/python/pyspark/sql/types.py", line 911, in _parse_datatype_json_value
return _all_complex_types[tpe].fromJson(json_value)
File "/home/h212957/spark/python/pyspark/sql/types.py", line 562, in fromJson
return StructType([StructField.fromJson(f) for f in json["fields"]])
File "/home/h212957/spark/python/pyspark/sql/types.py", line 428, in fromJson
_parse_datatype_json_value(json["type"]),
File "/home/h212957/spark/python/pyspark/sql/types.py", line 907, in _parse_datatype_json_value
raise ValueError("Could not parse datatype: %s" % json_value)
ValueError: Could not parse datatype: decimal(7,-31)
However If i don't infer the Schema than I am able to fetch the columns and do further operations. I am unable to get as why this is working in this way. Can anyone please explain me.
I suggest you use the function '.load' rather than '.csv', something like this:
data = sc.read.load(path_to_file,
format='com.databricks.spark.csv',
header='true',
inferSchema='true').cache()
Of you course you can add more options. Then you can simply get you want:
data.columns
Another way of doing this (to get the columns) is to use it this way:
data = sc.textFile(path_to_file)
And to get the headers (columns) just use
data.first()
Looks like you are trying to get your schema from your csv file without opening it! The above should help you to get them and hence manipulate whatever you like.
Note: to use '.columns' your 'sc' should be configured as:
spark = SparkSession.builder \
.master("yarn") \
.appName("experiment-airbnb") \
.enableHiveSupport() \
.getOrCreate()
sc = SQLContext(spark)
Good luck!
Please try the below code and this infers the schema along with header
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('operation').getOrCreate()
df=spark.read.csv("C:/LEARNING//Spark_DataFrames/stock.csv ",inferSchema=True, header=True)
df.show()
It would be good if you can provide some sample data next time. How should we know how your csv looks like. Concerning your question, it looks like that your csv column is not a decimal all the time. InferSchema takes the first row and assign a datatype, in your case, it is a DecimalType but then in the second row you might have a text so that the error would occur.
If you don't infer the schema then, of course, it would work since everything will be cast as a StringType.

pyspark : Categorical variables preparation for kmeans

I know Kmeans is not a good selection to be applied to categorical data, but we dont have much options in spark 1.4 for clustering categorical data.
Regardless of above issue. I'm getting errors in my below code.
I read my table from hive, use onehotencoder in a pipeline and then send the code into Kmeans.
Im getting an error when running this code.
Could the error be in datatype fed to Kmeans? doen is expect numpay Array data? if so How can I transfer my indexed data to numpy array!?!?
All comments are aporeciated and thanks for your help!
The error Im getting:
Traceback (most recent call last):
File "/usr/hdp/2.3.2.0-2950/spark/python/lib/pyspark.zip/pyspark /daemon.py", line 157, in manager
File "/usr/hdp/2.3.2.0-2950/spark/python/lib/pyspark.zip/pyspark/daemon.py",
line 61, in worker
File "/usr/hdp/2.3.2.0-2950/spark/python/lib/pyspark.zip/pyspark/worker.py",
line 136, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM: File "/usr/hdp/2.3.2.0-2950/spark/python/lib/pyspark.zip/pyspark/serializers.py",
line 544, in read_int
raise EOFError EOFError File "", line 1
Traceback (most recent call last):
My code:
#aline will be passed in from another rdd
aline=["xxx","yyy"]
# get data from Hive table & select the column & convert back to Rdd
rddRes2=hCtx.sql("select XXX, YYY from table1 where xxx <> ''")
rdd3=rddRes2.rdd
#fill the NA values with "none"
Rdd4=rdd3.map(lambda line: [x if len(x) else 'none' for x in line])
# convert it back to Df
DataDF=Rdd4.toDF(aline)
# Indexers encode strings with doubles
string_indexers=[
StringIndexer(inputCol=x,outputCol="idx_{0}".format(x))
for x in DataDF.columns if x not in '' ]
encoders=[
OneHotEncoder(inputCol="idx_{0}".format(x),outputCol="enc_{0}".format(x))
for x in DataDF.columns if x not in ''
]
# Assemble multiple columns into a single vector
assembler=VectorAssembler(
inputCols=["enc_{0}".format(x) for x in DataDF.columns if x not in ''],
outputCol="features")
pipeline= Pipeline(stages=string_indexers+encoders+[assembler])
model=pipeline.fit(DataDF)
indexed=model.transform(DataDF)
labeled_points=indexed.select("features").map(lambda row: LabeledPoint(row.features))
# Build the model (cluster the data)
clusters = KMeans.train(labeled_points, 3, maxIterations=10,runs=10, initializationMode="random")
I guess the correction would not solve the problem.
you can convert dense vectors to Array by uising XXX.toarray()

MongoDB/PyMongo won't $set attribute to document - but sets all other attributes! (bizarre error)

I'm trying to write a defaultdict variable to a document in my MongoDB. Everything else sets fine, just not this one attribute, its bizarre! I'm setting a rather large defaultdict called 'domains', which has worked many times before. Check out this terminal output:
So here's my defaultdict:
>>> type(domains)
<type 'collections.defaultdict'>
Its pretty big, about 3mb:
>>> sys.getsizeof(domains)
3146008
Here's the document we'll set it to:
>>> db.AggregateResults.find_one({'date':'20110409'}).keys()
[u'res', u'date', u'_id']
Let's grab that document's ID:
>>> myID = db.AggregateResults.find_one({'date':'20110409'})['_id']
>>> myID
ObjectId('50870847f49a00509a000000')
Great, let's set the attribute:
>>> db.AggregateResults.update({'_id':myID}, {"$set": {'domains':domains}})
>>> db.AggregateResults.find_one({'date':'20110409'}).keys()
[u'res', u'date', u'_id']
EH? It didn't save??
Hmmm...does anything save at all?
>>> db.AggregateResults.update({'_id':myID}, {"$set": {'myTest':'hello world'}})
>>> db.AggregateResults.find_one({'date':'20110409'}).keys()
[u'myTest', u'res', u'date', u'_id']
Okay...so it can save things fine...perhaps its because MongoDB doesn't like defaultdicts? Let's try:
>>> myDD = defaultdict(int)
>>> myDD['test'] = 1
>>> myDD
defaultdict(<type 'int'>, {'test': 1})
>>> db.AggregateResults.update({'_id':myID}, {"$set": {'myDD':myDD}})
>>> db.AggregateResults.find_one({'date':'20110409'}).keys()
[u'myTest', u'res', u'date', u'myDD', u'_id']
So it can save defaultdicts fine, just not this one??
So strange! Any ideas why??
EDIT with safe=True:
>>> db.AggregateResults.update({'_id':myID}, {"$set": {'domains':domains}}, safe=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.6/site-packages/pymongo-2.1.1_-py2.6-linux-x86_64.egg/pymongo/collection.py", line 405, in update
_check_keys, self.__uuid_subtype), safe)
File "/usr/lib64/python2.6/site-packages/pymongo-2.1.1_-py2.6-linux-x86_64.egg/pymongo/connection.py", line 796, in _send_message
return self.__check_response_to_last_error(response)
File "/usr/lib64/python2.6/site-packages/pymongo-2.1.1_-py2.6-linux-x86_64.egg/pymongo/connection.py", line 746, in __check_response_to_last_error
raise OperationFailure(error["err"], error["code"])
pymongo.errors.OperationFailure: not okForStorage
This GoogleGroup discussion says that could be due to having fullstops in the keys, but:
>>> [x for x in domains.keys() if '.' in x]
[]
Aha! Found it!
Not only can keys in MongoDB not have '.', they also cannot have '$' in them.
See:
>>>[x for x in domains.keys() if '$' in x]
['$some_key_']
My guess is that you are trying to save too large of a document. MongoDB imposes a 16MB maximum size on all of its documents.
Try running the update command with the parameter safe=True. This will run in safe mode, which will instruct the database to send back the result of the attempted insert.

Python 3.2 lxml fill and submit form, select multiple, how to do it? value not working

Great page this one, coming from the perl world and after several years of doing nothing, I've re-started to program again (this web page didn't exist, how things change). And now, after a 2 full-days of searching, I play the last card of asking here for help.
Working under mac environment, with python 3.2 and lxml 2.3 (installed following www.jtmoon.com/?p=21), what I am trying to do:
web: http://biodbnet.abcc.ncifcrf.gov/db/db2db.php
to fill the form that you find there
to submit it
My code. I put several attempts and the output code.
from lxml.html import parse, submit_form, tostring
page = parse('http://biodbnet.abcc.ncifcrf.gov/db/db2db.php').getroot()
page.forms[0].fields['input'] = 'GI Number'
page.forms[0].inputs['outputs[]'].value = 'Gene ID'
page.forms[0].fields['hasComma'] = 'no'
page.forms[0].fields['removeDupValues'] = 'yes'
page.forms[0].fields['request'] = 'db2db'
page.forms[0].action = 'http://biodbnet.abcc.ncifcrf.gov/db/db2dbRes.php'
page.forms[0].fields['idList'] = '86439006'
submit_form(page.forms[0])
Output:
File "/Users/gerard/Desktop/barbacue/MGFtoXML.py", line 30, in <module>
page.forms[0].inputs['outputs[]'].value = 'Gene ID'
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 1058, in _value__set
"You must pass in a sequence")
TypeError: You must pass in a sequence
So, since that element is a multi-select element, I understand that I have to give a list
page.forms[0].inputs['outputs[]'].value = list('Gene ID')
Output:
File "/Users/gerard/Desktop/barbacue/MGFtoXML.py", line 30, in <module>
page.forms[0].inputs['outputs[]'].value = list('Gene ID')
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 1059, in _value__set
self.value.clear()
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/_setmixin.py", line 115, in clear
self.remove(item)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 1159, in remove
"The option %r is not currently selected" % item)
ValueError: The option 'Affy ID' is not currently selected
'Affy ID' is the first option value of the list, and it is not selected. But what's the problem with it?
Surprisingly, if I instead put
page.forms[0].inputs['outputs[]'].multiple = list('Gene ID')
#page.forms[0].inputs['outputs[]'].value = list('Gene ID')
Then, somehow lxml likes it, and move on. However, the multiple attribute should be a boolean (actually it is if I print the value), I shouldn't touch it, and the "value" of the item should actually point to the selected items, according to the lxml docs.
The new output
File "/Users/gerard/Desktop/barbacue/MGFtoXML.py", line 87, in <module>
submit_form(page.forms[0])
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 856, in submit_form
return open_http(form.method, url, values)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 876, in open_http_urllib
return urlopen(url, data)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/urllib/request.py", line 138, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/urllib/request.py", line 364, in open
req = meth(req)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/urllib/request.py", line 1052, in do_request_
raise TypeError("POST data should be bytes"
TypeError: POST data should be bytes or an iterable of bytes. It cannot be str.
So, what can be done?? I am sure that with python 2.6 I could use mecanize, or that perhaps lxml could work? But I really don't want to code in a sort-of deprecated version. I am enjoying a lot python, but I am starting to consider going back to perl. Perhaps this could be a smart movement??
Any help will be hugely appreciated
Gerard
Reading in this forum, I find pythonpaste.org, could it be a replacement for lxml?
Passing in a sequence to list() will generate a list from that sequence. 'Gene ID' is sequence (namely a sequence of characters). So list('Gene ID') will generate a list of characters, like so:
>>> list('Gene ID')
['G', 'e', 'n', 'e', ' ', 'I', 'D']
That's not what you want. Try this:
>>> ['Gene ID']
['Gene ID']
In other words:
page.forms[0].inputs['outputs[]'].value = ['Gene ID']
That should take you a bit forward.