How to read in graphml file into networkx with weird characters? - networkx

I am trying to read in a graphml file of my facebook network into NetworkX. However, because some of my friends have unusual characters, such as accents, their names are unable to be read into networkx.
I ran the command:
g = nx.read_graphml("/Users/juliehui/Desktop/MyGraph.graphml")
I then get the error:
TypeError: int() argument must be a string or a number, not 'NoneType'
I looked at the graphml file in Sublime Text, and it seems to have trouble with names, such as Andrés
I then looked at the graphml file in Gephi to see what it looked like. The name, Andrés, in Gephi looks like:
Andrés
When I export the data without making any edits into a separate graphml file, and try to read that file in, I get the error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-8: ordinal not in range(128)
When I delete the problem names in Gephi, then the file reads fine.
I am not sure if there is some way to edit my original graphml file to fix the names with unusual characters.
I have looked at this page: Graphml parse error
But, I could not figure out if my graphml file is in UTF-8 or needs to be in UTF-8 or needs to be in ASCII?
I have also tried:
data="/Users/juliehui/Desktop/MyGraph.graphml"
udata=data.decode("utf-8")
asciidata=udata.encode("ascii","ignore")
g = nx.read_graphml(asciidata)
But, this gave the error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-19: ordinal not in range(128)
How do I resolve this error?

This worked for me in Python 2.7. You have to specify the node type as unicode.
nx.read_graphml('/path/to/my/file.graphml', unicode)

I would suggest to use unidecode to remove all non ASCII character in the file:
from unidecode import unidecode
data_in="/Users/juliehui/Desktop/MyGraph.graphml"
data_ascii ="/Users/juliehui/Desktop/MyGraph_ASCII.graphml"
f_in = open(data_in, 'rb')
f_out = open(data_ascii, 'wb')
for line in f_in:
f_out.write(unidecode(line))
f_in.close()
f_out.close()
Then you can hopefully use:
g = nx.read_graphml(data_ascii)

Related

How to overarchingly apply "utf-8" to opening csv/txt files in pandas dataframe?

I am trying to import data from text files from particular file path, but am getting error 'utf-8' codec can't decode byte 0xa5 in position 18: invalid start byte
My question is there anyway I can apply "utf-8" encoding to all the text files(about 20 others) I will have to open eventually so I can prevent the above error?
Code:
import pandas as pd
filelist = [r'D:/file1',r'D:/file2']
print (len((pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in filelist],axis=1))))
also open to any suggestions if I am doing something wrong.
Thank you in advance.
Not aware of solution to automatically convert encoding to utf-8 in python.
Alternatively, you can find out what the encoding is, and read it accordingly. Then write to file in utf-8.
this solution worked well for my files (credit maxnoe)
import chardet
import pandas as pd
with open('filename.csv', 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
pd.read_csv('filename.csv', encoding=result['encoding'])
don't forget to pip install chardet
if you now write file using pd.to_csv(), pandas default is to encode in utf-8

Scala java.nio.charset.UnmappableCharacterException: Input length = 1

I've found several questions with similar titles, but couldn't seem to use any to resolve my issue. I Can't seem to load my .csv file:
val source = io.Source.fromFile("C:/mon_usatotaldat.csv")
Returns:
java.nio.charset.UnmappableCharacterException: Input length = 1
So I tried:
val source = io.Source.fromFile("UTF-8", "C:/mon_usatotaldat.csv")
and got:
java.nio.charset.IllegalCharsetNameException: C:/mon_usatotaldat.csv
I guess UTF-8 wouldn't work, if the file isn't in UTF-8 format, so that makes sense, but I don't know what to do next.
I've managed to discover the encoding is windows-1252 using:
val source = io.Source.fromFile("C:/mon_usatotaldat.csv").codec.decodingReplaceWith("UTF-8")
But this didn't do what I had expected, which was convert the file to UTF-8. I have no Idea how to work with it.
Another thing I've tried was:
val source = io.Source.fromFile("windows-1252","C:/mon_usatotaldat.csv")
But that returned:
java.nio.charset.IllegalCharsetNameException: C:/mon_usatotaldat.csv
Please help. Thanks in advance.
Try mapping your excel file to UTF-8 first and then try val source = io.Source.fromFile("UTF-8", "C:/mon_usatotaldat.csv")
To map to UTF-8 try:
(1) Open an Excel file where you have the info (.xls, .xlsx)
(2) In Excel, choose "CSV (Comma Delimited) (*.csv) as the file type
and save as that type.
(3) In NOTEPAD (found under "Programs" and then Accessories in Start
menu), open the saved .csv file in Notepad
(4) Then choose -> Save As..and at the bottom of the "save as" box,
there is a select box labelled as "Encoding". Select UTF-8 (do NOT use
ANSI or you lose all accents etc). After selecting UTF-8, then save
the file to a slightly different file name from the original.
This file is in UTF-8 and retains all characters and accents and can be imported, for example, into MySQL and other database programs.
Reference: Excel to CSV with UTF8 encoding
Hope this helps!
Set up an InputStreamReader to correctly read windows-1252. Don't bother with intermediate UTF-8.

UnicodeDecodeError: 'ascii' codec can't decode byte in position : ordinal not in range(128)

I have done a bit of research on this error and can't really get my head around what's going on. As far as I understand I am basically having problems because I am converting from one type of encoding to another.
def write_table_to_file(table, connection):
db_table = io.StringIO()
cur = connection.cursor()
#pdb.set_trace()
cur.copy_to(db_table, table)
cur.close()
return db_tabl
This is the method that is giving me head aches. The below error is output when I run this method
[u350932#config5290vm0 python3]$ python3 datamain.py
Traceback (most recent call last):
File "datamain.py", line 48, in <module>
sys.exit(main())
File "datamain.py", line 40, in main
t = write_table_to_file("cms_jobdef", con_tctmsv64)
File "datamain.py", line 19, in write_table_to_file
cur.copy_to(db_table, table)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 40: ordinal not in range(128)
The client encoding on the database im retrieving the table from is
tctmsv64=> SHOW CLIENT_ENCODING;
client_encoding
-----------------
sql_ascii
(1 row)
The database encoding is LATIN1
The encoding for the database I am putting them onto is
S104838=# SHOW CLIENT_ENCODING;
client_encoding
-----------------
WIN1252
(1 row)
The database encoding is UTF8
From the threads I have found they recommend to change the encoding
To correct your function, you'll have to know what encoding the byte
string is in, and convert it to unicode using the decode() method,
and compare that result to the unicode string.
http://www.thecodingforums.com/threads/unicodedecodeerror-ascii-codec-cant-decode-byte-0xa0-in-position-10-ordinal-not-in-range-128.336691/
The problem is when I try and use the decode methods I get complaints that its not a file type. I have had a look at the python 3.4 methods for class io.StringIO(initial_value='', newline='\n')¶ method but could not find anything on changing the encoding.
I also found this page which outlined the problem but I couldn't figure out what I needed to do to solve it
https://wiki.python.org/moin/UnicodeDecodeError
Basically I'm quite confused as to what is going on and not sure how to fix it. Any help would be greatly appreciated.
Cheers
Python 3 changed file I/O behaviour around text encodings - massively for the better, IMO. You may find Processing Text Files in Python 3 informative.
It looks like psycopg2 is seeing that you passed a raw file object and is trying to encode the strings it's working with into byte sequences for writing to the file, with the assumption (since you didn't specify anything else) that you want to use the ascii encoding for the file.
I'd use an io.BytesIO object instead of StringIO, and specify the source encoding when you do a copy_from into the new database.
I'll be surprised if you don't have problems due to invalid, mixed, or otherwise borked text from your SQL_ASCII source database, though.
First of thanks Craig for your response. It was very helpful in making me realise that I needed to find a good way of doing this otherwise the data in my new database would be corrupt. Not something we want! After a bit more googling this link was very useful
https://docs.python.org/3/howto/unicode.html
I ended up using the StreamRecorder module and it works very well. Below is a snippet of my working code
def write_table_to_file(table, connection):
db_table = io.BytesIO()
cur = connection.cursor()
cur.copy_to(codecs.StreamRecoder(db_table,codecs.getencoder('utf-8'), codecs.getdecoder('latin-1'),
codecs.getreader('utf-8'), codecs.getwriter('utf-8')), table)
cur.close()
return db_table
Long story short I convert from latin-1 to utf-8 on the fly and it all works and my data looks good. Thanks again for the feedback Craig :)

Winjs, error reading file with FileIO.readTextAsync

I am reading a .json file from disk using Windows.Storage.FileIO.readTextAsync.
All is fine until I put some non english letters in the file, like Æ Å Ø
The error I get is (rough translation from Danish language):
WinRT: No mapping for the Unicode character exists in the target multi-byte code page.
any idea how to read those chars in WinJs?
I found the problem.
when I created the file manually with notepad I set it to type ANSII instead of utf8.
I reopened the file -> save as and the changed the type and overwrote it.
You may be able to solve this by changing the encoding from the default (Utf8) to Utf16. The readTextAsync method accepts a second parameter which is a UnicodeEncoding flag:
Windows.Storage.FileIO.readTextAsync(
file,
Windows.Storage.Streams.UnicodeEncoding.utf16LE
).done( ... );
Or if you need to, you can use utf16BE flag (see link above).

How is this file encoded?

I have a .csv file generated by Excel that I got from my customer. My software has to open and parse it in java. I'm using universalchardet but it did not detect the encoding from the first 1,000 bytes of the file.
Within these 1,000 first bytes, there is a sequence that should be read as Boîte, however I cannot find the correct encoding to use to convert this file to UTF-8 strings in java.
In the file, Boîte is encoded as 42,6F,94,74,65 (read using a hex editor). B, o, t and e are using the standard latin encoding with 1 byte per character. The î is also encoded on only one byte, 0x94.
I don't know how to guess this charset, none of my searches online yielded relevant results.
I also tried to use file on that file:
$ file export.csv
/Users/bicou/Desktop/export.csv: Non-ISO extended-ASCII text, with CR line terminators
However I looked at the extended-ASCII charset, the value 0x94 stands for ö.
Have you got other ideas for guessing the encoding of that file?
This was Mac OS Roman encoding. When using the following java code, the text was properly decoded:
InputStreamReader isr = new InputStreamReader(new FileInputStream(targetFileName), "MacRoman");
I don't know how to delete my own question. I don't think it is useful anymore...