Why is Jupyter using a column's values to populate column names? - jupyter

I'm using an SPSS .sav file that has typical column names like name, type, width, and so forth. The 'names' column labels the rows m1, I1, I2, etc.
Here's the Jupyter notebook:
https://imgur.com/9hXuL7u
import pandas as pd
df = pd.read_spss('./Data.sav')
df.head()
As you can see, the column names are the entries for 'Name':
https://imgur.com/ZVMS0F0
I.e., rather than 'name', 'type', 'width' as column names, there are the values for 'name': m1, I1, I2, etc.
I'm quite new to Jupyter and SPSS and have no idea where to start.
EDIT:
Following Rahul Singh's suggestions, I've added header=None, though read_spss() doesn't seem to recognize the argument.
import pandas as pd
df = pd.read_spss('./Data.sav',header=None)
df.head()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-39-77d006c914c9> in <module>
1 import pandas as pd
----> 2 df = pd.read_spss('./Data_ANQAR_Wave39.sav',header=None)
3 df.head()
TypeError: read_spss() got an unexpected keyword argument 'header'
`

Actually this issue is not with jupyter it's with pandas (we should not say it issue :) )
Usually when you read data from any file(.csv, .txt etc) and header (column names) is is not available in it then pandas will automaticly take first row as header.
To get rid of this problem you can provide `header=None`
Code :
import savReaderWriter
import numpy as np
import pandas as pd
# Convert .sav file into .csv
reader_np = savReaderWriter.SavReaderNp("Data.sav")
array = reader_np.to_structured_array("outfile.dat")
np.savetxt("Data.csv", array, delimiter=",")
reader_np.close()
# Read .csv file without header
df = pd.read_spss("Data.csv",header=None)
df.head()

What you are looking at in SPSS is not the data (data editor) but the metadata (variable view) which shows characteristics for your columns rather than the data itself. Pandas is reading the data correctly, in SPSS switch to the Data Editor to see what I mean.

Related

How to get a list view of columns and % of nans/nulls in Pyspark?

I am running a simple EDA on my dataset that has 59K rows and 21 columns. What I would like to see is a list of all columns and the % of the nulls/nans. I ran the following code in Jupyter in my virtual machine:
#Checking nulls by column
from pyspark.sql.functions import *
null_df = datingDF.select([(count(when(isnan(c) | col(c).isNull(), c))/count(lit(1))).alias(c) for c in datingDF.columns])
null_df.show()
The output is really cluttered and not a clean list (see attached)
replace null_df.show() with :
for i,j in null_df.first().asDict().items():
print(i,j)

psycopg2 error while writing data from csv file: extra data after last expected column

I am trying to insert data from a csv file (file.csv) into two columns of the table in Postgres. The data looks like this:
#Feature AC;Feature short label
EBI-517771;p.Leu107Phe
EBI-491052;p.Gly23Val
EBI-490120;p.Pro183His
EBI-517851;p.Gly12Val
EBI-492252;p.Lys49Met
EBI-527190;p.Cys360Ser
EBI-537514;p.Cys107Ser
The code I am running is as follows:
# create table in ebi_mut_db schema
cursor.execute("""
CREATE TABLE IF NOT EXISTS ebi_mut_db.mutations_affecting_interactions(
feature_ac TEXT,
feature_short_label TEXT)
""")
with open(file.csv', 'r') as f:
# Notice that we don't need the `csv` module.
next(f) # Skip the header row.
cursor.copy_from(f, 'ebi_mut_db.mutations_affecting_interactions', sep=';')
conn.commit()
The table is created but while writing the data, it is showing below error.
Traceback (most recent call last):
File "stdin<>", line 38, in <module>
cursor.copy_from(f, 'ebi_mut_db.mutations_affecting_interactions', sep=';')
psycopg2.errors.BadCopyFileFormat: extra data after last expected column
CONTEXT: COPY mutations_affecting_interactions, line 23: "EBI-878110;"p.[Ala223Pro;Ala226Pro;Ala234Asp]""
There are no extra columns except the two. My understanding is the code is detecting more than 2 columns.
Thanks
You have not told the COPY you are using CSV format, so it is using the default TEXT format. In that format, quoting does not protect special characters, and since there is more than one ; there is more than two columns.
If you want the COPY to know that ; inside quotes do not count as separators, then you have to tell it to use CSV format. In psycopg2, I think you have to use copy_expert, not copy_from, in order to accomplish this.

How to avoid markdown typesetting of $ signs in Jupyter output?

I am reading an Excel file in Jupyter, which contains income data, e.g. $2,500 to $4,999. Rendered output is returning:
How can I avoid this formatting?
In pandas>=0.23.0, you can prevent MathJax from rendering perceived LaTeX found in DataFrames. This is achieved using:
import pandas as pd
pd.options.display.html.use_mathjax = False
In Jupyter you can use a backslash ( \ ) before the dollar sign to avoid starting a LaTeX math block.
So write \$2,500 in your markdown instead of $2,500.
A markdown cell like this:
Characterisic | Total with Income| \$1 to \$2,499 | \$2,500 to \$4,999
--------------|------------------|----------------|--------------
data | data |data | data
data | data |data | data
will be rendered by Jupyter like so:
If the table is handled with typical Jupyter tools (python,numpy,pandas) you can alter the column names with a short code snippet.
The snippet below will replace all $ strings in the column names with \$ so that Jupyter will render them without LaTeX math.
import pandas as pd
data = pd.read_excel("test.xlsx")
for col in range(0, len(data.columns.values)):
data.columns.values[col] = data.columns.values[col].replace("$", "\$")
data
Before and after screenshot:

Import csv and make mathematical operation with its cells in matlab

I want to import csv and make mathematical operations with some specific cells for example: (C111-C12)/(B111-B12), I tried to import csv like this:
A_data = dataset('xlsfile','exceldata_A.csv');
and then the operation I tried is:
(A_data.C111-A_data.C12)/(A_data.B111-A_data.B12) but I am getting a bunch of errors how can I specify the cells I want to use?
Not sure about dataset (maybe your problem is due to using 'xlsfile' for a CSV file ?), but if the file is in CSV format I would use csvread function:
A_data = csvread('exceldata_A.csv');
Then, instead of (A_data.C111-A_data.C12)/(A_data.B111-A_data.B12), you can access line 111 and column C (column number 3) with A_data(111,3) :
(A_data(111,3)-A_data(12,3))/(A_data(111,2)-A_data(12,2))

ipython notebook and patsy categorical variable (formula)

I had the same error as in this question.
What is weird, is that it works (with the answer provided) in an ipython shell, but not in an ipython notebook. But it's related to the C() operator, because without it works (but not as an operator)
Same with that example :
import statsmodels.formula.api as smf
import numpy as np
import pandas
url = "http://vincentarelbundock.github.com/Rdatasets/csv/HistData/Guerry.csv"
df = pandas.read_csv(url)
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head()
mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
res = mod.fit()
print res.summary()
This works well, both in the ipython notebook and in the shell, and patsy treats Region as categorical variable because it's composed of strings.
but if I try this (as in the tutorial) :
res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
I got an error in the ipython notebook:
TypeError: 'Series' object is not callable
Note that both in the notebook and in the shell statsmodels and patsy are the same versions (0.5.0 and 0.3.0 respectively)
Do you have the same error ?
I eventually found the problem.
It is because there was a variable called C that I used way earlier in the notebook. What is surprising though, is that it was not a column of the df I used.
Anyway, the basic solution is :
del C
before running the regression.
Hope this will help people facing the same problem.
But I'm still not sure whether this is an expected behavior of patsy.