pyspark map function not showing output - pyspark

Here some simple code:
%python
allPaths=dbutils.fs.ls("/user/hive/warehouse")
allPathsFiltered = map(lambda x:(x[0]),allPaths)
matching = [s for s in allPathsFiltered if "tabx" in s]
print(list(allPaths))
print(list(allPathsFiltered))
print(list(matching))
returns:
[FileInfo(path='dbfs:/user/hive/warehouse/aaa/', name='aaa/', size=0), FileInfo(path='dbfs:/user/hive/warehouse/bbb/', name='bbb/', size=0), FileInfo(path='dbfs:/user/hive/warehouse/example_03/', name='example_03/', size=0), FileInfo(path='dbfs:/user/hive/warehouse/sox/', name='sox/', size=0), FileInfo(path='dbfs:/user/hive/warehouse/sox2/', name='sox2/', size=0), FileInfo(path='dbfs:/user/hive/warehouse/sox3/', name='sox3/', size=0), FileInfo(path='dbfs:/user/hive/warehouse/src/', name='src/', size=0), FileInfo(path='dbfs:/user/hive/warehouse/src2/', name='src2/', size=0), FileInfo(path='dbfs:/user/hive/warehouse/tabx/', name='tabx/', size=0), FileInfo(path='dbfs:/user/hive/warehouse/taby/', name='taby/', size=0), FileInfo(path='dbfs:/user/hive/warehouse/vrba/', name='vrba/', size=0), FileInfo(path='dbfs:/user/hive/warehouse/zzz/', name='zzz/', size=0)]
[]
['dbfs:/user/hive/warehouse/tabx/']
Why does the 2nd print not show any data?

allPathsFiltered is an iterator, and you have already iterated through it when defining matching, so if you try to iterate through it again using list(allPathsFiltered) it won't return anything.
You can try
allPaths = dbutils.fs.ls("/user/hive/warehouse")
allPathsFiltered = list(map(lambda x:(x[0]),allPaths))
matching = [s for s in allPathsFiltered if "tabx" in s]
Personally I prefer list comprehensions though, like
allPathsFiltered = [x[0] for x in allPaths]

Related

KeyError: exp(t) using dsolve from sympy for simple ODE

I am struggling to understand the behaviour of dsolve for this simple ODE:
Y''(t) = b*Y'(t) + f(t)
For some reason, dsolve throws an error if I use f(t)=exp(t-a), but for general f(t) or f(t)=exp(a*t) or if I put a value for a, dsolve succeeds. The complete error message:
File "~/.local/lib/python3.7/site-packages/sympy/solvers/ode.py",
line 679, in dsolve
return _helper_simplify(eq, hint, hints, simplify, ics=ics)
File "~/.local/lib/python3.7/site-packages/sympy/solvers/ode.py",
line 704, in _helper_simplify
sols = solvefunc(eq, func, order, match)
File "~/.local/lib/python3.7/site-packages/sympy/solvers/ode.py",
line 5674, in ode_nth_linear_constant_coeff_undetermined_coefficients
return _solve_undetermined_coefficients(eq, func, order, match)
File "~/.local/lib/python3.7/site-packages/sympy/solvers/ode.py",
line 5766, in _solve_undetermined_coefficients
coeffsdict[s[x]] += s['coeff']
KeyError: exp(t)
I am using this code:
from sympy import symbols, Function, dsolve, exp, Eq
a, b, t = symbols('a b t')
Y = Function('Y')(t)
#f = Function('f')(t) # works
#f = exp(a*t) # works
f = exp(t-a) # KeyError: exp(t)
#f = exp(t-2) # works
odeY = Eq( Y.diff(t,t), b*Y.diff(t) + f )
dsolve(odeY,Y)
I am using sympy version 1.5.1 with python3.7
Many thanks!

Compare contrasts in linear model in Python (like Rs contrast library?)

In R I can do the following to compare two contrasts from a linear model:
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/spider_wolff_gorb_2013.csv"
filename <- "spider_wolff_gorb_2013.csv"
install.packages("downloader", repos="http://cran.us.r-project.org")
library(downloader)
if (!file.exists(filename)) download(url, filename)
spider <- read.csv(filename, skip=1)
head(spider, 5)
# leg type friction
# 1 L1 pull 0.90
# 2 L1 pull 0.91
# 3 L1 pull 0.86
# 4 L1 pull 0.85
# 5 L1 pull 0.80
fit = lm(friction ~ type + leg, data=spider)
fit
# Call:
# lm(formula = friction ~ type + leg, data = spider)
#
# Coefficients:
# (Intercept) typepush legL2 legL3 legL4
# 1.0539 -0.7790 0.1719 0.1605 0.2813
install.packages("contrast", repos="http://cran.us.r-project.org")
library(contrast)
l4vsl2 = contrast(fit, list(leg="L4", type="pull"), list(leg="L2",type="pull"))
l4vsl2
# lm model parameter contrast
#
# Contrast S.E. Lower Upper t df Pr(>|t|)
# 0.1094167 0.04462392 0.02157158 0.1972618 2.45 277 0.0148
I have found out how to do much of the above in Python:
import pandas as pd
df = pd.read_table("https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/spider_wolff_gorb_2013.csv", sep=",", skiprows=1)
df.head(2)
import statsmodels.formula.api as sm
model1 = sm.ols(formula='friction ~ type + leg', data=df)
fitted1 = model1.fit()
print(fitted1.summary())
Now all that remains is finding the t-statistic for the contrast of leg pair L4 vs. leg pair L2. Is this possible in Python?
statsmodels is still missing some predefined contrasts, but the t_test and wald_test or f_test methods of the model Results classes can be used to test linear (or affine) restrictions. The restrictions either be given by arrays or by strings using the parameter names.
Details for how to specify contrasts/restrictions should be in the documentation
for example
>>> tt = fitted1.t_test("leg[T.L4] - leg[T.L2]")
>>> print(tt.summary())
Test for Constraints
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
c0 0.1094 0.045 2.452 0.015 0.022 0.197
==============================================================================
The results are attributes or methods in the instance that is returned by t_test. For example the conf_int can be obtained by
>>> tt.conf_int()
array([[ 0.02157158, 0.19726175]])
t_test is vectorized and treats each restriction or contrast as separate hypothesis. wald_test treats a list of restrictions as joint hypothesis:
>>> tt = fitted1.t_test(["leg[T.L3] - leg[T.L2], leg[T.L4] - leg[T.L2]"])
>>> print(tt.summary())
Test for Constraints
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
c0 -0.0114 0.043 -0.265 0.792 -0.096 0.074
c1 0.1094 0.045 2.452 0.015 0.022 0.197
==============================================================================
>>> tt = fitted1.wald_test(["leg[T.L3] - leg[T.L2], leg[T.L4] - leg[T.L2]"])
>>> print(tt.summary())
<F test: F=array([[ 8.10128575]]), p=0.00038081249480917173, df_denom=277, df_num=2>
Aside: this also works for robust covariance matrices if cov_type was specified as argument to fit.

MATLAB: errorn in butter() command

I wrote the following function:
function [output_signal] = AddDirectivityError (bat_loc_index, butter_deg_vector, sound_matrix)
global chirp_initial_freq ;
global chirp_end_freq;
global sampling_rate;
global num_of_mics;
global sound_signal_length;
for (i=1 : num_of_mics)
normalized_co_freq = (chirp_initial_freq + chirp_end_freq)/ (1.6* sampling_rate);
A=sound_matrix ( i, : ) ;
peak_signal=max(A);
B=find(abs(A)>peak_signal/100);
if (butter_deg_vector(i)==0)
butter_deg_vector(i)=2;
end
[num, den] = butter(butter_deg_vector(i), normalized_co_freq, 'low');// HERE!!!
filtered_signal=filter(num,den, A );
output_signal(i, :)=filtered_signal;
end
This functions runs many-many times without any error. However, when I reach the line: [num, den] = butter ( butter_deg_vector(i), normalized_co_freq, 'low');
And the local variables are: i=3, butter_deg_vector(i)=1, normalized_co_freq=5.625000e-001
MATLAB prompts an error says:
??? Error using ==> buttap Expected N to be integer-valued.
"Error in ==> buttap at 15 validateattributes(n,{'numeric'},{'scalar','integer','positive'},'buttap','N');
Error in ==> butter at 70 [z,p,k] = buttap(n);"
I don't understand why this problem occurs especially in this iteration. Why does this function prompt an error especially in this case?
Try to change the code line for:
[num, den] = butter (round(butter_deg_vector(i)), normalized_co_freq, 'low');

"LapackError: Parameter a has non-native byte order in lapack_lite.dgesdd" when importing from Matlab files

After importing this data file from Matlab with scipy.io.loadmat, things appeared to work fine until we tried to calculate the conditioning number of one of the matrixes within.
Here's the minimum amount of code that reproduces for us:
import scipy
import numpy
stuff = scipy.io.loadmat("dati-esercizio1.mat")
numpy.linalg.cond(stuff["A"])
Here's the extended stacktrace courtesy of iPython:
In [3]: numpy.linalg.cond(A)
---------------------------------------------------------------------------
LapackError Traceback (most recent call last)
/snip/<ipython-input-3-15d9ef00a605> in <module>()
----> 1 numpy.linalg.cond(A)
/snip/python2.7/site-packages/numpy/linalg/linalg.py in cond(x, p)
1409 x = asarray(x) # in case we have a matrix
1410 if p is None:
-> 1411 s = svd(x,compute_uv=False)
1412 return s[0]/s[-1]
1413 else:
/snip/python2.7/site-packages/numpy/linalg/linalg.py in svd(a, full_matrices, compute_uv)
1313 work = zeros((lwork,), t)
1314 results = lapack_routine(option, m, n, a, m, s, u, m, vt, nvt,
-> 1315 work, -1, iwork, 0)
1316 lwork = int(work[0])
1317 work = zeros((lwork,), t)
LapackError: Parameter a has non-native byte order in lapack_lite.dgesdd
All obvious ideas (like flattening and reshaping the matrix or recreating the matrix from scratch reassigning it element by element) failed. How can I want to massage the data, then, in order to make it more agreeable with numpy?
It's a bug, fixed some time ago: https://github.com/numpy/numpy/pull/235
Workaround:
np.linalg.cond(stuff['A'].newbyteorder('='))
This works for me:
In [33]: stuff = loadmat('dati-esercizio1.mat')
In [34]: a = stuff['A']
In [35]: try: np.linalg.cond(a)
....: except: print "Fail!"
Fail!
In [36]: b = np.array(a, dtype='>d')
In [37]: np.linalg.cond(b)
Out[37]: 62493201976.673141
In [38]: np.all(a == b) # Verify they hold the same data.
Out[38]: True
Apparently it's something wrong with the byte order (endianness?) of each number in the resulting ndarray and not just with the ndarray object itself.
Something like this but more elegant should do the trick:
n, m = A.shape()
B = numpy.empty_like(A)
for i in xrange(n):
for j in xrange(m):
B[i,j] = float(A[i,j])
del A
B = A
print numpy.linalg.cond(A) # 62493210091.354507
(For some reason an in-place replacement still gives that error - so there's something wrong with the byte order of the whole object, too.)

DNA to RNA and Getting Proteins with Perl

I am working on a project(I have to implement it in Perl but I am not good at it) that reads DNA and finds its RNA. Divide that RNA's into triplets to get the equivalent protein name of it. I will explain the steps:
1) Transcribe the following DNA to RNA, then use the genetic code to translate it to a sequence of amino acids
Example:
TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT
2) To transcribe the DNA, first substitute each DNA for it’s counterpart (i.e., G for C, C for G, T for A and A for T):
TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT
AGTATTATGCAAAACATAAGCGGTCGCGAAGCCACA
Next, remember that the Thymine (T) bases become a Uracil (U). Hence our sequence becomes:
AGUAUUAUGCAAAACAUAAGCGGUCGCGAAGCCACA
Using the genetic code is like that
AGU AUU AUG CAA AAC AUA AGC GGU CGC GAA GCC ACA
then look each triplet (codon) up in the genetic code table. So AGU becomes Serine, which we can write as Ser, or
just S. AUU becomes Isoleucine (Ile), which we write as I. Carrying on in this way, we get:
SIMQNISGREAT
I will give the protein table:
So how can I write that code in Perl? I will edit my question and write the code that what I did.
Try the script below, it accepts input on STDIN (or in file given as parameter) and read it by line. I also presume, that "STOP" in the image attached is some stop state. Hope I read it all well from that picture.
#!/usr/bin/perl
use strict;
use warnings;
my %proteins = qw/
UUU F UUC F UUA L UUG L UCU S UCC S UCA S UCG S UAU Y UAC Y UGU C UGC C UGG W
CUU L CUC L CUA L CUG L CCU P CCC P CCA P CCG P CAU H CAC H CAA Q CAG Q CGU R CGC R CGA R CGG R
AUU I AUC I AUA I AUG M ACU T ACC T ACA T ACG T AAU N AAC N AAA K AAG K AGU S AGC S AGA R AGG R
GUU V GUC V GUA V GUG V GCU A GCC A GCA A GCG A GAU D GAC D GAA E GAG E GGU G GGC G GGA G GGG G
/;
LINE: while (<>) {
chomp;
y/GCTA/CGAU/; # translate (point 1&2 mixed)
foreach my $protein (/(...)/g) {
if (defined $proteins{$protein}) {
print $proteins{$protein};
}
else {
print "Whoops, stop state?\n";
next LINE;
}
}
print "\n"
}