Why does StringIndexer has no outputCols?

Why does StringIndexer has no outputCols? - pyspark

I am using Apache Zeppelin. My anaconda version is conda 4.8.4. and my spark version is:
%spark2.pyspark
spark.version
u'2.3.1.3.0.1.0-187'
When I run my code, it throws followed error:
Exception AttributeError: "'StringIndexer' object has no attribute '_java_obj'" in <object repr() failed> ignored
Fail to execute line 4: indexerFeatures = StringIndexer(inputCols=catColumns, outputCols=catIndexedColumns, handleInvalid="keep")
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-66369397479549554.py", line 375, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 4, in <module>
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/__init__.py", line 105, in wrapper
return func(self, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'outputCols'
I ran the same code in Databricks and everything worked fine. I also checked the import for the StringIndexer with the help() function and it didn't included the outputCols argument.

It should be outputCol, not outputCols.
For spark 2.3.1, you can refer to: https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer
class pyspark.ml.feature.StringIndexer(inputCol=None, outputCol=None, handleInvalid='error', stringOrderType='frequencyDesc')

Related

ImportError: cannot import name 'ranked_blast_output_schema' from 'param'

I am just getting started on Python, though I know a bit of R. I want to replicate something someone has already done. I am receiving this error on one of my kernels on Jupyter and I don't immediately know what to do about it. Does anyone have any input or experience with it?
Traceback (most recent call last):
File "parse.py", line 8, in <module>
from param import ranked_blast_output_schema, blast_outfmt6_schema
ImportError: cannot import name 'ranked_blast_output_schema' from 'param' (/Users/myaccount/miniconda3/lib/python3.8/site-packages/param/__init__.py)
Traceback (most recent call last):
File "lca_analysis.py", line 52, in <module>
if ("~" in blast_results["query"].iloc[0]):
File "/Users/myaccount/miniconda3/lib/python3.8/site-packages/pandas/core/indexing.py", line 894, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/Users/myaccount/miniconda3/lib/python3.8/site-packages/pandas/core/indexing.py", line 1500, in _getitem_axis
self._validate_integer(key, axis)
File "/Users/myaccount/miniconda3/lib/python3.8/site-packages/pandas/core/indexing.py", line 1443, in _validate_integer
raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds

Ah okay. I am replicating a Jupyter notebook and I was told to run this link on bash.
lca_analysis.py --blast_type nt --fpath s3://org_name/contigs/CMS001_002_Ra_S1/blast_nt.m9 --filtered_blast_path s3://bucket_name/contig_quality/CMS001_002_Ra_S1/blast_nt_filtered.m9 --excluded_contigs_path s3://bucket_name/contig_quality/CMS001_002_Ra_S1/exclude_contigs_nt.txt --outpath s3://bucket_name/contig_quality/CMS001_002_Ra_S1/lca_nt.m9 --read_count_path s3://bucket_name/contigs/CMS001_002_Ra_S1/contig_stats.json --verbose True
But then, I had this error
Read counts have been loaded: s3://bucket_name/contigs/CMS001_002_Ra_S1/contig_stats.json| elapsed time: 0.73 seconds
/var/folders/ns/gdtc2hvx1g13_29wct4qkhd80000gq/T/tmp3m3k6w9o blast file downloaded to this tempfile
Traceback (most recent call last):
File "parse.py", line 8, in <module>
from param import ranked_blast_output_schema, blast_outfmt6_schema
ImportError: cannot import name 'ranked_blast_output_schema' from 'param' (/Users/myaccount/miniconda3/lib/python3.8/site-packages/param/__init__.py)
Traceback (most recent call last):
File "lca_analysis.py", line 52, in <module>
if ("~" in blast_results["query"].iloc[0]):
File "/Users/myaccount/miniconda3/lib/python3.8/site-packages/pandas/core/indexing.py", line 894, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/Users/myaccount/miniconda3/lib/python3.8/site-packages/pandas/core/indexing.py", line 1500, in _getitem_axis
self._validate_integer(key, axis)
File "/Users/myaccount/miniconda3/lib/python3.8/site-packages/pandas/core/indexing.py", line 1443, in _validate_integer
raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
I had just used pip install param to get param, and the install went just fine.

Syntax Error while building ns3.29 on ubuntu 18.04.4

I've installed all per-requisites from. Build fails with the following error by using ./waf command.
Traceback (most recent call last):
File "/usr/lib/python3.6/py_compile.py", line 125, in compile
_optimize=optimize)
File "<frozen importlib._bootstrap_external>", line 741, in source_to_code
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/usr/local/lib/python3/dist-packages/visualizer/visualizer/svgitem.py", line 123
raise AttributeError, 'unknown property %s' % pspec.name
^
SyntaxError: invalid syntax
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 3, in <module>
File "/usr/lib/python3.6/py_compile.py", line 129, in compile
raise py_exc
py_compile.PyCompileError: File "/usr/local/lib/python3/dist-packages/visualizer/visualizer/svgitem.py", line 123
raise AttributeError, 'unknown property %s' % pspec.name
^
SyntaxError: invalid syntax
I will appreciate any help. Thank you.

You can disable python. First you need to clean your previous failed build and then configure by disabling the python.
./waf distclean
./waf --disable-python configure
./waf

How to add custom JDBC dialects in PySpark

I have a custom JDBC Dialect in Scala, which works flawlessly through registerDialect method in Scala Spark API. I was hoping to use the same class in PySpark by accessing it through
sc._jvm.org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(sc._jvm.com.me.MyJDBCDialect)
But I receive this error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1124, in __call__
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1094, in _build_args
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 289, in get_command_part
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1363, in __getattr__
py4j.protocol.Py4JError: com.me.MyJDBCDialect._get_object_id does not exist in the JVM
I'm totally unfamiliar with Py4J but it sounds like _get_object_id error is raised since sc._jvm.com.me.MyJDBCDialect is a Python object and I try to pass it to sc._jvm.org.apache.spark.sql.jdbc.JdbcDialects.registerDialect, which is a Java(?) construct. How do I get around this problem?

This worked for me:
Make sure that your dialect is declared as class, not object
from py4j.java_gateway import java_import
gw = spark.sparkContext._gateway
java_import(gw.jvm, "com.me.MyJDBCDialect")
gw.jvm.org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(
gw.jvm.com.me.MyJDBCDialect())
Note the () - it will call class constructor for your dialect

how to properly resolve gsutil error

I just installed gsutil on OS X, exactly following Google's instruction, and am seeing errors of the following format when running any gsutil command:
Traceback (most recent call last):
File "/Users//gsutil/gsutil", line 22, in <module>
gsutil.RunMain()
File "/Users//gsutil/gsutil.py", line 101, in RunMain
sys.exit(gslib.__main__.main())
File "/Users//gsutil/gslib/__main__.py", line 175, in main
command_runner = CommandRunner()
File "/Users//gsutil/gslib/command_runner.py", line 107, in __init__
self.command_map = self._LoadCommandMap()
File "/Users//gsutil/gslib/command_runner.py", line 113, in _LoadCommandMap
__import__('gslib.commands.%s' % module_name)
File "/Users//gsutil/gslib/commands/disablelogging.py", line 16, in <module>
from gslib.command import COMMAND_NAME
ImportError: cannot import name COMMAND_NAME
This error occurs on several modules in the commands directory. The only thing I could do to get rid of these errors is to remove the following modules from the directory which reference COMMAND_NAME: disablelogging, enablelogging, getacl, getcors, getdefacl, getlogging, setacl, setcors, setdefacl.
Did I do the right thing here? Is this a bug in gsutil?

gsutil CONFIG_REQUIRED import error?

I tried to install and run gsutil and am getting the following error:
Traceback (most recent call last):
File "/Users/groovebug/gsutil/gsutil", line 88, in <module>
sys.exit(gslib.__main__.main())
File "/Users/groovebug/gsutil/gslib/__main__.py", line 93, in main
command_runner = CommandRunner(config_file_list)
File "/Users/groovebug/gsutil/gslib/command_runner.py", line 102, in __init__
self.command_map = self._LoadCommandMap()
File "/Users/groovebug/gsutil/gslib/command_runner.py", line 112, in _LoadCommandMap
__import__('gslib.commands.%s' % module_name)
File "/Users/groovebug/gsutil/gslib/commands/disablelogging.py", line 18, in <module>
from gslib.command import CONFIG_REQUIRED
ImportError: cannot import name CONFIG_REQUIRED
I reinstalled and continued to get it, and haven't found anyone solving it elsewhere.

gsutil no longer uses that variable. If you update to the latest version of gsutil this problem should no longer happen:
gsuil update

CONFIG_REQUIRED is just a constant name for a key used in a dictionary.
If you open ${Directory_Containing_gsutil}/gsutil/gslib/command.py
and add the line
CONFIG_REQUIRED = 'config_required'
it solves the problem. Not sure why that line is missing.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Why does StringIndexer has no outputCols? - pyspark

It should be outputCol, not outputCols. For spark 2.3.1, you can refer to: https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer class pyspark.ml.feature.StringIndexer(inputCol=None, outputCol=None, handleInvalid='error', stringOrderType='frequencyDesc')

Related

ImportError: cannot import name 'ranked_blast_output_schema' from 'param'

Syntax Error while building ns3.29 on ubuntu 18.04.4

How to add custom JDBC dialects in PySpark

how to properly resolve gsutil error

gsutil CONFIG_REQUIRED import error?

Categories

Resources