PySpark transform method with Vector Assembler

PySpark transform method with Vector Assembler - pyspark

I have a Spark DataFrame and I would like to use Vector Assembler to create a "features" column.
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=sel_cols, outputCol='features')
transformed_data = assembler.transform(sdf)
sel_cols consists of a list of 150 items of string type, which looks as follows:
['ASP.NET Core',
'ASP.NET MVC',
'AWS',
'AWS DynamoDB',
'AWS EMR',
'AWS SDK',
'Adobe Photoshop',
'Agile',
'Agile software development29',
'Ajax',
'Amazon CloudFront CDN',
'Amazon EC2',
'Android',
'Angular',
'AngularJS',
'Apache',
'Apache Hive',
'Apache Spark',
'Atom',...]
And sdf.columns, consists of 340 items and looks as follows:
['.NET',
'.NET 4',
'.NET Core',
'ADO.NET',
'AFNetworking',
'API Architecture',
'API Design',
'API Development',
'APIs',
'ASP.NET',
'ASP.NET Core',
'ASP.NET MVC',
'ASP.NET Web API',
'AWS',
'AWS DynamoDB',...]
I am getting this error in applying transformed_data = assembler.transform(sdf):
AnalysisException: Cannot resolve column name "ASP.NET Core" among (.NET, .NET 4, .NET Core, ADO.NET, AFNetworking, API Architecture, API Design, API Development, APIs, ASP.NET, ASP.NET Core, ASP.NET MVC, ASP.NET Web API, AWS, AWS DynamoDB, AWS EC2, AWS ECS, AWS EMR, AWS HA, AWS Lambda, AWS RDS, AWS S3, AWS SDK, Adobe Illustrator,...
As shown, "ASP.NET Core" is definitely among my sdf.columns and as far as I understand it, passing sel_cols as a list of string to VectorAssembler's inputCols should work... Would really appreciate any insight as I haven't worked with Spark DF's before:)
Thank you!

The VectorAssembler cannot handle columns with a space or a dot in the column name. Another answer from me provides some technical background why it does not work.
The only option is to rename the columns:
#build a dict (orginal column name: new column name)
mapping = {col: col.replace('.','_').replace(' ', '_') for col in sel_columns}
#select all columns and create an alias if there is a mapping for this column
df_renamed = df.select([F.col('`'+c+'`').alias(mapping.get(c, c)) for c in df.columns])
#create a VectorAssembler that uses the renamed columns as input
assembler = VectorAssembler(inputCols=list(mapping.values()), outputCol='features')
transformed_data = assembler.transform(df_renamed)

Related

get blob API Call from Azure Data Factory

I asked the same question at MS qna site too.
In ADF, I tried to call get BLOB() https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob
I got this error message: "Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature."
I'd like to read an image or non structured file and insert it into a varchar(max) column in SQL server. (source: binary to sink:binary in sQL server)
My pipeline is configured as below.
linked service:
base url: https://{account name}.blob.core.windows.net/
authentication type: anonymouse
server certificate: disabled
type: Rest
data set
type :Rest
relative url: {container name}/xyz.jpeg
copy data activity
request method: get
x-ms-date: #concat(formatDateTime(utcNow(), 'yyyy-MM-ddTHH:mm:ss'), 'Z')
x-ms-version: 2018-11-09
x-ms-blob-type: BlockBlob
Authorization: SharedKey {storage name}:CBntp....{SAS key}....LsIHw%3D
( I took a key from an SAS connection string....https&sig=CBntp{SAS key}LsIHw%3D)
Is it possible to call the Azure Blob rest API in ADF pipelines?

Unfortunately this is not possible because When using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset.
Source dataset property when Source is Binary
Sink dataset property
Reference - https://learn.microsoft.com/en-us/azure/data-factory/format-binary#copy-activity-properties

Could not instantiate EventHubSourceProvider for Azure Databricks

Using the steps documented in structured streaming pyspark, I'm unable to create a dataframe in pyspark from the Azure Event Hub I have set up in order to read the stream data.
Error message is:
java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.eventhubs.EventHubsSourceProvider could not be instantiated
I have installed the Maven libraries (com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.12 is unavailable) but none appear to work:
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.15
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.6
As well as ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString) but the error message returned is:
java.lang.NoSuchMethodError: org.apache.spark.internal.Logging.$init$(Lorg/apache/spark/internal/Logging;)V
The connection string is correct as it is also used in a console application that writes to the Azure Event Hub and that works.
Can someone point me in the right direction, please. Code in use is as follows:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Event Hub Namespace Name
NAMESPACE_NAME = "*myEventHub*"
KEY_NAME = "*MyPolicyName*"
KEY_VALUE = "*MySharedAccessKey*"
# The connection string to your Event Hubs Namespace
connectionString = "Endpoint=sb://{0}.servicebus.windows.net/;SharedAccessKeyName={1};SharedAccessKey={2};EntityPath=ingestion".format(NAMESPACE_NAME, KEY_NAME, KEY_VALUE)
ehConf = {}
ehConf['eventhubs.connectionString'] = connectionString
# For 2.3.15 version and above, the configuration dictionary requires that connection string be encrypted.
# ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()

To resolve the issue, I did the following:
Uninstall azure event hub library versions
Install com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.15 library version from Maven Central
Restart cluster
Validate by re-running code provided in the question

I received this same error when installing libraries with the version number com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.* on a Spark cluster running Spark 3.0 with Scala 2.12
For anyone else finding this via google - check if you have the correct Scala library version. In my case, my cluster is Spark v3 with Scala 2.12
Changing the "2.11" in the library version from the tutorial I was using to "2.12", so it matches my cluster runtime version, fixed the issue.

I had to take this a step further. in the format method I had to add in this:
.format("org.apache.spark.sql.eventhubs.EventHubsSourceProvider") directly.

check the cluster scala version and the library version
Unisntall the older libraries and install :
com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.17
in the shared workspace(right click and install library) and also in the cluster

Can Apache Beam detect the schema (column names) of a Parquet file like Spark and Pandas?

I am new to Apache Beam and I came from Spark world where the API is so rich.
How can I get the schema of a Parquet file using Apache Beam? without that I load data in memory as sometimes it risks to be huge and I am interested only in knowing the columns, and optionally the columns type.
The language is Python.
The storage system is Google Cloud Storage, and the Apache Beam job must be run in Dataflow.
FYI, I have tried the following as suggested in the sof:
from pyarrow.parquet import ParquetFile
ParquetFile(source).metadata
First, it didn't work when I give it a gs://.. path, giving me this error : error: No such file or directory
Then I have tried for a local file in my machine, and I have slightly changed the code to :
from pyarrow.parquet import ParquetFile
ParquetFile(source).metadata.schema
And so I could have the columns :
<pyarrow._parquet.ParquetSchema object at 0x10927cfd0>
name: BYTE_ARRAY
age: INT64
hobbies: BYTE_ARRAY String
But this solution as it seems to me it requires me to get this file to local (of Dataflow server??) and it doesn't use Apache Beam.
Any (better) solution?
Thank you!

I'm happy I could come up with a hand made solution after reading the code source of apache_beam.io.parquetio :
import pyarrow.parquet as pq
from apache_beam.io.parquetio import _ParquetSource
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '<json_key_path>'
ps = _ParquetSource("", None, None, None) # file_pattern, min_bundle_size, validate, columns
with ps.open_file("<GCS_path_of_parquet_file>") as f:
pf = pq.ParquetFile(f)
print(pf.metadata.schema)

Netflix Archaius Scala API and yaml file

I'm using Netflix/Archaius library for configuration management of scala/spark job. currently my property files are in the java properties format like
com.comp.appname.solrconfig.collectionname=testcollection
com.comp.appname.solrconfig.url=http://someip:8983/solr
com.comp.appname.solrconfig.zkhost=someip:2181
com.comp.appname.sparkconfig.executors=20
com.comp.appname.sparkconfig.executormb=200
would like to convert it to yaml format (which is more simple and readable)
solrconfig:
collectionName: testcollection
url: "http://someip:8983/solr"
zkhost: "someip:2181"
sparkConfig:
executors: 20
executormb: 200
My need is to use Netflix/archaius library for initiating a scala spark job.

Google Cloud authorization keeps failing with Python 3 - Type is None, expected one of ('authorized_user', 'service_account')

I am trying to download a file for the first time from Google Cloud Storage.
I set the path to the googstruct.json service account key file that I downloaded from https://cloud.google.com/storage/docs/reference/libraries#client-libraries-usage-python
Do need to set the authorization to Google Cloud outside the code somehow? Or is there a better "How to use Google Cloud Storage" then the one on the google site?
It seems like I am passing the wrong type to the storage_client = storage.Client()
the exception string is below.
Exception has occurred: google.auth.exceptions.DefaultCredentialsError
The file C:\Users\Cary\Documents\Programming\Python\QGIS\GoogleCloud\googstruct.json does not have a valid type.
Type is None, expected one of ('authorized_user', 'service_account').
MY PYTHON 3.7 CODE
from google.cloud import storage
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="C:\\GoogleCloud\\googstruct.json"
# Instantiates a client
storage_client = storage.Client()
bucket_name = 'structure_ssi'
destination_file_name = "C:\\Users\\18809_PIPEM.shp"
source_blob_name = '18809_PIPEM.shp'
download_blob(bucket_name, source_blob_name, destination_file_name)
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
print('Blob {} downloaded to {}.'.format(
source_blob_name,
destination_file_name
)
)
I did look at this but I cannot tell if this is my issue. I have tried both.
('Unexpected credentials type', None, 'Expected', 'service_account') with oauth2client (Python)

This error means that the Json Service Account Credentials that you are trying to use C:\\GoogleCloud\\googstruct.json are corrupt or the wrong type.
The first (or second) line in the file googstruct.json should be "type": "service_account".
Another few items to improve your code:
You do not need to use \\, just use / to make your code easier
and cleaner to read.
Load your credentials directly and do not modify environment
variables:
storage_client = storage.Client.from_service_account_json('C:/GoogleCloud/googstruct.json')
Wrap API calls in try / except. Stack traces do not impress customers. It is better to have clear, simple, easy to read error messages.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

PySpark transform method with Vector Assembler - pyspark

Related

get blob API Call from Azure Data Factory

Could not instantiate EventHubSourceProvider for Azure Databricks

Can Apache Beam detect the schema (column names) of a Parquet file like Spark and Pandas?

Netflix Archaius Scala API and yaml file

Google Cloud authorization keeps failing with Python 3 - Type is None, expected one of ('authorized_user', 'service_account')

Categories

Resources