DataBricks ADLS Gen 2 Mount missing all subfolders and files - pyspark

Finally getting to grips with Databricks.
I've mounted an ADLSGen2 account within databricks, however when I list my files I only have one folder within the mount. Can someone assist in where I am going wrong.
cmd1
adlsAccountName = "accountname"
adlsContainerName = "databricks"
adlsFolderName = "ARCHIVE"
mountpoint = "/mnt/files"
applicationID = dbutils.secrets.get(scope = "Secret02", key="AppID")
authenticationKey = dbutils.secrets.get(scope = "Secret02", key="ClientSecret")
tenantID = dbutils.secrets.get(scope = "Secret02", key="tenantid")
endpoint = "https://login.microsoft.com/" + tenantID + "/oauth2/token"
source = "abfss://" +adlsContainerName+"#"+adlsAccountName+".dfs.core.windows.net/"+adlsFolderName+"/"
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": applicationID,
"fs.azure.account.oauth2.client.secret": authenticationKey,
"fs.azure.account.oauth2.client.endpoint": endpoint,
"spark.databricks.sqldw.jdbc.service.principal.client.id": applicationID,
"spark.databricks.sqldw.jdbc.service.principal.client.secret": authenticationKey
}
dbutils.fs.mount(
source = source,
mount_point = mountpoint,
extra_configs = configs)
cmd2
%fs
ls "mnt/files"
ADLS Gen2 Storage
Databricks

If you have sub folder make sure to add subfolder at the end of blob storage URL and then mount.
As per your code follow below source syntax
source = "abfss://" +adlsContainerName+"#"+adlsAccountName+".dfs.core.windows.net/"+adlsFolderName+"/"+subfolderName+"/"

Related

Create a Azure PostgreSQL schema using terraform on a Azure PostgreSQL Database

I am able to create a azurerm_postgresql_flexible_server and azurerm_postgresql_flexible_server_database using terraform.
I am not able to create a schema using TF but not able to get much help on documentation.
I also checked https://registry.terraform.io/providers/cyrilgdn/postgresql/latest/docs/resources/postgresql_schema
but that uses a different provider. I am not sure what am I missing here.
This is the TF template which creates the Azure PostgreSQL server and DB -
module "common_modules" {
source = "../modules/Main"
}
provider "azurerm" {
features {}
}
locals {
#Construct Tag Data for Resource
resourceTags = {
environment = var.environment
createdBy = var.createdBy
managedBy = var.managedBy
colorBand = var.colorBand
purpose = var.purpose
lastUpdateOn = formatdate("DD-MM-YYYY hh:mm:ss ZZZ", timestamp())
}
}
resource "azurerm_postgresql_flexible_server" "postgreSQL" {
name = var.postgreSQL
location = var.location
resource_group_name = var.ckeditorResorceGroup
administrator_login = var.postgreSQLAdmin
administrator_password = var.password
sku_name = "B_Standard_B1ms"
version = "13"
storage_mb = 32768
backup_retention_days = 7
geo_redundant_backup_enabled = false
tags = local.resourceTags
}
resource "azurerm_postgresql_flexible_server_database" "postgreSQLDB" {
name = var.postgreSQLDB
server_id = azurerm_postgresql_flexible_server.postgreSQL.id
collation = "en_US.utf8"
charset = "utf8"
}
resource "azurerm_postgresql_flexible_server_firewall_rule" "postgreSQLFirewallRule" {
name = "allow_access_to_azure_services"
server_id = azurerm_postgresql_flexible_server.postgreSQL.id
start_ip_address = "0.0.0.0"
end_ip_address = "0.0.0.0"
}
have a look at https://registry.terraform.io/providers/cyrilgdn/postgresql or https://github.com/cyrilgdn/terraform-provider-postgresql
usable, but you need network connectivity to resolve names (azure private dns zone) and to connect with postgresql flexible server. The terraform code should run in same vnet like flexi server.

Alpakka S3 library for "assume role"

I am trying to push records to AWS S3 with Aplakka S3 library. The issue is due to security issues, I have to "assume role", and my IAM user doesn't have access to PUT. with aws cli, I could have successfully pushed to s3 with --profile parameter in aws s3 --profile <profile>. I want to know HOW TO ASSUME ROLE IN ALPAKKA S3 LIBRARY?
my application.conf file has the credentials as in https://github.com/akka/alpakka/blob/v3.0.4/s3/src/main/resources/reference.conf
:
alpakka.s3 {
# default values for AWS configuration
aws {
# to use the same configuration as if credentials.provider = default
credentials {
# static credentials
provider = default //static
access-key-id = <> //valid access key exists in original code
secret-access-key = <>
}
and my PUTing code is:
object Experiments extends App{
//AWS S3 configs
val s3BucketName = config.getString("alpakka.s3.bucket_name")
val s3BucketRegion = config.getString("alpakka.s3.bucket_region")
val bucket_path = config.getString("alpakka.s3.path_inside_bucket")
Source.single(record)
.map { Record => println("Record value: " + Record)
val K_record = //some parsing for the record into JSON
//push to S3
val bucketKey = s"$bucket_path/${K_record.session_id}.json"
try{
Source.single(ByteString(K_record.featureSet))
.runWith(S3.multipartUpload(bucket = s3BucketName, key = bucketKey))
}
catch{
case e2:Exception => println(s"S3 pushing error: ${e2.getMessage}")
}
}
.runWith(Sink.ignore)

How to get the configuration file path within JAR file for KafkaProducer SSL setup?

I have a JAR file with below structure:
example.jar
|
+-org
| +-springframework
| +-boot
| +-loader.jar
+-BOOT-INF
+-classes
| +- kafka
| truststore.jks ==> I want to get the path here
+-lib
+-dependency1.jar
How can I get the configuration file path, only path (string) of 'kafka/truststore.jks' file ?
Because I am applying the SSL for KafkaProducer, I using below code and it work fine on local:
#Value("classpath:kafka/truststore.jks")
private org.springframework.core.io.Resource sslTruststoreResource;
...
String sslTruststoreLocation = sslTruststoreResource.getFile().getAbsolutePath(); // ==\> ***it throw FileNotFoundException here on deployed Server, local env run fine !***
Map\<String, Object\> config = Maps.newHashMap();
config.put("ssl.truststore.location", sslTruststoreLocation);
but when I deploy on Server, it throw FileNotFoundException :(
After many days to research, I found that the sslTruststoreResource.getFile() will be fail for JAR file case as mentioned at here
The sslTruststoreResource.getInputStream() or sslTruststoreResource.getFilename() are ok for JAR file but they are not path I need for kafka configuration.
In my project, the 'truststore.jks' file is located as below:
src
-- java
-- resources
. -- kafka
-- truststore.jks
So, is there any solution for my issue ? Thank you.
I tried to use ClassPathResource, ResourcePatternResolver but they not working
After many ways I still could not get the path from JKS file then I copy it to another path out of jar file where my code can refer to its path
final String FILE_NAME = env.getProperty("kafka.metadata.ssl.truststore.location");
String sslTruststoreLocation = "*-imf-kafka-client.truststore.jks";
try {
InputStream is = getClass().getClassLoader().getResourceAsStream(FILE_NAME);
// Get the destination path where contains JKS file
final String HOME_DIR = System.getProperty("user.home");
final Path destPath = Paths.get(HOME_DIR, "tmp");
if (!Files.isDirectory(destPath)) {
Files.createDirectories(destPath);
}
// Copy JKS file to destination path
sslTruststoreLocation = destPath.toFile().getAbsolutePath() + "/" + FILE_NAME;
File uploadedFile = new File(sslTruststoreLocation);
if(!uploadedFile.exists()) {
uploadedFile.getParentFile().mkdirs();
uploadedFile.createNewFile();
FileCopyUtils.copy(is, new FileOutputStream(sslTruststoreLocation));
log.debug("Copied {} file from resources dir to {} done !", FILE_NAME, sslTruststoreLocation);
}
config.put(SslConfigs.SSL_TRUSTSTORE_LOCATION_CONFIG, sslTruststoreLocation);
} catch (IOException e) {
final String message = "The " + sslTruststoreLocation + " file not found to construct a KafkaProducer";
log.error(message);
}
Looks like this is a known issue in Kafka.
Spring Boot propose a workaround similar to yours.

Get aks publicIP/loadbalancer IP of Kuberbenetes address after apply bedrock terraform

Currently I'm applying following terraform template in order to create kubernetes cluster, everything work as I expected.
module "subnet" {
source = "git::https://github.com/microsoft/bedrock//cluster/azure/subnet/?ref=master"
subnet_name = var.subnet_name
vnet_name = var.vnet_name
resource_group_name = data.azurerm_resource_group.keyvault.name
address_prefixes = [var.subnet_prefix]
}
module "aks-gitops" {
source = "git::https://github.com/microsoft/bedrock//cluster/azure/aks-gitops/?ref=master"
acr_enabled = var.acr_enabled
agent_vm_count = var.agent_vm_count
agent_vm_size = var.agent_vm_size
cluster_name = var.cluster_name
dns_prefix = var.dns_prefix
flux_recreate = var.flux_recreate
gc_enabled = var.gc_enabled
gitops_ssh_url = var.gitops_ssh_url
gitops_ssh_key_path = var.gitops_ssh_key_path
gitops_path = var.gitops_path
gitops_poll_interval = var.gitops_poll_interval
gitops_label = var.gitops_label
gitops_url_branch = var.gitops_url_branch
kubernetes_version = var.kubernetes_version
resource_group_name = data.azurerm_resource_group.cluster_rg.name
service_principal_id = var.service_principal_id
service_principal_secret = var.service_principal_secret
ssh_public_key = var.ssh_public_key
vnet_subnet_id = module.subnet.subnet_id
network_plugin = var.network_plugin
network_policy = var.network_policy
oms_agent_enabled = var.oms_agent_enabled
}
The next step in terrafrom is configure the CDN/Domain setup, and it requires the public IP address (which already created in above steps under module "aks-gitops") but the output seem to be not returned with that Ip address.
Any idea for that, since I've just dug all the resource on internet.
every comment is appreciated. !
Thank mates !
To retrieve the FQDN which resolves to the public IP of the cluster, create a data resource that references the newly created cluster.
data "azurerm_kubernetes_cluster" "aks-cluster" {
name = var.cluster_name
resource_group_name = data.azurerm_resource_group.cluster_rg.name
}
The address of the newly created cluster can then be accessed via data.aks-cluster.fqdn
You can follow a similar pattern to retrieve details of a load balancer, or any other resource that is not returned in the module outputs.

No FileSystem for scheme: cos

I'm trying to connect to IBM Cloud Object Storage from IBM Data Science Experience:
access_key = 'XXX'
secret_key = 'XXX'
bucket = 'mybucket'
host = 'lon.ibmselect.objstor.com'
service = 'mycos'
sqlCxt = SQLContext(sc)
hconf = sc._jsc.hadoopConfiguration()
hconf.set('fs.cos.myCos.access.key', access_key)
hconf.set('fs.cos.myCos.endpoint', 'http://' + host)
hconf.set('fs.cose.myCos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
obj = 'mydata.tsv.gz'
rdd = sc.textFile('cos://{0}.{1}/{2}'.format(bucket, service, obj))
print(rdd.count())
This returns:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: No FileSystem for scheme: cos
I'm guessing I need to use the 'cos' scheme based on the stocator docs. However, the error suggests stocator isn't available or is an old version?
Any ideas?
Update 1:
I have also tried the following:
sqlCxt = SQLContext(sc)
hconf = sc._jsc.hadoopConfiguration()
hconf.set('fs.cos.impl', 'com.ibm.stocator.fs.ObjectStoreFileSystem')
hconf.set('fs.stocator.scheme.list', 'cos')
hconf.set('fs.stocator.cos.impl', 'com.ibm.stocator.fs.cos.COSAPIClient')
hconf.set('fs.stocator.cos.scheme', 'cos')
hconf.set('fs.cos.mycos.access.key', access_key)
hconf.set('fs.cos.mycos.endpoint', 'http://' + host)
hconf.set('fs.cos.mycos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
service = 'mycos'
obj = 'mydata.tsv.gz'
rdd = sc.textFile('cos://{0}.{1}/{2}'.format(bucket, service, obj))
print(rdd.count())
However, this time the response was:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: No object store for: cos
at com.ibm.stocator.fs.ObjectStoreVisitor.getStoreClient(ObjectStoreVisitor.java:121)
...
Caused by: java.lang.ClassNotFoundException: com.ibm.stocator.fs.cos.COSAPIClient
The latest version of Stocator (v1.0.9) that supports fs.cos scheme is not yet deployed on Spark aaService (It will be soon). Please use the stocator scheme "fs.s3d" to connect to your COS.
Example:
endpoint = 'endpointXXX'
access_key = 'XXX'
secret_key = 'XXX'
prefix = "fs.s3d.service"
hconf = sc._jsc.hadoopConfiguration()
hconf.set(prefix + ".endpoint", endpoint)
hconf.set(prefix + ".access.key", access_key)
hconf.set(prefix + ".secret.key", secret_key)
bucket = 'mybucket'
obj = 'mydata.tsv.gz'
rdd = sc.textFile('s3d://{0}.service/{1}'.format(bucket, obj))
rdd.count()
Alternatively, you can use ibmos2spark. The lib is already installed on our service. Example:
import ibmos2spark
credentials = {
'endpoint': 'endpointXXXX',
'access_key': 'XXXX',
'secret_key': 'XXXX'
}
configuration_name = 'os_configs' # any string you want
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name)
bucket = 'mybucket'
obj = 'mydata.tsv.gz'
rdd = sc.textFile(cos.url(obj, bucket))
rdd.count()
Stocator is on the classpath for Spark 2.0 and 2.1 kernels, but the cos scheme is not configured. You can access the config by executing the following in a Python notebook:
!cat $SPARK_CONF_DIR/core-site.xml
Look for the property fs.stocator.scheme.list. What I currently see is:
<property>
<name>fs.stocator.scheme.list</name>
<value>swift2d,swift,s3d</value>
</property>
I recommend that you raise a feature request against DSX to support the cos scheme.
It looks like cos driver is not properly initialized. Try this configuration:
hconf.set('fs.cos.impl', 'com.ibm.stocator.fs.ObjectStoreFileSystem')
hconf.set('fs.stocator.scheme.list', 'cos')
hconf.set('fs.stocator.cos.impl', 'com.ibm.stocator.fs.cos.COSAPIClient')
hconf.set('fs.stocator.cos.scheme', 'cos')
hconf.set('fs.cos.mycos.access.key', access_key)
hconf.set('fs.cos.mycos.endpoint', 'http://' + host)
hconf.set('fs.cos.mycos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
UPDATE 1:
You also need to ensure stocator classes are on the classpath. You can use packages system by exceuting pyspark in the following way:
./bin/pyspark --packages com.ibm.stocator:stocator:1.0.24
This works with swift2d and cos scheme.
UPDATE 2:
Just follow Stocator documentation (https://github.com/CODAIT/stocator). It contains all details how to install it, what branch to use, etc.
I found the same issue, and to solve it I just changed environment:
Within IBM Watson Studio, if you start a a Jupyter notebook in an environment without a pre-configured spark cluster, than you get that error. Installing PySpark is not enough.
Instead, if you start a notebook with the Spark cluster available, you will be just fine.
You have to set .config("spark.hadoop.fs.stocator.scheme.list", "cos") along with some others fs.cos... configurations.
Here's an end-to-end snippet code example that works (tested with pyspark==2.3.2 and Python 3.7.3):
from pyspark.sql import SparkSession
stocator_jar = '/path/to/stocator-1.1.2-SNAPSHOT-IBM-SDK.jar'
cos_instance_name = '<myCosIntanceName>'
bucket_name = '<bucketName>'
s3_region = '<region>'
cos_iam_api_key = '*******'
iam_servicce_id = 'crn:v1:bluemix:public:iam-identity::<****************>'
spark_builder = (
SparkSession
.builder
.appName('test_app'))
spark_builder.config('spark.driver.extraClassPath', stocator_jar)
spark_builder.config('spark.executor.extraClassPath', stocator_jar)
spark_builder.config(f"fs.cos.{cos_instance_name}.iam.api.key", cos_iam_api_key)
spark_builder.config(f"fs.cos.{cos_instance_name}.endpoint", f"s3.{s3_region}.cloud-object-storage.appdomain.cloud")
spark_builder.config(f"fs.cos.{cos_instance_name}.iam.service.id", iam_servicce_id)
spark_builder.config("spark.hadoop.fs.stocator.scheme.list", "cos")
spark_builder.config("spark.hadoop.fs.cos.impl", "com.ibm.stocator.fs.ObjectStoreFileSystem")
spark_builder.config("fs.stocator.cos.impl", "com.ibm.stocator.fs.cos.COSAPIClient")
spark_builder.config("fs.stocator.cos.scheme", "cos")
spark_sess = spark_builder.getOrCreate()
dataset = spark_sess.range(1, 10)
dataset = dataset.withColumnRenamed('id', 'user_idx')
dataset.repartition(1).write.csv(
f'cos://{bucket_name}.{cos_instance_name}/test.csv',
mode='overwrite',
header=True)
spark_sess.stop()
print('done!')