Context
Spark reader has the function format, which is used to specify a data source type, for example, JSON, CSV or third party com.databricks.spark.redshift
Help
how can I check whether a third-party format exists or not, let me give a case
In local spark, connect to redshift two open source libs available 1. com.databricks.spark.redshift 2. io.github.spark_redshift_community.spark.redshift, how I can determine which libs the user pastes in the classpath
What I tried
Class.forName("com.databricks.spark.redshift"), not worked
I tried to check spark code for how they are throwing error, here is line, but unfortunately Utils is not available publically
Instead of targeting format option, I tried to target JAR file System.getProperty("java.class.path")
spark.read.format("..").load() in try/catch
I looking for a proper & reliable solution
May this answer help you.
To only check whether is spark format exists or not,
spark.read.format("..").load() in try/catch
is enough.
And as all data sources usually register themselves using DataSourceRegister interface (and use shortName to provide their alias):
You can use Java's ServiceLoader.load method to find all registered implementations of DataSourceRegister interface.
import java.util.ServiceLoader
import org.apache.spark.sql.sources.DataSourceRegister
val formats = ServiceLoader.load(classOf[DataSourceRegister])
import scala.collection.JavaConverters._
formats.asScala.map(_.shortName).foreach(println)
Related
I just put some of my code from a/b.dart to a/b1.dart file and now I started getting lot of errors on importing.
Is there any command or any other fix to import all a/b1.dart file in these files instead of manually opening each file and importing one by one.
I understand that a function or a property can be defined in more than two files and Dart can't make the right choice but if a function or property is defined in just one place, I think there must be some way to import it except searching for a/b.dart and replacing it with a/b.dart + a/b1.dart and then optimizing all imports.
As much as I am aware, Plugins/Extensions for your specific IDE (for dart) can be found that will help you with this problem.
I would recommend using dartdev tools - dartfix
I have a set of .xml documents that I want to parse.
I previously have tried to parse them using methods that take the file contents and dump them into a single cell, however I've noticed this doesn't work in practice since I'm seeing slower and slower run times, often with one task taking tens of hours to run:
The first transform of mine takes the .xml contents and puts it into a single cell, and a second transform takes this string and uses Python's xml library to parse the string into a document. This document I'm then able to extract properties from and return a DataFrame.
I'm using a UDF to conduct the process of mapping the string contents to the fields I want.
How can I make this faster / work better with large .xml files?
For this problem, we're going to combine a couple of different techniques to make this code both testable and highly scalable.
Theory
When parsing raw files, you have a couple of options you can consider:
❌ You can write your own parser to read bytes from files and convert them into data Spark can understand.
This is highly discouraged whenever possible due to the engineering time and unscalable architecture. It doesn't take advantage of distributed compute when you do this as you must bring the entire raw file to your parsing method before you can use it. This is not an effective use of your resources.
⚠ You can use your own parser library not made for Spark, such as the XML Python library mentioned in the question
While this is less difficult to accomplish than writing your own parser, it still does not take advantage of distributed computation in Spark. It is easier to get something running, but it will eventually hit a limit of performance because it does not take advantage of low-level Spark functionality only exposed when writing a Spark library.
✅ You can use a Spark-native raw file parser
This is the preferred option in all cases as it takes advantage of low-level Spark functionality and doesn't require you to write your own code. If a low-level Spark parser exists, you should use it.
In our case, we can use the Databricks parser to great effect.
In general, you should also avoid using the .udf method as it likely is being used instead of good functionality already available in the Spark API. UDFs are not as performant as native methods and should be used only when no other option is available.
A good example of UDFs covering up hidden problems would be string manipulations of column contents; while you technically can use a UDF to do things like splitting and trimming strings, these things already exist in the Spark API and will be orders of magnitude faster than your own code.
Design
Our design is going to use the following:
Low-level Spark-optimized file parsing done via the Databricks XML Parser
Test-driven raw file parsing as explained here
Wire the Parser
First, we need to add the .jar to our spark_session available inside Transforms. Thanks to recent improvements, this argument, when configured, will allow you to use the .jar in both Preview/Test and at full build time. Previously, this would have required a full build but not so now.
We need to go to our transforms-python/build.gradle file and add 2 blocks of config:
Enable the pytest plugin
Enable the condaJars argument and declare the .jar dependency
My /transforms-python/build.gradle now looks like the following:
buildscript {
repositories {
// some other things
}
dependencies {
classpath "com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}"
}
}
apply plugin: 'com.palantir.transforms.lang.python'
apply plugin: 'com.palantir.transforms.lang.python-defaults'
dependencies {
condaJars "com.databricks:spark-xml_2.13:0.14.0"
}
// Apply the testing plugin
apply plugin: 'com.palantir.transforms.lang.pytest-defaults'
// ... some other awesome features you should enable
After applying this config, you'll want to restart your Code Assist session by clicking on the bottom ribbon and hitting Refresh
After refreshing Code Assist, we now have low-level functionality available to parse our .xml files, now we need to test it!
Testing the Parser
If we adopt the same style of test-driven development as here, we end up with /transforms-python/src/myproject/datasets/xml_parse_transform.py with the following contents:
from transforms.api import transform, Output, Input
from transforms.verbs.dataframes import union_many
def read_files(spark_session, paths):
parsed_dfs = []
for file_name in paths:
parsed_df = spark_session.read.format('xml').options(rowTag="tag").load(file_name)
parsed_dfs += [parsed_df]
output_df = union_many(*parsed_dfs, how="wide")
return output_df
#transform(
the_output=Output("my.awesome.output"),
the_input=Input("my.awesome.input"),
)
def my_compute_function(the_input, the_output, ctx):
session = ctx.spark_session
input_filesystem = the_input.filesystem()
hadoop_path = input_filesystem.hadoop_path
files = [hadoop_path + "/" + file_name.path for file_name in input_filesystem.ls()]
output_df = read_files(session, files)
the_output.write_dataframe(output_df)
... an example file /transforms-python/test/myproject/datasets/sample.xml with contents:
<tag>
<field1>
my_value
</field1>
</tag>
And a test file /transforms-python/test/myproject/datasets/test_xml_parse_transform.py:
from myproject.datasets import xml_parse_transform
from pkg_resources import resource_filename
def test_parse_xml(spark_session):
file_path = resource_filename(__name__, "sample.xml")
parsed_df = xml_parse_transform.read_files(spark_session, [file_path])
assert parsed_df.count() == 1
assert set(parsed_df.columns) == {"field1"}
We now have:
A distributed-compute, low-level .xml parser that is highly scalable
A test-driven setup that we can quickly iterate on to get our exact functionality right
Cheers
I have a script which creates new referenceId each time its executed. I used
.check(regex("orders.(.*?)\"").saveAs("referenceId")))
to extract the referenceId. Now, how can I write/append it to a file without impacting the script even if I run it as a load test?
I used session in .exec to write my value into a file. Here it is:
.exec( session => {
scala.tools.nsc.io.File("../user-files/data/refenceId.csv").appendAll(session("refenceId").as[String]+"\n")
session}
)
You solution works, but...
First of all do not use anything (if you don't have to) from scala.tools.nsc.io package. It is internal package only for Scala compiler. It is not public API included in Scala runtime library (official Scaladoc). More about the topic here. Scala do not have any own abstraction for writing to file, hence one need to use normal java.io.File & co.
Secondly opening a file in each execution can (may not) slow down your load-test. It strongly depends on at which rate you are making the requests. At higher rates you can experience contention when more concurrent executions will be trying to write to same file. Simplest solution to this is to write to different files, but you can still run out of maximum possible number of opened files. Another solution is to use shared java.io.FileOutputStream resp. java.io.FileWriter to desired target file with proper synchronisation (will be accessed from various threads), which is still blocking IO. Yet another solution will be use Java NIO API to write to shared file via Channel (non-blocking) or OutputStream (not sure if non-blocking).
Of course solutions differ in difficulty of implementation.
I'm having some trouble trying to load zexy and iemlib into Pd Vanilla 0.46-7. I had no problems compiling and installing cyclone from https://github.com/electrickery/pd-cyclone. It works fine. So I tried installing iemlib and zexy from https://github.com/iem-projects/pd-iem using their binaries but there's something wrong going on. When I turn on "verbose" under path preferences, PD seems to be looking for a file with the same name as the object I'm trying to use. Using [zexy/multiplex] in a patch gives:
tried ~/Library/Pd/zexy/multiplex.d_fat and failed
tried ~/Library/Pd/zexy/multiplex.pd_darwin and failed
tried ~/Library/Pd/zexy/multiplex/multiplex.d_fat and failed
But there's no multiplex.d_fat only zexy.d_fat. Same with iemlib, there's no dollarg.d_fat or dollarg.pd_darwin only iem_mp3.d_fat, iem_t3_lib.d_fat, iemlib1.d_fat, and iemlib2.d_fat. I'm guessing these files are where the externals were compiled in.
I tried using deken and iemlib installs the .pd_darwin files but I guess this is an older version(?) and zexy is still installing zexy.d_fat so I can't load its objects.
I also tried loading the lib "zexy/zexy" under startup preferences and it loads ok but then I get messages like:
warning: class 'abs~' overwritten; old one renamed 'abs~_aliased'
and I seem to loose namespace functionality, I can no longer refer to [zexy/multiplex] and need to use only [multiplex], which I guess is the correct behaviour.
How does Pd know how to look for objects on files with different names?
Any advice?
This thread is marked as solved http://forum.pdpatchrepo.info/topic/9677/having-trouble-with-deken-plugin-and-zexy-library-solved and sounds like a similar problem but I haven't been successful.
zexy is built as a multi-object library, so there is no separate binary for zexy/multiplex.
As you have correctly guessed, the correct way to load zexy is a whole (either using [declare -lib zexy] in your patch or adding zexy to the startup libs (no need to use zexy/zexy)), and ignore the warning about abs~.
as for how loading works:
Pd maintains a list of objects it knows how to create. e.g. whenever you create [pack], Pd will lookup pack in its list of known-objects and use the information found there to actually create the object.
If you try to create an object that Pd doesn't know about yet (e.g. [foo]), then Pd will look for a library named foo (e.g: foo.pd_linux) and if found, it will "load" it.
loading a library means that it will call a special function in the library (this special function is the entry point of the library and is called foo_setup() in our case)
after that, Pd will check, whether it now has foo in the list of known objects. if it does, it will create the object.
Now the magic is done in the special function, that is called when Pd loads the library: this function's main purpose is to tell Pd about new objects (basically saying: "if somebody asks for object "foo", i can make one or you").
When zexy's special function is loaded, it tells Pd about all zexy objects (including multiplex), so after Pd has loaded zexy, it knows how to create the [multiplex] object.
If the special function registers an object that Pd already knows about (e.g. in the case of zexy it tries to register a new object abs~ even though Pd already has a built-in object of the same name), then Pd will rename the original object by appending _aliased and the newly registered object will take over the name.
We are building an application using ScalaFX. When I run the project in IntelliJIDEA, everything works fine. However, when I create jar file and try to execute it, I am getting errors in reading some xml file.
I tried various solutions posted in SO, but with no use.
package com.app.adt
import scalafx.application.JFXApp
import scalafx.Includes._
import scalafx.scene.Scene
import scala.reflect.runtime.universe.typeOf
import scalafxml.core.{FXMLView, DependenciesByType}
object App extends JFXApp {
val root = FXMLView(getClass.getResource("/com/app/adt/Home.fxml"),
new DependenciesByType(Map(
typeOf[TestDependency] -> new TestDependency("ADT"))))
stage = new JFXApp.PrimaryStage() {
title = "ADT"
scene = new Scene(root)
}
}
The xml file(Home.fxml) is placed in com/app/adt package. I am creating the jar file using sbt-one-jar.
I have tried different combinations of path, but alwasys gives the same error.
Error Stack:
Caused by: javafx.fxml.LoadException:
file:/adt-app_2.11-1.3-SNAPSHOT-one-jar.jar!/main/adt-app_2.11-1.3-S
NAPSHOT.jar!/com/app/adt/Home.fxml
at javafx.fxml.FXMLLoader.constructLoadException(FXMLLoader.java:2611)
at javafx.fxml.FXMLLoader.loadImpl(FXMLLoader.java:2589)
at javafx.fxml.FXMLLoader.loadImpl(FXMLLoader.java:2435)
at javafx.fxml.FXMLLoader.load(FXMLLoader.java:2403)
at scalafxml.core.FXMLView$.apply(FXMLView.scala:17)
Jar Structure:
adt-app_2.11-1.3-SNAPSHOT-one-jar.jar
|
main
|
adt-app_2.11-1.3-SNAPSHOT.jar
|
com\app\adt
|
App.scala
Home.fxml
Also, I have tried with sbt-assembly instead of sbt-one-jar. But , still getting the same error. :(
Tried with below answers in SO:
Q1
Q2
The real problem is rather tricky. Firstly, one needs to realize that JAR is an archive (e.g. similar to ZIP) and archives are regular files. Thus the archive itself is located somewhere in the file system, hence, it is accessible via URL.
On the contrary, the "subfiles" (entries) are just data-block within the archive. Neither the operating system nor the JVM knows that this particular file is an archive therefore they treat is as a regular file.
If you're interested in deeper archive handling, try to figure out how ZipFile works. JAR is basically ZIP so you're able to apply this class to it.
Java provides Class.getResourceAsStream methods that enables the programmer to read files as streams. This solution is obviously useless in this particular example since the ScalaFX method expects the File instead.
So basically you have three options
Use the stream API in order to duplicate the XML into temporary file, than pass this file to the method.
Deploy your resources separately in a way they remain regular files.
Re-implement JavaFX in order to accept streams (this should probably happen anyway)