Convert mp3 to wav using Pyspark (cloud environment) - pyspark

I have some mp3 files that I need to convert to wav (to use a recognition service that only supports wav).
If I were on my local computer I would use the solution of:
from pydub import AudioSegment
sound = AudioSegment.from_mp3("02_Not_Work_AudioFile.mp3")
sound.export("pepe.wav", format="wav")
but in this case the data is in an Azure container, and the way to access it is by reading it through Spark.
full_path = path_to_file.mp3
input_df = spark.read.format("binaryFile").load(full_path)
pandas_df = input_df.toPandas()
content = pandas_df['content'][0] # here is my binary
Sending the binary data to the speech-translation service from Microsoft would not work as it only accept wav. So somehow I have to transform that binary to wav.
Does anyone knows how ?

Related

Is it possible to import data into MATLAB from a SAS database?

I'm working on an automation project and was wondering if it's possible to establish a connection between MATLAB and a SAS database?
Yes. For instance, you can use Matlab's sasread function to read from SAS
sasfile = 'C:\sasreaddemo.sas7bdat';
xlsfile = 'C:\SAS-Matlab Converter.xls';
[numeric,text,raw] = sasread(sasfile,xlsfile)
Example files are available:
https://www.mathworks.com/matlabcentral/fileexchange/15835-import-data-from-sas

How can I decrypt the Triplestore files of an RDF4J database?

I am currently trying to read the files of an RDF4J triplestore from the universAAL platform and put them into an InfluxDB to merge the data from different smart living systems.
However, I have noticed that the individual index files of the Native repository are encrypted/unreadable (See image below).
Is there any experience from the community on how to get human readable content out of the RDF4J files (namespace, triples.prop, triples-cosp, triples-posc, triples-spoc, values.hash, values.dat, values.id) and merge them into another database?
The documentation of RDF4J did not help me here, so I could not create a decent export.
Encrypted File from Triplestore
The files are not encrypted, they're simply a binary format, optimized for efficient storage and retrieval, used by RDF4J's Native Store database implementation. They're not meant for direct manipulation.
The easiest way to convert them to readable RDF is to spin up a Native Store on top of them and then use the RDF4J API to query/export its data. Assuming you have a complete set of data files it should be as simple as something like this:
Repository rep = new SailRepository(new NativeStore(new File("/path/to/datafiles/");
try(RepositoryConnection conn = rep.getConnection()) {
conn.export(Rio.createWriter(RDFFormat.TURTLE, System.out));
}
finally {
rep.shutDown();
}
Obviously, replace System.out with a FileOutputstream if you want to write the data to file rather than the console. And change RDFFormat.TURTLE to something else if you want a different syntax format.

decompressing files from hdfs in spark

I am using spark and I have different kind of compressed files on hdfs(zip,gzip,7zip,tar,bz2,tar.gz etc). Could anyone please let me know best way for decompression. For some compression I could use CompressionCodec. But it does not support all compression format.For zip file I did some search and found that ZipFileInputFormat could be used. but i could not find any jar for this.
For some compressed format (I know that it is true for tar.gz and zip, haven't tested for the others), you can use the dataframe API directly and it'll take care of the compression for you:
val df = spark.read.json("compressed-json.tar.gz")

Convert MathType equation embedded in OLE Binary file to MathML

I am trying to convert MathType's equation which is stored as OLE binary file to MathML using MathType's SDK.
The input file for my program is a DocX which would contain embdedd MathType equations. I am looking for a solution thats independent of using MS Word. DocX is a zip file, and once it is extracted we can find the a binary file for each OLE object in the folder "word/embeddings/". Typically the file name would be oleObject1.bin, oleObject2.bin etc.
When I checked with MathType SDK it has a class "ConvertEquation" which has following method:
virtual public bool Convert(EquationInput ei, EquationOutput eo)
EquationInput is an abstract class for which following concrete classes are made available:
EquationInputFileText
EquationInputFileWMF2
EquationInputFileWMF
EquationInputFileGIF
EquationInputFileEPS
In the above listed classes none of them seems to support OLE binary.
According to MathType's SDK doc, MTEF data is saved as the native data format of the object. Whenever an equation object is to be written to an OLE "stream", a 28- byte header is written, followed by the MTEF data. I guess this is exactly what is present in this binary file. But just that there seems to be no way by which this format can be made to be used by SDK to convert it into MathML. Any thoughts?
Thanks
you can convert mathtype wmf file to mathml as follow:
ConvertEquation conv = new ConvertEquation();
var input = EquationInputFileWMF("mathTYpe.wmf");
var output = EquationOutputFileText("MathMLName.txt", "MathML2 (m namespace).tdl"));
conv.Convert(input , input);
the "MathML2 (m namespace).tdl" string stand for "tdl" file which contains in "MathType\Translators" path, if you open the Translators path ,you can find many of type.
You may try MathMagic equation editor (Windows version).
MathMagic can extract all Word embeded equations out of the document(s) (.doc or .docx), and can save/covert them to other format (such as JPG, PNG, BMP, PDF, TeX, LaTeX, MathML, ...) as a batch conversion job.
Unfortunately, their trial version does not support this batch conversion. A valid license (even 1-month or 2-month license) is required to enable the Conversion feature.

Accessing files on webserver with Matlab

I have written a Matlab script to perform some analysis on audio samples recorded at different locations. I have a mobile app that records audio and stores it on a web server. Is there a way I can access this file from Matlab as an input to the script? The url and individual file names will be available, I assume there is a Matlab command that uses this information.
Thanks,
Try something like this:
str = urlread('http://stackoverflow.com');