I have a Scala Spark notebook on an AWS EMR cluster that loads data from an AWS S3 bucket. Previously, I had standard code like the following:
var stack = spark.read.option("header", "true").csv("""s3://someDirHere/*""")
This loaded multiple directories of files (.txt.gz) into a Spark DataFrame object called stack.
Recently, there were new files added to this directory. The content of the new files look the same (I downloaded a couple of them and opened them using both Sublime Text and Notepad++). I tried two different text editors to see if there were perhaps some invisible, non-unicode characters that was disrupting the interpretation of the first line as a header. The new data files causes my code above to ignore the first header line and instead interpret the second line as the header. I have tried a few variations without luck, here are a few examples of things I tried:
var stack = spark.read.option("quote", "\"").option("header", "true").csv("""s3://someDirHere/*""") // header not detected
var stack = spark.read.option("escape", "\"").option("header", "true").csv("""s3://someDirHere/*""") // header not detected
var stack = spark.read.option("escape", "\"").option("quote", "\"").option("header", "true").csv("""s3://someDirHere/*""") // header not detected
I wish I could share the files but it contains confidential information. Just wondering if there are some ideas as to what I can try.
how many files are there? if its to much to check manually you could try to read them withouth the header option. Your expectation is that the header matches everywhere right?
If thats truly the case that should have a count of 1:
spark.read.csv('path').limit(1).dropDuplicates().count()
If not you could see like this what different headers there are
spark.read.csv('path').limit(1).dropDuplicates().show()
Remember its important not to use the header option, so you can operate on it
As with others who have posted in the past, I cannot figure out to download a csv.gz file from a URL in Python 3.7. I see posts but they only post a 2kb file.
I am a 100% newbie using Python. What follows is the code for one file that I am trying to obtain. I can't even do that. The final goal would be to request all files that start with 2019* using python. Please try the code below to save the file. As others stated, the file is just a name without the true content - Ref: Downloading a csv.gz file from url in Python
import requests
url = 'https://public.bitmex.com/?prefix=data/trade/20191026.csv.gz'
r = requests.get(url, allow_redirects=True)
open('20191026.csv.gz', 'wb').write(r.content)
Yields:
Out[40]:
1245
I've tried "wget" and urllib.request along with "urlretrieve" also.
I wish I could add a screenshot or attach a file. The file created is 2kb and not even a csv.gz file. But the true file that I can download from a web browser is 78mb. The file is 20191026.csv.gz not that it matters as they all do the same thing. The location is https://public.bitmex.com/?prefix=data/trade/
Again, if you know of a way to obtain all the files using a filter such that 2019*csv.gz would be fantastic.
You are trying to download the files from https://public.bitmex.com/?prefix=data/trade/.
To achieve your final goal of download all the files starting from 2019* you have to do in 3 steps
1) you read the content of https://public.bitmex.com/?prefix=data/trade/
2) convert the content into an list, from that filter out the file names which starting from 2019.
3) from the result list try to download the csv using the example which you referring.
Hope this approach will help you
Happy coding.
I'm trying to create a web service in PHP that can deliver an SVG with reference to a PNG raster image. Both the data for the SVG as well as the binary PNG image come from a MySQL database on the server.
Option A: Encode the PNG data in base-64 and embed it directly in the SVG, such as:
<image xlink:href="data:image/png;base64,..."/>
Concerns: 30% heavier load than loading it as pure binary and noticeable delay when loading it with Postman (or is this just because of Postman).
Option B: Call the PNG data as binary and save it as a file on the file system, then call the SVG file, which would then reference the physical PNG file.
Concerns: Involvement of the file system (which implies I need to start managing physical files, expiration dates etc).
Is there perhaps another way that an SVG can reference the binary data on the fly without it having to be on the file system?
To accomplish something similar (in my case sending data for SVGs with additional data about each file as binary files, which are much smaller than sending xml, text, or json) - I use CBOR. In my case, I compress the SVG using LZString compression first, and add this along with additional data attributes to a JSON object. Then I convert the JSON object to CBOR. I think CBOR can handle your base 64 data without any need for conversion - more information about it is here: cbor.io
I found a PHP library for CBOR here: https://github.com/2tvenom/CBOREncode
This may not be the way to go at all for you, but I thought I'd throw it out there just in case.
I have a filepicker.io instance where I am using the pickAndStore function to allow users to upload various files, however while testing Microsoft Visio I found the files are being blocked / denied upload by a yellow error that states it does not register as an accepted file type (and lists out all the files it believes are allowed)?
In my logs of the arguments sent to the function, I can see the full array of file types I allow and the 4 variants of visio I added are clearly there:
The four I added:
".vss", ".vssx", ".vsd", ".vsdx"
Full array:
[".doc", ".dot", ".docx", ".docm", ".dotx", ".xls", ".xlt", ".xlsx", ".xltx", ".xlsm", ".xlsb", ".oft", ".msg", ".ppt", ".pptx", ".pptm", ".pps", ".ppsx", ".mpp", ".pub", ".pdf", ".html", ".mhtml", ".txt", ".rtf", ".csv", ".xml", ".css", ".zip", ".tar", ".rar", ".vss", ".vssx", ".vsd", ".vsdx", ".mp3", ".wav", ".swf", ".ics", ".srt", ".wmf", ".eps", ".ai", ".psd", ".gif", ".jpg", ".jpeg", ".png", ".bmp", ".m4v", ".mp4", ".flv", ".f4v", ".mov", ".wmv", ".wm", ".webm", ".3gp", ".3gpp", ".m2p", ".rv", ".rm", ".avi", ".3gp2", ".mpg", ".mpeg", ".ts", ".vp6", ".h264", ".arf", ".wrf", ".m2ts"]
However When I use "My Computer" as a source and upload any one of the twenty odd .vsd files I have to use as tests, all of them trigger the error to appear and deny upload:
The image i am seeing saying that .vsd does not register
I'm not sure what else I can do at this point to fix? I don't particularly want to have to use mimetype in this one instance as it suggests not to use this along side extension in the filepicker documentation.
Here is the link i used that provides various Visio files you can use to test. I would rather not use the files clients upload using our platform as I would need to ask permission and in case they are sensitive. I don't think there has been a single successful upload (of a visio file, others are fine) so I would be surprised if it was file specific.
https://www.microsoft.com/en-gb/download/confirmation.aspx?id=24023
Thanks!
All extensions are converted back to mimetype therefor you can't mixed extension & mimetype.
It appears that ".vss", ".vssx", ".vsd", ".vsdx" are in the database.
Could you post some of the files you are testing so we can check them ourselves.
Regards,
Dylan
Filepicker needs to include "application/vnd.ms-visio.viewer" in the mapping from those file extensions. It looks like that's what the browser is reporting the MIME type for those files to be.
This might be a question that may not be answered due to the nature of the external tool I am using (lack of documentation).
Basically, I am using a tool that pushes and pulls messages from the queue, more precisely - it pushes and pulls files. It worked perfectly for text files but when I tried pushing and then pulling a binary file - the pulled one was corrupted, it's size increased in comparsion with the original file (1.33 ratio).
For example moving a zip file wouldn't work...
I suppose it has something to do with the tools configuration, the only settings that can be changed regarding the problem are CCSID and encoding (UTF-8, Base16, etc.), I tried playing with both, unfortunately without success.
Tried using the following CCSIDs: 65535, 1208, 819
and encodings : UTF-8, Base16, Base64
In every case the binary file was corrupted after pulling it from the queue, I'm not entirely sure how the tool acomplishes that, it's written in Java, also I'm new to MQ so I tried searching for the correct options in IBM's docs but I haven't found anything that makes more sense than 65535 and Base16, yet it still doesn't work, could anyone with more experience with MQ tell if playing with these options makes sense at all in this case and if so - suggest what CCSID and encoding can I try to accomplish what Ive described above?
More information is really needed, but my suspicion is you are putting the message on the queue as a text message and playing around with encodings and ccsid's to try to get it right. You really need to know how the 'Java' app achieves this - is it using JMS (eg JMSBytesMessage) or base Java (something like setMessageData).
At a high level, there is a header on a message (The MD) which 'describes' the data - the MD format field. If you say the data is a string then MQ can convert between codepages should the getter request it etc. Put a tiny binary file into a message onto a queue, and browse the queue with amqsbcg or the GUI - what are the MD fields for format? What headers are on the payload - anything like RFH2's?
Put the same code in to give us a clue, or at least the amqsbcg output