I am writing spark output to an external system that does not like file extensions (I know, I know, don't start).
Something like:
df.write.partitionBy("date").parquet(some_path)
creates files like: some/path/date=2021-01-01/part-00000-77dd02e8-1a67-4f0d-9c07-b55b4f2e5efc-c000.snappy.parquet
And that makes that external system unhappy.
I am looking for a way to tell spark to write that file without extension.
I know I can just rename it afterwards, but that seems ... stupid (and there's a lot of files) :/
Is there some option I could use to tell spark to just write it the way I want?
df.write.partitionBy("date").parquet(some_path_ends_with_file_name)
Related
I have to perform a find/replace across my project's files using a rename rule-set which I have in CSV format.
My rename CSV is simple and in the format from value,to value:
foo,bar
car,dog
...
zip,zip
All from and to values are exact (so no need to do weird regex).
Is there any way (even w/ an extension) to feed this CSV into VS Code and have it perform the find and replace against all files in my project?
I can of course reformat this CSV to other formats (JSON, excel, etc.) fairly easily if that helps.
You could write a simple python script to do the replacing for you.
I ended up using Batch Replace extension for VS Code.
https://marketplace.visualstudio.com/items?itemName=angelomollame.batch-replacer
Originally I had tried this extension but it wasnt working. I had an ah-ha momement as to why (i have about 500 replace rules). I also use a local history VS Code extension which creates a (massive) local history in a .history folder in the workspace. This extension was choking on processing the 10,000's of files in that (since technically its in my workspace).
Once i excluded that, it worked - though it did take ~1 min to process all my files, and during that time there is no indication that its running.
i have 1 folder which has 4 files, they are sales_jan, sales_feb, debt_jan, debt_feb.I created specific job for each sales and debt. The thing is, if i already run the job previously for sales_jan only and then there comes sales_feb after that, i dont wanna repeat reading the sales_jan again, i only want to read the newest file added that hasn't been processed. For reading the file, i pass the pattern of the specific file (ex. sales_*) but if i use it like that, then the stage will reprocessed the sales_jan again although it already has. I want to move the file already been read into another folder. How do i exactly do it in ibm datastage? if there's no way to do it, what's your suggestion for my problem. Any ideas would be appreciated.
The easiest solution is to use an after-job subroutine (ExecSH on Linux/UNIX, ExecDOS on Windows) to move the file to a different location.
Since you're using wildcards for the Sequential File stage, you're going to have to be a bit more clever in handling a situation where your job processes only some of the files. I would prefer to write this using a loop in a sequence, processing one file at a time, so that the move can be handled per-file.
you might make a flag for every file which already read by your job. For example add a maxdate field for each file. When the first file max date is less than the second file or new file Then read the latest file. It can be done by using simple linux command in sequence or tranformer. Just like Ray mentioned before
Alright, here's what I'm dealing with (you can skip to TLDR if all you need to see is what I want to run):
I'm having an issue with file formatting for a nasty conglomeration of several ancient programs I've strung together. I have some data in .CSV format, and I need to put it into .SPC format. I've tried a set of proprietary MATLAB programs called 'GS tools' for fast and easy conversion, but fast and easy doesn't look like its gonna happen here since there are discrepancies in how .spc files are organized now and how they were organized back when my ancient programs were written.
If I could find the source code for the old programs I could probably alter the GS tools code to write my .spc files appropriately, but all I can find are broken links circa 2002 and earlier. Seeing as I don't know what my programs are looking for, I have no choice but to try resaving my data with other programs until one of them produces something workable.
I found my Cinderella program: if I open the data I have in a program called Spekwin and save the file with a .spc extension... viola! Everything else runs on those files. The problem is that I have hundreds of these files and I'd like to automate the conversion process.
I either need to extract the writing rubric Spekwin uses for .spc files (I believe that info is stored in a dll file within the program, but I'm not sure if that actually makes sense) and use it as a rule to write a file from my input data, or I need a piece of code that will open a file with Spekwin, tell Spekwin to save that file under the .spc extension, and terminate Spekwin.
TLDR: Need a command that tells the computer to open a file with a certain program, save that file under a different extension through that program (essentially open*.csv>save as>*.spc), then terminate the program.
OR--I need a way to tell MATLAB to write a file according to rules specified by a .dll, but I'm not sure I fully understand what that entails.
Of course I'm open to suggestions on other ways to handle this.
I need to replace a file on a zip using iOS. I tried many libraries with no results. The only one that kind of did the trick was zipzap (https://github.com/pixelglow/zipzap) but this one is no good for me, because what really do is re-zip the file again with the change and besides of this process be to slow for me, also do something that loads the whole file on memory and make my application crash.
PS: If this is not possible or way to complicated, I can settle for rename or delete an specific file.
You need to find a framework where you can modify how data is read and written. You would then use some form of mmap to essentially read and write small chunks. Searching on NSData and mmap resulted in this Post, however you can use mmap from the posix level too. Ps it will be slower than using pure memory no way around that.
Got it WORKING!! JXZip (https://github.com/JanX2/JXZip) has made exactly what I need, they link to libzip (http://www.nih.at/libzip/) that is a fully equiped library for working with ZIP files and JXZip have all the necessary Objective-C wrapper code. Thanks for all the replys.
For archive purposes, as the author of zipzap:
Actually zipzap does exactly what you want. If you replace an entry within a zip file, zipzap will do the minimum necessary to update it: it will skip writing all entries before the replaced entry, then write out the entry, then write out all entries after the replaced entry without recompressing. At the moment, it does require sufficient memory for the entries after the replaced entry though.
I have several files in a GridFS Document Store and what I'd like to do is to pipe this data into a zip file via stdin in NodeJS. So that I will end up with a zip file containing all these files.
Now my question is how can I give the files a valid filename inside of the zip file. I think I need to emulate/fake a file header containing the filename?
Any help is appreciated!
Thanks
I had problems when writing zip files with Node.js not long ago. I ended up doing something similar to what is described in Zip archives in node.js
I can't help you directly with your problem, but at least I hope I can point out some things:
Don't try to use node-archive. Even if the description says it allows to create zip files, the moment I read the source code (since documentation is unexistant) I realized that's just a lie. It only exposes methods for reading.
Using zip by spawning a process, like recommended on the provided link, seems to be the best way. Something that would work is copying the files to a local folder with whatever name you desire and then calling the zip command, just to delete the files afterwards.
The other option, which seems ok, is to use zipper (https://github.com/rubenv/zipper, although better just use npm). The reason I'm not really wishing to use it is because there's not that much flexibility, it seems to have been done in a day and it hasn't been modified since the first commit, so I'm not sure it will receive maintenance (sure, you could just fork it...).
I swear the day I have an entire free weekend with no work I will write a freaking module that does this as complete as possible. It's silly that there isn't and it shouldn't be that much struggle. blablablarant.
Edit:
Not sure if it was there before, but now I've been using the node-compress module (also using gzippo). It works fine.