I'm using Modeler 18.0 and I'm new to the tool.
I inherited 30 stream files. Each stream starts with 30 excel source nodes with file names like \.xlsx (e.g. c:\source\regl_sales_01_WI_2017Q3.xlsx). I need to update all 900 nodes for the 2017Q4 versions of the source files.
Can I do this with some type of script where I can find and replace? Would this be a stand alone script? Seems like I could use something like node.setPropertyValue("full_filename", "c:\source\regl_sales_01_WI_2017Q3.xlsx") If I can only identify the script and node.
Thank you
I think your problem can be solved using standalone scripting and the stream.findAll()-functionality.
You could do something like this:
from os import listdir
from os.path import isfile, join
session = modeler.script.session()
tasks = session.getTaskRunner()
mypath = 'C:\\yourpath'
streams = [f for f in listdir(mypath) if (isfile(join(mypath, f)) and f.endswith(".str"))]
for streamFile in streams:
print(stream.getName()+" is getting processed.")
stream = tasks.openStreamFromFile(demosDir + streamFile, True)
inputNodes = stream.findAll("excelimport", None)
for in in inputNodes:
ff = in.getPropertyValue("full_filename")
ff.replace("2017Q3.xlsx", "2017Q4.xlsx")
in.setPropertyValue("full_filename", ff)
print(stream.getName()+" is processed.")
stream.close()
I did not test this, but it shouldn't need a lot of tweaking to work.
Related
I have a bunch of CSV files in a mounted blob container and I need to calculate the 'SHA1' hash values for every file to store as inventory. I'm very new to Azure cloud and pyspark so I'm not sure how this can be achieved efficiently. I have written the following code in Python Pandas and I'm trying to use this in pyspark. It seems to work however it takes quite a while to run as there are thousands of CSV files. I understand that things work differently in pyspark, so can someone please guide if my approach is correct, or if there is a better piece of code I can use to accomplish this task?
import os
import subprocess
import hashlib
import pandas as pd
class File:
def __init__(self, path):
self.path = path
def get_hash(self):
hash = hashlib.sha1()
with open(self.path, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash.update(chunk)
self.md5hash = hash.hexdigest()
return self.md5hash
path = '/dbfs/mnt/data/My_Folder' #Path to CSV files
cnt = 0
rlist = []
for path, subdirs, files in os.walk(path):
for fi in files:
if cnt < 10: #check on only 10 files for now as it takes ages!
f = File(os.path.join(path, fi))
cnt +=1
hash_value = f.get_hash()
results = {'File_Name': fi, 'File_Path': f.filename, 'SHA1_Hash_Value': hash_value}
rlist.append(results)
print(fi)
df = pd.DataFrame(rlist)
print(str(cnt) + ' files processed')
df = pd.DataFrame(rlist)
#df.to_csv('/dbfs/mnt/workspace/Inventory/File_Hashes.csv', mode='a', header=False) #not sure how to write files in pyspark!
display(df)
Thanks
Since you want to treat the files as blobs and not read them into a table. I would recommend using spark.sparkContext.binaryFiles this would land you an RDD of pairs where the key is the file name and the value is a file-like object, on which you can calculate the hash in a map function (rdd.mapValues(calculate_hash_of_file_like))
For more information, refer to the documentation: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.binaryFiles.html#pyspark.SparkContext.binaryFiles
I want to write to a gs file but I don’t know the file name at compile time. Its name is based on behavior that is defined at runtime. How can I proceed?
If you're using Beam Java, you can use FileIO.writeDynamic() for this (starting with Beam 2.3 which is currently in the process of being released - but you can already use it via the version 2.3.0-SNAPSHOT), or the older DynamicDestinations API (available in Beam 2.2).
Example of using FileIO.writeDynamic() to write a PCollection of bank transactions to different paths on GCS depending on the transaction's type:
PCollection<BankTransaction> transactions = ...;
transactions.apply(
FileIO.<BankTransaction, TransactionType>writeDynamic()
.by(Transaction::getType)
.via(BankTransaction::toString, TextIO.sink())
.to("gs://bucket/myfolder/")
.withNaming(type -> defaultNaming("transactions_", ".txt"));
For an example of DynamicDestinations use, see example code in the TextIO unit tests.
Alternatively, if you want to write each record to its own file, just use the FileSystems API (in particular, FileSystems.create()) from a DoFn.
For the Python crowd:
An experimental write was added to the Beam python SDK in 2.14.0, beam.io.fileio.WriteToFiles:
my_pcollection | beam.io.fileio.WriteToFiles(
path='/my/file/path',
destination=lambda record: 'avro' if record['type'] == 'A' else 'csv',
sink=lambda dest: AvroSink() if dest == 'avro' else CsvSink(),
file_naming=beam.io.fileio.destination_prefix_naming())
which can be used to write to different files per-record.
If your filename is based on data within your pcollections, you can use the destination and file_naming to create files based on each record's data.
More documentation here:
https://beam.apache.org/releases/pydoc/2.14.0/apache_beam.io.fileio.html#dynamic-destinations
And the JIRA issue here:
https://issues.apache.org/jira/browse/BEAM-2857
As #anrope mentioned already, apache_beam.io.fileio seems to be the latest Python API for writing files. The WordCount example is currently outdated since it uses the WriteToText class, which inherits from the now deprecated apache_beam.io.filebasedsink / apache_beam.io.iobase
To add to existing answers, here is my pipeline in which I dynamically name the output files during runtime. My pipeline takes N input files and creates N output files, which are named based on their corresponding input file name.
with beam.Pipeline(options=pipeline_options) as p:
(p
| 'CreateFiles' >> beam.Create(input_file_paths)
| 'MatchFiles' >> MatchAll()
| 'OpenFiles' >> ReadMatches()
| 'LoadData' >> beam.Map(custom_data_loader)
| 'Transform' >> beam.Map(custom_data_transform)
| 'Write' >> custom_writer
)
When I load the data I create a PCollection of tuple records (file_name, data). All my transforms are applied to data, but I pass file_name through to the end of the pipeline to generate the output file names.
def custom_data_loader(f: beam.io.fileio.ReadableFile):
file_name = f.metadata.path.split('/')[-1]
data = custom_read_function(f.open())
return file_name, data
def custom_data_transform(record):
file_name, data = record
data = custom_transform_function(data) # not defined
return file_name, data
And I save the file with:
def file_naming(record):
file_name, data = record
file_name = custom_naming_function(file_name) # not defined
return file_name
def return_destination(*args):
"""Optional: Return only the last arg (destination) to avoid sharding name format"""
return args[-1]
custom_writer = WriteToFiles(
path='path/to/output',
file_naming=return_destination,
destination=file_naming,
sink=TextSink()
)
Replace all of the custom_* functions with your own logic.
I know this is a bit of an old question but I struggled with the examples in the documentation.
Here is a simple example of how to split files based on dict items.
pipeline_options = PipelineOptions()
pipeline_options.view_as(SetupOptions).save_main_session = False
def file_names(*args):
file_name = fileio.destination_prefix_naming()(*args)
destination, *_ = file_name.split("----")
return f"{destination}.json"
class JsonSink(fileio.TextSink):
def write(self, element):
record = json.loads(element)
record.pop("id")
self._fh.write(json.dumps(record).encode("utf8"))
self._fh.write("\n".encode("utf8"))
def destination(element):
return json.loads(element)["id"]
with beam.Pipeline(options=pipeline_options) as p:
data = [
{"id": 0, "message": "whhhhhhyyyyyyy"},
{"id": 1, "message": "world"},
{"id": 1, "message": "hi there!"},
{"id": 1, "message": "what's up!!!?!?!!?"},
]
(
p
| "CreateEmails" >> beam.Create(data)
| "JSONify" >> beam.Map(json.dumps)
| "Write Files"
>> fileio.WriteToFiles(
path="path/",
destination=destination,
sink=lambda dest: JsonSink(),
file_naming=file_names,
)
)
For my classroom, I have a PN532 NFC card reader/writer hooked up via UART to a Raspberry Pi 2, and I'm using Type 2 NXP NTAG213 NFC cards to store information specifically to the text record. While weak in Python, I used the example under subheader 8.3 in the NFCPy Documentation to write to the card and used "How to redirect 'print' output to a file using python?" in order to complete the output process to a text file. For a while, the reading, writing, and outputting to my text file worked:
import nfc
import nfc.ndef
import nfc.tag
import os, sys
import subprocess
import glob
from os import path
import datetime
f = open('BankTransactions.txt', 'a')
sys.stdout = f
path = '/home/pi/BankTransactions.txt'
def connected(tag): print(tag); return False
clf = nfc.ContactlessFrontend('tty:AMA0:pn532')
clf.connect(rdwr={'on-connect': connected})
tag = clf.connect(rdwr={'on-connect': connected})
record_1 = tag.ndef.message[0]
signature = nfc.tag.tty2_nxp.NTAG213
today = datetime.date.today()
print(record_1.pretty())
if tag.ndef is not None:
print(tag.ndef.message.pretty())
if tag.ndef.is_writeable:
text_record = nfc.ndef.TextRecord("Jessica has 19 GP on card")
tag.ndef.message = nfc.ndef.Message(text_record)
print >> f, "Edited by Roman", today, record_1, signature, '\n'
f.close()
Now, however, when I use the same card for testing, it will not append the data within the text file. The data is still being written to the card, as I can read the information on the card with a simple read program.
I was able to remove the first few lines of a single file using the code below:
scala> val file = sc.textFile("file:///root/path/file.csv")
Removing first 5 lines:
scala> val Data = file.mapPartitionsWithIndex{ (idx, iter) => if (idx == 0) iter.drop(5) else iter }
The problem is: Suppose that I have multiple files with the same columns, and I want to load all of them into rdd, removing the first few lines of each file.
Is this actually possible?
I'd appreciate any help. Thanks in advance!
Lets assume there are 2 files.
ravis-MacBook-Pro:files raviramadoss$ cat file.csv
first_file_first_record
first_file_second_record
first_file_third_record
first_file_fourth_record
first_file_fifth_record
first_file_sixth_record
ravis-MacBook-Pro:files raviramadoss$ cat file_2.csv
second_file_first_record
second_file_second_record
second_file_third_record
second_file_fourth_record
second_file_fifth_record
second_file_sixth_record
second_file_seventh_record
second_file_eight_record
Scala Code
sc.wholeTextFiles("/Users/raviramadoss/files").flatMap( _._2.lines.drop(5) ).collect()
Output:
res41: Array[String] = Array(first_file_sixth_record, second_file_sixth_record, second_file_seventh_record, second_file_eight_record)
In Spark/Hadoop if you give the input path as the directory containing all the files then the code which you have written will work on all the individual files separately.
So to achieve your objective, just give the input path as the directory containing all the files. So the first few lines will be removed from all the files.
I would like to know if it's possible to use "^%GOF" without user interaction. I'm using Caché 2008. ^%GO isn't an option as it's to slow. I'm using input from a temporary file for automatically answer the questions, but it can fail (rarely happens).
I couldn't find the routine of this utility in %SYS. Where is it located?
Thanks,
Answer: Using "%SYS.GlobalQuery:NameSpaceList" to get list of globals (system globals excluding).
Set Rset = ##class(%ResultSet).%New("%SYS.GlobalQuery:NameSpaceList")
d Rset.Execute(namespace, "*", 0)
s globals=""
while (Rset.Next()){
s globalName=Rset.Data("Name")_".gbl"
if (globals=""){
s globals = globalName
}else{
s globals = globals_","_globalName
}
d ##class(%Library.Global).Export(namespace, globals, "/tmp/export.gof", 7)
The only drawback is that if you have a namespace with concatination of globals exceeding the maximum allowed for a global entry, the program crashes. You should then split the globals list.
I would recommend that you look at the %Library.Global() class with output format 7.
classmethod Export(Nsp As %String = $zu(5), ByRef GlobalList As %String, FileName As %String, OutputFormat As %Integer = 5, RecordFormat As %String = "V", qspec As %String = "d", Translation As %String = "") as %Status
Exports a list of globals GlobalList from a namespace Nsp to FileName using OutputFormat and RecordFormat.
OutputFormat can take the values below:
1 - DTM format
3 - VAXDSM format
4 - DSM11 format
5 - ISM/Cache format
6 - MSM format
7 - Cache Block format (%GOF)
RecordFormat can take the values below:
V - Variable Length Records
S - Stream Data
You can find it in the class documentation here: http://docs.intersystems.com/cache20082/csp/documatic/%25CSP.Documatic.cls
I've never used it, it looks like it would do the trick however.
export your global to file
d $system.OBJ.Export("myGlobal.GBL","c:\global.xml")
import global from your file
d $system.OBJ.Load("c:\global.xml")
Export items as an XML file
The extension of the items determine what
type they are, they can be one of:
CLS - classes
CSP - Cache Server Pages
CSR - Cache Rule files
MAC - Macro routines
INT - None macro routines
BAS - Basic routines
INC - Include files
GBL - Globals
PRJ - Studio Projects
OBJ - Object code
PKG - Package definition
If you wish to export multiple classes then separate then with commas or
pass the items("item")="" as an array or use wild cards.
If filename is empty then it will export to the current device.
link to docbook
edit: adding "-d" as qspec value will suppress the terminal output of the export. If you want to use this programmtically, it might get in the way.
And just for completeness' sake:
SAMPLES>s IO="c:\temp\test.gof"
SAMPLES>s IOT="RMS"
SAMPLES>s IOPAR="WNS"
SAMPLES>s globals("Sample.PersonD")=""
SAMPLES>d entry^%GOF(.globals)
SAMPLES>
-> results in c:\temp\test.gof having the export. You can define up to 65435 globals in you array (named globals in this example)
But I would recommend you go with DAiMor's answer as this is the more 'modern' way.
To avoid maximum string error, you should use subscripts instead of comma delimited string:
Set Rset = ##class(%ResultSet).%New("%SYS.GlobalQuery:NameSpaceList")
d Rset.Execute(namespace, "*", 0)
while (Rset.Next()) {
s globals(Rset.Data("Name"))="" // No need for _".gbl" in recent Cache
}
d ##class(%Library.Global).Export(namespace, .globals, "/tmp/export.gof", 7) // Note dot before globals