call specific pubsub topic based on condition in apache beam with python - apache-beam

I would like read topics from pubsub topic1 and write cleaned json to topic2 and topic3 based on the condition.
let say: i have a flag in the json comes from topic1, i do some transformations, and check the flag value and write to topic2 and topic3 based on the flag value.
i have tried the below but from here i am not able to move further, since no idea how to call call the pipe based on the condition.
my Beam pipe code as below:
with beam.Pipeline(options=pipeline_options) as p:
Ingest = ( p
| 'Read from Topic' >> beam.io.ReadFromPubSub(topic=known_args.topic).with_output_types(bytes)
| 'Decode' >> beam.Map(decode_message)
| 'Make One Json' >> beam.Map(make_one)
| 'Split based on event' >> beam.Map(split)
# when event_name== 'aa_afo_addtocart_clicked'
| 'write to topic2'
# when event_name== 'aa_afo_merchantpage_visited'
| 'write to topic3'
)
4th step i am calling the split function, but pls guide me how to write the splited output to multiple topics.
the split python function do the following.
gets the one single input json -> check the flag and split the result as two -> one should go topic2 and another one should go to topic3.
def split(p):
json_obj_list = json.load(p)
jb =[]
for json_obj in json_obj_list:
if json_obj['event_name']== 'aa_afo_addtocart_clicked':
filename = json_obj['event_name'] + '.json'
with open(filename, 'a') as out_json_file:
json_string = json.dumps(json_obj)
print(json_string)
#json.dump(json_obj, out_json_file)
if json_obj['event_name'] == 'aa_afo_merchantpage_visited':
filename = json_obj['event_name'] + '.json'
with open(filename, 'a') as out_json_file:
json_string = json.dumps(json_obj)
print(json_string)

The solution here is to create several output as described in the programming guide. Like that, you perform your split and you have 2 PCollections as output.
Then process independently the 2 PCollections: sink in topic 2 and topic 3.

Related

To split data into good and bad rows and write to output file using Spark program

I am trying to filter the good and bad rows by counting the number of delimiters in a TSV.gz file and write to separate files in HDFS
I ran the below commands in spark-shell
Spark Version: 1.6.3
val file = sc.textFile("/abc/abc.tsv.gz")
val data = file.map(line => line.split("\t"))
var good = data.filter(a => a.size == 995)
val bad = data.filter(a => a.size < 995)
When I checked the first record the value could be seen in the spark shell
good.first()
But when I try to write to an output file I am seeing the below records,
good.saveAsTextFile(good.tsv)
Output in HDFS (top 2 rows):
[Ljava.lang.String;#1287b635
[Ljava.lang.String;#2ef89922
Could ypu please let me know on how to get the required output file in HDFS
Thanks.!
Your final RDD is type of org.apache.spark.rdd.RDD[Array[String]]. Which leads to writing objects instead of string values in the write operation.
You should convert the array of strings to tab separated string values again before saving. Just try;
good.map(item => item.mkString("\t")).saveAsTextFile("goodFile.tsv")

Escape hyphen when reading multiple dataframes at once in pyspark

I can't seem to find any documentation on this in pyspark's docs.
I am trying to read multiple parquets at once, like this:
df = sqlContext.read.option("basePath", "/some/path")\
.load("/some/path/date=[2018-01-01, 2018-01-02]")
And receive the following exception
java.util.regex.PatternSyntaxException: Illegal character range near index 11
E date=[2018-01-01,2018-01-02]
I tried replacing the hyphen with \-, but then I just receive a file not found exception.
I would appreciate help on that
Escaping the - is not your issue. You can not specify multiple dates in a list like that. If you'd like to read multiple files, you have a few options.
Option 1: Use * as a wildcard:
df = sqlContext.read.option("basePath", "/some/path").load("/some/path/date=2018-01-0*")
But this will also read any files named /some/path/data=2018-01-03 through /some/path/data=2018-01-09.
Option 2: Read each file individually and union the results
dates = ["2018-01-01", "2018-01-02"]
df = reduce(
lambda a, b: a.union(b),
[
sqlContext.read.option("basePath", "/some/path").load(
"/some/path/date={d}".format(d=d)
) for d in dates
]
)

Powershell Header Record

I have a scenario like below:
I have a .dat file where header field name which is coming as below example:
2_a 2_b 2_c 2_d 2_e - Header
a b c d e - Data
f g h I j - Trailer
Next time
1_a 1_b 1_c 1_d 1_e -Header
c d e f g -data
b d f j k - trailer
So I want to achieve like my header record number is dynamically changing. Is there any way that I can achieve it so that I will just put the value and it will pick that value before that...like if I will input value 3 the header record will become 3_a 3_b like that....
After that my data will come and then trailer...Please suggest me the process as I am new to powershell...
If you want to create a line like
2_a 2_b 2_c 2_d 2_e
or
1_a 1_b 1_c 1_d 1_e
dynamically, you could use the string format operator, -f, like this
$index = 2
$header = "{0}_a {0}_b {0}_c {0}_d {0}_e" -f $index
this will create the first header and save it to a variable. Change the $index variable to create another string with some other number instead.
See this link for more info on its usage.

How to write to a file name defined at runtime?

I want to write to a gs file but I don’t know the file name at compile time. Its name is based on behavior that is defined at runtime. How can I proceed?
If you're using Beam Java, you can use FileIO.writeDynamic() for this (starting with Beam 2.3 which is currently in the process of being released - but you can already use it via the version 2.3.0-SNAPSHOT), or the older DynamicDestinations API (available in Beam 2.2).
Example of using FileIO.writeDynamic() to write a PCollection of bank transactions to different paths on GCS depending on the transaction's type:
PCollection<BankTransaction> transactions = ...;
transactions.apply(
FileIO.<BankTransaction, TransactionType>writeDynamic()
.by(Transaction::getType)
.via(BankTransaction::toString, TextIO.sink())
.to("gs://bucket/myfolder/")
.withNaming(type -> defaultNaming("transactions_", ".txt"));
For an example of DynamicDestinations use, see example code in the TextIO unit tests.
Alternatively, if you want to write each record to its own file, just use the FileSystems API (in particular, FileSystems.create()) from a DoFn.
For the Python crowd:
An experimental write was added to the Beam python SDK in 2.14.0, beam.io.fileio.WriteToFiles:
my_pcollection | beam.io.fileio.WriteToFiles(
path='/my/file/path',
destination=lambda record: 'avro' if record['type'] == 'A' else 'csv',
sink=lambda dest: AvroSink() if dest == 'avro' else CsvSink(),
file_naming=beam.io.fileio.destination_prefix_naming())
which can be used to write to different files per-record.
If your filename is based on data within your pcollections, you can use the destination and file_naming to create files based on each record's data.
More documentation here:
https://beam.apache.org/releases/pydoc/2.14.0/apache_beam.io.fileio.html#dynamic-destinations
And the JIRA issue here:
https://issues.apache.org/jira/browse/BEAM-2857
As #anrope mentioned already, apache_beam.io.fileio seems to be the latest Python API for writing files. The WordCount example is currently outdated since it uses the WriteToText class, which inherits from the now deprecated apache_beam.io.filebasedsink / apache_beam.io.iobase
To add to existing answers, here is my pipeline in which I dynamically name the output files during runtime. My pipeline takes N input files and creates N output files, which are named based on their corresponding input file name.
with beam.Pipeline(options=pipeline_options) as p:
(p
| 'CreateFiles' >> beam.Create(input_file_paths)
| 'MatchFiles' >> MatchAll()
| 'OpenFiles' >> ReadMatches()
| 'LoadData' >> beam.Map(custom_data_loader)
| 'Transform' >> beam.Map(custom_data_transform)
| 'Write' >> custom_writer
)
When I load the data I create a PCollection of tuple records (file_name, data). All my transforms are applied to data, but I pass file_name through to the end of the pipeline to generate the output file names.
def custom_data_loader(f: beam.io.fileio.ReadableFile):
file_name = f.metadata.path.split('/')[-1]
data = custom_read_function(f.open())
return file_name, data
def custom_data_transform(record):
file_name, data = record
data = custom_transform_function(data) # not defined
return file_name, data
And I save the file with:
def file_naming(record):
file_name, data = record
file_name = custom_naming_function(file_name) # not defined
return file_name
def return_destination(*args):
"""Optional: Return only the last arg (destination) to avoid sharding name format"""
return args[-1]
custom_writer = WriteToFiles(
path='path/to/output',
file_naming=return_destination,
destination=file_naming,
sink=TextSink()
)
Replace all of the custom_* functions with your own logic.
I know this is a bit of an old question but I struggled with the examples in the documentation.
Here is a simple example of how to split files based on dict items.
pipeline_options = PipelineOptions()
pipeline_options.view_as(SetupOptions).save_main_session = False
def file_names(*args):
file_name = fileio.destination_prefix_naming()(*args)
destination, *_ = file_name.split("----")
return f"{destination}.json"
class JsonSink(fileio.TextSink):
def write(self, element):
record = json.loads(element)
record.pop("id")
self._fh.write(json.dumps(record).encode("utf8"))
self._fh.write("\n".encode("utf8"))
def destination(element):
return json.loads(element)["id"]
with beam.Pipeline(options=pipeline_options) as p:
data = [
{"id": 0, "message": "whhhhhhyyyyyyy"},
{"id": 1, "message": "world"},
{"id": 1, "message": "hi there!"},
{"id": 1, "message": "what's up!!!?!?!!?"},
]
(
p
| "CreateEmails" >> beam.Create(data)
| "JSONify" >> beam.Map(json.dumps)
| "Write Files"
>> fileio.WriteToFiles(
path="path/",
destination=destination,
sink=lambda dest: JsonSink(),
file_naming=file_names,
)
)

Real-time output from engines in IPython parallel?

I am running a bunch of long-running tasks with IPython's great parallelization functionality.
How can I get real-time output from the ipengines' stdout in my IPython client?
E.g., I'm running dview.map_async(fun, lots_of_args) and fun prints to stdout. I would like to see the outputs as they are happening.
I know about AsyncResult.display_output(), but it's only available after all tasks have finished.
You can see stdout in the meantime by accessing AsyncResult.stdout, which will return a list of strings, which are the stdout from each engine.
The simplest case being:
print ar.stdout
You can wrap this in a simple function that prints stdout while you wait for the AsyncResult to complete:
import sys
import time
from IPython.display import clear_output
def wait_watching_stdout(ar, dt=1, truncate=1000):
while not ar.ready():
stdouts = ar.stdout
if not any(stdouts):
continue
# clear_output doesn't do much in terminal environments
clear_output()
print '-' * 30
print "%.3fs elapsed" % ar.elapsed
print ""
for eid, stdout in zip(ar._targets, ar.stdout):
if stdout:
print "[ stdout %2i ]\n%s" % (eid, stdout[-truncate:])
sys.stdout.flush()
time.sleep(dt)
An example notebook illustrating this function.
Now, if you are using older IPython, you may see an artificial restriction on access of the stdout attribute ('Result not ready' errors).
The information is available in the metadata, so you can still get at it while the task is not done:
rc.spin()
stdout = [ rc.metadata[msg_id]['stdout'] for msg_id in ar.msg_ids ]
Which is essentially the same thing that the ar.stdout attribute access does.
just in case somebody is still struggling with
getting ordinary print-outputs of the individual kernels:
I adapted minrk's answer such that i get the output of each
kernel as if it would have been a local one by constantly checking if the stdout of each kernel changes while the program is running.
asdf = dview.map_async(function, arguments)
# initialize a stdout0 array for comparison
stdout0 = asdf.stdout
while not asdf.ready():
# check if stdout changed for any kernel
if asdf.stdout != stdout0:
for i in range(0,len(asdf.stdout)):
if asdf.stdout[i] != stdout0[i]:
# print only new stdout's without previous message and remove '\n' at the end
print('kernel ' + str(i) + ': ' + asdf.stdout[i][len(stdout0[i]):-1])
# set stdout0 to last output for new comparison
stdout0 = asdf.stdout
else:
continue
asdf.get()
outputs will then be something like:
kernel0: message 1 from kernel 0
kernel1: message 1 from kernel 1
kernel0: message 2 from kernel 0
kernel0: message 3 from kernel 0
kernel1: message 2 from kernel 0
...