Python code for beams with hdfs support in Pipeline - apache-beam

I am running the Sentiment example here for tensorflow transform.
https://github.com/tensorflow/transform/blob/master/examples/sentiment_example.py
For fn ReadAndShuffleData() defined in line 78-98, is it possible that in a similar way I can load files but from HDFS, not GCS?
I have tried a whole day with several beam API (beams-2.8.0) but failed, and the most promising one I think is using beams.io.hadoopfilesystem. But this fn actually produces a python file-object and cannot be read in using beams.io.ReadFromText() in a beam pipeline.
I also passed in HadoopFileSystemPipelineOptions correctly. Anyone can show me a direction to solve the issue or a 2/3-line code snippets or a workaround? Thank you very much!
p.s. hadoop 2.7.7, beams 2.8 and data is loaded correctly.
I think I may lack some theoretical understandings here, any references will be appreciated!

You might use the apache_beam.Create transform:
Init signature: beam.Create(self, values, reshuffle=True)
Docstring: A transform that creates a PCollection from an iterable.
import apache_beam as beam
from apache_beam.options.pipeline_options import HadoopFileSystemOptions
from apache_beam.io.hadoopfilesystem import HadoopFileSystem
HDFS_HOSTNAME = 'foo.hadoop.com'
HDFS_PORT = 50070
hdfs_client_options = HadoopFileSystemOptions(hdfs_host=HDFS_HOSTNAME, hdfs_port=HDFS_PORT, hdfs_user="foobar")
hdfs_client = HadoopFileSystem(hdfs_client_options)
input_file_hdfs = "hdfs://foo/bar.csv"
f = hdfs_client.open(input_file_hdfs)
p = beam.Pipeline(options=PipelineOptions())
lines = p | 'ReadMyFile' >> beam.Create(f)
res = lines | "WriteMyFile" >> beam.io.WriteToText("./bar", ".csv")
p.run()

Related

K-means in pyspark runing infinitely in jupyter notebook, works fine in zepplin notebook

I am running a k-means algorithm in pyspark:
from pyspark.ml.clustering import KMeans
from pyspark.ml.clustering import KMeansModel
import numpy as np
kmeans_modeling = KMeans(k = 3, seed = 0)
model = kmeans_modeling.fit(data.select("parameters"))
The data is a pyspark sql dataframe: pyspark.sql.dataframe.DataFrame
However, the algorithm is running infinitely (it is taking much, much longer than supposed for the amount of data in the dataframe).
Does anyone know what could be causing the algorithm to behave like this? I ran this exact code for a different dataframe of the same type, and everything worked fine.
The dataset I used before (that worked) had 72020 rows and 35 columns, and the present dataset has 60297 rows and 31 columns, so it is not a size-related problem. The data was normalized in both cases, but I assume the problem has to be in the data treatment. Can anyone help me with this? If any other information is needed let me know in the comments and I will answer or edit the question.
EDIT:
This is what I can show about creating the data:
aux1 = temp.filter("valflag = 0")
sample = spark.read.option("header", "true").option("delimiter", ",").csv("gs://LOCATION.csv").select("id")
data_pre = aux1.join(sample, sample["sample"] == aux1["id"], "leftanti").drop("sample")
data_pre.createOrReplaceTempView("data_pre")
data_pre = spark.table("data_pre")
data_pre = data.withColumn(col, functions.col(col).cast("double"))
data_pre = data_pre.na.fill(0)
data = vectorization_function(df = data_pre, inputCols = inputCols, outputCol = "parameters")
EDIT 2: I cannot provide additional information about the data, but I have now realized that the algorithm runs without problem in a zepplin notebook, but it is not working in a jupyter notebook; I have edited the tags and titel accordingly. Does anyone know why this could be happening?
Here is some documentation about running clustering jobs in Spark.
https://spark.apache.org/docs/latest/ml-clustering.html
Here is another, very similar, idea.
https://spark.apache.org/docs/latest/mllib-clustering.html

How to writeback to dataframe using transform_df in palantir foundry?

I created a library for updating description of the columns of the input dataset. This function takes three parameter as input (input_dataset, output_dataset, config file) and eventually writes back the description of output dataset. So now we want to import this library across various use cases. How to go for those cases where we are writing spark transformation i.e taking inputs through transform_df because here we can't assign output to output variable. In that situation how can i call my description library function? How to proceed in those situation in palantir foundry. Any suggestions?
This method isn't currently supported using the #transform_df decorator; you'll have to use the #transform decorator at the moment.
The reasoning behind this resulted from recognizing the need for broader access to metadata APIs like the #transform decorator already allows. Thus it seemed more in line with this pattern to keep it there since the #transform_df decorator is inherently higher-level.
You can always simply move over your transformations from...
from transforms.api import transform_df, Input, Output
#transform_df(
Output("/my/output"),
my_input("/my/input"),
)
def my_compute_function(my_input):
df = my_input
# ... logic ....
return my_input
...to...
from transforms.api import transform, Input, Output
#transform(
my_output=Output("/my/output"),
my_input=Input("/my/input")
)
def my_compute_function(my_input, my_output):
df = my_input.dataframe()
# ... logic ....
my_output.write_dataframe(df)
...in which only 6 lines of code need be changed.

Flink: How to write DataSet to a variable instead of to a file

I have a flink batch program written in scala using the DataSet API which results in a final dataset I am interested in. I would like to get that dataset as a variable or value (e.g. a list or sequence of String) within my program, without having to write it to any file. Is it possible?
I have seen that flink allows for collection data sinks in order to debug (the only example in their doc is in Java). However, this is only allowed in local execution, and anyway I don't know its equivalent in Scala. What I would like is to write the final resulting dataset after the whole flink parallel execution is done to a program value or variable.
First, try this for the scala version of collection data sink:
import org.apache.flink.api.scala._
import org.apache.flink.api.java.io.LocalCollectionOutputFormat;
.
.
val env = ExecutionEnvironment.getExecutionEnvironment
// Create a DataSet from a list of elements
val words = env.fromElements("w1","w2", "w3")
var outData:java.util.List[String]= new java.util.ArrayList[String]()
words.output(new LocalCollectionOutputFormat(outData))
// execute program
env.execute("Flink Batch Scala")
println(outData)
Second, if your dataset fits in memory of single machine why do you need to use a distributed processing framework? I think you should think more about your use case! and try to use the right transformations on your dataset.
I used flink 1.72 with scala 2.12. And this is a streaming prediction using SVM that i wrapped up in Model class. I think the most correct answer is using collect(). It'll return Seq. i got this answer after searching for hours. i got the idea from Flink Git - Line 95
var temp_jaringan : DataSet[(Vector,Double)] = model.predict_jaringan(value)
temp_jaringan.print()
var temp_produk : DataSet[(Vector,Double)] = model.predict_produk(value)
temp_produk.print()
var result_jaringan : Seq[(Vector,Double)] = temp_jaringan.collect()
var result_produk : Seq[(Vector,Double)] = temp_produk.collect()
if(result_jaringan(0)._2 == 1.0 && result_produk(0)._2 == 1.0 ){
println("Keduanya")
}else if(result_jaringan(0)._2 == 1.0 && result_produk(0)._2 == -1.0){
println("Jaringan")
}else if(result_jaringan(0)._2 == -1.0 && result_produk(0)._2 == 1.0){
println("Produk")
}else{
println("Bukan Keduanya")
}
It may vary based on other version. cause after using and searching flink material like a mad dog for weeks even months for my final project as graduation requirement, i know that this flink develepment projects need more documentation and tutorial, especially for beginners like me.
anyway, correct me if im wrong. Thanks!

In Spyder, is there some way to plot a dataframe by right-clicking it like you can with arrays?

I've just recently discovered that you can right-click an array in Spyder and get a quick plot of the data. With sample data like this:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Some numbers in a data frame
nsample = 440
x1 = np.linspace(0, 100, nsample)
y = np.sin(x1)
dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
df = pd.DataFrame({'dates':dates, 'x1':x1, 'y':y})
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
you can go to the Variable explorer, right-click y and get the following directly in the console:
which will give you this:
The same option does not seem to be available to a pandas dataframe:
Sure, you could easily go for df.plot():
But I really like the right-click option to check whether the variables and dataframes look the way I expect them to when I'm messing around with a lot of data. So, is there any library I'd have to import? Or maybe something in the settings? I've also noticed that what happens in the console is this little piece of magic: %varexp --plot y, but can't seem to find an equivalent for data frames.
Thank you for any suggestions!
(Spyder developer here) This is just a bit of missing functionality for Dataframes, but it's very easy to implement.
Please open an issue in our issue tracker, so we don't forget to do it in a future release.

date in pig latin

I am trying to do the following. I have multiple dates and I want to create a pig script which gets unknown number of input dates and then runs the pig script for the input arguments. My question is:
How can I send an unknown number of input variables to a pig script and then handle them within the pig script?
Thanks
Sara
I have some trouble understanding what you actually want to do. That would be my solution >for your problem, sending an unknown number of dates (sorted as chararray):
A = load 'input_dates' AS (date:chararray);
B = my_macro(A);
It's quite basic, so I guess I didn't understand your problem correctly. Could you maybe >develop a little bit more your problem?
UPDATE >> How about something like this if you use Pig 0.11 (there is a bug until 0.10 for module imports):
#!/usr/bin/python
import os
from org.apache.pig.scripting import *
P = Pig.compile("""
data = LOAD '$docs_in' AS (a:int);
-- do something
""")
lof = os.listdir("/home/.../dates/")
params = []
for elem in lof:
params.append({'docs_in': str(elem)})
lof.remove(elem)
bound = P.bind(list_of_files)
stats = bound.run(params)
If each run is counting on the result of the previous one, use runSingle() instead.
If I understand question correctly, you want to load number of files or directories. You can specify as "," as input.
Below is an example:
load.pig (content):
A = LOAD '$input' using PigStorage();
dump A;
command to run ( to run locally):
pig -x local -param input=20120301,20120302,20120304 load.pig