Return a dataframe from another notebook in databricks - pyspark

I have a notebook which will process the file and creates a data frame in structured format.
Now I need to import that data frame created in another notebook, but the problem is before running the notebook I need to validate that only for some scenarios I need to run.
Usually to import all data structures, we use %run. But in my case it should be combinations of if clause and then notebook run
if "dataset" in path": %run ntbk_path
its giving an error " path not exist"
if "dataset" in path": dbutils.notebook.run(ntbk_path)
this one I cannot get all the data structures.
Can someone help me to resolve this error?

To implement it correctly you need to understand how things are working:
%run is a separate directive that should be put into the separate notebook cell, you can't mix it with the Python code. Plus, it can't accept the notebook name as variable. What %run is doing - it's evaluating the code from specified notebook in the context of the current Spark session, so everything that is defined in that notebook - variables, functions, etc. is available in the caller notebook.
dbutils.notebook.run is a function that may take a notebook path, plus parameters and execute it as a separate job on the current cluster. Because it's executed as a separate job, then it doesn't share the context with current notebook, and everything that is defined in it won't be available in the caller notebook (you can return a simple string as execution result, but it has a relatively small max length). One of the problems with dbutils.notebook.run is that scheduling of a job takes several seconds, even if the code is very simple.
How you can implement what you need?
if you use dbutils.notebook.run, then in the called notebook you can register a temp view, and caller notebook can read data from it (examples are adopted from this demo)
Called notebook (Code1 - it requires two parameters - name for view name & n - for number of entries to generate):
name = dbutils.widgets.get("name")
n = int(dbutils.widgets.get("n"))
df = spark.range(0, n)
df.createOrReplaceTempView(name)
Caller notebook (let's call it main):
if "dataset" in "path":
view_name = "some_name"
dbutils.notebook.run(ntbk_path, 300, {'name': view_name, 'n': "1000"})
df = spark.sql(f"select * from {view_name}")
... work with data
it's even possible to do something like with %run, but it could require a kind of "magic". The foundation of it is the fact that you can pass arguments to the called notebook by using the $arg_name="value", and you can even refer to the values specified in the widgets. But in any case, the check for value will happen in the called notebook.
The called notebook could look as following:
flag = dbutils.widgets.get("generate_data")
dataframe = None
if flag == "true":
dataframe = ..... create datarame
and the caller notebook could look as following:
------ cell in python
if "dataset" in "path":
gen_data = "true"
else:
gen_data = "false"
dbutils.widgets.text("gen_data", gen_data)
------- cell for %run
%run ./notebook_name $generate_data=$gen_data
------ again in python
dbutils.widgets.remove("gen_data") # remove widget
if dataframe: # dataframe is defined
do something with dataframe

Related

how to pass the outputs from Get metadata stage and use it for file name comparison in databricks notebook

I have 2 Get metadata stages in ADF which is fetching file names from 2 different folders, I need to use these outputs for file name comparison in databricks notebook and return true if all the files are present.
how to pass the output from Get meta data stages to databricks and perform string comparison and
return true if all files are present and return false if even 1 file is missing
How to achieve this?
Please find the below answer which I explained with 1 Get metadata stage , the same can be replicated for more than one also.
Create an ADF pipeline with below activities.
Now in the Get Metadata activity , add the childItems in the Fieldlist as argument, to pass the output of Get Metadata to Notebook as show below
In the Databricks Notebook activity , add the below parameter as Base Paramter which will capture the output of Get Metadata and pass as input paramater to Notebook. Generally this parameter will of object datatype , but I converted to string datatype to access the names of files in the notebook as show below
#string(activity('Get Metadata1').output.childItems)
Now we can able to access the Get Metadata output as string in the notebook.
import ast
required_filenames = ['File1.csv','File2.csv','File3.csv'] ##This is for comparing with the output we get from GetMetadata activity.
metadata_value = dbutils.widgets.get('metadata_output') ##Accessing the output from Get Metadata and storing into a variable using databricks widgets.
metadata_list = ast.literal_eval(metadata_value) ##Converting the above string datatype to the list datatype.
blob_output_list=[] ##Creating an empty list to add the names of files we get from GetMetadata activity.
for i in metadata_list:
blob_output_list.append(i['name']) ##This will add all the names of files from blob storage to the empty list we created above.
validateif = all(item in blob_output_list for item in required_filenames) ##This validateif variable now compare both the lists using list comprehension and provide either True or False.
I tried in the above way and can able to solve the provided requirement. Hope this helps.
Request to please upvote the answer if this helps in your requirement.

How to writeback to dataframe using transform_df in palantir foundry?

I created a library for updating description of the columns of the input dataset. This function takes three parameter as input (input_dataset, output_dataset, config file) and eventually writes back the description of output dataset. So now we want to import this library across various use cases. How to go for those cases where we are writing spark transformation i.e taking inputs through transform_df because here we can't assign output to output variable. In that situation how can i call my description library function? How to proceed in those situation in palantir foundry. Any suggestions?
This method isn't currently supported using the #transform_df decorator; you'll have to use the #transform decorator at the moment.
The reasoning behind this resulted from recognizing the need for broader access to metadata APIs like the #transform decorator already allows. Thus it seemed more in line with this pattern to keep it there since the #transform_df decorator is inherently higher-level.
You can always simply move over your transformations from...
from transforms.api import transform_df, Input, Output
#transform_df(
Output("/my/output"),
my_input("/my/input"),
)
def my_compute_function(my_input):
df = my_input
# ... logic ....
return my_input
...to...
from transforms.api import transform, Input, Output
#transform(
my_output=Output("/my/output"),
my_input=Input("/my/input")
)
def my_compute_function(my_input, my_output):
df = my_input.dataframe()
# ... logic ....
my_output.write_dataframe(df)
...in which only 6 lines of code need be changed.

neo4j import script endless loop because 2 properties with same name

I just managed to freeze my whole environment with a cypher import script. The process was running with 99% CPU uncontrollably until we killed it.
I am not sure, but I think the bug was in the import script - trying to set 2 properties with the same name - reading like
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///import.csv' AS import FIELDTERMINATOR ';'
... (some WITH / WHERE clauses)
CREATE (:Mylabel {myproperty: import.column1, myproperty: import.column2});
Does anyone have experience with behaviour like that?
EDIT:
I am not allowed to copypaste the exact code, but I can try and leave it semantically intact:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///db.csv' AS row FIELDTERMINATOR ';'
WITH row
WHERE row.typerow = 'Some_Identifier'
WITH head(collect(row.id)) as aid, row.exclusive AS excl, toInteger(row.alwsel) AS alwsel
CREATE (:Mylabel:Mytype {aid: toInteger(aid), exclusive: toString(excl),
exclusive: CASE WHEN alwsel=1 THEN true ELSE NULL END});
As was inquired below: there is no constraint on the property in question. I am currently not able to do any tests. I will be in a few days.

How to track function calls in MATLAB?

I would like a way to track all function calls that have operated on a specific workspace variable -- for instance, a sound waveform that will be transformed by various signal processing functions.
One cumbersome and fragile way is to do this:
>> cfg = [];
>> curr_call = 'data_out = my_function(cfg,data_in);';
>> eval(curr_call);
>> data_out.cfg.history = cat(1,data_out.cfg.history,{curr_call});
What would be much better is the following:
>> cfg = [];
>> data_out = my_function(cfg,data_in);
>> data_out.cfg.history
'data_out = my_function(cfg,data_in);'
EDIT for clarification: In other words, this variable has a field, cfg.history, that keeps track of all history-enabled functions that have operated on it (ideally with arguments). The history field should be updated regardless of where function calls originate: my example above is from the command line, but calls made from cell mode or within a script should also be appended to the history. Obviously, I can edit my_function() in the above example so that it can modify the history field.
NOTE in response to discussion below: the motivation for doing this is to have the history "attached" to the data in question, rather than say, in a separate log file which would then need to be packaged with the data somehow.
Can this be done?
You can access the full Session history using this code:
import com.mathworks.mlservices.MLCommandHistoryServices
history=MLCommandHistoryServices.getSessionHistory;
To achive what you want, use this code:
import com.mathworks.mlservices.MLCommandHistoryServices;
startcounter=numel(MLCommandHistoryServices.getSessionHistory);
disp('mydummycommand');
disp('anotherdummycommand');
history=MLCommandHistoryServices.getSessionHistory;
commands=cell(history(startcounter-2:end-1));
Be aware that these functions are undocumented. It uses the command history which is typically located at the bottom right in your matlab.

date in pig latin

I am trying to do the following. I have multiple dates and I want to create a pig script which gets unknown number of input dates and then runs the pig script for the input arguments. My question is:
How can I send an unknown number of input variables to a pig script and then handle them within the pig script?
Thanks
Sara
I have some trouble understanding what you actually want to do. That would be my solution >for your problem, sending an unknown number of dates (sorted as chararray):
A = load 'input_dates' AS (date:chararray);
B = my_macro(A);
It's quite basic, so I guess I didn't understand your problem correctly. Could you maybe >develop a little bit more your problem?
UPDATE >> How about something like this if you use Pig 0.11 (there is a bug until 0.10 for module imports):
#!/usr/bin/python
import os
from org.apache.pig.scripting import *
P = Pig.compile("""
data = LOAD '$docs_in' AS (a:int);
-- do something
""")
lof = os.listdir("/home/.../dates/")
params = []
for elem in lof:
params.append({'docs_in': str(elem)})
lof.remove(elem)
bound = P.bind(list_of_files)
stats = bound.run(params)
If each run is counting on the result of the previous one, use runSingle() instead.
If I understand question correctly, you want to load number of files or directories. You can specify as "," as input.
Below is an example:
load.pig (content):
A = LOAD '$input' using PigStorage();
dump A;
command to run ( to run locally):
pig -x local -param input=20120301,20120302,20120304 load.pig