I have multiple .csv files in a folder on Azure. Using PySpark I am trying to create a dataframe that has two columns, filename and firstrow, which are captured for each file within the folder.
Ideally I would like to avoid having to read the files in full as some of them can be quite large.
I am new to PySpark so I do not yet understand the basics so I would appreciate any help.
I have write a code for your scenario and it is working fine.
Create a empty list and append it with all the filenames stored in the source
# read filenames
filenames = []
l = dbutils.fs.ls("/FileStore/tables/")
for i in l:
print(i.name)
filenames.append(i.name)
# converting filenames to tuple
d = [(x,) for x in filenames]
print(d)
Read the data from multiple files and store in a list
# create data by reading from multiple files
data = []
i = 0
for n in filename:
temp = spark.read.option("header", "true").csv("/FileStore/tables/" + n).limit(1)
temp = temp.collect()[0]
temp = str(temp)
s = d[i] + (temp,)
data.append(s)
i+=1
print(data)
Now create DataFrame from the data with column names.
column = ["filename", "filedata"]
df = spark.createDataFrame(data, column)
df.head(2)
I have a bunch of CSV files in a mounted blob container and I need to calculate the 'SHA1' hash values for every file to store as inventory. I'm very new to Azure cloud and pyspark so I'm not sure how this can be achieved efficiently. I have written the following code in Python Pandas and I'm trying to use this in pyspark. It seems to work however it takes quite a while to run as there are thousands of CSV files. I understand that things work differently in pyspark, so can someone please guide if my approach is correct, or if there is a better piece of code I can use to accomplish this task?
import os
import subprocess
import hashlib
import pandas as pd
class File:
def __init__(self, path):
self.path = path
def get_hash(self):
hash = hashlib.sha1()
with open(self.path, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash.update(chunk)
self.md5hash = hash.hexdigest()
return self.md5hash
path = '/dbfs/mnt/data/My_Folder' #Path to CSV files
cnt = 0
rlist = []
for path, subdirs, files in os.walk(path):
for fi in files:
if cnt < 10: #check on only 10 files for now as it takes ages!
f = File(os.path.join(path, fi))
cnt +=1
hash_value = f.get_hash()
results = {'File_Name': fi, 'File_Path': f.filename, 'SHA1_Hash_Value': hash_value}
rlist.append(results)
print(fi)
df = pd.DataFrame(rlist)
print(str(cnt) + ' files processed')
df = pd.DataFrame(rlist)
#df.to_csv('/dbfs/mnt/workspace/Inventory/File_Hashes.csv', mode='a', header=False) #not sure how to write files in pyspark!
display(df)
Thanks
Since you want to treat the files as blobs and not read them into a table. I would recommend using spark.sparkContext.binaryFiles this would land you an RDD of pairs where the key is the file name and the value is a file-like object, on which you can calculate the hash in a map function (rdd.mapValues(calculate_hash_of_file_like))
For more information, refer to the documentation: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.binaryFiles.html#pyspark.SparkContext.binaryFiles
I would like read topics from pubsub topic1 and write cleaned json to topic2 and topic3 based on the condition.
let say: i have a flag in the json comes from topic1, i do some transformations, and check the flag value and write to topic2 and topic3 based on the flag value.
i have tried the below but from here i am not able to move further, since no idea how to call call the pipe based on the condition.
my Beam pipe code as below:
with beam.Pipeline(options=pipeline_options) as p:
Ingest = ( p
| 'Read from Topic' >> beam.io.ReadFromPubSub(topic=known_args.topic).with_output_types(bytes)
| 'Decode' >> beam.Map(decode_message)
| 'Make One Json' >> beam.Map(make_one)
| 'Split based on event' >> beam.Map(split)
# when event_name== 'aa_afo_addtocart_clicked'
| 'write to topic2'
# when event_name== 'aa_afo_merchantpage_visited'
| 'write to topic3'
)
4th step i am calling the split function, but pls guide me how to write the splited output to multiple topics.
the split python function do the following.
gets the one single input json -> check the flag and split the result as two -> one should go topic2 and another one should go to topic3.
def split(p):
json_obj_list = json.load(p)
jb =[]
for json_obj in json_obj_list:
if json_obj['event_name']== 'aa_afo_addtocart_clicked':
filename = json_obj['event_name'] + '.json'
with open(filename, 'a') as out_json_file:
json_string = json.dumps(json_obj)
print(json_string)
#json.dump(json_obj, out_json_file)
if json_obj['event_name'] == 'aa_afo_merchantpage_visited':
filename = json_obj['event_name'] + '.json'
with open(filename, 'a') as out_json_file:
json_string = json.dumps(json_obj)
print(json_string)
The solution here is to create several output as described in the programming guide. Like that, you perform your split and you have 2 PCollections as output.
Then process independently the 2 PCollections: sink in topic 2 and topic 3.
I'm currently using Tensorflow transform library to convert and save the transformation, though it used to work before just fine currently im facing a bit of issue something similar to below
I keep getting the same error like -
'BeamDatasetMetadata' object has no attribute 'schema' [while running
'AnalyzeAndTransformDataset/TransformDataset/ConvertAndUnbatch']
Is someone familiar with the error above and how can we resolve it?
My Transform Function looks like below -
# ### Transformation Function
def transform_data(train_data_file, test_data_file, working_dir):
"""Transform the data and write out as a TFRecord of Example protos.
Read in the data using the CSV reader, and transform it using a
preprocessing pipeline that scales numeric data and converts categorical data
from strings to int64 values indices, by creating a vocabulary for each
category.
Args:
train_data_file: File containing training data
test_data_file: File containing test data
working_dir: Directory to write transformed data and metadata to
"""
def preprocessing_fn(inputs):
"""Preprocess input columns into transformed columns."""
outputs = {}
# Scale numeric columns to have range [0, 1].
for key in NUMERIC_FEATURE_KEYS:
outputs[key] = tft.scale_to_0_1(inputs[key])
# For all categorical columns except the label column, we use
# tft.string_to_int which computes the set of unique values and uses this
# to convert the strings to indices.
for key in CATEGORICAL_FEATURE_KEYS:
tft.uniques(inputs[key], vocab_filename=key)
""" We would use the lookup table when the label is a string value
In our case here Creative_id = 0/1 so we can direclty assign output as is
"""
outputs[LABEL_KEY] = inputs[LABEL_KEY]
return outputs
# The "with" block will create a pipeline, and run that pipeline at the exit
# of the block.
with beam.Pipeline() as pipeline:
with beam_impl.Context(temp_dir=tempfile.mkdtemp()):
# Create a coder to read the data with the schema. To do this we
# need to list all columns in order since the schema doesn't specify the
# order of columns in the csv.
ordered_columns = [
'app_category', 'connection_type', 'creative_id', 'day_of_week',
'device_size', 'geo', 'hour_of_day', 'num_of_connects',
'num_of_conversions', 'opt_bid', 'os_version'
]
converter = csv_coder.CsvCoder(ordered_columns, RAW_DATA_METADATA.schema)
# Read in raw data and convert using CSV converter. Note that we apply
# some Beam transformations here, which will not be encoded in the TF
# graph since we don't do the from within tf.Transform's methods
# (AnalyzeDataset, TransformDataset etc.). These transformations are just
# to get data into a format that the CSV converter can read, in particular
# removing empty lines and removing spaces after commas.
raw_data = (
pipeline
| 'ReadTrainData' >> textio.ReadFromText(train_data_file)
| 'FilterTrainData' >> beam.Filter(
lambda line: line and line != 'app_category,connection_type,creative_id,day_of_week,device_size,geo,hour_of_day,num_of_connects,num_of_conversions,opt_bid,os_version')
| 'FixCommasTrainData' >> beam.Map(
lambda line: line.replace(', ', ','))
| 'DecodeTrainData' >> MapAndFilterErrors(converter.decode))
# Combine data and schema into a dataset tuple. Note that we already used
# the schema to read the CSV data, but we also need it to interpret
# raw_data.
raw_dataset = (raw_data, RAW_DATA_METADATA)
transformed_dataset, transform_fn = (
raw_dataset | beam_impl.AnalyzeAndTransformDataset(preprocessing_fn))
transformed_data, transformed_metadata = transformed_dataset
transformed_data_coder = example_proto_coder.ExampleProtoCoder(transformed_metadata.schema)
_ = (
transformed_data
| 'EncodeTrainData' >> beam.Map(transformed_data_coder.encode)
| 'WriteTrainData' >> tfrecordio.WriteToTFRecord(
os.path.join(working_dir, TRANSFORMED_TRAIN_DATA_FILEBASE)))
# Now apply transform function to test data. In this case we also remove
# the header line from the CSV file and the trailing period at the end of
# each line.
raw_test_data = (
pipeline
| 'ReadTestData' >> textio.ReadFromText(test_data_file, skip_header_lines=1)
| 'FixCommasTestData' >> beam.Map(
lambda line: line.replace(', ', ','))
| 'DecodeTestData' >> beam.Map(converter.decode))
raw_test_dataset = (raw_test_data, RAW_DATA_METADATA)
transformed_test_dataset = ((raw_test_dataset, transform_fn) | beam_impl.TransformDataset())
# Don't need transformed data schema, it's the same as before.
transformed_test_data, _ = transformed_test_dataset
_ = (
transformed_test_data
| 'EncodeTestData' >> beam.Map(transformed_data_coder.encode)
| 'WriteTestData' >> tfrecordio.WriteToTFRecord(
os.path.join(working_dir, TRANSFORMED_TEST_DATA_FILEBASE)))
_ = (
transform_fn
| 'WriteTransformFn' >>
transform_fn_io.WriteTransformFn(working_dir))
Ouput stack for -
pip show tensorflow-transform apache-beam
Name: tensorflow-transform
Version: 0.4.0
Summary: A library for data preprocessing with TensorFlow
Home-page: UNKNOWN
Author: Google Inc.
Author-email: tf-transform-feedback#google.com
License: Apache 2.0
Location: /usr/local/lib/python2.7/dist-packages
Requires: six, apache-beam, protobuf
---
Name: apache-beam
Version: 2.4.0
Summary: Apache Beam SDK for Python
Home-page: https://beam.apache.org
Author: Apache Software Foundation
Author-email: dev#beam.apache.org
License: Apache License, Version 2.0
Location: /usr/local/lib/python2.7/dist-packages
Requires: oauth2client, httplib2, mock, crcmod, grpcio, futures, pyvcf, avro, typing, pyyaml, dill, six, hdfs, protobuf
You are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
the above issue doesn't seem to occur all the time though! looks like it has some conflict with other packages.
Can't see but this line looks incomplete:
```
converter = csv_coder.CsvCoder(ordered_columns, RAW_DATA_METADATA.schema)
```
A possible way to do it:
```
INPUT_SCHEMA = dataset_schema.from_feature_spec({
'label':tf.FixedLenFeature(shape=[], dtype=tf.float32),
'id': tf.FixedLenFeature(shape=[], dtype=tf.float32),
'date': tf.FixedLenFeature(shape=[], dtype=tf.string),
'random': tf.FixedLenFeature(shape=[], dtype=tf.string),
'name': tf.FixedLenFeature(shape=[], dtype=tf.string),
'tweet': tf.FixedLenFeature(shape=[], dtype=tf.string),
})
```
```
converter_input = coders.CsvCoder(
['label','id','date','random','name','tweet'],
INPUT_SCHEMA,
delimiter=delimiter)
```
Then for the transform step where it seams like your actuall problem is here is an example as well.
```
input_metadata =
dataset_metadata.DatasetMetadata(schema=TRANSFORM_INPUT_SCHEMA)
TRANSFORM_INPUT_SCHEMA = dataset_schema.from_feature_spec({
'id': tf.FixedLenFeature(shape=[], dtype=tf.float32),
'label': tf.FixedLenFeature(shape=[], dtype=tf.float32),
'tweet': tf.FixedLenFeature(shape=[], dtype=tf.string),
'answer_to_nbr': tf.FixedLenFeature(shape=[], dtype=tf.float32),
'nbr_of_tags': tf.FixedLenFeature(shape=[], dtype=tf.float32),
})
train_dataset = (train_dataset, input_metadata)
transformed_dataset, transform_fn = (train_dataset
| 'AnalyzeAndTransform' >>
beam_impl.AnalyzeAndTransformDataset(
preprocessing_fn))
```
Hope it helps you :) If you post to your github repo I could look at the full code and see if I can help! Good luck!
Look at this repo for help https://github.com/Fematich/tftransform-demo
I would like to know if it's possible to use "^%GOF" without user interaction. I'm using Caché 2008. ^%GO isn't an option as it's to slow. I'm using input from a temporary file for automatically answer the questions, but it can fail (rarely happens).
I couldn't find the routine of this utility in %SYS. Where is it located?
Thanks,
Answer: Using "%SYS.GlobalQuery:NameSpaceList" to get list of globals (system globals excluding).
Set Rset = ##class(%ResultSet).%New("%SYS.GlobalQuery:NameSpaceList")
d Rset.Execute(namespace, "*", 0)
s globals=""
while (Rset.Next()){
s globalName=Rset.Data("Name")_".gbl"
if (globals=""){
s globals = globalName
}else{
s globals = globals_","_globalName
}
d ##class(%Library.Global).Export(namespace, globals, "/tmp/export.gof", 7)
The only drawback is that if you have a namespace with concatination of globals exceeding the maximum allowed for a global entry, the program crashes. You should then split the globals list.
I would recommend that you look at the %Library.Global() class with output format 7.
classmethod Export(Nsp As %String = $zu(5), ByRef GlobalList As %String, FileName As %String, OutputFormat As %Integer = 5, RecordFormat As %String = "V", qspec As %String = "d", Translation As %String = "") as %Status
Exports a list of globals GlobalList from a namespace Nsp to FileName using OutputFormat and RecordFormat.
OutputFormat can take the values below:
1 - DTM format
3 - VAXDSM format
4 - DSM11 format
5 - ISM/Cache format
6 - MSM format
7 - Cache Block format (%GOF)
RecordFormat can take the values below:
V - Variable Length Records
S - Stream Data
You can find it in the class documentation here: http://docs.intersystems.com/cache20082/csp/documatic/%25CSP.Documatic.cls
I've never used it, it looks like it would do the trick however.
export your global to file
d $system.OBJ.Export("myGlobal.GBL","c:\global.xml")
import global from your file
d $system.OBJ.Load("c:\global.xml")
Export items as an XML file
The extension of the items determine what
type they are, they can be one of:
CLS - classes
CSP - Cache Server Pages
CSR - Cache Rule files
MAC - Macro routines
INT - None macro routines
BAS - Basic routines
INC - Include files
GBL - Globals
PRJ - Studio Projects
OBJ - Object code
PKG - Package definition
If you wish to export multiple classes then separate then with commas or
pass the items("item")="" as an array or use wild cards.
If filename is empty then it will export to the current device.
link to docbook
edit: adding "-d" as qspec value will suppress the terminal output of the export. If you want to use this programmtically, it might get in the way.
And just for completeness' sake:
SAMPLES>s IO="c:\temp\test.gof"
SAMPLES>s IOT="RMS"
SAMPLES>s IOPAR="WNS"
SAMPLES>s globals("Sample.PersonD")=""
SAMPLES>d entry^%GOF(.globals)
SAMPLES>
-> results in c:\temp\test.gof having the export. You can define up to 65435 globals in you array (named globals in this example)
But I would recommend you go with DAiMor's answer as this is the more 'modern' way.
To avoid maximum string error, you should use subscripts instead of comma delimited string:
Set Rset = ##class(%ResultSet).%New("%SYS.GlobalQuery:NameSpaceList")
d Rset.Execute(namespace, "*", 0)
while (Rset.Next()) {
s globals(Rset.Data("Name"))="" // No need for _".gbl" in recent Cache
}
d ##class(%Library.Global).Export(namespace, .globals, "/tmp/export.gof", 7) // Note dot before globals