I have a large csv log file. Here is a simplified sample:
ts,a.b.c,a.b.d,a.b.e,b.c,b.d,c.d.e,c.d.f,c.g
2021-03-29 06:38:39,1.0000,2,3,28.20,1,2,3,4
2021-03-29 06:38:40,1.0000,2,3,28.20,1,2,3,0.000000
I am using MATLAB's Import Data tool to import it, but, unfortunately, it removes all dots from the header and imports all variables as, e.g.: abc, abd, abe etc.
What is an efficient way to import a csv like the one above as structs?
It am looking for a way to have data imported as structs: a, b and c for this particular log file, so that I can easily access variables as a.b.c or c.d.f.
Here is what I came up with, by simply using readtable.
function res = log_import(logfile)
log_table = readtable(logfile);
res = [];
for i = 1:width(log_table)
str_path = log_table.Properties.VariableDescriptions{i};
fields = strsplit(str_path,'.');
res = setfield(res, fields{1:end}, log_table{:, i});
end
end
Related
I need to convert String data from a HDF5 File to Float format to use in a Skyplot (Astropy) with l b coordinates. The data is present here:
https://wwwmpa.mpa-garching.mpg.de/~ensslin/research/data/faraday2020.html
(Faraday Sky 2020)
The code I have programmed until now is:
from astropy import units as u
from astropy.coordinates import SkyCoord
import matplotlib.pyplot as plt
import numpy as np
import h5py
dat = []
ggl=[]
ggb=[]
f1= h5py.File('/home/nikita/faraday_2020/faraday2020.hdf5','r')
data = f1.get('faraday_sky_mean')
faraday_sky_mean = np.array(data)
data1 = f1.get('faraday_sky_std')
faraday_sky_std = np.array(data1)
n1 = 0
for line in f1:
s = line.split()
dat.append(s)
n1 = n1 +1
#
for i in range(0,n1):
ggl.append(float(dat[i][0])) # galactic coordinates input
ggb.append(float(dat[i][1]))
f1.close()
However I am getting the error:
ggl.append(float(dat[i][0])) # galactic coordinates input
ValueError: could not convert string to float: 'faraday_sky_mean'
Please help with this. Thanks.
What what you asked and what (I think) you need are 2 different things.
This line is NOT the way to read a HDF5 file: for line in f1:
You need to use a HDF5 API to read it (h5py is 1 of many).
I think you want to read datasets faraday_sky_mean and faraday_sky_std and load arrays into lists ggl and ggb. To do that, use this code. It will create 2 lists with 3145728 float64 values in each.
with h5py.File('faraday2020.hdf5','r') as hdf:
print(list(hdf.keys()))
faraday_sky_mean = hdf['faraday_sky_mean'][:]
faraday_sky_std = hdf['faraday_sky_std'][:]
print(faraday_sky_mean.shape, faraday_sky_mean.dtype)
print(f'Max Mean={max(faraday_sky_mean)}, Min Mean={min(faraday_sky_mean)}')
print(faraday_sky_std.shape, faraday_sky_std.dtype)
print(f'Max StdDev={max(faraday_sky_std)}, Min StdDev={min(faraday_sky_std)}')
ggl = faraday_sky_mean.tolist()
print(len(ggl),type(ggl[0]))
ggb = faraday_sky_std.tolist()
print(len(ggb),type(ggb[0]))
The procedure above saves the data as both NumPy arrays and Python lists. If you only need the lists (don't need the arrays), you can shorten the code as shown below:
with h5py.File('faraday2020.hdf5','r') as hdf:
ggl = hdf['faraday_sky_mean'][:].tolist()
print(len(ggl),type(ggl[0]))
ggb = hdf['faraday_sky_std'][:].tolist()
print(len(ggb),type(ggb[0]))
I want to Union multiple datasets in Palantir Foundry, the name of the datasets are dynamic so I would not be able to give the dataset names in transform_df() statically. Is there a way I can dynamically take multiple inputs into transform_df and union all of those dataframes?
I tried looping over the datasets like:
li = ['dataset1_path', 'dataset2_path']
union_df = None
for p in li:
#transforms_df(
my_input = Input(p),
Output(p+"_output")
)
def my_compute_function(my_input):
return my_input
if union_df is None:
union_df = my_compute_function
else:
union_df = union_df.union(my_compute_function)
But, this doesn't generate the unioned output.
This should be able to work for you with some changes, this is an example of dynamic dataset with json files, your situation would maybe be only a little different. Here is a generalized way you could be doing dynamic json input datasets that should be adaptable to any type of dynamic input file type or internal to foundry dataset that you can specify. This generic example is working on a set of json files uploaded to a dataset node in the platform. This should be fully dynamic. Doing a union after this should be a simple matter.
There's some bonus logging going on here as well.
Hope this helps
from transforms.api import Input, Output, transform
from pyspark.sql import functions as F
import json
import logging
def transform_generator():
transforms = []
transf_dict = {## enter your dynamic mappings here ##}
for value in transf_dict:
#transform(
out=Output(' path to your output here '.format(val=value)),
inpt=Input(" path to input here ".format(val=value)),
)
def update_set(ctx, inpt, out):
spark = ctx.spark_session
sc = spark.sparkContext
filesystem = list(inpt.filesystem().ls())
file_dates = []
for files in filesystem:
with inpt.filesystem().open(files.path) as fi:
data = json.load(fi)
file_dates.append(data)
logging.info('info logs:')
logging.info(file_dates)
json_object = json.dumps(file_dates)
df_2 = spark.read.option("multiline", "true").json(sc.parallelize([json_object]))
df_2 = df_2.withColumn('upload_date', F.current_date())
df_2.drop_duplicates()
out.write_dataframe(df_2)
transforms.append(update_logs)
return transforms
TRANSFORMS = transform_generator()
So this question breaks down in two questions.
How to handle transforms with programatic input paths
To handle transforms with programatic inputs, it is important to remember two things:
1st - Transforms will determine your inputs and outputs at CI time. Which means that you can have python code that generates transforms, but you cannot read paths from a dataset, they need to be hardcoded into your python code that generates the transform.
2nd - Your transforms will be created once, during the CI execution. Meaning that you can't have an increment or special logic to generate different paths whenever the dataset builds.
With these two premises, like in your example or #jeremy-david-gamet 's (ty for the reply, gave you a +1) you can have python code that generates your paths at CI time.
dataset_paths = ['dataset1_path', 'dataset2_path']
for path in dataset_paths:
#transforms_df(
my_input = Input(path),
Output(f"{path}_output")
)
def my_compute_function(my_input):
return my_input
However to union them you'll need a second transform to execute the union, you'll need to pass multiple inputs, so you can use *args or **kwargs for this:
dataset_paths = ['dataset1_path', 'dataset2_path']
all_args = [Input(path) for path in dataset_paths]
all_args.append(Output("path/to/unioned_dataset"))
#transforms_df(*all_args)
def my_compute_function(*args):
input_dfs = []
for arg in args:
# there are other arguments like ctx in the args list, so we need to check for type. You can also use kwargs for more determinism.
if isinstance(arg, pyspark.sql.DataFrame):
input_dfs.append(arg)
# now that you have your dfs in a list you can union them
# Note I didn't test this code, but it should be something like this
...
How to union datasets with different schemas.
For this part there are plenty of Q&A out there on how to union different dataframes in spark. Here is a short code example copied from https://stackoverflow.com/a/55461824/26004
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
Since inputs and outputs are determined at CI time, we cannot form true dynamic inputs. We will have to somehow point to specific datasets in the code. Assuming the paths of datasets share the same root, the following seems to require minimum maintenance:
from transforms.api import transform_df, Input, Output
from functools import reduce
datasets = [
'dataset1',
'dataset2',
'dataset3',
]
inputs = {f'inp{i}': Input(f'input/folder/path/{x}') for i, x in enumerate(datasets)}
kwargs = {
**{'output': Output('output/folder/path/unioned_dataset')},
**inputs
}
#transform_df(**kwargs)
def my_compute_function(**inputs):
unioned_df = reduce(lambda df1, df2: df1.unionByName(df2), inputs.values())
return unioned_df
Regarding unions of different schemas, since Spark 3.1 one can use this:
df1.unionByName(df2, allowMissingColumns=True)
I have an excel file that I grab by:
ds = dataset('XLSFile',fullfile('file path here', 'waterReal.xlsx'))
It looks like this:
I want each column in its own numeric array though! Like how when I load an example dataset: load carsmall, I get a bunch of individual numeric arrays. But I can't figure out how to do that.
I can do this individually by writing:
A = ds.TEMP, B = ds.PROD, ...
Bu what if I had BIG excel file? What then?
You can convert a dataset to a struct or a cell like this:
To struct:
s = dataset2struct(ds, 'AsScalar',true)
To cell:
fnames = fieldnames(ds);
c = cell(1, numel(fnames));
for i = 1:numel(fnames)
c{i} = ds.(fnames{i});
end
By the way: use tables instead of datasets. They're newer and better. Use the readtable function to read your Excel file into a table. And tables are nicer enough that you might not want to bother converting them into a simpler cell array, because you can just grab the columns out with t{:,i} where t is your table and i is the index of the column you want.
I want to save the financial data I get from the following code:
data = FinancialData["GE","OHLCV", "Jan. 1, 2000"];
The format is:
{{yy, mm, dd}, {O, H, L, C, V}}
I want 2 columns, one for the {date}, other for the {O, H, L, C, V} but inside the second column I want to treat each individual value (like a list?)
I have tried:
Export[dir <> filename <> ".csv", data];
data1 = Import[dir <> filename <> ".csv", "Table"];
And also with other formats, "List", etc.
The problem is that I have a running program to backtest the data and it works fine when I get it from FinancialData but I just can't find a way to export and import like if I did the FinancialData...
For example I can't do thinks like:
C = Table[data1[[i]][[2]][[4]], {i, 1, n}];
(Of course everything works if I put data, instead of data1)
Any ideas?
Why do you export to csv when you want to keep the Mathematica list structure intact? Please try the following
data = FinancialData["GE", "OHLCV", "Jan. 1, 2000"];
Export["tmp/test.m", data]
data2 = Import["tmp/test.m"];
and you will see that
data2 == data
gives True
Its not clear what you are trying to accomplish, presumably some external program needs to read the data??
you might do
Export["file.csv", Flatten[data,{2}]]
then partition it when you read it back:
{#[[;;3]],#[[4;;]]}&/#Import["file.csv"]
I have tried to import several csv files into one file. However, the new file overwrites the "original" ones.
Only the last processed one among them was imported. Something may be wrong about the loop, but I don't know where to change.
This is what I have:
p=dir('C:\foldername\*.csv');
for i=1:length(p)
[num, text, all]= xlsread(['C:\foldername\', p(i).name]);
end
You are overriding the variables in the loop.
Try to collect everything in cell array:
num = {};
text = {};
all = {};
p=dir('C:\foldername\*.csv');
for i=1:length(p)
[num{end+1}, text{end+1}, all{end+1}]= xlsread(['C:\foldername\', p(i).name]);
end
You cannot read all the things into the same variables, but you can put them in different dimensions.
p=dir('C:\foldername\*.csv');
num = cell(size(p));
text = cell(size(p));
all = cell(size(p));
for i=1:length(p)
[num{i}, text{i}, all{i}]= xlsread(['C:\foldername\', p(i).name]);
end