Inline parsing of IObservable<byte> - system.reactive

I have an observable query that produces an IObservable<byte> from a stream that I want to parse inline. I want to be able to use different strategies depending on the data source to parse discrete messages from this sequence. Bear in mind I am still on the upward learning curve of RX. I have come up with a solution, but am unsure if there is a way to accomplish this using out-of-the-box operators.
First, I wrote the following extension method to IObservable:
public static IObservable<IList<T>> Parse<T>(
this IObservable<T> source,
Func<IObservable<T>, IObservable<IList<T>>> parsingFunction)
{
return parsingFunction(source);
}
This allows me to specify the message framing strategy in use by a particular data source. One data source might be delimited by one or more bytes while another might be delimited by both start and stop block patterns while another might use a length prefixing strategy. So here is an example of the Delimited strategy that I have defined:
public static class MessageParsingFunctions
{
public static Func<IObservable<T>, IObservable<IList<T>>> Delimited<T>(T[] delimiter)
{
if (delimiter == null) throw new ArgumentNullException("delimiter");
if (delimiter.Length < 1) throw new ArgumentException("delimiter must contain at least one element.");
Func<IObservable<T>, IObservable<IList<T>>> parser =
(source) =>
{
var shared = source.Publish().RefCount();
var windowOpen = shared.Buffer(delimiter.Length, 1)
.Where(buffer => buffer.SequenceEqual(delimiter))
.Publish()
.RefCount();
return shared.Buffer(windowOpen)
.Select(bytes =>
bytes
.Take(bytes.Count - delimiter.Length)
.ToList());
};
return parser;
}
}
So ultimately, as an example, I can use the code in the following fashion to parse discrete messages from the sequence any time the byte pattern for the string '<EOF>' is encountered in the sequence:
var messages = ...operators that surface an IObservable<byte>
.Parse(MessageParsingFunctions.Delimited(Encoding.ASCII.GetBytes("<EOF>")))
...further operators to package discrete messages along with additional metadata
Questions:
Is there a more straight-forward way to accomplish this using just out of the box operators?
If not, would it be preferable to just define the different parsing functions (i.e. ParseDelimited, ParseLengthPrefixed, etc.) as local extensions instead of having a more generic Parse extension method that accepts a parsing function?
Thanks in advance!

Take a look at Rxx Parsers. Here's a related lab. For example:
IObservable<byte> bytes = ...;
var parsed = bytes.ParseBinary(parser =>
from next in parser
let magicNumber = parser.String(Encoding.UTF8, 3).Where(value => value == "RXX")
let header = from headerLength in parser.Int32
from header in next.Exactly(headerLength)
from headerAsString in header.Aggregate(string.Empty, (s, b) => s + " " + b)
select headerAsString
let message = parser.String(Encoding.UTF8)
let entry = from length in parser.Int32
from data in next.Exactly(length)
from value in data.Aggregate(string.Empty, (s, b) => s + " " + b)
select value
let entries = from count in parser.Int32
from entries in entry.Exactly(count).ToList()
select entries
select from _ in magicNumber.Required("The file's magic number is invalid.")
from h in header.Required("The file's header is invalid.")
from m in message.Required("The file's message is invalid.")
from e in entries.Required("The file's data is invalid.")
select new
{
Header = h,
Message = m,
Entries = e.Aggregate(string.Empty, (acc, cur) => acc + cur + Environment.NewLine)
});

Related

Use method inside LINQ GroupBy

I'm trying to manipulate properties in a GroupBy clause to be used in a dictionary:
var lifeStages = await _dbContext.Customers
.GroupBy(x => GetLifeStage(x.DoB))
.Select(x => new { LifeStage = x.Key, Count = x.Count() })
.ToDictionaryAsync(x => x.LifeStage, x => x.Count);
I'm expecting results like
adolescent: 10,
adult: 15,
senior: 12 etc
But getting error:
Either rewrite the query in a form that can be translated,
or switch to client evaluation explicitly
by inserting a call to either AsEnumerable(), AsAsyncEnumerable(), ToList(), or ToListAsync().
Offcourse I can't combine ToDictionary() with any of the mentioned calls, and splitting up the query did not resolve the issues or taught my anything)
I've tried with making GetLifeStage() static and async, no difference there as well. The method gets called, performs what it needs to do, and still GroupBy can't be translated
If I leave out the Select() part and work with the Key of the GroupBy, same error:
"...could not be translated."
I saw an error too that said I couldn't combine a GroupBy() with a ToDictionary() during try-outs, but doesn't seem to pop up atm.
As I'm running out of ideas, all suggestions are welcome!
update:
private LifeStage GetLifeStage(DateTimeOffset doB)
{
var ageInMonths = Math.Abs(12 * (doB.Year - DateTimeOffset.UtcNow.Year) + doB.Month - DateTimeOffset.UtcNow.Month);
switch (ageInMonths)
{
case < 216:
return LifeStage.Adolescent;
case < 780:
return LifeStage.Adult;
case >= 780:
return LifeStage.Senior;
}
}
The problem is the usage of the custom GetLifeStage method inside the GroupBy expression. Custom methods cannot be translated to SQL because the query translator code has no way to know what is inside that method. And it cannot be called because there are no objects at all during the translation process.
In order to make it translatable, you have to replace the custom method call with its body, converted to translatable expression - basically something which can be used as expression bodied method. You can't use variables and switch, but you can use conditional operators. Instead of variable, you could use intermediate projection (Select).
Here is the equivalent translatable query:
var lifeStages = await _dbContext.Customers
.Select(c => new { Customer = c, AgeInMonths = Math.Abs(12 * (c.DoB.Year - DateTimeOffset.UtcNow.Year) + c.DoB.Month - DateTimeOffset.UtcNow.Month) })
.GroupBy(x => x.AgeInMonths < 216 ? LifeStage.Adolescent : x.AgeInMonths < 780 ? LifeStage.Adult : LifeStage.Senior)
// the rest is the same as original
.Select(x => new { LifeStage = x.Key, Count = x.Count() })
.ToDictionaryAsync(x => x.LifeStage, x => x.Count);

Confusion over behavior of Publish().Refcount()

I've got a simple program here that displays the number of letters in various words. It works as expected.
static void Main(string[] args) {
var word = new Subject<string>();
var wordPub = word.Publish().RefCount();
var length = word.Select(i => i.Length);
var report =
wordPub
.GroupJoin(length,
s => wordPub,
s => Observable.Empty<int>(),
(w, a) => new { Word = w, Lengths = a })
.SelectMany(i => i.Lengths.Select(j => new { Word = i.Word, Length = j }));
report.Subscribe(i => Console.WriteLine($"{i.Word} {i.Length}"));
word.OnNext("Apple");
word.OnNext("Banana");
word.OnNext("Cat");
word.OnNext("Donkey");
word.OnNext("Elephant");
word.OnNext("Zebra");
Console.ReadLine();
}
And the output is:
Apple 5
Banana 6
Cat 3
Donkey 6
Elephant 8
Zebra 5
I used the Publish().RefCount() because "wordpub" is included in "report" twice. Without it, when a word is emitted first one part of the report would get notified by a callback, and then the other part of report would be notified, double the notifications. That is kindof what happens; the output ends up having 11 items rather than 6. At least that is what I think is going on. I think of using Publish().RefCount() in this situation as simultaneously updating both parts of the report.
However if I change the length function to ALSO use the published source like this:
var length = wordPub.Select(i => i.Length);
Then the output is this:
Apple 5
Apple 6
Banana 6
Cat 3
Banana 3
Cat 6
Donkey 6
Elephant 8
Donkey 8
Elephant 5
Zebra 5
Why can't the length function also use the same published source?
This was a great challenge to solve!
So subtle the conditions that this happens.
Apologies in advance for the long explanation, but bear with me!
TL;DR
Subscriptions to the published source are processed in order, but before any other subscription directly to the unpublished source. i.e. you can jump the queue!
With GroupJoin subscription order is important to determine when windows open and close.
My first concern would be that you are publish refcounting a subject.
This should be a no-op.
Subject<T> has no subscription cost.
So when you remove the Publish().RefCount() :
var word = new Subject<string>();
var wordPub = word;//.Publish().RefCount();
var length = word.Select(i => i.Length);
then you get the same issue.
So then I look to the GroupJoin (because my intuition suggests that Publish().Refcount() is a red herring).
For me, eyeballing this alone was too hard to rationalise, so I lean on a simple debugging too I have used dozens of times of the years - a Trace or Log extension method.
public interface ILogger
{
void Log(string input);
}
public class DumpLogger : ILogger
{
public void Log(string input)
{
//LinqPad `Dump()` extension method.
// Could use Console.Write instead.
input.Dump();
}
}
public static class ObservableLoggingExtensions
{
private static int _index = 0;
public static IObservable<T> Log<T>(this IObservable<T> source, ILogger logger, string name)
{
return Observable.Create<T>(o =>
{
var index = Interlocked.Increment(ref _index);
var label = $"{index:0000}{name}";
logger.Log($"{label}.Subscribe()");
var disposed = Disposable.Create(() => logger.Log($"{label}.Dispose()"));
var subscription = source
.Do(
x => logger.Log($"{label}.OnNext({x.ToString()})"),
ex => logger.Log($"{label}.OnError({ex})"),
() => logger.Log($"{label}.OnCompleted()")
)
.Subscribe(o);
return new CompositeDisposable(subscription, disposed);
});
}
}
When I add the logging to your provided code it looks like this:
var logger = new DumpLogger();
var word = new Subject<string>();
var wordPub = word.Publish().RefCount();
var length = word.Select(i => i.Length);
var report =
wordPub.Log(logger, "lhs")
.GroupJoin(word.Select(i => i.Length).Log(logger, "rhs"),
s => wordPub.Log(logger, "lhsDuration"),
s => Observable.Empty<int>().Log(logger, "rhsDuration"),
(w, a) => new { Word = w, Lengths = a })
.SelectMany(i => i.Lengths.Select(j => new { Word = i.Word, Length = j }));
report.Subscribe(i => ($"{i.Word} {i.Length}").Dump("OnNext"));
word.OnNext("Apple");
word.OnNext("Banana");
word.OnNext("Cat");
word.OnNext("Donkey");
word.OnNext("Elephant");
word.OnNext("Zebra");
This will then output in my log something like the following
Log with Publish().RefCount() used
0001lhs.Subscribe()
0002rhs.Subscribe()
0001lhs.OnNext(Apple)
0003lhsDuration.Subscribe()
0002rhs.OnNext(5)
0004rhsDuration.Subscribe()
0004rhsDuration.OnCompleted()
0004rhsDuration.Dispose()
OnNext
Apple 5
0001lhs.OnNext(Banana)
0005lhsDuration.Subscribe()
0003lhsDuration.OnNext(Banana)
0003lhsDuration.Dispose()
0002rhs.OnNext(6)
0006rhsDuration.Subscribe()
0006rhsDuration.OnCompleted()
0006rhsDuration.Dispose()
OnNext
Banana 6
...
However when I remove the usage Publish().RefCount() the new log output is as follows:
Log without only Subject
0001lhs.Subscribe()
0002rhs.Subscribe()
0001lhs.OnNext(Apple)
0003lhsDuration.Subscribe()
0002rhs.OnNext(5)
0004rhsDuration.Subscribe()
0004rhsDuration.OnCompleted()
0004rhsDuration.Dispose()
OnNext
Apple 5
0001lhs.OnNext(Banana)
0005lhsDuration.Subscribe()
0002rhs.OnNext(6)
0006rhsDuration.Subscribe()
0006rhsDuration.OnCompleted()
0006rhsDuration.Dispose()
OnNext
Apple 6
OnNext
Banana 6
0003lhsDuration.OnNext(Banana)
0003lhsDuration.Dispose()
...
This gives us some insight, however when the issue really becomes clear is when we start annotating our logs with a logical list of subscriptions.
In the original (working) code with the RefCount our annotations might look like this
//word.Subsribers.Add(wordPub)
0001lhs.Subscribe() //wordPub.Subsribers.Add(0001lhs)
0002rhs.Subscribe() //word.Subsribers.Add(0002rhs)
0001lhs.OnNext(Apple)
0003lhsDuration.Subscribe() //wordPub.Subsribers.Add(0003lhsDuration)
0002rhs.OnNext(5)
0004rhsDuration.Subscribe()
0004rhsDuration.OnCompleted()
0004rhsDuration.Dispose()
OnNext
Apple 5
0001lhs.OnNext(Banana)
0005lhsDuration.Subscribe() //wordPub.Subsribers.Add(0005lhsDuration)
0003lhsDuration.OnNext(Banana)
0003lhsDuration.Dispose() //wordPub.Subsribers.Remove(0003lhsDuration)
0002rhs.OnNext(6)
0006rhsDuration.Subscribe()
0006rhsDuration.OnCompleted()
0006rhsDuration.Dispose()
OnNext
Banana 6
So in this example, when word.OnNext("Banana"); is executed the chain of observers is linked in this order
wordPub
0002rhs
However, wordPub has child subscriptions!
So the real subscription list looks like
wordPub
0001lhs
0003lhsDuration
0005lhsDuration
0002rhs
If we annotate the Subject only log we see where the subtlety lies
0001lhs.Subscribe() //word.Subsribers.Add(0001lhs)
0002rhs.Subscribe() //word.Subsribers.Add(0002rhs)
0001lhs.OnNext(Apple)
0003lhsDuration.Subscribe() //word.Subsribers.Add(0003lhsDuration)
0002rhs.OnNext(5)
0004rhsDuration.Subscribe()
0004rhsDuration.OnCompleted()
0004rhsDuration.Dispose()
OnNext
Apple 5
0001lhs.OnNext(Banana)
0005lhsDuration.Subscribe() //word.Subsribers.Add(0005lhsDuration)
0002rhs.OnNext(6)
0006rhsDuration.Subscribe()
0006rhsDuration.OnCompleted()
0006rhsDuration.Dispose()
OnNext
Apple 6
OnNext
Banana 6
0003lhsDuration.OnNext(Banana)
0003lhsDuration.Dispose()
So in this example, when word.OnNext("Banana"); is executed the chain of observers is linked in this order
1. 0001lhs
2. 0002rhs
3. 0003lhsDuration
4. 0005lhsDuration
As the 0003lhsDuration subscription is activated after the 0002rhs, it wont see the "Banana" value to terminate the window, until after the rhs has been sent the value, thus yielding it in the still open window.
Whew
As #francezu13k50 points out the obvious and simple solution to your problem is to just use word.Select(x => new { Word = x, Length = x.Length });, but as I think you have given us a simplified version of your real problem (appreciated) I understand why this isn't suitable.
However, as I dont know what your real problem space is I am not sure what to suggest to you to provide a solution, except that you have one with your current code, and now you should know why it works the way it does.
RefCount returns an Observable that stays connected to the source as long as there is at least one subscription to the returned Observable. When the last subscription is disposed, RefCount disposes it's connection to the source, and reconnects when a new subscription is being made. It might be the case with your report query that all subscriptions to the 'wordPub' are disposed before the query is fulfilled.
Instead of the complicated GroupJoin query you could simply do :
var report = word.Select(x => new { Word = x, Length = x.Length });
Edit:
Change your report query to this if you want to use the GroupJoin operator :
var report =
wordPub
.GroupJoin(length,
s => wordPub,
s => Observable.Empty<int>(),
(w, a) => new { Word = w, Lengths = a })
.SelectMany(i => i.Lengths.FirstAsync().Select(j => new { Word = i.Word, Length = j }));
Because GroupJoin seems to be very tricky to work with, here is another approach for correlating the inputs and outputs of functions.
static void Main(string[] args) {
var word = new Subject<string>();
var length = new Subject<int>();
var report =
word
.CombineLatest(length, (w, l) => new { Word = w, Length = l })
.Scan((a, b) => new { Word = b.Word, Length = a.Word == b.Word ? b.Length : -1 })
.Where(i => i.Length != -1);
report.Subscribe(i => Console.WriteLine($"{i.Word} {i.Length}"));
word.OnNext("Apple"); length.OnNext(5);
word.OnNext("Banana");
word.OnNext("Cat"); length.OnNext(3);
word.OnNext("Donkey");
word.OnNext("Elephant"); length.OnNext(8);
word.OnNext("Zebra"); length.OnNext(5);
Console.ReadLine();
}
This approach works if every input has 0 or more outputs subject to the constraints that (1) outputs only arrive in the same order as the inputs AND (2) each output corresponds to its most recent input. This is like a LeftJoin - each item in the first list (word) is paired with items in the right list (length) that subsequently arrive, up until another item in the first list is emitted.
Trying to use regular Join instead of GroupJoin. I thought the problem was that when a new word was created there was a race condition inside Join between creating a new window and ending the current one. So here I tried to elimate that by pairing every word with a null signifying the end of the window. Doesn't work, just like the first version did not. How is it possible that a new window is created for each word without the previous one being closed first? Completely confused.
static void Main(string[] args) {
var lgr = new DelegateLogger(Console.WriteLine);
var word = new Subject<string>();
var wordDelimited =
word
.Select(i => Observable.Return<string>(null).StartWith(i))
.SelectMany(i => i);
var wordStart = wordDelimited.Where(i => i != null);
var wordEnd = wordDelimited.Where(i => i == null);
var report = Observable
.Join(
wordStart.Log(lgr, "word"), // starts window
wordStart.Select(i => i.Length),
s => wordEnd.Log(lgr, "expireWord"), // ends current window
s => Observable.Empty<int>(),
(l, r) => new { Word = l, Length = r });
report.Subscribe(i => Console.WriteLine($"{i.Word} {i.Length}"));
word.OnNext("Apple");
word.OnNext("Banana");
word.OnNext("Cat");
word.OnNext("Zebra");
word.OnNext("Elephant");
word.OnNext("Bear");
Console.ReadLine();
}

Merging multiple streams, keeping ordering and avoiding duplicates

I have a problem that I do not know how to handle beautifully with RX.
I have multiple streams that all supposedly contain the same elements
However​ each stream may lose messages (UDP is involved) or be late/early compared to others. Each of these messages have a sequence number.
Now what I want to achieve is get a single stream out of all those streams, ​without duplicate and keeping the message order​. In other words, the same sequence number should not appear twice and their values only have to increase, never decrease.
When a message was lost on all the streams, I'm OK with losing it (as there is another TCP mechanism involved that allows me to ask explicitly for missing messages).
I am looking to do that in RxJava, but I guess my problem is not specific to Java.
Here's a marble diagram to help visualizing what I want to achieve:
marble diagram
You can see in that diagram that we are waiting for 2 on the first stream to output 3 from the second stream.
Likewise, 6 is only outputted once we receive 6 from the second stream because only at that point can we know for sure that 5 will never be received by any stream.
This is browser code, but I think it should give you a good idea of how you could solve this.
public static IObservable<T> Sequenced<T>(
this IObservable<T> source,
Func<T, int> getSequenceNumber,
int sequenceBegin,
int sequenceRedundancy)
{
return Observable.Create(observer =>
{
// The next sequence number in order.
var sequenceNext = sequenceBegin;
// The key is the sequence number.
// The value is (T, Count).
var counts = new SortedDictionary<int, Tuple<T, int>>();
return source.Subscribe(
value =>
{
var sequenceNumber = getSequenceNumber(value);
// If the sequence number for the current value is
// earlier in the sequence, just throw away this value.
if (sequenceNumber < sequenceNext)
{
return;
}
// Update counts based on the current value.
Tuple<T, int> count;
if (!counts.TryGetValue(sequenceNumber, out count))
{
count = Tuple.Create(value, 0);
}
count = Tuple.Create(count.Item1, count.Item2 + 1);
counts[sequenceNumber] = count;
// If the current count has reached sequenceRedundancy,
// that means any seqeunce values S such that
// sequenceNext < S < sequenceNumber and S has not been
// seen yet will never be seen. So we emit everything
// we have seen up to this point, in order.
if (count.Item2 >= sequenceRedundancy)
{
var removal = counts.Keys
.TakeWhile(seq => seq <= sequenceNumber)
.ToList();
foreach (var seq in removal)
{
count = counts[seq];
observer.OnNext(count.Item1);
counts.Remove(seq);
}
sequenceNext++;
}
// Emit stored values as long as we keep having the
// next sequence value.
while (counts.TryGetValue(sequenceNext, out count))
{
observer.OnNext(count.Item1);
counts.Remove(sequenceNext);
sequenceNext++;
}
},
observer.OnError,
() =>
{
// Emit in order any remaining values.
foreach (var count in counts.Values)
{
observer.OnNext(count.Item1);
}
observer.OnCompleted();
});
});
}
If you have two streams IObservable<Message> A and IObservable<Message> B, you would use this by doing Observable.Merge(A, B).Sequenced(msg => msg.SequenceNumber, 1, 2).
For your example marble diagram, this would look like the following, where the source column shows the values emitted by Observable.Merge(A, B) and the counts column shows the contents of the SortedDictionary after each step of the algorithm. I am assuming that the "messages" of the original source sequence (without any lost values) is (A,1), (B,2), (C,3), (D,4), (E,5), (F,6) where the second component of each message is its sequence number.
source | counts
-------|-----------
(A,1) | --> emit A
(A,1) | --> skip
(C,3) | (3,(C,1))
(B,2) | (3,(C,1)) --> emit B,C and remove C
(D,4) | --> emit D
(F,6) | (6,(F,1))
(F,6) | (6,(F,2)) --> emit F and remove
A similar question came up a while ago and I have a custom merge operator that when given ordered streams, it merges them in order but doesn't do deduplication.
Edit:
If you can "afford" it, you can use this custom merge and then distinctUntilChanged(Func1) to filter out subsequent messages with the same sequence number.
Observable<Message> messages = SortedMerge.create(
Arrays.asList(src1, src2, src3), (a, b) -> Long.compare(a.id, b.id))
.distinctUntilChanged(v -> v.id);

Apache-Spark: method in foreach doesn't work

I read file from HDFS, which contains x1,x2,y1,y2 representing a envelope in JTS.
I would like to use those data to build STRtree in foreach.
val inputData = sc.textFile(inputDataPath).cache()
val strtree = new STRtree
inputData.foreach(line => {val array = line.split(",").map(_.toDouble);val e = new Envelope(array(0),array(1),array(2),array(3)) ;
println("envelope is " + e);
strtree.insert(e,
new Rectangle(array(0),array(1),array(2),array(3)))})
As you can see, I also print the e object.
To my surprise, when I log the size of strtree, it is zero! It seems that insert method make no senses here.
By the way, if I write hard code some test data line by line, the strtree can be built well.
One more thing, those project is packed into jar and submitted in the spark-shell.
So, why does the method in foreach not work ?
You will have to collect() to do this:
inputData.collect().foreach(line => {
... // your code
})
You can do this (for avoiding collecting all data):
val pairs = inputData.map(line => {
val array = line.split(",").map(_.toDouble);
val e = new Envelope(array(0),array(1),array(2),array(3)) ;
println("envelope is " + e);
(e, new Rectangle(array(0),array(1),array(2),array(3)))
}
pairs.collect().foreach(pair => {
strtree.insert(pair._1, pair._2)
}
Use .map() instead of .foreach() and reassign the outcome.
Foreach does not return the outcome of applyied function. It can be used for sending data somewhere, storing to db, printing, and so on.

Support for basic datatypes in H5Attributes?

I am trying out the beta hdf5 toolkit of ilnumerics.
Currently I see H5Attributes support only ilnumerics arrays. Is there any plan to extend it for basic datatypes (such as string) as part of the final release?
Does ilnumerics H5 wrappers provide provision for extending any functionality to a particular
datatype?
ILNumerics internally uses the official HDF5 libraries from the HDF Group, of course. H5Attributes in HDF5 correspond to datasets with the limitation of being not capable of partial I/O. Besides that, H5Attributes are plain arrays! Support for basic (scalar) element types is given by assuming the array stored to be scalar.
Strings are a complete different story: strings in general are variable length datatypes. In terms of HDF5 strings are arrays of element type Char. The number of characters in the string determines the length of the array. In order to store a string into a dataset or attribute, you will have to store its individual characters as elements of the array. In ILNumerics, you can convert your string into ILArrray or ILArray (for ASCII data) and store that into the dataset/ attribute.
Please consult the following test case which stores a string as value into an attribute and reads the content back into a string.
Disclaimer: This is part of our internal test suite. You will not be able to compile the example directly, since it depends on the existence of several functions which may are not available. However, you will be able to understand how to store strings into datasets and attributes:
public void StringASCIAttribute() {
string file = "deleteA0001.h5";
string val = "This is a long string to be stored into an attribute.\r\n";
// transfer string into ILArray<Char>
ILArray<Char> A = ILMath.array<Char>(' ', 1, val.Length);
for (int i = 0; i < val.Length; i++) {
A.SetValue(val[i], 0, i);
}
// store the string as attribute of a group
using (var f = new H5File(file)) {
f.Add(new H5Group("grp1") {
Attributes = {
{ "title", A }
}
});
}
// check by reading back
// read back
using (var f = new H5File(file)) {
// must exist in the file
Assert.IsTrue(f.Get<H5Group>("grp1").Attributes.ContainsKey("title"));
// check size
var attr = f.Get<H5Group>("grp1").Attributes["title"];
Assert.IsTrue(attr.Size == ILMath.size(1, val.Length));
// read back
ILArray<Char> titleChar = attr.Get<Char>();
ILArray<byte> titleByte = attr.Get<byte>();
// compare byte values (sum)
int origsum = 0;
foreach (var c in val) origsum += (Byte)c;
Assert.IsTrue(ILMath.sumall(ILMath.toint32(titleByte)) == origsum);
StringBuilder title = new StringBuilder(attr.Size[1]);
for (int i = 0; i < titleChar.Length; i++) {
title.Append(titleChar.GetValue(i));
}
Assert.IsTrue(title.ToString() == val);
}
}
This stores arbitrary strings as 'Char-array' into HDF5 attributes and would work just the same for H5Dataset.
As an alternative solution you may use HDF5DotNet (http://hdf5.net/default.aspx) wrapper to write attributes as strings:
H5.open()
Uri destination = new Uri(#"C:\yourFileLocation\FileName.h5");
//Create an HDF5 file
H5FileId fileId = H5F.create(destination.LocalPath, H5F.CreateMode.ACC_TRUNC);
//Add a group to the file
H5GroupId groupId = H5G.create(fileId, "groupName");
string myString = "String attribute";
byte[] attrData = Encoding.ASCII.GetBytes(myString);
//Create an attribute of type STRING attached to the group
H5AttributeId attrId = H5A.create(groupId, "attributeName", H5T.create(H5T.CreateClass.STRING, attrData.Length),
H5S.create(H5S.H5SClass.SCALAR));
//Write the string into the attribute
H5A.write(attributeId, H5T.create(H5T.CreateClass.STRING, attrData.Length), new H5Array<byte>(attrData));
H5A.close(attributeId);
H5G.close(groupId);
H5F.close(fileId);
H5.close();