I want to read a rather large csv file and process it (slice, dice, summarize etc.) interactively (data exploration). My idea is to read the file into a database (H2) and use SQL to process it:
Read the file: I use Ostermiller csv parser
Determine the type of each column: I select randomly 50 rows and derive the type (int, long, double, date, string) of each column
I want to use Squeryl to process. To do so I need to create a case class dynamically. That's the bottleneck so far!
I upload the file to H2 and use any SQL command.
My questions:
Is there a better general interactive way of doing this in Scala?
Is there a way to solve the 3rd point? To state it differently, given a list of types (corresponding to the columns in the csv file), is it possible to dynamically create a case class corresponding to the table in Squeryl? To my understanding I can do that using macros, but I do not have enough exposure to do that.
I think your approach to the first question sounds reasonable.
Regarding your 2nd question - as an addition to drexin's answer - it is possible to generate the bytecode, with a library such as ASM. With such a library you can generate the same byte code as a case class would.
As scala is a statically typed language there is no way to dynamically create classes except for reflection, which is slow and dangerous and therefore should be avoided. Even with macros you cannot do this. Macros are evaluated at compile-time, not at runtime, so you need to know the structure of your data at compile-time. What do you need the case classes for, if you don't even know what your data looks like? What benefit do you expect from this over using a Map[String,Any]?
I think you want to create a sealed base class and then a series of case classes as subclasses of it. Each subclass will wrap a different type that you support.
Then you can use match statements and deconstruction to deal with the individual types, and treat them generically via the base class in the places where it doesn't matter.
You can't create a class for an entire row since you don't know enough about it at compile time. Even if you could dynamically generate a class (maybe by invoking the compiler at runtime), you wouldn't be able to benefit from type-safety and most of your code would have to treat it generically anyway.
Related
C# 9 introduces record reference types. A record provides some synthesized methods like copy constructor, clone operation, hash codes calculation and comparison/equality operations. It seems to me convenient to use records instead of classes in general. Are there reasons no to do so?
It seems to me that currently Visual Studio as an editor does not support records as well as classes but this will probably change in the future.
Firstly, be aware that if it's possible for a class to contain circular references (which is true for most mutable classes) then many of the auto generated record members can StackOverflow. So that's a pretty good reason to not use records for everything.
So when should you use a record?
Use a record when an instance of a class is entirely defined by the public data it contains, and has no unique identity of it's own.
This means that the record is basically just an immutable bag of data. I don't really care about that particular instance of the record at all, other than that it provides a convenient way of grouping related bits of data together.
Why?
Consider the members a record generates:
Value Equality
Two instances of a record are considered equal if they have the same data (by default: if all fields are the same).
This is appropriate for classes with no behavior, which are just used as immutable bags of data. However this is rarely the case for classes which are mutable, or have behavior.
For example if a class is mutable, then two instances which happen to contain the same data shouldn't be considered equal, as that would imply that updating one would update the other, which is obviously false. Instead you should use reference equality for such objects.
Meanwhile if a class is an abstraction providing a service you have to think more carefully about what equality means, or if it's even relevant to your class. For example imagine a Crawler class which can crawl websites and return a list of pages. What would equality mean for such a class? You'd rarely have two instances of a Crawler, and if you did, why would you compare them?
with blocks
with blocks provides a convenient way to copy an object and update specific fields. However this is always safe if the object has no identity, as copying it doesn't lose any information. Copying a mutable class loses the identity of the original object, as updating the copy won't update the original. As such you have to consider whether this really makes sense for your class.
ToString
The generated ToString prints out the values of all public properties. If your class is entirely defined by the properties it contains, then this makes a lot of sense. However if your class is not, then that's not necessarily the information you are interested in. A Crawler for example may have no public fields at all, but the private fields are likely to be highly relevant to its behavior. You'll probably want to define ToString yourself for such classes.
All properties of a record are per default public
All properties of a record are per default immutable
By default, I mean when using the simple record definition syntax.
Also, records can only derive from records and you cannot derive a regular class from a record.
Consider you are doing some integration testing, you are storing some bigger entity into db, and then read it back and would like to compare it. Obviously it has some associations as well, but that's just a cherry on top of very unpleasant cake. How do you compare those entities? I saw lot of incorrect ideas and feel, that this has to be written manually. How you guys do that?
Issues:
you cannot use equals/hashcode: these are for natural Id.
you cannot use subclass with fixed equals, as that would test different class and can give wrong results when persisting data as data are handled differently in persistence context.
lot of fields: you don't want to type all comparisons by hand. You want reflection.
#Temporal annotations: you cannot use trivial "reflection equals" approaches, because #Temporal(TIMESTAMP) java.util.Date <> java.sql.Date
associations: typical entity you would like to have properly tested will have several associations, thus tool/approach ideally should support deep comparison. Also cycles in object graph can ruin the fun.
Best solution what I found:
don't use transmogrifying data types (like Date) in JPA entities.
all associations should be initialized in entity, because null <> empty list.
calculate externaly toString via say ReflectionToStringBuilder, and compare those. Reason for that is to allow entity to have its toString, tests should not depend that someone does not change something. Theoretically, toString can be deep, but commons recursive toStringStyle includes object identifier, which ruins it.
I though, that I could use json format to string, but commons support that only for shallow toString, Jackson (without further instructions on entity) fails on cycles over associations
Alternative solution would be actually declaring subclasses with generated id (say lombok) and use some automatic mapping tool (say remondis mapper), with option to overcome differences in Dates/collections.
But I'm listening. Does anyone posses better solution?
I read in this answer A generic list of anonymous class how to load a list with anonymous class objects. My question is why and when is recommendable to use this way instead of using a struct, considering performance and good practices.
An exposed-field structure is essentially a group of variables bound together with duct tape. It won't behave as an "object", and may thus be seen as evil who think everything should behave like an object; nonetheless, in cases where one doesn't really want an object, but rather a group of variables bound together with duct tape, an exposed-field structure may be a perfect fit.
Anonymous classes have only a few advantages over exposed-field structures:
The syntax to declare them is at least slightly smaller; depending upon coding standards, it may be a lot smaller. If coding standards will allow one to write internal struct WeightAndVolume { public double weight, volume;} and say that the struct is "self-explanatory" [it contains two public fields of type double, named weight and volume, each of which will hold whatever was last written to it by outside code], anonymous classes won't save much, but if coding standards would require that every named data type have many pages of associated documentation, including an analysis of required unit-test procedures, anonymous classes could avoid such hassle.
Copying class references is slightly cheaper than copying structures larger than 8 bytes, though unless a reference would be copied many times, the cost of creating the object will outweigh any savings in copying.
Casting an anonymous class to Object is much cheaper than casting a struct. The first time an anonymous class instance gets cast to Object will make up for the extra costs of creating it. Every additional time will represent a savings of that amount.
Passing a structure to a generic method will require the JITter to produce a specialized version of the code for that type; by contrast, the JITter would only have to produce one piece of code to handle all anonymous classes.
In general, structures will work better than anonymous classes. On the other hand, there are a few scenarios (mostly related to the third point above) where classes can end up being much better.
I wouldn't say it is ever recommended to use anonymous classes, in the sense that it's never wrong to not use them. But they typically get used when
it's an one-shot job, for which creating a proper named type would be cumbersome, and
the consumer of the objects is either compiler-generated code (you don't have access to the types backing those anonymous classes, but the compiler does) or uses reflection (in which case you don't need access to the types at compile time)
The most common scenario where this occurs is in LINQ queries.
I have very large result sets being imported from json. Each row of data in the json returns a very specific "column" order, that I would like to quickly iterate through. I'd prefer to avoid the overhead of checking/matching keys to process each piece of data. Unfortunately, scala.util.parsing.json puts these columns into a Map object, and when iterating through the Map, the order in which it iterates is random, and does not necessarily mirror the order of the columns in the JSON result. Is there a way to make the parser enforce the order of the JSON columns? One thought was if there is a way to tell the parser to use LinkedHashMap or ListMap as it is generating the objects. Would this be possible by extending the class or adding other traits? Do I have alternative options?
I'd strongly discourage you from relying on the order of key/value pairs. JSON objects are defined as:
An object is an unordered set of name/value pairs.
Relying on the order will most likely introduce difficult bugs and incompatibility of your code. Trading correctness for speed is always a bad deal.
Instead I'd suggest to find a fast, correct parser. I've used Jackson before, which is very fast, and can be well used with Scala. You annotate an arbitrary class of yours and Jackson parses JSON into instances of the class. Then you can process these instances as native Java/Scala objects, which is both very fast and robust.
I would consider trying something like json4s.
It appears the JObject type has ordered fields.
https://github.com/json4s/json4s
Otherwise I would ask why you need them ordered?
You can always map.get by key.
I need to store Scala class in Morphia. With annotations it works well unless I try to store collection of _ <: Enumeration
Morphia complains that it does not have serializers for that type, and I am wondering, how to provide one. For now I changed type of collection to Seq[String], and fill it with invoking toString on every item in collection.
That works well, however I'm not sure if that is right way.
This problem is common to several available layers of abstraction on the top of MongoDB. It all come back to a base reason: there is no enum equivalent in json/bson. Salat for example has the same problem.
In fact, MongoDB Java driver does not support enums as you can read in the discussion going on here: https://jira.mongodb.org/browse/JAVA-268 where you can see the problem is still open. Most of the frameworks I have seen to use MongoDB with Java do not implement low-level functionalities such as this one. I think this choice makes a lot of sense because they leave you the choice on how to deal with data structures not handled by the low-level driver, instead of imposing you how to do it.
In general I feel that the absence of support comes not from technical limitation but rather from design choice. For enums, there are multiple way to map them with their pros and their cons, while for other data types is probably simpler. I don't know the MongoDB Java driver in detail, but I guess supporting multiple "modes" would have required some refactoring (maybe that's why they are talking about a new version of serialization?)
These are two strategies I am thinking about:
If you want to index on an enum and minimize space occupation, you will map the enum to an integer ( Not using the ordinal , please can set enum start value in java).
If your concern is queryability on the mongoshell, because your data will be accessed by data scientist, you would rather store the enum using its string value
To conclude, there is nothing wrong in adding an intermediate data structure between your native object and MongoDB. Salat support it through CustomTransformers, on Morphia maybe you would need to do the conversion explicitely. Go for it.