What does record oriented data integration mean - jca

In this JCA tutorial here
Data integration
Lastly, data-level integration implies that the data passed between systems will be in a data/record-oriented manner.
What is 'record-oriented'?

Here is a pretty good description on a Record-oriented filesystem:
In computer science, a record-oriented filesystem is a file system
where files are stored as collections of records. There are several
different record formats; the details vary depending on the particular
system. In general the formats can be fixed-length or variable length,
with different physical organizations or padding mechanisms; metadata
may be associated with the file records to define the record length,
or the data may be part of the record. Different methods to access
records may be provided, for example sequential, by key or by record
number.

Related

Is MongoDB a good choice for storing a huge set of text files?

I'm currently building a system (with GCP) for storing large set of text files of different sizes (1kb~100mb) about different subjects. One fileset could be more than 10GB.
For example:
dataset_about_some_subject/
- file1.txt
- file2.txt
...
dataset_about_another_subject/
- file1.txt
- file2.txt
...
The files are for NLP, and after pre-processing, as pre-processed data are saved separately, will not be accessed frequently. So saving all files in MongoDB seems unnecessary.
I'm considering
saving all files into some cloud storage,
save file information like name and path to MongoDB as JSON.
The above folders turn to:
{
name: dataset_about_some_subject,
path: path_to_cloud_storage,
files: [
{
name: file1.txt
...
},
...
]
}
When any fileset is needed, search its name in MongoDB and read the files from cloud storage.
Is this a valid way? Will there be any I/O speed problem?
Or is there any better solution for this?
And I've read about Hadoop. Maybe this is a better solution?
Or maybe not. My data is not that big.
As far as I remember, MongoDB has a maximum object size of 16 MB, which is below the maximum size of the files (100 MB). This means that, unless one splits, storing the original files in plaintext JSON strings would not work.
The approach you describe, however, is sensible. Storing the files on cloud storage such as S3 or Azure, is common, not very expensive, and does not require a lot of maintenance comparing to having your own HDFS cluster. I/O would be best by performing the computations on the machines of the same provider, and making sure the machines are in the same region as the data.
Note that document stores, in general, are very good at handling large collections of small documents. Retrieving file metadata in the collection would thus be most efficient if you store the metadata of each file in a separate object (rather than in an array of objects in the same document), and have a corresponding index for fast lookup.
Finally, there is another aspect to consider, namely, whether your NLP scenario will process the files by scanning them (reading them all entirely) or whether you need random access or lookup (for example, a certain word). In the first case, which is throughput-driven, cloud storage is a very good option. In the latter case, which is latency-driven, there are document stores like Elasticsearch that offer good fulltext search functionality and can index text out of the box.
I recommend you to store large file using storage service provide by below. It also support Multi-regional access through CDN to ensure the speed of file access.
AWS S3: https://aws.amazon.com/tw/s3/
Azure Blob: https://azure.microsoft.com/zh-tw/pricing/details/storage/blobs/
GCP Cloud Storage: https://cloud.google.com/storage
You can rest assured that for the metadata storage you propose in mongodb, speed will not be a problem.
However, for storing the files themselves, you have various options to consider:
Cloud storage: fast setup, low initial cost, medium cost over time (compare vendor prices), datatransfer over public network for every access (might be a performance problem)
Mongodb-Gridfs: already in place, operation cost varies, data transfer is just as fast as from mongo itself
Hadoop cluster: high initial hardware and setup cost, lower cost over time. Data transfer in local network (provided you build it on-premise.) Specialized administration skills needed. Possibility to use the cluster for parrallel calculations (i.e. this is not only storage, this is also computing power.) (As a rule of thumb: if you are not going to store more than 500 TB, this is not worthwile.)
If you are not sure about the amount of data you cover, and just want to get started, I recommend starting out with gridfs, but encapsulate in a way that you can easily exchange the storage.
I have another answer: as you say, 10GB is really not big at all. You may want to also consider the option of storing it on your local computer (or locally on one single machine in the cloud), simply on your regular file system, and executing in parallel on your cores (Hadoop, Spark will do this too).
One way of doing it is to save the metadata as a single large text file (or JSON Lines, Parquet, CSV...), the metadata for each file on a separate line, then have Hadoop or Spark parallelize over this metadata file, and thus process the actual files in parallel.
Depending on your use case, this might turn out to be faster than on a cluster, or not exceedingly slower, especially if your execution is CPU-heavy. A cluster has clear benefits when the problem is that you cannot read from the disk fast enough, and for workloads executed occasionally, this is a problem that one starts having from the TB range.
I recommend this excellent paper by Frank McSherry:
https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf

Kafka : Generating unique IDs for strings across partitions

I'm trying to asses if Kafka could be used to scale-out our current solution.
I can identify partitions easily. Currently, the requirement is there to be 1500 partitions, each having 1-2 events per second, but future might go as high as 10000 partitions.
But there is one part of our solution which I don't know how would be solved in Kafka.
The problem is that each message contains a string and I want to assign a unique ID to each string across the whole topic. So same strings have the same ID while different strings have different IDs. The IDs don't need to be sequential, nor do they need to be always-growing.
The IDs will then be used down-stream as unique keys to identify those strings. The strings can be hundreds of characters long, so I don't think they would make efficient keys.
More advanced usage would be where messages might have different "kinds" of strings, so there would be multiple unique sequences of IDs. And messages will contain only some of those kinds depending on the type of the message.
Another advanced usage would be that the values are not strings, but structures and if two structures are same would be some more elaborate rule, like if PropA is equal, then structures are equal, if not, then structures are equal if PropB is equal.
To illustrate the problem: Each partition is a computer in a network. Each event is action on the computer. Events need to be ordered per-computer so that events that change the state of the computer (eg. user logged in) can affect other types of events, and ordering is critical for that. Eg. the user opened an application, a file is written, a flash drive is inserted, etc.. And I need each application, file, flash drive, or many others to have unique identifiers across all computers. This is then used to calculate statistics down-stream. And sometimes, an event can have multiple of those, eg. operation on a specific file on the specific flash drive.
There is a very nice post about kafka and blockchain. This is collective mind work and I think this could solve your IDs scalability issue. For solution refer to "Blockchain: reasons." part. All credits goes to respective authors.
Idea is simple, yet efficient:
Data is hash based, with link to previous block
Data may be very well same hashes, links to respective blocks of types
Custom block-chain solution means you in control of data encoding/decoding
Each hash chain is self-contained, and essentially may be your process (hdd/ram/cpu/word/app etc.)
Each hash chain may be a message itself
Bonus: statistics and analytics may be very well stored in block-chain, with high support for compression and replication. Consumers are pretty cheap in that context (scalability).
Proc:
Unique identifier issue solved
All records linked and thanks to kafka & blockchain highly ordered
Data extendable
Kafka properties applied
Cons:
Encryption/Decryption is CPU intensive
Growing level of hash calculation complexity
Problem: without problem context it's hard to approximate the limitations that need to be addressed further. However, assuming calculated solution has a finite nature you should have no issues scaling the solution in a regular way.
Bottom line:
Without knowledge of requirements in terms of speed/cost/quality it's hard to give a better, backed answer with working example. CPU cloud extension may be comparably cheap, data storage - depends on time for how long and what amount of data you want to store, replay-ability, etc. It's a good chunk of work. Prototype? Concept in referenced article.

What is the recommended way to create a consolidated data store for large SPSS files that have survey data (having 600–800 columns)?

Hello Everyone I just need your suggestion what is the best way to store the data retrieving from SPSS file and storing into Mongo db or RDBMS or any other .
The data comprises of responses to survey questionnaire which can span upto large number of columns (600-800) depending on the number of questions and other attributes recorded for the respondent and the survey study. Also these surveys are conducted periodically - however it's not necessary that the questions remain exactly the same - these may vary from survey to survey.
The need is to consolidate this data into a uniform structure and enable further analysis over the consolidated data spanning over multiple survey for which again the plan is to use SPSS.
One option I considered was to store data in MongoDB as then there is flexibility on how the schema can be modified across surveys i.e. rigid schema definition part can be avoided. However in this case not sure if SPSS would support working against Mongo
Would be very interested to know if someone has had any experience in this area or could provide some suggestion.
Another thing to consider, if you plan to create generalized jobs that can be run over surveys that are similar but differ in details is to set up a classification system for the variables such as demographic, opinion, economic etc, and assign these using custom attributes when the sav files are created. You can then use these attributes in generalized jobs to determine what to do based on generic properties rather than tying the code to specific variable names.
You can use the SPSSINC SELECT VARIABLES to define macros based on variable properties, including custom attributes and then use those macros in your syntax in place of specific variable names.
We have seen that an approach like this can dramatically reduce the number of different but similar jobs that an organization has to otherwise maintain.

Why does DHT hash the filenames?

One of the objectives of DHT is to partition the keyspace, so each node (or group of them) has a share of it. To do so, it hashes the filename of a file that wants to be saved and stores it in the node responsible of this part of the network. But, why does it have to hash the filename? Couldn't it just work like a dictionary, so instead of having a node hold hash values between 0000 and 0a2d, it would hold filename values between C and E?
But, why does it have to hash the filename?
It doesn't have to be a filename. It can hash other things too. E.g. file contents. Or metadata. Or cryptographic keys used as identities of users in the network.
Couldn't it just work like a dictionary, so instead of having a node hold hash values between 0000 and 0a2d, it would hold filename values between C and E?
Because filenames are not uniformly distributed throughout the possible keyspace (how often do you see filenames starting with some exotic unicode character?) and their entropy is spread over a variable length, leading to even more clustering at the top level.
If you were to index all existing unix filesystems in the world you would have massive clustering around the /etc/... prefix for example.
There are other p2p network overlays that can deal with heavy clustering in the keyspace, often by rearranging the nodes around the hotspots to increase network capacity in regions of the affected keyspace, e.g. based on levenshtein distance, but they generally aren't distributed hash tables because they do not employ hashing.
because searches are done on numbers.
When you hash a file, you end up with a number, and that number will be allocated in the nearest K-buckets of the nearest K-peers.
names are irrelevant, you're performing XOR searches on numeric spaces, so that you always search half of the space on every hop.
once you find a peer that has the bucket pointed by the hash, then you can communicate with that peer and exchange related information.
A DHT, like libtorrent's kademlia implementation has to be seen more of a distributed routing data structure. The problem you're solving is how do I find a number among billions of numbers, how do I find a peer among millions in the least amounts of hops possible, and the answer is that every node on the network has to follow a set of simple rules as to how to organize the numbers they're storing, and the peers that they know about.
I recommend you read these notes on how a real DHT actually works.
https://gist.github.com/gubatron/cd9cfa66839e18e49846
Also, storing a number takes a lot less space than storing a word.
If you know the word, you can hash the word and search for the hash.
Yes, it could work like a dictionary. However, it would be missing some desirable (for the typical DHT use case) emergent properties that come from using a hash.
One property that hashing (along with XOR distance metric) gives you is an even distribution of content amongst all the nodes participating in a DHT. "Even" here being caveated by how the k-bucket data structure works (here's an overview k-bucket slides), but in aggregate, you get nodes evenly distributing data amongst the DHT peers.. in theory. In practice, you can get hotspots.
Another property of using a hash is if you're looking for a file with specific contents. So, if you use hashes of the file contents as the identifiers, you can be... statistically sure (the guarantee comes from your hash function collision properties) that you're getting the contents you're looking for. Relying on a filename introduces a level of indirection that can serve different contents for the same file. Depending on your use case, that's acceptable or not.
I've considered what you're proposing before as a prefix to a SHA-1 hash. So, something like node1-cd9cf... (the prefix could be anything really, doesn't need to be human readable). This would ensure that all the things with that prefix end up pretty much on a node that identifies itself with an id starting with "node1-". But, you'd have to have a DHT implementation (including k-bucket implementation) that supports variable length ids. In this case, you're guaranteeing a hotspot. It's an equivalent of artificially ensuring that things are "close together" as in the difference between them in the XOR metric is very small. Why would anyone want to do this? For example: com.example.www-cd9cf... combined with some crypto could ensure that while you're participating in a DHT, the data is stored on your servers. I haven't seen this implemented before though.

One big and wide table or many not so big for statistics data

I'm writing simplest analytics system for my company. I have about 100 different event types that should be collected per tens of projects. We are not interested in cross-project analytic requests but events have similar types through all projects. I use PostgreSQL as primary storage for this system. Now I should decide which architecture is more preferable.
First architecture is one very big table (in terms of rows count) per project that contains data for all types of events. It will be about 20 or more columns many of them will be nullable. May be it will be used partitioning to split this table by event type but table still be so wide.
Second one architecture is a lot of tables (fairly big in terms of rows count but not so wide) with one table per event type.
I going to retrieve analytic data from this tables using different join queries (self join in case of first architecture). Which one is more preferable and where are pitfalls of them?
UPD. All events have about 10 common attributes. And remain attributes are varied from one event type to another.
In the past, I've had similar situations. With postgres you have a bunch of options.
Depending on how your data is input into the system (all at once/ a little at a time) and the volume of your data per project (hundreds of data points vs millions of data points) and the querying pattern (IE, querying after the data is all in, querying nightly, or reports running constantly throughout), there are many options. One other factor will be IF new project types (with new data point types) are likely to crop up.
First, in your "first architecture" the first question that comes up for me is: Are all the "data points" the same data type (or at least very similar). Are some text and others numeric? Are some numeric and others floats? If so, you're likely to run into issues with rolling up your data without either building a column or a table for every data type.
If all your data is the same datatype, then the first architecture you mentioned might work really well.
The second architecture you mentioned is OK especially if you don't predict having a bunch of new project types coming down the pike anytime soon, otherwise, you'll be constantly modifying the DB, which I prefer to avoid when unnecessary.
A third architecture that you didn't mention is to have a combination of 1 and 2. Basically have 1 table to hold the 10 common attributes and use either 1 or 2 to hold the additional attributes. This would have an advantage, especially if the additional data wasn't that frequently used, or was non-numeric.
Lastly, you could use one of PostgreSQLs "document store" type datatypes. You could store this information in arrays, hstores, or json. Now, this will be fairly inefficient if you're doing a ton of aggregate functions as you might be left calculating the aggregates outside of Pgsql, or at a minimum, running an inefficient query. You could store the 10 common fields in normal fields, and the additional ones as hstore or json.
I didn't ask you, but it'd be nice to know that if each event within a project had more than 1 data point (IE are you logging changes, or just updating data).If your overall table has less than 100,000 rows, it's likely just going to be best to focus on what's easier to maintain and program rather than performance, as small amounts of data are pretty quick regardless of how they're stored.