Hash calculation in torrent clients - hash

I was wondering if someone knows what a "hash" in a BitTorrent client is referring to, it is clearly not the hashcode of the file, but something different.
I think thats more a magnet link to a file, but how is this connected to the file itself?
I just want to understand the construct behind the scene.
File <--> Hash <---> hash in torrent client

The hash in a torrent client or the hashyou find in a magnet-URI is the SHA1-hash of the raw bencoded info-dictionary-part of a torrent-file.
To understand how that works you need to know two things:
How a torrent-file is built.
How bencodeing is done.
Both of these are explained in the offical bittorrent specification that you can find here: http://bittorrent.org/beps/bep_0003.html
However I recommed that you instead read the inoffical specification that you can find here: https://wiki.theory.org/BitTorrentSpecification
as it is much easier to understand.

Related

What is multicodec and how it is related to multihash?

I don't have any background with this subject.
To try to understand them better, I read:
Multihash
CIDv1: Multicodec prefix
From what I understand, the multihash is the algorithm used to hash (one way) the value. so it means, we can't go back (we can't decode the hash to the value).
Questions
I don't understand, in simple words, what is multicodec and if it's related to decoding the hash to a value (which makes no sense).
what is the motivation to multicodec prefix?
The multicodec is related to decoding the value the hash points to, if that makes it easier to understand. Don't worry, no magic hash decoding is happening ;). Remember we're making CIDs, and we can use CIDs to lookup content. However then we have the question of "how do we decode this data we just retrieved?", the multicodec solves that problem for us. Reading From Data to Data Structures might help clear up some confusion.
The multicodec prefix allows IPFS to evolve to support new and different encodings for the data that's actually put into IPFS. This refers to IPLD, and you can actually find the answer you're looking for under Links (with information about the codecs under Codecs):
For links we use a CID. A CID is an extension of multihash, in fact a multihash is part of a CID. We simply add a codec to a multihash that tells us what format the data is in (JSON, CBOR, Bitcoin, Ethereum, etc). This way, we can actually link between data in different formats and any link to data anyone ever gives us can be decoded so that it can become more than just a series of bytes.
CID is a standard that anyone can implement, even people that have no other interest in IPLD beyond the need for hash links to different data types can use it.

Reading & writing text in Scala, getting the encoding right?

I'm reading and writings some text files in Scala. As a complete beginner in the language, I wanted to make sure to find the right way to do it, e.g. get the encoding right.
So most of the stuff I found (also on SO ) recommends I use io.Source.fromFile.However, after trying it out like so, reading a UTF-8 file:
val user_list = Source.fromFile("usernames.txt").getLines.toList
val user_list = Source.fromFile("usernames.txt", enc="UTF8").getLines.toList
I looked at the docs but was left with some questions.
Get the encoding right:
the docs show that I can set an encoding in Source.fromFile as I tried above. Looking at the man on Codec and the types listed there, I was wondering if those are all my codec options - is there e.g. no Utf-16, Big-Endian vs Little-Endian, etc.?
I am slightly obsessed with this since it used to trip me up in Python a lot. Is this less of concern with Scala for some reason?
Get the reading in right:
All the examples I looked at used the getLines method and postprocessed it with MkString or List, etc. Is there any advantage to that over just reading in the entire file (my files are small) in one go?
Get the writing out right:
Every source I could find tells me that Scala has no file writing function and to use the Java FileWriter. I was surprised by this - is this still accurate?
Looking at it I feel the question might be a little broad for SO, so I'd be happy to take it back if it does not meet the requirements. At this point, I'm not struggling with specific examples but rather trying to set things up in a way I don't get in trouble later.
Thanks!
Scala only has a basic IO api in the standard library. For the most part you just use the java apis. The fact that a decent api from java exists is probably why the Scala team is not prioritizing having a robust and fully featured IO api.
There are also third party scala libraries you could use as well however. Better Files I've never used but heard good things about as a Scala file api. As well as fs2 which provides functional, streaming IO. I'm sure there are others out there as well.
For encoding, there are many possible encoding available. It's just that only a couple of the most common ones are available as static fields, the rest you typically access through Codec("Encoding Name"). Most apis will also let you just enter a String directly instead of needing to get a Codec instance first. The codec is really just a wrapper over java.nio.charset.Charset. You can run java.nio.charset.Charset.availableCharsets() to see all of the encodings available on your system.
As far as reading, if the files are small you can load them fully into memory if you prefer that. The only reason not to do so is if you want to avoid the extra memory use of loading the entire file at once if reading through line by line is enough. You may want to use Vector instead of List for efficiency reasons (Vector is better in many cases and should probably be preferred as a default collection, but tradition and old habits die hard and most people/guides seem to default to List, but this is a whole other topic)

Is there a port for KStem for .NET?

I'm about to launch into a Lucene.NET implementation and I am concerned about using the PorterStemFilter. Reading here, and reading source code, it appears to be far, far too aggressive for my needs.
I need something simpler that doesn't look for roots but just removes "er", "ed", "s", etc suffixes. From what I've read, KStem would do the trick.
I can't for the life of me find a .NET version of KStem. I can't even find source code for the Java version to handroll a port.
Could someone point me in the right direction?
Looks like it is easy enough to handcraft a reduced PorterStemmer by simply removing steps I don't want. Anyone have success with that?
You could use the HunspellStemmer, part of contrib. It can use freely available hunspell dictionaries to provide proper stemming.

MD5 hash of a file in ABAP

I want to generate a MD5 hash of a text file in ABAP. I have not found any standard solution for generating it for a very big file. Function module CALCULATE_HASH_FOR_CHAR does not meet my requirements because it takes a string as an input parameter. Although it works for smaller files, in case of a for example 4 GB file one cannot construct such a big string.
Does anybody know whether there is a standard piece of coding for doing that (my google efforts did not bring me anything) or maybe someone has an MD5 algorithm in ABAP that calculates the hash of a file?
It looks like the implementation of this algorithm is impossible in ABAP because of the fact that the language does not allow arithmetic overflows during the calculations. This should also answer the question why it has not been implemented so far in SAP system. Either way looks that there is no other way as to call an external tool which of course is, regrettably, hardly platform independent.
EDIT: Ok! So with a great help of René and the code of Fast MD5 Implementation in Java I created the implementation of MD5 algorithm in ABAP . This implementation allows to update the calculated hash with more bytes, which of course might be coming from different sources.
There is no method which takes a file so far but anyways most of the work has been done.
Some simple ABAP Unit tests are included in the code, which also document how to use it.
Perhaps you could read the file in data blocks of a couple megabytes and create a hash list of those using the suggested function. And then create a single top hash using the generated hash list.
The SDN is usually a very good starting point for finding ABAP-related solutions. I was able to find this post: http://scn.sap.com/thread/1483479
The author suggests:
Upload the .txt file BUT as BIN.
Calculate the hash code using function MD5_CALCULATE_HASH_FOR_RAW
Are you able to get your file in binary format and use MD5_CALCULATE_HASH_FOR_RAW?
Edit: This post even has a more detailed answer using CALCULATE_HASH_FOR_RAW: http://scn.sap.com/thread/1298723
Quote of Shivanand Kalagi's answer:
STR_LEN = XSTRLEN( DATA ).
CALL FUNCTION 'CALCULATE_HASH_FOR_RAW'
EXPORTING
ALG = 'MD5'
DATA = DATA
LENGTH = STR_LEN
IMPORTING
HASH = L_MD5_HASH.

Approaches to programming application-level protocols?

I'm doing some simple socket programming in C#. I am attempting to authenticate a user by reading the username and password from the client console, sending the credentials to the server, and returning the authentication status from the server. Basic stuff. My question is, how do I ensure that the data is in a format that both the server and client expect?
For example, here's how I read the user credentials on the client:
Console.WriteLine("Enter username: ");
string username = Console.ReadLine();
Console.WriteLine("Enter plassword: ");
string password = Console.ReadLine();
StreamWriter clientSocketWriter = new StreamWriter(new NetworkStream(clientSocket));
clientSocketWriter.WriteLine(username + ":" + password);
clientSocketWriter.Flush();
Here I am delimiting the username and password with a colon (or some other symbol) on the client side. On the server I simply split the string using ":" as the token. This works, but it seems sort of... unsafe. Shouldn't there be some sort of delimiter token that is shared between client and server so I don't have to just hard-code it in like this?
It's a similar matter for the server response. If the authentication is successful, how do I send a response back in a format that the client expects? Would I simply send a "SUCCESS" or "AuthSuccessful=True/False" string? How would I ensure the client knows what format the server sends data in (other than just hard-coding it into the client)?
I guess what I am asking is how to design and implement an application-level protocol. I realize it is sort of unique to your application, but what is the typical approach that programmers generally use? Furthermore, how do you keep the format consistent? I would really appreciate some links to articles on this matter as well.
Rather than reinvent the wheel. Why not code up an XML schema and send and receive XML "files".
Your messages will certainly be longer, but with gigabyte Ethernet and ADSL this hardly matters these days. What you do get is a protocol where all the issues of character sets, complex data structures have already been solved, plus, an embarrassing choice of tools and libraries to support and ease your development.
I highly recommend using plain ASCII text if at all possible.
It makes bugs much easier to detect and fix.
Some common, machine-readable ASCII text protocols (roughly in order of complexity):
netstring
Tab Delimited Tables
Comma Separated Values (CSV) (strings that include both commas and double-quotes are a little awkward to handle correctly)
INI file format
property list format
JSON
YAML Ain't Markup Language
XML
The world is already complicated enough, so I try to use the least-complex protocol that would work.
Sending two user-generated strings from one machine to another -- netstrings is the simplest protocol on my list that would work for that, so I would pick netstrings.
(netstrings will will work fine even if the user types in a few colons or semi-colons or double-quotes or tabs -- unlike other formats that choke on certain commonly-typed characters).
I agree that it would be nice if there existed some way to describe a protocol in a single shared file such that that both the server and the client could somehow "#include" or otherwise use that protocol.
Then when I fix a bug in the protocol, I could fix it in one place, recompile both the server and the client, and then things would Just Work -- rather than digging through a bunch of hard-wired constants on both sides.
Kind of like the way well-written C code and C++ code uses function prototypes in header files so that the code that calls the function on one side, and the function itself on the other side, can pass parameters in a way that both sides expect.
Tell me if you discover anything like that, OK?
Basically, you're looking for a standard. "The great thing about standards is that there are so many to choose from". Pick one and go with it, it's a lot easier than rolling your own. For this particular situation, look into Apache "basic" authentication, which joins the username and password and base64-encodes it, as one possibility.
I have worked with two main approaches.
First is ascii based protocol.
Ascii based protocol is usally based on a set of text commands that terminate on some defined delimiter (like a carriage return or semicolon or xml or json). If your protocol is a command based protocol where there is not a lot of data being transferred back and forth then this is the best way to go.
FIND\r
DO_SOMETHING\r
It has the advantage of being easy to read and understand because it is text based.
The disadvantage (may not be a problem but can be) is that there can be an unknown number of bytes being transferred back and forth from the client and the server. So if you need to know exactly how many bytes are being sent and received this may not be the type of protocol you want.
The other type of protocol is binary based with fixed sized messages that are sent in the header. This has the advantage of knowing exactly how much data the client is expected to receive. It also can potentially save you bandwith depending on what your sending across. Although, ascii can also save you space too, it depends on your application requirements. The disadvantage of a binary based protocol is that it is difficult to understand by just looking at it....requiring you to constantly look at documentation.
In practice, I tend to mix both strategies in protocols I have defined based on my application's requirements.