Extract subtitles from mp4 without time marks or position locators - mp4

I have an mp4 from a university lecture that has embedded subtitles. I am aware there are tools to extract the captions.
The lecturer is reading from a script which we don't have access to. I just want to extract all the text of the subtitles without the time stamps so I have the text in a word document to study.
Is this possible? If not, is there any tool or script that could help me eliminate the post-extraction time stamps?

Related

Writing MP4 tags for M4A or MP4 audio files

I have a strange problem with MP4 tagging.. I can figure out 2 styles of tags, one that works with mp3tag and tagscanner, another that works with MusicBee.. But I can't figure out one that universally works with all of those. So I write 2 sets of tags into the file...
and even this isn't enough.. Players like AIMP and Clementine still can't read MP4 files I tagged this way. I need to open mp3tag load my files and save them.. then it will write tags that those music players understand.. but I can't find good documentation anywhere.
Does anyone know what kind of tags I need to write to make all of them be able to read the tags? I tried to look mp4s that work in all of them and it is no use, I see tags like "Artist".. I already write a tag called "Artist".. I mean it looks like "Artist" in exif also, this is the tag that I wrote that MusicBee understands.
I use the AudioGenie Windows Library to write the tags. There are 2 different methods for writing a tag.. one is called an ISLT text frame (which I have no idea what that is) and requires an integer code as well as text when writing. Another is called an iTune text frame and requires a string frame ID as well as text.
I tried to shove MP3 ID3v2 tags in both of those as well, to see if that was what the third group of players that can't read my tags wanted. But that didn't work. I only tried this because I read somewhere that ID3v2 tags are widely used in MP4 files (it was only on one comment in stackoverflow that I read this, so I'm skeptical)
Could someone point me in the right direction?

How can I upload Kayaking or Rowing data via a TCX formatted file to Strava?

I'm recording workouts with a Flutter based mobile application. I can successfully upload bike workouts. https://github.com/BirdyF/strava_flutter/blob/master/lib/Models/activity.dart#L916 lists a pretty wide variety of sports. However I already noticed that "VirtualRide" reverts to a regular bike ride once it is uploaded to Strava.
Now as I'm uploading Kayaking data it also shows up as a bike ride as well (in the middle of a small lake). But at least it has the speed and the rpm (which is actually strokes per minute for kayaking). However if I switch that activity over to Kayaking on Strava's UI Strava stops showing the pace (speed) and the rpm.
I peeked at https://github.com/sanderroosendaal/rowingdata/blob/master/rowingdata/writetcx.py and that seems to output the rowing activities as "Other" sport. Its tests contain such TCXs as well. I just cannot believe TCX would be so limited. Does anyone have a pointer for me to solve this?
I'm outputting TCX because it's a textual format so it's easier to interpolate and debug than a binary FIT format. Since I can gzip it for upload its size is OK as well when compressed.
It seems that TCX is too limited file format with respect to sport selection. I develop FIT file based upload and that is able to carry Kayaking and many more sports.

Generate Audio using a VTT as the text source with Google Text to Speech?

For sign language videos, there is no audio already present normally; however, we do provide the VTT caption file for captioning into English. The VTT does have a time-start and time-end for each text block of the cues.
I am wondering if it's possible to use the VTT as the text source to generate the Text to Speech audio, wherein the speed is controlled by the time codes in the caption file.
Currently not finding anything. Usually it's the other way around--audio to subtitles--but I want to work subtitles to audio (US English).

Extract data from many PDF forms

I regularly receive large numbers of the same PDF form. I want to extract the data from them into a text file. I'd like to do this via a script of some sort. I'm working in a UNIX environment.
Is this possible? I've googled my brains out and can't find anything.
Text in PDF is represented by text elements in page content streams. The streams are commonly compressed. If you have the time and resources you can use ISO 32000-1:2008 or Adobe PDF 1.7 specification to build your own PDF parser. Or it may be more practical to use a 3rd party app as an intermediate translation step.
There are utilities that will decode the stream and give you clear text. One option is PDFtk Server which will work in your environment. Another option is to use the Poppler PDF Rendering Library which has a command line utility "pdftotext" useful for searching for strings in PDFs.

asp.net web application to convert pdf to word

Is there any clear and proper process to convert a pdf file into a word file with all formatting and images in asp.net web application?
The best way to do that is by using the OCR. It will recognize the text and the images in the PDF file, and then you can save it on a DOC file. I know a third party toolkit named leadtools that should help you doing your requirements, since it support the ASP.NET environment. You can check their Online OCR Demo
Also, you can check their website for more information, or contact their support team.
PDF is a presentational format where all the content is placed by absolute positions. There are no paragraphs and other structured elements (unless it is a Tagged PDF). Technically, you can output every word character by character in any order, but visually it would look like a normal text. Thus, to make a proper conversion to word it is required to do content recognition or some kind of OCR (e.g. ABBYY FineReader)
There are some paid components on the market that allow to do text extraction and some do converting pages to images (obviously, this is not a desired approach for converting into word).