byroot / pysrt Goto Github PK

Python parser for SubRip (srt) files

License: GNU General Public License v3.0

Python 100.00%

pysrt's Issues

Deleting a subtitles line should also trigger a renumber

I'm using your library to build a simple scripts which downloads subtitles from opensubtitles.org, removes all the unnecessary lines(synced by... and similar entries), saves the srt file and then encodes it into the video file with ffmpeg.
I've found that ffmpeg doesn't like srt files where there's a missing line, and it refuses to work with them.

This is what happens now:

1
00:00:22,712 --> 00:00:24,478
first line

2
00:00:25,000 --> 00:00:31,074
second line

3
00:00:57,413 --> 00:01:00,180
third line

when i run

del sub[1]

I get

1
00:00:22,712 --> 00:00:24,478
first line

3
00:00:57,413 --> 00:01:00,180
third line

Which is not valid according to ffmpeg. Instead I should get

1
00:00:22,712 --> 00:00:24,478
first line

2
00:00:57,413 --> 00:01:00,180
third line

Can you please make that when I close an srt file, pysrt automatically renumbers all the lines, so that there are no interruptions?

Can't easy easy_installing

On a Kubuntu Linux system the easy_install of the package (on python 2.4 and 2.6 also) is not so simple. Someway the egg downloaded is NOT the right one.

Typing: "easy_install pysrt" e use this file:
http://pypi.python.org/packages/any/p/pysrt/pysrt-0.2.3.macosx-10.6-universal.tar.gz

After manually downloading this:
http://pypi.python.org/packages/source/p/pysrt/pysrt-0.2.3.tar.gz

The installation gone well.

Join subtitles

Hi,
It would be great if srt command could 'unsplit' subtitles. Something like:
$ srt join movie.1.srt movie.2.srt movie.3.srt > movie.srt

pysrt 0.4.6 install error

pysrt 0.4.6 using pip install
Python 2.7
Mac OS 10.8

File "pysrt/__init__.py", line 3, in <module>

    from pysrt.srtfile import SubRipFile, SUPPORT_UTF_32_LE, SUPPORT_UTF_32_BE

  File "pysrt/srtfile.py", line 12, in <module>

    import chardet as charade

ImportError: No module named chardet

the latest code in master was not released

Would it be possible to publish a release and push it it to pypi? The 1.1.1 release is from April 2016 and does not contain the python 3 clasifiers added a few months later.

Inserting Subtitle Snippet

I think it would be nice to have a method where you could insert a subtitle into an existing .srt file. For example:

subs = pysrt.open('some/file.srt')
subs.insert("some subtitle", [start-time], [end-time])

The method would then shift all the existing subtitles down accordingly.
Just a thought...

Include tests into releases

Please include tests into releases, Gentoo python packages run them before installing.

Improperly formatted timestamp results in an empty list

If I have an srt_string like this:

1
00:00:20 --> 00:00:24
I also enjoy the fruits of our labor.

2
00:00:24 --> 00:00:27
We truly are blessed creatures.

and then execute SubRipFile.fromstring(srt_string), it will result in an empty list. Shouldn't it just be able to recognize the improperly formatted timestamp and append ",00" to them? Getting an empty list (which can cause serious issues with programs that are interacting with SRTs) seems like a rather harmful result in this case.

Explicitly include the license in the repository.

Hi @ThiefMaster, @MestreLion, @chenhsiu, @ichernev

Thanks again for your contribution to pysrt.
Since the beginning pysrt was tagged as licensed under GNU GPL on PyPI, but I never explicitly included the license in the repository.

So it's not really a change of licensing, but I agree that you may not have been aware of pysrt licensing. So if you have any objection about this licensing, please tell me and I'll make sure to remove the code you own from the repository.

Regards.

Missing git tag for 1.1.1

It would be nice to keep git tags and PyPI releases in sync :)

Phantom pointers when assigning fields

I am trying to split subtiltes where there are two speakers that show up in the same frame. In my case this is indicated by newlines and hyphens ('\n-'). I this code snippet to split the subtitles into multiple:

# Split any multi-speaker subtitles (denoted by '\n-') into multiple single-speaker subtitles
for i in reversed(xrange(len(subs))):
    if '\n-' in subs[i].text:
        # Split the subtitle at the hyphen and format the list
        lines = [line[1:] if line[0] == '-' else line for line in subs[i].text.split('\n-')]
        length_milli = 1000 * float(subs[i].end.seconds - subs[i].start.seconds) + float(subs[i].end.milliseconds - subs[i].start.milliseconds)
        interval_milli = int(length_milli / len(lines))
        dummy = pysrt.SubRipItem(0, start=subs[i].start, end=subs[i].end, text="") # Use this just to get the right formatting for the time
        dummy.shift(milliseconds =+ interval_milli) # Shift the dummy so its start time is now the end time we want
        for j in xrange(len(lines)):
            new_sub = pysrt.SubRipItem(0, start=subs[i].start, end=dummy.start, text=lines[j])
            new_sub.shift(milliseconds =+ (j * interval_milli))
            subs.append(new_sub)
        del subs[i]
subs.clean_indexes()

The basic gist is that to format the time I am using a dummy object so that I can take advantage of shifting. For example, a 3-phrase frame over 3 seconds is split 3 ways would be 1 second long for each new frame.

When I create the dummy as above using start=sub.start and end=sub.end and then shift the dummy, it also shifts the original subtitle. I suspect this was not the intended behavior.

I found that casting sub.start and sub.end to strings in the assignment (e.g. start=str(sub.start)) solved the issue. It appears that without the cast, however, I am actually assigning a reference or pointer of some kind rather than the value of the string.

BOM markers are not handled properly

If the srt file starts with a BOM ('\xef\xbb\xbf') it fails the subtitle parse, so the first subtitle is missing.

Maybe a manual test after open to check for these bytes, or a library to handle it automatically?

time passed to at() will not find caption if the time passed in equals start time of caption

For example in the file captions.srt:
1 00:00:02,000 00:00:03,000 Hello world

import pysrt
captions = pysrt.open('captions.srt')
select = captions.at(seconds=2)
select
[]
select = captions.at(seconds=3)
select
[]

In-place mode does not write the entire file

For large-enough files (100KiB+), the in-place output always gets cut off.

I'm guessing the buffer does not get flushed / the file handle does not get closed?

merge feature 2

Hello, I tried to write small script, which will merge two subtitle files (assumed every is in different language) to one file. My motivation is my Taiwanise wife, I am Czech person. We want to watch movie together in Chinese and Czech subtitles and I want she will have chance to learn other language too.

Here is small script. I hoped somebody could use too.

#!/usr/bin/env python
# -*- coding: utf8 -*-

import sys
import getopt
from pysrt import SubRipFile
from pysrt import SubRipItem
from pysrt import SubRipTime


def join_lines(txtsub1, txtsub2):
    if (len(txtsub1) > 0) & (len(txtsub2) > 0):
        return txtsub1 + '\n' + txtsub2
    else:
        return txtsub1 + txtsub2


def find_subtitle(subtitle, from_t, to_t, lo=0):
    i = lo
    while (i < len(subtitle)):
        if (subtitle[i].start >= to_t):
            break

        if (subtitle[i].start <= from_t) & (to_t  <= subtitle[i].end):
            return subtitle[i].text, i
        i += 1

    return "", i



def merge_subtitle(sub_a, sub_b, delta):
    out = SubRipFile()
    intervals = [item.start.ordinal for item in sub_a]
    intervals.extend([item.end.ordinal for item in sub_a])
    intervals.extend([item.start.ordinal for item in sub_b])
    intervals.extend([item.end.ordinal for item in sub_b])
    intervals.sort()

    j = k = 0
    for i in xrange(1, len(intervals)):
        start = SubRipTime.from_ordinal(intervals[i-1])
        end = SubRipTime.from_ordinal(intervals[i])

        if (end-start) > delta:
            text_a, j = find_subtitle(sub_a, start, end, j)
            text_b, k = find_subtitle(sub_b, start, end, k)

            text = join_lines(text_a, text_b)
            if len(text) > 0:
                item = SubRipItem(0, start, end, text)
                out.append(item)

    out.clean_indexes()
    return out

def usage():
    print "Usage: ./srtmerge [options] lang1.srt lang2.srt out.srt"
    print
    print "Options:"
    print "  -d <milliseconds>         The shortest time length of the one subtitle"
    print "  --delta=<milliseconds>    default: 500"
    print "  -e <encoding>             Encoding of input and output files."
    print "  --encoding=<encoding>     default: utf_8"


def main():
    try:
        opts, args = getopt.getopt(sys.argv[1:], 'hd:e:', ["help", "encoding=", "delta="])
    except getopt.GetoptError, err:
        print str(err)
        usage()
        sys.exit(2)

    #Settings default values
    delta = SubRipTime(milliseconds=500)
    encoding="utf_8"
    #-

    if len(args) <> 3:
        usage()
        sys.exit(2)

    for o, a in opts:
        if o in ("-d", "--delta"):
            delta = SubRipTime(milliseconds=int(a))
        elif o in ("-e", "--encoding"):
            encoding = a
        elif o in ("-h", "--help"):
            usage()
            sys.exit()

    subs_a = SubRipFile.open(args[0], encoding=encoding)
    subs_b = SubRipFile.open(args[1], encoding=encoding)
    out = merge_subtitle(subs_a, subs_b, delta)
    out.save(args[2], encoding=encoding)

if __name__ == "__main__":
    main()

Unicode support somewhat broken

Running

 a = pysrt.open('tests/static/utf-8.srt')
print a[1]

Gives me:

<ipython-input-11-e9d376687425> in <module>()
----> 1 print a[1]

UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in position 51: ordinal not in range(128)```

Running `print a[1].__str__()` is fine though, which puzzles me. When dealing with Unicode, I gather that `str(text)` and `str(poition)` in the init method of SubRipItem should not be used, but I am not so sure. I tried fixing this, but since pysrt wants to support both python 2 and 3, I am not sure how to go about it and failed miserably so far. Do you have any suggestings?

bypass the first item

Hi Jean,
the first item is bypassed, i just tried it on one file so maybe this file is misformatted, but i don't think so.
Thanks

Subtitles validation

I try to validate subtitles with this:

import codecs
import pysrt
from charade.universaldetector import UniversalDetector


def is_valid_subtitle(path):
    u = UniversalDetector()
    for line in open(path, 'rb'):
        u.feed(line)
    u.close()
    encoding = u.result['encoding']
    source_file = codecs.open(path, 'rU', encoding=encoding, errors='replace')
    try:
        for _ in pysrt.SubRipFile.stream(source_file, error_handling=pysrt.SubRipFile.ERROR_RAISE):
            pass
    except pysrt.Error:
        return False
    except UnicodeEncodeError:  # Workaround for https://github.com/byroot/pysrt/issues/12
        pass
    return True

But unfortunately for some subtitles it fails even though the file is a valid subtitle. For example this one: https://docs.google.com/open?id=0B2q9iBGZdj6qOXZrbFpiV2ozOHc
I think there should be different kind of InvalidItem error. It could be subclassed to raise, in this case, EmptyText error.

Although, I'm not sure this should raise an error at all because this doesn't mean the item is invalid, it just has its text empty.

Error when shifting files

I have this subs:

1
00:00:00,058 --> 00:00:02,942
Previously on AMC's
Breaking Bad...

2
00:00:02,984 --> 00:00:04,513
Sooner or later
someone is gonna flip.

If I shift this subs -1 seconds, the resulting file has all this subs at 00:00:00,000

1
00:00:00,000 --> 00:00:00,000
Previously on AMC's
Breaking Bad...

2
00:00:00,000 --> 00:00:00,000
Sooner or later
someone is gonna flip.

3
00:00:00,000 --> 00:00:00,000
I've got nine guys.
They were part of the

I think it should be 00:00:00,000 at first and then -1 second.

New release

Would it be possible to release a new version with the changes currently in master? Just so that it's possible to use the new features.

merge feature

Hello,

I was looking for a dual subtitle feature. I didn't found.
But I found this fantastic package, I just would like to share my merge script here.
For me it works perfectly on my VLC player.
Perhaps it might be added to your command line tool.

Usage:
./subtitle_merge.py en.srt de.srt en_de.srt

Best regards,
Karel.

See the code below....

SubRipTime.init should maybe cast the arguments to int or float (aka “TypeError: '>' not supported between instances of 'SubRipTime' and 'dict'” in slice())

If you receive some times as strings, split them into parts and try calling SubRipFile.slice with a dict of those parts, e.g.:

subs.slice(starts_after={'minutes': '11', 'seconds': '22'})

then you'll get a rather cryptic error: TypeError: '>' not supported between instances of 'SubRipTime' and 'dict'. This is caused by a different error: TypeError: '>' not supported between instances of 'str' and 'int' in ComparableMixin._compare. Which in turn means that the ordinal field in one of the objects is a string.

The root cause is that, when passed strings as the arguments, the SubRipTime constructor multiplies them by HOURS_RATIO, MINUTES_RATIO and SECONDS_RATIO respectively, and adds them all together, silently resulting in a long-ass string instead of a number.

To either handle the use-case or make the output more informative, it would be prudent to convert the arguments to numbers, or to explicitly forbid non-number arguments. IMO the first approach is better, especially since Python itself would then complain if the arguments really don't contain numbers. One other question to decide is whether the constructor should accept fractional times and thus convert to float and not just int. Milliseconds should probably be integers, but I might want to cut e.g. 1.5 hours into a film. Some workflows involving arithmetic might even produce fractional milliseconds, which of course should still be cut to integers after parsing.

Alternatively, or in addition, you might want to pass exceptions through in ComparableMixin._compare instead of returning NotImplemented: firstly, AttributeError and TypeError may have more possible causes when calling _cmpkey than just the two envisioned cases. Secondly, use of a mixin suggests more complex workflows than plain comparison of two values—as in the very case of SubRipTime—while Python's resulting message is rather unenlightening. So passing exceptions through seems to be more informative, as they would properly indicate invalid use of ComparableMixin.

Encode error

I can't run srt with this file http://dl.dropbox.com/u/1788271/Bones.S07E01.HDTVRip.srt
It is cp1251
I have the following error:

Traceback (most recent call last):
  File "/usr/local/bin/srt", line 9, in <module>
    load_entry_point('pysrt==0.4.1', 'console_scripts', 'srt')()
  File "/usr/local/lib/python2.7/dist-packages/pysrt/commands.py", line 190, in main
    SubRipShifter().run(sys.argv[1:])
  File "/usr/local/lib/python2.7/dist-packages/pysrt/commands.py", line 118, in run
    self.arguments.action()
  File "/usr/local/lib/python2.7/dist-packages/pysrt/commands.py", line 164, in break_lines
    self.input_file.break_lines(self.arguments.length)
  File "/usr/local/lib/python2.7/dist-packages/pysrt/commands.py", line 177, in input_file
    encoding=encoding, error_handling=SubRipFile.ERROR_LOG)
  File "/usr/local/lib/python2.7/dist-packages/pysrt/srtfile.py", line 131, in open
    new_file.read(source_file, error_handling=error_handling)
  File "/usr/local/lib/python2.7/dist-packages/pysrt/srtfile.py", line 159, in read
    self.extend(self.stream(source_file, error_handling=error_handling))
  File "/usr/lib/python2.7/UserList.py", line 88, in extend
    self.data.extend(other)
  File "/usr/local/lib/python2.7/dist-packages/pysrt/srtfile.py", line 190, in stream
    yield SubRipItem.from_lines(source)
  File "/usr/local/lib/python2.7/dist-packages/pysrt/srtitem.py", line 79, in from_lines
    return cls(index, start, end, body, position)
  File "/usr/local/lib/python2.7/dist-packages/pysrt/srtitem.py", line 21, in __init__
    self.index = int(index)
UnicodeEncodeError: 'decimal' codec can't encode character u'\ufeff' in position 0: invalid decimal Unicode string

JUNK

Fix small type in example

part = subs.slice(starts_after={'minutes': 2, seconds': 30}, ends_before={'minutes': 3, 'seconds': 40})
part.shift(seconds=-2)

should be (seconds part ' missing)

part = subs.slice(starts_after={'minutes': 2, 'seconds': 30}, ends_before={'minutes': 3, 'seconds': 40})
part.shift(seconds=-2)

Sphinx documentation

Even though it is a small library, a few pages of documentation would be welcome.

http://subliminal.readthedocs.org/

Python 2.5 compatibility issue (encodings)

Under python 2.5 utf32 codecs are not defined.
There is also an issue with BOM handling that often strip the first real character.

ValueError: invalid literal for int()

Hi,

I encoutered this error:

Traceback (most recent call last):
  File "/home/antoine/workspace/python/subliminal/subliminal/api.py", line 250, in download_best_subtitles
    subtitle_text = provider.download_subtitle(subtitle)
  File "/home/antoine/workspace/python/subliminal/subliminal/providers/podnapisi.py", line 161, in download_subtitle
    if not is_valid_subtitle(subtitle_text):
  File "/home/antoine/workspace/python/subliminal/subliminal/subtitle.py", line 106, in is_valid_subtitle
    pysrt.from_string(subtitle_text, error_handling=pysrt.ERROR_RAISE)
  File "/home/antoine/.virtualenvs/subliminal/local/lib/python2.7/site-packages/pysrt/srtfile.py", line 188, in from_string
    new_file.read(source.splitlines(True), error_handling=error_handling)
  File "/home/antoine/.virtualenvs/subliminal/local/lib/python2.7/site-packages/pysrt/srtfile.py", line 202, in read
    self.extend(self.stream(source_file, error_handling=error_handling))
  File "/usr/lib/python2.7/UserList.py", line 88, in extend
    self.data.extend(other)
  File "/home/antoine/.virtualenvs/subliminal/local/lib/python2.7/site-packages/pysrt/srtfile.py", line 243, in stream
    yield SubRipItem.from_lines(source)
  File "/home/antoine/.virtualenvs/subliminal/local/lib/python2.7/site-packages/pysrt/srtitem.py", line 66, in from_lines
    return cls(index, start, end, body, position)
  File "/home/antoine/.virtualenvs/subliminal/local/lib/python2.7/site-packages/pysrt/srtitem.py", line 28, in __init__
    self.end = SubRipTime.coerce(end or 0)
  File "/home/antoine/.virtualenvs/subliminal/local/lib/python2.7/site-packages/pysrt/srttime.py", line 128, in coerce
    return cls.from_string(other)
  File "/home/antoine/.virtualenvs/subliminal/local/lib/python2.7/site-packages/pysrt/srttime.py", line 170, in from_string
    return cls(*(int(i) for i in items))
  File "/home/antoine/.virtualenvs/subliminal/local/lib/python2.7/site-packages/pysrt/srttime.py", line 170, in <genexpr>
    return cls(*(int(i) for i in items))
ValueError: invalid literal for int() with base 10: '197?'

You can download the subtitle causing this here : http://podnapisi.net/static/podnapisi/c/1/8/c18482a60f7ce6f94a8a33947aa723e6c3bd2e18.zip

Tests are broken in tar from pypi

https://pypi.python.org/packages/source/p/pysrt/pysrt-0.5.1.tar.gz#md5=c5d44c8abac6089cb8cd03ddee26faa5 does not have tests/static/, so 'nosetests --with-coverage --cover-package=pysrt' fails.

how to create a new line?

maybe you can add some method like:

subs.addAfter(LineNumber, StartTime, EndTime, SubtitleContent)
subs.addBefore(LineNumber, StartTime, EndTime, SubtitleContent)

subs.addAfterLastLine(StartTime, EndTime, SubtitleContent)
subs.addAfterLastLine(duringtime, SubtitleContent)

File encoding guessing with charade (= chardet)

This could be used on a failed attempt to open a file due to UnicodeDecodeError. charade would be called to detect the encoding and a second attempt to open the file would be done.
This would be the default behavior and suppressed if encoding argument is not None.

What do you think?

python 2.4 compatibility

I'm planning tu use this inside a project using Plone (a know CMS made with Zope and Python). The last stable version of Plone is using python 2.4, so the current egg is not compatible due to use of try-except-finally.

Changing the block at line 87 of pysrtfile.py like this make it usable!

            try:
                try:
                    new_item = SubRipItem.from_string(source)
                    new_file.append(new_item)
                except InvalidItem, error:
                    cls._handle_error(error, error_handling, path, index)
            finally:
                string_buffer.truncate(0)

srt shift with negative offset not worked under Windows and Linux

installation
pip install pysrt

run Windows command
C:\usr\local\python27\Scripts\srt.exe -i shift -65s sample.srt

or linux
srt -i shift -65s sample.srt

return output

usage: srt shift [-h] ←[4moffset←[0m
srt-script.py shift: error: too few arguments

i'm try running
C:\usr\local\python27\Scripts\srt.exe -i shift "-65s" sample.srt
with same error result
and
C:\usr\local\python27\Scripts\srt.exe -i shift '-65s' sample.srt
just shift sample.srt with posifive offset exactly as
C:\usr\local\python27\Scripts\srt.exe -i shift 65s sample.srt

Get rid of the pysrt3 egg on pypi

Maybe it can just be an empty egg that require pysrt >= 1.0.0

Install tests to subdir

Installing tests to /usr/lib64/python2.7/site-packages/ is wrong, they should be installed to /usr/lib64/python2.7/site-packages/pysrt. Else file collisions will happen.

Subtitles file validation

Would be great to do the following:

import pysrt
pysrt.is_valid('/path/to/Inception.srt')

What it would do is check if the file is a subtitle or not, with the correct format.
I tried to open an invalid subtitle file and it raised a ValueError somewhere in the code, would be better to have it raise an "InvalidFormatError" or something.

>>> SubRipFile.open('test.srt')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.6/dist-packages/pysrt/srtfile.py", line 127, in open
    new_file.read(source_file, error_handling=error_handling)
  File "/usr/local/lib/python2.6/dist-packages/pysrt/srtfile.py", line 155, in read
    self.extend(self.stream(source_file, error_handling=error_handling))
  File "/usr/lib/python2.6/UserList.py", line 88, in extend
    self.data.extend(other)
  File "/usr/local/lib/python2.6/dist-packages/pysrt/srtfile.py", line 186, in stream
    yield SubRipItem.from_lines(source)
  File "/usr/local/lib/python2.6/dist-packages/pysrt/srtitem.py", line 56, in from_lines
    start, end, position = cls.split_timestamps(lines[1])
  File "/usr/local/lib/python2.6/dist-packages/pysrt/srtitem.py", line 62, in split_timestamps
    start, end_and_position = line.split('-->')
ValueError: need more than 1 value to unpack

Problem with SubRipFile.from_string

I've a problem with the from_string API; an Unicode error I'm not able to fix.

The file is there (but for application reason I can't use the SubRipFile.open method):
http://releases.flowplayer.org/data/buffalo.srt

Some tested examples:

from pysrt import SubRipFile
p = '/Users/luca/Documents/buffalo.srt'
SubRipFile.open(p)
Traceback (most recent call last):
File "", line 1, in ?
File "/Users/luca/Library/Buildout/eggs/pysrt-0.2.4-py2.4.egg/pysrt/srtfile.py", line 81, in open
source = unicode(string_buffer.read(), new_file.encoding)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 47-48: invalid data
SubRipFile.open(p, encoding='latin1')
[... THIS IS OK, IT WORKS ...]
st = open(p).read()
SubRipFile.from_string(st)
Traceback (most recent call last):
File "", line 1, in ?
File "/Users/luca/Library/Buildout/eggs/pysrt-0.2.4-py2.4.egg/pysrt/srtfile.py", line 107, in from_string
return cls.open(file_descriptor=StringIO(source))
File "/Users/luca/Library/Buildout/eggs/pysrt-0.2.4-py2.4.egg/pysrt/srtfile.py", line 81, in open
source = unicode(string_buffer.read(), new_file.encoding)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 49-50: invalid data
SubRipFile.from_string(st.decode('iso-8859-1'))
Traceback (most recent call last):
File "", line 1, in ?
File "/Users/luca/Library/Buildout/eggs/pysrt-0.2.4-py2.4.egg/pysrt/srtfile.py", line 107, in from_string
return cls.open(file_descriptor=StringIO(source))
File "/Users/luca/Library/Buildout/eggs/pysrt-0.2.4-py2.4.egg/pysrt/srtfile.py", line 81, in open
source = unicode(string_buffer.read(), new_file.encoding)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 17348-17349: invalid data
SubRipFile.from_string(st.decode('iso-8859-1').encode('utf-8'))
[]

Any tips? right now I can skip this problem using a temp file, but however it seems there are some problem in the method.

chardet dependency does not resolve with easy_install

Hello

easy_install does not resolve chardet dependency by itself. It has to be installed manually before.

Regards,

Opening a .srt file seems to load a list-like object

Following the documentation I tried this:

from pysrt import SubRipFile, SubRipItem, SubRipTime
subs = SubRipFile('buffalo.srt')
subs
['b', 'u', 'f', 'f', 'a', 'l', 'o', '.', 's', 'r', 't']
subs[0]
'b'
subs[1]
'u'
subs[2]
'f'

The "buffalo.srt" file is the one used for a demo of the Flowplayer Flash player, at this URL:
http://releases.flowplayer.org/data/buffalo.srt

I can't understand if something is goind bad or the srt used is someway corrupted or not well formed.

Script: parsing transcript .srt files into readable text

Hello,

I am working through an online class and trying to produce notes based on the instructional video content. Since many of the concepts covered in these videos are worth taking note of, I'm finding myself writing out nearly every line spoken by the instructor. Obviously, this process is laborious and extremely time-consuming. I am wondering if there is an easier way to extract the text from these videos using an srt tool to help parse and modify the text.

The syntax of the transcript files for each video are identical to standard srt format. Here's an example:

1
00:00:00,710 --> 00:00:03,220
Rob just showed us how we can
make things accessible to

2
00:00:03,220 --> 00:00:05,970
anyone who can't use a mouse or
pointing device.

3
00:00:05,970 --> 00:00:09,130
Whether that's because it's any
type of physical impairment or

4
00:00:09,130 --> 00:00:11,510
a technology issue or
simply personal preference.

Does pysrt currently provide any tools for modifying text content so that it's formatted into a more readable format? To clarify, for the above example, I would like to remove blank lines, lines beginning with the record number and time-stamp, and then join the remaining lines, adding spaces after periods, like so:

Rob just showed us how we can make things accessible to anyone who can't use a mouse or pointing device. Whether that's because it's any type of physical impairment or a technology issue or simply personal preference.

I am interested in creating the following output from the example above and being able to apply such a modification to more of the files in the series. In my current situation, I am really pretty rusty working with python, though believe this capability could be pretty easily implemented with
an understanding of common string methods.

Can anyone contributing to this project let me know how this is done or if the functionality already exists in pysrt?

Thanks!

Use chardet

charade has been merged into chardet and is not maintained anymore so you might want to switch :)

Newline between subs should be done in SubRipFile

Right now the SubRipItem tests contain a trailing newline, like this:

u'MILES:\nNo one stops us.\n'
u'No one ever has.\n'

When changing the text you need to ensure to keep (or re-add) that trailing newline since otherwise the next sub in the file will come directly after the edited one instead of after the edited one plus one newline.

This separating newline is not really part of the subtitle itself though, so it shouldn't be there and be added automatically when saving the subtitles to a file.

UnicodeDecodeError

Almost ~40% of subtitles fail to parse because of unicode errors.

Traceback (most recent call last):
  File "/home/username/bin/nocc", line 11, in <module>
    load_entry_point('nocc', 'console_scripts', 'nocc')()
  File "/home/username/projects/nocc/nocc/nocc.py", line 155, in main
    nocc(fn)
  File "/home/username/projects/nocc/nocc/nocc.py", line 47, in nocc
    subs = pysrt.open(filename)
  File "/home/username/.local/venvs/nocc/lib/python3.5/site-packages/pysrt/srtfile.py", line 153, in open
    new_file.read(source_file, error_handling=error_handling)
  File "/home/username/.local/venvs/nocc/lib/python3.5/site-packages/pysrt/srtfile.py", line 180, in read
    self.eol = self._guess_eol(source_file)
  File "/home/username/.local/venvs/nocc/lib/python3.5/site-packages/pysrt/srtfile.py", line 257, in _guess_eol
    first_line = cls._get_first_line(string_iterable)
  File "/home/username/.local/venvs/nocc/lib/python3.5/site-packages/pysrt/srtfile.py", line 269, in _get_first_line
    first_line = next(iter(string_iterable))
  File "/home/username/.local/venvs/nocc/lib/python3.5/codecs.py", line 711, in __next__
    return next(self.reader)
  File "/home/username/.local/venvs/nocc/lib/python3.5/codecs.py", line 642, in __next__
    line = self.readline()
  File "/home/username/.local/venvs/nocc/lib/python3.5/codecs.py", line 555, in readline
    data = self.read(readsize, firstline=True)
  File "/home/username/.local/venvs/nocc/lib/python3.5/codecs.py", line 501, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 4: invalid start byte

Please enable "errors=ignore" in open()

python3

Would improve support for languages and UTF-8 a lot.

Is it planned?

Edit: sorry, just found python3 branch.

Be more liberal in what to expect

The pysrt library should also minimally support the WebVTT subtitle format for input and output, as it is so very close to the original SRT format, and one would need a program to convert an SRT to the WebVTT format and vice versa for HTML5 videos.

Furthermore, the counter field before the time code should also be made optional for SRT too, as I have seen subtitles without the counter (only the timestamp); they also currently work in many players as is, but pysrt just returns an empty list for such files unless I add the 1, 2, 3... by hand before the timestamps.

call clean_indexes after opening a file?

The following is a perfectly valid SRT file (at least anything that handles them will open it just fine), let's call it subs.srt:

3
00:08:17,317 --> 00:08:19,328
It is life or death, James.

The following happens

a = pysrt.open('subs.srt')
a[0].index
>>>> 3
a.clean_indexes()
a[0].index
>>>> 1

Should not clean_indexes() be called after opening the subtitles by default? There is also the issue that index after calling clean_indexes() starts at 1 and python lists start at 0, but I do not think it would be wise to do anything about it.

No tags for releases

Hi,

I'm trying to create a Debian/Ubuntu package for your library, and it'd be really helpful if you could create a git tag for release you made.

Thanks in advance!

Can't parse text with empty line

Hi, and first thanks for this handy library. I don't know if my error is due to a bad srt file or if it's a bug, but whith a srt file like this:

1
00:22:10,440 --> 00:22:15,195
Je suis coincée au boulot,

j'aurai 10 minutes de retard.

305
00:22:15,960 --> 00:22:19,157
John, je suis dans les embouteillages.
La 5e Avenue est en travaux.

When I run the command: srt shift 35s file_with_empty_line.srt, I've got the following error:

PySRT-InvalidItem(line 5): 
Traceback (most recent call last):
  File "/home/john/Documents/git/pysrt/pysrt/srtfile.py", line 212, in stream
    yield SubRipItem.from_lines(source)
  File "/home/john/Documents/git/pysrt/pysrt/srtitem.py", line 83, in from_lines
    raise InvalidItem()
pysrt.srtexc.InvalidItem: j'aurai 10 minutes de retard.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/miniconda3/bin/srt", line 9, in <module>
    load_entry_point('pysrt', 'console_scripts', 'srt')()
  File "/home/john/Documents/git/pysrt/pysrt/commands.py", line 222, in main
    SubRipShifter().run(sys.argv[1:])
  File "/home/john/Documents/git/pysrt/pysrt/commands.py", line 140, in run
    self.arguments.action()
  File "/home/john/Documents/git/pysrt/pysrt/commands.py", line 161, in shift
    self.input_file.shift(milliseconds=self.arguments.time_offset)
  File "/home/john/Documents/git/pysrt/pysrt/commands.py", line 205, in input_file
    encoding=encoding, error_handling=SubRipFile.ERROR_LOG)
  File "/home/john/Documents/git/pysrt/pysrt/srtfile.py", line 153, in open
    new_file.read(source_file, error_handling=error_handling)
  File "/home/john/Documents/git/pysrt/pysrt/srtfile.py", line 181, in read
    self.extend(self.stream(source_file, error_handling=error_handling))
  File "/opt/miniconda3/lib/python3.5/collections/__init__.py", line 1091, in extend
    self.data.extend(other)
  File "/home/john/Documents/git/pysrt/pysrt/srtfile.py", line 215, in stream
    cls._handle_error(error, error_handling, index)
  File "/home/john/Documents/git/pysrt/pysrt/srtfile.py", line 311, in _handle_error
    sys.stderr.write(error.args[0].encode('ascii', 'replace'))
TypeError: write() argument must be str, not bytes

Is it possible to create a new srt file using this?

This seems geared towards editing existing files. Is there a way to create a new .srt file in python with the existing functions?

Python 3 support

Please add support for Python 3

byroot / pysrt Goto Github PK

pysrt's Issues

return output

Recommend Projects

Recommend Topics

Recommend Org