ccextractor / ccextractor Goto Github PK

CCExtractor - Official version maintained by the core team

License: GNU General Public License v2.0

Makefile 0.81% Shell 0.55% C 92.48% CMake 0.50% Batchfile 0.06% Objective-C 0.40% M4 0.83% Python 0.10% Starlark 0.04% Rust 4.23%

subtitles c tesseract ocr tesseract-ocr dvb video image-processing image teletext

ccextractor's Introduction

CCExtractor

CCExtractor is a tool used to produce subtitles for TV recordings from almost anywhere in the world. We intend to keep up with all sources and formats.

Subtitles are important for many people. If you're learning a new language, subtitles are a great way to learn it from movies or TV shows. If you are hard of hearing, subtitles can help you better understand what's happening on the screen. We aim to make it easy to generate subtitles by using the command line tool or Windows GUI.

The official repository is (CCExtractor/ccextractor) and master being the most stable branch.

Features

Extract subtitles in real-time
Translate subtitles
Extract closed captions from DVDs
Convert closed captions to subtitles

Programming Languages & Technologies

The core functionality is written in C. Other languages used include C++ and Python.

Installation and Usage

Downloads for precompiled binaries and source code can be found on our website.

Extracting subtitles is relatively simple. Just run the following command:

ccextractor <input>

This will extract the subtitles.

More usage information can be found on our website:

You can also find the list of parameters and their brief description by running ccextractor without any arguments.

You can find sample files on our website to test the software.

Compiling CCExtractor

To learn more about how to compile and build CCExtractor for your platform check the compilation guide.

Support

By far the best way to get support is by opening an issue at our issue tracker.

When you create a new issue, please fill in the needed details in the provided template. That makes it easier for us to help you more efficiently.

If you have a question or a problem you can also contact us by email or chat with the team in Slack.

If you want to contribute to CCExtractor but can't submit some code patches or issues or video samples, you can also donate to us

Contributing

You can contribute to the project by reporting issues, forking it, modifying the code and making a pull request to the repository. We have some rules, outlined in the contributor's guide.

News & Other Information

News about releases and modifications to the code can be found in the CHANGES.TXT file.

For more information visit the CCExtractor website: https://www.ccextractor.org

License

GNU General Public License version 2.0 (GPL-2.0)

ccextractor's People

Contributors

Stargazers

Watchers

Forkers

rkuchumov moonfunjohn priyankbhatt mikayuoadas mayur-desh brooss rahul-raturi akirato theorange okkosh hardikchoudhary12 ineskhandelwal gvishal amoljangam vishalmidha siddhanjay dhaval27 okisseloff codeman38 slifty december-soul afeldspar iha sagar20896 kwikadi rahulroxx avikantz karanagarwal17 kunaf arbitrer bigharshrag ioanas96 yorkhe at25sep aypatil93 abhinav95 nikofil phanindra707 ps259 isacdaavid abhishek-vinjamoori nehabagari vinayakathavale ishan-maheshwari liontooth shubhsingh594 gshruti95 asnelchristian nimitbhardwaj breezet breezetemple maxkoryukov ajeyodey eztourist itzik009 xuhairong 17eparker mauzyprime danilafe doluz ychunwei isamorphic eviltak deeprajpandey andreea-zaharia r4huln kerct soumyaranjanbhol2002 harshkraj tarunbod harpreetk1 eianklock marissaangell adityajayantgupta sedoid karinakozarova e-l-e-e jeffsieu arpitadash awesometushar000 jaygupta2003 vibhorgupta96 mediathand kalpana2k1 saurabhkapur sidgairo18 himanshu-dixit namanyadav12 zeokav mrmc madan96 hexiay barun511 sza-1 satyammittal alexandrumc siddharthjindal1997 hurda sathkrith jainamritanshu

ccextractor's Issues

Seek logic using chapters in ifo/bup

when startat is passed seeking should be done usinf ifo/bup files in DVD

GSOC - Wordwide repository

Building on the "Real time uploading" task, it makes sense to aggregate all available caption data.

A problem with captioning is of course that you need to be able to receive the TV signal, which makes it impossible for any individual entity to capture data from all TV channels in the world. We want to allow everyone with the required hardware to be able to contribute to a global effort; the task is to build an scalable system that is able to receive data for a large number of providers. Google's App Engine would be a good option to start building.

CEA-708 is not supported at all for MP4

The MP4 code doesn't detect CEA-708, so obviously the decoder doesn't have a chance to process it.

I looked into this a bit and it might not be a lot of work.

We have one MP4 sample that contains 708 (only, no 608) in this directory:

/Chenders

From what I look, the track type is c708 which is correct (however the code in mp4.c skips it).
After that, comes the interesting thing. The atom type is not cdat (as the specs say should be for subtitles) but ccpd which is undocumented.

Someone in ffmpeg did some digging and explains here
https://trac.ffmpeg.org/ticket/3250

I think that should be enough to put us in the right direction.

File not being correctly parsed (Sample.vob)

This file:
https://docs.google.com/uc?id=0B3-vhpZ_3PTLOUJoc1JBZW5FSTg&export=download

Is not correctly processed. VLC says there's 4 captions tracks there. CCExtractor seems quite confused though.

GSOC - Create a Linux GUI

OSX GUI CCExtractor is a console program, which is very convenient for a number of things, particularly those involving anything automatic. It also makes it harder to use by regular users in their desktops. For Windows we just have a nice GUI that calls the console program. We'd like to have a GUI written for Linux as well.

Word-by-word tagging

Hi there,

I am interested in identifying the precise in and out timestamps of specific words embedded in the closed caption data of an mpeg2 stream. It seems that with CCExtractor, only lines of text are indexed in this way. Is this a limitation of CCExtractor specifically, or the standards of CC in digital broadcast? If this functionality is not directly built into CCExtractor, would you have any suggestions as to how to extract and use this very specific data?

GSOC - Real time uploading

Closed captions is used by a number of data aggregation companies as once of the sources they add to their data pool. They use the information to correlate appearances on media to things like stock. While doing that for twitter -to mention just one- is easy because there's even an API, it's a lot more difficult to do it for closed captions. We want CCExtractor to be able to upload data as it's produced so it can be used for real time analysis.
The task includes everything from defining the protocol to make the changes in CCExtractor and write reference a implementation of the data receiver.

GSOC - Finish CEA-708 support

EIA-708 is the "new" standard for closed captioning. While the specification has been around for some years and support for it is mandatory in the US for both TV receivers and stations, until very recently almost all stations have just converted their CEA-608 data to 708; this means that none of the 708 features have actually been used, and you still see many captions in all uppercase (to mention just one thing). This is starting to change though, so it makes sense for CCExtractor to fully implement a 708 decoder. Some work was done already, and you can actually see 708 output in CCExtractor in debug mode. But it needs to be completed by adding the actual export features.

GSOC - ffmpeg integration

As you may know, CCExtractor has built-in parsers for everything. Zero dependencies. Quite convenient, but of course it means that except for ports we aren't building on the work of giants like ffmpeg. We would like to -optionally- build CCExtractor using ffmpeg's instead of the internal parsers. This will probably require some changes in ffmpeg itself (because as far as we know it doesn't allow to get the subtitle data even in raw format). Your job would be to make changes in both projects and have them added to the mainstream versions.

GSOC - Configurable transcript output

The current transcript format is fixed (well, there's a -ucla parameter that changes to an alternate format, but that's all). Since transcripts are really not an official format such as .srt, it would be convenient to allow the user to specify how the transcript should look like, for example with or without timing, or maybe just the start time but not the end...colors or no colors, timestamp in UTC or relative to the start of the stream, and so on.

Example:

ccextractor -txtformat “starttime,endtime,mode,window,text” …

Personal Pronoun Capitalization

I have developed a word list which I use in conjunction with your option to convert the case of captions & use an external list for names & such. In version .69, this works great & except where it always capitalizes the new line, whether it is the continuation of an existing sentence or not, is perfect. Unfortunately, every newer release fails to capitalize the personal pronoun "I". If it were possible to do a simple search & replace on this, it would not be an issue, but I have found no way to do that, as every subtitle editor I use (I have many) is incapable of locating every instance of the simple word "I" correctly, making it a manual task of, sometimes, herculean proportions. I am hoping you will eventually be able to correct this situation, as I'm sure I'm missing out on other improvements I have so far been unable to enjoy.

GSOC - DeC++ spupng

The Spupng code contains C++ stuff. Since CCExtractor is C, we need to convert that part of the code.

Some closed captioning does not work properly

I emailed a file to you as well (espn.ts). We’ve found some transport streams that the CC ends up jumbled. It’s the same with VLC or FFMpeg-based software, but our IP set-top boxes display the CC just fine. They use software made by bitrouter: http://www.bitrouter.com/products/capstack.htm

mp4 files not automatically detected

history: discussion regarding this started https://github.com/CCExtractor/ccextractor/pull/58/files#r14395363

[question] Badly corrupted .SRT file from TiVo units

I hope this question here is permissible - I am having trouble with subtitles generated by ccextractor (although I have determined that ccextractor is likely NOT the cause) and could use some guidance on where to take this for resolution.

I have been observing badly corrupted subtitles/caption files being generated from recordings downloaded from TiVo lately - I have been using Tivo + ccextractor as part of my workflow for years, and while there were frequently the occasional corrupted line (there are almost always about 4 to 5 dialogue lines corrupted), it has recently become very bad.

Lately the corruption issues with my TiVo downloads have been so bad that nearly every line (99% percent) is corrupted. The corruption sometimes appear as if part of one (left) half of the subtitles are somehow overlapping the other (right) half.

Here is an example of the .SRT output:

00:04:37,077 --> 00:04:38,377
SOUNDS GRE!                     

33
00:04:38,379 --> 00:04:40,545
     AND CAN I PRACTICE         
     MY SPANI.                  

34
00:04:57,797 --> 00:04:59,765
WHAT DO YOU THINK?              

35
00:05:00,566 --> 00:05:03,035
            OF COUE! IRS'D LOVE 
            TO GO TOEXIC MO!    

36
00:05:03,037 --> 00:05:04,403
                    YE

I have determined that the issue is NOT a result of the decoding/encoding/ccextractor process (which was my belief previously) as I have found that the corrupted captions appear this way on the .tivo video file itself.

I previously surmised, with my limited understanding of ccextractor, that perhaps the problem was related to the analog vs. digital captions on the video and that ccextractor was possibly pulling captions from "the wrong one", however, from my efforts in reading up on this, it appears that this hypothesis is unlikely to be correct.

I welcome anyone's help in identifying why this is happening and where I should bring this for further resolution.

alternate subtitles missing on UK freeview .TS files

Input
https://www.dropbox.com/s/nvi4j0amowkthjy/Harry%20Potter%20Chamber%20of%20Secrets_20150530_14401740.ts?dl=0
Output
https://www.dropbox.com/s/a7w3suf4n1ke7cn/Harry%20Potter%20Chamber%20of%20Secrets_20150530_14401740.srt?dl=0

several files tried with same result command line and GUI produce same results
using default options only

Apple Iphone 5 sample

I'll submit it here, and I'll be working on it. I got some extra information and have the next conclusions so far:

file comes from the site of apple. Other samples from same site are apparently ok, this one isn't.
Subtitles are messed up on several different players (VLC, MPC-HC, ...), but play fine on QuickTime
Subtitles are in english.

Some media info:

General:
Format : MPEG-4
Format profile : QuickTime
Codec ID : qt
File size : 15.9 MiB
Duration : 30s 167ms
Overall bit rate : 4 430 Kbps
Encoded date : UTC 2014-02-11 15:46:11
Tagged date : UTC 2014-02-11 15:46:12
Writing library : Apple QuickTime

Text
ID : 5-CC1
Format : EIA-608
Muxing mode : Final Cut
Codec ID : c608
Duration : 30s 167ms
Source duration : 30s 163ms / 30s 155ms
Bit rate mode : Constant
Stream size : 0.00 Byte (0%)
Source stream size : 32.7 KiB (0%)
Language : English
Encoded date : UTC 2014-02-11 15:46:11
Tagged date : UTC 2014-02-11 15:46:12

This would lead me to think that QuickTime doesn't 100% follow the standards or does something special, because they generated the file and can read it out again, but CCExtractor & other media players can't.

GSOC - Create a OSX GUI

CCExtractor is a console program, which is very convenient for a number of things, particularly those involving anything automatic. It also makes it harder to use by regular users in their desktops. For Windows we just have a nice GUI that calls the console program. We'd like to have a GUI written for OSX as well.

CEA-708 support files

Someone just sent a number of useful samples. They're available in /repository/Cristiano708

MPEG-PS containing both CEA608 and CEA708 captions
(On version 0.76, 608 extraction works, but 708 does not)
captions_test.mpg

MPEG-TS containing both CEA608 and CEA708 captions
(On version 0.76, 608 extraction works, but 708 does not)
captions_test_ts.mpg

CEA608 TTML file generated by Adobe Premiere (CEA608 track)
(the great thing about this version is that it contains the correct positions)
captions-test_608.xml

CEA708 TTML file (slightly different from the CEA608 TTML):
(the great thing about this version is that it contains the correct positions)
captions-test_708.xml

SCC File (CEA608)
captions-test_608.scc

MCC File (CEA708)
captions-test_708.mcc

GSOC - CC insertion

CCExtractor is able to extract the subtitles from almost anything you throw at it: MPEG2, H264... in MP4, TS... doesn't matter if it's teletext or closed captions, or if the media is from Europe, Australia or North America. If the data is there, chances is CCExtractor can give it to you in a text file. However, what CCExtractor cannot yet do is insert data into existing files, i.e. add captions where there aren't any. The job here: Be able to take an uncaptioned media file and insert the CC data.

GSOC - Test suite

We have a reasonably decent collection of samples of all kinds, from a number of sources. Often, fixing a problem that appears in just one sample breaks something else. We need to automatize tests so we can easily compare the output of different CCExtractor versions and get a useful reports (which files changed and what).

GSOC - Multiprogram

In the digital world, a number of programs are transmitted simultaneously (multiplexed) in a single channel. The tuner receives all those programs, and then the receiver filters the one the user wants to watch, discarding all the others. CCExtractor does this too - if a stream contains more than one program you have to pick one. The goal is to modify CCExtractor so it's able to process all programs in the stream at the same time, generating the transcript for all of them in one pass.

3 new samples that don't work

Directory "/Ramit Bhalla" in the repository.

Garbled up HDHomeRun samples

There are a couple of samples in the RaviUSATV folder on the dev server that contain garbled output.

hdhomerun_station_44.1_KBCW-DT
hdhomerun_station_4.2_KRON-SD
hdhomerun_station_66.3_Bounce
hdhomerun_station_7.1_KGO-HD
hdhomerun_station_14.1_KDTV-HD

hdhomerun_station_26.1_KTSF-D1 (partially)
hdhomerun_station_20.1_KOFY-HD (partially)

They all behave the same in VLC (also garbled output).

(orginally posted by @cfsmp3 in Slack)

Garbled output in some Tivo samples

Output is garbled in some (but not all) recordings from Tivo.

3 samples can be checked out here:

https://drive.google.com/folderview?id=0B3bPKNXgZu0-fjAxWFN2YXJSSFdZSlpRYllPSDBxTk9xUlU4dDZiUllxRE5kZXp1cEpSX2c

OCR issue

I've found a problem with OCR feature, that causes problems with these two samples - #172 and #151.

I've compiled ccextractor with ocr feature and extracted png subs using -out=spupng option. pngs extracted well, but there are not enough subtitles in srt file - some of lines are missing. First thing, that strikes the eye is that there are no multi-line subs there. After that I found that some single-line subs are missing too.

Then I tried to check some of excluded from srt file png sources with tesseract cli tool if it can can recognize the text. Some of multi-line sources were recognized well, and some of them could not be recognized. More than that, lots of single-line sources could not be recognized too. Error messages appeared:

...
Error in pixReduceRankBinary2: hs must be at least 2
Error in pixDilateBrick: pixs not defined
Error in pixExpandReplicate: pixs not defined
Error in pixAnd: pixs1 not defined
Error in pixDilateBrick: pixs not defined
...

led me to tesseracts' bugtracker https://code.google.com/p/tesseract-ocr/issues/detail?id=605, where they say it is a leptonica issue.

I am not so familiar with OCR-related code in ccextractor, but probably some of you guys are.

Buggy ttml support

The generated ttml file are not valid: they start and end with the valid ttml headers (xml), but the rest of the file is just a in regular SubRip format.

Here's a short example of the kind of file I get with ccextractor -out=smptett video.ts:

<?xml version="1.0" encoding="UTF-8" ?>
<tt xmlns="http://www.w3.org/ns/ttml" xml:lang="en">
<body>
<div>
1
00:00:48,280 --> 00:00:49,880
Mon pauvre, je suis désolée !

2
00:00:50,080 --> 00:00:50,720
Ca va ?

3
00:00:50,960 --> 00:00:51,920
T'as rien ?

</div></body></tt>

The input files I have are all .ts with embedded dvb_teletext subtitles.

Incorrect timing in iTunes MP4

Looks like we are doing it wrong in iTunes MP4.

There's a track that contains the captions (as opposed to being embedded in the video track) and timing is a bit different than normally because that track can contain a lot of data.

We've received a couple useful links about this:

http://forum.doom9.org/showthread.php?p=1718273#post1718273
(It continues to the bottom of the page.)

https://trac.videolan.org/vlc/ticket/12685
(There's an extra sample here you should be able to download.)

CEA-608 Export to RAW/BIN Fails

Hey, think I found a bug...

My input files are m4v with CEA-608 stream (iTunes). If I output to SRT it works fine. But attempting to output to RAW / DVDRAW / BIN doesn’t work (output files are empty.)

Command line:
C:\Program Files (x86)\CCExtractor\ccextractorwin.exe --gui_mode_reports -in=mp4 -autoprogram -out=raw -o "C:\somefolder\s01e01_ccextract.raw" -latin1 --nofontcolor --notypesetting -noteletext [+input files]

The only option I’m changing (other than the output file name) is the –out argument. CCExtractor GUI "Preview" text area displays all captions as per normal every time, but unless output is set to SRT the files are empty.

CCExtractor 0.73 (also tested with CCExtractor 0.69)
Windows 8.1 Pro 64-Bit with UAC enabled; NTFS file system; Visual C++ 2013 Runtime installed.

Happy to answer any question, but go easy on me - I'm a CC n00b.

XDS: Pulluting .srt

XDS output should only be added to transcript, not .srt or any other format that doesn't support it.

Crash on PMT parsing

"http://www.worldtrad.org/Bones - 16.ts"

That file causes a crash here:
int write_section(struct lib_ccx_ctx _ctx, struct ts_payload *payload, unsigned char_buf, int size, int pos)
{
if (payload->pesstart)
{
memcpy(payload->section_buf, buf, size);

(size == -3)

sendto doesn't seem to be sending

First: Awesome tool, you guys are awesome.

Second: I'm pulling captions from a UDP stream and I would like to send them to a TCP server.

> ccextractor -s -udp 239.255.251.9:1234 -sendto localhost:9500

In another shell, I've set up a netcat listening to 9500 to test this out:

> nc -l -p 9500

When I run these two commands, the output is as follows:

CCExtractor 0.76, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
Input: Network, 239.255.251.9:1234
[Extract: 1] [Stream mode: Autodetect]
[Program : Auto ] [Hauppage mode: No] [Use MythTV code: Auto]

[Timing mode: Auto] [Debug: No] [Buffer input: Yes]
[Use pic_order_cnt_lsb for H.264: No] [Print CC decoder traces: No]
[Target format: .bin] [Encoding: UTF-8] [Delay: 0] [Trim lines: No]
[Add font color data: Yes] [Add font typesetting: Yes]
[Convert case: No] [Video-edit join: No]
[Extraction start time: not set (from start)]
[Extraction end time: not set (to end)]
[Live stream: Yes, no timeout] [Clock frequency: 90000]
Teletext page: [Autodetect]
Start credits text: [None]

----------------------------------------------------------------------
Connecting to localhost:9500
connect() error: Connection refused
trying next addres ...

However, it seems to be "connecting" since when I stop the nc early, it shows:

Error: read() error: Connection refused

Note, I'm building something to make it easy to broadcast the captions extracted using ccextractor across a websocket API. I will be sure to update this issue when I figure out what's going on.

Worth noting that when I send the output to stdout with a -stdout flag (and removing sendto) I am seeing captions, so the UDP extraction portion is functioning as expected.

Bad timing in the DVD "The Net"

The original DVD of The Net (which had closed captions instead of DVD subtitles) doesn't get along well with CCExtractor. In particular, timing is incorrect (it goes off around 2 minutes by the end of the movie).

VOBs are available in the developer repository.

Improve built-in dictionary code

CCExtractor has a small (maybe 20 words) dictionary that is used to correct capitalization.

Because the dictionary is so small the implementation that uses is to do the correction is extremely trivial, and also time consuming (basically it checks against each word in the dictionary rather than do a binary search).

We didn't care about this until someone sent a 11,000 word dictionary for us to use :-)

The function is in 608_helpers.
void correct_case (int line_num, struct eia608_screen *data)

Job: Implement sort and binary search so we can use that dictionary efficiently.

GSOC - DVB Subtitles

DVB subtitles are bitmap based subtitles used in Europe (there's also teletext there, which comes from the analog world). For now CCExtractor is unable to extract them. However, because specifications are freely available, because there's some open source implementations we can use as a reference (such as Project X) and because part of the work is common to everything else (i.e. we can use what we already have to get to the DVB data, we just need to process it) this shouldn't be extremely difficult.

Apparently nasty corruption in some files

A hard one. The file

/repository/UCLACorruption/2015-06-25_1800_US_KNBC_Access_Hollywood_Live.mpg

Produces some garbage under some conditions.

Report says that it happens with these parameters:

" 463 F=2015-06-25_1800_US_KNBC_Access_Hollywood_Live
464 cx=ccextractor-0.78-alpha1
466 $cx -ts -autoprogram -UCLA -12 -noru -out=ttxt -utf8 -unixts 0 -o $F.test2 $F.mpg

Note that the problem goes away if you use -1 instead of -12. It's looking for the second channel that triggers the junk inclusion, so it's finding this somehere."

To make it worse, it doesn't happen (to me) on Windows.

Output looks like this:
19700101000012.412|19700101000014.180|CC1|RU2|>>> AH, TODAY ON "ACCESS
19700101000014.247|19700101000015.215|CC1|RU2|HOLLYWOOD LIVE," I AM SO
19700101000015.281|19700101000015.482|CC1|RU2|EXCITED.
1061371091103224844.814|1978833380815014111.993|CC-1956779514|???|�^�^��^֠^@^@xm^N^E�^�^��^֠^@^@?1 ^A^@^@^@^@?1 ^A^@^@^@^@?¢üí
1061371091103224844.814|1978833380815014111.993|CC-1956779514|???|^K�^�/F?�^�?EI??D^B�^�
^Y�^�^�ú2]ëäSs?^C?~^Q?�^�^�0

For more details, contact David at UCLA.

Captions not detected in WTV file

The repository contains this directory: /MurdochMisteries_nocaptions

There's a file there for which we find no captions but reports say that MCEBuddy does.

I'm assigning this to Anshul since he's doing stream work :-)

Line Breaks: %%--PP

I'm exploring the idea of creating a new "network" output type which could send the output to a TCP as opposed to stdout or a file.

When the output type is txt, I'm noticing that new lines are being sent as 0x25 0x25 0xAD 0xAD 0xD0 0xD0 which, when parsed as ASCII, amounts to %%--PP.

00:00:06:673   0   FC:%:25    <>   ..   ..
00:00:06:706   0   FC:%:25    <>   ..   ..
00:00:06:740   0   FC:-:AD    <>   ..   ..
00:00:06:773   0   FC:-:AD    <>   ..   ..
00:00:06:806   0   FC:P:D0    <>   ..   ..
00:00:06:840   0   FC:P:D0    <>   ..   ..
00:00:06:873   0   FC:I':A7    I'   ..   ..
00:00:06:906   0   FC:M :20    M    ..   ..
00:00:06:940   0   FC:NO:4F    NO   ..   ..
00:00:06:973   0   FC:T :20    T    ..   ..

I've spent a very long time scouring the code base to better understand why this might be, and more specifically, why this does NOT render as %%--PP when writing to stdout, but I think I have to give up in the name of my sanity.

Does anyone know the answer to this riddle?

Case fixing in teletext

For American closed captions CCExtractor is able to apply some case conversion rules so instead of all caps we get reasonably correct case.

We need to apply the same logic to Teletext (and well, DVB).

UDP stream segfault

I've finally cought a segfault with valgrind

==13192== Invalid write of size 1
==13192==    at 0x4C2A88A: memcpy (mc_replace_strmem.c:838)
==13192==    by 0x504ED9: write_section (ts_tables.c:419)
==13192==    by 0x4F1FF1: ts_readstream (ts_functions.c:325)
==13192==    by 0x4F257C: ts_getmoredata (ts_functions.c:486)
==13192==    by 0x507E1C: general_loop (general_loop.c:570)
==13192==    by 0x49E724: main (ccextractor.c:261)
==13192==  Address 0xffffffff8f2822d6 is not stack'd, malloc'd or (recently) free'd
==13192== 
==13192== 
==13192== Process terminating with default action of signal 11 (SIGSEGV)
==13192==  Access not within mapped region at address 0xFFFFFFFF8F2822D6
==13192==    at 0x4C2A88A: memcpy (mc_replace_strmem.c:838)
==13192==    by 0x504ED9: write_section (ts_tables.c:419)
==13192==    by 0x4F1FF1: ts_readstream (ts_functions.c:325)
==13192==    by 0x4F257C: ts_getmoredata (ts_functions.c:486)
==13192==    by 0x507E1C: general_loop (general_loop.c:570)
==13192==    by 0x49E724: main (ccextractor.c:261)
==13192==  If you believe this happened as a result of a stack
==13192==  overflow in your program's main thread (unlikely but
==13192==  possible), you can try to increase the size of the
==13192==  main thread stack using the --main-stacksize= flag.
==13192==  The main thread stack size used in this run was 8388608.
==13192== 
==13192== HEAP SUMMARY:
==13192==     in use at exit: 208,281,265 bytes in 421 blocks
==13192==   total heap usage: 4,910 allocs, 4,489 frees, 220,492,160 bytes allocated
==13192== 
==13192== LEAK SUMMARY:
==13192==    definitely lost: 1,845 bytes in 369 blocks
==13192==    indirectly lost: 0 bytes in 0 blocks
==13192==      possibly lost: 7,248 bytes in 6 blocks
==13192==    still reachable: 208,272,172 bytes in 46 blocks
==13192==         suppressed: 0 bytes in 0 blocks
==13192== Rerun with --leak-check=full to see details of leaked memory
==13192== 
==13192== For counts of detected and suppressed errors, rerun with: -v
==13192== Use --track-origins=yes to see where uninitialised values come from
==13192== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 4 from 4)
Killed

support WebVTT output format

I would like to see the WebVTT output format.

Proper Australian teletext support

Australian Teletext (I suppose it's not specific to Australia but the samples that show the issue are Australian) require extra parameters to be parsed, i.e. we have to tell CCExtractor lots of things (-teletext, -datapid, -datastreamtype) or it won't find the data.

This is what the user sent:

I have just uploaded a number of samples recorded with EyeTV into the new directory norman_dahl on your FTP server. They will be enough, I hope, to illustrate the point that some Australian DVB-T broadcasters format their data streams differently from others.

The directory contains a text file that explains the layout of the samples and the problem I have found. I hope this will be enough for you to analyse what is going on.

Probably it's not a lot of work - find out why when finding the private MPEG stream in the PMT parser it doesn't find the teletext later.

Prevent overwriting files

If at any point a check could be inserted to check whether a .bin file is being processed with -out=bin as an output setting, that would be great. As is, accidentally doing that results in deleting the source file (overwriting it with a blank file, more accurately) which is very bad! (And something I just did.)

GSOC - Config file

A simple one. Have a config file with default options so users don't need to specify the same things each time.

/etc/ccextractor would be nice.

DVB subtitles from TNT (France)

I have tried to extract dvb subtitles in text or picture format of of a
dvb-t record from TNT in France without any success.
Are there any limitation in dvb subtitles support with current version of
cc extractor ?

Here is the extract of the record I tried to
http://dl.free.fr/gn7ofWDq0

[Carlos' note: Uploaded to /Franck/m6record.ts in developer's repository]

Vlc can display the subtitles without any problem
I tried buy a build version of 0.76 of ccextractor
It look like there is a problem in pts parsing as min and max pts are indentical

Here is the result:
Opening file: /home/franck/sub/videos/m6record.ts
File seems to be a transport stream, enabling TS mode
Analyzing data in general mode

Number of NAL_type_7: 0
Number of VCL_HRD: 0
Number of NAL HRD: 0
Number of jump-in-frames: 0
Number of num_unexpected_sei_length: 0

Total frames time: 00:00:00:000 (0 frames at 29.97fps)

Min PTS: 13:33:58:190
Max PTS: 13:33:58:190
Length: 00:00:00:000

GSOC - Library-ize CCExtractor

CCExtractor has a quite robust CEA-608 decoder that could be used by any other program. However, in the way it's currently packed, such program would need to copy and paste some of our init code, have some global variables... so as you can see, it's not really a library you can just embed into a 3rd party program. The job: Refactor it, and produce a reference program that builds on the refactored code to process a sample file.

GSOC - Multichannel

A number of TV devices (most famously HDHomeRun) come with more than one tuner (some models as many as 6), allowing the user to watch several channels at the same time. CCExtractor is able to receive the data from HDHomeRun directly (no need for intermediate files) but it only listens to one tuner. The job: Modify CCExtractor so it's able to listen to any number of tuners at the same time.

GSOC - Decode other tables

On top of the essential PAT and PMT, Transport Streams may contain additional tables (for example the EIT, Event Information Tables). Some times the PMT contains an EIT indicator so you have to look something up in EIT.

So far we've found one sample (from Australia) in which EIT parsing would be required to properly autodetect the stream with the teletext information.

So the task is to implement EIT parsing. Not complex since it's this is well documented.

GSOC - New standards

CEA-608 is the old (analog) standard from closed captions. CEA-708 is the standard for digital TV. But what about the standard for internet media? CCExtractor has some basic support for TTML, but there's other emerging standards that we should support as well, both for input and for output.

GSOC - File analysis funcionality

We need a feature that -using everything that is already in place- consumes part of a stream (up to a limit specified by the user) and generates an easy to parse report.

The limit can be time (for example, the first minute of the file), size (such as the first 10 MB), or until something is found (for example if captions are found, stop).

The report will be text sent to stdout, and contain things like this:
File: ...........
AnyCC608: Yes
AnyCC708: No
Programs: 3
PrimaryLanguagePresent: Yes
SecondaryLanguagePresent: No
XDSPresent: Yes

and so on

This functionality is easy to add, since all info is already in the internal status. We just need the ability to display it in an easy to parse format.