Giter VIP home page Giter VIP logo

Comments (38)

prwteas avatar prwteas commented on May 22, 2024

Same problem here:

Downloaded http://class.coursera.org/sna-002/lecture/index (5309 bytes)
Found 0 sections and 0 lectures on this page
Probably bad cookies file (or wrong class name)

from coursera-dl.

HurryUpAndWait avatar HurryUpAndWait commented on May 22, 2024

Ditto:

python coursera-dl -u -p econ1scientists-2012-001
Downloaded http://class.coursera.org/econ1scientists-2012-001/lecture/index (5737 bytes)
Found 0 sections and 0 lectures on this page
Probably bad cookies file (or wrong class name)

from coursera-dl.

knudvaneeden avatar knudvaneeden commented on May 22, 2024

Ditto:

python coursera-dl -u myusername -p mypassword --path=mypath eefun-001
Downloading class: eefun-001
Downloaded http://class.coursera.org/eefun-001/lecture/index (5352 bytes)
Found 0 sections and 0 lectures on this page
Probably bad cookies file (or wrong class name)

with debugging set on:

python coursera-dl --debug -u myusername -p mypassword --path=mypath eefun-001
root[main] Downloading class: eefun-001
root[get_syllabus] Downloaded http://class.coursera.org/eefun-001/lecture/index (5352 bytes)
root[parse_syllabus] Found 0 sections and 0 lectures on this page
root[parse_syllabus] Probably bad cookies file (or wrong class name)


Thus the issue occurs in the 'parse_syllabus' method.
There the tags <.>...</> are parsed and the URLs are extracted.

And the same actions worked for sure fine a few days ago.

  1. So the content of the web page has changed with a high probability.
  2. Note: I compared the same web page at http://class.coursera.org/eefun-001/lecture/index downloaded at first 14 March and later 15 March. There was basically no difference (only some temporary numbers which are probably not relevant). I could not find the same web page anymore from around 12 March, so could not do a diff with that. So if it is change of web page it might have taken place between 12 and 14 March.
  3. Around 12 March it for sure still did work (nothing has been changed)
  4. Note: I got the same error 'Probably bad cookies (or wrong class name)' before also when user name and or password were not correct. So to test if maybe the logging in has changed, trying logging out at Coursera and logging in worked fine. So it is not the logging in at web site level which is an issue presumably.

from coursera-dl.

nymanjens avatar nymanjens commented on May 22, 2024

In coursera_dl.py -> write_cookie_file(), if you allow errors to be thrown like

except urllib2.HTTPError as e:
    if e.code == 404:
         raise ClassNotFoundException(className)
    raise

then you can see that the login script throws a 403 Forbidden error:

Downloading class: nutrition-001
Traceback (most recent call last):
  File "./coursera-dl", line 745, in <module>
    main()
  File "./coursera-dl", line 739, in main
    download_class(args, class_name)
  File "./coursera-dl", line 698, in download_class
    args.password)
  File "./coursera-dl", line 133, in write_cookie_file
    opener.open(req)
  File "/usr/lib/python2.7/urllib2.py", line 406, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 444, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: FORBIDDEN

from coursera-dl.

HurryUpAndWait avatar HurryUpAndWait commented on May 22, 2024

Trying to debug what's going on, and from what I can see (zero Python experience!) the get_syllabus() returns html without any videos in or at least no html containing "course-item-list-header".

Contrast this to just saving the lecture video webpage, where I can see multiple "^course-item-list-header" which I believe is used in the parse_syllabus() method to obtain the different sections.

So when parse_syllabus() is run on what is returned from get_syllabus(), no sections are found, as printed out to stdout. So a problem with get_syllabus()? Or as above, is it an authentication problem that prevents the correct html from being read?

from coursera-dl.

knudvaneeden avatar knudvaneeden commented on May 22, 2024

I found a workaround.

"1." The conclusion that the (main HTML) file is not downloaded (around method get_syllabus()) looks very much correct.

"2." Thus the root cause is very probably indeed 'urllib2.HTTPError: HTTP Error 403: FORBIDDEN'

"3." HTTP Error 403 is a client side HTTP error which indicates that you are not given access to that page by the provider (simply because you lack authentication).

"4." So it looks like something has been (temporarily maybe?) changed with the authentication at the provider web site
(which expected typically blocks the download of the main HTML file). At least the HTML page is somehow not arriving in the method 'parse_syllabus()'. But if you use a local file instead then it does, and that works.


Workaround:

"5." The (temporary, for the moment) workaround is to download the HTML source code of the web page (e.g. http://class.coursera.org/eefun-001/lecture/index ) yourself to your local harddisk.
Thus typically goto that URL which contains the videos and so on, right click on that web page and select 'view source', then save that HTML file to disk e.g. as 'c:\temp\foobar.htm').

"6." Then you run the file extractor using an extra parameter
--process_local_page followed by the full filepath to your downloaded HTML file

"7." The command line typically becomes thus:

python coursera-dl --process_local_page YOURHTMLFILENAME -u yourusername -p yourpassword --path=yourpathwhereyouwanttodownload yourcourseraclassname

For example:

python coursera-dl --process_local_page c:\temp\foobar.htm -u yourusername -p yourpassword --path=c:\TEMP eefun-001

"8." That successfully downloaded all the files on that HTML web page.

from coursera-dl.

HurryUpAndWait avatar HurryUpAndWait commented on May 22, 2024

Thanks knudvaneeden, using the above work-around for some reason the videos on my particular course (econ1scientists-2012-001) are being downloaded as an html file, and one that (without checking) looks suspiciously like the incorrectly downloaded HTML page from before (i.e. the one without videos). Are you seeing that?

The TXT files look fine, there are no PDFs in this course by the looks of it...

from coursera-dl.

knudvaneeden avatar knudvaneeden commented on May 22, 2024

"1." I will check it => looks like it works OK here. Many files on that page (which should be downloaded as an HTML file, according to the workaround steps) though,

https://class.coursera.org/econ1scientists-2012-001/lecture/index

so should be a long download even on a fast download system.

11,447,008 bytes downloaded in 365 files and 23 dirs

"2." Note: If you are using it and having issues, possibly do not use the '--path' parameter. That is just optional. If using it put no spaces in or double quotes around the path name. It just informs where you want to download your files, so that you do not have to search all over the place to find it afterwards.


"3." There are quite a few PDF files there, e.g.
https://d19vezwu8eufl6.cloudfront.net/econ1scientists/notes%2Funit1.pdf
so you should find that (also) in your download directory.

"4." I downloaded most files it looks like, but one error at the end (but I do not think this is general behavior, as e.g. some other course just downloaded OK, without any error. Maybe caused by quite a long path I used):

PRINCIPAL_ECONOMICS_FOR_SCIENTISTS\econ1scientists-2012-001\07_Unit_5._Government_intervention_in_competitive_markets_I-_Efficiency\11_5.6._First_welfare_theorem-_The_magic_of_the_market_17-25.mp4
Traceback (most recent call last):
File "g:\learn\coursera\coursera-master\coursera\coursera_dl.py", line 749, in
main()
File "g:\learn\coursera\coursera-master\coursera\coursera_dl.py", line 743, in main
download_class(args, class_name)
File "g:\learn\coursera\coursera-master\coursera\coursera_dl.py", line 727, in download_class
args.verbose_dirs,
File "g:\learn\coursera\coursera-master\coursera\coursera_dl.py", line 395, in download_lectures
curl_bin, aria2_bin, axel_bin)
File "g:\learn\coursera\coursera-master\coursera\coursera_dl.py", line 426, in download_file
download_file_nowget(url, fn, cookies_file)
File "g:\learn\coursera\coursera-master\coursera\coursera_dl.py", line 509, in download_file_nowget
with open(fn, 'wb') as f:
IOError: [Errno 2] No such file or directory: u'g:\mydownloadfiles\FROM_JOB_TO_HOME\TODO\LEARNING\ONLINE\COURSERA\PRINCIPAL_ECONOMICS_FOR_SCIENTISTS\econ1scientists-2012-001\07_Unit_5._Govern
ment_intervention_in_competitive_markets_I-_Efficiency\11_5.6._First_welfare_theorem-_The_magic_of_the_market_17-25.mp4'

from coursera-dl.

rbrito avatar rbrito commented on May 22, 2024

Wow, you guys are fast.

Well, from the little that I saw, the problem is that they changed the
authentication from a simple GET request while keeping cookies to a
more elaborate login procedure involving a POST verb, with the use of
Cross-site request forgery (CSRF) tokens, and a lot of redirections.

The login URL also changed to something like:

https://www.coursera.org/maestro/api/user/login

This seems to be intended to keep a lot of state when one is accessing
Coursera's site, and I would guess that some of the blobs that are
being used contain hashes of what the User-Agent is (like the headers
that we send).

I hope to be wrong here, though, and a few liners will be sufficient,
but I think that others will certainly correct me if I am, as we have
wonderful readers.

Regards,

Rogério Brito : rbrito@{ime.usp.br,gmail.com} : GPG key 4096R/BCFCAAAA
http://rb.doesntexist.org/blog : Projects : https://github.com/rbrito/
DebianQA: http://qa.debian.org/developer.php?login=rbrito%40ime.usp.br

from coursera-dl.

HurryUpAndWait avatar HurryUpAndWait commented on May 22, 2024

Ok, I repeated the download, making sure I had all the arguments set correctly, but the *.mp4 files once downloaded are showing up as html files. This happens on both Mac OS X and Linux (Fedora):

python coursera-dl --process_local_page ~/Downloads/econ1scientists-2012-001.html -u username -p password econ1scientists-2012-001

[554]$ file econ1scientists-2012-001/01_Introduction_and_Logistics/*.mp4
econ1scientists-2012-001/01_Introduction_and_Logistics/01_Welcome_and_introduction_5-48.mp4: UTF-8 Unicode HTML document text, with very long lines
econ1scientists-2012-001/01_Introduction_and_Logistics/02_Logistics_11-21.mp4: UTF-8 Unicode HTML document text, with very long lines
econ1scientists-2012-001/01_Introduction_and_Logistics/03_Course_overview_12-15.mp4: UTF-8 Unicode HTML document text, with very long lines

Sadly I'm at my limit with Python and HTML (which is admittedly pretty weak) - do you expect to be able to make a patch to this John?

Thanks guys.

from coursera-dl.

knudvaneeden avatar knudvaneeden commented on May 22, 2024

Ok, I repeated the download, making sure I had all the arguments set correctly, but the *.mp4 files once
downloaded are showing up as html files. This happens on both Mac OS X and Linux (Fedora):

"1." Yes, this is reproducible also on Microsoft Windows 7 Ultimate 64 bits.

Steps:

"2." Using the workaround of downloading the HTML page, e.g.
https://class.coursera.org/econ1scientists-2012-001/lecture/index
(and optionally saving it e.g. with an extension .html or .htm)

"3." It downloads successfully the .mp4 (you see e.g. the filename in the list of downloaded files), but it if you try to run it it will probably open the player used (because that looks at the extension .mp4), but fails then to run.

"4." If you manually open that .mp4 file, you see indeed that it is HTML, instead of the expected binary .mp4 format.

"5." If you rename the .mp4 file to .html and then open it e.g. in a browser, you see that what is downloaded instead of the .mp4 file is the ENROLL / LOGIN page.

"6." Expected: was that .mp4 format file was downloaded.

"7." But result: was that the program tried to login, but that failed (because the provider did change the login authentication method a few days ago). So instead of downloading the .mp4 file there it downloaded the HTML page there and saved it as a .mp4 file.

"8." It looks like this issue only is the case for .mp4 files. E.g. .pdf, .txt, .srt extension files download without any problem and can be opened OK here as far as I can tell.

"9." Conclusion: => The root cause the .mp4 to HTML file ssue is with a very high probability because there is an extra or different method of login authentication used lately.

This is 'proven' by the fact that using the same unchanged software e.g. downloading the .mp4 files on 10 March the same .mp4 files from another course were downloaded correctly and can be played. Only the newer downloaded e.g. on 15 March .mp4 files (using the workaround download method) show the .mp4 downloaded as HTML issue.

"10." Current (temporary) workaround: (Only) the .mp4 files should then separately be downloaded manually also. So not automatically anymore as before.
Because downloading it directly manually from the video lecture page works still fine.
E.g. downloading manually
https://class.coursera.org/econ1scientists-2012-001/lecture/download.mp4?lecture_id=13
And because usually only as much videos as the total of weeks the course takes, that is doable probably (e.g. on average 10 weeks thus about 10 downloads of .mp4 per course).

from coursera-dl.

HurryUpAndWait avatar HurryUpAndWait commented on May 22, 2024

Agreed only *.mp4 are an issue. However, for this particular course, there are 193 videos :(

from coursera-dl.

knudvaneeden avatar knudvaneeden commented on May 22, 2024

Well, from the little that I saw, the problem is that they changed the
authentication from a simple GET request while keeping cookies to a
more elaborate login procedure involving a POST verb, with the use of
Cross-site request forgery (CSRF) tokens, and a lot of redirections

Note:
"1." At the end of the day it looks like the class HTML page, e.g.
http://class.coursera.org/econ1scientists-2012-001/lecture/index
is not downloaded at all (because thus the authentication login method has a few days ago been changed by the provider, and that is the root cause of the failure to download).

"2." But actually the Python program still reports that the download was successful and that the HTML page was downloaded successfully.

"3." E.g. if you look at the output of the download you see that:

python coursera-dl -u -p econ1scientists-2012-001
Downloaded http://class.coursera.org/econ1scientists-2012-001/lecture/index (5737 bytes)
Found 0 sections and 0 lectures on this page
Probably bad cookies file (or wrong class name)

"4." What was expected instead was output similar to:

python coursera-dl -u -p econ1scientists-2012-001
Downloaded NOT SUCCESSFUL: http://class.coursera.org/econ1scientists-2012-001/lecture/index (5737 bytes)
Found 0 sections and 0 lectures on this page
Probably bad cookies file (or wrong class name)

from coursera-dl.

knudvaneeden avatar knudvaneeden commented on May 22, 2024

Agreed only *.mp4 are an issue. However, for this particular course, there are 193 videos :(

OK, so that is moving from doable to non-doable to use the manual workaround to download the .mp4 files separately (quickly in general maybe hundreds or even thousands of manual download actions and investing (hours of) time in it to manually download), so a more fundamental (automatic) solution should be found, as before.

from coursera-dl.

knudvaneeden avatar knudvaneeden commented on May 22, 2024

FYI

If I download the .mp4 directly using the Microsoft Windows API method 'URLDownloadToFileA' located in the file URLMON.DLL, without involving anything about login, HTTPS, SSL, ...
then that works fine.

This program written in BBCBASIC downloaded e.g. the .mp4 file successfully.

url$ = "https://class.coursera.org/econ1scientists-2012-001/lecture/download.mp4?lecture_id=13"
file$ = "c:\temp\ddd.mp4"
SYS "LoadLibrary", "URLMON.DLL" TO urlmon%
SYS "GetProcAddress", urlmon%, "URLDownloadToFileA" TO URLDownloadToFile
SYS URLDownloadToFile, 0, url$, file$, 0, 0 TO res%
IF res% ERROR 100, "Could not download "+url$
PRINT "File downloaded to " + file$
END

Similarly running this in my Semware TSE text editor macro language SAL also downloads it successfully

DLL "<urlmon.dll>"
INTEGER PROC FNUrlGetSourceApiI(
INTEGER lpunknown,
STRING urlS : CSTRVAL,
STRING filenameS : CSTRVAL,
INTEGER dword,
INTEGER tlpbindstatuscallback
) : "URLDownloadToFileA"
END

PROC Main()
STRING urlS[255] = "https://class.coursera.org/econ1scientists-2012-001/lecture/download.mp4?lecture_id=13"
STRING fileNameS[255] = "c:\temp\ddd.mp4"
FNUrlGetSourceApiI( 0, urlS, filenameS, 0, 0 )
END

=> The equivalent in Python could similarly be created. I will check if I can create this.

If this should be generally true (not sure of that, but it sure looks like it), then as a consequence of this is that the whole SSL layer and methods in coursera-dl.py could probably be removed and the program reduced to a few lines,

For any file download using HTTP GET (as opposed to downloads needing HTTP POST, but this should not be the case is assumed) this should then work out of the box.

E.g. Only keeping the get_syllabus (which calls the equivalent API located in the file URLMON.DLL in Microsoft Windows) and parser_syllabus method (which uses Beautiful Soap to parse the HTML tags to extract the URL of the download files)

from coursera-dl.

knudvaneeden avatar knudvaneeden commented on May 22, 2024

Interesting.

"1." Alternatively checking if I can reproduce it in the latest version of Python 3.x, instead of Python v2.6 which is used until now (not done yet, still working in Python v2.6).

"2." I was trying to create the equivalent of the URLMON.dll download (which downloads the .mp4 successfully as a binary file which can be run in a video player), but then instead in Python.

Conclusion: The root cause for the .mp4 saved as HTML behavior lies actually in using the Python library 'urllib2'.

"3." E.g. I created this Python program (e.g. save as 'foobar.py', then run 'python.exe foobar.py' or on Linux/Unix/Mac 'python foobar.py' or use the shebang method), the file will be downloaded in 'foobar.mp4')

import urllib2
u = urllib2.urlopen( 'https://class.coursera.org/econ1scientists-2012-001/lecture/download.mp4?lecture_id=13' )
localFile = open( 'foobar.mp4', 'w' )
localFile.write( u.read() )
localFile.close()

"4." and if you run that it indeed downloads the .mp4 file, but then contains HTML.

"5." Current conclusion: the use of the standard Python library 'urllib2' probably introduces unexpected side effects, probably complicating matters very much with regard to automatic downloading of files (e.g. it maybe forced John (Lehmann) to go the login + SSL route instead of just simply passing the URL to download). Which could probably be avoided all together by using an equivalent in Python for the function 'URLDownloadToFileA' located in the file URLMON.dll. Checking this further.

"6." If that does not show out of the box possible in Python then checking to create this e.g. in Semware TSE text editor instead, using regular expression search to extract the file download URLs and URLDownloadToFile from Microsoft Windows file URLMON.dll.

Otherwise alternatively extracting the URLs in Python, and passing that URLs to wget.exe or curl.exe via a system call to do the actual download

This instead of doing the whole action in native Python only (e.g. using urllib2 lib, ... and other libs).

"7." After further searching e.g. Google, I do not think that URLDownloadToFile is used in Python to handle the download on the Microsoft Windows platform (e.g just using some kind of wrapper). So they have probably used and written their own native Python solution thus (e.g. typically urllib2).

Checking now if you can call Microsoft API DLLs via Python (note: it looks like, using 'ctypes', but that will restrict it to the Microsoft Windows platform because only there you find the file URLMON.dll with that download functionality. So on Mac (using 'dylib'), Linux and Unix (using .so or .sl libraries) other solutions could be found).

See also
http://docs.python.org/2/library/ctypes.html

E.g. running save this program as foobar.py and run it on Microsoft Windows (by design only there it will work, it will not work on Mac, Linux or Unix).

Create the following general download method in Python:

import os

def download_file_dll( url, filename ):
  import ctypes
  myDLL = ctypes.WinDLL( "URLMon.DLL" )
  return( myDLL.URLDownloadToFileA( 0, url.encode( 'ascii', 'ignore' ), filename.encode( 'ascii', 'ignore' ), 0, 0 ) )

if ( os.name == 'nt' ):
download_file_dll( "https://class.coursera.org/econ1scientists-2012-001/lecture/download.mp4?lecture_id=13", "c:\temp\ddd.mp4" )
else:
print "this works only on Microsoft Windows. It does not work on Apple Mac OS X, Linux or Unix"

Then call this method using your URL and download filename (use double backslash or forward slash) as parameters

=> Yes, that works fine and OK, and directly succesffully downloads the .mp4 which can be played in your video player.

> python foobar.py

you will find the file in c:\temp\ddd.mp4.

=> So now just feeding this method with your file URLs to download.

"8." So a current workaround would be to (manually for the moment):

  1. Goto the page of your class
  2. Do a view source
  3. Get all the .mp4 URLs
  4. Then using the above program adding the lines

download_file_dll( url1, file1.mp4 )
download_file_dll( url2, file2.mp4 )
...
download_file_dll( urlLast, fileLast.mp4 )

and then running that python program, after the program has been created it will then fully automatically download that .mp4 files and also the other files (.pdf, .txt, .srt, ...), which will be (much) faster than doing it all manually).

"9." I will check now if I can add this method download_url to courser-dl.py, check if the extension is .mp4, and if so then call instead that method instead of the original method. It will then only on Microsoft WIndows download automatically, using the original class HTML page download workaround method.

from coursera-dl.

knudvaneeden avatar knudvaneeden commented on May 22, 2024

Tried as a quicker workaround also to use wget as the program to download. But .mp4 still as HTML downloaded

use parameter --wget_bin or -w followed by the full path to your wget program (must be installed)

E.g.

--wget_bin G:\UTILS\NETWORK\WGET\wget.exe

from coursera-dl.

knudvaneeden avatar knudvaneeden commented on May 22, 2024

Tried as another quicker workaround also to use curl as the program to download. But .mp4 still as HTML downloaded

use parameter --curl_bin or -c followed by the full path to your curl rogram (must be installed)

E.g.

--curl_bin G:\UTILS\NETWORK\CURL\curl.exe

from coursera-dl.

knudvaneeden avatar knudvaneeden commented on May 22, 2024

Implemented thus the download method to be the URLDownloadToFileA from URLMON.dll

Language: Computer: Python: Learn: Online: Coursera: File: Operation: Download: All: File: Source: Coursera-dl.py: Knud: 01 [URLDownloadToFileA / URLMON.dll]
http://goo.gl/pwLtu
If you pass the extra parameter on the command line
--dll_download
it will use the DLL download method.
E.g.
python coursera-dl -u username -p password --dll_download econ1scientists-2012-001

=> And the result is strange, as the issue still occurs.

That is when using this method inside coursera-dl.py it still creates HTML from the .mp4

But If I use the same urlmon.dll method outside coursera-dl.py it does not create HTML from the .mp4, and creates correctly .mp4.

Actually the same happens when using wget or curl, at least when used inside coursera_dl.py

This is all unexpected.

Possible explanations for this behavior:

"1." Not so likely: Something is changing the download state (e.g. some SLL state) which influences the download result somehow.

"2." Likely: Otherwise the .mp4 file is changed (e.g. by some other method in coursera-dl.py) in HTML after the download using wget, curl or dll_download has taken place. This method should be triggered by the extension .mp4. I will check this now in the source code if there is something like that maybe.

"3." It might even redownload it using e.g. urllib2, because that reproduces the HTML file behavior.

"4." Testing:

If I download using the DLL download method,
using

def download_file_dll( url, filename ):
import ctypes
myDLL = ctypes.WinDLL( "URLMon.DLL" )
logging.debug('Using urlmon.dll: downloading file %s', filename)
logging.debug('Using urlmon.dll: downloading from url %s', url)
myDLL.URLDownloadToFileA( 0, "https://class.coursera.org/econ1scientists-2012-001/lecture/download.mp4?lecture_id=13", "c:\temp\ddd123.mp4", 0, 0 )
exit()
return( myDLL.URLDownloadToFileA( 0, url, filename, 0, 0 ) )

thus always downloading the same mp4 file, that works OK.
Checking after the 'exit()' if the file exists, it does.

So possiblilty 1. above is not really an option, because if in some state all the .mp4 downloads should fail, then also this particular .mp4 download should fail. But it does not fail, so option rejected.

So checking possibility 2.
Possible methods (because containing the string 'mp4') to check:

def parse_syllabus(page, cookies_file, reverse=False)

and or

def grab_hidden_video_url (href, cookies_file)

and or

def get_anchor_format(a)

"5." It looks like at this moment the issue is related to the type of the url passed.
The type is 'unicode'.
Some programs like wget, curl, ... maybe do not like Unicode strings.

E.g. type( url ) gives

<type 'unicode'>

This should be happening for at least the URL download method. So some conversion will have to take place there.
A constant url like "http://www.google.com" works, but a variable url does not at the moment.

OK, fixed this, you should cast the url from ascii to Unicode.

E.g.
instead of url use url.encode( 'ascii', 'ignore' )

"6." OK, now the URLMON.dll download is working after casting the Unicode URL to an ASCII URL, I can clearly see that the .mp4 file is downloaded TWICE. First alright like a normal .mp4 file, then overwritten by some HTML for some reason. So that is the root cause, double downloading.
=> It showed to be caused by using 'if' instead of 'elif'.
That cause the function 'download_file_nowget' to be run.
And there the .mp4 issue occurred, because it uses urllib2, and that saves the .mp4 by design as HTML.

from coursera-dl.

rbrito avatar rbrito commented on May 22, 2024

@knudvaneeden, you won't be successful in your explorations of just trying to download the files with wget or curl or whatever.

The problem is understood: we have problems in the login/authentication phase.

The reason why you didn't need further authentication in some circumstances is that your browser/system is simply sending the cookies that coursera stored in your system.

When you try to use a program like wget or curl that don't know about the inner workings of your system (i.e., where the cookies are stored), then you get the failed downloads with a webpage asking you to supply your credentials.

I have mostly figured out the problem, but I guess that it will be a lot of information to carry to (in the jargon) keep state with HTTP (which is a stateless protocol).

The most fruitful approach here is to just use a module like Mechanize or [requests](or something similar) too keep the state instead of reinventing the wheel.

For those that want to get their hands dirty, as I mentioned before, the starting point is:

https://www.coursera.org/maestro/api/user/login

from coursera-dl.

knudvaneeden avatar knudvaneeden commented on May 22, 2024

I have got it working, this has successfully downloaded all Coursera files for that particular course successfully, including the .mp4 files in binary, working format.

http://goo.gl/pwLtu

from coursera-dl.

knudvaneeden avatar knudvaneeden commented on May 22, 2024

you won't be successful in your explorations of just trying to download the files with wget or curl or whatever.

You are maybe right, but anyhow my program changes downloaded in the current state (cookie state, ...) all files of the tested course successfully.

But I had to use the workaround to download once the class HTML source file which shows the videos (manually once for each course), it might be that that would (also) require the cookies.

from coursera-dl.

knudvaneeden avatar knudvaneeden commented on May 22, 2024

you won't be successful in your explorations of just trying to download the files with wget or curl or whatever.
...
The reason why you didn't need further authentication in some circumstances is that your browser/system is
simply sending the cookies that coursera stored in your system.

Testing this hypothesis that login and or authentication is necessary to download the files.

"1." I have created a clean virtual machine (Oracle Sun Virtual Box), running Microsoft Windows 7 Ultimate 64 bit.

"2." Never run Coursera or logged in there, so no cookies are to be found there

"3." For quick testing downloaded BBCBASIC, I know that it works, and can always create an equivalent program in Python, but that would require more actions.

"4." Then I tried to download the .mp4 file at the URL using URLMON.dll method

url$ = "https://class.coursera.org/econ1scientists-2012-001/lecture/download.mp4?lecture_id=13"
file$ = "c:\temp\ddd.mp4"
SYS "LoadLibrary", "URLMON.DLL" TO urlmon%
SYS "GetProcAddress", urlmon%, "URLDownloadToFileA" TO URLDownloadToFile
SYS URLDownloadToFile, 0, url$, file$, 0, 0 TO res%
IF res% ERROR 100, "Could not download "+url$
PRINT "File downloaded to " + file$
END

"5." For comparison purposes I tried also to download "http://www.google.com" and that downloaded successfully

"6." The first time it indeed downloaded by design the Coursera HTML login page (instead of the .mp4 file)

"7." Then I logged in (the first time) and tried again.

"8." Result: But again it downloaded again the HTML page for the login (and not the .mp4 file)

"9." Expected was that it should download the .mp4 file, e.g. after the first login.

"10." Conclusion: Using induction to generalize this results, if that is indeed correct, you can not circumvent the login to automatically download the files using only the given URL.
So if it on the other hand does work it indeed uses some existing (cookie,...) information which is already stored or created on the system.

from coursera-dl.

jplehmann avatar jplehmann commented on May 22, 2024

Since Mechanize doesn't understand Javascript, we have to direct it manually to the authentication endpoint @rbrito pointed out. I also saw that endpoint with Chrome Developer Tools' Network tab.

I used this approach to manually feed Mechanize a form for it to fill out, but my initial attempt is still getting a 403. You can see what I did here. To use this script supply an email and password as two arguments.
http://pastebin.com/fC9Qhim3

All the debugging is turned on so you can see the Request which Mechanize sends, which is this:

send: 'POST /maestro/api/user/login HTTP/1.1
Accept-Encoding: identity
Content-Length: 47
Host: www.coursera.org
User-Agent: Python-urllib/2.7
Connection: close
Referer: https://www.coursera.org/account/signin
Content-Type: application/x-www-form-urlencoded
signin-email=jplehmann&signin-password=XXXX'

I compared that request to the one my browser is (successfully) issuing, and the most noticeable difference was the X-CSRFtoken header. Even when I supplied the one my browser used to Mechanize that didn't work (I'm guessing it is stale at this point). I also included the User-Agent for good measure.

I'm going to sleep, but I thought someone else might be able to push this forward.

from coursera-dl.

jplehmann avatar jplehmann commented on May 22, 2024

Here is the browser's successful request, for comparison:

Request URL:https://www.coursera.org/maestro/api/user/login
Request Method:POST
Status Code:200 OK
Request Headersview source
Accept:*/*
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Connection:keep-alive
Content-Length:56
Content-Type:application/x-www-form-urlencoded
Cookie:XXXX (I'm guessing this didn't matter)
Host:www.coursera.org
Origin:https://www.coursera.org
Referer:https://www.coursera.org/account/signin
User-Agent:Mozilla/5.0 (...)
X-CSRFToken:XXXX
X-Requested-With:XMLHttpRequest
Form Dataview URL encoded
email_address:[email protected]
password:XXXX

from coursera-dl.

jetume avatar jetume commented on May 22, 2024

One way to get the csrf token and reuse it in the POST to https://www.coursera.org/maestro/api/user/login: http://pastebin.com/FdCGViGC This logs in successfully geting me session_id and a maestro_login cookies. The csrf, session_id and maestro_login cookies seem to be used in the 4 step redirects in order to get the session cookie which then is used to get the course's lecture/index page...

from coursera-dl.

jplehmann avatar jplehmann commented on May 22, 2024

That seems to be a bad url.

On Sun, Mar 17, 2013 at 2:17 PM, Suresh Jayanty [email protected]:

I seem to have it working on at least os x. The diff is here:
https://gist.github.com/jetume/5183131


Reply to this email directly or view it on GitHubhttps://github.com/jplehmann/coursera/issues/74#issuecomment-15028759
.

from coursera-dl.

prwteas avatar prwteas commented on May 22, 2024

It seems the link is bad, but here is the original, I managed to save it

https://gist.github.com/prwteas/5183341

On Mar 17, 2013, at 8:17 PM, Suresh Jayanty wrote:

I seem to have it working on at least os x. The diff is here: https://gist.github.com/jetume/5183131


Reply to this email directly or view it on GitHub.

from coursera-dl.

jetume avatar jetume commented on May 22, 2024

Sorry deleted that gist after realizing that the fix was only partial. I am posting a new one which works at least with the download_file_nowget option: https://gist.github.com/jetume/ac8142f104c8ae70b485

It does not work with wget, curl, aria2 options yet..

from coursera-dl.

jetume avatar jetume commented on May 22, 2024

Updated the gist with proposals for the other download options (wget, curl, and aria2). So far I've tested this only on os x

from coursera-dl.

rbrito avatar rbrito commented on May 22, 2024

@jetume, could you update your gist to our latest version from the master branch? I noticed that I made some changes after you created the fork. That would make things easier to review...

from coursera-dl.

rbrito avatar rbrito commented on May 22, 2024

For everybody, here is an excerpt of what I had been talking with @jplehmann in private:

(...)

As I mentioned there, I think that
trying to keep state when there are solutions that do the job already
is reinventing the wheel badly. And while I was looking at one of
those libraries/modules that already do the job of keeping state, I
found out....

drumroll, please...

https://github.com/sharat87/coursera-downloader

which works (well, at least it did for me) and it is a fork of our
code, but the person never bothered to contribute back.

His solution is not clean (the cookies are hardcoded), but those can
be cleaned up.

(...)

So, I would be willing to merge a minimally intrusive patch (better yet, sequence of patches) that fix the problem by replacing uses of urllib2 with the requests library, as done above. The smallest the patches are, the better for the review process.

We can clean up the code by installing context handlers (the things that get called when the word with is used) after the change. The same for other non-essential things.

Thanks.

Rogério Brito.

from coursera-dl.

rbrito avatar rbrito commented on May 22, 2024

@jetume, I looked a little bit closer at your code:

  1. It's incomplete. In particular, it stops in the middle of a line (a return 1)
  2. Your gist contains only a patch. Can you update it with the corresponding versions of the actual script, so that we can see diffs of python instead of diffs of diffs?
  3. In the write_cookie_file of your patch, you set a bunch of handlers (OK here), then you build an opener (OK), then you actually make a request and you have the line res = opener.open(req), but that variable res does not seem to be used in the rest of the script and it is not clear what you meant with that. Were you only interested in the side-effects of the call?
  4. You have an import of the pdb module that appears to be unused.

If you fix the above and send a pull request, I am willing to merge your changes, which we can clean further at a later moment.

Thanks,

Rogério.

from coursera-dl.

jplehmann avatar jplehmann commented on May 22, 2024

@rbrito I heartily agree about leveraging Requests. It looks very nice, and would be a big improvement to our codebase.

https://github.com/kennethreitz/requests

from coursera-dl.

jplehmann avatar jplehmann commented on May 22, 2024

I want to thank @jetume and everyone else who looked into fixing this!

from coursera-dl.

rbrito avatar rbrito commented on May 22, 2024

Yes, the decision to merge @jetume's patch was that he did a great job while not requiring any extra modules and being the most conservative/low-risk change of all options that we had.

So, thanks @jetume again, @prwteas and everybody that shed lights on this. I am sure that we all learned a little bit and that we will be better equipped for similar situations in our future (which certainly will happen, given the nature of the web).

from coursera-dl.

jetume avatar jetume commented on May 22, 2024

Thanks guys! :) happy to contribute

from coursera-dl.

Smitty010 avatar Smitty010 commented on May 22, 2024

Thanks to everyone who worked on this. I've been following the emails and can tell it didn't come easy. I confirm that the new version works (and that was testing on about a dozen different classes).

from coursera-dl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.