Comments (16)
The error happened again, and this time we had the log statement for some more details:
https://github.com/CouncilDataProject/test-deployment/runs/3854359377?check_suite_focus=true#step:8:433
Based on the error message, this means that the validator method really did return False
Since the validation that takes place here is checking whether the resource exists, it should be that fsspec
just isn't finding the resource. I have no idea why this would happen unless it's some transient type of error.
Will debug further, probably by looking into fsspec
from cdp-backend.
It honestly may be just that we are storing that file and then instantly checking if it exists. I think most storage systems are "eventually consistent" / it may take some time for Google to properly hit the result after storing. (Especially when we are uploading hundreds of files at a time...)
from cdp-backend.
Happened on the new seattle-staging
deployment too: https://github.com/CouncilDataProject/seattle-staging/runs/3873780139?check_suite_focus=true#step:9:305
from cdp-backend.
@isaacna I think this may also be timeout errors or similar: fsspec/filesystem_spec#619 (comment)
I am seeing a lot of failures on seattle-staging backfill runs because of timeout errors: https://github.com/CouncilDataProject/seattle-staging/runs/3875382085?check_suite_focus=true#step:9:157
from cdp-backend.
It honestly may be just that we are storing that file and then instantly checking if it exists. I think most storage systems are "eventually consistent" / it may take some time for Google to properly hit the result after storing. (Especially when we are uploading hundreds of files at a time...)
I'm kinda confused about this, but for this specific case isn't it checking that the remote resource https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf
exists via an http request? Or am I misunderstanding what fsspec.url_to_fs
does?
I was thinking that sometimes there's just an error while making the http request (for the non google uri's): https://github.com/intake/filesystem_spec/blob/master/fsspec/implementations/http.py#L305-L315
Also for the timeout errors during resource_copy
I dont' think that's related to the validator issue but still a problem. Maybe we could add some retry logic into resource_copy
?
from cdp-backend.
I also created a script that just checks creates a file store using fsspec
, and checks whether the resource exists. I ran it in a loop but haven't seen a case with the resource suddenly not being found. Maybe the github action runner is just more prone to http errors or timeouts?
This may be going too deep specifically into this issue but HttpFileSystem
exists uses a GET request, but according to this HEAD is more efficient. Maybe we could try that if it's intermittent http errors giving us issues?
from cdp-backend.
For the FSTimeoutError
the vids we're trying download in the run you sent earlier are pretty large, all of them being 2 hrs+, with one being over 3 hours.
The ones we use in the test deployment data tend to be much shorter, with only one of them being close to 3 hours and most being under 30 minutes: https://github.com/CouncilDataProject/cdp-backend/blob/main/cdp_backend/pipeline/mock_get_events.py#L54-L69
from cdp-backend.
I also created a script that just checks creates a file store using
fsspec
, and checks whether the resource exists. I ran it in a loop but haven't seen a case with the resource suddenly not being found. Maybe the github action runner is just more prone to http errors or timeouts?This may be going too deep specifically into this issue but
HttpFileSystem
exists uses a GET request, but according to this HEAD is more efficient. Maybe we could try that if it's intermittent http errors giving us issues?
Oh interesting... Making a HEAD
request for .get
on HttpFileSystem
seems smart.
For the
FSTimeoutError
the vids we're trying download in the run you sent earlier are pretty large, all of them being 2 hrs+, with one being over 3 hours.
Yea those are likely real timeouts then. In which case we should add the timeout={some_int}
. And btw we already have a retry=3 on the task.
from cdp-backend.
It looks like Event gather runs 160-164 (the ones since bumping the cdp-backend version) haven't encountered the could not be archived
issue. I'll wait for a few more runs before saying that the HEAD request fixed the issue, but fingers crossed that it did
from cdp-backend.
Yea those are likely real timeouts then. In which case we should add the
timeout={some_int}
. And btw we already have a retry=3 on the task.
Also I may just not be reading the fsspec docs right, but where is the timeout
property that you're passing in fs.get
actually defined in fsspec
? I don't see it in HttpFileSystem
, and it's parent class AsyncFileSystem
is pretty confusing. It looks like the FSTimeout
error gets thrown here but I'm not sure if the timeout
param we're passing actually makes it to this method
from cdp-backend.
Yep. I have seen many less errors on the logs as well.
Also I may just not be reading the fsspec docs right, but where is the timeout property that you're passing in fs.get actually defined in fsspec? I don't see it in HttpFileSystem, and it's parent class AsyncFileSystem is pretty confusing. It looks like the FSTimeout error gets thrown here but I'm not sure if the timeout param we're passing actually makes it to this method
I know. I have been digging through the code as well and the AsyncFileSystem
is hard to navigate. I was able to reproduce the timeout error on my local machine so I can try to debug it as well.
I will say, this issue and #120 combined are the primary reasons for the v3 pipeline to fail. More on #120 in it's issue comments.
from cdp-backend.
I know. I have been digging through the code as well and the
AsyncFileSystem
is hard to navigate. I was able to reproduce the timeout error on my local machine so I can try to debug it as well.
Sounds good, there's probably some simple way to increase the timeout but it isn't obvious based on the docs or code. It may be worth trying to clone fsspec
and mess around with it directly
I will say, this issue and #120 combined are the primary reasons for the v3 pipeline to fail. More on #120 in it's issue comments.
Yeah I think fixing these should be a priority, but at least the issue #120 seems to have a fairly straightforward solution
from cdp-backend.
Sounds good, there's probably some simple way to increase the timeout but it isn't obvious based on the docs or code. It may be worth trying to clone fsspec and mess around with it directly.
If you find an event that is over 3 hours and like 10 minutes you can probably just give the fsspec + that weird HTTP options dict a try. Errrr really just find an event video that is really long and run the pipeline with the -f
and -t
options with dates that surround the video date imo.
from cdp-backend.
fsspec + that weird HTTP options dict a try
Which HTTP options are you referring to? Also if we have trouble timing out with fsspec we could consider using something else instead like requests
or urllib
from cdp-backend.
From this comment: fsspec/filesystem_spec#619 (comment)
with fsspec.open('filecache::https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/physical/ne_10m_land.zip',
https={'client_kwargs': {'timeout' :aiohttp.ClientTimeout(total=1)}}) as f:
I am wondering if we can pass that https={...}
to the .get
function.
from cdp-backend.
Looked into the issue and I figured out how to pass the client_kwargs
to HttpFileSystem
. In our case we'd have to pass it to url_to_fs
so that it instantiates the aiohttp.ClientSession
in HttpFileSystem
with the kwargs. Will put out a PR sometime tomorrow
from cdp-backend.
Related Issues (20)
- Add params to event gather pipeline to allow long-runnable and log errors / skipped events
- Investigate / fix m3u8 processing...
- Google Speech-to-Text SR Model raises a confusing attr error instead of a defined error when Google runs into an issue
- Parse closed caption files for Oakland better HOT 13
- Hackathon Cleanup: Audio/Video Clipping
- Accept timestamps and only process a subset of a video as an event HOT 13
- Allow ability to flag Events as "try to scrape again next time" and do so on the next CRON run
- Filter out bad caption files
- Issue with deploy-infra action on 3.2.4 due to missing ffmpeg HOT 1
- Issue with deploy-infra action on 3.2.5 due to missing webvtt HOT 2
- Allow meeting minutes to be processed like a transcript
- Re-enable m3u8 and vimeo file utils resource copy tests HOT 1
- Duplicate Persons in cdp-seattle instance HOT 7
- Docker Images
- Break up Event Gather Pipeline HOT 1
- Cannot store and transcribe multiple videos that are clipped via the video_start_time and video_end_time event params
- Reduce complexity of event gather pipeline HOT 4
- MP4 Videos Unnecessarily Encoded During Trim HOT 3
- Inefficient usage of requests.get for very large videos causes event gather to fail
- deploy-infra - problem when Firestore database created in deprecated Datastore Mode HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cdp-backend.