Comments (27)
Unfortunately fixing this is going to be rather involved due to the way a tar file is structured. Since members of an archive are a data blob preceeded by a header you have to perform a lot of IO to remove a member (find header, find end of data blob then overwrite from former with data from latter). For this reason most tar implementations don't have an option to remove a member and this is true of python's tarfile (see python issue 6818 for more details: http://bugs.python.org/issue6818).
This makes this pretty much unfixable with the current TarFileArchive implementation (and probably ZipFileArchive too - I haven't checked).
I am not sure right now whether it's better to plough on fixing this or to revert to shelling out to these programs.
The problem is that when we call self.addCopySpec*() or getExtOutput() functions in setup() we will later call copy_stuff() -> doCopyFileOrDir() -> self.archive.add_file(). This writes the header and data blobs to the tarfile and a later run of doFileSub() etc. will just call self.archive.add_string() adding a new member to the archive with a colliding name.
from sos.
Zipfiles have this problem too. I wasn't aware of it in tarfiles (sadly disappointed, but not surprised). Maybe we should figure out a way to stream files to the archive and substitute/filter before they are ever written rather than try to archive them then re-archive them.
Shelling out to the archive tools is bad from a cross platform standpoint, but it might be preferable to do the previous trick of writing to a temp directory then using the python libraries to compress the finished directories.
from sos.
Afaics that's the only way this can work without going back to shelling out.
Unfortunately both of these look quite involved - the file list stuff would need to move from Plugin to the archive classes (along with the copy_stuff() machinery) or we'd need to re-work copy_stuff() and postproc() to be able to do the post-processing in-line. Going back to shelling out is also complicated now as we'd have to reintroduce all the temp file handling stuff and callouts for cp, tar etc.
If it wasn't for the security aspect I'd almost say ignore it (with a hack to fix the 555 perms caused by fscaps the unpack error goes away) but that's the whole point of postproc() and something we can't ignore.
from sos.
I think we could hide the temp folder behind the Archive class, so that plugins wouldn't have to know that we are really going to spool to a temp directory. The archive could do all its actual work on a close() function. I'll try to make time to at least mock this up soon.
I think going back to shelling out for tar/xz/gz/bz2 etc is a bad idea (causes problems for sure.)
I now think that the streaming/pipeline stuff is probably more difficult than just using a temp directory.
from sos.
Yeah that could work - I might try to hack something up here. It's a bit annoying in some ways as it means we're double-copying everything (triple copying everything that gets postproc()'d) but I agree trying to remodel postproc() to work in-line with copy_stuff() sounds more complex and fragile and possibly more inefficient too (depends how you organise the regex substitution calls - if you did it the dumb way adapting what we have now you'd wind up compiling a regex for each line. Obviously you wouldn't do it that way but it means more work to do it right).
from sos.
I've started playing with this in a private branch to see how complicated the in-line processing is. From what I've got so far I think I either need to:
- Pass a filter object over to the Archive class for each file to do substitutions
or: - Only use add_string() to put content into the archive. Do all the processing inside the Plugin class and hand ready-processed strings to the archive.
The only other way I could think to do it would mean returning an object with an IO-like interface from the archive using it in Plugin which seems over complicated.
I don't like the second option as although all our current substitutions are small-ish files it's possible we could want to filter logs at some point. This would mean needing to scoop potentially very large logs into memory. Although that's how the current code works it's probably something we should try to avoid.
from sos.
Pass a filter object over to the Archive class for each file to do substitutions or:
I don't see there being a problem with this approach. It could look like this:
self.addCopySpec("*/foo/bar/*.log", filter=my_filter_func)
from sos.
I think we need a closure of some sort in order to bind an re object - at least, with my current approach - I don't want to end up compiling a regex for every line of a file.
from sos.
Doing this in-line is getting a bit nasty.. The problem is that we're using addfile() to add members to the archive. That wants a fileobj to read from and a pre-filled tarinfo. The problem is that tarinfo.size MUST be the size of the file /after/ our substitutions. If it ends up to small we truncate the file if it ends up too big we hit an IO exception.
The only way I hit on to fix this so far is to double-substitute the string, calculate the change in length and then pad the string out to fit (so there is no change in size). This "works" fine for the common case of shortening the file but risks truncation if we ever lengthen it:
def filter(self, buf):
print "filtering: %s" % buf
saved_buf = buf
buf = re.sub(self.re, self.replace, buf)
pad = len(saved_buf) - len(buf)
return re.sub(self.re, self.replace + (pad * ' '), saved_buf)
I could add code to shorten the replacement string if pad becomes negative but I'm not sure it is worth it unless we re-work the way we are processing files to begin with (I am starting to think that that may be necessary anyway...) because we slurp up the whole file content in TarFileArchive.add_file(). This worries me when it comes to multi-GB files; I haven't tested yet but I suspect that that sucks horribly.
from sos.
This looks like it may be getting sideways on you. Could we recalculate the size and update the TarInfo object with the proper size?
In TarFileArchive.add_file()
why are we reading the entire file? We call stat
, is there a reason that doesn't work? I don't remember the reasoning now...
from sos.
Well, there's a couple of things going on here:
- Reading whole files into memory is probably bad (the code already does this but it would be good to fix it)
- Filtering in-line is very complicated to do right with the current interfaces
Right now we do:
content = fp.read()
fileobj = StringIO(content)
self.tarfile.addfile(tar_info, fileobj)
So all the IO is done in one go and effectively passed to the tarfile as a string. This is the part that concerns me for large files.
I'd really like to get rid of this and pass a "real" fileobj into self.tarfile.addfile() - this already uses streamed IO with a 16K buffer to copy the file content which should reduce memory impact for large copies.
Setting that problem aside for now though, my first cut at in-line filtering was to define a filter class:
class RegexFilter(object):
pattern = None
replace = None
re = None
def __init__(self, pattern, replace):
self.pattern = pattern
self.replace = replace
self.re = re.compile(pattern)
def filter(self, buf):
print "filtering: %s" % buf
saved_buf = buf
buf = re.sub(self.re, self.replace, buf)
delta = len(saved_buf) - len(buf)
if delta < 0:
return re.sub(self.re, self.replace[:delta], saved_buf)
return re.sub(self.re, self.replace + (delta * '\x00'), saved_buf)
Filters are then attached to paths via a per-plugin list (like files and commands):
self.addFileSub("/root/anaconda-ks.cfg",
r"(\s*rootpw\s*).*",
r"\1******")
To make this work with streaming IO I was planning to use a FilterIO class that takes a filter and a fileobj and wraps the IO calls, calling the filter's filter() method on each buffer read out from the file source.
The thinking was that this would allow us to filter efficiently while the tarfile class is reading the file from the fileobj; a file would be read in small units, filtered and written out without using lots of memory or making a temporary copy in the file system.
The problem is that this means that the filter must not change the file's size. We often shrink the file slightly due to substitutions like:
rootpw --iscrypted $6$9EBYuYGhyUZZfXHq$/JYm4xHxNfFil26RxkjKWlh9.H9he8TX9E6kjJlGpILnW6439LBvaTCaz5t3i6.Tn53HWXnAxEvlSPby6GJ.a0
(125 chars)
To:
rootpw ******
(13 chars)
The code above handles this by checking for a change in size and padding the string out or truncating it to suit. That works for single cases but messes up e.g. when the replacement is <tag>******</tag>
- we'd delete the close tag rather than the asterisks.
Making this work for the general case with the pattern/replace interface seems complicated - we'd need to parse the replacement string and remove the blanks without disturbing other text or capture group references.
One option would be to make the replacements allowed more restricted.
from sos.
I understand now. On the surface, it doesn't seem like streaming will work well with the tarfile api in python's stdlib.
from sos.
If you are really feeling wreckless/brave you could reimplement addfile()
:
def addfile(self, tarinfo, fileobj=None):
"""Add the TarInfo object `tarinfo' to the archive. If `fileobj' is
given, tarinfo.size bytes are read from it and added to the archive.
You can create TarInfo objects using gettarinfo().
On Windows platforms, `fileobj' should always be opened with mode
'rb' to avoid irritation about the file size.
"""
self._check("aw")
tarinfo = copy.copy(tarinfo)
buf = tarinfo.tobuf(self.format, self.encoding, self.errors)
self.fileobj.write(buf)
self.offset += len(buf)
# If there's data to follow, append it.
if fileobj is not None:
copyfileobj(fileobj, self.fileobj, tarinfo.size)
blocks, remainder = divmod(tarinfo.size, BLOCKSIZE)
if remainder > 0:
self.fileobj.write(NUL * (BLOCKSIZE - remainder))
blocks += 1
self.offset += blocks * BLOCKSIZE
self.members.append(tarinfo)
Though I don't know that there is much we can do. The format goes:
[tarinfo]
[data]
[tarinfo]
[data]
...
from sos.
That may actually be easier than all this :-)
It may also still prove better to go back to building the tree in /tmp - I'm testing with a ~250M file in /var/log now and sure enough sosreport is eating a quarter gig:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2955 root 20 0 466m 263m 4208 R 91.9 26.5 2:37.27 sosreport
from sos.
It really does seem like the simplest and most correct way to do this is to write to a tmp directory. We could still use the streaming filter style, but it seems like all the tools oppose streaming directly into a tarfile.
from sos.
Using /tmp has its own set of problems but we've managed with it for a long time (although tmp-on-tmpfs in some ways renders all this worrying about buffering moot since the data's going to wind up in RAM or on swap anyway..).
I really like the idea of doing this inline; it's elegant and potentially more efficient but implementing it is currently looking to be a giant can of worms.
from sos.
I like the idea of streaming-to-tmp. We'd have to change the way we handle substitutions to play nicely with streaming (i.e. don't use re.sub against the entire file). That'd mean we only end up with one copy before going to tar/zip.
from sos.
I've committed the whole sticky mess on a private branch for now:
I'll think about this some more and come back to it..
from sos.
btw the sort of interface change I was thinking of was along these lines:
def addFileSub(path, pattern, replace):
Becomes:
def addFileSub(path, prefix, replace, suffix):
And the substitution is always of the form:
$prefix$replace$suffix -> $prefix*{1,len($replace)}$suffix
That makes the size change problem much easier to tackle correctly as you know the length of $prefix and $suffix and the replacement is always a single character repeat.
The parameters would just be concatenated to generate the regex, e.g.:
addFileSub("/etc/some.xml", "<secrettag>", ".*", "</secrettag>")
Becomes (<secrettag>)(.*)(</secrettag>)
and the replacement string is "\1" + len(\2) * '*' + "\3"
from sos.
How would that work with things like:
[client]
user=foo
password=bar
There is no suffix in this case? Or are you thinking of '\n' as the suffix?
from sos.
Sorry, should've mentioned that - Null or empty string would just get dropped (i.e. no 3rd group).
from sos.
Just curious, did you ever arrive at a solution?
from sos.
I have two; there's the fully streamed version that I prototyped last year and reverting to a tempdir based fix. Either works - we need to make a decision in the next two weeks on which one we'll go with in the next release.
from sos.
Do we have anymore thoughts on how to handle this? Are we still leaning towards reversing this functionality to tmpdir?
from sos.
Right now that's my plan. I haven't been able to put any time into it yet because of other work. Aside from the security problems I really think the streaming code needs a complete rewrite - on fairly modest runs I now see sosreport chewing through ~750M of RSS - obviously that's completely useless considering when we're expecting the tool to run. I've not put any time into debugging this and I'm unsure if it's a leak or a consequence of bad python but it seems to me there are just too many cracks to be papered over in the current approach and it was given too little testing to understand the real impact of the change when it was first written. Also, although it potentially enables some neat stuff in the future none of that is actually really working today so the new code is not really buying us anything at all right now (apart from bugs & breakage..).
from sos.
on fairly modest runs I now see sosreport chewing through ~750M of RSS - obviously that's completely useless considering when we're expecting the tool to run
Yea i've gotten 1 or 2 reports mentioning the high resource usage but haven't really looked into it. Though it looks like this is a known issue and due to this bug.
from sos.
Fixed in commit 0178d5f.
from sos.
Related Issues (20)
- [logs] Logs plugin will always load CoSLogs variant
- [collector] Add a wrapper to import yaml HOT 2
- [tests] networking plugin test can fail if /sys/class/net/bonding_masters file exists HOT 1
- [virsh] may collect SPICE passwords in virt-manager logs HOT 1
- [ubuntu] `--upload-url` implicitly suppress report upload for ubuntu policy HOT 1
- Typos in sos-collect manpage
- [tests] restrict foreman tests to x86_64 architecture only
- URL for commercial support seems wrong HOT 2
- [report] Rename sos_commands to sos_plugins? HOT 2
- support for pebble service manager HOT 3
- [juju] cluster collection doesn't work in 2 scenarios HOT 2
- [collect] sos collect does not detect snap HOT 3
- [networking] route call returns only IPv4 routing table
- [services] Is the services plugin still useful? HOT 3
- sos report --all-logs does not gather all rotated logs in ubuntu HOT 1
- [networking] Collect `sysctl -a` from namespaces
- [juju] add_copy_spec does not collect dot files HOT 2
- Should --plugopts overwrite or add plugin options from config file or preset? HOT 2
- [tests] all cleaner tests to have --no-update option HOT 2
- Issue with do_file_private_sub on the juju plugin HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sos.