We have a problem with post-processed (e.g. do*Sub()'ed) files: <div class="snippe

Post-processed files end up in archive twice (security, usability),about sosreport/sos

Comments (27)

bmr-cymru commented on August 27, 2024

Unfortunately fixing this is going to be rather involved due to the way a tar file is structured. Since members of an archive are a data blob preceeded by a header you have to perform a lot of IO to remove a member (find header, find end of data blob then overwrite from former with data from latter). For this reason most tar implementations don't have an option to remove a member and this is true of python's tarfile (see python issue 6818 for more details: http://bugs.python.org/issue6818).

This makes this pretty much unfixable with the current TarFileArchive implementation (and probably ZipFileArchive too - I haven't checked).

I am not sure right now whether it's better to plough on fixing this or to revert to shelling out to these programs.

The problem is that when we call self.addCopySpec*() or getExtOutput() functions in setup() we will later call copy_stuff() -> doCopyFileOrDir() -> self.archive.add_file(). This writes the header and data blobs to the tarfile and a later run of doFileSub() etc. will just call self.archive.add_string() adding a new member to the archive with a colliding name.

from sos.

jhjaggars commented on August 27, 2024

Zipfiles have this problem too. I wasn't aware of it in tarfiles (sadly disappointed, but not surprised). Maybe we should figure out a way to stream files to the archive and substitute/filter before they are ever written rather than try to archive them then re-archive them.

Shelling out to the archive tools is bad from a cross platform standpoint, but it might be preferable to do the previous trick of writing to a temp directory then using the python libraries to compress the finished directories.

from sos.

bmr-cymru commented on August 27, 2024

Afaics that's the only way this can work without going back to shelling out.

Unfortunately both of these look quite involved - the file list stuff would need to move from Plugin to the archive classes (along with the copy_stuff() machinery) or we'd need to re-work copy_stuff() and postproc() to be able to do the post-processing in-line. Going back to shelling out is also complicated now as we'd have to reintroduce all the temp file handling stuff and callouts for cp, tar etc.

If it wasn't for the security aspect I'd almost say ignore it (with a hack to fix the 555 perms caused by fscaps the unpack error goes away) but that's the whole point of postproc() and something we can't ignore.

from sos.

jhjaggars commented on August 27, 2024

I think we could hide the temp folder behind the Archive class, so that plugins wouldn't have to know that we are really going to spool to a temp directory. The archive could do all its actual work on a close() function. I'll try to make time to at least mock this up soon.

I think going back to shelling out for tar/xz/gz/bz2 etc is a bad idea (causes problems for sure.)
I now think that the streaming/pipeline stuff is probably more difficult than just using a temp directory.

from sos.

bmr-cymru commented on August 27, 2024

Yeah that could work - I might try to hack something up here. It's a bit annoying in some ways as it means we're double-copying everything (triple copying everything that gets postproc()'d) but I agree trying to remodel postproc() to work in-line with copy_stuff() sounds more complex and fragile and possibly more inefficient too (depends how you organise the regex substitution calls - if you did it the dumb way adapting what we have now you'd wind up compiling a regex for each line. Obviously you wouldn't do it that way but it means more work to do it right).

from sos.

bmr-cymru commented on August 27, 2024

I've started playing with this in a private branch to see how complicated the in-line processing is. From what I've got so far I think I either need to:

Pass a filter object over to the Archive class for each file to do substitutions
or:
Only use add_string() to put content into the archive. Do all the processing inside the Plugin class and hand ready-processed strings to the archive.

The only other way I could think to do it would mean returning an object with an IO-like interface from the archive using it in Plugin which seems over complicated.

I don't like the second option as although all our current substitutions are small-ish files it's possible we could want to filter logs at some point. This would mean needing to scoop potentially very large logs into memory. Although that's how the current code works it's probably something we should try to avoid.

from sos.

jhjaggars commented on August 27, 2024

Pass a filter object over to the Archive class for each file to do substitutions or:

I don't see there being a problem with this approach. It could look like this:

self.addCopySpec("*/foo/bar/*.log", filter=my_filter_func)

from sos.

bmr-cymru commented on August 27, 2024

I think we need a closure of some sort in order to bind an re object - at least, with my current approach - I don't want to end up compiling a regex for every line of a file.

from sos.

bmr-cymru commented on August 27, 2024

Doing this in-line is getting a bit nasty.. The problem is that we're using addfile() to add members to the archive. That wants a fileobj to read from and a pre-filled tarinfo. The problem is that tarinfo.size MUST be the size of the file /after/ our substitutions. If it ends up to small we truncate the file if it ends up too big we hit an IO exception.

The only way I hit on to fix this so far is to double-substitute the string, calculate the change in length and then pad the string out to fit (so there is no change in size). This "works" fine for the common case of shortening the file but risks truncation if we ever lengthen it:

    def filter(self, buf):
        print "filtering: %s" % buf
        saved_buf = buf
        buf = re.sub(self.re, self.replace, buf)
        pad = len(saved_buf) - len(buf)
        return re.sub(self.re, self.replace + (pad * ' '), saved_buf)

I could add code to shorten the replacement string if pad becomes negative but I'm not sure it is worth it unless we re-work the way we are processing files to begin with (I am starting to think that that may be necessary anyway...) because we slurp up the whole file content in TarFileArchive.add_file(). This worries me when it comes to multi-GB files; I haven't tested yet but I suspect that that sucks horribly.

from sos.

jhjaggars commented on August 27, 2024

This looks like it may be getting sideways on you. Could we recalculate the size and update the TarInfo object with the proper size?

In TarFileArchive.add_file() why are we reading the entire file? We call stat, is there a reason that doesn't work? I don't remember the reasoning now...

from sos.

bmr-cymru commented on August 27, 2024

Well, there's a couple of things going on here:

Reading whole files into memory is probably bad (the code already does this but it would be good to fix it)
Filtering in-line is very complicated to do right with the current interfaces

Right now we do:

content = fp.read()
fileobj = StringIO(content)
self.tarfile.addfile(tar_info, fileobj)

So all the IO is done in one go and effectively passed to the tarfile as a string. This is the part that concerns me for large files.

I'd really like to get rid of this and pass a "real" fileobj into self.tarfile.addfile() - this already uses streamed IO with a 16K buffer to copy the file content which should reduce memory impact for large copies.

Setting that problem aside for now though, my first cut at in-line filtering was to define a filter class:

class RegexFilter(object):
    pattern = None
    replace = None
    re = None

    def __init__(self, pattern, replace):
        self.pattern = pattern
        self.replace = replace
        self.re = re.compile(pattern)

    def filter(self, buf):
        print "filtering: %s" % buf
        saved_buf = buf
        buf = re.sub(self.re, self.replace, buf)
        delta = len(saved_buf) - len(buf)
        if delta < 0:
            return re.sub(self.re, self.replace[:delta], saved_buf)
        return re.sub(self.re, self.replace + (delta * '\x00'), saved_buf)

Filters are then attached to paths via a per-plugin list (like files and commands):

        self.addFileSub("/root/anaconda-ks.cfg",
                        r"(\s*rootpw\s*).*",
                        r"\1******")

To make this work with streaming IO I was planning to use a FilterIO class that takes a filter and a fileobj and wraps the IO calls, calling the filter's filter() method on each buffer read out from the file source.

The thinking was that this would allow us to filter efficiently while the tarfile class is reading the file from the fileobj; a file would be read in small units, filtered and written out without using lots of memory or making a temporary copy in the file system.

The problem is that this means that the filter must not change the file's size. We often shrink the file slightly due to substitutions like:

rootpw --iscrypted $6$9EBYuYGhyUZZfXHq$/JYm4xHxNfFil26RxkjKWlh9.H9he8TX9E6kjJlGpILnW6439LBvaTCaz5t3i6.Tn53HWXnAxEvlSPby6GJ.a0

(125 chars)

To:

rootpw ******

(13 chars)

The code above handles this by checking for a change in size and padding the string out or truncating it to suit. That works for single cases but messes up e.g. when the replacement is <tag>******</tag> - we'd delete the close tag rather than the asterisks.

Making this work for the general case with the pattern/replace interface seems complicated - we'd need to parse the replacement string and remove the blanks without disturbing other text or capture group references.

One option would be to make the replacements allowed more restricted.

from sos.

jhjaggars commented on August 27, 2024

I understand now. On the surface, it doesn't seem like streaming will work well with the tarfile api in python's stdlib.

from sos.

jhjaggars commented on August 27, 2024

If you are really feeling wreckless/brave you could reimplement addfile():

    def addfile(self, tarinfo, fileobj=None):
        """Add the TarInfo object `tarinfo' to the archive. If `fileobj' is
           given, tarinfo.size bytes are read from it and added to the archive.
           You can create TarInfo objects using gettarinfo().
           On Windows platforms, `fileobj' should always be opened with mode
           'rb' to avoid irritation about the file size.
        """
        self._check("aw")

        tarinfo = copy.copy(tarinfo)

        buf = tarinfo.tobuf(self.format, self.encoding, self.errors)
        self.fileobj.write(buf)
        self.offset += len(buf)

        # If there's data to follow, append it.
        if fileobj is not None:
            copyfileobj(fileobj, self.fileobj, tarinfo.size)
            blocks, remainder = divmod(tarinfo.size, BLOCKSIZE)
            if remainder > 0: 
                self.fileobj.write(NUL * (BLOCKSIZE - remainder))
                blocks += 1 
            self.offset += blocks * BLOCKSIZE

        self.members.append(tarinfo)

Though I don't know that there is much we can do. The format goes:

[tarinfo]
[data]
[tarinfo]
[data]
...

from sos.

bmr-cymru commented on August 27, 2024

That may actually be easier than all this :-)

It may also still prove better to go back to building the tree in /tmp - I'm testing with a ~250M file in /var/log now and sure enough sosreport is eating a quarter gig:

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND                                                                                                            
 2955 root      20   0  466m 263m 4208 R  91.9 26.5   2:37.27 sosreport

from sos.

jhjaggars commented on August 27, 2024

It really does seem like the simplest and most correct way to do this is to write to a tmp directory. We could still use the streaming filter style, but it seems like all the tools oppose streaming directly into a tarfile.

from sos.

bmr-cymru commented on August 27, 2024

Using /tmp has its own set of problems but we've managed with it for a long time (although tmp-on-tmpfs in some ways renders all this worrying about buffering moot since the data's going to wind up in RAM or on swap anyway..).

I really like the idea of doing this inline; it's elegant and potentially more efficient but implementing it is currently looking to be a giant can of worms.

from sos.

jhjaggars commented on August 27, 2024

I like the idea of streaming-to-tmp. We'd have to change the way we handle substitutions to play nicely with streaming (i.e. don't use re.sub against the entire file). That'd mean we only end up with one copy before going to tar/zip.

from sos.

bmr-cymru commented on August 27, 2024

I've committed the whole sticky mess on a private branch for now:

I'll think about this some more and come back to it..

from sos.

bmr-cymru commented on August 27, 2024

btw the sort of interface change I was thinking of was along these lines:

    def addFileSub(path, pattern, replace):

Becomes:

    def addFileSub(path, prefix, replace, suffix):

And the substitution is always of the form:

$prefix$replace$suffix -> $prefix*{1,len($replace)}$suffix

That makes the size change problem much easier to tackle correctly as you know the length of $prefix and $suffix and the replacement is always a single character repeat.

The parameters would just be concatenated to generate the regex, e.g.:

addFileSub("/etc/some.xml", "<secrettag>", ".*", "</secrettag>")

Becomes (<secrettag>)(.*)(</secrettag>) and the replacement string is "\1" + len(\2) * '*' + "\3"

from sos.

jhjaggars commented on August 27, 2024

How would that work with things like:

[client]
user=foo
password=bar

There is no suffix in this case? Or are you thinking of '\n' as the suffix?

from sos.

bmr-cymru commented on August 27, 2024

Sorry, should've mentioned that - Null or empty string would just get dropped (i.e. no 3rd group).

from sos.

jhjaggars commented on August 27, 2024

Just curious, did you ever arrive at a solution?

from sos.

bmr-cymru commented on August 27, 2024

I have two; there's the fully streamed version that I prototyped last year and reverting to a tempdir based fix. Either works - we need to make a decision in the next two weeks on which one we'll go with in the next release.

from sos.

adam-stokes commented on August 27, 2024

Do we have anymore thoughts on how to handle this? Are we still leaning towards reversing this functionality to tmpdir?

from sos.

bmr-cymru commented on August 27, 2024

Right now that's my plan. I haven't been able to put any time into it yet because of other work. Aside from the security problems I really think the streaming code needs a complete rewrite - on fairly modest runs I now see sosreport chewing through ~750M of RSS - obviously that's completely useless considering when we're expecting the tool to run. I've not put any time into debugging this and I'm unsure if it's a leak or a consequence of bad python but it seems to me there are just too many cracks to be papered over in the current approach and it was given too little testing to understand the real impact of the change when it was first written. Also, although it potentially enables some neat stuff in the future none of that is actually really working today so the new code is not really buying us anything at all right now (apart from bugs & breakage..).

from sos.

adam-stokes commented on August 27, 2024

on fairly modest runs I now see sosreport chewing through ~750M of RSS - obviously that's completely useless considering when we're expecting the tool to run

Yea i've gotten 1 or 2 reports mentioning the high resource usage but haven't really looked into it. Though it looks like this is a known issue and due to this bug.

from sos.

bmr-cymru commented on August 27, 2024

Fixed in commit 0178d5f.

from sos.

Post-processed files end up in archive twice (security, usability) about sos HOT 27 CLOSED

Comments (27)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent