mar-file-system / marfs Goto Github PK

MarFS provides a scalable near-POSIX file system by using one or more POSIX file systems as a scalable metadata component and one or more data stores (object, file, etc) as a scalable data component.

License: Other

C 93.74% Shell 0.05% Makefile 2.11% M4 0.87% Python 3.16% Dockerfile 0.03% Batchfile 0.04%

marfs's People

Contributors

Stargazers

Watchers

marfs's Issues

Xattr Not Being Created

I ran make_uni_mult from the script directory on Friday and noticed that my quota script was not finding the files that were created. It turns out that the files existed in the gpfs space but the only xattr for the file was as below for all files:

file: uni.195

user.marfs_restart=0sAQA=

In addition, when I deleted these files (from the fuse mount) I could not find them in trash.

Alfred

Configure sproxyd to require a username and password authentication

Apply username and password to using sproxyd such that we must authenticate to write and read objects in Scality via sproxyd. We'll use a mechanism where the FUSE daemon and PFTool must escalate to root to get this information.

Two senses of marfs "version"

There are two senses in which we have configuration versions:

(1) config-file parsing. In this case the config-file has a version, and the config-reader could compare that with some hardwired defines, to assure that it is competent to perform the parse.

(2) xattr parsing. Any objects written by a given version of the software are stamped with the "version" that was in effect when they were written. This includes xattr-parsers, and xattr writers (e.g. str_2_pre(), and pre_2_str(), respectively).

But these are two different things. Additions to the config-file structure may require changes in the config-reader, so maybe it makes sense that there would be #defines that identify the SW version, which the reader would compare with the config file. But the same is true of the xattr parser/writer. And the two can change independently, though they are related.

So, what do we really care about? We really care about xattrs on files. The SW should stamp the xattrs with a SW version-number, so that we can know how they were written, so we can know how to read them. It may also tell us something about how chunk-info is formatted inside MD files, or about how recovery-info is formatted in objects.

The "version" in the config file is an independent thing, which could let the config-reader know something about how to read it. Maybe it's not needed? The config reader is just supposed to be flexible? But if you have a newer config-file, don't you want old software to realize that it is out of its depth?

I propose that the version in the config-file pertains to the config-reader only, and that the version written into xattrs is a different thing. Thus, we have two sets of #defines. One tells us about what kinds of configuration-versions we can handle, and changes whenever the configuration-reader changes. The other goes into objects, and tells us how we have to read them and their metadata.

Scality recommends tcp_syncookies=0

Ensure that this network setting is set to 0 (zero) on the ring servers for sure. We may not want to do it on the FTAs because the are somewhat "outward facing" and could open the FTAs to a SYN Flood attack. Though, if anyone on our network does that to our FTAs they're going to lose their privilege to compute here.

Where to retrieve configuration parser code?

Hi,

While trying to build the MarFS code, I got stuck on the compilation in common/configuration/src, which requires PARSE_DIR to be defined, pointing to "the path to your install of Ron's code". If I'm not mistaken, this code isn't present in the repository, and I couldn't find it elsewhere.

Any pointers? Thanks!

Encrypt username and password authentication

Extend the username and password protection by encrypting the username and password, but not the data, when writes and reads occur to the object store.

Export FUSE mount from batch FTAs

Work with Chris Mitchell and team to setup the batch FTAs to re-export the FUSE mount.

This allows us not to worry about unsharing the MarFS GPFS metadata file system mount. And, it allows us to IPtables deny the interactive FTAs to access the object store.

Implement "fake" marfs_statfs()

A comment in the code says "NOTE: Until fsinfo is available, we're just ignoring". Then we return ENOSYS.

The idea behind the comment is that once we have fsinfo maintained semi-regularly by Alfred's scripts, we can provide semi-up-to-date information about available storage, etc, in a MarFS file system. What will be needed is to read fsinfo and extract info (if fsinfo contains detailed output as a result of the scripts), or to stat it (if fsinfo just gets truncated to size). I believe the former is the case. Then we fill out the statvfs struct and return.

Chris H and Dave B were speculating about exporting fuse via NFS from the interactive FTA to one of the batch FTAs. (Or maybe going the other direction would address fears of allowing access to GPFS when fuse goes down?) Anyhow, NFS needs staffs, for that to work.

Also , statfs is needed to support 'df', which is not necessarily something important.

Users unable to delete files that they own unless they also own the trash directory

Users must own both their directory in gpfs as well as their directoy in the gpfs trash in order to delete a file.

This might not be a problem. I just wanted it noted.

Example:

[conn001 57] /gpfs/marfs-gpfs/trash > ls -alh dejager.0/
total 64K
drwxr-x---  12 root root 512 Nov 17 13:50 ./
drwxr-xr-x.  9 root root 32K Nov 17 13:51 ../
drwxr-x---  12 root root 512 Nov 17 13:50 0/
drwxr-x---  12 root root 512 Nov 17 13:50 1/
drwxr-x---  12 root root 512 Nov 17 13:50 2/
drwxr-x---  12 root root 512 Nov 17 13:50 3/
drwxr-x---  12 root root 512 Nov 17 13:50 4/
drwxr-x---  12 root root 512 Nov 17 13:50 5/
drwxr-x---  12 root root 512 Nov 17 13:50 6/
drwxr-x---  12 root root 512 Nov 17 13:50 7/
drwxr-x---  12 root root 512 Nov 17 13:50 8/
drwxr-x---  12 root root 512 Nov 17 13:50 9/

[dejager@conn001 dejager]$ pwd /marfs/dejager

[dejager@conn001 dejager]$ touch hello
[dejager@conn001 dejager]$ rm hello
rm: cannot remove `hello': Permission denied

[conn001 58] /gpfs/marfs-gpfs/trash > chown -R dejager:dejager dejager.0/

[dejager@conn001 dejager]$ rm hello

Ensure MarFS pfcp restart works

This may be more appropriate for the PFTool project, but I thought I'd put it here to remind us that we need this capability working for MarFS.

pfcp -n will only copy files for N-N that have not succeeded already, so the N-N case is covered where we won’t have to start from the beginning in the event of a failure. If Jeff adds retries for writes that will alleviate most other failure scenarios (at least that we’ve encountered to this point in our work). That leaves a probably infrequent occurrence where N-1 does not finish copying because even write retries could not get it to work.

For Open Science it would be nice to have this work, but we can live without it.

For further releases we want restarts to work. Irrespective of file type (PACKED, UNI, MULTI), we need to have a way to clear out any partially written object and start writing over beginning with the last object that was not completed.

Configure Open Science batch FTA

This FTA should mount and re-export the MarFS FUSE mount RW as well as the other file systems required for Open Science (Trinity Sonexion as RW, Turquiose Archive RO, Trinity Home/Projects/Netscratch RW). Users can use PFTool commands only that are submitted from the interactive FTA to move files between these storage systems. Users should be informed this is for high bandwidth needs and moving files will only be supported through PFTool jobs submitted to these FTAs.

Make MarFS FUSE available to users on interactive FTAs

We're using an unshare technique to have the GPFS metadata mount be visible to only the FUSE daemon process. We need to test and ensure that if the FUSE daemon process is killed that this does not allow other processes on the node to see the GPFS metadata mount.

Dave Bonnie and I talked about this and I wanted to assign this to him, but it appears that he has not accepted the invitation to join the MarFS project. Assigning to Chris Hoffman for now because they are in adjoining offices and work closely together.

Establish a mechanism for where MarFS configuration file is

Right now I have the API user pass the path to the configuration file. I propose to change this so that the API clients don't have to implement the same mechanism multiple times. I propose that MarFS configuration search for the configuration file in this order:

Value of the MARFSCONFIGRC environment variable.
$HOME/.marfsconfigrc file.
/etc/marfsconfigrc file.

If none of those is found, the read_configuration fails.

Implement PACKED file creation in PFTool

After the Open Science MarFS version is released, we need to add the PACKED file creation capability to PFTool so that when it copies files to a repository that is an object store it creates large objects consisting of small files.

Add per-file locks, so file can be converted to PACKED atomically

When creating packed files, we take many small files, concatenate their contents into one large object, and then update the xattrs on each original file, to indicate it is packed and is a member of this new object, at a given offset, etc. Just before updating the xattr, we stat the file to see whether it has changed while we were packing. If so, we leave it as is. It is okay that we didn't update the xattrs, because the file was overwritten. (The remaining files in the packed object can all still be considered valid packed files. And the wrong contents of the changed file will be ignored.)

The problem is that (in the case where the stat does not indicate anybody touched the file during packing), between the moment we stat the file and the moment we update the xattrs, someone could now change the original file. In this case, it is not okay that we updated the xattrs.

The problem scenario is:
[a] packer stats file
[b] fuse writes new file with new xattrs
[c] packer overwrites new xattrs to correspond with the (obsolete) packed file.

This is a race-condition, albeit very brief. The proposed solution is to provide a special "lock" (e.g. an xattr) which can be put on the file before stat'ing and removed after updating the other xattrs. Fuse and pftool would both forbid trashing the file while the lock was held.

Okay, but what if someone writes the file after the lock has been removed? Well, that is just a normal fact of life. They overwrote the file. The packed file should be trashed as usual, and the new file written in its place. Thus, the locks are only preventing the xattrs from becoming incorrect.

What if someone had the file open for writing, while the packer was running? We put an xattr onto open files. The packer should avoid those. Its stat should also check for this xattr.

There may be other use-cases, as well. To lock an entire namespace, admins can just change the access perms in the configuration, so they don't need this technique. However, this scheme could perhaps be used to lock directory-trees, etc.

marfs_configuration doesn't work right for repo range lists in namespaces

It was hacked temporarily to handle a single repo range for a namespace, but it needs to be a list. I think I've figured out how to get the base PA2X configuration parsing to handle embedded repo ranges. I need to fix marfs_configuration to read those and build variable sized repo range lists for the various namespaces.

Implement full marfs_statfs()

Refer to the initial explanation in #34 Implement "fake" marfs_statfs().

In the full version, we will have Alfred’s tool create an fsinfo in the root namespace where the sum of everything lives.

test

If I create an issue, does email get sent to someone?

deadlock with concurrent readers

[reported by Chris DeJager]

Easy to reproduce:
for i in seq 1 2; do cat /marfs/jti/test 2>&1 > foo.test.$i & wait; echo done; done

deleting directory leaves strange trash

[from Alfred Torrez]

Yesterday, I mentioned that I was seeing an issue with file counts and sizes following a delete. The counts and sizes did not match up with du. It turns out that when I run a variant of make_gc_data (only creates multi and uni in one directory – no delete), my file counts and sizes match up with du. When I delete as follows: rm –rf /marfs/atorrez/d1, I am seeing an extra count of files and my size is wrong in the project_a/d1 directory. If I go back and do an ls on /marfs/atorrez, I see ????????? on the .. dir entry. If I remount and run my quota script again, the counts return to correct values and the size is correct.

I turned logging on for the remove and I am seeing something related to /atorrez/tail followed by some errors about no such file or directory…..

MarFS xattr inconsistencies

There are inconsistencies between what the MarFS documentation says for marfs_objid and marfs_post xattrs versus what is actually being written in marfs code.

We should verify that marfs writes according to the MarFS Documentation and that:
-All fields are included
-All fields are in the correct order

Set tcp_sack=1 on FTAs and Ring Servers

This setting forces some synchronization in TCP communications. Though one might first think this would slow things down, it in fact yields much better consistent performance. It helps HTTP-based protocols avoid the switch "incast" problem.

Set the FTAs and Ring servers to use tcp_sack=1.

Configure Open Science interactive FTA

This FTA should mount the re-exported MarFS FUSE mount RW as well as the other file systems required for Open Science (Trinity Sonexion as RW, Turquiose Archive RO, Trinity Home/Projects/Netscratch RW). Users can use UNIX commands to move files between these storage systems. Users should be informed this is for low bandwidth needs and moving files will not be high performance on the interactive FTAs.

look at stream_close() in the case of OSF_TIMEOUT

This is currently skipping any attempt to close the stream (see the old "signal QUIT to writefunc). Instead we just cancel the GET request thread. I'm thinking this could either (a) leave connections hanging open, or (b) add extra time to the reopening of a new connection.

'rm /marfs/ns/symlink' fails

… because symlink doesn’t get xattrs (as it shouldn’t), but we use lgetxattr()

PFTool performance adequate for Open Science

With 9KB MTU, tcp_sack=1, on FTAs and Ring servers, tcp_syncookies=0 on Ring servers, TCP buffers at 64MB on FTAs and Ring servers, and read retries enabled in PFTool we are able to do PFTool transfers without errors. Write performance (reading sparse files from the POSIX file system) on 2 Dual-bonded 10 GigE FTAs is 2.26 GB/s and read performance (writing to /dev/null, not the POSIX file system) is 3.52 GB/s.

Let's discuss if this is adequate.

Set default and max TCP buffers to 64MB

We've discovered that the TCP buffers default and max settings need to be the same so that the OS doesn't try to adjust during run-time, causing stalls. Set these buffers to be 64MB in the FTAs and the ring servers.

FUSE daemon authentication for sproxyd

Implement that the FUSE daemon must use a username and password to access Scality objects via sproxyd. The username and password will be protected such that the FUSE daemon must escalate to root to access it.

Design to maximize transfer bandwidth in the long-term

We are putting in some code segments to deal with a 10 GigE interface to the object store. Ultimately we intend to target installations that use IB interfaces which, in our experimentation, do not show the server congestion issues that cause frequent dropped packets and necessitate retries.

We need to develop a design that can maximize bandwidth, especially for reads (where we mostly see the dropped packet issue), and enable more successful transfers on the first try of using PFTool.

This involves finding the right network parameters, enabling timeouts that aren't too short (timeouts that would have otherwise completed on their own) and aren't too long (waiting too long for something that isn't going to complete), enabling retries (not so many that it will never work), allocating object clients to object data servers (well distributed requests to the object store servers), etc.

Files created by user with fuse are in root group

If I create a file as a user in marfs the group is set to root. This is with fuse.

Example:
[dejager@conn001 bin]$ echo "It was nice to meet you" > /marfs/dejager/goodbye
[dejager@conn001 bin]$ ls -lh /marfs/dejager/goodbye
-rw-r--r-- 1 dejager root 24 Oct 8 11:14 /marfs/dejager/goodbye

GPFS gpfs_iattr Structure Member Value Change

I ran the quota script on the tr-FTA cluster and noticed I was not finding MarFS files. The quota and garbage collection script rely on reading the gpfs_iattr structures during the inode scan. The code was checking the ia_xperm member for a value of 2 which implies extended attributes. I was seeing a value of 18 which did not make sense initially. It turns out the ia_xperm is a bit field and I was, in this case, getting a value of 0x0012 (18) which implies extended attributes (bit 1) and bit 4 which implies file has restore policy attrs. The code was doing a compare on a value and not masking appropriately.

The code will be modified for bit masking and tested. Update to this issue will occur after testing.

Break large GET requests into a sequence of smaller ones

(See email, with subject "incast on Trinity-Island FTAs")

Unit tests on 1 FTA show multiple large GET requests to multiple servers results in dramatically diminished BW. This is probably explained as an "incast" issue, where packets are colliding in the buffers of the recv port on the switch, and getting dropped. This in turn leads to a "TCP throughput collapse", because of congestion-control at the server, and the need to time-out to detect lost-packets.

When the multiple requests all target a single server, the problem goes away, presumably because the server can only inject into switch at about the same rate that the client is pulling data out. However, that is an awkward solution, because it implies some sort of coordinated response, to allow clients to failover from one server to another, making sure that it isn't just a matter of one task dropping a connection, etc. The old scheme of picking servers at random from a range of IP-addresses is much simpler and more robust, and we'd like to keep that.

It appears that another way to resolve our incast issue is to break larger requests up into a sequence of smaller ones. So instead of a single GET request with a 1GB byte-range, we requests a synchronous series of smaller ranges. This means the packets hitting the recv-port can't stack up as much data there. It seems to resolve the problem in unit-tests.

So, TBD:

(1) configuration should add a max-request-size to the Repo. Actually, we don't want people having to divide repo.chunk_size to get a request-size that divides chunk-size evenly. So, the new config-field should say how many requests-per-chunk, and we'll do the division, with rounding and padding, etc.

[Question: but what if two namespaces have different connectivity to the same repo?]

(2) marfs_read(), or something, should break requests up into this smaller size. We only increment the chunk-number in the URL at object-boundaries.

(3) Should this also apply to N:1 PUTs from pftool?

Recovery-Info needs to include data-size (or use two 8-byte lengths)

That's so that we can walk backwards through a packed object that is formatted in our new, simpler, packed format:

FileData1, RecoveryInfo1, 8_byte_length
FileData2, RecoveryInfo2, 8_byte_length
…

This is just a concatenation of the raw objects, and is what we already support for reads. The 8-byte length holds the length of the recovery-info immediately preceding it. Then we read that recovery-info to get the length of the data immediately preceding that. Etc.

(We can't assume we know the length of the recovery-info, because that may change with newer versions of the system, and we want to be able to deal with old objects)

To make this work, the recovery-info has to include the length of the data.

*** BETTER YET, the 8-byte length, could be extended to two 8-byte lengths, one for the recovery-info, and one for the data. Then (a) we can skip over the recovery-info without parsing, if we're trying to get to a specific object, and (b) we don't have to waste additional xattr bytes for values that are really only needed for recovery.

another test issue

If I create multiple issues, what kind of interface do I get for looking at them?

Implement SEMI_DIRECT access method

This method is where the data is stored in a separate file system that allows update-in-place. Typically, the repository would be a parallel file system or other fully POSIX-compliant file system that allows for efficient editing of files and not just complete overwrites.

object recovery-info is still just fake data

We install recovery-info into all objects, and are forced to do arithmetic to manage writing this info into object streams with a given content-length (i.e. to include user-data plus recovery-info), but the info itself is still just fake, intended to support debugging.

'mkdir –p /marfs/foo/bar' fails

…because we don't yet have a top-level "root" namespace.

Monitor for users querying object store directly

Develop a monitor to look for users who are doing object queries and getting object not found. This will identify users who are trying to subvert MarFS and get at object data without being able to access it through the MarFS metadata store.

ls -li /marfs/jti/foo shows incorrect inode

I've walked this through fuse, and fuse is returning a stat struct containing the proper inode, but maybe something else is wrong with it, such that the kernel decides to display it differently?

fuse crashes when user tries to list root directory

To reproduce login as a user and ls /marfs

The only solution I have found is to remount marfs.

PFTool must use username and password for sproxyd access

Implement that PFTool must use a username and password to access Scality objects via sproxyd. The username and password will be protected such that PFTool must escalate to root to access it.

test #3

Apparently, members of the Owners group for a github repository do not get email notifications for events. So, I've added myself to the Developers group. Do I get email notifications now?

occasional deadlock in stream_put() / streaming_readfunc() ?

Chris Hoffman has noticed that there are cases (in older code only?) where we were having write-failures even though the sproxyd log indicated that a 200-OK response had been sent.

I suspect there may be some race-condition between the PUT operation completing, and the readfunc getting a final callback from curl. There may need to be some robustification in s3_op(), or in streaming_readfunc(), to get the final handshaking to complete.

I suspect that this may be what is causing this problem. (And maybe something similar, for the "deadlocked concurrent readers" bug)

Question: why doesn't the overlying operation time-out? We have timeouts around our write-functions for just this purpose.

Multiple trashes in the same epoch-second aren’t given unique names

This means they all overwrite the same trash file.

pftool should write MD chunk-info from single-task

Gary points out that if all the write-threads writing N:1 data are also updating the MD file with chunk-info, there will be a lot of lock-contention for small writes.

We should let the updates be done en masse from a single thread. However, we still need this done per-file in fuse. Sounds like more conditionalization of the behavior of marfs_release(), based on details in the FileHandle. Fuse could continue doing this in marfs_write() or marfs_release(), whereas pftool could do it in MARFS_Path::post_process(), or something.

Open Science Campaign Storage Security Baseline

We need to have an Open Science Campaign Storage Security Baseline approved by the CCB by 1/14/16 so that we can begin use of it on 1/19/16 when Open Science is scheduled to begin. The decision will be made on 12/21/15 about whether it will be MarFS over Scality or GPFS over ZFS.

Implement DIRECT access method

I was not sure if this is officially supported or not. We don't need it for the Open Science MarFS release, but we'll need it afterwards. Please close this issue if it already works and a repository can be accessed using this method, and so indicated in a configuration.

fuse sometimes creates files with RESTART xattr still on?

[transferring bugs from my every-other-weekly-reports]

Alfred has occasionally seen files with RESTART xattr still attached. (This xattr is installed by mknod, and removed by delete, so I’d guess that where it’s found, it’s an artifact of fuse crashing or punting. No, wait, restart is also installed by truncate, for the same reason as mknod, so this could also be a file that was truncated, without some fuse problem.)

Obsolete?

remove obsolete static configuration

This was training wheels, while we were moving into using PA2X for defining/parsing our config file. We've outgrown it, and there is some clutter in source-files and Makefile.

Update fsinfo tool to account for sum of all namespaces

Alfred’s tool shall create an fsinfo in the root namespace where the sum of everything lives. This will be used by capabilities that need to know the latest status of all metadata information for the file system's namespaces in total.

mar-file-system / marfs Goto Github PK

marfs's People

Contributors

Stargazers

Watchers

Forkers

marfs's Issues

file: uni.195

Recommend Projects

Recommend Topics

Recommend Org