Summary

this repo is notes of Linux f2fs file system in my preparation of porting f2fs to Windows.

F2FS

Linux Implementation

Porting Gap

Gap1

F2FS

disk layout

figure

description

meta data

Super Block
Checkpoint
SIT, segment information table
NAT, node address table
SSA, segment summary area

frequently accessed meta datas(Super Block / Checkpoint / SIT / NAT, except SSA) are versioned(there are 2 versions)

the purpose of meta data versioning is to balance the wirte of meta area
version switching happened at checking point time

Super Block, has following important information

magic number, is used to recognize f2fs volume
major and minor version of f2fs
sector size, block size
sectors per block, blocks per segment (512), segments per section (usually 1, settable by mkfs.f2fs), sections per zone (usually 1)

SIT is basicly an array of SIT entries , indexed by Segment No
SIT entry has following information

allocated block count of the segment
bitmap of allocated blocks
segment temperature
the average access time of the segment
refer to nat/sit/ssa for detail

NAT is basicly an array of NAT entries , indexed by Node ID
NAT entry has following information

version
I Node identifier
block address
refer to nat/sit/ssa for detail

SSA is basicly an array of SSA entry, indexed by Segment No. the size of SSA entry is 4K
SSA entry has following information

summary entries
journal
segment type
check sum
refer to nat/sit/ssa for detail

checkpoint

figure

description

Header and Footer, they are indentical if this is a valid checkpoint, have following information

size (in unit of block) of checkpoint : from Header to Footer
current segments : segment No, next block offset, allocation type etc.
NAT version bitmap
SIT version bitmap

Payload, if Header can not accommodate the SIT version bitmap, it will be stored in Payload, or else this area is empty
Orphan inode, if any, or else this area is empty
DataSummary, summaries of the current data segments, has 2 formats

compact format : if size of NAT journal + SIT journal + SSA summary entries is less than 3 blocks

NAT journal
SIT journal
SSA summary entries for hot data
SSA summary entries for warm data
SSA summary entries for cold data

normal format

SSA entry for hot data
SSA entry for warm data
SSA entry for cold data

NodeSummary : summaries of the current node segments if checking point reason is CP_UMOUNT or CP_FASTBOOT, or else this area is empty

SSA entry for hot node
SSA entry for warm node
SSA entry for cold node

NATBits, if enabled, or else this area is empty

NAT/SIT/SSA

figure	description
	NAT is organized in 4K sized blocks, an NAT block accommodates 455 NAT entries Version, every time the BlkAddr is changed from non-NULL_ADDR to NULL_ADDR , Version will increase by 1 INO BlkAddr NULL_ADDR, the NAT entry is free for allocating NEW_ADDR, the NAT entry is allocated but node is not allocated COMPRESS_ADDR Used (none of the above), the NAT entry is allocated and node is allocated, BlkAddr is the address of the node
	SIT is organized in 4K sized blocks, a SIT block accommodates 55 SIT entries Blocks, is consist of the following components bits 0-9, is the allocated block count in the segment bits 10-15, is the segment temperature : hot/warm/cold BlockBitmap, every allocated block in this segment has its bit set in this bitmap Mtime, average access time of the segment when a block in this segment is allocated or freed, the access time is calculated and averaged with Mtime is used in victim segment selection
	SSA summary entry is the summary of a block in a segment, there are 512 such entries NID Version copy of NAT entry version of Direct Node if this is a data block 0 if this is a node block Offset offset in Direct Node if this is a data block 0 if this is a node block Footer Type: data or node. f2fs does not mix data block and node block in the same segment

node

figure	description
	NID is the node identifier INO is the I Node identifier the node belong to. If NID==INO this is an I Node Flag, is consist of the following components bit 0 : cold flag bit 1 : fsync flag bit 2 : dentry flag bit 3-31 : offset, refer to node config for detail CkpVer NextBlkAddr
	InlineDataAddrs, this is an array of 923 entries , each entry is 4 bytes optionally, this array can be used to store extra attributes inline (ExtraAttr) optionally, this array can also be used to store extension attributes inline (InlineXAttr) the other entries (InlineDataAddrs) are used to store addresses of inlined data or inlined data NID, if file size can not fit in InlineDataAddrs completely, f2fs will allocate additional nodes, refer to node config
	a direct node has 1018 block address entries , it will cover 1018*4K bytes of data
	an indirect node has 1018 nid entries, it will cover 1018 direct nodes or 1018 indirect nodes. in the former an indirect node will cover 101810184K bytes of data, in the latter, an indirect node will cover 101810181018*4K bytes of data

node config

figure

description

f2fs use 5 configurations to accommodate different file sizes, this makes up a tree structure

blue circle is I Node (file)
light green circle is Indirect Node (I)
green circle is Direct Node (D)

file size and configuration(Isz is the size that can be inlined, N==1018)

file size	node config (cumulative)	NID Index
<=Isz	I Node	-
<=Isz+N*4K	D	0
<=Isz+N4K2	D	1
<=Isz+NN4K	I+D	2
<=Isz+NN4K*2	I+D	3
<=Isz+NNN*4K	I+I+D	4

for example, if file size is between Isz and Isz+N*4K, f2fs will allocate the first Direct Node. portion of the file size equal to Isz is covered by I Node, portion greater than Isz is covered by the first Direct Node.
if files size is between Isz+N*4K and Isz+N*4K*2, f2fs will allocate the first and second Direct Node, etc.
there are 2 kind of offsets in a file : node offset and data offset

node offset, is the numbering of the tree structure from top to down and left to right : the offset of I Node is 0, offset of the first and second Direct Node are 1 and 2, offset of the first and second Indirect Node are 3 and 4+1018, and so on...
data offset

Linux Implementation

Locks

name	type	description
cp_lock	spin	description
cp_global_sem	RW	description
cp_rwsem	RW	write, when checking point read, when fs operation
locks on Node Manager	-	Node Manager
locks on Segment Manager	-	Segment Manager

Node Manager

figure

description

locks

name	type	protecting
nid_list_lock	spin	FreeNID Cache FreeNIDBitmaps
nat_list_lock	spin	NatE Cache entry LRU
nat_tree_lock	RW	NatE Cache
build_lock	mutex	NATBlockBitmap

data flows

green arrow is the data flow of FreeNID Cache building
red arrow is the data flow of checkpoiting
light blue arrow is the data flow of NatE Cache loading

FreeNIDBitmaps, an array of bitmap, indexed by NAT block No, each bitmap is indexed by offset of nid in NAT block

is guarded by NATBlockBitmap : to update bit in bitmap of an NAT block, bit of which in NATBlockBitmap must be set
updated in NAT scanning (the light green arrow)
updated from NATBits (the blue arrow)
updated at checking point time
updated at pre-allocation time (clear)
updated in fail function call (set)
used in run time FreeNID Cache building

NATBits, is consist of 2 bitmaps : FullBitmap and EmptyBitmap, both indexed by NAT block No

is enabled when CP_NAT_BITS_FLAG is set
if an NAT block is empty (all NAT entries are free, NULL_ADDR), the bit in EmptyBitmap will be set
if an NAT block is full (all NAT entries are used, non-NULL_ADDR), the bit in FullBitmap will be set
a dirty NAT block will has its bit set neither in EmptyBitmap nor in FullBitmap
updated from FreeNIDBitmaps (the purple arrow) the first time NATBits is enabled
updated at checking point time

NATBlockBitmap, indexed by NAT block No

used as a guard for updating FreeNIDBitmaps
used as NAT scanning hint : NAT block set in this bitmap will be skipped
updated in process of scanning NAT (orange arrow)
updated from NATBits (orange arrow)

FreeNID Cache

updated at mount time : f2fs will scan the NAT and NAT Journal
updated at run time : when there is not enough free nid, f2fs will scan the FreeNIDBitmaps, NAT Jounal and NAT
updated at checking point time : if an NatE Cache entry becomes free
updated when f2fs decide that there are too many FreeNID Cache, portion of the entries will be deleted, but the bits of which in FreeNIDBitmaps will keep set, in case of there is not enough FreeNID Cache, the deleted entries can be brought back quickly by scanning FreeNIDBitmaps

NatE Cache

an NatE Cache entry is added on-demand

added in the code path of f2fs_get_node_info
added in the code path of set_node_addr

an entry becomes dirty when its address changed

FreeNID Cache entries are organized in a balanced tree with node id as its key
at the same time, these entries are linked into a list, entry will be allocated from the head of the list. the allocation is divided into 2 stages

the first stage is pre-allocation stage : entry is deleted from the list (1)
the second stage is succeeding/failing stage

if f2fs decide that the pre-allocation is succeeded, it will call the done function, node manager will delete the entry from the tree (3)
if f2fs decide that the pre-allocation is failed, it will call the fail function, node manager will delete the entry from the tree(3) or append it back to the list (2)

Segment Manager

figure

description

locks

name	type	protect
segmap_lock	spin	FreeSegBitmap FreeSecBitmap
sentry_lock	RW	SitE Cache
journal_rwsem	RW	NAT/SIT journal in CurSegs[n]
seglist_lock	mutex	DirtySegBitmaps DirtySecBitmap VictimSecBitmap
curseg_mutex	mutex	CurSegs[n]

data flows

red arrow is the data flow of checkpointing
green arrow is the data flow of SitE Cache loadinig
light blue arrow is the data flow of segment to section bitmap rollup / dirty SitE bitmap update
blue arrow is the data flow of segment switching
orange arrow is the data flow of dirty segment update

DirtySegBitmaps is consist of 8 bitmaps, all are indexed by segment No

the DIRTY bitmap, is the union of the following 6 sub-DIRTY bitmaps, any 2 sub-DIRTY bitmaps have no intersection

hot data dirty bitmap
warm data dirty bitmap
cold data dirty bitmap
hot node dirty bitmap
warm node dirty bitmap
cold node dirty bitmap

the PRE bitmap, the prefreed segment has its bit set in this bitmap. dirty segment may become free in the process of block address release or GC. these kind of segment is recorded in this PRE bitmap. at checking point time, the prefreed segment(s) is moved to free segment(s) (PRE DirtySegBitmap->FreeSegBitmap)

CurSegs, an array of current segments, f2fs allocate node or data from current segments, there are 8 types of current segment

hot data
warm data
cold data
hot node, from which Direct Node is allocated
warm node
cold node, from which Indriect Node is allocated
pinned cold data
atgc

main processes

nid allocation

block allocation

checking point

victim selection

segment selection

when there is no free block for a current segment, f2fs will select a new segment. if there is need for SSR(Slack Space Recycling), f2fs will select from dirty segments, or else f2fs will select from free segments.

free segment selection

dirty segment selection

gc

gc is the process of cleaning dirty segment, blocks are moved out from dirty segment to other segments, which makes more free segments.

Porting Gap

Gap1

f2fs use page cache to read node into memory , and will lock the page cache from concurrent access. it has the exclusive semantics.
for example, in f2fs_get_dnode_of_data, f2fs will read I Node , if needed, f2fs will allocate a Direct Node, and fill its nid into I Node. if the page cache is not locked, 2 thread might overwrite each other.
Windows lacks similar facility.

oldoldman / f2fs-design Goto Github PK

f2fs-design's Introduction

Summary

Table of Contents

F2FS

disk layout

checkpoint

NAT/SIT/SSA

node

node config

Linux Implementation

Locks

Node Manager

Segment Manager

main processes

nid allocation

block allocation

checking point

victim selection

segment selection

free segment selection

dirty segment selection

gc

Porting Gap

Gap1

f2fs-design's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org