[NTOSKRKNL]

Write a lot of general prose about the operation of the Mm parts of NewCC.

Dedicate to timo and chongo.

No code changes.

svn path=/trunk/; revision=56267
This commit is contained in:
Art Yerkes 2012-03-29 04:43:44 +00:00
parent b993c4ba6f
commit 8a3e9c37be
7 changed files with 442 additions and 4 deletions

View file

@ -42,6 +42,36 @@
* Herve Poussineau
*/
/*
A note on this code:
Unlike the previous section code, this code does not rely on an active map
for a page to exist in a data segment. Each mapping contains a large integer
offset to map at, and the segment always represents the entire section space
from zero to the maximum long long. This allows us to associate one single
page map with each file object, and to let each mapping view an offset into
the overall mapped file. Temporarily unmapping the file has no effect on the
section membership.
This necessitates a change in the section page table implementation, which is
now an RtlGenericTable. This will be elaborated more in sptab.c. One upshot
of this change is that a mapping of a small files takes a bit more than 1/4
of the size in nonpaged kernel space as it did previously.
When we need other threads that may be competing for the same page fault to
wait, we have a mechanism seperate from PageOps for dealing with that, which
was suggested by Travis Geiselbrecht after a conversation I had with Alex
Ionescu. That mechanism is the MM_WAIT_ENTRY, which is the all-ones SWAPENTRY.
When we wish for other threads to know that we're waiting and will finish
handling a page fault, we place the swap entry MM_WAIT_ENTRY in the page table
at the fault address (this works on either the section page table or a process
address space), perform any blocking operations required, then replace the
entry with
*/
/* INCLUDES *****************************************************************/
#include <ntoskrnl.h>
@ -163,6 +193,17 @@ MiZeroFillSection(PVOID Address, PLARGE_INTEGER FileOffsetPtr, ULONG Length)
return STATUS_SUCCESS;
}
/*
MiFlushMappedSection
Called from cache code to cause dirty pages of a section
to be written back. This doesn't affect the mapping.
BaseOffset is the base at which to start writing in file space.
FileSize is the length of the file as understood by the cache.
*/
NTSTATUS
NTAPI
_MiFlushMappedSection(PVOID BaseAddress,
@ -299,6 +340,13 @@ _MiFlushMappedSection(PVOID BaseAddress,
return Status;
}
/*
This deletes a segment entirely including its page map.
It must have been unmapped in every address space.
*/
VOID
NTAPI
MmFinalizeSegment(PMM_SECTION_SEGMENT Segment)
@ -346,7 +394,7 @@ MmCreateCacheSection(PROS_SECTION_OBJECT *SectionObject,
ULONG AllocationAttributes,
PFILE_OBJECT FileObject)
/*
* Create a section backed by a data file
* Create a section backed by a data file.
*/
{
PROS_SECTION_OBJECT Section;
@ -641,6 +689,13 @@ _MiMapViewOfSegment(PMMSUPPORT AddressSpace,
return STATUS_SUCCESS;
}
/*
Completely remove the page at FileOffset in Segment. The page must not
be mapped.
*/
VOID
NTAPI
MiFreeSegmentPage(PMM_SECTION_SEGMENT Segment,

View file

@ -43,6 +43,34 @@
* Herve Poussineau
*/
/*
I've generally organized fault handling code in newmm as handlers that run
under a single lock acquisition, check the state, and either take necessary
action atomically, or place a wait entry and return a continuation to the
caller. This lends itself to code that has a simple, structured form,
doesn't make assumptions about lock taking and breaking, and provides an
obvious, graphic seperation between code that may block and code that isn't
allowed to. This file contains the non-blocking half.
In order to request a blocking operation to happen outside locks, place a
function pointer in the provided MM_REQUIRED_RESOURCES struct and return
STATUS_MORE_PROCESSING_REQUIRED. The function indicated will receive the
provided struct and take action outside of any mm related locks and at
PASSIVE_LEVEL. The same fault handler will be called again after the
blocking operation succeeds. In this way, the fault handler can accumulate
state, but will freely work while competing with other threads.
Fault handlers in this file should check for an MM_WAIT_ENTRY in a page
table they're using and return STATUS_SUCCESS + 1 if it's found. In that
case, the caller will wait on the wait entry event until the competing thread
is finished, and recall this handler in the current thread.
Another thing to note here is that we require mappings to exactly mirror
rmaps, so each mapping should be immediately followed by an rmap addition.
*/
/* INCLUDES *****************************************************************/
#include <ntoskrnl.h>
@ -56,6 +84,22 @@
extern KEVENT MmWaitPageEvent;
extern PMMWSL MmWorkingSetList;
/*
Multiple stage handling of a not-present fault in a data section.
Required->State is used to accumulate flags that indicate the next action
the handler should take.
State & 2 is currently used to indicate that the page acquired by a previous
callout is a global page to the section and should be placed in the section
page table.
Note that the primitive tail recursion done here reaches the base case when
the page is present.
*/
NTSTATUS
NTAPI
MmNotPresentFaultCachePage(PMMSUPPORT AddressSpace,
@ -168,6 +212,10 @@ MmNotPresentFaultCachePage(PMMSUPPORT AddressSpace,
}
else if (MM_IS_WAIT_PTE(Entry))
{
// Whenever MM_WAIT_ENTRY is required as a swap entry, we need to
// ask the fault handler to wait until we should continue. Rathern
// than recopy this boilerplate code everywhere, we just ask them
// to wait.
MmUnlockSectionSegment(Segment);
return STATUS_SUCCESS + 1;
}
@ -255,6 +303,18 @@ MiCopyPageToPage(PFN_NUMBER DestPage, PFN_NUMBER SrcPage)
return STATUS_SUCCESS;
}
/*
This function is deceptively named, in that it does the actual work of handling
access faults on data sections. In the case of the code that's present here,
we don't allow cow sections, but we do need this to unset the initial
PAGE_READONLY condition of pages faulted into the cache so that we can add
a dirty bit in the section page table on the first modification.
In the ultimate form of this code, CoW is reenabled.
*/
NTSTATUS
NTAPI
MiCowCacheSectionPage(PMMSUPPORT AddressSpace,
@ -344,6 +404,8 @@ MiCowCacheSectionPage(PMMSUPPORT AddressSpace,
else
return STATUS_SUCCESS; // Nonwait swap entry ... handle elsewhere
}
/* Call out to acquire a page to copy to. We'll be re-called when
* the page has been allocated. */
Required->Page[1] = MmGetPfnForProcess(Process, Address);
Required->Consumer = MC_CACHE;
Required->Amount = 1;
@ -403,6 +465,15 @@ typedef struct _WORK_QUEUE_WITH_CONTEXT
AcquireResource DoAcquisition;
} WORK_QUEUE_WITH_CONTEXT, *PWORK_QUEUE_WITH_CONTEXT;
/*
This is the work item used do blocking resource acquisition when a fault
handler returns STATUS_MORE_PROCESSING_REQUIRED. It's used to allow resource
acquisition to take place on a different stack, and outside of any locks used
by fault handling, making recursive fault handling possible when required.
*/
VOID
NTAPI
MmpFaultWorker(PWORK_QUEUE_WITH_CONTEXT WorkItem)
@ -415,6 +486,38 @@ MmpFaultWorker(PWORK_QUEUE_WITH_CONTEXT WorkItem)
KeSetEvent(&WorkItem->Wait, IO_NO_INCREMENT, FALSE);
}
/*
This code seperates the action of fault handling into an upper and lower
handler to allow the inner handler to optionally be called in work item
if the stack is getting too deep. My experiments show that the third
recursive page fault taken at PASSIVE_LEVEL must be shunted away to a
worker thread. In the ultimate form of this code, the primary fault handler
makes this decision by using a thread-local counter to detect a too-deep
fault stack and call the inner fault handler in a worker thread if required.
Note that faults are taken at passive level and have access to ordinary
driver entry points such as those that read and write files, and filesystems
should use paged structures whenever possible. This makes recursive faults
both a perfectly normal occurrance, and a worthwhile case to handle.
The code below will repeatedly call MiCowSectionPage as long as it returns
either STATUS_SUCCESS + 1 or STATUS_MORE_PROCESSING_REQUIRED. In the more
processing required case, we call out to a blocking resource acquisition
function and then recall the faut handler with the shared state represented
by the MM_REQUIRED_RESOURCES struct.
In the other case, we wait on the wait entry event and recall the handler.
Each time the wait entry event is signalled, one thread has removed an
MM_WAIT_ENTRY from a page table.
In the ultimate form of this code, there is a single system wide fault handler
for each of access fault and not present and each memory area contains a
function pointer that indicates the active fault handler. Since the mm code
in reactos is currently fragmented, I didn't bring this change to trunk.
*/
NTSTATUS
NTAPI
MmpSectionAccessFaultInner(KPROCESSOR_MODE Mode,
@ -564,6 +667,17 @@ MmpSectionAccessFaultInner(KPROCESSOR_MODE Mode,
return Status;
}
/*
This is the outer fault handler mentioned in the description of
MmpSectionAccsesFaultInner. It increments a fault depth count in the current
thread.
In the ultimate form of this code, the lower fault handler will optionally
use the count to keep the kernel stack from overflowing.
*/
NTSTATUS
NTAPI
MmAccessFaultCacheSection(KPROCESSOR_MODE Mode,
@ -613,6 +727,16 @@ MmAccessFaultCacheSection(KPROCESSOR_MODE Mode,
return Status;
}
/*
As above, this code seperates the active part of fault handling from a carrier
that can use the thread's active fault count to determine whether a work item
is required. Also as above, this function repeatedly calls the active not
present fault handler until a clear success or failure is received, using a
return of STATUS_MORE_PROCESSING_REQUIRED or STATUS_SUCCESS + 1.
*/
NTSTATUS
NTAPI
MmNotPresentFaultCacheSectionInner(KPROCESSOR_MODE Mode,
@ -765,6 +889,14 @@ MmNotPresentFaultCacheSectionInner(KPROCESSOR_MODE Mode,
return Status;
}
/*
Call the inner not present fault handler, keeping track of the fault count.
In the ultimate form of this code, optionally use a worker thread the handle
the fault in order to sidestep stack overflow in the multiple fault case.
*/
NTSTATUS
NTAPI
MmNotPresentFaultCacheSection(KPROCESSOR_MODE Mode,

View file

@ -64,9 +64,13 @@ MmGetDeviceObjectForFile(IN PFILE_OBJECT FileObject)
return IoGetRelatedDeviceObject(FileObject);
}
/* Note:
This completion function is really required. Paging io completion does almost
nothing, including freeing the mdls. */
/*
Note:
This completion function is really required. Paging io completion does almost
nothing, including freeing the mdls.
*/
NTSTATUS
NTAPI
MiSimpleReadComplete(PDEVICE_OBJECT DeviceObject,
@ -95,6 +99,15 @@ MiSimpleReadComplete(PDEVICE_OBJECT DeviceObject,
return STATUS_SUCCESS;
}
/*
MiSimpleRead is a convenience function that provides either paging or non
paging reads. The caching and mm systems use this in paging mode, where
a completion function is required as above. The Paging BOOLEAN determines
whether the read is issued as a paging read or as an ordinary buffered read.
*/
NTSTATUS
NTAPI
MiSimpleRead(PFILE_OBJECT FileObject,
@ -177,6 +190,13 @@ MiSimpleRead(PFILE_OBJECT FileObject,
return Status;
}
/*
Convenience function for writing from kernel space. This issues a paging
write in all cases.
*/
NTSTATUS
NTAPI
_MiSimpleWrite(PFILE_OBJECT FileObject,
@ -259,6 +279,15 @@ _MiSimpleWrite(PFILE_OBJECT FileObject,
extern KEVENT MpwThreadEvent;
FAST_MUTEX MiWriteMutex;
/*
Function which uses MiSimpleWrite to write back a single page to a file.
The page in question does not need to be mapped. This function could be
made a bit more efficient by avoiding the copy and making a system space
mdl.
*/
NTSTATUS
NTAPI
_MiWriteBackPage(PFILE_OBJECT FileObject,

View file

@ -42,6 +42,13 @@
* Herve Poussineau
*/
/*
This file contains functions used by fault.c to do blocking resource
acquisition. To call one of these functions, fill out your
MM_REQUIRED_RESOURCES with a pointer to the desired function and configure
the other members as below.
*/
/* INCLUDES *****************************************************************/
#include <ntoskrnl.h>
@ -55,6 +62,22 @@ VOID
NTAPI
MmBuildMdlFromPages(PMDL Mdl, PPFN_NUMBER Pages);
/*
Blocking function to acquire zeroed pages from the balancer.
Upon entry:
Required->Amount: Number of pages to acquire
Required->Consumer: consumer to charge the page to
Upon return:
Required->Pages[0..Amount]: Allocated pages.
The function fails unless all requested pages can be allocated.
*/
NTSTATUS
NTAPI
MiGetOnePage(PMMSUPPORT AddressSpace,
@ -85,6 +108,26 @@ MiGetOnePage(PMMSUPPORT AddressSpace,
return Status;
}
/*
Blocking function to read (part of) a page from a file.
Upon entry:
Required->Context: a FILE_OBJECT to read
Required->Consumer: consumer to charge the page to
Required->FileOffset: Offset to read at
Required->Amount: Number of bytes to read (0 -> 4096)
Upon return:
Required->Page[Required->Offset]: The allocated and read in page
The indicated page is filled to Required->Amount with file data and zeroed
afterward.
*/
NTSTATUS
NTAPI
MiReadFilePage(PMMSUPPORT AddressSpace,
@ -158,6 +201,21 @@ MiReadFilePage(PMMSUPPORT AddressSpace,
return STATUS_SUCCESS;
}
/*
Blocking function to read a swap page into a memory page.
Upon entry:
Required->Consumer: consumer to charge the page to
Required->SwapEntry: swap entry to use
Upon return:
Required->Page[Required->Offset]: Populated page
*/
NTSTATUS
NTAPI
MiSwapInPage(PMMSUPPORT AddressSpace,
@ -193,6 +251,22 @@ MiSwapInPage(PMMSUPPORT AddressSpace,
return Status;
}
/*
A way to write a page without a lock acquired using the same blocking mechanism
as resource acquisition.
Upon entry:
Required->Page[Required->Offset]: Page to write
Required->Context: FILE_OBJECT to write to
Required->FileOffset: offset to write at
This always does a paging write with whole page size. Note that paging IO
doesn't change the valid data length of a file.
*/
NTSTATUS
NTAPI
MiWriteFilePage(PMMSUPPORT AddressSpace,

View file

@ -23,6 +23,34 @@
* PROGRAMMERS: arty
*/
/*
This file implements the section page table. It relies on rtl generic table
functionality to provide access to 256-page chunks. Calls to
MiSetPageEntrySectionSegment and MiGetPageEntrySectionSegment must be
synchronized by holding the segment lock.
Each page table entry is a ULONG as in x86.
Bit 1 is used as a swap entry indication as in the main page table.
Bit 2 is used as a dirty indication. A dirty page will eventually be written
back to the file.
Bits 3-11 are used as a map count in the legacy mm code, Note that zero is
illegal, as the legacy code does not take advantage of segment rmaps.
Therefore, every segment page is mapped in at least one address space, and
MmUnsharePageEntry is quite complicated. In addition, the page may also be
owned by the legacy cache manager, giving an implied additional reference.
Upper bits are a PFN_NUMBER.
These functions, in addition to maintaining the segment page table also
automatically maintain the segment rmap by calling MmSetSectionAssociation
and MmDeleteSectionAssociation. Segment rmaps are discussed in rmap.c. The
upshot is that it is impossible to have a page properly registered in a segment
page table and not also found in a segment rmap that can be found from the
paging machinery.
*/
/* INCLUDES *****************************************************************/
#include <ntoskrnl.h>
@ -232,6 +260,17 @@ _MmGetPageEntrySectionSegment(PMM_SECTION_SEGMENT Segment,
return Result;
}
/*
Destroy the rtl generic table that serves as the section's page table. Call
the FreePage function for each non-zero entry in the section page table as
we go. Note that the page table is still techinally valid until after all
pages are destroyed, as we don't finally destroy the table until we've free
each slice. There is no order guarantee for deletion of individual elements
although it's in-order as written now.
*/
VOID
NTAPI
MmFreePageTablesSectionSegment(PMM_SECTION_SEGMENT Segment,
@ -271,6 +310,21 @@ MmFreePageTablesSectionSegment(PMM_SECTION_SEGMENT Segment,
DPRINT("Done\n");
}
/*
Retrieves the MM_SECTION_SEGMENT and fills in the LARGE_INTEGER Offset given
by the caller that corresponds to the page specified. This uses
MmGetSegmentRmap to find the rmap belonging to the segment itself, and uses
the result as a pointer to a 256-entry page table structure. The rmap also
includes 8 bits of offset information indication one of 256 page entries that
the rmap corresponds to. This information together gives us an exact offset
into the file, as well as the MM_SECTION_SEGMENT pointer stored in the page
table slice.
NULL is returned is there is no segment rmap for the page.
*/
PMM_SECTION_SEGMENT
NTAPI
MmGetSectionAssociation(PFN_NUMBER Page,

View file

@ -43,6 +43,21 @@
* Herve Poussineau
*/
/*
This file implements page out infrastructure for cache type sections. This
is implemented a little differently from the legacy mm because mapping in an
address space and membership in a segment are considered separate.
The general strategy here is to try to remove all mappings as gently as
possible, then to remove the page entry from the section itself as a final
step. If at any time during the page out operation, the page is mapped in
a new address space by a competing thread, the operation will abort before
the segment page is finally removed, and the page will be naturally faulted
back into any address spaces required in the normal way.
*/
/* INCLUDES *****************************************************************/
#include <ntoskrnl.h>
@ -58,6 +73,15 @@ extern PMMWSL MmWorkingSetList;
FAST_MUTEX MiGlobalPageOperation;
/*
MmWithdrawSectionPage removes a page entry from the section segment, replacing
it with a wait entry. The caller must replace the wait entry with a 0, when
any required writing is done. The wait entry must remain until the page is
written to protect against cases where a fault brings a stale copy of the page
back before writing is complete.
*/
PFN_NUMBER
NTAPI
MmWithdrawSectionPage(PMM_SECTION_SEGMENT Segment,
@ -120,6 +144,22 @@ MmWithdrawSectionPage(PMM_SECTION_SEGMENT Segment,
}
}
/*
This function determines whether the segment holds the very last reference to
the page being considered and if so, writes it back or discards it as
approriate. One small niggle here is that we might be holding the last
reference to the section segment associated with this page. That happens
when the segment is destroyed at the same time that an active swap operation
is occurring, and all maps were already withdrawn. In that case, it's our
responsiblity for finalizing the segment.
Note that in the current code, WriteZero is always TRUE because the section
always backs a file. In the ultimate form of this code, it also writes back
pages without necessarily evicting them. In reactos' trunk, this is vestigal.
*/
NTSTATUS
NTAPI
MmFinalizeSectionPageOut(PMM_SECTION_SEGMENT Segment,
@ -220,6 +260,20 @@ MmFinalizeSectionPageOut(PMM_SECTION_SEGMENT Segment,
return Status;
}
/*
The slightly misnamed MmPageOutCacheSection removes a page from an address
space in the manner of fault handlers found in fault.c. In the ultimate form
of the code, this is one of the function pointers stored in a memory area
to control how pages in that memory area are managed.
Also misleading is the call to MmReleasePageMemoryConsumer, which releases
the reference held by this address space only. After all address spaces
have had MmPageOutCacheSection succeed on them for the indicated page,
then paging out of a cache page can continue.
*/
NTSTATUS
NTAPI
MmPageOutCacheSection(PMMSUPPORT AddressSpace,
@ -257,6 +311,7 @@ MmPageOutCacheSection(PMMSUPPORT AddressSpace,
MmDeleteVirtualMapping(Process, Address, FALSE, Dirty, &OurPage);
ASSERT(OurPage == Required->Page[0]);
/* Note: this releases the reference held by this address space only. */
MmReleasePageMemoryConsumer(MC_CACHE, Required->Page[0]);
MmUnlockSectionSegment(Segment);
@ -264,6 +319,25 @@ MmPageOutCacheSection(PMMSUPPORT AddressSpace,
return STATUS_SUCCESS;
}
/*
This function is called by rmap when spare pages are needed by the blancer.
It attempts first to release the page from every address space in which it
appears, and, after a final check that no competing thread has mapped the
page again, uses MmFinalizeSectionPageOut to completely evict the page. If
that's successful, then a suitable non-page map will be left in the segment
page table, otherwise, the original page is replaced in the section page
map. Failure may result from a variety of conditions, but always leaves
the page mapped.
This code is like the other fault handlers, in that MmPageOutCacheSection has
the option of returning either STATUS_SUCCESS + 1 to wait for a wait entry
to disppear or to use the blocking callout facility by returning
STATUS_MORE_PROCESSING_REQUIRED and placing a pointer to a function from
reqtools.c in the MM_REQUIRED_RESOURCES struct.
*/
NTSTATUS
NTAPI
MmpPageOutPhysicalAddress(PFN_NUMBER Page)

View file

@ -418,6 +418,20 @@ MmDeleteRmap(PFN_NUMBER Page, PEPROCESS Process,
KeBugCheck(MEMORY_MANAGEMENT);
}
/*
Return the process pointer given when a previous call to MmInsertRmap was
called with a process and address pointer that conform to the segment rmap
schema. In short, this requires the address part to be 0xffffff00 + n
where n is between 0 and 255. When such an rmap exists, it specifies a
segment rmap in which the process part is a pointer to a slice of a section
page table, and the low 8 bits of the address represent a page index in the
page table slice. Together, this information is used by
MmGetSectionAssociation to determine which page entry points to this page in
the segment page table.
*/
PVOID
NTAPI
MmGetSegmentRmap(PFN_NUMBER Page, PULONG RawOffset)
@ -445,6 +459,12 @@ MmGetSegmentRmap(PFN_NUMBER Page, PULONG RawOffset)
return NULL;
}
/*
Remove the section rmap associated with the indicated page, if it exists.
*/
VOID
NTAPI
MmDeleteSectionAssociation(PFN_NUMBER Page)