reactos/ntoskrnl/mm/ARM3/syspte.c

444 lines
12 KiB
C
Raw Normal View History

/*
* PROJECT: ReactOS Kernel
* LICENSE: BSD - See COPYING.ARM in the top level directory
* FILE: ntoskrnl/mm/ARM3/syspte.c
* PURPOSE: ARM Memory Manager System PTE Allocator
* PROGRAMMERS: ReactOS Portable Systems Group
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
* Roel Messiant (roel.messiant@reactos.org)
*/
/* INCLUDES *******************************************************************/
#include <ntoskrnl.h>
#define NDEBUG
#include <debug.h>
#define MODULE_INVOLVED_IN_ARM3
#include <mm/ARM3/miarm.h>
/* GLOBALS ********************************************************************/
PMMPTE MmSystemPteBase;
PMMPTE MmSystemPtesStart[MaximumPtePoolTypes];
PMMPTE MmSystemPtesEnd[MaximumPtePoolTypes];
MMPTE MmFirstFreeSystemPte[MaximumPtePoolTypes];
ULONG MmTotalFreeSystemPtes[MaximumPtePoolTypes];
ULONG MmTotalSystemPtes;
ULONG MiNumberOfExtraSystemPdes;
const ULONG MmSysPteIndex[5] = { 1, 2, 4, 8, 16 };
const UCHAR MmSysPteTables[] = { 0, // 1
0, // 1
1, // 2
2, 2, // 4
3, 3, 3, 3, // 8
4, 4, 4, 4, 4, 4, 4, 4 // 16
};
LONG MmSysPteListBySizeCount[5];
/* PRIVATE FUNCTIONS **********************************************************/
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
//
// The free System Page Table Entries are stored in a bunch of clusters,
// each consisting of one or more PTEs. These PTE clusters are connected
// in a singly linked list, ordered by increasing cluster size.
//
// A cluster consisting of a single PTE is marked by having the OneEntry flag
// of its PTE set. The forward link is contained in the NextEntry field.
//
// Clusters containing multiple PTEs have the OneEntry flag of their first PTE
// reset. The NextEntry field of the first PTE contains the forward link, and
// the size of the cluster is stored in the NextEntry field of its second PTE.
//
// Reserving PTEs currently happens by walking the linked list until a cluster
// is found that contains the requested amount of PTEs or more. This cluster
// is removed from the list, and the requested amount of PTEs is taken from the
// tail of this cluster. If any PTEs remain in the cluster, the linked list is
// walked again until a second cluster is found that contains the same amount
// of PTEs or more. The first cluster is then inserted in front of the second
// one.
//
// Releasing PTEs currently happens by walking the whole linked list, recording
// the first cluster that contains the amount of PTEs to release or more. When
// a cluster is found that is adjacent to the PTEs being released, this cluster
// is removed from the list and subsequently added to the PTEs being released.
// This ensures no two clusters are adjacent, which maximizes their size.
// After the walk is complete, a new cluster is created that contains the PTEs
// being released, which is then inserted in front of the recorded cluster.
//
FORCEINLINE
ULONG
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
MI_GET_CLUSTER_SIZE(IN PMMPTE Pte)
{
//
// First check for a single PTE
//
if (Pte->u.List.OneEntry)
return 1;
//
// Then read the size from the trailing PTE
//
Pte++;
return (ULONG)Pte->u.List.NextEntry;
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
}
PMMPTE
NTAPI
MiReserveAlignedSystemPtes(IN ULONG NumberOfPtes,
IN MMSYSTEM_PTE_POOL_TYPE SystemPtePoolType,
IN ULONG Alignment)
{
KIRQL OldIrql;
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
PMMPTE PreviousPte, NextPte, ReturnPte;
ULONG ClusterSize;
//
// Sanity check
//
ASSERT(Alignment <= PAGE_SIZE);
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
// Acquire the System PTE lock
//
OldIrql = KeAcquireQueuedSpinLock(LockQueueSystemSpaceLock);
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
// Find the last cluster in the list that doesn't contain enough PTEs
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
PreviousPte = &MmFirstFreeSystemPte[SystemPtePoolType];
while (PreviousPte->u.List.NextEntry != MM_EMPTY_PTE_LIST)
{
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
// Get the next cluster and its size
//
NextPte = MmSystemPteBase + PreviousPte->u.List.NextEntry;
ClusterSize = MI_GET_CLUSTER_SIZE(NextPte);
//
// Check if this cluster contains enough PTEs
//
if (NumberOfPtes <= ClusterSize)
break;
//
// On to the next cluster
//
PreviousPte = NextPte;
}
//
// Make sure we didn't reach the end of the cluster list
//
if (PreviousPte->u.List.NextEntry == MM_EMPTY_PTE_LIST)
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
{
//
// Release the System PTE lock and return failure
//
KeReleaseQueuedSpinLock(LockQueueSystemSpaceLock, OldIrql);
return NULL;
}
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
// Unlink the cluster
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
PreviousPte->u.List.NextEntry = NextPte->u.List.NextEntry;
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
// Check if the reservation spans the whole cluster
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
if (ClusterSize == NumberOfPtes)
{
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
// Return the first PTE of this cluster
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
ReturnPte = NextPte;
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
//
// Zero the cluster
//
if (NextPte->u.List.OneEntry == 0)
{
NextPte->u.Long = 0;
NextPte++;
}
NextPte->u.Long = 0;
}
else
{
//
// Divide the cluster into two parts
//
ClusterSize -= NumberOfPtes;
ReturnPte = NextPte + ClusterSize;
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
//
// Set the size of the first cluster, zero the second if needed
//
if (ClusterSize == 1)
{
NextPte->u.List.OneEntry = 1;
ReturnPte->u.Long = 0;
}
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
else
{
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
NextPte++;
NextPte->u.List.NextEntry = ClusterSize;
}
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
// Step through the cluster list to find out where to insert the first
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
PreviousPte = &MmFirstFreeSystemPte[SystemPtePoolType];
while (PreviousPte->u.List.NextEntry != MM_EMPTY_PTE_LIST)
{
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
// Get the next cluster
//
NextPte = MmSystemPteBase + PreviousPte->u.List.NextEntry;
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
// Check if the cluster to insert is smaller or of equal size
//
if (ClusterSize <= MI_GET_CLUSTER_SIZE(NextPte))
break;
//
// On to the next cluster
//
PreviousPte = NextPte;
}
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
// Retrieve the first cluster and link it back into the cluster list
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
NextPte = ReturnPte - ClusterSize;
NextPte->u.List.NextEntry = PreviousPte->u.List.NextEntry;
PreviousPte->u.List.NextEntry = NextPte - MmSystemPteBase;
}
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
// Decrease availability
//
MmTotalFreeSystemPtes[SystemPtePoolType] -= NumberOfPtes;
//
// Release the System PTE lock
//
KeReleaseQueuedSpinLock(LockQueueSystemSpaceLock, OldIrql);
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
//
// Flush the TLB
//
KeFlushProcessTb();
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
//
// Return the reserved PTEs
//
return ReturnPte;
}
PMMPTE
NTAPI
MiReserveSystemPtes(IN ULONG NumberOfPtes,
IN MMSYSTEM_PTE_POOL_TYPE SystemPtePoolType)
{
PMMPTE PointerPte;
//
// Use the extended function
//
PointerPte = MiReserveAlignedSystemPtes(NumberOfPtes, SystemPtePoolType, 0);
//
// Return the PTE Pointer
//
return PointerPte;
}
VOID
NTAPI
MiReleaseSystemPtes(IN PMMPTE StartingPte,
IN ULONG NumberOfPtes,
IN MMSYSTEM_PTE_POOL_TYPE SystemPtePoolType)
{
KIRQL OldIrql;
ULONG ClusterSize;
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
PMMPTE PreviousPte, NextPte, InsertPte;
//
// Check to make sure the PTE address is within bounds
//
ASSERT(NumberOfPtes != 0);
ASSERT(StartingPte >= MmSystemPtesStart[SystemPtePoolType]);
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
ASSERT(StartingPte + NumberOfPtes - 1 <= MmSystemPtesEnd[SystemPtePoolType]);
//
// Zero PTEs
//
RtlZeroMemory(StartingPte, NumberOfPtes * sizeof(MMPTE));
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
// Acquire the System PTE lock
//
OldIrql = KeAcquireQueuedSpinLock(LockQueueSystemSpaceLock);
//
// Increase availability
//
MmTotalFreeSystemPtes[SystemPtePoolType] += NumberOfPtes;
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
// Step through the cluster list to find where to insert the PTEs
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
PreviousPte = &MmFirstFreeSystemPte[SystemPtePoolType];
InsertPte = NULL;
while (PreviousPte->u.List.NextEntry != MM_EMPTY_PTE_LIST)
{
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
// Get the next cluster and its size
//
NextPte = MmSystemPteBase + PreviousPte->u.List.NextEntry;
ClusterSize = MI_GET_CLUSTER_SIZE(NextPte);
//
// Check if this cluster is adjacent to the PTEs being released
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
if ((NextPte + ClusterSize == StartingPte) ||
(StartingPte + NumberOfPtes == NextPte))
{
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
// Add the PTEs in the cluster to the PTEs being released
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
NumberOfPtes += ClusterSize;
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
if (NextPte < StartingPte)
StartingPte = NextPte;
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
// Unlink this cluster and zero it
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
PreviousPte->u.List.NextEntry = NextPte->u.List.NextEntry;
if (NextPte->u.List.OneEntry == 0)
{
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
NextPte->u.Long = 0;
NextPte++;
}
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
NextPte->u.Long = 0;
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
// Invalidate the previously found insertion location, if any
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
InsertPte = NULL;
}
else
{
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
// Check if the insertion location is right before this cluster
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
if ((InsertPte == NULL) && (NumberOfPtes <= ClusterSize))
InsertPte = PreviousPte;
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
// On to the next cluster
//
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
PreviousPte = NextPte;
}
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
}
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
//
// If no insertion location was found, use the tail of the list
//
if (InsertPte == NULL)
InsertPte = PreviousPte;
//
// Create a new cluster using the PTEs being released
//
if (NumberOfPtes != 1)
{
StartingPte->u.List.OneEntry = 0;
NextPte = StartingPte + 1;
NextPte->u.List.NextEntry = NumberOfPtes;
}
[NTOS] Complete rewrite reserving and releasing of System PTEs. The previous algorithm, in a nutshell, worked as follows: - PTE clusters are in a singly linked list, ordered by their base address. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping). - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. Problems with the previous algorithm: - While the idea is that all PTEs in clusters are zeroed, which requesters rely on, cluster bookkeeping isn't zeroed on merges. The side effect of this was that PTEs that weren't really zeroed were randomly delivered to requesters. - 99% of the time, allocations are serviced using the first cluster in the list, which is virtually always the first suitable cluster. This is so because the ordering is based on the base address of the clusters, and allocations are serviced using the cluster tail. Because the first cluster starts out as the whole pool, and the pool is quite sizable, it can deal with virtually allocations.. for a while. - A corollary of the previous point is *massive fragmentation* because: as long as an allocation isn't released back into the pool, the space of previous allocations that have been released isn't reused because the first cluster can't suck them up, and enough allocations remain in use. - The combined effect of the previous two points: a first cluster that effectively shrinks mostly, with small clusters forming behind it. Once the first cluster has shrunk far enough (which of course takes a long time), 90% of the space may still be free, scattered in mostly small clusters. This would make decent sized allocations fail because of the heavy fragmentation. - An implementation detail that caused the head of the list to be treated as a genuine cluster when the first cluster in the list was too small. The algorithm (as explained above) made this case quite unlikely until your system has been running for a while, after which it could happily corrupt list heads of other pools, depending on where the list head is with respect to its own pool. Empirically obtained data revealed that after just *booting to the desktop*, the pool for System Pte Space entries contained roughly 70 (unusable) clusters, blocking 15 to 20% of the pool. These figures increased to roughly 100 clusters and 30 to 35% after opening a foxy browser and using it to visit a mathematically inspired search engine. The same data also showed that over 95% of allocations requested just a single PTE, and a noticable allocation spike also occured in the range of 65-128 PTEs. It should be clear optimizing for small allocations is a good idea, and preferably encourage reuse the same PTEs for such allocations. And the new algorithm was born: - PTE clusters are in a singly linked list, ordered by increasing cluster size. - All PTEs in the clusters are zeroed (except for cluster list bookkeeping) .. really this time! - Upon reservation: Walk the list to get the first cluster that's large enough, cut the requested amount of PTEs off its tail and return them. - Upon release: Create a new cluster using the PTEs to release, and merge it together with possible adjacent clusters. - Both in the reservation and release actions, insertions into the list preserve the increasing cluster size order. Empirically obtained data now revealed that after just booting to the desktop, the pool for System Pte Space entries contained exactly 2 clusters. This increased to 10 clusters after some minor internet browsing and watching a 5 minute video using a media player. svn path=/trunk/; revision=50347
2011-01-09 20:52:22 +00:00
else
StartingPte->u.List.OneEntry = 1;
//
// Link the new cluster into the cluster list at the insertion location
//
StartingPte->u.List.NextEntry = InsertPte->u.List.NextEntry;
InsertPte->u.List.NextEntry = StartingPte - MmSystemPteBase;
//
// Release the System PTE lock
//
KeReleaseQueuedSpinLock(LockQueueSystemSpaceLock, OldIrql);
}
CODE_SEG("INIT")
VOID
NTAPI
MiInitializeSystemPtes(IN PMMPTE StartingPte,
IN ULONG NumberOfPtes,
IN MMSYSTEM_PTE_POOL_TYPE PoolType)
{
//
// Sanity checks
//
ASSERT(NumberOfPtes >= 1);
//
// Set the starting and ending PTE addresses for this space
//
MmSystemPteBase = MI_SYSTEM_PTE_BASE;
MmSystemPtesStart[PoolType] = StartingPte;
MmSystemPtesEnd[PoolType] = StartingPte + NumberOfPtes - 1;
DPRINT("System PTE space for %d starting at: %p and ending at: %p\n",
PoolType, MmSystemPtesStart[PoolType], MmSystemPtesEnd[PoolType]);
//
// Clear all the PTEs to start with
//
RtlZeroMemory(StartingPte, NumberOfPtes * sizeof(MMPTE));
//
// Make the first entry free and link it
//
StartingPte->u.List.NextEntry = MM_EMPTY_PTE_LIST;
MmFirstFreeSystemPte[PoolType].u.Long = 0;
MmFirstFreeSystemPte[PoolType].u.List.NextEntry = StartingPte -
MmSystemPteBase;
//
// The second entry stores the size of this PTE space
//
StartingPte++;
StartingPte->u.Long = 0;
StartingPte->u.List.NextEntry = NumberOfPtes;
//
// We also keep a global for it
//
MmTotalFreeSystemPtes[PoolType] = NumberOfPtes;
//
// Check if this is the system PTE space
//
if (PoolType == SystemPteSpace)
{
//
// Remember how many PTEs we have
//
MmTotalSystemPtes = NumberOfPtes;
}
}
/* EOF */