Garbage Collection Internals (Part 1 – C)

Why do programs produce garbage? 🤔

Well, we first have to understand memory internals of programming languages to understand what is dynamic memory and what constitutes “garbage”. So let’s take a look at this Digital ConceptVisual video which explains Process Map, Segments, Global Data area, Stack and Heap, the last one being the focus of our attention.

Compilation Process of C/C++ and Process Map

As we can see from the above video, global variables (if any in the language) are stored in DATA/BSS segment, local variables are in the STACK and any data created at run time are stored in HEAP.

Languages like C/C++ were mostly Global and Local variables and only when the developer specifically called malloc/calloc or new, it allocated memory in the HEAP.

One great thing about HEAP allocation is that it’s allocated at runtime and can be “released” by the developer. If the developer stops using the allocated memory and does not release it, well, that’s when it’s called “Garbage” memory.

Different languages have different strategy for releasing heap allocated memory once it’s no longer needed

TechnologyGrabage Collection
CYou clean up your own mess
C++You clean up your own mess
JavaMark and Sweep Garbage Collector (GC)
SwiftAutomatic Reference Counting (ARC)
COMReference Counting (RC)
C#Mark and Sweep Garbage Collector (GC)
PythonAutomatic Reference Counting (ARC) + Mark and Sweep Garbage Collector (GC)
JavaScriptEarlier was ARC but now most JS Engine implement Mark and Sweep (GC)
Language/Technology and Garbage Collection

So there are 4 Garbage collection strategies

  1. No automatic Garbage Collection – C/C++
  2. Manual Reference Counting (RC) – COM
  3. Automatic ReferenceCounting (ARC) – Swift
  4. Mark and Sweep (Compact or Copy) – Java, .NET, JavaScript
1. No Automatic Garbage Collection - C/C++
/*
 * File:      NoGC.c
 * Project:   CPP
 * Author:    Sanjay Vyas
 * 
 * Description:
 *      C program to demonstarte No GC
 * 
 * Revision History:
 * 2020-Jun-20	[SV]: Created
 */

#include <stdio.h>  // printf/scanf
#include <stdlib.h> // malloc/qsort

// Maximum random number
#define MAX (100)

// Helper function to print array
void print(int *array, int size)
{
    int offset;

    // Print comma (upto 2nd last)
    for (offset = 0;
         offset < size-1;
         offset++)
    {
        printf("%d, ", array[offset]);
    }

    // Print last element
    printf("%d\n", array[size-1]);
}

// Compare function required for qsort
// Return
//      0 if both are equal
//     <0 if lhs is < rhs
//     >0 if lhs is > rhs
int intcmp(const void *lhs, const void *rhs) 
{
    // We are given 2 elements as void *
    // We need to typecast them back to int
    return *(int *)lhs - *(int *)rhs;
}

int main()
{
    int *pointer = NULL;    // Defensive programming 🙂
    int size;
    int offset;

    while (printf("Enter size (0 to quit): "),
           scanf("%d", &size),
           size)
    {
        // Allocate heap memory
        pointer = (int *)malloc(size * sizeof(int));

        // Randomize it
        for (offset = 0;
             offset < size;
             offset++)
        {
            pointer[offset] = rand()%MAX;
        }

        // Now print the array
        print(pointer, size);

        // Let's sort it using built in function
        qsort(pointer, size, sizeof(int), intcmp);

        // And print it again
        print(pointer, size); 

        // WARNING! Memory is LEAKING here
        // We did not free(pointer)
        // The loop will use same pointer
        // to malloc again, losing the previous address
    }
}
What happens we do a malloc or new
This is what happens when we do a malloc
And what is a memory leak?
Losing address without doing free is MEMORY LEAK

What we see in the above image is a result of not deallocating memory (free or delete) and losing the address. Over a period of time, the application will fill up its heap area and would no longer be able to allocate more memory as previously unused allocations are occupying the heap area. These unused allocations are called GARBAGE.

Remember to free your allocations
/*
 * File:      FreeMem.c
 * Project:   CPP
 * Author:    Sanjay Vyas
 * 
 * Description:
 *      C program to demonstarte No GC
 * 
 * Revision History:
 * 2020-Jun-20	[SV]: Created
 */

#include <stdio.h>  // printf/scanf
#include <stdlib.h> // malloc/qsort

// Maximum random number
#define MAX (100)

// Helper function to print array
void print(int *array, int size)
{
    int offset;

    // Print comma (upto 2nd last)
    for (offset = 0;
         offset < size - 1;
         offset++)
    {
        printf("%d, ", array[offset]);
    }

    // Print last element
    printf("%d\n", array[size - 1]);
}

// Compare function required for qsort
// Return
//      0 if both are equal
//     <0 if lhs is < rhs
//     >0 if lhs is > rhs
int intcmp(const void *lhs, const void *rhs)
{
    // We are given 2 elements as void *
    // We need to typecast them back to int
    return *(int *)lhs - *(int *)rhs;
}

int main()
{
    int *pointer = NULL; // Defensive programming 🙂
    int size;
    int offset;

    while (printf("Enter size (0 to quit): "),
           scanf("%d", &size),
           size)
    {
        // Allocate heap memory
        pointer = (int *)malloc(size * sizeof(int));

        // Randomize it
        for (offset = 0;
             offset < size;
             offset++)
        {
            pointer[offset] = rand() % MAX;
        }

        // Now print the array
        print(pointer, size);

        // Let's sort it using built in function
        qsort(pointer, size, sizeof(int), intcmp);

        // And print it again
        print(pointer, size);

        // Remember to deallocate memory
        free(pointer);
    }
}

In C programming there is no built-in GC (Garbage Collection) and as developers we must keep track of the memory we are allocating and runtime and deallocate it when not required. This is tough given that the codebase may be huge and memory addresses may be passed around, losing track of who allocated it and who will deallocate it. Add to this, many developers do not have visualisation of what is happening in memory.

No wonder, languages which came after C/C++ implemented some from of Garbage Collection (Java, .NET, Python, Swift)

Garbage Collection Internals Series

  1. Part 1 – C language
  2. Part 2 – C++ Language
  3. Part 3 – Java

Process Map

I am in love with Process Maps. What are they?

Whenever a program is loaded into memory and executed as a process, the OS doesn’t simply dump the executable, rather, arranges and structures various sections into segments – code, data, heap, stack etc. In case of dynamic languages there may not be a binary executable to begin with, so their process maps are populated gradually. C/C++ have relatively simple process maps where as others can be quite complex, like .NET has 9 distinct types of heap alone.

To understand how something works, it’s best to look it how it works internally. Visualise opening up the engine of a car to understand what gears do, what happens when we press the accelerator etc. Similarly, to understand programming, visualising the process map instead of just reading the syntax will make us understand it better

Compilation Process and Process Map of C/C++

Here is a video which explains what happens when we compile a program, say in C/C++, and execute it. It is interesting to visualise process maps of Java, C# and more interesting are JavaScript and Python as they are dynamic languages.

Fundamentals

Maybe it is the way we learnt programming by patterns – “Do this and that will happen” or “this is how it is done”. Maybe we don’t stop to think how things work at more fundamental level.

Here is a piece of code which works in some languages (C, C++, JavaScript) but won’t work in others (Java, C#)

C/C++ allows a statement like 1, 2, 3;
/*
   Works in C/C++
*/
int main()
{
    1, 2, 3;  // Why does this work? What does it mean?
}

Java and C# don't allow 1, 2, 3
/*
   Java (and C#) don't allow
*/
class MyClass {
    static void main(String[] args) {
        1, 2, 3; // Gives an error.. Why?
}

Why do you think this is? Is there a fundamental reason these languages differ? Why did Java disallow this? If you know why, write it in comments below