Date: 2023-02-06
Author: namhas

Data structures in C

For working with runtime data, life is easy for today's programmers, because modern, advanced, high-level programming languages come with efficient data structures.

So, instead of having to know how data structures work, instead of having to write them from scratch, today's programmers simply put their data to the structures provided for them.

Why some programmers opt to develop software using the low-level C programming language, which does not come with data structures? Some programmers even does not use a library to bring in the common data structures.

One of the common data structures is the linked list. It is efficient, it is flexible, but it is missing from C. However, it is used all the time in C, but some people don't bother taking it from a library. Why don't the programmers simply use one common linked list implementation?

I tested this. I had a 300 lines long module which was using custom linked lists very extensively. Then, I wrote a generic linked list implementation, that could be used for all the use cases of this module, and I reimplemented the module.

The reimplemented module turned out to be 288 lines long. So, it became 4% smaller. A negligible benefit. Furthermore, the generic linked list implementation became a 270 lines long module of its own, almost doubling the size if combined. We would need 23 similarly sized modules that are using linked lists extensively, before we would offset the cost added by the generic implementation.

Apart from these numbers, before jumping to conclusions that even a 4% reduction in code size would automatically translate to better maintainability and readability, there was also other problems:

Memory management: We need to detach the information used for the data structure from the data itself, so instead of managing the memory for one piece of data, we need to track two pieces separately: the data itself, and the metadata used for the data structure. Otherwise we would need to export the data structures or otherwise compromise modularity or generality. In a custom implementation, they can be combined with ease.
Iteration: We obviously need to use function calls, which already makes the iteration a bit harder, but we need to work with three types of data. The list structure, which contains 'head' and 'tail' pointers, the node structure, which contains 'next', and 'prev' pointers, and then the data itself. So, we need more variables, while in a custom implementation all these can be combined. More variables mean more lines of code, but it also means higher cognitive load when deciphering the code.
Deleting elements or accessing neighbors: We need to store a separate pointer or instance of an iterator, with our data object, if we want to have O(1) complexity for the delete operation or nearest neighbor operations. Otherwise we need to search for the data first, which creates O(n) complexity.
Type safety: For being generic, the implementation needs to support all types. So, we have two choices: either we use macros, so that preprocessor would write type-specific code for us, or we replace C with something like C++, and use templates for the generics programming, which does the same in a glorified way. Or then we discard typesafety and use "void pointers", which is possible in C.
Modularity: Information hiding helps to achieve modularity if the rule of decomposition for modularity is taken as limiting changes to one component when changes are needed. However, for achieving either type safety, and/or simplified memory management for bundling data and metadata to one object through the use of macros or templates, we need to export the data structures, which decreases information hiding. If we want to avoid this, we need to compromise and e.g. use "void pointers" and multiple objects.
Debugging: If we use macros or templates for ensuring continued type safety, our source code is modified in an intermediate step before it is being compiled to the machine code, which means the code we see in a debugger may look unfamiliar. Similarly, compile errors might become harder to decipher, because the errors are based on the code that is being generated on our behalf. Using a modern language with a good generics support helps here, or does it? Many C++ programmers know the horror of C++ template errors.
Performance: We need to manage allocations and deallocations for more separate memory regions and we need to do a function call in tight inner loops when iterating the data.
Flexibility: Since it is a generic implementation, we lose the flexibility of suddenly turning the data structure to some sort of a combined composite data structure.
Clarity: Data structures are central to programming. If we hide how we work with the data, what will our programs become? The data may become buried. It is there, and it is here. We may lose control, because we made it easy to lose the control. When we work with data structures explicitly, it is easier to contemplate on who owns the data and how it relates to other data, because we do the work and nothing happens automatically.
Dependency: The modules that use the generic linked list implementation will be tied to that particular API and need to carry the linked list implementation module with them wherever they go. In worst case, when combining different modules from different programs, each use a different generic linked list implementation.

For the sake of mere 4% reduction in user code, do we welcome all these problems?

C++ is great as it would fix many of these problems. However, we would still have the following problems:

Iteration: We need to use a separate templated iterator instead of simply directly accessing the data, even though C++ makes this so much nicer due to overloading and templates, using which we can access the actual data in a type safe way, but we still need to create the iterator.
Deleting items: Same reasons as with C.
Type safety: For enabling type safety, we need to work with templates, which introduce problems of their own.
Debugging: Since we are using templates, translation to machine code is no longer one-to-one with the source code.
Performance: Performance obviously suffers because we need to manage more memory regions than if we would simply bundle data and metadata to the same object. We also need to create separate iterators and manage their memory behind the scenes.
Flexibility: For turning the data structure to a purpose-specific composite data structure, when following the C++ design style, we would need to do a whole new templated object, which requires creating a templatized-object that has uses operator overloading, and that is a lot of work for a simple thing, so the programmer might avoid it due to the complexity.
Clarity: Same reasons as with C.

"Bad programmers worry about the code. Good programmers worry about data structures and their relationships." -- Linus Torvalds.

"Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't need your flowchart, it'll be obvious." -- Fred Brooks.

Comparisons with examples

Lines

Type	Init	Add x3	Iterate x6	Delete x3	Total
C (generic)	4	3	18	3	28
C (double, funcs)	4	7	12	10	33
Javascript (dynamic hash)	2	3	12	18	35
C (single)	4	6	12	27	49
C (double)	4	12	12	24	52
C++	7	18	18	24	60

Theoretical performance

Cost is added if:

Allocation
Deallocation
External function call
Any operation on data

Assuming existing n in the data structure is 100.

Type	Init	Add x3	Iterate x6	Delete x3	Total
C (double)	1	3	600	6	610
C (double, funcs)	1	3	600	6	610
C (single)	1	3	600	600	1204
C (generic)	2	6	1200	606	1814
C++	1	6	1800	1200	3007
Javascript (dynamic hash)	2	6	2400	2400	4808

Examples

Example C++ version (43 lines)

The C++ version is most inefficient because it does not allow O(1) delete operation. When deleting, we need to search through the list to find an item to delete first. The C++ version also results in a binary that is almost 5 times larger than the C counterpart.

Also note two allocations and two deallocations are necessary for each item, as well as stack allocations and function calls when iterating.


1	#include <list>
2	#include <iostream>
3
4	struct foo {
5		int value;
6	};
7
8	int main()
9	{
10		std::list<struct foo *> fl;
11		struct foo *foo;
12
13		// Allocate the data itself (7 lines)
14		try {
15			foo = new struct foo;
16		} catch (std::bad_alloc &ba) {
17			std::cerr << "foo: " << ba.what() << "\n";
18			return 1;
19		}
20		foo->value = 5;
21
22		// Add the data item to the list (6 lines)
23		try {
24			fl.push_front(foo);
25		} catch (std::bad_alloc &ba) {
26			std::cerr << "list node: " << ba.what() << "\n";
27			return 1;
28		}
29
30		// Show all items in the list (3 lines)
31		std::list<struct foo *>::iterator i;
32		for (i = fl.begin(); i != fl.end(); i++)
33			std::cout << (*i)->value << "\n";
34
35		// Delete our data item (8 lines)
36		std::list<struct foo *>::iterator j;
37		for (j = fl.begin(); j != fl.end(); j++)
38			if ((*j) == foo) {
39				fl.erase(j);
40				break;
41			}
42		delete foo;
43	}

Example C version, generic (37 lines)

Surprisingly, this takes little less space than the C++ version, but it compromises on type safety because a void pointer is used. However, it is quite readable compared to the C++ version.

Also, this is faster than the C++ version, because it allows O(1) delete operation, but like C++ version, this requires double allocations and deallocations.


1	#include "list.h"
2	
3	#include <err.h>
4	#include <stdlib.h>
5	#include <stdio.h>
6
7	struct foo {
8		int value;
9		struct listnode *ln;
10	};
11	
12	int
13	main()
14	{
15		struct foo *foo;
16		struct foo *p;
17		struct list list = { 0 };
18		struct listnode *np;
19	
20		/* Allocate the data itself (4 lines) */
21		foo = calloc(1, sizeof(struct foo));
22		if (!foo)
23			err(1, "malloc foo");
24		foo->value = 5;
25	
26		/* Add the data item to the list (1 line) */
27		foo->ln = list_add(&list, foo);
28
29		/* Show all items in the list (3 lines) */
30		np = list_first(&list);
31		while (p = listnode_next_value(&np))
32			printf("%d\n", p->value);
33	
34		/* Delete our data item (1 line) */
35		listnode_remove(&list, foo->ln);
36		free(foo);
37	}

Example C version, singly linked version (42 lines)

This is fast, type safe version. It is comparable to the C++ version from algorithmic complexity point of view.


1	#include <err.h>
2	#include <stdlib.h>
3	#include <stdio.h>
4
5	struct foo {
6		int value;
7		struct foo *next;
8	};
9	
10	int
11	main()
12	{
13		struct foo *p;
14		struct foo *foo;
15		struct foo *head = NULL;
16		struct foo *prev = NULL;
17	
18		/* Allocate the data itself (4 lines) */
19		foo = calloc(1, sizeof(struct foo));
20		if (!foo)
21			err(1, "malloc foo");
22		foo->value = 5;
23
24		/* Add the data item to the list (2 lines) */
25		foo->next = head;
26		head = foo;
27	
28		/* Show all items in the list (2 lines) */
29		for (p = head; p; p = p->next)
30			printf("%d\n", p->value);
31
32		/* Delete our data item (9 lines) */
33		for (p = head; p; prev = p, p = p->next)
34			if (p == foo) {
35				if (prev)
36					prev->next = p->next;
37				else
38					head = p->next;
39				free(p);
40				break;
41			}
42	}

Example C version, doubly linked list (42 lines)

This is the fastest, type safe version, but requires a little more lines and compromises on readability because there is already too much code in one function.

It is the fastest because it has O(1) complexity deletion, instead of O(n), and allows for nearest neighbor operations.


1	#include <err.h>
2	#include <stdlib.h>
3	#include <stdio.h>
4
5	struct foo {
6		int value;
7		struct foo *prev;
8		struct foo *next;
9	};
10	
11	int
12	main()
13	{
14		struct foo *p;
15		struct foo *foo;
16		struct foo *head = NULL;
17	
18		/* Allocate the data itself (4 lines) */
19		foo = calloc(1, sizeof(struct foo));
20		if (!foo)
21			err(1, "malloc foo");
22		foo->value = 5;
23
24		/* Add the data item to the list (4 lines) */
25		if (head)
26			head->prev = foo;
27		foo->next = head;
28		head = foo;
29	
30		/* Show all items in the list (2 lines) */
31		for (p = head; p; p = p->next)
32			printf("%d\n", p->value);
33	
34		/* Delete our data item (8 lines) */
35		if (foo == head)
36			head = foo->next;
37		if (foo->next)
38			foo->next->prev = foo->prev;
39		if (foo->prev)
40			foo->prev->next = foo->next;
41		free(foo);
42	}

Example C version, doubly linked list, readable (53 lines)

This is fast, type safe version. It is the longest, but does not compromise on readability, because the functions are well named and operations that are related are well bundled together.

When related operations are bundled together, such as the initialization of the actual data object, and adding the data object to the list, the add and remove operations become custom: when we want to add, or remove, we often do more than just add or remove to a linked list. In this way we get the readability benefit even if we sprinkle the linked list code to these functions once per each custom data object.

If we assume we would in any case write _add and _remove functions for our custom object, for calculating the extra added lines of the linked list code is 3 lines for _add, and 5 lines for _remove, assuming the alternative would be 1 line of code for each. So, it would be only 8 lines extra for each module.

Also, whenever we work with linked lists, or any data structure manually in this way, with small tweaks we can make it to be, for example, a singly linked list instead of a doubly linked list, and we can add to the tail, to the head, or after a particular item as we want.


#include <err.h>
#include <stdlib.h>
#include <stdio.h>

static struct foo {
	int value;
	struct foo *prev;
	struct foo *next;
} *head;

struct foo *
foo_add(int value)
{
	struct foo *foo;

	foo = calloc(1, sizeof(struct foo));
	if (!foo)
		err(1, "malloc foo");
	foo->value = value;

	if (head)
		head->prev = foo;
	foo->next = head;
	head = foo;

	return foo;
}

void
foo_remove(struct foo *foo)
{
	if (foo == head)
		head = foo->next;
	if (foo->next)
		foo->next->prev = foo->prev;
	if (foo->prev)
		foo->prev->next = foo->next;	
	free(foo);
}

int
main()
{
	struct foo *foo, *p;

	foo = foo_add(5);

	for (p = head; p; p = p->next)
		printf("%d\n", p->value);

	foo_remove(foo);
}

Example C, static array (30 lines)


#include <stdio.h>

struct foo {
	int values[1];
	int sz;
};

int
main()
{
	int i;

	/* Allocate the data itself (1 line) */
	struct foo foo = { 0 };

	/* Add the data item to the list (3 lines) */
	if (foo.sz == sizeof(foo.values))
		err(1, "list is full");
	foo.values[foo.sz++] = 5;

	/* Show all items in the list (2 lines) */
	for (i = 0; i < foo.sz; i++)
		printf("%d\n", foo.values[i]);

	/* Delete our data item (3 lines) */
	for (i = 0; i < foo.sz; i++)
		if (foo.values[i] == 5)
			memmove(foo.values[i], foo.values[i+1], foo.sz-i);
}

Example Javascript (24 lines)

This obviously is not a linked list and the performance suffers greatly when compared to a real linked list. Behind the scenes this is a dynamic array, or a hash map, so operations like arbitrary add or removal are expensive, but short from source code lines point of view. Also, there is no type safety whatsoever and the garbage collection process may do whatever it wants, so there is no control of memory allocation.


var obj = {
	"value" : null
};

function main()
{
	// Allocate the data itself (2 lines)
	var list = new Array();
	obj["value"] = 5;	

	// Add the data item to the list (1 line)
	list.push(obj);

	// Show all items in the list (2 lines)
	for (i = 0; i < list.length; i++)
		console.log(list[i]);

	// Delete our data item (6 lines)
	for (i = 0; i < list.length; i++)
		if (list[i].value == 5) {
			list.slice(i, 1);
			break;
		}
}