Algorithms and Data Structures — Part I

35 min readJul 22, 2021

Data Structures

Data structures came to be in order to handle large amounts of data efficiently for different uses. Avoiding nested for loops, writing fewer operations, and making every calculation as fast as possible are all things to consider when contemplating an algorithm, but algorithms can also be boosted, quite considerably actually, by choosing the proper data structure. We must consider data structures to make our programs faster.

Bad programmers worry about the code. Good programmers worry about data structures and their relationships

Dijskstra’s algorithm would be O(N²) without the proper data structure (heap / priority queue). Because of it, the time complexity is O(NlogN). There is a trade off between run time complexity and memory complexity. We choose one at the expense of the other. We can use data structures (and more memory) to boost up an algorithm’s run time speed. Same for the spanning trees algorithm (*research this*).

Abstract datatypes (interfaces, etc) define a model (logical description) for the data structure. They define a model. Functions, like push() and pop() that determine the basic behavior of the data structure are defined here. They do not specify the implementation, just the interface. Data structures themselves define the concrete implementation, or actual representation, of the data. The aim is to store and retrieve data in an efficient manner. We want to insert, find, and/or remove items in O(1) time.

Table 1: Abstract Data Types vs Their Corresponding Data Structures

Arrays

Arrays are data structures. A collection of elements or values each defined by an array index (or key). Index always starts at zero. Because of the index, random access is possible by using the following notation: array_name[index]. Random access has O(1) time complexity. You can have two dimensional arrays (an array of arrays), and these arrays will have two indexes, one for the rows, and another for the columns. These can be accessed as follows: array_name[row_index][col_index]. Usually the items of an array are the same type, but they don’t have to be in python. A dynamic array is an array who’s size changes dynamically. Applications: lookup tables / hashtables, heaps.

The disadvantages of an array are 1) you have to know the size of the array before hand. This is true in languages like Java, C# and C++. If one wants to make an existing array larger, one would have to copy the existing array into a new larger array. This copying of the array to increase it’s size has O(N) complexity. In python this is done for you, so the list seems dynamic. We can solve the issues with the resizing of arrays with the help of linked lists.

Array Operations:

Add(item): we can add values to the array as long as the array is not full. We are assuming add (or push) is a function of the Array Data Type (see differences between data structures and abstract data types above) that takes a parameter of the type the array supports. Add HAS to add at the end of the array. If it tried to add at the beginning, it would have to move all the items forward one slot, thus reconstructing the whole array. Adding at the front is O(N); adding at the back O(1) (when you keep track of the current empty index).
Insert(item, index): Inserting into an array, means you have to move the item in the index to be inserted to and all items beyond that, then insert the new item into the desired index. This requires reconstructing the array. This has a time complexity of O(N). Inserting into the back is called adding and is above.
remove(index): Removing last item in array? O(1). If we want to remove an item in a random index, we will have to reconstruct the array, so it will be O(N).

Linked Lists

Linked Lists consists of nodes that are connected to each other in either one direction (self.next = next_node_obj), or, in the case of doubly linked lists, two directions (self.next = next_node_obj; self.prev = prev_node_obj). The last item on the linked list is NULL.

The node of a linked list looks like this:

The node stores data, either a variable or another data structure/type, and a reference (pointer) to another node. Node prevNode can be added to make this a two way list. Linked lists can be used to create common data types, such as stacks and queues.

Simple linked lists do not allow for random access, one has to traverse, or search, the entire list to find the desired value (O(N) time). Basic operations like finding the last node, or finding a specific value in the list, or finding the place in the list where a node should be placed or removed, require a partial, or complete, scan of the list (O(N) time).

Advantages of linked lists include: they are dynamic data structures and can change their size (add more nodes) dynamically (O(1)). Arrays cannot (O(N) to insert or remove an item at the beginning of an array). They can allocate needed memory at run time (O(1)); arrays cannot (O(N) to allocate memory for a bigger array and move the existing array to it). Very efficient if we only want to manipulate the first element (O(1)), so stacks and queues use this as the underlying data structure (queues, FIFO, require a variable to keep track of the last element of the linked list with a pointer). They are easy to implement, can store items of different sizes or types (an array assumes same sizes and types). It’s easier for a linked list to grow organically, where an array must be recreated (O(N)) when it needs to grow.

Disadvantages of linked list include: every node in a linked list (except the last one) has a reference to the next node (and previous node, if a doubly linked list). This takes lots of memory. Accessing a specific node in a linked list takes O(N) time, where an array allows for random access, if you know the index (O(1)). It is hard to traverse backwards on a singly linked list (it’s possible, but hard), so it takes even more memory to do so (just make a doubly linked list, IMO). [FIND OUT HOW TO TRAVERSE A SINGLELY LINKED LIST BACKWARDS]

Linked List Operations

Insertion: inserting items at the beginning of a linked list. Very simple. Just have to create the node (O(1)), point the “next_node” variable of the new node to the existing “first_node” variable(O(1)). Then point the “first_node” variable to the new node (O(1)).

linked_list.insert_at_start(23)

You can do the same thing for insertion at the back, you just need another variable that points to the last node. If you don’t add this pointer, then finding the last node will take O(N), as you have to traverse through it.

linked_list.insert_at_back(32)

If you are not adding an extra pointer for the “first” and “last” node, then things change a bit. You can still add in the front in O(1), but adding to the back would take a full list traversal, so would take O(N) time. Just add the two extra pointers and these problems go away (at the expense of the extra memory).

Remove: removing an item in the list. Removing the first item just means reference changes (O(1)). Removing an item in the middle or end of the list requires a full traversal (O(N)), plus reference changes (O(1)). Remember to clean up the memory used by the removed node.

linked_list.remove(10)

Get: Access a value at a certain index in the list. This takes O(N) time as one has to traverse to the point in the list indicated by the index

linked_list.get(3)

Double Linked List

We can get from 4 to 25 because we just have to hope to the next nodes via the pointers (references) to them. But we cannot go from 25 to 4 (see how to move backwards via a singly linked list) because the references are in the opposite direction.

We solve this with doubly linked lists, where there are references to both the next node and the previous node (this causes more memory consumption as there are more pointers (references).

Linked List vs Arrays: A Review

It should be said to make insertion of the last item of a linked list run in O(1) time, one needs to add a reference to the last item of the list. This takes up more memory (the negative correlation between time complexity and memory complexity shows its head, once again).

Here we see the concrete examples of the negative correlation between time and memory complexities

Implementing Linked List in Python

Arrays vs Linked List Again

Linked list win in this round

Arrays win in this one, tho

Disatvantage for both data structures

Linked List Applications

Doubly Linked Lists

To insert an item at the end, one must traverse the entire linked list O(N).

Linked List Interview Questions:

Stacks

Stacks in Memory Management (stacks and heaps):

Stack memory has something to do with the stack data structure, but heap memory has nothing to do with the heap data structure.

Here we see three function calls each within another. These three calls produce their on contexts on the stack with the variables within those contexts. The House object in function 3 produces an object that is placed on the heap, and

Once function 3 is completed, the flow of control will go back to function 2. And once complete the context for function 3 is flushed out of the stack

There is now no reference to the object in the heap, and it is now eligible for garbage collection

Then the process continues with function 2 and then once complete, its context is removed, and then function 1 is left to complete the process.

Real World Applications of Stack:

Find some besides memory and execution stacks, lol

Queues

Interview Questions — Stacks and Queues:

Using recursion (the second stack is the execution stack)

Binary Search Trees

This is a cyclic graph, where there are two paths to node D. This is not a tree

We can also define parent/child relationships. Shown in blue are all of node A’s children

The yellow shows all the values less than 12 (the root node) and the blue all the values larger than twelve. This is a quality of binary sort trees.

Height of a Tree:

A tree can have multiple layers, as some branches end and have no children. A true, all the parent nodes have two children for each layer. A non true binary tree can have multiple layers.

The number of nodes is our N, just like the number of nodes in a linked list is that constructs N. Here we calculate the number of nodes in a binary search tree and see how the time complexity is constructed.

If the tree is not balanced, and this is done by picking the correct root node (see if there is an algorithm for this) I think, h != log N and the time complexity of a search no longer holds to O(logN)

Here is an example of an imbalanced tree vs a balanced tree (see picking 4 as the root node of the imbalanced tree would make it balanced. Check to see what the algorithm is for balancing a imbalanced tree)

You can see that the imbalanced tree is basically a linked list and now has O(N) time for search. This is a large difference.

Binary Search Trees Commands

Insert: To insert one just need to start at the root node, see if the value being inserted is greater than or smaller than and then move left or right accordingly until there is no leaf node and make the new node the leaf node (could be better written)

Max

Min

Delete

If you considered what this would look like if it was a sorted array, the successor of the node you want to remove (in this case the root node) is the node to the right of that node in the imaginary array. In the tree structure, it is the smallest value to the right of the node you want to remove.

The predecessor of the node we want to remove (in this case the root node) is the value to the left of it, if we imagine the tree as a sorted array. In the tree, it is the largest value to the left of the node in question.

If we swap the predecessor with the node we want to delete, then we still have a binary search tree:

All we have to do is remove the leaf node (which is easier) and then we have removed the node in question.

Binary Search Tree Traversal

Every single node means O(N),since the number of nodes is N

Pre order traversal is top down (left to right)

Post order traversal is bottom up (left to right, and starting from the top).

In order traversal yields the sorted order as a treat!

Max is the node furthest to the right and min is the node furthest to the left.

Stack Memory Visualization of Getting Maximum Value Recursively:

Below we see the binary search tree we are going to traverse and a representation of the stack memory

Here we see what happens when we have called the max function recursively 3 times and reach the baseline return statement for the first time (active line in green):

Here we see the baseline value get passed through the execution stack (like using an explicit stack)

Stack Memory Visualization of In Order Traversal Recursive:

The tree, code and stack illustration before we traverse

The traverse function starts at the root and moves left until it reaches the left most node (data: 1). Below is how the stack looks when it reaches this node and node.left == None, so it moves to the print statement and prints the data

After 1 has been printed, it goes to the next line of code, which is look for a right child, it doesn’t find one and pushes 1 off the stack, and moves back up to the function context of node 4. In node 4 the recursive calls for the left node check are all finished, so it goes to print node.data, and prints 4. Now we see 1, 4 on the console.

After printing 4, we then move to node 4’s right child check and we call traverse recursively on node 8

Node 8 is a leaf node and thus the only result of the traverse function call on it is to print its data, so now we see 1 4 8 on the console. The end of the function for node 8 pops it from the stack and then we move to the end of the function for its parent, and since node 8 us a right child, and we check the right children last, this means the end of the function for node 4 and it is also popped off the stack. We are now finished with the left child recursive calls for the root node and now can print its value. 1 4 8 12 is now on the console.

After 12 is printed the right child checks for the root node begin and 20 is placed on the stack

The left children of 20 are checked and placed on the stack

Node 16 is a leaf node, so 16 is printed (1 4 8 12 16 on console) and we move back up to 20, which is done with its left child checks and now prints out 20 (1 4 8 12 16 20) and moves to its right child nodes, places 27 on the stack, 27 is printed (1 4 8 12 16 20 27) and the right child recursive calls all finish up and pop the corresponding values off the stack.

Remove Implementation

For a node with two children

We have to find the largest item in the left subtree or the smallest item in the right subtree. Once we have found the one we have picked, then we swap that node with the node we want to delete, and then delete the node

We will choose the predecessor for it is a leaf node and we know how to delete a leaf node. I think that a predecessor can have a left node, but not a right node. This will be taken care of if we use remove_node recursively using the predecessor as the starting node.

Real World Applications of Trees

Binary Tree Interview Questions:

Here we have two binary search trees that have the same values, but not the same topology

Here are two trees with the same values and topology

The implementation is as follows. Test with the above examples and post here

AVL Trees

Motivation

When considering searching, for arrays and linked lists we have linear search which is O(N). A binary search tree greatly improves the best case

But at worst case, when the tree is unbalanced, the time complexity mimics that of an array or linked list (sorted arrays can utilize binary search, which is O(logN))

We introduce another parameter called the height of the tree, and we want this parameter to follow this equation: h = log N. If the tree is imbalanced h will not be equal to this. We see the worst case scenario of an imbalanced tree is h = N.

This height parameter will allow us to create AVL trees, which are trees that are self balancing and ensure that insert/remove/search algos have a time complexity of O(logN).

What Are They

AVL trees, named after inventors Adelson-Velsky and Landis, are a self balancing binary search tree. The heights (longest path from actual node to a leaf node) of the two (binary) child subtrees can differ by at most one, if this property changes, rebalancing is done

AVL Trees are rigidly balanced and guarantee a O(logN) time complexity. Red Black Trees (future concept) are faster to construct.

The operations for AVL tress will be exactly the same as binary search trees, but after the operation we must check if the tree has become imbalanced or not.

With the help of this thing called rotations, we can achieve the following

This is a marked improvement to unbalanced binary search trees, unsorted arrays, and linked lists.

Height

The biggest problem with binary search trees is that the trees might become imbalanced and we lose the mathematics for O(logN) time complexity. To ensure these maths we must ensure h = log N. We define height as follows (edges are the connections between nodes).

If we calculate the height for the example binary tree, we see this

The height of a Null (None) node is -1, so leaf nodes have a height of 0 because max(-1, -1) = -1 + 1 = 0. We can see the resulting heights when calculated for each level up to the root node in this fashion. We determine balance by seeing if the difference between the left sub tree’s height and the right sub tree’s height is greater than 1. In this case the left sub tree height is 1 and the right sub tree height is 1, so this tree is considered balanced.

Here we see that the height parameter will determine if a tree is balanced or not

We will make the same operations as a binary search tree, but calculate the height and the difference in height between the left sub tree and the right sub tree. We will call this difference the balance factor. If the balance factor is less than zero, it means the tree is right heavy, and if the balance factor is greater than zero, it means the tree is left heavy. In right heavy situations we have to make a left rotation, and in left heavy situations we have to make a right rotation.

Intro to Rotations

Depending on the sign of the balance factor we make either a left rotation or a right rotation

Rotation illustrated

When we rotate to the right, we see the root node D change to node B. D becomes the right child of B, and the previous right child of B, node C, becomes the left child of D. In a left rotation, node B is the root node and node D is the new root node. The left child of D becomes B, and the old left child of D, node C, becomes the right child of B.

The progression of a right rotation

Final state, c breaks away from the new root node and attaches to it’s new parent

The rotations should not change the fundamental properties of the tree. In order traversal, for example, should return the same thing, an ordered list of the tree’s nodes.

The progression of a left rotation

Final stage where B breaks away from the root node and becomes the right child of its new parent

Rotation Examples

Balance factor is (1–(-1)) = 2, which is positive, and what we expect from a left heavy tree. A right rotation is needed and we get this:

In this example the balance factor is (-1–1) = -2, which is negative and what we expect from a right heavy tree. A left rotation is needed and, when that is applied we get

This is the so called left/right heavy situation

In terms of node D, the tree is left heavy, and in terms of node B, it is right heavy. How do we balance this? First we have to make a left rotation on node B, which produces this

Now we have just a left heavy tree, and we need to make a right rotation to finally balance it out

Here we have the opposite of the above a right/left heavy tree

Right rotation on F produces

Then a left rotation on B

Rotation should be O(1) and no more, but there may be other issues due to rotation and every vertex up to the root node needs to be checked, and this check takes O(logN) as there are log N vertices in a tree

AVL Trees Illustrative Introduction

Example tree

The first operation we will consider is insert, and we will now insert 12

Now we must figure out if we need to balance the tree or not. We calculate the height parameter for each node in the level of the node that was just inserted

We see that node 19 is the root of a left heavy sub tree. We calculate the balance factor to be 1 -(-1) = 2, which is positive and what we expect from left heavy trees. We know that we will need to make a right rotation.

Here is the new tree with the new height calculations. And as we do our check up the levels (this check is why the time complexity is O(logN)). We need to do this check because any rotation can mess with the balance of the tree. Going up the chain of vertices corrects this problem

Implementation

Practical Applications of AVL Trees

Abstract Syntax Trees use this (python interpreter and JavaScript engine use ast)

This is not an in place algorithm. It needs extra memory to store the nodes of the tree. The memory complexity is O(N), as we are storing all the nodes. Quick sort can achieve the same time complexity as an AVL sort without the need for extra memory

What is a B tree. Look it up!

Red Black Trees

AVL Trees base their calculations on height, and from this the balance factor is calculated and from this a determination of if and how to rotate the nodes. Red Black Trees use a completely different set of parameters to determine balance.

The root node is always Black (CAUSE YOU ALREADY KNOW), as are the null pointers

If the properties of the red black tree are violated we can do one of two things: 1) recolor the nodes (make sure the root node is black and the null nodes are black and work from there), or rotate the nodes until the properties are met

Look at AVL tree rotations to see how this works. We will see the implementation soon, but the rotations are basically the same as AVL trees

The Logic Behind Red Black Trees

Shortest path on a red black tree contains only black nodes. This is an example of a valid red black tree, and if we say m is equal to the number of black nodes, then the shortest path is equal to m.

This, also valid, red black tree shows that the longest path is a path with red nodes interspersed between the black. Remember that null nodes are black, so each red node has two black children, and all properties are respected.

This shows that the longest path is when there are red nodes interspersed between the black nodes, giving a length of roughly 2m (where m is the number of black nodes). Thus no path is more than twice as long as any other in a tree. This satisfies the standards of a balanced tree (or at least it is more balanced than nothing at all, lol. It’s actually pretty good). This is enough to guarantee O(log N)

Red Black Trees: Recoloring and Rotation

Case 1: Insert Red, Uncle is Red

Opposite Case

Case 2: Insert Red, Uncle is Black

Opposite Case

Case 3

Right rotation on the grand parent of the inserted node

Recoloring, and a check up to the root node if any thing else has to be done

Opposite Case

Case 4: Red Child, of a Red Child of a Black Parent, with a Red Uncle

Recoloring takes O(1), as do rotations.

Red Black Trees Illustrated

insert(1)

All new nodes are red, and then recolored if necessary. So the first node is red, but because it is the root node, it is re colored to black

insert(2)

Rotation on the grandparent

Recoloring

insert(4)

insert(5)

Example 2

insert(12)

Differences Between AVL Trees and Red Black Trees

Heaps

What are Priority Queues

Heap Introduction — Basics

Heaps are trees where the max (or min) item is the root node, so finding a max in a max heap or a min in a min heap takes O(1) time complexity

Usually for AVL trees the time complexity for finding the max or min value (going all the way to the end of the left or right most branch takes O(log N) time complexity

The heap (tree) must be balanced. They are complete so they cannot be imbalanced.

Complete here means we add values from left to right. Root first, then root.left, then root.right, then root.left.left then root.left.right, then root.right.left, then root.right.right, and so on, as so

This is an invalid insertion according to the completeness property rule

Valid Maximum Heap as each parent is greater than its children. It doesn’t matter that the left child node is larger than the right, only that the parents are greater than the children

Minimum Heap

Heap Introduction — Array Representation

First we place index on nodes in the heap

Here we follow the heap rules and count from left to right starting with the root node. Here is how we represent a heap with a one dimensional array

The formula gives us the ability to calculate the index of the left and right child based on the index of the node in question. We can find the parent node with the inverse formula ([node index] — 1) // 2. The integer division will round the value to the proper int index).

Step by Step Guide to Building a Maximum Heap

insert(23)

insert(5)

After each insertion we have to check weather the Heap property rules are violated or not (I am guessing rotations are in order, if not)

insert(78)

Here we see the heap property rules are violated, as 78 is larger than its parent. So in order to fix that, we have to swap the two.

insert(2)

insert(92)

This insertion violates the heap property rules. 92 is larger than it’s parent and grandparent. So we swap

Then we have to swap again (remember to look at the array and how it is changed compared to the abstraction of the tree). It is interesting how the property rules of tress (AVL vs Red Black vs Heaps) define the functions executed when the rules are violated. For example, in a heap all that is required is the left to right insertion and the fact that parent nodes must be greater than in value than their children. All this requires to enforce is a simple swap operation when the child is larger than the parent. See how the rules define the violation behaviors in AVL and Red Black trees.

This swap went all the way to the root node, so we know it is a valid heap data structure.

insert(12)

Valid heap, move on

insert(21)

Valid heap, so move on

insert(99)

Heap property rules violated, swap!

then swap again

then check and swap again, all the way up to the root node, ensuring the heap properties are respected

And now we have a valid heap. Why do we like heaps?

Time complexity of python max/min functions and the comparison to heaps:

How efficient is Python's max function

Thanks for contributing an answer to Stack Overflow! Please be sure to answer the question. Provide details and share…

stackoverflow.com

heapq - Heap queue algorithm - Python 3.9.1 documentation

Source code: Lib/heapq.py This module provides an implementation of the heap queue algorithm, also known as the…

docs.python.org

Heap Introduction — Remove Operation

Heaps are for max/min operations, as they happen in O(log N) time (rebuilding the heap after removing the root node is what I am guessing is the bottle neck). Removing and finding an arbitrary item takes O(N) time. AVL and Red Black trees have better time complexity for the remove and find operations.

remove(12) # from heap

removing 12 produces a hole in the array data structure

So we move up the values to fill the gap, and check through to the root node that the heap property rules are followed

So removing takes O(N) time because we have to do a linear search to find the item — O(N), then remove the item and move up the tree to ensure the hash rules are followed — O(log N).

How do we get rid of the root node?

First we remove it and swap it with the last item in the array

Then we check that it follows heap rules with the heapify function. We check to see if the root’s children are greater than the root and if so, swap the root with its largest child.

We now check again to see if the heap property rules are followed and see that they are not. We then swap the node with it’s larges neighbor and we are done!

To reiterate, removing the root node takes O(log N) and removing an arbitrary node takes O(N)

Heapsort

How to sort a heap, step by step Say we have this heap

First we know the top node is the largest node, so we 1) put it in the sorted array (O(1)), 2) swap it with the last (far right) node (O(1) if reference to last is stored, O(log N) if one needs to traverse all the right nodes until None is found), 3) consider the old root, now far right, node sorted, and finally, 4) run heapify to ensure that the tree follows heap property rules (O(log N) I believe). Total run time complexity: O(N) + O(log N) = O(N log N).

run heapify, and see that swapping it with it’s largest child does the trick

Now repeat with the new root node

run heapify, and swap with the greatest child

Repeat the process with the new root node

heapify, swap with greatest child

Repeat with the new root:

heaify, swap with greatest child

repeat

heapify

We see that if we keep getting the root node, swapping it with the last item in the array (according to the inverse of the heap insert rules (left to right)), and heapifying the tree, we can get the sorted order. This happens in O(N log N) time no matter what.

Running Time Complexities of Heap Operations

The key difference between heaps and binary search trees (and AVLs) is that finding the min/max on binary search trees and AVLs takes O(log N) time complexity, but in heaps it is constant time (O(1)).

Binomial and Fibonacci Heaps