Algorithms and Data Structures — Part I
Data Structures
Data structures came to be in order to handle large amounts of data efficiently for different uses. Avoiding nested for loops, writing fewer operations, and making every calculation as fast as possible are all things to consider when contemplating an algorithm, but algorithms can also be boosted, quite considerably actually, by choosing the proper data structure. We must consider data structures to make our programs faster.
Bad programmers worry about the code. Good programmers worry about data structures and their relationships
Dijskstra’s algorithm would be O(N²) without the proper data structure (heap / priority queue). Because of it, the time complexity is O(NlogN). There is a trade off between run time complexity and memory complexity. We choose one at the expense of the other. We can use data structures (and more memory) to boost up an algorithm’s run time speed. Same for the spanning trees algorithm (*research this*).
Abstract datatypes (interfaces, etc) define a model (logical description) for the data structure. They define a model. Functions, like push() and pop() that determine the basic behavior of the data structure are defined here. They do not specify the implementation, just the interface. Data structures themselves define the concrete implementation, or actual representation, of the data. The aim is to store and retrieve data in an efficient manner. We want to insert, find, and/or remove items in O(1) time.
Arrays
Arrays are data structures. A collection of elements or values each defined by an array index (or key). Index always starts at zero. Because of the index, random access is possible by using the following notation: array_name[index]. Random access has O(1) time complexity. You can have two dimensional arrays (an array of arrays), and these arrays will have two indexes, one for the rows, and another for the columns. These can be accessed as follows: array_name[row_index][col_index]. Usually the items of an array are the same type, but they don’t have to be in python. A dynamic array is an array who’s size changes dynamically. Applications: lookup tables / hashtables, heaps.
The disadvantages of an array are 1) you have to know the size of the array before hand. This is true in languages like Java, C# and C++. If one wants to make an existing array larger, one would have to copy the existing array into a new larger array. This copying of the array to increase it’s size has O(N) complexity. In python this is done for you, so the list seems dynamic. We can solve the issues with the resizing of arrays with the help of linked lists.
Array Operations:
- Add(item): we can add values to the array as long as the array is not full. We are assuming add (or push) is a function of the Array Data Type (see differences between data structures and abstract data types above) that takes a parameter of the type the array supports. Add HAS to add at the end of the array. If it tried to add at the beginning, it would have to move all the items forward one slot, thus reconstructing the whole array. Adding at the front is O(N); adding at the back O(1) (when you keep track of the current empty index).
- Insert(item, index): Inserting into an array, means you have to move the item in the index to be inserted to and all items beyond that, then insert the new item into the desired index. This requires reconstructing the array. This has a time complexity of O(N). Inserting into the back is called adding and is above.
- remove(index): Removing last item in array? O(1). If we want to remove an item in a random index, we will have to reconstruct the array, so it will be O(N).
Linked Lists
Linked Lists consists of nodes that are connected to each other in either one direction (self.next = next_node_obj), or, in the case of doubly linked lists, two directions (self.next = next_node_obj; self.prev = prev_node_obj). The last item on the linked list is NULL.
The node of a linked list looks like this:
The node stores data, either a variable or another data structure/type, and a reference (pointer) to another node. Node prevNode can be added to make this a two way list. Linked lists can be used to create common data types, such as stacks and queues.
Simple linked lists do not allow for random access, one has to traverse, or search, the entire list to find the desired value (O(N) time). Basic operations like finding the last node, or finding a specific value in the list, or finding the place in the list where a node should be placed or removed, require a partial, or complete, scan of the list (O(N) time).
Advantages of linked lists include: they are dynamic data structures and can change their size (add more nodes) dynamically (O(1)). Arrays cannot (O(N) to insert or remove an item at the beginning of an array). They can allocate needed memory at run time (O(1)); arrays cannot (O(N) to allocate memory for a bigger array and move the existing array to it). Very efficient if we only want to manipulate the first element (O(1)), so stacks and queues use this as the underlying data structure (queues, FIFO, require a variable to keep track of the last element of the linked list with a pointer). They are easy to implement, can store items of different sizes or types (an array assumes same sizes and types). It’s easier for a linked list to grow organically, where an array must be recreated (O(N)) when it needs to grow.
Disadvantages of linked list include: every node in a linked list (except the last one) has a reference to the next node (and previous node, if a doubly linked list). This takes lots of memory. Accessing a specific node in a linked list takes O(N) time, where an array allows for random access, if you know the index (O(1)). It is hard to traverse backwards on a singly linked list (it’s possible, but hard), so it takes even more memory to do so (just make a doubly linked list, IMO). [FIND OUT HOW TO TRAVERSE A SINGLELY LINKED LIST BACKWARDS]
Linked List Operations
Insertion: inserting items at the beginning of a linked list. Very simple. Just have to create the node (O(1)), point the “next_node” variable of the new node to the existing “first_node” variable(O(1)). Then point the “first_node” variable to the new node (O(1)).
linked_list.insert_at_start(23)
You can do the same thing for insertion at the back, you just need another variable that points to the last node. If you don’t add this pointer, then finding the last node will take O(N), as you have to traverse through it.
linked_list.insert_at_back(32)
If you are not adding an extra pointer for the “first” and “last” node, then things change a bit. You can still add in the front in O(1), but adding to the back would take a full list traversal, so would take O(N) time. Just add the two extra pointers and these problems go away (at the expense of the extra memory).
Remove: removing an item in the list. Removing the first item just means reference changes (O(1)). Removing an item in the middle or end of the list requires a full traversal (O(N)), plus reference changes (O(1)). Remember to clean up the memory used by the removed node.
linked_list.remove(10)
Get: Access a value at a certain index in the list. This takes O(N) time as one has to traverse to the point in the list indicated by the index
linked_list.get(3)
Double Linked List
We can get from 4 to 25 because we just have to hope to the next nodes via the pointers (references) to them. But we cannot go from 25 to 4 (see how to move backwards via a singly linked list) because the references are in the opposite direction.
We solve this with doubly linked lists, where there are references to both the next node and the previous node (this causes more memory consumption as there are more pointers (references).
Linked List vs Arrays: A Review
It should be said to make insertion of the last item of a linked list run in O(1) time, one needs to add a reference to the last item of the list. This takes up more memory (the negative correlation between time complexity and memory complexity shows its head, once again).
Here we see the concrete examples of the negative correlation between time and memory complexities
Implementing Linked List in Python
Arrays vs Linked List Again
Linked list win in this round
Arrays win in this one, tho
Disatvantage for both data structures
Linked List Applications
Doubly Linked Lists
To insert an item at the end, one must traverse the entire linked list O(N).
Linked List Interview Questions:
Stacks
Stacks in Memory Management (stacks and heaps):
Stack memory has something to do with the stack data structure, but heap memory has nothing to do with the heap data structure.
Here we see three function calls each within another. These three calls produce their on contexts on the stack with the variables within those contexts. The House object in function 3 produces an object that is placed on the heap, and
Once function 3 is completed, the flow of control will go back to function 2. And once complete the context for function 3 is flushed out of the stack
There is now no reference to the object in the heap, and it is now eligible for garbage collection
Then the process continues with function 2 and then once complete, its context is removed, and then function 1 is left to complete the process.
Real World Applications of Stack:
Find some besides memory and execution stacks, lol
Queues
Interview Questions — Stacks and Queues:
Using recursion (the second stack is the execution stack)
Binary Search Trees
This is a cyclic graph, where there are two paths to node D. This is not a tree
We can also define parent/child relationships. Shown in blue are all of node A’s children
The yellow shows all the values less than 12 (the root node) and the blue all the values larger than twelve. This is a quality of binary sort trees.
Height of a Tree:
A tree can have multiple layers, as some branches end and have no children. A true, all the parent nodes have two children for each layer. A non true binary tree can have multiple layers.
The number of nodes is our N, just like the number of nodes in a linked list is that constructs N. Here we calculate the number of nodes in a binary search tree and see how the time complexity is constructed.
If the tree is not balanced, and this is done by picking the correct root node (see if there is an algorithm for this) I think, h != log N and the time complexity of a search no longer holds to O(logN)
Here is an example of an imbalanced tree vs a balanced tree (see picking 4 as the root node of the imbalanced tree would make it balanced. Check to see what the algorithm is for balancing a imbalanced tree)
You can see that the imbalanced tree is basically a linked list and now has O(N) time for search. This is a large difference.
Binary Search Trees Commands
Insert: To insert one just need to start at the root node, see if the value being inserted is greater than or smaller than and then move left or right accordingly until there is no leaf node and make the new node the leaf node (could be better written)
Max
Min
Delete
If you considered what this would look like if it was a sorted array, the successor of the node you want to remove (in this case the root node) is the node to the right of that node in the imaginary array. In the tree structure, it is the smallest value to the right of the node you want to remove.
The predecessor of the node we want to remove (in this case the root node) is the value to the left of it, if we imagine the tree as a sorted array. In the tree, it is the largest value to the left of the node in question.
If we swap the predecessor with the node we want to delete, then we still have a binary search tree:
All we have to do is remove the leaf node (which is easier) and then we have removed the node in question.
Binary Search Tree Traversal
Pre order traversal is top down (left to right)
Post order traversal is bottom up (left to right, and starting from the top).
In order traversal yields the sorted order as a treat!
Max is the node furthest to the right and min is the node furthest to the left.
Stack Memory Visualization of Getting Maximum Value Recursively:
Below we see the binary search tree we are going to traverse and a representation of the stack memory
Here we see what happens when we have called the max function recursively 3 times and reach the baseline return statement for the first time (active line in green):
Here we see the baseline value get passed through the execution stack (like using an explicit stack)
Stack Memory Visualization of In Order Traversal Recursive:
The tree, code and stack illustration before we traverse
The traverse function starts at the root and moves left until it reaches the left most node (data: 1). Below is how the stack looks when it reaches this node and node.left == None, so it moves to the print statement and prints the data
After 1 has been printed, it goes to the next line of code, which is look for a right child, it doesn’t find one and pushes 1 off the stack, and moves back up to the function context of node 4. In node 4 the recursive calls for the left node check are all finished, so it goes to print node.data, and prints 4. Now we see 1, 4 on the console.
After printing 4, we then move to node 4’s right child check and we call traverse recursively on node 8
Node 8 is a leaf node and thus the only result of the traverse function call on it is to print its data, so now we see 1 4 8 on the console. The end of the function for node 8 pops it from the stack and then we move to the end of the function for its parent, and since node 8 us a right child, and we check the right children last, this means the end of the function for node 4 and it is also popped off the stack. We are now finished with the left child recursive calls for the root node and now can print its value. 1 4 8 12 is now on the console.
After 12 is printed the right child checks for the root node begin and 20 is placed on the stack
The left children of 20 are checked and placed on the stack
Node 16 is a leaf node, so 16 is printed (1 4 8 12 16 on console) and we move back up to 20, which is done with its left child checks and now prints out 20 (1 4 8 12 16 20) and moves to its right child nodes, places 27 on the stack, 27 is printed (1 4 8 12 16 20 27) and the right child recursive calls all finish up and pop the corresponding values off the stack.
Remove Implementation
For a node with two children
We have to find the largest item in the left subtree or the smallest item in the right subtree. Once we have found the one we have picked, then we swap that node with the node we want to delete, and then delete the node
We will choose the predecessor for it is a leaf node and we know how to delete a leaf node. I think that a predecessor can have a left node, but not a right node. This will be taken care of if we use remove_node recursively using the predecessor as the starting node.
Real World Applications of Trees
Binary Tree Interview Questions:
Here we have two binary search trees that have the same values, but not the same topology
Here are two trees with the same values and topology
The implementation is as follows. Test with the above examples and post here
AVL Trees
Motivation
When considering searching, for arrays and linked lists we have linear search which is O(N). A binary search tree greatly improves the best case
But at worst case, when the tree is unbalanced, the time complexity mimics that of an array or linked list (sorted arrays can utilize binary search, which is O(logN))
We introduce another parameter called the height of the tree, and we want this parameter to follow this equation: h = log N. If the tree is imbalanced h will not be equal to this. We see the worst case scenario of an imbalanced tree is h = N.
This height parameter will allow us to create AVL trees, which are trees that are self balancing and ensure that insert/remove/search algos have a time complexity of O(logN).
What Are They
AVL trees, named after inventors Adelson-Velsky and Landis, are a self balancing binary search tree. The heights (longest path from actual node to a leaf node) of the two (binary) child subtrees can differ by at most one, if this property changes, rebalancing is done
AVL Trees are rigidly balanced and guarantee a O(logN) time complexity. Red Black Trees (future concept) are faster to construct.
The operations for AVL tress will be exactly the same as binary search trees, but after the operation we must check if the tree has become imbalanced or not.
With the help of this thing called rotations, we can achieve the following
This is a marked improvement to unbalanced binary search trees, unsorted arrays, and linked lists.
Height
The biggest problem with binary search trees is that the trees might become imbalanced and we lose the mathematics for O(logN) time complexity. To ensure these maths we must ensure h = log N. We define height as follows (edges are the connections between nodes).
If we calculate the height for the example binary tree, we see this
The height of a Null (None) node is -1, so leaf nodes have a height of 0 because max(-1, -1) = -1 + 1 = 0. We can see the resulting heights when calculated for each level up to the root node in this fashion. We determine balance by seeing if the difference between the left sub tree’s height and the right sub tree’s height is greater than 1. In this case the left sub tree height is 1 and the right sub tree height is 1, so this tree is considered balanced.
Here we see that the height parameter will determine if a tree is balanced or not
We will make the same operations as a binary search tree, but calculate the height and the difference in height between the left sub tree and the right sub tree. We will call this difference the balance factor. If the balance factor is less than zero, it means the tree is right heavy, and if the balance factor is greater than zero, it means the tree is left heavy. In right heavy situations we have to make a left rotation, and in left heavy situations we have to make a right rotation.
Intro to Rotations
Depending on the sign of the balance factor we make either a left rotation or a right rotation
Rotation illustrated
When we rotate to the right, we see the root node D change to node B. D becomes the right child of B, and the previous right child of B, node C, becomes the left child of D. In a left rotation, node B is the root node and node D is the new root node. The left child of D becomes B, and the old left child of D, node C, becomes the right child of B.
The progression of a right rotation
The rotations should not change the fundamental properties of the tree. In order traversal, for example, should return the same thing, an ordered list of the tree’s nodes.
The progression of a left rotation
Rotation Examples
Balance factor is (1–(-1)) = 2, which is positive, and what we expect from a left heavy tree. A right rotation is needed and we get this:
In this example the balance factor is (-1–1) = -2, which is negative and what we expect from a right heavy tree. A left rotation is needed and, when that is applied we get
This is the so called left/right heavy situation
In terms of node D, the tree is left heavy, and in terms of node B, it is right heavy. How do we balance this? First we have to make a left rotation on node B, which produces this
Now we have just a left heavy tree, and we need to make a right rotation to finally balance it out
Here we have the opposite of the above a right/left heavy tree
Right rotation on F produces
Then a left rotation on B
Rotation should be O(1) and no more, but there may be other issues due to rotation and every vertex up to the root node needs to be checked, and this check takes O(logN) as there are log N vertices in a tree
AVL Trees Illustrative Introduction
Example tree
The first operation we will consider is insert, and we will now insert 12
Now we must figure out if we need to balance the tree or not. We calculate the height parameter for each node in the level of the node that was just inserted
We see that node 19 is the root of a left heavy sub tree. We calculate the balance factor to be 1 -(-1) = 2, which is positive and what we expect from left heavy trees. We know that we will need to make a right rotation.
Here is the new tree with the new height calculations. And as we do our check up the levels (this check is why the time complexity is O(logN)). We need to do this check because any rotation can mess with the balance of the tree. Going up the chain of vertices corrects this problem
Implementation
Practical Applications of AVL Trees
Abstract Syntax Trees use this (python interpreter and JavaScript engine use ast)
This is not an in place algorithm. It needs extra memory to store the nodes of the tree. The memory complexity is O(N), as we are storing all the nodes. Quick sort can achieve the same time complexity as an AVL sort without the need for extra memory
What is a B tree. Look it up!
Red Black Trees
AVL Trees base their calculations on height, and from this the balance factor is calculated and from this a determination of if and how to rotate the nodes. Red Black Trees use a completely different set of parameters to determine balance.
The root node is always Black (CAUSE YOU ALREADY KNOW), as are the null pointers
If the properties of the red black tree are violated we can do one of two things: 1) recolor the nodes (make sure the root node is black and the null nodes are black and work from there), or rotate the nodes until the properties are met
Look at AVL tree rotations to see how this works. We will see the implementation soon, but the rotations are basically the same as AVL trees
The Logic Behind Red Black Trees
Shortest path on a red black tree contains only black nodes. This is an example of a valid red black tree, and if we say m is equal to the number of black nodes, then the shortest path is equal to m.
This, also valid, red black tree shows that the longest path is a path with red nodes interspersed between the black. Remember that null nodes are black, so each red node has two black children, and all properties are respected.
This shows that the longest path is when there are red nodes interspersed between the black nodes, giving a length of roughly 2m (where m is the number of black nodes). Thus no path is more than twice as long as any other in a tree. This satisfies the standards of a balanced tree (or at least it is more balanced than nothing at all, lol. It’s actually pretty good). This is enough to guarantee O(log N)
Red Black Trees: Recoloring and Rotation
Case 1: Insert Red, Uncle is Red
Opposite Case
Case 2: Insert Red, Uncle is Black
Opposite Case
Case 3
Right rotation on the grand parent of the inserted node
Recoloring, and a check up to the root node if any thing else has to be done
Opposite Case
Case 4: Red Child, of a Red Child of a Black Parent, with a Red Uncle
Recoloring takes O(1), as do rotations.
Red Black Trees Illustrated
insert(1)
All new nodes are red, and then recolored if necessary. So the first node is red, but because it is the root node, it is re colored to black
insert(2)
Rotation on the grandparent
Recoloring
insert(4)
insert(5)
Example 2
insert(12)
Differences Between AVL Trees and Red Black Trees
Heaps
What are Priority Queues
Heap Introduction — Basics
Heaps are trees where the max (or min) item is the root node, so finding a max in a max heap or a min in a min heap takes O(1) time complexity
Usually for AVL trees the time complexity for finding the max or min value (going all the way to the end of the left or right most branch takes O(log N) time complexity
The heap (tree) must be balanced. They are complete so they cannot be imbalanced.
Complete here means we add values from left to right. Root first, then root.left, then root.right, then root.left.left then root.left.right, then root.right.left, then root.right.right, and so on, as so
This is an invalid insertion according to the completeness property rule
Valid Maximum Heap as each parent is greater than its children. It doesn’t matter that the left child node is larger than the right, only that the parents are greater than the children
Minimum Heap
Heap Introduction — Array Representation
First we place index on nodes in the heap
Here we follow the heap rules and count from left to right starting with the root node. Here is how we represent a heap with a one dimensional array
The formula gives us the ability to calculate the index of the left and right child based on the index of the node in question. We can find the parent node with the inverse formula ([node index] — 1) // 2. The integer division will round the value to the proper int index).
Step by Step Guide to Building a Maximum Heap
insert(23)
insert(5)
After each insertion we have to check weather the Heap property rules are violated or not (I am guessing rotations are in order, if not)
insert(78)
Here we see the heap property rules are violated, as 78 is larger than its parent. So in order to fix that, we have to swap the two.
insert(2)
insert(92)
This insertion violates the heap property rules. 92 is larger than it’s parent and grandparent. So we swap
Then we have to swap again (remember to look at the array and how it is changed compared to the abstraction of the tree). It is interesting how the property rules of tress (AVL vs Red Black vs Heaps) define the functions executed when the rules are violated. For example, in a heap all that is required is the left to right insertion and the fact that parent nodes must be greater than in value than their children. All this requires to enforce is a simple swap operation when the child is larger than the parent. See how the rules define the violation behaviors in AVL and Red Black trees.
This swap went all the way to the root node, so we know it is a valid heap data structure.
insert(12)
Valid heap, move on
insert(21)
Valid heap, so move on
insert(99)
Heap property rules violated, swap!
then swap again
then check and swap again, all the way up to the root node, ensuring the heap properties are respected
And now we have a valid heap. Why do we like heaps?
Time complexity of python max/min functions and the comparison to heaps:
Heap Introduction — Remove Operation
Heaps are for max/min operations, as they happen in O(log N) time (rebuilding the heap after removing the root node is what I am guessing is the bottle neck). Removing and finding an arbitrary item takes O(N) time. AVL and Red Black trees have better time complexity for the remove and find operations.
remove(12) # from heap
removing 12 produces a hole in the array data structure
So we move up the values to fill the gap, and check through to the root node that the heap property rules are followed
So removing takes O(N) time because we have to do a linear search to find the item — O(N), then remove the item and move up the tree to ensure the hash rules are followed — O(log N).
How do we get rid of the root node?
First we remove it and swap it with the last item in the array
Then we check that it follows heap rules with the heapify function. We check to see if the root’s children are greater than the root and if so, swap the root with its largest child.
We now check again to see if the heap property rules are followed and see that they are not. We then swap the node with it’s larges neighbor and we are done!
To reiterate, removing the root node takes O(log N) and removing an arbitrary node takes O(N)
Heapsort
How to sort a heap, step by step Say we have this heap
First we know the top node is the largest node, so we 1) put it in the sorted array (O(1)), 2) swap it with the last (far right) node (O(1) if reference to last is stored, O(log N) if one needs to traverse all the right nodes until None is found), 3) consider the old root, now far right, node sorted, and finally, 4) run heapify to ensure that the tree follows heap property rules (O(log N) I believe). Total run time complexity: O(N) + O(log N) = O(N log N).
run heapify, and see that swapping it with it’s largest child does the trick
Now repeat with the new root node
run heapify, and swap with the greatest child
Repeat the process with the new root node
heapify, swap with greatest child
Repeat with the new root:
heaify, swap with greatest child
repeat
heapify
We see that if we keep getting the root node, swapping it with the last item in the array (according to the inverse of the heap insert rules (left to right)), and heapifying the tree, we can get the sorted order. This happens in O(N log N) time no matter what.
Running Time Complexities of Heap Operations
The key difference between heaps and binary search trees (and AVLs) is that finding the min/max on binary search trees and AVLs takes O(log N) time complexity, but in heaps it is constant time (O(1)).
Binomial and Fibonacci Heaps
Look up how Binomial Heaps and Fibonacci Heaps are implemented
Implementation of Binary Heap
Heaps In Python
Heap Interview Questions