Reinforcement Sort

You know Quicksort and Merge sort and Timsort and Shell sort
Heapsort and Bubble sort and Insertion sort and Selection sort
But do you recall the most god-awful sort of all?

A Reinforcement Sort (a sort done through reinforcement(ish) learning) is the Nickelback of sorting algorithms - a perfect superposition of horrible and awesome. This rockstar was developed using an epsilon greedy Q-learning approach with the aim of generating the best python sorting script. This mainly involved combining code fragments together and giving rewards based on how well these scripts sort a list. The exact setup is summarized as follows:

1) State: A binary array chad-kroegered from a md5 hash of the python code string
2) Actions: I figured sorting will require some looping, swapping and probably a conditional or two, so I made these actions on a state
3) Reward: The state receives a score of -1 if the python code doesn't run, -10 is the code is over 300 characters or +1000 if the list is correctly sorted.

After training up a neural net, I received this optimal sort:


Now you may be telling yourself that this looks exactly like a Bubble Sort, but I assure you it's not. A Bubble Sort has a reasonable time complexity of O(n^2), but a Reinforcement Sort runs in O(oh god, why am I doing this).