How to write code?

Posted on Jan 1, 0001

How to write code? Or: things I would have loved to have heard 20 years ago.

DEFINE YOUR DATA FIRST

Start with your data-model. Make sure you define all data you need, what the types are, and how they are related.

If possible allow some flexibility, for instance by allowing key => value strings to extend the properties a entity can have.

In case of doubt, use SQLite for data-storage. Try to only use SQL that can also be used for MySQL, MS-SQL, PgSQL, etc. Don’t worry about scaling. If your project becomes so big that you need to scale, you can hire a external DB service, or have a team of people only dedicated to making a DB cluster scale (if it worked for Facebook en Twitter, it will work for you). How they do that, and how they keep it running is not and will never be your problem as a programmer.

Now that you have your data-model, create wrapper functions to insert, update, delete, find, etc,etc whatever you need to query from this data. Don’t start with caching layers. If a SQL query is slow, then cache results in a (in mem, without ACID) SQL table, not in Memcached, Redis, whatever. But only do this as last resort, not as a first. Remember that cache invalidation is a unsolved problem.

DEFINE YOUR INPUT FUNCTIONS

If you are writing a daemon/web-app of some sorts, or even a command-line app it is best to define all the different input functions and their requirements in params.

The input functions should have simple data-structures as params, such as hashes, of simple params, or arrays, and return a simple value, or a hash. Always return error-levels. Do not do the protocol handling inside the input functions. If possible also include functions that make life easier for clients, so that they don’t have to do 40 call’s when one call would be enough. For example: authenticate_and_return_config ( user, pass ) instead of auth_user ( user, pass ); get_session_token( ticket); get_config(token, ‘foo’); get_config(token,‘bar’); The authenticate_and_return_config() call can internally do the heavy lifting much faster and more efficient then having al this round trips and protocol encoding / decoding.

MAGICAL FUNCTIONS

Now that you have the data functions, the input functions, its time for the fun to begin. The things before could be done without much effort, since it is all default boilerplate code. If correct lot’s of thought have gone into the data-model, and the API / input functions, so they should be solid. Now comes the part were you connect everything, and write the real logic. You will find very quickly that you write the same small code over and over. This is mostly small utility code, which can be placed in a utils functions library. But you will also find yourself staring at bigger problems, were you have no idea on how to start. What to do, what to do?

Simply imagine that there is a magic fairy who will magically write the code for you. The only thing you have to do is be really precise in what the input has to be and what the output has to be. Also remember that ‘pure functions’ are better then functions that take a complex input, so do try to split up the magic functions into magic functions that take simple input, and return simple output.

Now that you have done this, create the prototype for the function, and already create fake, fixed output as the return for the function. Put MAGICAL FUNCTION as a comment in the prototype. Continue writing the input function, and don’t worry about the magical functions. Do the same for all the other input functions. Now that the input functions are done, we only have the magical functions to worry about. And you probably noticed that several on the input functions have the same problems, so they use the same magical function. Go to one of the magical functions. You probably know how to do one part, but not the other (or that one part is a lot of work). Again we will simply create a new magical function to do the heavy lifting, and we do the stuff that is relatively easy first. Always make sure that you finish the whole function, and test it before going to the next magical function, even though you think you can create it on the fly.

The main goal is to not get sucked into diving in sub-functions, but to finish a complete function, and to DEFER even looking into those new functions.

If you switch to the newly created magical function, you will partly forget what you were doing in current function, and instead of having one function that works, and one that still needs to be written, you will have two functions, none of which will work and you will start to suffer from cognitive overload. Debugging will also become more complicated, since problems could be in several of the half finished code. Therefore: one function at a time, and only switch to a new magical function once the current function is finished, debugged, and has a test.

What if you have a magical function that you can’t directly split up in sub magical functions? Simply skip it, and go to the next magical function that needs finishing. What if you still have open magical functions that seem hard? Go for a walk, get your mind of things. When you look fresh against the same problems you will see a solution. If not, bother a friend, colleague, the Internets.

A very nice mind trick is to visualize / describe the problem as if it were people getting things done. This allows you to connect the problem to fast unconscious social knowledge that you have ( think of the ‘dining cryptographers problem’ that is easier to grasp then the mathematical notation). Last but not least: explain the problem to somebody or even yourself out loud. Just by being forced to explain the problem, you will often find the solution.

DISPATCHER

Since you have defined the input functions, but don’t allow them to do the nitty and gritty of the protocol, you now have to write the dispatcher. The dispatcher function can be part of main, or can be a function that is called by a framework. This function is very dependent on what it is that you write, so don’t worry that it is not portable. By now you probably also have separated the input functions in a library, as well as the data functions. So the code for the dispatcher will probably look a bit like these examples:

 use <my_functions>
 
 function main 
    // call functions to get the current config
    // setup stuff that need to be done
    while connection = connections_come_in()
        dispatch(connection)
    end
    // cleanup code 
    // shutdown code
 end
 
 function dispatch (connection)
    
     if path == connection("bla")
            // extract variables code
            result = input_bla ( variables ) 
            // massage results and output the protocol data code 
     elseif foo
            // more of the same
     end 
 end

Or something as simple and nasty as:

require "functions";

while(<>) {
    chomp;
    if ( m/^bla (.+)/ ) { print input_bla($1) }
    if ( m/^foo (.+)/ ) { print input_foo($1) }
    if ( m/^bar (.+)/ ) { print input_bar($1) }
}

The thing is: since the dispatch mechanism is highly dependent on the functions of the application and the hooks / protocol used, it does not make sense to try to make it reusable or generic. It is highly specific, and you should treat it as such. It exists independently of the libraries which do all of the heavy lifting. The dispatcher can be several functions, but just don’t mix the dispatching system with the libraries of the data-model(s) and the input libraries. If your libraries are well maintained and don’t have hidden state, then anybody in your team (or you) or externally should be able to connect any protocol / system with your application, and have something that works within a couple of days, since they only have to focus on implementing the protocol specifics / hook handling, etc.

The value of writing the data, input and other functions first is that it allows you to model for how things should be in a perfect world. You will probably be forced to add extra functions to deal with the messy reality of (protocol) handling, but if you start by writing the handling first, then any subsequent code will look like that protocol / system, which will lock you in that way of thinking, and make it difficult and messy to extend or use the code for other uses.

If on the other hand you start with the data-model, and the input functions, then you create a high level design that can and will be used for other goals and will outlive the dispatch system.

Other things I strongly recommend are:

CODE PROMISES

ALL functions should not just define what their input and output is, but they ALWAYS should promise not to CHANGE their behavior, and especially not their input/output!

If for whatever reason new params are needed, then you are required to create the new function first with a new name, and only after this new function works, back-port all the existing code to use the new function (with the new name). After you are sure no code remains that uses the old function you can delete it from the code-base, but not in a hurry (print warnings to logs/stderr first).

WRITE TESTS

Write a test for every function. Make them part of your build process. Include timing information. Compare timing information to see if functions suddenly become much slower.

ALWAYS WORK THROUGH YOUR INPUT FUNCTIONS

If you need a search engine, etc, then still do those things THROUGH your input functions. If on the other hand you expose those services directly you are stuck with their interfaces, and you can’t change anything. Changes in the backend should only require changes in your code. Your application server can always scale if you don’t maintain state. so don’t worry about that.

ALWAYS DO INPUT VALIDATION

Do the input validation inside your input functions. Make sure that you convert to the correct data-type first, BEFORE you do any tests on the correctness of the input.

Do not allow any tainted variable to be used in any other function. Depending on what the function does, consider adding a random sleep of milliseconds to prevent all sorts of timing attacks. Never allow user provided XML to be interpreted by a standard un-configured XML parser, since this will open up all sorts of XXE vulnerabilities.

It is insane, but I had discussions with developers who were convinced that simply by using a language that uses strong typing, it was impossible to get hacked / have invalid input. Check the ‘OWASP Top Ten’, and know this is just the top of the iceberg. I left out all the obvious things such as SQL bind parameters. You should also consider all common security recommendations, as part of security in depth, but know that if input validation is done, then (almost) all other security problems simply cease to exist.

GO FOR PURE/PERFECT FUNCTIONS

If a function can be a ‘perfect pure function’, stick to that. If a function can promise to always return the same output if the input is the same go for that. It makes debugging easier, and allows caching of results. If you have the choice of creating 2 INTERNAL functions (not input functions), or only 1, but of the 2 functions, one can be a pure function, go for the 2 functions. For example if pure_function_is_valid_hash(hash) then return getconfig_from_storage(user) is better then return get_config_from_storage_if_valid_hash(hash,user).

The obvious exception is a ‘magical function’. But magical functions (defer based programming) are there to help you overcome being stuck in a problem, or winding up with a cluster-fuck of code that is all ‘almost done’.

NO STATE

Only keep state in the data, not in persistent variables in the code. The exception of course is database / connection handles.

OO is problematic since it always insists in keeping state inside of objects, creating the need for locks and other (multi-threading) problems. But OO has its usefull usecases, so use OO if your application requires it. State will bite you in ways which you cannot foresee now. You are also better off by copying values to pass as a param to a function, instead of using references. Optimize code by using pointers only if you have no other choice.

WRAP MODULE FUNCTIONS

If you require external modules, wrap those functions in your own functions.

If for whatever reason you need to change modules, then you only have to edit those functions, instead of going for a goose-hunt through all of your code. If at all possible, do the loading of external code/modules also inside of the wrapper function. if not defined( function_foo ) load-module(‘foo’). You should at all times be able to phase out a module / library by simply rewriting a couple of functions. This is not possible if the behavior of those external libraries is hard-coded in the logic of all your code. The same applies for frameworks: They dictate how your input functions should be. Frameworks can be OK if used in the dispatcher, but they should never in any way be part of your library of data, input en other functions.

INCLUDE BATTERIES WHEN SHIPPING

Ship required modules / code WITH your source code / software distribution.

Your code, and how it works with extra modules, is not a problem to be dumped on the shoulders of people who are least capable of dealing with them. Do not create a dll/dependency hell for others, but make sure that the code has everything required to run, except for the interpreter and base modules which are shipped by default with the interpreter (if you are writing in a higher level language). If possible compile / concat all the dependencies in a single (source/executable) file, and ship that. If there is a bug in one of the modules, it’s your responsibility to ship ASAP a new version without the bug. Everything you call directly is your domain to test en ship. Your responsibility stops at the protocol level (don’t ship databases daemons, MTA’s, etc).

If your application is shipping its own modules / libraries / deps, then it can simply run from any directory, thus negating the need to put everything in a docker container just to be able to run. You can still put it in a docker container (just put in the package with the interpreter, and drop the directory in the container), but there is no need to do so. It also means you won’t be receiving all kinds of support questions / tickets stating that it won’t work on system X, that has lib version Y. These bugs will have you modify the code to work around all these different situations, instead of spending the time coding features, which is were the fun is. So do yourself and your users a favor, and ship with all the code you need, except the interpreter and its basic libraries. Things like snap packages are a overtly complex way to fix what is mostly just a dependency problem. These kind of solutions bring their own complexity, and limitations (if you don’t believe me, just try to install a snap package on ChromeOS, Windows, FreeBSD or OSX).

NO PREMATURE OPTIMIZING

Write functions in a higher level language, and do things as they come naturally within the language.

Once the functionality is done, time the functions on your test code to find bottle-necks (profiling). Remember: if you want to rewrite code to make it faster, create a new function like ‘functionname_foo_v2’, and after testing rewrite existing code to use this new function, do not rewrite the code itself. Then change the existing functions one by one to use the new function (and test in-between). Also test in production with one changed function at the time, instead of doing a dramatic all out migration. After one altered function in production has the desired result, do the next (and wait a day, etc, etc). Ops will thank you for it.

If things are still not fast enough, and the limitation is in the code, you can choose to rewrite a function in rust/C. Almost every language has the possibilities to link in externally compiled code. This should not be the first course of action though. Also: only rewrite things that clearly benefit from it, and NEVER rewrite the input code in C.

If data access is to slow, see if you can CACHE results IN THE DATABASE in temporary (in memory) tables. Since they don’t provide ACID, they should be almost as fast as Redis / Memcached, etc, but since they are part of the database, you don’t have to reinvent or setup a extra system to distribute data and keep it consistent. Also since all systems use the same database you can scale the application-server ( horizontal scalability ). If you decided to create a in mem cache, or use the local filesystem for caching then you just burned that bridge. If data access is slow, and are still running on SQLite, change your data functions to use MySQL(MariaDB), and migrate to a MySQL or a PostgreSQL cluster (or use a hosted SQL solution, such as planetscale).

Remember that being a DBA is a different profession than being a programmer, and if you work for a company, bring in a DBA. don’t do what every programmer is doing, and try to build all kinds of caching in the code, besides temp tables in the database, since they will hamper your ability to scale your application-server, and they will introduce all kinds of extra services, such as extra deamons that can fail, and extra complexity. Also: CACHE INVALIDATION is a UNSOLVED PROBLEM.

If you are building a planetscale solution, and really have a need for speed, then take a hint from Google’s play-book. They index and pre-create all the data at the INGRES (when it is being digested / taken in). In this way data is ‘chewed’ before it goes into various indexes in several systems, so the pre-calculations are already done, and lookups can be lighting fast.

SOMETIMES JOBS ARE REQUIRED

A common pitfalls made by web-developers is that they try to update everything in one request, while the user is waiting. This means that the load-time can be very long, since a request can take some time to update everything if systems become very complex. This also raises the risk that mutations did not completely come through ( ignore_user_abort ). The better approach if often to simply create a job in a queue to ‘do stuff’. The client can check what the status is, while it is guaranteed that the jobs will finish with the correct update (use a background jobs to do all jobs requests). A simple job queue system can be created in the database itself. Using a external daemon such as rabbitmq is overkill, and hardly ever the best solution. But again: creating a job queue from the start is most of the time a form of premature optimalisation. Don’t do it if not required.

DECLARATIVE LANGUAGES ARE AWSOME

SQL, html, regelar expressions or even Puppet’s template language are all examples of declarative languages. The reason that they work is that they allow you to declare what the intent is, without requiring you to specify how it is done. They have their limitations, in that they often require you to give hints (create a index, customise behavior through style sheets), but in general they safe you lots of time.

BE PRAGMATIC, DON’T BE DOGMATIC ( Apply Or Explain )

Guidelines are not a straitjacket, but just recommendations. But where you deviate from the guidelines, try to write a comment WHY this is the better solution. At least this will force you to think about the pro and cons. It will also give guidance to other people working on the same code base or reviewers.

Also accept that DSL’s (domain specific languages) are there to narrow the problems you encounter. Don’t be the idiot who writes commandline utilities in java, which require Gb’s of memory just to run and take forever, when a simple sh + awk script script does the same in a fraction of the time using no resources at all. Accept that every language has it’s own problem domain.