Executive Summary
Smart Contracts are applications built on blockchain that, once deployed, cannot be altered or updated. With that in mind, their testing is crucial, even more so than in traditional software development.
Several different techniques exist in the testing of Smart Contracts, and it is up to the developers to choose when a technique should be used with the goal of creating tests that will perform sufficient validation. This is a non-standardized, individualistic approach as there is no established methodology for doing this, and the developers’ skill plays an essential part in it.
This research focuses on testing techniques that are most widely used and showcases them in order to give a sense of what kind of testing is possible and where it makes sense.
In testing, there is always the question of whether the collection of tests (test suite) covers all of the cases - “Who will guard the guards themselves?”*.
To answer this question, to a certain degree, the paper elaborates on evaluation tools that indicate whether or not more tests should be written or if there’s a case that is overlooked.
As the techniques and tools mature and increase in complexity, we may see the introduction of standardized methodologies that provide a thinking framework on how code should be written and/or tested, as well as a separation of roles between developers and testers.
*Quis custodiet ipsos custodes? - a Latin phrase found in the work of the Roman poet Juvenal (Satire VI, lines 347–348)
Introduction
Testing involves thinking about how code sections should behave in the wanted (ideal) case, but also what consequences might occur if some unintended actions are performed by unaware or even malevolent actors.
The logical question occurs of “what needs to get tested … and how ?”. This is an extremely hard question, and the answer to it lies in the considerations that the developers consciously/unconsciously make. It is important that they keep up to date with the latest techniques, now more than ever, as the past experiences of others can help in establishing best practices and be used in solving similar or completely new problems.
Sanity checks can be performed by using evaluation tools that help with casting light on areas that were previously overlooked. Still, to add another layer of confidence, a set of completely new, trusted eyes should separately go through the code and try to find bugs and/or exploits in it. This is referred to as a Smart Contract audit and is the last step before the deployment to mainnet.
Goals & Methodology
The focus of this research is on the testing and evaluation techniques that are currently being used in the area of Smart Contract development in the Ethereum Virtual Machine (EVM) ecosystem without going down the rabbit hole of what the best practices are.
The methodology consists of describing a technique and then giving an appropriate example that shows when it is adequate to use it. Frameworks are purposefully not mentioned, as it is more important to first understand the key concepts of what is being done rather than the unique and specific details of how something is done.
Results & Discussion
As to not give too abstract and vague descriptions, an example of a smart contract is given on which the testing can be performed and through which a better understanding can be created.
Contract example
The example contract DummyToken
can wrap/unwrap Ether through deposit
and withdraw
functions and transfer the tokens between two addresses using a function of the same name - transfer
. During the execution of those functions, a corresponding event is emitted.
The implementation details are purposefully hidden with the intention of starting the thinking process of how those functions should behave both when called in intended and non-intended ways.
/**
* @dev Implementation of the Dummy Token.
*/
contract DummyToken {
/**
* @dev Emitted when tokens are moved from one account (`from`) to
* another (`to`) of the `value` amount.
*/
event Transfer(address indexed from, address indexed to, uint value);
/**
* @dev Emitted when a new Deposit is made
*/
event Deposit(address indexed to, uint value);
/**
* @dev Emitted when new Withdrawal is made
*/
event Withdrawal(address indexed to, uint value);
...
/**
* @dev Mints `value` tokens to `msg.sender` that corresponds to `msg.value` .
*
* Returns a boolean value indicating whether the operation succeeded.
*
* Emits a {Deposit} event.
*/
function deposit () public payable returns (bool) {...}
/**
* @dev Burns `value` tokens if the `msg.sender` balance can cover it.
*
* Returns a boolean value indicating whether the operation succeeded.
*
* Emits a {Withdraw} event.
*/
function withdraw (uint value) public returns (bool) {...}
/**
* @dev Moves `value` tokens from the caller's account to `to`.
*
* Returns a boolean value indicating whether the operation succeeded.
*
* Emits a {Transfer} event.
*/
function transfer (address to, uint value) public returns (bool) {...}
/**
* @dev Returns the number of tokens owned by `account`.
*/
function balanceOf(address account) public view returns (uint) {...}
/**
* @dev Returns the total amount of tokens in existence.
*/
function totalSupply() public view returns (uint) {...}
}
Specification of the transfer
function
To understand the forms of testing that can be performed, let us write a specification on what one of the functions needs to accomplish, namely the transfer
function.
High level specification of the transfer
function
This function transfers the amount of tokens (value
) from the msg.sender
‘s balance to the to
address’ balance.
Low level specification of the transfer
function
- After successful transfer, the balance of
to
address is incremented by thevalue
amount and themsg.sender
’s balance is decremented by it. - If the
msg.sender
’s balance is smaller than thevalue
, the transaction should revert with the"Transfer amount exceeds balance"
message. - If the transfer is successful, the function returns
true
- otherwise, it returnsfalse
- If the transfer is successful, the
Transfer
event should be emitted with the corresponding fields:from
:msg.sender
to
: value of theto
argumentvalue
: value of thevalue
argument
Forms of testing
Unit Testing
Unit Testing relies on keeping the tests separate from each other and as simple as possible, with each unit test being responsible for testing a single module(“unit”).
These tests follow a common pattern referred to as Arrange-Act-Assert(AAA). First, the “arrangments” are made to put the system in the desired state, then the “act” is performed (function call most often) that leads the system to the next state, after which that state is “asserted” for correctness.
In an individual unit test, most often, only one assertion is made, which increases the number of tests. This, however, has the benefits of having a clear indication of why a test has failed and increasing the code readability.
When thinking about unit testing the DummyToken
contract, we will take only the transfer
function as an example. Following is an incomplete list of test scenarios for this functionality that should serve as a starting point.
Test Scenarios:
To form a part of a test suite, let us divide the test scenarios into two sections (generalized and edge cases) and write some examples of tests for each of them.
Generalized:
- Valid* Transfer
amount
** ofDummyToken
fromaddress0
toaddress1
whereaddress0
!=address1
- Tests:
address0
’s balance is decremented by theamount
address1
’s balance is incremented by theamount
- balances of other adresses has not changed
Transfer
event was emitted with the corresponding fields
- Tests:
- Invalid* Transfer
amount
ofDummyToken
fromaddress0
toaddress1
whereaddress0
!=address1
- Tests:
- transaction was reverted with the right message (“Transfer amount exceeds balance”)
- Tests:
- …
- Valid* Transfer
Edge Cases:
- Valid/Invalid Transfer
amount
ofDummyToken
fromaddress0
toaddress1
whereaddress0
==address1
- Valid Transfer
0/1
ofDummyToken
fromaddress0
toaddress1
whereaddress0
!=address1
- Valid Transfer
0/1
ofDummyToken
fromaddress0
toaddress1
whereaddress0
==address1
- …
- Valid/Invalid Transfer
*Term “Valid/Invalid” refers to the fact of whether this transfer should be possible (due to balance amounts).
**amount
can be any uint
(including the value being greater than the total supply)
We can notice that for the first scenario of the generalized section, four tests need to be written, with each of them being a unit test that checks a specific thing (i.e., the sender’s balance has been decremented by the right amount).
It is important to note that a “Property-based Testing” technique was used in the above list, which is a form of an automated process called “fuzzing” that is used to find bugs by feeding randomized data into the system. This technique focuses on the “properties” of the code that should always hold. The tests are not concerned with the actual values of amount
, address0
, and address1
, which can be anything in the allowed range of possibilities. Rather, they aim to say whether the properties around the balances hold in the test scenario - i.e., if an account transfers some tokens to another account, only those two balances should be affected.
Integration Testing
In the context of Smart Contract testing, integration tests validate interactions between different components of a single contract or across multiple different contracts and are more complex when compared to unit tests.
One form of integration testing is Stateful testing, an advanced method of property-based testing, where a single test is defined by:
- an initial state that can, after deployment, be kept as it is or be created by some fixed sequence of actions
- actions - transactions that lead to a transition of state
- invariants which are properties that should always hold true
Starting from the initial state, a randomized sequence of actions is carried out, where after each action, all of the invariants are tested.
For example, when writing a “stateful” test for the DummyToken
contract :
- initial state can be created such that each test account calls a
deposit
function with a random amount of Ether provided - actions can be kept basic (
deposit
,transfer
andwithdraw
) or more complex (nested - i.e. one action can be [deposit
,withdraw
,withdraw
,…]) - one of the invariants can be that sum of account balances of the
DummyToken
must always be equal to the Ether amount that the contract holds
Besides being more complex, integration tests require more resources and execution time.
Static (code) analysis
Both of the above-mentioned forms of testing are considered a type of “dynamic code analysis” that searches for bugs during the execution of the program, and they are the main topic of this research.
It is worth mentioning its counterpart - Static code analysis or just Static analysis, which is a debugging method that examines the source code before a program is run. This is done by analyzing the code against a set of detection rules that include: timestamp dependency, integer underflow/overflow, re-entrancy issues, use of tx.origin instead of msg.sender, … It remains up to the developer to implement or reject the recommendations of these rules.
General Considerations
Smart Contracts operate in an extremely hostile environment, and this should always be taken into account. During development and testing, the most valuable guiding principle is that everything that can go wrong will eventually go wrong, especially if someone stands to benefit from it.
A set of principles can be adopted to make the functionality of a contract and its complexity more manageable as to reduce the probability of bugs or exploits happening. Some of those include that:
- code should be modularized and kept simple (KISS and DRY principles*** should be followed)
- clarity should be preferred over performance (if possible)
- latest versions of battle-tested tools and frameworks should be used
- the blockchain characteristics should be considered
- the latest security developments should always be incorporated
- deployment and testing should be done on Testnet before moving to Mainnet
*** KISS (Keep It Simple, Stupid) and DRY (Don’t Repeat Yourself) are software programming principles where KISS states that the most simple solutions often work the best, while DRY follows the reasoning that same/similar code sections should not be replicated across the code base.
Evaluation
The purpose of tests is to verify the correctness of the implementation, which poses the question of whether or not the test suite is sufficient for the implementation requirements. To address this and to have a sanity check for a developer’s thought process, evaluation tools have been created.
Code Coverage
The term code coverage refers to the set of evaluation metrics that are used to determine how much of the program has been tested by the test suite - how many functions have been called, how many statements have been executed, etc.
For example, in the code below, to reach a 100% coverage for the function fcn
, at least one of the tests would need to call with parameters that pass all of the three if
statements (i.e. fcn(32, 300, 500)
).
function fcn (uint a, uint b, uint c) {
if(a < 100) {
if(b > 200) {
if(c > 300 && c < 600) {
...
}
}
}
}
While a high coverage doesn’t generally equal good tests, low coverage helps in identifying gaps in the test suite that can be filled by adding new, carefully designed tests.
Coverage-guided Fuzzing
During testing, feeding purely randomized values is often wasteful and time-consuming. In the example above, parameter a
is of type uint
, which means it can hold any value in the range [0, 2**64-1], but the condition a < 100
will hold true only for a small portion of time.
Coverage-guided Fuzzing takes into account code coverage information for each random value it tries, and if that value executes a new code, it is put in the set of promising values. For example, if a = 32
has been generated, fuzzer will keep note of it, as it opens the door to new code - it can then keep a
fixed and randomize parameters b
and c
, thus reducing the search space.
Mutation Testing (Mutation analysis)
Mutation testing is a technique used to evaluate the effectiveness of a test suite by introducing minor modifications, called “mutations”, in the code, thus producing “mutants”.
These modifications are performed using a fixed set of mutation operators like operand replacement, expression modification, statement modification, etc.
Listed below is an example of an original code as well as one potential mutant that can be generated from it.
Original Code
function fcn (uint a, uint b) returns (bool) {
if(a > b){
return true;
}
return false;
}
Mutant #1 - produced by using an expression modification operator (replaced >
with <
)
function fcn (uint a, uint b) returns (bool) {
if(a < b){
return true;
}
return false;
}
These mutants are then tested, and, ideally, all of them would need to get caught (killed) by at least one of the tests. The percentage of killed mutants is referred to as the mutation score.
These techniques can give insight into what are the tests missing and where are the blind spots as well as what tests are rarely killing mutants - both of which is valuable when improving the test suite.
If a mutant cannot be compiled (i.e., mutation produced a syntax error), it is called stillborn and is not taken into consideration. Sometimes, mutants can have the same behavior as the original code, in which case, they are referred to as equivalent mutants. These mutants will not get killed by the test suite and will lower the mutation score. Detecting and taking them out of consideration is not an easy task and is the biggest obstacle to the widespread application of mutation testing.
Conclusion
This research concludes that the space of testing techniques is vast and evolving. As the complexity of challenges that the developers are faced with is rapidly increasing, staying up to date is a task by itself.
With time, we will probably see more and more specialized roles and the separation of responsibilities as it was done in traditional software. For this to happen, some standards should be formed that would enable effective communication between team members, namely clear requirement specification documents.
While posing a risk of the field becoming too rigid, rather than each individual/team having a different approach, we may also see the evolution of techniques and frameworks leading up to complete standardized methodologies.
Some possible roads for future research on this topic should include tools that are used when creating a well-documented functional specification for the project and digging deeper into the evaluation methods, specifically mutation testing, which is an active area of research.