Archive augusti 2018

C# Protecting private collections from modification

Why do we want to hide data?

One of the big advantages of object oriented programming is claimed to be possibilities to encapsulate data and hide internal implementation details. This enables us to make changes to our implementation without impacting our users and ensures our internal data does not get modified in ways that we do not control. In other words, it helps us keep sane.

Protecting data in a class

In C# fields in a class are protected from outside access by using the private keyword.

public class MyClass
{
  private List<int> _myInts;
  ...
}

Now _myInts is only accessible from within MyClass.

Breaking encapsulation

Now, you might want users of MyClass to be able to read the data that _myInts contain. The easiest way to do this is to make _myInts public.

public class MyClass
{
  public List<int> _myInts;
  ...
}

But now you no longer have control over what happens to _myInts. Any external user might alter the list or even change the reference to point at a completely different list object (or null).

Re-adding the protection

In an attempt to prevent a user from modifying the private List you might add a getter method, let’s say by creating a read only property.

public class MyClass
{
  private List<int> _myInts;
  public List<int> MyInts => _myInts;
  ...
}

This code can be simplified by converting to an auto property.

public class MyClass
{
  public List<int> MyInts { get; } = new List<int>();
  ...
}

Under the hood a private field is still created, this is called a backing field in C# lingo. By only supplying a get and no set you prevent an external user from setting the backing field to reference another list (or null). However, an external user may still modify the contents of the original list. For example var m = new MyClass(); m.MyInts.Add(1); is still possible to do.

To prevent this you might attempt to give an external user access to the list by supplying it via an IEnumerable or IReadOnlyCollection interface. But doing that prevents you from modifying the list yourself, so you will have to re-add the _myInts field.

public class MyClass
{
  private readonly List<int> _myInts = new List<int>();
  public IEnumerable<int> MyInts => _myInts;
  // Alternatively
  public IReadOnlyCollection<int> MyInts => _myInts;
  ...
}

Ok, so now we must be safe, right? Well, what if the user of your class does something like this.

var m = new MyClass();
var i = m.MyInts as List<int>;
i.Add(1);

Now he is still able to modify your private readonly List. But don’t give up! There is still one last thing to try. Here comes the AsReadOnly method to the rescue!

public class MyClass
{
  private readonly List<int> _myInts = new List<int>();
  public IEnumerable<int> MyInts => _myInts.AsReadOnly();
  // Alternatively
  public IReadOnlyCollection<int> MyInts => _myInts.AsReadOnly();
}

The AsReadOnly method adds a read only wrapper around the list. Any attempts to cast it to a writable collection will fail. However, casting code will still compile, but fail in runtime.

var m = new MyClass();
var i = m.MyInts as List<int>; // Would set i to null
i.Add(1); // Would throw a null reference exception
...
var j = (List<int>) m.MyInts; // Would throw an InvalidCastException

Conclusion

To prevent modifications to a collection, expose it only through a ReadOnlyCollection wrapper. The easiest way to do this is to use the AsReadOnly method. However it is also possible to wrap the collection by creating a new ReadOnlyCollection object, new ReadOnlyCollection<int>(_myInts); for example.

Also note that this is a O(1) operation. The elements in the collection are not copied so there is only a really minor overhead cost to wrapping it.

C# Pitfalls of returning IEnumerable

Generic and specific return types

It is often preferred to work with as generic types as possible. Doing that opens up for more flexibility and being able to modify parts of the code without effecting any other parts. One example of this is IEnumerable<T> which is implemented by several generic collection types, such as HashSet<T>, List<T>, and LinkedList<T>.

If you write a method that works with a List<T> internally and returns it as an IEnumerable<T> the caller does not need know that it is actually a List<T> you are using. And if you later decide to change your implementation to use a different type internally the caller does not need to be updated as long as the new type also implements IEnumerable<T>.

This might have you think that you should always return IEnumerable<T> when possible. But this might come back and bite you.

The multiple enumeration issue

Whenever you iterate over a collection via the IEnumerable interface it gets Enumerated. This means that an Enumerator object is created that flattens the collection so that members can be accessed sequentially. If this is done more than once you are causing extra work to be performed.

IEnumerable<string> names = GetNames();
foreach (var name in names)
{
  Console.WriteLine(name);
}
var sb = new StringBuilder();
foreach (var name in names)
{
  sb.Append(name).Append(" ");
}

The above code Enumerates the name enumerable twice. In the best case this just introduces some extra work. But it can get really nasty if GetNames returns a database Query. This will make the code evaluate the Query twice, possibly even returning different results each time.

Avoiding multiple enumerations

Fortunatelly it is quite easy to avoid multiple enumerations. The calling code can force the enumeration to happen once during variable initialization.

IList<string> names = GetNames().ToList();
...

With this change the enumeration will only happen once and the rest of the code can stay the same. The ToList method will allocate a new List and populate it with the strings returned from the Enumerator.

The price of being generic

Now, assume the GetNames method actually do use a List<string> internally and returns it as an IEnumerable<string> but then the caller calls the ToList method, isn’t that just adding extra complexity? Yes! That is exactly right. And not only that, the call to ToList will create a new List<string> which will be populated with the exact same elements that already exists in the original List. So there is an efficiency penalty involved as well.

So what to do? Well, that depends. If you are writing a library that is aimed for users outside of your team or organization, in other words where you don’t know how the returned type will be used and it is important that your code is flexible while the interface is stable, then you probably should return an IEnumerable<T>.

But if you are in control of all the source code involved, then you can afford to be more specific in what types you return.

Building strings in C# with good performance

A little bit on strings

Strings in C# are immutable. What that means is that you cannot modify an existing string, you must create a new one.

Also strings are objects – sequential read only collections of char objects – so they are allocated on the managed heap, and therefore managed by the Garbage Collector (GC). Hence, if you wish to modify an existing string you can use it as input and create a new string. For example, let’s assume you have a string ”Hello” that you wish to append your friends name to.

var str = "Hello";
...
str = string.Concat(str, " Bob");

Now what happens under the hood here is that String.Concat creates a new string, ”Hello Bob”, and the str variable is updated to reference the newly created string. Since the old string, ”Hello”, no longer is referenced it will be garbage collected.

Building strings in a loop

Now, assume that you wish to build up a long string by concatenating new words in a loop. We can simulate this behavior by creating a loop like the one below.

public static string BuildString()
{
  var str = string.Empty;
  for (var i = 0; i < 100_000; i++)
  {
    var append = (char)('a' + i % ('z' - 'a' + 1));
    str = string.Concat(str, append);
  }
  return str;
}

The code looks decent. But, since strings are immutable, a new string object is created for each iteration in the loop. That is 100 000 objects that needs to be created and garbage collected.

Fortunately there is a better way to to this. If we could use a mutable object in the loop instead we could just keep adding to it avoiding all this object creation. The .NET Framework provides a type just for this purpose, System.Text.StringBuilder. With some small modifications to the code we can use this instead.

private static string BuildString()
{
  var sb = new System.Text.StringBuilder(100_000);
  for (var i = 0; i < 100_000; i++)
  {
    var append = (char)('a' + i % ('z' - 'a' + 1));
    sb.Append(append);
  }
  return sb.ToString();
}        

The numbers

I did some measurements using the two different approaches above. In figure 1 below you can see the memory allocations.


Figure 1. Memory allocations

The ”Exclusive Allocations” column shows how many allocations of managed memory that was done in the methods. As expected in the method using String.Concat there were 100 000 allocations. Interestingly an additional 199 999 allocations are added by String.Concat.

In the method using StringBuilder there is only a single allocation, and only two additional are added by calls to other methods (one allocation in the StringBuilder constructor and a second one in the ToString method.

I also measured the time it took to execute the two different implementations. The String.Concat version took 1.3 seconds to run while the StringBuilder version took 1.0 milliseconds. That is 1 300 times faster.

Conclusion

Memory allocation and Garbage Collection are expensive operations. As can be seen in the case above it can be difficult to determine if a piece of code will perform good or bad without knowledge of the underlying implementation (in this case how strings are implemented in the .NET Framework). Many insights when it comes to performance comes with experience.

Also, remember that in almost all cases code readability is much more important than blazing fast performance. Do not try prioritize optimization over readability unless you know you need to. However, there are cases when the code can be both clear and perform well. Try to aim for that.

C# Collections – Non-generic or Generic?

Non-generic- vs generic collections

When coding in C# you will pretty soon encounter Collections (Lists, Queues, Stacks, and so on). As soon as you wish to use a Collection of some sort you will have the choice of using one from the System.Collections namespace or one from the System.Collections.Generic namespace. The most commonly used Collection is the List and the non-generic version of List is the ArrayList.

Using the non-generic version forces you to cast the objects into the correct type.

ArrayList l = new ArrayList { new MyType() };
...
// MyType t = l[0]; This won't compile
MyType t = (MyType)l[0]; // But this will
MyOtherType t2 = (MyOtherType)l[0]; // Unfortunately this will also compile

As can be seen in the code above the non-generic ArrayList is error prone. There is no compile time check whether you mistakenly cast to the correct type or not. The generic version of ArrayList is List<T>. Using this forces you to define the type of objects the list will contain already when creating the list, and the static code analysis tools and compiler will give you errors if you misuse the type.

var l = new List<MyType> { new MyType() };
...
MyType t = l[0]; // Now this compiles fine
MyType t2 = (MyType)l[0]; // This also works fine, but the cast is not needed
// MyOtherType t3 = (MyOtherType)l[0]; This won't compile

So, no unboxing and boxing, no need to help the compiler understand what types of objects the list holds, no runtime errors due to incorrect casts.

Now, List is only one of many Collection types in the System.Collections.Generic namespace, you will find generic versions of all the non-generic Collections. My advice is to simply pretend that the non-generic Collection types doesn’t even exist, and use the generic ones, always.