SelectMany, Sorting and Grouping Objects

So here is the problem: I have a list of items that

var collection = new[]
{
    new { Title = "One", References = "1;3" },
    new { Title = "Two", References = "2;3" },
    new { Title = "Three", References = "1;4" },
    new { Title = "Four", References = "4"}
};

The References fields of these object is some kind of category. What I want to do here is to have a list for each different reference (in this example: 1, 2, 3 and 4) containing all the items that are in the reference. Items will be duplicated if they are in more than one category.

To sum it up, the expected output would be: One, Three, Two, One, Two, Three, Four

After fooling around a bit, here is the query I came out with:

var query = from c in collection
            from d in c.References.Split(';')
            orderby d
            group c by d into groups
            select groups;

This does exactly what I want and produces the output I expected from the input data.

However, when I use Linq, I generally use extensions methods directly and not the pretty query syntax. This is mostly because I want to understand what happens behind the scene, and I have to admit that this query was quite a beast.

First, as there are two from clauses, there is a SelectMany somewhere. You probably know that SelectMany is a kind of the beast and that understanding it fully is quite a challenge compared to the other operators/extensions methods. Also, I thought that the GroupBy clause was going to be tough, as we groups c items by d which is in the other collection.

I couldn’t figure out by myself how to write that query using extension methods, so I fell back on the good old Reflector that gave me a straight answer:

var query = collection.SelectMany(delegate (<>f__AnonymousType0 c) {
    return c.Values.Split(new char[] { ';' });
}, delegate (<>f__AnonymousType0 c, string d) {
    return new { c = c, d = d };
}).OrderBy(delegate (<>f__AnonymousType1<<>f__AnonymousType0, string> <>h__TransparentIdentifier0) {
    return <>h__TransparentIdentifier0.d;
}).GroupBy(delegate (<>f__AnonymousType1<<>f__AnonymousType0, string> <>h__TransparentIdentifier0) {
    return <>h__TransparentIdentifier0.d;
}, delegate (<>f__AnonymousType1<<>f__AnonymousType0, string> <>h__TransparentIdentifier0) {
    return <>h__TransparentIdentifier0.c;
}).Select(delegate (IGrouping<>f__AnonymousType0> groups) {
    return groups;
});

After reading that, it made much more sense. Here is what I came up with when writing it on my own:

var p = collection
    .SelectMany(c => c.References.Split(';'), (c, d) => new { c, d })
    .OrderBy(t => t.d)
    .GroupBy(t => t.d, c => c.c);

Much more readable. The idea here is that the SelectMany clause outputs a sequence of anonymous types that contains the two kind of elements. This sequence is then sorted with the OrderBy, and finally fed trough a GroupBy that uses the d property as the grouping key and the c property as the project in the resulting collections. Not that difficult after all…

Here is another version that is probably a bit more clear:

var q = collection
    .SelectMany(c => c.References.Split(';'), (c, d) => new { Title = c.Title, Reference = d })
    .GroupBy(c => c.Reference, c => c.Title)
    .OrderBy(g => g.Key);

Note that this is a simplified version of the original issue. The issue itself was to do this with some ListItems retrieved from SharePoint. Objects were a bit more complicated, but logic is the same.

Enumerable.Empty with null Coalescing Operation

Today I read Eric Lipert’s blog latest entry. It is about the semantic difference between null and empty. An easy example is with Collections. An empty Collection is not the same as an non-existing (ie. null) Collection.

However, as mentioned, how may times did you wish that foreach statement would work on an null Collection? How convenient would it be!

So far, what I was doing was:

foreach (var item in list ?? new List<SomeType>())
{
    //Do Stuff...
}

But something nicer exists: Enumerable.Empty. Using that, the code becomes:

foreach (var item in list ?? Enumerable.Empty<SomeType>())
{
    //Do Stuff
}

It’s actually longer, but it reads easier. The intent is quite obvious, I bet you can show this to someone who doesn’t know about the null coalescing operator and he would get what it does! Unfortunately I’m now working on Java and my colleagues would burn me alive if they knew I secretly pledged allegiance to .NET…

LINQ to SQL and auto increment fields

When using LINQ to SQL in Visual Studio 2005, Table and Column mapping has to be coded manually. This is not a very difficult job but it’s repetitive and boring (of course, if you are using Visual Studio 2008 to automatically generated the mapping code, this will be done automatically).

I ran into an issue while following a guide on manually map database tables to classes. When I tried to insert a new record, leaving the identity field empty in the object, I had the following error:

"Cannot insert explicit value for identity column in table ‘Contexts’ when IDENTITY_INSERT is set to OFF."

It was pretty obvious that, as the identity column was and Int32, it was set to 0 as default, and when LINQ tried to insert that row in the table the SQL Server raised an error as this auto incremented field cannot be explicitly set.

This is simply solved by using the IsDbGenerated attribute in the Column annotation description of the field, as shown in the following code.

private Int32 _id;

[Column(IsPrimaryKey = true, IsDbGenerated = true)]
public Int32 Id { get { return this._id; } set { this._id = value; } }

This attribute tells LINQ not to give a value for this field as it is generated by the database at insert time.

Another very nice thing about this is that once you have inserted the object in the database, this field will hold the value generated by the database. Practically, this means that after executing the following code

DataTable.InsertOnSubmit(obj);
DataContext.SubmitChanges();

the obj.Id property will contain the value that was assigned by the database.

LINQ in Visual Studio 2005

A few months ago, we wanted to query an XML log file on our project. On that particular project, we can use .NET 3.5 features, while we have to stick to Visual Studio 2005.

For this task, LINQ seemed like a natural choice, provided that we could use it inside Visual Studio 2005 which doesn’t natively support the sexy LINQ query syntax.

At first, we thought that it was not possible to use LINQ in Visual Studio 2005, but after some research it turned out that we were wrong. I found some clues on how to achieve that in one of the comments on this post comment by Charles Young.

The LINQ libraries are located in the System.Core assembly, so it is required to reference it in the project. Then, each file that will use LINQ will have to include the System.Linq using statement.

A simple LINQ query

Let’s take a standard LINQ query on a String array

String[] persons = new String[] { "Philippe", "Steve", "Bill" };

var search = from p in persons
             where p.Contains("l")
             select p.ToUpper();

This is a fairly simple query, involving only filtering (where clause) an projection (select with transformation of the output). This native query will be translated by the compiler multiple times, we just have to follow the pipeline until we find something that Visual Studio 2005 understands.

The first translation results in a statement that uses Extensions methods on the array:

var search = persons
    .Where(p => p.Contains("l"))
    .Select(p => p.ToUpper());

These extension methods are from Enumerable class. However, Extension methods are not natively supported in Visual Studio 2005.

The reverse call of the previous statement will be a use some of Enumerable’s static methods (as these are Extension methods, the reverse call will actually call a static method and give the object that you think it has been called on as parameter). In this particular example, it will be translated in:

var search = Enumerable.Select(
    Enumerable.Where(persons, p => p.Contains("l")),
    p => p.ToUpper());

Please note that this translation results in using Enumerable because for this particular case we are using LINQ to Objects. I suppose that using LINQ to SQL would result in using Queryable static methods, but I didn’t try yet.

Also note that there is a call to Enumerable.Select because our initial query involves projection, in a sense that the result of the query (here a String object) is different from the input (here a String object that is a transformation of the input). If the projection statement of the initial query was "select p" (no transformation of the output), there would be no call to Enumerable.Select. More on this below.

Still, this statement is not supported in Visual Studio 2005 because of the lambda expressions that are not supported.

The next step for the compiler is to translate these lambda expressions into anonymous delegates:

var search = Enumerable.Select(
    Enumerable.Where(
        persons,
        delegate(String p)
        {
            return p.Contains("l");
        }),
    delegate(String p)
    {
        return p.ToUpper();
    });

Even if this looks like something we could use in Visual Studio 2005, it is not. Visual Studio 2005 will complain because it is unable to infer types (and because of the var keyword that is also part of C# 3.0 hence not supported in Visual Studio 2005).

We have to specify types explicitly for the Enumerable methods:

IEnumerable<String> search = Enumerable.Select<String, String>(
    Enumerable.Where<String>(
        persons,
        delegate(String p)
        {
            return p.Contains("l");
        }),
    delegate(String p)
    {
        return p.ToUpper();
    });

There we are, a LINQ query that can be used inside Visual Studio 2005! It’s a bit confusing to write at first, but you’ll get use to it.

Now that we have seen a simple query, let’s have a look at the other query statements.

Ordering

Always using the same String array, here is the C# 3.0 style LINQ query

var search = from p in persons
             orderby p
             select p;

that translates in

IEnumerable<String> search = Enumerable.OrderBy<String, String>(
        persons,
        delegate(String p)
        {
            return p;
        });

As you can see, no need for Enumerable.Select call as the projection returns the same object.

If you want to order descending, use the Enumerable.OrderByDescending method instead. It is also possible to specify the IComparer that is to be used by the OrderBy/OrderByDescending method, thus allowing you to use ordering on custom object.

Joining

For joining, we need to define some struct in order do have two distinct list of elements.

So, here is a simple struct:

struct Name
{
    public int Id;
    public String Value;

    public Name(int Id, String Value)
    {
        this.Id = Id;
        this.Value = Value;
    }
}

And here are two array declarations:

Name[] firstNames = new Name[] {
    new Name(1, "Philippe"),
    new Name(2, "Steve"),
    new Name(3, "Bill")
};

Name[] lastNames = new Name[] {
    new Name(1, "Vlérick"),
    new Name(2, "Balmer"),
    new Name(3, "Gates")
};

Using two arrays, we will be joining on the Id field of the Name struct.

The C# 3.0 query:

var search = from fn in firstNames
             join ln in lastNames on fn.Id equals ln.Id
             select fn.Value + " " + ln.Value;

Here is the translation:

IEnumerable<String> search = Enumerable.Join<Name, Name, int, String>(
    firstNames,
    lastNames,
    delegate(Name n)
    {
        return n.Id;
    },
    delegate(Name n)
    {
        return n.Id;
    },
    delegate(Name n1, Name n2)
    {
        return n1.Value + " " + n2.Value;
    });

This one is a bit tricky and need some clarification.

The first two types given are the type of object that each collection contains. The third type is the type that the actual join will be made on and has to be the same on both collections. The last type is the type of the objects that will be stored in the returned collection. So, in this particular case, the first collection will contain Name objects, the second collection will contain Name objects as well, the join will be made on int types and the returning collection will contain String objects.

As parameters, the first two are collections that contain the object to use for the join. The next two parameters are the delegates that must return the type the join has to be made on (in this case, int). The first delegate will be used with each object of the first collection (in this case, firstNames) and the second delegate will be used with each object of the second collection (lastNames). The last parameter is a delegate that must return the given return type (here a String object) receiving the two joined objects as parameters).

Note that we don’t need to explicitly call the projection method (Enumerable.Select) as there is a selector delegate that builds the output object.

Grouping

As a reminder, grouping means splitting the output in sequences of groups that have the same key value. The output of this type of query is a bit different from the previous queries.

A custom struct is needed to have an example object that we can test grouping on:

struct Student
{
    public String Name;
    public String Course;

    public Student(String Name, String Course)
    {
        this.Course = Course;
        this.Name = Name;
    }
}

We can then declare an array that we will query on

Student[] students = new Student[] {
    new Student("Philippe", "Data Structure"),
    new Student("Bill", "Marketing"),
    new Student("Steve", "Finance"),
    new Student("Bill", "Data Structure"),
    new Student("Steve", "Dance"),
    new Student("Bill", "Finance")
};

Lets group on Course field of the struct. Here is the C# 3.0 query:

var search = from s in students
             group s by s.Course;

As you can see, there is no projection. The output is a IEnumerable of IGrouping objects. An IGrouping object is a collection of objects that share a common key, in here Student objects with String as keys.

To iterate trough this, we need two foreach loops. the first one will iterate trough each IGrouping object, and the second one will iterate trough each Student.

foreach (var course in search)
{
    Console.WriteLine(course.Key);
    foreach (var student in course)
    {
        Console.WriteLine("- " + student.Name);
    }
}

This is how this query translates in Visual Studio 2005:

IEnumerable<IGrouping<String, Student>> search = Enumerable.GroupBy<Student, String>(
    students,
    delegate(Student s)
    {
        return s.Course;
    });

And the loops to display the content:

foreach (IGrouping<String, Student> course in search)
{
    Console.WriteLine(course.Key);
    foreach (Student student in course)
    {
        Console.WriteLine("- " + student.Name);
    }
}

Again, you have to specify types everywhere in Visual Studio 2005.

Other Thoughts

One of the other painful things with Visual Studio 2005 is that there are no anonymous types which are very handy to use with LINQ. It is required to write all the used types as Classes (or better, Structs).

Conclusion

Using static methods from Enumerable (and possibly Queryable), it is possible to use LINQ queries inside Visual Studio 2005. However, queries are a bit more complicated to write.

A good practice would be to comments the query intend extensively, as the query statement itself is hard to read.

Resources

Here are some of the resources I used to write this entry: