September 2011

Volume 26 Number 09

The Working Programmer - Multiparadigmatic .NET, Part 10: Choosing an Approach

By Ted Neward | September 2011

Ted NewardIn my last column (Part 9 of this series), I suggested that any time an article series gets close to double digits, the author is either pretentious enough to think his readers are actually interested in that subject that many times in a row, or he’s just too boneheaded to come up with a new topic. Readers are left to speculate as to which of those two cases are at work here.

Nevertheless, commentary from the field has made it clear that despite the dangers of traversing into double-digit territory, one more article on multiparadigmatic design seems necessary to try to tie all the individual elements together—to demonstrate how to use each of these different paradigms and choose among them for a real-world problem. “Real-world,” for our purposes, means non-trivial enough to hint at how the approach might be useful for problems that aren’t as simple as those chosen to solve in a magazine article.

Trying to create such a problem turns out to be more difficult than it might appear; either the idea is too complicated, with too many distractions in it to get a clear view of the solutions used; or the idea is too simple, allowing for too little variation in its implementation to illustrate how the different paradigms might each be used to solve it. Fortunately, to get started we can make use of some work that has already been done—in the form of Dave Thomas’ “Code Katas.”

Code Katas

On his Web site, Thomas writes about dropping his son Zachary off at karate practice and discovering that the dojo had no room left in the parents’ viewing area for him. This left him with 45 minutes to himself, and he began to play around with some code he’d been idly considering performance implications about. He writes:

I just wanted to play with some code and experiment with a technique I hadn’t used before. I did it in a simple, controlled environment and I tried many different variations (more than I’ve listed here). And I’ve still got some more playing to do …

What made this a practice session? Well, I had some time without interruptions. I had a simple thing I wanted to try, and I tried it many times. I looked for feedback each time so I could work to improve. There was no pressure: the code was effectively throwaway. It was fun: I kept making small steps forward, which motivated me to continue. Finally, I came out of it knowing more than when I went in.

Ultimately, it was having the free time that allowed me to practice.If the pressure was on, if there was a deadline to deliver the blog search functionality, then the existing performance would have been acceptable and the practice would never have taken place. But those 45 pressure-free minutes let me play.

So, my challenge for the day: See if you can carve out 45 to 60minutes to play with a small piece of code. You don’t necessarily have to look at performance; perhaps you could play with the structure, or the memory use or the interface. In the end, it doesn’t matter. Experiment, measure, improve.

In other words, the code kata is a (relatively) simple problem—one that’s not hard to grasp conceptually—that offers a framework in which to explore. In our particular case, the goal will be to design. We’ll base our exploration on Thomas’ “Kata Four: Data Munging.”

Kata Four: Data Munging

The code kata in this case has three parts, which we will take in discrete steps, one at a time, designing as we go and considering each of the axes available to us within C# (though, again, solutions in Visual Basic are equally approachable).

Step One

Step one reads as follows:

In weather.dat (http://bit.ly/ksbVPs) you’ll find daily weather data forMorristown, N.J., for June 2002. Download this text file, then write a program to output the day number (column one) with the smallest temperature spread (the maximum temperature is the second column, the minimum the third column).

The weather.dat file looks like what’s shown in Figure 1 (except it stretches out for the full 30 days).

Figure 1 The Weather.dat Text File

MMU June 2002                                
Dy MxT MnT AvT HDDay AvDP 1HrP TPcpn WxType PDir AvSp Dir MxS SkyC MxR MnR AvSLP
1 88 59 74   53.8   0.00 F 280 9.6 270 17 1.6 93 23 1004.5
2 79 63 71   46.5   0.00   330 8.7 340 23 3.3 70 28 1004.5
3 77 55 66   39.6   0.00   350 5.0 350 9 2.8 59 24 1016.8
4 77 59 68   51.1   0.00   110 9.1 130 12 8.6 62 40 1021.1
. ..                              
28 84 68 76   65.6   0.00 RTFH 280 7.6 340 16 7.0 100 51 1011.0
29 88 66 77   59.7   0.00   040 5.4 020 9 5.3 84 33 1020.6
30 90 45 68   63.6   0.00 H 240 6.0 220 17 4.8 200 41 1022.7
mo 82.9 60.5 71.7 16 58.8   0.00     6.9     5.3      

 

It’s immediately clear that the file is not comma-separated, but positionally separated—the “MxT” (max temp) column always starts at the same place and the “MnT” (min temp) column starts at another fixed location. A glance at the file in Visual Studio reveals that each line is precisely 90 characters long; parsing this file to a line-by-line, string-by-string arrangement will be trivial, because we can split on newlines or 90-character lengths.

After that, however, things become less clear. Though the exercise asks us to output only the smallest temperature spread for the month by hardcoding position values and doing a String.Subset on each line, that misses some of our goals. So let’s take things just a step further and refine the problem to also require the ability to examine any data element for any day within the month.

Remember, the core of multiparadigmatic design is to identify the commonality/variability axis, and for this problem thus far, the commonality/variability axis is pretty unintuitive—though it’s easy to see that we need to parse a text file and examine the results. That data is essentially tabular in nature, and can be looked at in several different ways based on the paradigms we’ve investigated.

From a procedural perspective, the data structure is the definition of each line, and an object-oriented solution doesn’t give us much more. It’s fairly easy to simply take the procedural data structure and toss in the methods to parse the data on the class. Essentially, we’re back to “smart data,” which although not particularly object-ish, works. For the moment.

Nothing here suggests the practical use of meta-programming tactics yet, or of dynamic programming, except perhaps the data structure allowing some kind of lookup based on column name, so that day1[“MxT”] would return “88.” The functional approach could be interesting, because parsing can be seen as a collection of functions, running one after another, taking input and returning strings (or other parsed data) and more functions to parse the rest of the file as results. Such a technique is known as parser combinators and is a discussion well beyond the scope of this particular article.

None of these approaches seem to help much; so far, the best solution seems to be a procedural approach, doing something like what’s shown in Figure 2.

Figure 2 Taking a Procedural Approach

namespace DataMunger
{
  public struct WeatherData
  {
    public int Day;
    public float MxT;
    public float MnT;
    // More columns go here
  }
  class Program
  {
    static void Main(string[] args)
    {
      TextReader reader = new StreamReader("weather.dat");
           
      // Read past first four lines
      reader.ReadLine();
      reader.ReadLine();
      reader.ReadLine();
      reader.ReadLine();
 
      // Start reading data
      List<WeatherData> weatherInfo = new List<WeatherData>();
      while (reader.Peek() > 0)
      {
        string line = reader.ReadLine();
 
        // Guard against "mo" summation line
        if (line.Substring(0, 4) == "  mo")
          continue;
 
        WeatherData wd = new WeatherData();
        wd.Day = Int32.Parse(line.Substring(0, 4).Replace("*", " "));
        wd.MxT = Single.Parse(line.Substring(5, 6).Replace("*", " "));
        wd.MnT = Single.Parse(line.Substring(12, 6).Replace("*", " "));
        // More parsing goes here
 
        weatherInfo.Add(wd);
      }
 
      Console.WriteLine("Max spread: " +
        weatherInfo.Select((wd) => wd.MxT - wd.MnT).Max());
    }
  }
}

Note that this code already presents a problem; when the parsing reaches day 9, an asterisk shows up in the MnT column, indicating that this is the “low” for the month, just as day 26 is the “high” for the month. This is solvable by stripping the “*” out of the string, but the point is made: The procedural axis focuses on establishing the data structure and operating on it—in this case, parsing it.

Also, note the use of List<> here; just because we’re using a procedural approach to parse the file doesn’t mean we can’t take advantage of useful classes elsewhere within the Base Class Library. This makes it trivial to figure out the smallest spread—a single LINQ-to-Objects query will give the needed result. (We should, of course, grab which day that spread occurs on, to be faithful to the problem, but that’s really just a matter of returning and Max()ing on something other than a raw float; consider that an exercise for the reader.)

Step Two

We could spend a lot of time projecting how we might write “reusable” code for this, but doing so would be like trying to predict the future; this code works, so let’s move on to the next step in the kata:

The file football.dat (http://bit.ly/lyNLya) contains the results from theEnglish Premier League for 2001/2002. The columns labeled “F” and “A” contain the total number of goals scored for and against each team in that season (so Arsenal scored 79 goals against opponents, and had 36 goals scored against them). Write a program to print the name of the team with the smallest difference in “for” and “against” goals.

More text parsing. Joy. The football.dat file is shown in Figure 3; it’s similar in many ways to weather.dat (Figure 1), yet different enough to merit some different parsing code.

Figure 3 The Football.dat Text File

  Team P W L D F   A Pts
1. Arsenal 38 26 9 3 79 - 36 87
2. Liverpool 38 24 8 6 67 - 30 80
3. Manchester_U 38 24 5 9 87 - 45 77
4. Newcastle 38 21 8 9 74 - 52 71
5. Leeds 38 18 12 8 53 - 37 66
6. Chelsea 38 17 13 8 66 - 38 64
7. West_Ham 38 15 8 15 48 - 57 53
8. Aston_Villa 38 12 14 12 46 - 47 50
9. Tottenham 38 14 8 16 49 - 53 50
10. Blackburn 38 12 10 16 55 - 51 46
11. Southampton 38 12 9 17 46 - 54 45
12. Middlesbrough 38 12 9 17 35 - 47 45
13. Fulham 38 10 24 14 36 - 44 44
14. Charlton 38 10 14 14 38 - 49 44
15. Everton 38 11 10 17 45 - 57 43
16. Bolton 38 9 13 16 44 - 62 40
17. Sunderland 38 10 10 18 29 - 51 40
18. Ipswich 38 9 9 20 41 - 64 36
19. Derby 38 8 6 24 33 - 63 30
20. Leicester 38 5 13 20 30 - 64 28

 

If we consider these two programs independently, it’s pretty easy to see that, individually, each is solved in much the same way—there’s not a lot of point in rehashing that exercise for football.

Step Three

The last step to the kata, however, leads us to the pay dirt of the exercise:

Take the two programs written previously and factor out as much common code as possible, leaving you with two smaller programs and some kind of shared functionality.

This is intriguing, and factors right in to commonality/variability analysis.

Commonality/Variability

Because we now have more than just one problem to examine—that is to say, a family of concerns—it becomes easier to tease out the common and the variable between the two problems. That analysis, starting with analysis of the text files, reveals the exercise essentially boils down to two steps: parsing the file into a list of row data, and then examining that data in terms of some kind of calculation or analysis.

There are some issues to consider when parsing the text files:

  • Both file formats have a “header” that needs to be ignored.
  • Both file formats are positionally based.
  • Both file formats have lines that will need to be ignored (the “mo” summation line in weather.dat, the “------” visual marker in football.dat).
  • Weather.dat has some empty columns; football.dat does not.
  • Both file formats support principally numeric and string columns (except that weather.dat also includes “*” values, which may need to be captured somehow).

Calculating the results depends strongly on what the parsed data ends up like, so it seems reasonable to begin there. We have several approaches, based on each paradigm, from which we could start:

Procedural This axis focuses on capturing commonality in data structures. However, with two different file formats, that’s clearly where we want to capture variability, so procedural looks like it’s getting the boot. It might be possible to create some kind of data structure that captures both file formats, but it’s probably going to be awkward compared to the other paradigms.

Object-Oriented The commonality between the two files 
suggests using a base abstract “TextParser” class to provide base parsing functionality, including the ability to skip lines. Variability comes with parsing each line, which means that subclasses must override some kind of “ParseLine” method to do the actual line-by-line parsing. How the parsed values are retrieved out of the TextParser subtype, however—in order to do the min/max comparison—could be tricky, because the types of the columns will also vary. Historically, the Microsoft .NET Framework solved this (with SQL datasets) by returning objects, which we could use if necessary. But that introduces potential type-safety errors, because objects would need to be downcast to be useful, and that might be dangerous.

Meta Several different solutions present themselves along the meta-object/meta-programmatic range. The attributive approach suggests that the TextParser class could take a “Record” type, in which each described column has a start/length custom attribute describing how to parse the lines, like so:

public struct WeatherData
{
  [Parse(0, 4)]
  public int Day;
  [Parse(5, 6)]
  public float MxT;
  [Parse(12, 6)]
  public float MnT;
}

The TextParser could then be parameterized to take the Record type (TextParser<RecordT>) and use the custom attributes at run time to discover how to parse each line. This would then return a List of Records (List<RecordT>), from which we could do the aforementioned calculations.

Alternatively, generative programming suggests that a source format of some kind describes the text file, and the parser for each kind of file is generated based on that source format.

Dynamic Taking a dynamic approach is a bit strange, given it’s a newer approach for .NET, but here we might imagine a TextParser class taking a string that describes how to parse each file. It’d be almost like a tiny programming language in its own right, perhaps incorporating regular expressions, or (in our case) just using positional data, because that’s all that’s required for this problem:

string parseCommands =
  "Ignore; Ignore; Ignore; Ignore; Repeat(Day:0-4, MxT:5-6, MnT:12-6)";
TextParser parser = new TextParser(parseCommands);
List<string> MxTData = parser["MxT"];
List<string> MnTData = parser["MnT"];

This is clearly different from the meta-approach in that it lacks the type-safety the meta-approach could provide. However, a meta-approach requires compile-time changes if the file format changes, whereas a dynamic one, because it acts on name-based variability, could roll with the changes. The dynamic approach offers more flexibility at the cost of more complexity internally.

The source format used by the generative meta-approach could be what’s passed at run time, building the parser at run time instead of at source time.

Functional In some ways, the functional approach is the most curious, as it would suggest that the algorithm—the parsing—is where the variability lies, and therefore the TextParser class should take one or more functions that describe how to parse the text, whether line-by-line or column-by-column, or both. For example, two things the TextParser needs to know how to do are ignore non-parseable lines and break a line down into constituent parts. These could be established as function instances on TextParser itself, passed in either through a constructor or set via properties, as shown in Figure 4.

Figure 4 A Functional Approach

TextParser<WeatherData> parser = new TextParser<WeatherData>();
parser.LineParseVerifier =
  (line) =>
  {
    if ( (line.Trim().Length == 0) || // Empty line
         (line.Contains("MMU")) ||    // First header line
         (line.Contains("Dy")) ||     // Second header line
         (line.Contains("mo")))
        return false;
    else
      return true;
  };
parser.ColumnExtracter =
  (line) =>
  {
    WeatherData wd = new WeatherData();
    wd.Day = line.Substring(0, 4);
    wd.MxT = line.Substring(5, 6);
    wd.MnT = line.Substring(12, 6);
    return wd;
  };
List<WeatherData> results = parser.Parse("weather.dat");

TextParser uses the parameterized type solely to capture the WeatherData type for greater type-safety; it could be written to return generic System.Object if desired or if easier.

Wrapping Up

Clearly there’s no real “one way” to approach the problem—as other requirements come into focus, we’d get a better sense of where the variabilities are, and thus a better sense of the commonalities that need to be captured. This, by the way, highlights the real power of refactoring: Because we can’t predict the future, refactoring code means we can be wrong about our commonality/variability analysis and yet not have to “throw it all away and start over” when these errors come to light.

Multiparadigm design is not an easy subject to grasp, particularly not in one sitting. It took 10 articles to describe, and it’s quite reasonable to expect that it will take much longer to incorporate into your own thinking. But doing so can lead to much more flexible and comfortable designs over time, and even offer solutions to thorny problems that leave other developers a little wild-eyed at how you managed to see that solution through the haze of the problem. Remember, in the end, a multiparadigmatic approach begins and ends with commonality and variability—keep that in mind, and lots of things will begin to sort themselves out.

Toward that end, I strongly suggest readers take the Data Munging kata and deliberately try to solve each of the two file-parse exercises using each of the five different paradigms separately, then look for ways to combine them to create powerful, reusable code. What happens when object-oriented is paired up with functional or dynamic? Does the solution get easier or more complicated? Then take the approach and throw a different file format at it—how does a CSV file format screw it up? More importantly, do this during some empty moments in between stories or project meetings; the experience gained from this will pay off in spades when you attempt to do this kind of design under the gun of a real production deadline.

As already mentioned but worth repeating: multiparadigm languages are here, in common use, and seem like they’re here to stay. The various Visual Studio 2010 languages each exhibit some degree of each of these paradigms; C++, for example, has some parametric meta-programming facilities that aren’t possible in managed code, owing to the way that the C++ compiler operates; and just recently (in the latest C++0x standard) it gained lambda expressions. Even the much-maligned ECMAScript/JavaScript/JScript language can do objects, procedures, meta-programming, and dynamic and functional paradigms; in fact, much of JQuery is built on these ideas. The sooner these ideas take root in your brain, the sooner you’ll be able to code your way through the stickiest of situations.

Happy coding!


Ted Neward is a principal with Neward & Associates, an independent firm specializing in enterprise .NET Framework and Java platform systems. He has written more than 100 articles, is a C# MVP, INETA speaker, and has authored or coauthored a dozen books, including “Professional F# 2.0” (Wrox, 2010). He consults and mentors regularly—reach him at ted@tedneward.com if you’re interested in having him come work with your team, or read his blog at http://blogs.tedneward.com.

Thanks to the following technical expert for reviewing this article: Anthony D. Green