Government 2013

Volume 28 Number 10A

HTML - Leverage Existing Web Assets as Data Sources for Apps

By Frank La | Government 2013

We live in exciting times—we’re experiencing a major shift from a Web-centric industry to one driven by mobile apps. This transition will redefine our industry and even the rest of the world. Despite this, many state and local governments have not yet begun producing mobile apps. However, most do have a Web site and some even have a social media presence.

Why are state and local governments so Web savvy, yet lagging in the mobile revolution? The answer may be complex, but likely all comes down to money and organizational inertia. Web technology is now a fairly mainstream and effective way to communicate with citizens. Given the wide deployment of Web hosting services, there’s ample competition to reduce the costs of Web publishing. The same applies to social media. With hundreds of millions of users on Twitter and Facebook, even the most remote community can be assured that a fair share of its constituents can be reached.

Services and Mobile Apps

Along with widespread deployment of smartphones, the market has segmented to such a degree that developing a mobile app quickly escalates to a multiplatform development story with multiple codebases. Furthermore, many Web-savvy developers and architects have already moved to a service-oriented architecture (SOA). 

SOA offers a number of benefits, such as separation of concerns and scalability. It also provides a nice platform-agnostic means of publishing data. The code on the mobile device becomes more about rendering the data or “painting the glass” for the consuming device. With the majority of business logic centralized on the server, multiplatform development becomes more manageable. Each client application needs only a basic codebase, making it simpler and more maintainable, as shown in Figure 1.

Service-Oriented Architecture
Figure 1 Service-Oriented Architecture

With these benefits in mind, why aren’t more state and local governments adopting this approach? Once again, it boils down to budget. Quite often, the costs in terms of money, time and political determination are too high to reengineer systems and publishing processes. All this combines to make app creation a non-starter for this market segment.

Given these constraints, what you can do is leverage your existing Web assets and set up a façade that mimics the SOA approach. It may not be an ideal solution, but it can enable cash-strapped organizations to provide new mobile services to their constituents, as well as offer a roadmap for future enhancements.

A Common Scenario

Web sites are marvels of engineering. Using protocols and standards reaching back to the 1990s, Web site publishers can broadcast infor­mation cheaply and provide automated services to a worldwide audience at an extremely low cost. One way to repurpose Web-based content is simply to do nothing at all. Virtually all mobile platforms ship with a native Web browser. This means any Web site will be accessible on any mobile device with an Internet connection. There’s no additional cost or effort to this approach—it’s free.

However, many sites don’t render gracefully on smaller screens and usability suffers accordingly. Fortunately, this can be mitigated by using CSS media queries. Like CSS, CSS media queries provide guidance on how content should render based on certain parameters, such as screen size and orientation. Best of all, you can add CSS media queries to your site and they’ll apply only to browsers that understand them and be ignored by browsers that don’t. Still, though CSS media queries can help bridge the gap between desktop and mobile screen sizes, this technology won’t create native mobile apps.

Why Write an App?

The question today for cross-platform mobile developers is: Mobile-­optimized Web site or native app? While developers debate which is better, business decision makers and stakeholders are listening to their customers and the market. Native apps optimized for a particular platform appear to be winning. Native apps have access to device-specific APIs, can access more on-device hardware, and can be written to take advantage of each mobile platform’s aesthetic or design language.

Recently, users have been doing what they do best: behaving unexpectedly. While on their desktop, they’ll consult a search engine such as Bing or Google. On their mobile devices, they’ll look for apps. The people I see doing this tend to be both business decision makers and general users. When I ask why they behave differently on different devices, responses range from, “I don’t know why, I just do,” to, “App store results are more curated than search engine results,” to, “It will look better on this screen.”

If these observations are indicative of the general user population, then Web developers are at the cusp of apps becoming the new Web site. Apps will be the must-have technology to engage with citizens in a more effective as well as cost-effective manner.

The Multiplatform Burden

One major challenge to making app development cost-effective is multiplatform support. In order to reach the widest audience, state and local governments must develop applications for Android, iOS, Windows Phone and BlackBerry devices. That means at least four different codebases to write, maintain and manage.

Private sector organizations generally pick one or two platforms on which to focus. State and local governments may not have that option because there could be regulatory requirements to make information accessible to everyone. Laws vary from locale to locale, but cutting off citizens from information based on their choice of mobile platform seems like an unwise decision in the public sector.

It would be so much easier if more state and local governments implemented the SOA approach to their infrastructures. Rearchitecting and reengineering seem like unnecessary obstacles.

Another Way to Look at HTML

Nearly every state and local government entity has a Web site, each one supported by databases, content management systems and workflows. What if this existing infra­structure could be repurposed to serve mobile apps? What if these existing assets could have a façade layer placed around them so they could mimic the SOA approach?

Structured Data, Unstructured Data and Semi-Structured Data The industry has been dealing with structured data for a long time. Lately there’s been a lot of talk about unstructured data. To many, unstructured data represents the final frontier of computer science. This is data that lacks structure, but has meaning and purpose. Faster processors, larger data stores and smarter, if not artificially intelligent, algorithms are needed to process this raw data into logical pieces of information.

In addition to structured and unstructured data, I propose a third data classification: semi-structured data. This is data with some internal structure that both presents and occludes information. HTML represents the simplest and most widely deployed form of semi-structured data.

Take a look at this sample HTML from a fairly typical HTML page (it’s hard to call this “unstructured data”):

<tr><td width="30%">Events</td>
<tr><td><a href="events/julypicnic.htm">Fourth of July Picnic</a></td></tr>
<tr><td><a href="events/labor.htm">Labor Day Parade</a></td></tr>
<tr><td><a href="events/sept11.htm">Sept 11 Candlelight vigil</a></td></tr>


From this snippet of HTML, it’s clear that HTML does contain structure. It may not be a structure you prefer, such as XML or JSON, but the sample clearly contains information that can be extracted. The question is how to extract it in a programmatic way that can be configured. The challenge is separating the HTML structures from the data.

There are a number of ways to do this: regular expressions, string matching, even CSS selectors. For my Screen Scraper Utility Kit (, I chose string matching. It’s simple, performs well and is easy to maintain.

Screen Scraping?

Screen scraping—or the process of pulling down content from a display format, parsing it and then using it—generally has a bad reputation. Many consider it the option of last resort.

Why the reluctance to implement screen scraping? The objections generally fall into two categories:

  1. Using a presentation layer as a data layer
  2. Lack of resilience

Using one application’s presentation layer as another’s data layer is generally inferior design. However, in cases where there’s no viable alternative, it can provide a bridge you can build rather inexpensively. As for screen scraping’s lack of resilience, the problem is that should the source screen data change in any way, the reliant application will fail.

In this scenario, I have no Web API and no budget to build one. Screen scraping existing Web assets is the only viable way to extract data. As for the lack of resiliency, I have a solution for that.

Resilient Screen Scraping To make screen scraping more resilient, I added a definition file. Similar to the way antivirus software updates malware definitions, my definition file can change independently of the core parsing engine or the consuming app. The definition file can reside at a location that can be updated independ­ently of the application. Thus, when a source page changes, all you need to do is update the definition file. A user doesn’t need to wait for a newer version of the app to get certified and published into the app store. The process is shown in Figure 2.

Using a Definition File to Update an App
Figure 2 Using a Definition File to Update an App

The Screen Scraper Utility Kit text-parsing engine takes raw text, primarily HTML, and uses text-pattern definitions from a Definitions object containing certain metadata and a list of Definition objects. Definitions have name attributes so they can be programmatically referenced.

Debrief: Sourcing App Data from Existing Web Sites

Nearly every small town and county in the United States has a Web site, yet these communities rarely have apps. Part of this is the lack of a Web API and the money to build one. In this article, I discuss how to leverage existing Web assets and repurpose them for the era of the mobile app.

IT Brief:

As mobile apps increase in importance, the pressure for organizations to build them will increase. For apps to be useful, they need to contain relevant and up-to-date content. In most scenarios, this means exposing data via a robust Web-based API. For many cash-strapped smaller towns and counties, this represents a major obstacle. In this article, I demonstrate how to extract data from existing Web properties.

  • Web pages present content in text-based HTML
  • HTML on Web pages has structure
  • The HTML structure can provide an app with real-time data with minimal effort

Dev Brief:

HTML is semi-structured data. It has internal structure that both presents and occludes information. This structure can be parsed into meaningful, discrete data elements via screen scraping.

  • Create a Web API façade by screen scraping the HTML on your existing Web properties
  • Definition files add resiliency and flexibility
  • ShaZapp gives you a head start on creating an app that aesthetically and functionally aligns more to your brand and mission

More Information:

Jump-Starting State and Local Government App Development

While the built-in starter templates in Visual Studio 2012 or later are a great place for you to start writing Windows Store apps, they don’t necessarily help you to jump-start real-world app development. You’ll still have to write code to connect to data sources. This can be particularly challenging for cash-strapped state and local government developers who don’t have a Web API.

To help accelerate Windows Store app development for such organizations, our team developed ShaZapp. ShaZapp is an experiment in taking the screen-scraping model further, so resource-strapped communities can jump-start their Windows Store app efforts. ShaZapp lets less technically savvy users create a starting point from which developers can leverage their data already exposed on the Web. They can also use Bing images to create visually rich apps with minimal effort and cost.

Via the ShaZapp Web site, it’s straightforward to generate the source code for a Windows Store app that’s already tailored to your project.

Connecting with Citizens

Suppose you work for a small, rural town that wants to create an app to keep its citizens informed of local events. Your goal is to create an app that loads event data from the ContosoVille Web site, but displays the information within Windows Store apps. Let’s assume this is your first Windows Store app and you may not be a full-time developer. You want to create an app that conveys the correct brand and messaging.

Creating the App Framework ShaZapp walks you through the process of creating an app with custom content and branding. It’s designed to give an app builder a starting project geared more directly to her needs. The process consists of three steps: adding app metadata, creating data groups and generating the app source code. 

The first step in creating an app with ShaZapp is naming the app, providing basic information about your organization, and setting some display and branding elements (see Figure 3). This lets you create a more customized project to work with in Visual Studio. It also fills in the required fields for getting your application published into the Windows Store.

Setting Basic Properties for Your App in ShaZapp
Figure 3 Setting Basic Properties for Your App in ShaZapp

Once these fields are filled in, you’re ready to move on to the next step: creating data groups.

Creating Data Groups Now that your project has some basic information, the next step is to add data. All apps require data of some sort to be useful. ShaZapp provides two kinds of data group templates: static and dynamic. Static data groups contain data that remains largely unchanged. Good examples of this kind of data would be contact information for various city hall offices, trash pickup schedules and so forth.

Creating a static data group is straightforward. Simply click on Create New Group on the Groups page, then choose Static from the Group Type dropdown list. You’ll notice right way some fields automatically disappear. The remaining fields all pertain to metadata about this data group. Only the Group Key and Group Name fields are required. Once all the field entries are complete, click on Create to create the group.

Next, add data items to the group. Click on Create New Static Item to bring up the Create New Static Item page. On this page, you can add data to the item. When you’re done entering the data, click on create and you’ll see this item in the static data group. Repeat this for all the data you’d like to include in the data group.

Pulling in Data Dynamically To add real value, you need to be able to pull in real-time data. Under ideal circumstances, you’d have a robust Web API to access the data. Not every organization has such an API. However, nearly every organization has a Web site. Web sites present HTML to an end-user browser. Often, the HTML has a consistent structure that you can leverage as a data source.

Both the Screen Scraper Utility Kit and ShaZapp use definition files that define start markers and end markers. Definition files are either in XML or JSON format and let the screen-scraper text parser know which character patterns define a starting point and end point for data embedded in the HTML.

To get started, l’ll pull in the event’s name and a link to the event detail page. The HTML is easy enough to parse. I’ve already created a definition file in XML (see Figure 4), and placed it on my Web server at

Figure 4 The Definition File in XML

<?xml version="1.0" encoding="UTF-8"?>
<Definitions IsReady="true"
  <Definition Name="Events">
      <Marker Value="<tr><td><a href="/>
      <Marker Value="</td>"/>
  <Definition Name="Title">
      <Marker Value="">"/>
      <Marker Value="</a>"/>
  <Definition Name="Link">
      <Marker Value="."/>
      <Marker Value="""/>

First, I’ll add a new group and make sure Dynamic is selected from the Group Type dropdown list , as shown in Figure 5. The Definition File URI field is where you reference the URI to the definition file. In Iterator Name, enter “Events.” The remaining fields are display only and could have any number of values. 

Adding a Dynamic Group in ShaZapp
Figure 5 Adding a Dynamic Group in ShaZapp

ShaZapp automatically adds default field mappings when adding a dynamic group. These fields are Title, Link and imageUri (see Figure 6). Normally, this is a nice time-saver, as these are common kinds of data. However, in this case, the source HTML doesn’t contain an image associated with each event. As a result, I don’t have an imageUri field definition. Be sure to remove this field mapping to avoid a runtime error in the final project by clicking on Delete.

Dynamic Data Field Mappings
Figure 6 Dynamic Data Field Mappings

Next, click on Generate App and then click on Generate Your App. Shortly thereafter, you’ll see a download prompt for a zipped file. Download the file, extract the contents and then open the solution file. Once the solution is loaded into Visual Studio, run the project and you should see an app that looks something like what’s shown in Figure 7.

The ContosoVille Windows Store App
Figure 7 The ContosoVille Windows Store App

Now bring up the Charms and click on About to bring up the About page shown in Figure 8.

The About Page for the ContosoVille Windows Store App
Figure 8 The About Page for the ContosoVille Windows Store App

This is where the metadata entered in ShaZapp gets placed in the generated project.

This app isn’t perfect, but in mere moments you’ve created an app that displays data relevant to your purpose, instead of using a generic, out-of-the-box template. You also have the source code and can start modifying the app further based on your particular needs.

Examining the Generated Code

Take a closer look at that source code. In Solution Explorer, you’ll see a structure similar to the output of a Grid Template project. However, if you open up the References folder, you’ll notice this solution already has a reference to the Screen Scraper Utility Kit. If you look at the data.js file, you’ll see further modifications, including common functions that interface with the Screen Scraper Utility Kit component. Following that code, you should see a section with two functions, loadEventsItems and parseEventsRawHtml. The loadEventsItemsfunction downloads the definition file and then calls the parseEventsRawHtmlfunction. This is where you can add an image to each of the events.

In the interest of brevity, I’ll add the same image to each element, even though that’s probably not the ideal end product:

var itemObject = {
  group: dataGroups[0],
  title: title,
  subtitle: subtitle,
  index: index

With that simple modification, I’ve removed the broken images and now have a slightly more polished product, as Figure 9 shows.

Fixed Images in the ContosoVille Windows Store App
Figure 9 Fixed Images in the ContosoVille Windows Store App

What This Means

I took an existing HTML asset and converted it to a real-time, dynamic data source in just a few short steps. With some simple XML, the Screen Scraper Utility Kit can read and interpret the raw text of the Web page and convert it into a meaningful data model. Think of the power this gives smaller towns and communities.

Of course, the app as it exists now is hardly a finished product. There’s more to be done in terms of parsing the data into more granular parts. Having the source code means you can continue to improve and customize the end product as you like.  

Looking to the Cloud

Thus far, a service façade has been built on a device, an approach that can help state and local governments create Windows Store apps. How can you create a multiplatform solution that works for virtually any client? This issue is especially important to public-sector organizations, as they may have regulatory requirements to provide content to all citizens.

Instead of having the service façade on the Windows 8 device, what if you could move it to the cloud? Rather than duplicating the code across multiple platforms, it makes sense to perform the screen scraping at one central location and expose the extracted data structure at one REST service endpoint. Doing so would make the data pulled from the raw HTML accessible to all platforms that can understand a REST-based service (see Figure 10).

Moving the Screen Scraper Utility Kit Service to the Cloud
Figure 10 Moving the Screen Scraper Utility Kit Service to the Cloud

Moreover, it would give smaller communities a way to experiment with cloud computing with fairly low-risk data, meaning content that already resides on their public Web sites. This also provides a REST-based service architecture that can empower multiplatform development. From a developer’s point of view, this architecture façade would be indistinguishable from an actual SOA implementation.

Wrapping Up

With the growing importance of mobile platforms, communities that lack native apps for the major mobile platforms will miss out on the opportunity to connect with their constituents. However, today’s budget-crunched communities lack the funding to reengineer their back-end data systems to provide an ideal architectural pattern for multiplatform development. The short-term solution is to leverage existing Web-based assets and repackage them behind an on-device service façade. While screen scraping is rarely an ideal solution, the Screen Scraper Utility Kit makes the process easy, and it’s hardy enough to adapt to changes in the HTML source data.

Looking forward, I hope to migrate my parsing engine to a Windows Azure-based model. This model puts the service façade into the cloud. With a mere change of a service endpoint, communities could easily swap out the mocked-up SOA with a real one once budgets allow for such development.

Frank La Vigne is a technical evangelist for the Microsoft U.S. Public Sector DPE team where he helps public sector customers leverage technology in order to better serve their constituents. He blogs regularly at and recently started a YouTube channel called Frank’s World TV (

Thanks to the following technical experts for reviewing this article: Rachel Appel (consultant) and Roberto Hernandez (,