About Me

I now work for Microsoft Federal in Chevy Chase, MD.

Dedicated to providing customer-driven, results-focused solutions to the complex business problems of today... and tomorrow.

At SQLTrainer.com, LLC  we understand that the technical challenges faced by businesses today are much greater in both scope and complexity than they have ever been. Businesses today are faced with understanding both local IT infrastructures as well as cloud-based technologies.

What is SQLTrainer.com?

Founded in 1998 by Ted Malone, SQLTrainer.com is a technical consulting, training and content development firm dedicated to the following core principles:

  • Technology Alone is NOT the Answer! - Implementing a particular technology because it is interesting or "cool" will not solve customer problems.
  • Technology Solutions do NOT need to be Overly Complex! - Many times developers and technical practitioners will attempt to build the most clever solution possible. While this serves to stroke the egos of those involved, it doesn't provide a maintainable solution.
  • Consultants Should be Mentors First! - When looking to hire an external consultant, businesses should look to the consultant who's willing to train themselves out of a paycheck.

Why the name, SQLTrainer.com?

SQL (pronounced See-Quell) stands for Structured Query Language, which is at the heart of every modern-day relational database system. Since many technology solutions today rely on some form of database storage or interaction, it was only logical to find a way to incorporate SQL into the name of the organization. Given that one of our core principles is to be a mentor/training above everything, the name SQLTrainer made sense. Since we also wanted to represent our embracing of the cloud, it seemed logical to add the ".com", referring to the biggest "cloud" of them all.

Live Feeds

Wednesday, December 17, 2014 10:00:00 AM

When you put big data to work, results can be beautiful. Especially when those results are as impactful as saving lives. Here are four best practice examples of how big data is being used in healthcare to improve, and often save, lives.

Aerocrine improves asthma care with near-real-time data

Millions of asthma sufferers worldwide depend on Aerocrine monitoring devices to diagnose and treat their disease effectively. But those devices are sensitive to small changes in ambient environment. That’s why Aerocrine is using a cloud analytics solution to boost reliability. Read more.

Virginia Tech advances DNA sequencing with cloud big data solution

DNA sequencing analysis is a form of life sciences research that has the potential to lead to a wide range of medical and pharmaceutical breakthroughs. However, this type of analysis requires supercomputing resources and Big Data storage that many researchers lack. Working through a grant provided by the National Science Foundation in partnership with Microsoft, a team of computer scientists at Virginia Tech addressed this challenge by developing an on-demand, cloud-computing model using the Windows Azure HDInsight Service. By moving to an on-demand cloud computing model, researchers will now have easier, more cost-effective access to DNA sequencing tools and resources, which could lead to even faster, more exciting advancements in medical research. Read more.

The Grameen Foundation expands global humanitarian efforts with cloud BI

Global nonprofit Grameen Foundation is dedicated to helping as many impoverished people as possible, which means continually improving the way Grameen works. To do so, it needed an ongoing sense of its programs’ performance. Grameen and Microsoft brought people and technology together to create a BI solution that helps program managers and financial staff: glean insights in minutes, not hours; expand services to more people; and make the best use of the foundation’s funding. Read more.

Ascribe transforms healthcare with faster access to information

Ascribe, a leading provider of IT solutions for the healthcare industry, wanted to help clinicians identify trends and improve services by supplying faster access to information. However, exploding volumes of structured and unstructured data hindered insight. To solve the problem, Ascribe designed a hybrid-cloud solution with built-in business intelligence (BI) tools based on Microsoft SQL Server 2012 and Windows Azure. Now, clinicians can respond faster with self-service BI tools. Read more.

Learn more about Microsoft’s big data solutions

Tuesday, December 16, 2014 10:00:00 AM

This blog post was authored by: Matt Usher, Senior PM on the Microsoft Analytics Platform System (APS) team

Microsoft is happy to announce the release of the Analytics Platform System (APS) Appliance Update (AU) 3. APS is Microsoft’s big data in a box appliance for serving the needs of relational data warehouses at massive scale. With this release, the APS appliance supports new scenarios for utilizing Power BI modeling, visualization, and collaboration tools over on premise data sets. In addition, this release extends the PolyBase to allow customers to utilize the HDFS infrastructure in Hadoop for ORC files and directory modeling to more easily integrate non-relational data into their data insights.

The AU3 release includes:

  • PolyBase recursive Directory Traversal and ORC file format support
  • Integrated Data Management Gateway enables query from Power BI to on premise APS
  • TSQL compatibility improvements to reduce migration friction from SQL Server SMP
  • Replatformed to Windows Server 2012 R2 and SQL Server 2014

PolyBase Directory Transversal and ORC File Support

PolyBase is an integrated technology that allows customers to utilize the skillset that they have developed in TSQL for querying and managing data in Hadoop platforms. With the AU3 release, the APS team has augmented this technology with the ability to define an external table that targets a directory structure as a whole. This new ability unlocks a whole new set of scenarios for customers to utilize their existing investments in Hadoop as well as APS to provide greater insight into all of the data collected within their data systems. In addition, AU3 introduces full support for the Optimized Row Column (ORC) file format – a common storage mechanism for files within Hadoop.

As an example of this new capability, let’s examine a customer that is using APS to host inventory and Point of Sale (POS) data in an APS appliance while storing the web logs from their ecommerce site in a Hadoop path structure. With AU3, the customer can simply maintain a structure for their logs in Hadoop in a structure that is easy to construct such as year/month/date/server/log for simple storage and recovery within Hadoop that can then be exposed as a single table to analysts and data scientists for insights.

In this example, let’s assume that each of the Serverxx folders contains the log file for that server on that particular day. In order to surface the entire structure, we can construct an external table using the following definition:

	[Uri] NVARCHAR(256) NULL,
	[Server] NVARCHAR(256) NULL,
	[Referrer] NVARCHAR(256) NULL
	FILE_FORMAT = LogFileFormat,

By setting the LOCATION targeted at the //Logs/ folder, the external table will pull data from all folders and files within the directory structure. In this case, a simple select of the data will return data from only the last 10 entries regardless of the log file that contains the data:


The results are:

Note: PolyBase, like Hadoop, will not return results from hidden folders or any file that begins with an underscore (_) or period(.).

Integrated Data Management Gateway

With the integration of the Microsoft Data Management Gateway into APS, customers now have a scale-out compute gateway for Azure cloud services to more effectively query sophisticated sets of on-premises data.  Power BI users can leverage PolyBase in APS to perform more complicated mash-ups of results from on-premises unstructured data sets in Hadoop distributions. By exposing the data from the APS Appliance as an OData feed, Power BI is able to easily and quickly consume the data for display to end users.

For more details, please look for an upcoming blog post on the Integrated Data Management Gateway.

TSQL Compatibility improvements

The AU3 release incorporates a set of TSQL improvements targeted at richer language support to improve the types of queries and procedures that can be written for APS. For AU3, the primary focus was on implementing full error handling within TSQL to allow customers to port existing applications to APS with minimal code change and to introduce full error handling to existing APS customers. Released in AU3 are the following keywords and constructs for handling errors:

In addition to the error handling components, the AU3 release also includes support for the XACT_STATE scalar function that is used to indicate the current running transaction state of a user request.

Replatformed to Windows Server 2012 R2 and SQL Server 2014

The AU3 release also marks the upgrade of the core fabric of the APS appliance to Windows Server 2012 R2 and SQL Server 2014. With the upgrade to the latest versions of Microsoft’s flagship server operating system and core relational database engine, the APS appliance takes advantage of the improved networking, storage and query execution components of these products. For example, the APS appliance now utilizes a virtualized Active Directory infrastructure which helps to reduce cost and increase domain reliability within the appliance helping to make APS the price/performance leader in the big data appliance space.

APS on the Web

To learn more about the Microsoft Analytics Platform System, please visit us on the web at http://www.microsoft.com/aps

Tuesday, December 16, 2014 9:30:00 AM

As the end of 2014 nears, now is the perfect time to review IT infrastructure plans for the coming year.  If you haven’t made supportability a key initiative for 2015, there are some important dates that you should know about:

After the end of extended support security updates will no longer be available for these products.  Staying ahead of these support dates will help achieve regulatory compliance and mitigate potential future security risks. That means SQL Server 2005 users, especially those running databases on Windows Server 2003, should make upgrading the data platform an IT priority. 

Security isn’t the only reason to think about upgrading. Here are six benefits to upgrading and migrating your SQL Server 2005 databases before the end of extended support:

  1. Maintain compliance – It will become harder to prove compliance with the latest regulations such as the upcoming PCI DSS 3.0. Protect your data and stay on top of regulatory compliance and internal security audits by running an upgraded version of SQL Server.
  2. Achieve breakthrough performance – Per industry benchmarks, SQL Server 2014 delivers 13x performance gains relative to SQL Server 2005 and 5.5x performance gains over SQL Server 2008.  Customers using SQL Server 2014 can further accelerate mission critical applications with up to 30x transaction performance gains with our new in-memory OLTP engine and accelerate queries up to 100x with our in-memory columnstore. 
  3. Virtualize and consolidate with Windows Server – Scale up on-premises or scale-out via private cloud with Windows Server 2012 R2. Reduce costs by consolidating more database workloads on fewer servers, and increase agility using the same virtualization platform on-premises and in the cloud.
  4. Reduce TCO and increase availability with Microsoft AzureAzure Virtual Machines can help you reduce the total cost of ownership of deployment, management, and maintenance of your enterprise database applications. And, it’s easier than ever to upgrade your applications and achieve high availability in the cloud using pre-configured templates in Azure.
  5. Use our easy on-ramp to cloud for web applications – The new preview of Microsoft Azure SQL Database announced last week has enhanced compatibility with SQL Server that makes it easier than ever to migrate from SQL Server 2005 to Microsoft Azure SQL Database. Microsoft’s enterprise-strength cloud brings global scale and near zero maintenance to database-as-a-service, and enables you to scale out your application on demand.
  6. Get more from your data platform investments - Upgrading and migrating your databases doesn’t have to be painful or expensive. A Forrester Total Economic ImpactTM of Microsoft SQL Server study found a payback period of just 9.5 months for moving to SQL Server 2012 or 2014.

Here are some additional resources to help with your upgrade or migration:

Monday, December 15, 2014 10:00:00 AM

As part of SQL Server’s ongoing interoperability program, we are pleased to announce the general availability of two SQL Server drivers: the Microsoft JDBC Driver for SQL Server and the SQL Server Driver for PHP are now available!

Both drivers provide that robust data access to Microsoft SQL Server and Microsoft Azure SQL Database. The JDBC Driver for SQL Server is a Java Database Connectivity (JDBC) type 4 driver supporting Java Development Kit (JDK) version 1.7. The PHP driver will allow developers who use the PHP scripting language version 5.5 to access Microsoft SQL Server and Microsoft Azure SQL Database, and to take advantage of new features implemented in ODBC 

You can download the JDBC driver here, and download the PHP driver hereWe invite you to explore the latest the Microsoft Data Platform has to offer via a trial evaluation of Microsoft SQL Server 2014, or by trying the new preview of Microsoft Azure SQL Database.

Thursday, December 11, 2014 12:00:00 PM

By Tiffany Wissner, Senior Director, Data Platform

Making it easier for more of our customers to access our latest big data technologies, we are announcing updates to some of our flagship data platform products and services. These updates are part of our approach to make it easier for our customers to work with data of any type and size – using the tools, languages and frameworks they want – in a trusted environment, on-premises and in the cloud. 

Azure SQL Database

Announced last month and available today is a new version of Azure SQL Database that represents a major milestone for this database-as-a-service. With this preview, we are adding near-complete SQL Server engine compatibility, including support for larger databases with online indexing and parallel queries, improved T-SQL support with common language runtime and XML index, and monitoring and troubleshooting with extended events. Internal tests using over 600 million rows of data show query performance improvements up to 5x in the Premium tier of the new preview relative to today’s offering. Continuing on our journey to bring in-memory technologies to the cloud, when applying in-memory columnstore in the new preview, performance is also improved up to 100x.

“From a strategy perspective, these SQL Database service updates are our answer to migrating and working with large data types by leveraging features such as online index rebuild, and partitioning,” said Joe Testa, vice president of Systems Development at Weichert, one of the nation’s leading full-service real estate providers. “Simply put, the results so far have been fantastic—we’re seeing >2x better performance and the advanced features that were only previously available in SQL Server, now make it easier to work with our applications as we continue to migrate our mission-critical apps to Azure.”

These new preview capabilities are offered as part of service tiers introduced earlier this year, which deliver 99.99% availability, larger database sizes, restore and geo-replication capabilities, and predictable performance. When combined with our recently announced elastic scale technologies that scale out to thousands of databases for processing 10s of terabytes of OLTP data and new auditing capabilities, Azure SQL Database service is a clear choice for any cloud-based mission critical application.

Analytics Platform System

As Microsoft’s “big data in a box” solution built with HP, Dell and Quanta, the Analytics Platform System is a data warehousing appliance that supports the ability to query across traditional relational data and data stored in a Hadoop region – either in the appliance or in a separate Hadoop cluster. This latest release includes a data management gateway that establishes a secure connection between on-premises data stored in the Analytics Platform System and Microsoft’s cloud business intelligence and advanced analytics services such as Power BI and Azure Machine Learning. This capability, coupled with PolyBase, a feature of the Analytics Platform System, allows for seamless integration of data stored in SQL Server with data stored in Hadoop. This now enables users of Power BI and Azure Machine Learning to gain insights from Analytics Platform System, whether on-premises or in the Azure cloud.

New Java, PHP and migration tools

Microsoft is also making available new tools and drivers that support greater interoperability with PHP and Java and make it easier for customers to migrate to and use our big data technologies.

Azure DocumentDB is our fully-managed NoSQL document database service with native support for JSON and JavaScript. DocumentDB already includes SDKs for popular languages, including Node.js, Python, .NET, and JavaScript – today we are adding a new Java SDK that will make DocumentDB easier to use within a Java development environment. The SDK provides easy-to-use methods to manage and query DocumentDB resources including collections, stored procedures and permissions. The Java SDK is also available on Github and welcomes community contributions.

Additionally, we are bolstering our SQL Server tools and drivers with updates to the Microsoft JDBC Driver for SQL Server the SQL Server Driver for PHP. Available early next week, these drivers will make it easier for our customers’ applications to access both SQL Server and Azure SQL Database.

For customers that are migrating their IBM DB2 workloads to SQL Server, we are also making available today the SQL Server Migration Assistant (SSMA) tool which automates all aspects of database migration including migration assessment analysis, schema and SQL statement conversion, data migration as well as migration testing to reduce cost and reduce risk of database migration projects. SSMA 6.0 for IBM DB2 automates migrations from IBM DB2 databases to SQL Server and Azure SQL Database and is free to download and use. Support for IBM DB2 is in addition to earlier updates to SSMA 6.0 including migration support for larger Oracle databases.

Microsoft data platform

These new updates will enable more customers to use Microsoft’s data platform to build, extend and migrate more applications. Microsoft’s data platform includes all the building blocks customers need to capture and manage all of their data, transform and analyze that data for new insights, and provide tools which enable users across their organization to visualize data and make better business decisions. To learn more, go here

Monday, December 8, 2014 10:50:04 AM

Have you been watching Data Exposed over on Channel 9? If you’re a data developer, Data Exposed is a great place to learn more about what you can do with data: relational and non-relational, on-premises and in the cloud, big and small.

On the show, Scott Klein and his guests demonstrate features, discuss the latest news, and share their love for data technology – from SQL Server, to Azure HDInsight, and more!

We rounded up the year’s top 10 most-watched videos from Data Exposed. Check them out below – we hope you learn something new!

  • Introducing Azure Data Factory: Learn about Azure Data Factory, a new service for data developers and IT pros to easily transform raw data into trusted data assets for their organization at scale.
  • Introduction to Azure DocumentDB: Get an introduction to Azure DocumentDB, a NoSQL document database-as-a-service that provides rich querying, transactional processing over schema free data, and query processing and transaction semantics that are common to relational database systems.
  • Introduction to Azure Search: Learn about Azure Search, a new fully-managed, full-text search service in Microsoft Azure which provides powerful and sophisticated search capabilities to your applications.
  • Azure SQL Database Elastic Scale: Learn about Azure SQL Database Elastic Scale, .NET client libraries and Azure cloud service packages that provide the ability to easily develop, scale, and manage the stateful data tiers of your SQL Server applications.
  • Hadoop Meets the Cloud: Scenarios for HDInsight: Explore real-life customer scenarios for big data in the cloud, and gain some ideas of how you can use Hadoop in your environment to solve some of the big data challenges many people face today.
  • Azure Stream Analytics: See the capabilities of Azure Stream Analytics and how it helps make working with mass volumes of data more manageable.
  • The Top Reasons People Call Bob Ward: Scott Klein is joined by Bob Ward, Principle Escalation Engineer for SQL Server, to talk about the top two reasons why people want to talk to Bob Ward and the rest of his SQL Server Services and Support team.
  • SQL Server 2014 In-Memory OLTP Logging: Learn about In-Memory OLTP, a memory-optimized and OLTP-optimized database engine integrated into SQL Server. See how transactions and logging work on memory-optimized-tables, and how a system can recover in-memory data in case of a system failure.
  • Insights into Azure SQL Database: Get a candid and insightful behind-the-scenes look at Azure SQL Database, the new service tiers, and the process around determining the right set of capabilities at each tier.
  • Using SQL Server Integration Services to Control the Power of Azure HDInsight: Join Scott and several members of the #sqlfamily to talk about how to control cloud from on-premises SQL Server.

Interested in taking your learning to the next level? Try SQL Server or Microsoft Azure now.

Thursday, December 4, 2014 10:00:00 AM

Historically, Hadoop has been a platform for big data that you either deploy on-premises with your own hardware or in the cloud and managed by a hosting vendor. Deploying on-premises affords you specific benefits, like control and flexibility over your deployment.  But the cloud provides other benefits like elastic scale, fast time to value, and automatic redundancy, amongst others.

With the recent announcement of the Hortonworks Data Platform 2.2 being made generally available, Microsoft and Hortonworks are partnered to deliver Hadoop on Hybrid infrastructure in both on-premises and cloud.  This will give customers the best of both worlds with control & flexibility of on-premises deployments and the elasticity & redundancy of the cloud.

What are some of the top scenarios or use cases for Hybrid Hadoop? And what are the benefits of taking advantage of a hybrid model?

  • Elasticity: Easily scale out during peak demand times by quickly spinning up more Hadoop nodes (with HDInsight)
  • Reliability: Use the cloud as an automated disaster recovery solution that automatically geo-replicates your data. Or
  • Breadth of Analytics Offerings: If you’re already working with on-prem Hortonworks offerings, you now have access to a suite of turn-key data analytics and management services in Azure, like HDInsight, Machine Learning, Data Factory, and Stream Analytics.

To get started, customers need Hortonworks Data Platform 2.2 with Apache Falcon configured to move data from on-premises into Azure.  Detailed instructions can be found here.

We are excited to be working with Hortonworks to give Hadoop users Hadoop/Big Data on a hybrid cloud. For more resources:

Wednesday, November 19, 2014 10:00:00 AM

To allow developers in Visual Studio to more easily incorporate the benefits of “big data” with their custom applications, Microsoft is adding a deeper tooling experience for HDInsight in Visual Studio in the most recent version of the Azure SDK. This extension to Visual Studio helps developers to visualize their Hadoop clusters, tables and associated storage in familiar and powerful tools. Developers can now create and submit ad hoc Hive queries for HDInsight directly against a cluster from within Visual Studio, or build a Hive application that is managed like any other Visual Studio project.

Download the Azure SDK now for VS 2013 | VS 2012 | VS 2015 Preview.

Integration of HDInsight objects into the “Server Explorer” brings your Big Data assets onto the same page as other cloud services under Azure. This allows for quick and simple exploration of clusters, Hive tables and their schemas, down to querying the first 100 rows of a table.  This helps you to quickly understand the shape of the data you are working with in Visual Studio.

Also, there is tooling to create Hive queries and submit them as jobs. Use the context menu against a Hadoop cluster to immediately begin writing Hive query scripts. In the example below, we create a simple query against a Hive table with geographic info to find the count of all countries and sort them by country. The Job Browser tool helps you visualize the job submissions and status.  Double click on any job to get a summary and details in the Hive Job Summary window. 

You can also navigate to any Azure Blob container and open it to work with the files contained there. The backing store is associated with the Hadoop cluster during cluster creation in the Azure dashboard. Management of the Hadoop cluster is still performed in the same Azure dashboard.

For more complex script development and lifecycle management, you can create Hive projects within Visual Studio. In the new project dialog (see below) you will find a new HDInsight Template category. A helpful starting point is the Hive Sample project type. This project is pre-populated with a more complex Hive query and sample data for the case of processing web server logs.

To get started visit the Azure HDInsight page to learn about Hadoop features on Azure. 

Monday, November 17, 2014 10:00:00 AM

Apache Spark is a popular open source framework for distributed cluster computing. Spark has been gaining popularity for its ability to handle both batch and stream processing as well as supporting in-memory and conventional disk processing. Starting today, Azure HDInsight will make it possible to install Spark as well as other Hadoop sub-projects on its clusters. This is delivered through a new customization feature called Script Action. This will allow you to experiment and deploy Hadoop projects to HDInsight clusters that were not possible before. We are making this easier specifically for Spark and R by documenting the process to install these modules.

To do this, you will have to create an HDInsight cluster with Spark Script Action. Script Action allow users to specify PowerShell scripts that will be executed on cluster nodes during cluster setup. One of the sample scripts that are released with the preview is Script Action to install Spark. During preview the feature is available through PowerShell, so you will need to run PowerShell scripts to create your Spark cluster. Below is the snippet of the PowerShell code where “spark-installer-v01.ps1” is the Script Action that installs Spark on HDInsight:

New-AzureHDInsightClusterConfig -ClusterSizeInNodes $clusterNodes

| Set-AzureHDInsightDefaultStorage -StorageAccountName $storageAccountName 
        -StorageAccountKey $storageAccountKey -StorageContainerName $containerName

| Add-AzureHDInsightScriptAction -Name "Install Spark"
        -ClusterRoleCollection HeadNode,DataNode
        -Uri https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv01/spark-installer-v01.ps1

| New-AzureHDInsightCluster -Name $clusterName -Location $location

Once the cluster is provisioned it will have the Spark component installed on it. You can RDP into the cluster and use Spark shell:

  • In Hadoop command line window change directory to C:\apps\dist\spark-1.0.2
  • Run the following command to start the Spark shell.


  • On the Scala prompt, enter the spark query to count words in a sample file stored in Azure Blob storage account:

val file = sc.textFile("example/data/gutenberg/davinci.txt")
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

Read more on installing and using Spark on HDInsight here:

Read more on Script Action to make other customizations here :

For more information on Azure HDInsight:

Friday, November 14, 2014 9:00:00 AM

by Rob Farley, LobsterPot Solutions

The Analytics Platform System, with its MPP SQL Server engine (SQL Server Parallel Data Warehouse) can deliver performance and scalability for analytics workloads that you may not have expected from SQL Server. But there are key differences in working with SQL Server PDW and SQL Server Enterprise Edition that one should be aware of in order to take full advantage of the SQL Server PDW capabilities. One of the most important considerations when tuning queries in Microsoft SQL Server Parallel Data Warehouse is the minimisation of data movement. This post shows a useful technique regarding the identification of redundant joins through additional predicates that simulate check constraints.

Microsoft’s PDW, part of the Analytics Platform System (APS), offers scale-out technology for data warehouses. This involves spreading data across a number of SQL Server nodes and distributions, such that systems can host up to many petabytes of data. To achieve this, queries which use data from multiple distributions to satisfy joins must leverage the Data Movement Service (DMS) to relocate data during the execution of the query. This data movement is both a blessing and a curse; a blessing because it is the fundamental technology which allows the scale-out features to work, and a curse because it can be one of the most expensive parts of query execution. Furthermore, tuning to avoid data movement is something which many SQL Server query tuning experts have little experience, as it is unique to the Parallel Data Warehouse edition of SQL Server.

Regardless of whether data in PDW is stored in a column-store or row-store manner, or whether it is partitioned or not, there is a decision to be made as to whether a table is to be replicated or distributed. Replicated tables store a full copy of their data on each compute node of the system, while distributed tables distribute their data across distributions, of which there are eight on each compute node. In a system with six compute nodes, there would be forty-eight distributions, with an average of less than 2.1% (100% / 48) of the data in each distribution.

When deciding whether to distribute or replicate data, there are a number of considerations to bear in mind. Replicated data uses more storage and also has a larger management overhead, but can be more easily joined to data, as every SQL node has local access to replicated data. By distributing larger tables according to the hash of one of the table columns (known as the distribution key), the overhead of both reading and writing data is reduced – effectively reducing the size of databases by an order of magnitude.

Having decided to distribute data, choosing which column to use as the distribution key is driven by factors including the minimisation of data movement and the reduction of skew. Skew is important because if a distribution has much more than the average amount of data, this can affect query time. However, the minimisation of data movement is probably the most significant factor in distribution-key choice.

Joining two tables together involves identifying whether rows from each table match to according a number of predicates, but to do this, the two rows must be available on the same compute node. If one of the tables is replicated, this requirement is already satisfied (although it might need to be ‘trimmed’ to enable a left join), but if both tables are distributed, then the data is only known to be on the same node if one of the join predicates is an equality predicate between the distribution keys of the tables, and the data types of those keys are exactly identical (including nullability and length). More can be read about this in the excellent whitepaper about Query Execution in Parallel Data Warehouse.

To avoid data movement between commonly-performed joins, creativity is often needed by the data warehouse designers. This could involve the addition of extra columns to tables, such as adding the CustomerKey to many fact data tables (and using this as the distribution key), as joins between orders, items, payments, and other information required for a given report, as all these items are ultimately about a customer, and adding additional predicates to each join to alert the PDW Engine that only rows within the same distribution could possibly match. This is thinking that is alien for most data warehouse designers, who would typically feel that adding CustomerKey to a table not directly related to a Customer dimension is against best-practice advice.


Another technique commonly used by PDW data warehouse designers that is rarely seen in other SQL Server data warehouses is splitting tables up into two, either vertically or horizontally, whereas both are relatively common in PDW to avoid some of the problems that can often occur.

Splitting a table vertically is frequently done to reduce the impact of skew when the ideal distribution key for joins is not evenly distributed. Imagine the scenario of identifiable customers and unidentifiable customers, as increasingly the situation as stores have loyalty programs allowing them to identify a large portion (but not all) customers. For the analysis of shopping trends, it could be very useful to have data distributed by customer, but if half the customers are unknown, there will be a large amount of skew.

To solve this, sales could be split into two tables, such as Sales_KnownCustomer (distributed by CustomerKey) and Sales_UnknownCustomer (distributed by some other column). When analysing by customer, the table Sales_KnownCustomer could be used, including the CustomerKey as an additional (even if redundant) join predicate. A view performing a UNION ALL over the two tables could be used to allow reports that need to consider all Sales.

The query overhead of having the two tables is potentially high, especially if we consider tables for Sales, SaleItems, Deliveries, and more, which might all need to be split into two to avoid skew while minimising data movement, using CustomerKey as the distribution key when known to allow customer-based analysis, and SalesKey when the customer is unknown.

By distributing on a common key the impact is to effectively create mini-databases which are split out according to groups of customers, with all of the data about a particular customer residing in a single database. This is similar to the way that people scale out when doing so manually, rather than using a system such as PDW. Of course, there is a lot of additional overhead when trying to scale out manually, such as working out how to execute queries that do involve some amount of data movement.

By splitting up the tables into ones for known and unknown customers, queries that were looking something like the following:

FROM Sales AS s
JOIN SaleItems AS si
   ON si.SalesKey = s.SalesKey
JOIN Delivery_SaleItems AS dsi
   ON dsi.LineItemKey = si.LineItemKey
JOIN Deliveries AS d
   ON d.DeliveryKey = dsi.DeliveryKey

…would become something like:

FROM Sales_KnownCustomer AS s
JOIN SaleItems_KnownCustomer AS si
   ON si.SalesKey = s.SalesKey
   AND si.CustomerKey = s.CustomerKey
JOIN Delivery_SaleItems_KnownCustomer AS dsi
   ON dsi.LineItemKey = si.LineItemKey
   AND dsi.CustomerKey = s.CustomerKey
JOIN Deliveries_KnownCustomer AS d
   ON d.DeliveryKey = dsi.DeliveryKey
   AND d.CustomerKey = s.CustomerKey
FROM Sales_UnknownCustomer AS s
JOIN SaleItems_UnknownCustomer AS li
   ON si.SalesKey = s.SalesKey
JOIN Delivery_SaleItems_UnknownCustomer AS dsi
   ON dsi.LineItemKey = s.LineItemKey
   AND dsi.SalesKey = s.SalesKey
JOIN Deliveries_UnknownCustomer AS d
   ON d.DeliveryKey = s.DeliveryKey
   AND d.SalesKey = s.SalesKey

I’m sure you can appreciate that this becomes a much larger effort for query writers, and the existence of views to simplify querying back to the earlier shape could be useful. If both CustomerKey and SalesKey were being used as distribution keys, then joins between the views would require both, but this can be incorporated into logical layers such as Data Source Views much more easily than using UNION ALL across the results of many joins. A DSV or Data Model could easily define relationships between tables using multiple columns so that self-serving reporting environments leverage the additional predicates.

The use of views should be considered very carefully, as it is easily possible to end up with views that nest views that nest view that nest views, and an environment that is very hard to troubleshoot and performs poorly. With sufficient care and expertise, however, there are some advantages to be had.


The resultant query would look something like:

FROM Sales AS s
JOIN SaleItems AS li
   ON si.SalesKey = s.SalesKey
   AND si.CustomerKey = s.CustomerKey
JOIN Delivery_SaleItems AS dsi
   ON dsi.LineItemKey = si.LineItemKey
   AND dsi.CustomerKey = s.CustomerKey
   AND dsi.SalesKey = s.SalesKey
JOIN Deliveries AS d
   ON d.DeliveryKey = dsi.DeliveryKey
   AND d.CustomerKey = s.CustomerKey
   AND d.SalesKey = s.SalesKey

Joining multiple sets of tables which have been combined using UNION ALL is not the same as performing a UNION ALL of sets of tables which have been joined. Much like any high school mathematics teacher will happily explain that (a*b)+(c*d) is not the same as (a+c)*(b+d), additional combinations need to be considered when the logical order of joins and UNION ALLs.

Notice that when we have (TableA1 UNION ALL TableA2) JOIN (TableB1 UNION ALL TableB2), we must perform joins not only between TableA1 and TableB1, and TableA2 and TableB2, but also TableA1 and TableB2, and TableB1 and TableA2. These last two combinations do not involve tables with common distribution keys, and therefore we would see data movement. This is despite the fact that we know that there can be no matching rows in those combinations, because some are for KnownCustomers and the others are for UnknownCustomers. Effectively, the relationships between the tables would be more like the following diagram:

There is an important stage of Query Optimization which must be considered here, and which can be leveraged to remove the need for data movement when this pattern is applied – that of Contradiction.

The contradiction algorithm is an incredibly useful but underappreciated stage of Query Optimization. Typically it is explained using an obvious contradiction such as WHERE 1=2. Notice the effect on the query plans of using this predicate.

Because the Query Optimizer recognises that no rows can possibly satisfy the predicate WHERE 1=2, it does not access the data structures seen in the first query plan.

This is useful, but many readers may not consider queries that use such an obvious contradiction are going to appear in their code.

But suppose the views that perform a UNION ALL are expressed in this form:

FROM dbo.Sales_KnownCustomer
WHERE CustomerID > 0
FROM dbo.Sales_UnknownCustomer
WHERE CustomerID = 0;

Now, we see a different kind of behaviour.

Before the predicates are used, the query on the views is rewritten as follows (with SELECT clauses replaced by ellipses).

        FROM   (SELECT ...
                FROM   [sample_vsplit].[dbo].[Sales_KnownCustomer] AS T4_1
                UNION ALL
                SELECT …
                FROM   [tempdb].[dbo].[TEMP_ID_4208] AS T4_1) AS T2_1
               INNER JOIN
               (SELECT …
                FROM   (SELECT …
                        FROM   [sample_vsplit].[dbo].[SaleItems_KnownCustomer] AS T5_1
                        UNION ALL
                        SELECT …
                        FROM   [tempdb].[dbo].[TEMP_ID_4209] AS T5_1) AS T3_1
                       INNER JOIN
                       (SELECT …
                        FROM   (SELECT …
                                FROM   [sample_vsplit].[dbo].[Delivery_SaleItems_KnownCustomer] AS T6_1
                                UNION ALL
                                SELECT …
                                FROM   [tempdb].[dbo].[TEMP_ID_4210] AS T6_1) AS T4_1
                               INNER JOIN
                               (SELECT …
                                FROM   [sample_vsplit].[dbo].[Deliveries_KnownCustomer] AS T6_1
                                UNION ALL
                                SELECT …
                                FROM   [tempdb].[dbo].[TEMP_ID_4211] AS T6_1) AS T4_2
                               ON (([T4_2].[CustomerKey] = [T4_1].[CustomerKey])
                                   AND ([T4_2].[SalesKey] = [T4_1].[SalesKey])
                                       AND ([T4_2].[DeliveryKey] = [T4_1].[DeliveryKey]))) AS T3_2
                       ON (([T3_1].[CustomerKey] = [T3_2].[CustomerKey])
                           AND ([T3_1].[SalesKey] = [T3_2].[SalesKey])
                               AND ([T3_2].[SaleItemKey] = [T3_1].[SaleItemKey]))) AS T2_2
               ON (([T2_2].[CustomerKey] = [T2_1].[CustomerKey])
                   AND ([T2_2].[SalesKey] = [T2_1].[SalesKey]))) AS T1_1

Whereas with the inclusion of the additional predicates, the query simplifies to:

        FROM   (SELECT …
                FROM   [sample_vsplit].[dbo].[Sales_KnownCustomer] AS T4_1
                WHERE  ([T4_1].[CustomerKey] > 0)) AS T3_1
               INNER JOIN
               (SELECT …
                FROM   (SELECT …
                        FROM   [sample_vsplit].[dbo].[SaleItems_KnownCustomer] AS T5_1
                        WHERE  ([T5_1].[CustomerKey] > 0)) AS T4_1
                       INNER JOIN
                       (SELECT …
                        FROM   (SELECT …
                                FROM   [sample_vsplit].[dbo].[Delivery_SaleItems_KnownCustomer] AS T6_1
                                WHERE  ([T6_1].[CustomerKey] > 0)) AS T5_1
                               INNER JOIN
                               (SELECT …
                                FROM   [sample_vsplit].[dbo].[Deliveries_KnownCustomer] AS T6_1
                                WHERE  ([T6_1].[CustomerKey] > 0)) AS T5_2
                               ON (([T5_2].[CustomerKey] = [T5_1].[CustomerKey])
                                   AND ([T5_2].[SalesKey] = [T5_1].[SalesKey])
                                       AND ([T5_2].[DeliveryKey] = [T5_1].[DeliveryKey]))) AS T4_2
                       ON (([T4_1].[CustomerKey] = [T4_2].[CustomerKey])
                           AND ([T4_1].[SalesKey] = [T4_2].[SalesKey])
                               AND ([T4_2].[SaleItemKey] = [T4_1].[SaleItemKey]))) AS T3_2
               ON (([T3_2].[CustomerKey] = [T3_1].[CustomerKey])
                   AND ([T3_2].[SalesKey] = [T3_1].[SalesKey]))
        UNION ALL
        SELECT …
        FROM   (SELECT …
                FROM   [sample_vsplit].[dbo].[Sales_UnknownCustomer] AS T4_1
                WHERE  ([T4_1].[CustomerKey] = 0)) AS T3_1
               INNER JOIN
               (SELECT …
                FROM   (SELECT …
                        FROM   [sample_vsplit].[dbo].[SaleItems_UnknownCustomer] AS T5_1
                        WHERE  ([T5_1].[CustomerKey] = 0)) AS T4_1
                       INNER JOIN
                       (SELECT …
                        FROM   (SELECT …
                                FROM   [sample_vsplit].[dbo].[Delivery_SaleItems_UnknownCustomer] AS T6_1
                                WHERE  ([T6_1].[CustomerKey] = 0)) AS T5_1
                               INNER JOIN
                               (SELECT …
                                FROM   [sample_vsplit].[dbo].[Deliveries_UnknownCustomer] AS T6_1
                                WHERE  ([T6_1].[CustomerKey] = 0)) AS T5_2
                               ON (([T5_2].[CustomerKey] = [T5_1].[CustomerKey])
                                   AND ([T5_2].[SalesKey] = [T5_1].[SalesKey])
                                       AND ([T5_2].[DeliveryKey] = [T5_1].[DeliveryKey]))) AS T4_2
                       ON (([T4_1].[CustomerKey] = [T4_2].[CustomerKey])
                           AND ([T4_1].[SalesKey] = [T4_2].[SalesKey])
                               AND ([T4_2].[SaleItemKey] = [T4_1].[SaleItemKey]))) AS T3_2
               ON (([T3_2].[CustomerKey] = [T3_1].[CustomerKey])
                   AND ([T3_2].[SalesKey] = [T3_1].[SalesKey]))) AS T1_1

This may seem more complex – it’s certainly longer – but this is the original, preferred version of the join. This is a powerful rewrite of the query.

Furthermore, the astute PDW-familiar reader will quickly realise that the UNION ALL of two local queries (queries that don’t require data movement) is also local, and that therefore, this query is completely local. The TEMP_ID_NNNNN tables in the first rewrite are more evidence that data movement has been required.

When the two plans are shown using PDW’s EXPLAIN keyword, the significance is shown even clearer.

The first plan appears as following, and it is obvious that there is a large amount of data movement involved.

The queries passed in are identical, but the altered definitions of the views have removed the need for any data movement at all. This should allow your query to run a little faster. Ok, a lot faster.


When splitting distributed tables vertically to avoid skew, views over those tables should include predicates which reiterate the conditions that cause the data to be populated into each table. This provides additional information to the PDW Engine that can remove unnecessary data movement, resulting in much-improved performance, both for standard reports using designed queries, and ad hoc reports that use a data model.

Site Map | Printable View | © 2008 - 2014 SQLTrainer.com, LLC | Powered by mojoPortal | HTML 5 | CSS | Design by mitchinson