Available Now: Preview Release of the SQL Server PHP Driver

November 6, 2014, 9:00 am

≫ Next: L’innovation par les données – Introduction

≪ Previous: Microsoft adds free tier to Azure Machine Learning

Today we are pleased to announce the availability of a community technology preview release of the SQL Server Driver for PHP! Download the preview driver today here.

This release will allow developers who use the PHP scripting language version 5.5 to access Microsoft SQL Server and Microsoft Azure SQL Database. The full source code for the driver will be made available on GITHUB, at https://github.com/azure/msphpsql.

The updated driver is part of SQL Server’s wider interoperability program, which includes the upcoming release of a JDBC Driver for SQL Server compatible with JDK7. This driver will enable customers to develop applications that connect to SQL Server against Java 7 and move forward with the Java platform.

We look forward to hearing your feedback about the new driver. Let us know what you think on Microsoft Connect.

↧

L’innovation par les données – Introduction

November 6, 2014, 1:34 pm

≫ Next: L’innovation par les données – Le pilotage commercial B2C

≪ Previous: Available Now: Preview Release of the SQL Server PHP Driver

L’équipe Data Insight de Microsoft Consulting Services composée d’experts reconnus de la plateforme de gestion des données, autours de SQL Server et Microsoft Business Intelligence (MSBI), vous propose une nouvelle série de billets concernant la mise en œuvre de solutions innovantes, à travers des focus sur des cas métiers concrets et des exemples d’implémentation en réponse à un besoin fonctionnel.

...(read more)

↧

L’innovation par les données – Le pilotage commercial B2C

November 6, 2014, 1:55 pm

≫ Next: Azure Event Hubs now commercially available

≪ Previous: L’innovation par les données – Introduction

Cet article a pour objectif de présenter une solution de pilotage en temps réel d’une activité commerciale B2C (Business to Consumer), qui démontre la valeur ajoutée et l’agilité que peuvent offrir aux entreprises les architectures basées sur le Cloud et les équipements mobiles connectés.

...(read more)

↧

Azure Event Hubs now commercially available

November 6, 2014, 4:05 pm

≫ Next: Information Week: Dell Bolsters Analytics Software, Taps Microsoft Azure ML

≪ Previous: L’innovation par les données – Le pilotage commercial B2C

Yesterday, we announced the commercial availability of Azure Event Hubs.

In the IoT (Internet of Things) world that we live in, ever large amounts of telemetry data are being emitted by multiple distinct sources such as sensors and applications embedded into things that we use or interact with every day. Azure Event Hubs allows for the secure and reliable ingestion of such data torrents at very high sustained throughput rates. Many innovative new consumer and commercial experiences can be delivered on the basis of gathering and processing such data.

For instance, most new vehicles or industrial machinery sold these days capture a ton of such data, allowing for rich insights and actions that could be taken on the basis of a careful analysis of the data emitted.

By combining Event Hubs with the Azure Stream Analytics service which is now offered as a preview, you can have automated alerts go out when you determine that a piece of machinery is on the cusp of failure, and you could even dispatch a repair technician pro-actively, well before your customer calls your support line.

Event Hubs supports over 1 GB/second cumulative throughput and from hundreds of thousands of concurrently connected clients, and the best part is that the hard scale-out and reliability problems are entirely handled by our service.

ML Blog Team

↧

Information Week: Dell Bolsters Analytics Software, Taps Microsoft Azure ML

November 7, 2014, 3:50 pm

≫ Next: VentureBeat: Microsoft gives out free access to its Azure Machine Learning service

≪ Previous: Azure Event Hubs now commercially available

Re-post of an article that ran earlier this week, from Information Week.

Dell extends its big data analysis capabilities, adding natural-language processing and integrating Microsoft Azure Machine Learning services.

↧

VentureBeat: Microsoft gives out free access to its Azure Machine Learning service

November 7, 2014, 3:55 pm

≫ Next: [Announcement] OData V4 Client Code Generator 2.1.0 Release

≪ Previous: Information Week: Dell Bolsters Analytics Software, Taps Microsoft Azure ML

Re-post of an article from VentureBeat earlier this week.

"... you’ll welcome a recent development coming out of the PASS Summit in Seattle. Microsoft executives on the scene there announced that the company is opening up Azure Machine Learning to free access. That means you don’t need to plug in your credit card information if you want to try it."

↧

[Announcement] OData V4 Client Code Generator 2.1.0 Release

November 9, 2014, 5:59 pm

≫ Next: Microsoft Research Grants Available for Azure, Including Machine Learning

≪ Previous: VentureBeat: Microsoft gives out free access to its Azure Machine Learning service

Features

OData v4 Client Code Generator now supports the generation of properties whose type, functions whose parameter types and return type are Edm.TimeOfDay or Edm.Date.
OData v4 Client Code Generator generates one more ByKey method for each EntityType, which directly accepts all keys as parameters instead of a dictionary.

Bug Fixes:

Fix a bug that OData v4 Client Code Generator may generate empty ExtensionMethods class.
Fix a bug that OData v4 Client Code Generator may generate duplicate names between a property and a private field.
[Github issues #10] Fix a bug that OData v4 Client Code Generator cannot correctly generate code when EntityType name and one of its property name are same.
Fix a bug that OData v4 Client Code Generator cannot correctly generate VB code for a bound function which returns a collection.

↧

Microsoft Research Grants Available for Azure, Including Machine Learning

November 11, 2014, 11:05 am

≫ Next: Five things to know about SQL Server’s in-memory technology

≪ Previous: [Announcement] OData V4 Client Code Generator 2.1.0 Release

This article is a re-post from the Microsoft Research Connections Blog.

A year ago, the Microsoft Azure for Research project began as a small effort to help external researchers and scientists (and Microsoft) understand how the cloud could accelerate research insights. The project enables researchers to take advantage of cloud computing for collaboration, computation, and data-intensive processing, and gives researchers access to events, online training, technical papers and more.

The project also features a very popular award program, which provides qualified research proposals with substantial grants of Azure storage and compute for one year. We got over 700 proposals in the past year and from all seven continents, including one from researchers in Antarctica! We granted awards to over half the submitted project proposals, facilitating research in a wide range of disciplines, including computer science, biology, environmental science, genomics, and planetary science.

The program also issues special-opportunity RFPs including one for Azure Machine Learning. Click here to learn more about our research collaboration, including how to apply for a grant.

ML Blog Team

↧

Five things to know about SQL Server’s in-memory technology

November 11, 2014, 11:55 am

≫ Next: Visual Studio 2015 Preview and Entity Framework

≪ Previous: Microsoft Research Grants Available for Azure, Including Machine Learning

Last week was an exciting week for the SQL Server team, as one of our favorite events happened – PASS Summit. If you attended PASS, you probably heard a ton about the latest version of SQL Server 2014.

One of the key drivers of SQL 2014’s design was the in-memory technology that is built into the product. These capabilities and the way they were designed are a key differentiator for SQL Server 2014. Recently we discussed how using SQL Server 2014’s in-memory technology can have a dramatic impact on your business– speeding transactions, queries, and insights. Today let’s delve a little deeper into our in-memory solution and our unique approach to its design.

We built in-memory technology into SQL Server from the ground up, making it the first in-memory database that works across all workloads. These in-memory capabilities are available not only on-premises, but also in the cloud when you use SQL Server in an Azure VM or use the upcoming in-memory columnstore capabilities within Azure SQL Database. So just what makes our approach so unique? This video describes it well.

We have five core design points for SQL Server in-memory. These are:

It’s built-in. If you know SQL Server, you’re ready to go. You don’t need new development tools, to rewrite the entire app, or learn new APIs.
It increases speed and throughput. SQL Server’s in-memory OLTP design removes database contention with lock and latch-free table architecture while maintaining 100 percent data durability. This means you can take advantage of all your compute resources in parallel, for more concurrent users.
It’s flexible. Your entire database doesn’t need to be in-memory. You can choose to store hot data in-memory and cold data on disk, while still being able to access both with a single query. This give you the ability to optimize new or existing hardware.
It’s easy to implement. The new migration advisory built right into SQL Server Management Studio lets you easily decide what to migrate to memory.
It’s workload-optimized. In-memory OLTP is optimized for faster transactions, enhanced in-memory ColumnStore gives you faster queries and reports, and in-memory built into Excel and Analysis Services speeds analytics.

All of this combined leads to up to 30x faster transactions, over 100x faster queries and reporting, and easy management of millions of rows of data in Excel. Think about what this can do for your business.

Learn more about SQL Server 2014 in-memory, or try SQL Server 2014 now.

↧

Visual Studio 2015 Preview and Entity Framework

November 12, 2014, 5:35 am

≫ Next: SQL Server 2014 In-Memory Gives Dell the Boost it Needed to Turn Time into Money

≪ Previous: Five things to know about SQL Server’s in-memory technology

Today we are pleased to announce the availability of Visual Studio 2015 Preview. You can download it and read more about the release on the Visual Studio team blog.

This post details the places that Entity Framework is included in Visual Studio 2015 Preview. Our team is concurrently working on two versions of Entity Framework, both of which are included in this preview.

Entity Framework 6.1.2 (Beta 1)

This release includes Beta 1 of Entity Framework 6.1.2 runtime and tooling. EF6.1.2 includes a number bug fixes and community contributions, you can see a list of the changes included in EF6.1.2 on our CodePlex site.

The Entity Framework 6.1.2 runtime is included in a number of places in this release.

The runtime will be installed if you create a new model using the Entity Framework Tools in a project that does not already have the EF runtime installed.
The runtime is pre-installed in new ASP.NET projects, depending on the project template you select.

If you encounter any issues using this beta of the 6.1.2 release be sure to report them on our CodePlex site so that we can look at fixing them for RTM.

Entity Framework 7

EF7 is the next major version of Entity Framework and is still in the early stages of development. EF7 is a lightweight and extensible version of EF that enables new platforms and new data stores.

You can find more detailed information about EF7 at http://aka.ms/AboutEF7. The page includes design information, links to relevant blog posts, and instruction about trying out the latest builds.

Visual Studio 2015 Preview includes a pre-release version of the EF7 runtime that is installed in new ASP.NET 5 projects.

Quality of EF7

We’d love to have you try out EF7 but just remember there are still a lot of rough edges and missing functionality. The EF7 project involves some major changes in the core of Entity Framework, you can read more about this in our recent ‘EF7 – v1 or v7?’ blog post.

This release is designed to give you an idea of what the experience will be like and you will quickly hit limitations if you deviate from the examples.

If you have a keen eye you may notice that the EF7 package is marked as ‘Beta 1’. This is a side effect of being part of a larger set of previews that are currently marked as Beta and we do not consider the EF7 code base to be at a level of quality or functionality where we would typically mark it as beta. This is just a result of the complexities of having a series of smaller autonomous products that are also involved in an all up release.

↧

SQL Server 2014 In-Memory Gives Dell the Boost it Needed to Turn Time into Money

November 12, 2014, 9:00 am

≫ Next: AlwaysOn Availability Groups Now Support Internal Listeners on Azure Virtual Machines

≪ Previous: Visual Studio 2015 Preview and Entity Framework

There’s an old adage: time is money. Technology and the internet have changed the value of time and created a very speed-oriented culture. The pace at which you as a business deliver information, react to customers, enable online purchases, etc. directly correlates with your revenue. For example, reaction times and processing speeds can mean the difference between making a sale and a consumer losing interest. This is where the right data platform comes into play.

If you attended PASS Summit or watched the keynotes online, you saw us speak about Dell and the success they’ve had in using technology performance to drive their business. For Dell, providing its customers with the best possible online experience is paramount. That meant boosting its website performance so that each day its 10,000 concurrent shoppers (this number jumps to nearly 1 million concurrent shoppers during the holiday season) could enjoy faster, frustration-free shopping experiences. For Dell, time literally means money.

With a very specific need and goal in mind Dell evaluated numerous other in-memory tools and databases, but ultimately selected SQL Server 2014.

Dell turned to Microsoft’s in-memory OLTP (online transaction processing) technology because of its unique lock and latch free table architecture the removed database contention while still guaranteeing 100 percent durability. By removing database contention Dell could utilize far more parallel processors to not only improve transactional speed but also significantly increase the number of concurrent users. And choosing SQL Server 2014 with in-memory built-in meant Dell did not have to learn new APIs or tools their developers could use familiar SQL Server tools and T-SQL to easily implement the new in-memory technologies.

All of this meant Dell was able to double its application speeds and process transactions 9x faster. Like Dell, you also can take advantage of the workload optimized in-memory technologies built into the SQL Server 2014 data platform for faster transactions, faster queries and faster analytics. And you can do it all without expensive add-ons utilizing your existing hardware, and existing development skills.

Learn more about SQL Server 2014 in-memory technology.

↧

AlwaysOn Availability Groups Now Support Internal Listeners on Azure Virtual Machines

November 13, 2014, 8:00 am

≫ Next: In-Memory Technology in SQL Server 2014 Provides Samsung ElectroMechanics with Huge Performance Gains

≪ Previous: SQL Server 2014 In-Memory Gives Dell the Boost it Needed to Turn Time into Money

We’re excited to announce that AlwaysOn Availability Groups now support Internal Listeners on Azure Virtual Machines. Today we updated our official documentation accordingly.

Availability Groups and Listeners on Azure Virtual Machines

Availability Groups, released in SQL Server 2012 and enhanced in SQL Server 2014, detect conditions impacting SQL Server availability (e.g. SQL service being down or losing connectivity). When detecting these conditions, the Availability Group fails over a group of databases to a secondary replica. In the context of Azure Infrastructure Services, this significantly increases the availability of these databases during Microsoft Azure’s VM Service Healing (e.g. due to physical hardware failures), platform upgrades, or your own patching of the guest OS or SQL Server.

Client applications connect to the primary replica of an availability group using an Availability Group Listener. The Listener specifies a DNS name that remains the same, irrespective of the number of replicas or where these are located.

For example: Server=tcp:ListenerName,1433;Database=DatabaseName;

To support this in Azure Virtual Machines, the Listener must be assigned the IP address of an Azure Load Balancer. The Load Balancer routes connections to endpoint of the primary replica of the Availability Group.

Internal Availability Group Listeners

Until now, the IP address of the Azure Load Balancer had to be a public IP reachable over the Internet. To restrict access to the listener only to trusted entities, you could configure an access control list for the Load Balancer IP. However, maintaining this list could be cumbersome over time.

To simplify this, you can now configure an Internal Azure Load Balancer. This has an internal IP address reachableonly within a Virtual Network. This makes the Listener accessible only to client applications located

In the same Virtual Network
In another connected Virtual Network (in the same Azure region or a different Azure region)
On-premise connected via VPN tunnel

This is depicted in the picture below. An availability group has three replicas, two in Virtual Network 1 and one in Virtual Network 2. The Virtual Networks are connected via a VPN tunnel. The Availability Group has a Listener configured using an Internal Load Balancer. This disallows access outside of the connected Virtual Networks.

To create an Internal Azure Load Balancer execute the Powershell cmdlet Add-AzureInternalLoadBalancer. As depicted below, this cmdlet receives the name of the Load Balancer, the Cloud Service where it’ll be created, and a static IP address in the Virtual Network. This is the internal IP address that should be used for the listener.

Add-AzureInternalLoadBalancer -InternalLoadBalancerName $ILBName -ServiceName $ServiceName -StaticVNetIPAddress $ILBStaticIP

Check our official documentation and start using Internal Availability Groups today!

To learn more about SQL Server in Azure Virtual Machines check our start page.

↧

In-Memory Technology in SQL Server 2014 Provides Samsung ElectroMechanics with Huge Performance Gains

November 13, 2014, 9:00 am

≫ Next: Preview Release of the SQL Server JDBC Driver

≪ Previous: AlwaysOn Availability Groups Now Support Internal Listeners on Azure Virtual Machines

We’ve been talking a lot lately about our in-memory technology in SQL Server. If you attended the PASS Summit last week you likely heard a fair share. So, why all the fuss? Simply put, SQL Server 2014’s in-memory delivers serious business impact. According to CMS Wire“Microsoft SQL 2014 just may be the most complete in-memory solution on the market.”

Last week we told you the story of Dell and how they have boosted website performance and enabled faster online shopping experiences with SQL Server’s in-memory online transaction processing technology. Dell is not alone. Nasdaq, Bwin and EdgeNet all have seen significant performance gains. Let’s take a look at another customer, Samsung Electro-Mechanics.

Samsung Electro-Mechanics, an electrical and mechanical devices manufacturer, uses its Statistical Process Control system to manage quality control for its large-scale manufacturing facilities. As the system evolved and became more complex, database performance suffered, impacting manufacturing quality. To stabilize and increase performance, Samsung Electro-Mechanics implemented SQL Server 2014 in-memory OLTP and CCI (Clustered Columnstore Indexes).

By doing so, Samsung Electro-Mechanics was able to increase transactional performance by 24x using in-memory OLTP, and improve query and reporting by 22x using the in-memory Columnstore. These performance gains far exceeded their initial goal of improving overall performance by 2x.

So consider what impact SQL Server in-memory could have on your business.

Learn more about SQL Server 2014 in-memory, or try SQL Server 2014 now.

↧

Preview Release of the SQL Server JDBC Driver

November 13, 2014, 10:00 am

≫ Next: SQL Server database tooling November update

≪ Previous: In-Memory Technology in SQL Server 2014 Provides Samsung ElectroMechanics with Huge Performance Gains

Today we are pleased to announce the availability of a community technology preview release of the Microsoft JDBC Driver for SQL Server! Download the preview driver today here.

The JDBC Driver for SQL Server is a Java Database Connectivity (JDBC) 4.1 compliant driver that provides robust data access to Microsoft SQL Server and Microsoft Azure SQL Database. Microsoft JDBC Driver 4.1 (Preview) for SQL Server now supports Java Development Kit (JDK) version 7.0.

The updated driver is part of SQL Server’s wider interoperability program, which includes the recent announcement of a preview driver for SQL Server compatible with PHP 5.5.

We look forward to hearing your feedback about the new driver. Let us know what you think on Microsoft Connect.

↧

SQL Server database tooling November update

November 13, 2014, 12:50 pm

≫ Next: Forget the pollsters: Microsoft's Bing predicted midterm election with 95% accuracy

≪ Previous: Preview Release of the SQL Server JDBC Driver

We’d like to announce the availability of the latest November 2014 release of SQL Server database tooling for Visual Studio. This update is now available for Visual Studio 2012 and 2013. For Visual Studio 2012 use the “SQL –> Check for Updates” tool inside Visual Studio. For Visual Studio 2013 download check the Visual Studio update channel (Tools –> Extensions and Updates –> Updates) for this update. Alternatively, administrative installs are available via the link below.

Get it here: http://msdn.microsoft.com/en-us/data/hh297027

Contact Us

If you have any questions or feedback, please visit our forum or Microsoft Connect page.
We look forward to hearing from you.

What’s New

The November 2014 update includes bug fixes along with the following enhancements:

"Open in Visual Studio" support from Azure Preview Portal for Microsoft Azure SQL Databases
Improved support for High DPI screens
Database Unit Test support for Visual Studio 2013 Express for Web and Express for Windows Desktop

↧

Forget the pollsters: Microsoft's Bing predicted midterm election with 95% accuracy

November 13, 2014, 3:30 pm

≫ Next: Free webinar: Learn how to run your R code in record time with just your browser

≪ Previous: SQL Server database tooling November update

This is a re-post of an article from NetworkWorld.

The search engine continues its track record of astonishingly accurate predictions.

"Now that the dust has settled from the elections, Bing Predict has won out again with a 95% accuracy rate in calling the House, Senate, and Governor's races. It got 34 out of 35 Senate races correct, 419 out of 435 House seats correct, and 33 out of 36 Governor's races correct. That's a better prediction rate than even Nate Silver’s lauded FiveThirtyEight blog."

↧

Free webinar: Learn how to run your R code in record time with just your browser

November 13, 2014, 6:20 pm

≫ Next: APS Best Practice: How to Optimize Query Performance by Minimizing Data Movement

≪ Previous: Forget the pollsters: Microsoft's Bing predicted midterm election with 95% accuracy

R is the most widely used language today for machine learning, but its power is sometimes limited by gaps in the technology meant to bring it to life. In this webinar, learn how you can use your existing skills in R in new ways, including deploying models as web services with a few clicks. The first half will be presentation content and the second will be open Q&A for everyone interested in optimizing their machine learning solutions.

Friday, November 14th, 2014

9:30 A.M. – 10:30 A.M. PDT

Register now!

↧

APS Best Practice: How to Optimize Query Performance by Minimizing Data Movement

November 14, 2014, 8:00 am

≫ Next: Azure HDInsight Clusters Allows Custom Installation of Spark Using Script Action

≪ Previous: Free webinar: Learn how to run your R code in record time with just your browser

by Rob Farley, LobsterPot Solutions

The Analytics Platform System, with its MPP SQL Server engine (SQL Server Parallel Data Warehouse) can deliver performance and scalability for analytics workloads that you may not have expected from SQL Server. But there are key differences in working with SQL Server PDW and SQL Server Enterprise Edition that one should be aware of in order to take full advantage of the SQL Server PDW capabilities. One of the most important considerations when tuning queries in Microsoft SQL Server Parallel Data Warehouse is the minimisation of data movement. This post shows a useful technique regarding the identification of redundant joins through additional predicates that simulate check constraints.

Microsoft’s PDW, part of the Analytics Platform System (APS), offers scale-out technology for data warehouses. This involves spreading data across a number of SQL Server nodes and distributions, such that systems can host up to many petabytes of data. To achieve this, queries which use data from multiple distributions to satisfy joins must leverage the Data Movement Service (DMS) to relocate data during the execution of the query. This data movement is both a blessing and a curse; a blessing because it is the fundamental technology which allows the scale-out features to work, and a curse because it can be one of the most expensive parts of query execution. Furthermore, tuning to avoid data movement is something which many SQL Server query tuning experts have little experience, as it is unique to the Parallel Data Warehouse edition of SQL Server.

Regardless of whether data in PDW is stored in a column-store or row-store manner, or whether it is partitioned or not, there is a decision to be made as to whether a table is to be replicated or distributed. Replicated tables store a full copy of their data on each compute node of the system, while distributed tables distribute their data across distributions, of which there are eight on each compute node. In a system with six compute nodes, there would be forty-eight distributions, with an average of less than 2.1% (100% / 48) of the data in each distribution.

When deciding whether to distribute or replicate data, there are a number of considerations to bear in mind. Replicated data uses more storage and also has a larger management overhead, but can be more easily joined to data, as every SQL node has local access to replicated data. By distributing larger tables according to the hash of one of the table columns (known as the distribution key), the overhead of both reading and writing data is reduced – effectively reducing the size of databases by an order of magnitude.

Having decided to distribute data, choosing which column to use as the distribution key is driven by factors including the minimisation of data movement and the reduction of skew. Skew is important because if a distribution has much more than the average amount of data, this can affect query time. However, the minimisation of data movement is probably the most significant factor in distribution-key choice.

Joining two tables together involves identifying whether rows from each table match to according a number of predicates, but to do this, the two rows must be available on the same compute node. If one of the tables is replicated, this requirement is already satisfied (although it might need to be ‘trimmed’ to enable a left join), but if both tables are distributed, then the data is only known to be on the same node if one of the join predicates is an equality predicate between the distribution keys of the tables, and the data types of those keys are exactly identical (including nullability and length). More can be read about this in the excellent whitepaper about Query Execution in Parallel Data Warehouse.

To avoid data movement between commonly-performed joins, creativity is often needed by the data warehouse designers. This could involve the addition of extra columns to tables, such as adding the CustomerKey to many fact data tables (and using this as the distribution key), as joins between orders, items, payments, and other information required for a given report, as all these items are ultimately about a customer, and adding additional predicates to each join to alert the PDW Engine that only rows within the same distribution could possibly match. This is thinking that is alien for most data warehouse designers, who would typically feel that adding CustomerKey to a table not directly related to a Customer dimension is against best-practice advice.

Another technique commonly used by PDW data warehouse designers that is rarely seen in other SQL Server data warehouses is splitting tables up into two, either vertically or horizontally, whereas both are relatively common in PDW to avoid some of the problems that can often occur.

Splitting a table vertically is frequently done to reduce the impact of skew when the ideal distribution key for joins is not evenly distributed. Imagine the scenario of identifiable customers and unidentifiable customers, as increasingly the situation as stores have loyalty programs allowing them to identify a large portion (but not all) customers. For the analysis of shopping trends, it could be very useful to have data distributed by customer, but if half the customers are unknown, there will be a large amount of skew.

To solve this, sales could be split into two tables, such as Sales_KnownCustomer (distributed by CustomerKey) and Sales_UnknownCustomer (distributed by some other column). When analysing by customer, the table Sales_KnownCustomer could be used, including the CustomerKey as an additional (even if redundant) join predicate. A view performing a UNION ALL over the two tables could be used to allow reports that need to consider all Sales.

The query overhead of having the two tables is potentially high, especially if we consider tables for Sales, SaleItems, Deliveries, and more, which might all need to be split into two to avoid skew while minimising data movement, using CustomerKey as the distribution key when known to allow customer-based analysis, and SalesKey when the customer is unknown.

By distributing on a common key the impact is to effectively create mini-databases which are split out according to groups of customers, with all of the data about a particular customer residing in a single database. This is similar to the way that people scale out when doing so manually, rather than using a system such as PDW. Of course, there is a lot of additional overhead when trying to scale out manually, such as working out how to execute queries that do involve some amount of data movement.

By splitting up the tables into ones for known and unknown customers, queries that were looking something like the following:

SELECT …
FROM Sales AS s
JOIN SaleItems AS si
   ON si.SalesKey = s.SalesKey
JOIN Delivery_SaleItems AS dsi
   ON dsi.LineItemKey = si.LineItemKey
JOIN Deliveries AS d
   ON d.DeliveryKey = dsi.DeliveryKey

…would become something like:

SELECT …
FROM Sales_KnownCustomer AS s
JOIN SaleItems_KnownCustomer AS si
   ON si.SalesKey = s.SalesKey
   AND si.CustomerKey = s.CustomerKey
JOIN Delivery_SaleItems_KnownCustomer AS dsi
   ON dsi.LineItemKey = si.LineItemKey
   AND dsi.CustomerKey = s.CustomerKey
JOIN Deliveries_KnownCustomer AS d
   ON d.DeliveryKey = dsi.DeliveryKey
   AND d.CustomerKey = s.CustomerKey
UNION ALL
SELECT …
FROM Sales_UnknownCustomer AS s
JOIN SaleItems_UnknownCustomer AS li
   ON si.SalesKey = s.SalesKey
JOIN Delivery_SaleItems_UnknownCustomer AS dsi
   ON dsi.LineItemKey = s.LineItemKey
   AND dsi.SalesKey = s.SalesKey
JOIN Deliveries_UnknownCustomer AS d
   ON d.DeliveryKey = s.DeliveryKey
   AND d.SalesKey = s.SalesKey

I’m sure you can appreciate that this becomes a much larger effort for query writers, and the existence of views to simplify querying back to the earlier shape could be useful. If both CustomerKey and SalesKey were being used as distribution keys, then joins between the views would require both, but this can be incorporated into logical layers such as Data Source Views much more easily than using UNION ALL across the results of many joins. A DSV or Data Model could easily define relationships between tables using multiple columns so that self-serving reporting environments leverage the additional predicates.

The use of views should be considered very carefully, as it is easily possible to end up with views that nest views that nest view that nest views, and an environment that is very hard to troubleshoot and performs poorly. With sufficient care and expertise, however, there are some advantages to be had.

The resultant query would look something like:

SELECT …
FROM Sales AS s
JOIN SaleItems AS li
   ON si.SalesKey = s.SalesKey
   AND si.CustomerKey = s.CustomerKey
JOIN Delivery_SaleItems AS dsi
   ON dsi.LineItemKey = si.LineItemKey
   AND dsi.CustomerKey = s.CustomerKey
   AND dsi.SalesKey = s.SalesKey
JOIN Deliveries AS d
   ON d.DeliveryKey = dsi.DeliveryKey
   AND d.CustomerKey = s.CustomerKey
   AND d.SalesKey = s.SalesKey

Joining multiple sets of tables which have been combined using UNION ALL is not the same as performing a UNION ALL of sets of tables which have been joined. Much like any high school mathematics teacher will happily explain that (a*b)+(c*d) is not the same as (a+c)*(b+d), additional combinations need to be considered when the logical order of joins and UNION ALLs.

Notice that when we have (TableA1 UNION ALL TableA2) JOIN (TableB1 UNION ALL TableB2), we must perform joins not only between TableA1 and TableB1, and TableA2 and TableB2, but also TableA1 and TableB2, and TableB1 and TableA2. These last two combinations do not involve tables with common distribution keys, and therefore we would see data movement. This is despite the fact that we know that there can be no matching rows in those combinations, because some are for KnownCustomers and the others are for UnknownCustomers. Effectively, the relationships between the tables would be more like the following diagram:

There is an important stage of Query Optimization which must be considered here, and which can be leveraged to remove the need for data movement when this pattern is applied – that of Contradiction.

The contradiction algorithm is an incredibly useful but underappreciated stage of Query Optimization. Typically it is explained using an obvious contradiction such as WHERE 1=2. Notice the effect on the query plans of using this predicate.

Because the Query Optimizer recognises that no rows can possibly satisfy the predicate WHERE 1=2, it does not access the data structures seen in the first query plan.

This is useful, but many readers may not consider queries that use such an obvious contradiction are going to appear in their code.

But suppose the views that perform a UNION ALL are expressed in this form:

CREATE VIEW dbo.Sales AS
SELECT *
FROM dbo.Sales_KnownCustomer
WHERE CustomerID > 0
UNION ALL
SELECT *
FROM dbo.Sales_UnknownCustomer
WHERE CustomerID = 0;

Now, we see a different kind of behaviour.

Before the predicates are used, the query on the views is rewritten as follows (with SELECT clauses replaced by ellipses).

SELECT …
FROM   (SELECT …
        FROM   (SELECT ...
                FROM   [sample_vsplit].[dbo].[Sales_KnownCustomer] AS T4_1
                UNION ALL
                SELECT …
                FROM   [tempdb].[dbo].[TEMP_ID_4208] AS T4_1) AS T2_1
               INNER JOIN
               (SELECT …
                FROM   (SELECT …
                        FROM   [sample_vsplit].[dbo].[SaleItems_KnownCustomer] AS T5_1
                        UNION ALL
                        SELECT …
                        FROM   [tempdb].[dbo].[TEMP_ID_4209] AS T5_1) AS T3_1
                       INNER JOIN
                       (SELECT …
                        FROM   (SELECT …
                                FROM   [sample_vsplit].[dbo].[Delivery_SaleItems_KnownCustomer] AS T6_1
                                UNION ALL
                                SELECT …
                                FROM   [tempdb].[dbo].[TEMP_ID_4210] AS T6_1) AS T4_1
                               INNER JOIN
                               (SELECT …
                                FROM   [sample_vsplit].[dbo].[Deliveries_KnownCustomer] AS T6_1
                                UNION ALL
                                SELECT …
                                FROM   [tempdb].[dbo].[TEMP_ID_4211] AS T6_1) AS T4_2
                               ON (([T4_2].[CustomerKey] = [T4_1].[CustomerKey])
                                   AND ([T4_2].[SalesKey] = [T4_1].[SalesKey])
                                       AND ([T4_2].[DeliveryKey] = [T4_1].[DeliveryKey]))) AS T3_2
                       ON (([T3_1].[CustomerKey] = [T3_2].[CustomerKey])
                           AND ([T3_1].[SalesKey] = [T3_2].[SalesKey])
                               AND ([T3_2].[SaleItemKey] = [T3_1].[SaleItemKey]))) AS T2_2
               ON (([T2_2].[CustomerKey] = [T2_1].[CustomerKey])
                   AND ([T2_2].[SalesKey] = [T2_1].[SalesKey]))) AS T1_1

Whereas with the inclusion of the additional predicates, the query simplifies to:

SELECT …
FROM   (SELECT …
        FROM   (SELECT …
                FROM   [sample_vsplit].[dbo].[Sales_KnownCustomer] AS T4_1
                WHERE ([T4_1].[CustomerKey] > 0)) AS T3_1
               INNER JOIN
               (SELECT …
                FROM   (SELECT …
                        FROM   [sample_vsplit].[dbo].[SaleItems_KnownCustomer] AS T5_1
                        WHERE ([T5_1].[CustomerKey] > 0)) AS T4_1
                       INNER JOIN
                       (SELECT …
                        FROM   (SELECT …
                                FROM   [sample_vsplit].[dbo].[Delivery_SaleItems_KnownCustomer] AS T6_1
                                WHERE ([T6_1].[CustomerKey] > 0)) AS T5_1
                               INNER JOIN
                               (SELECT …
                                FROM   [sample_vsplit].[dbo].[Deliveries_KnownCustomer] AS T6_1
                                WHERE ([T6_1].[CustomerKey] > 0)) AS T5_2
                               ON (([T5_2].[CustomerKey] = [T5_1].[CustomerKey])
                                   AND ([T5_2].[SalesKey] = [T5_1].[SalesKey])
                                       AND ([T5_2].[DeliveryKey] = [T5_1].[DeliveryKey]))) AS T4_2
                       ON (([T4_1].[CustomerKey] = [T4_2].[CustomerKey])
                           AND ([T4_1].[SalesKey] = [T4_2].[SalesKey])
                               AND ([T4_2].[SaleItemKey] = [T4_1].[SaleItemKey]))) AS T3_2
               ON (([T3_2].[CustomerKey] = [T3_1].[CustomerKey])
                   AND ([T3_2].[SalesKey] = [T3_1].[SalesKey]))
        UNION ALL
        SELECT …
        FROM   (SELECT …
                FROM   [sample_vsplit].[dbo].[Sales_UnknownCustomer] AS T4_1
                WHERE ([T4_1].[CustomerKey] = 0)) AS T3_1
               INNER JOIN
               (SELECT …
                FROM   (SELECT …
                        FROM   [sample_vsplit].[dbo].[SaleItems_UnknownCustomer] AS T5_1
                        WHERE ([T5_1].[CustomerKey] = 0)) AS T4_1
                       INNER JOIN
                     (SELECT …
                        FROM   (SELECT …
                                FROM   [sample_vsplit].[dbo].[Delivery_SaleItems_UnknownCustomer] AS T6_1
                                WHERE ([T6_1].[CustomerKey] = 0)) AS T5_1
                               INNER JOIN
                               (SELECT …
                                FROM   [sample_vsplit].[dbo].[Deliveries_UnknownCustomer] AS T6_1
                                WHERE ([T6_1].[CustomerKey] = 0)) AS T5_2
                               ON (([T5_2].[CustomerKey] = [T5_1].[CustomerKey])
                                   AND ([T5_2].[SalesKey] = [T5_1].[SalesKey])
                                       AND ([T5_2].[DeliveryKey] = [T5_1].[DeliveryKey]))) AS T4_2
                       ON (([T4_1].[CustomerKey] = [T4_2].[CustomerKey])
                           AND ([T4_1].[SalesKey] = [T4_2].[SalesKey])
                               AND ([T4_2].[SaleItemKey] = [T4_1].[SaleItemKey]))) AS T3_2
               ON (([T3_2].[CustomerKey] = [T3_1].[CustomerKey])
                   AND ([T3_2].[SalesKey] = [T3_1].[SalesKey]))) AS T1_1

This may seem more complex – it’s certainly longer – but this is the original, preferred version of the join. This is a powerful rewrite of the query.

Furthermore, the astute PDW-familiar reader will quickly realise that the UNION ALL of two local queries (queries that don’t require data movement) is also local, and that therefore, this query is completely local. The TEMP_ID_NNNNN tables in the first rewrite are more evidence that data movement has been required.

When the two plans are shown using PDW’s EXPLAIN keyword, the significance is shown even clearer.

The first plan appears as following, and it is obvious that there is a large amount of data movement involved.

The queries passed in are identical, but the altered definitions of the views have removed the need for any data movement at all. This should allow your query to run a little faster. Ok, a lot faster.

Summary

When splitting distributed tables vertically to avoid skew, views over those tables should include predicates which reiterate the conditions that cause the data to be populated into each table. This provides additional information to the PDW Engine that can remove unnecessary data movement, resulting in much-improved performance, both for standard reports using designed queries, and ad hoc reports that use a data model.

↧

Azure HDInsight Clusters Allows Custom Installation of Spark Using Script Action

November 17, 2014, 9:00 am

≫ Next: Cumulative Update #3 for SQL Server 2012 SP2

≪ Previous: APS Best Practice: How to Optimize Query Performance by Minimizing Data Movement

Apache Spark is a popular open source framework for distributed cluster computing. Spark has been gaining popularity for its ability to handle both batch and stream processing as well as supporting in-memory and conventional disk processing. Starting today, Azure HDInsight will make it possible to install Spark as well as other Hadoop sub-projects on its clusters. This is delivered through a new customization feature called Script Action. This will allow you to experiment and deploy Hadoop projects to HDInsight clusters that were not possible before. We are making this easier specifically for Spark and R by documenting the process to install these modules.

To do this, you will have to create an HDInsight cluster with Spark Script Action. Script Action allow users to specify PowerShell scripts that will be executed on cluster nodes during cluster setup. One of the sample scripts that are released with the preview is Script Action to install Spark. During preview the feature is available through PowerShell, so you will need to run PowerShell scripts to create your Spark cluster. Below is the snippet of the PowerShell code where “spark-installer-v01.ps1” is the Script Action that installs Spark on HDInsight:

New-AzureHDInsightClusterConfig -ClusterSizeInNodes $clusterNodes

| Set-AzureHDInsightDefaultStorage -StorageAccountName $storageAccountName
-StorageAccountKey $storageAccountKey -StorageContainerName $containerName

| Add-AzureHDInsightScriptAction -Name "Install Spark"
-ClusterRoleCollection HeadNode,DataNode
-Uri https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv01/spark-installer-v01.ps1

| New-AzureHDInsightCluster -Name $clusterName -Location $location

Once the cluster is provisioned it will have the Spark component installed on it. You can RDP into the cluster and use Spark shell:

In Hadoop command line window change directory to C:\apps\dist\spark-1.1.0.
Run the following command to start the Spark shell.

.\bin\spark-shell

On the Scala prompt, enter the spark query to count words in a sample file stored in Azure Blob storage account:

val file = sc.textFile("example/data/gutenberg/davinci.txt")
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.toArray().foreach(println)

You can monitor progress of Spark jobs by opening Spark Web UI (http://localhost:4040)

Read more on installing and using Spark on HDInsight here:

Install and use Spark on HDInsight clusters

Read more on Script Action to make other customizations here :

For more information on Azure HDInsight:

Read more about Azure HDInsight
Read HDInsight’s Learning Map
Attend Microsoft’s Virtual Academy for free classes on HDInsight
Azure 30 day free trial

↧

Cumulative Update #3 for SQL Server 2012 SP2

November 17, 2014, 7:55 pm

≫ Next: Cumulative Update #13 for SQL Server 2012 SP1

≪ Previous: Azure HDInsight Clusters Allows Custom Installation of Spark Using Script Action

Dear Customers, The 3 rd cumulative update release for SQL Server 2012 SP2 is now available for download at the Microsoft Support site. Cumulative Update 3 contains all hotfixes which have been available since the initial release of SQL Server 2012...(read more)

↧