ICYMI - Free Machine Learning and Predictive Analytics Training

January 30, 2015, 9:00 am

≫ Next: Trill - Processing "Big Data" Orders of Magnitude Faster

≪ Previous: Power BI Preview and SQL Server Analysis Services

In case you missed it:

If you're not a data scientist, but interested in the power of ML and predictive analytics, be sure check out this level 100 Microsoft Virtual Academy course on Azure ML.

In a demo-rich format led by entertaining experts Buck Woody, Seayoung Rhee and Scott Klein, you will get a real-world look at different ways to embed predictive analytics in your big data solutions. You will explore best practices for analyzing trends and patterns, learn how to use the Azure ML API services and how to monetize your predictive analytics solutions via the Azure Marketplace.

ML Blog Team

↧

Trill - Processing "Big Data" Orders of Magnitude Faster

February 3, 2015, 9:00 am

≫ Next: L'innovation par les données - Le marketing direct

≪ Previous: ICYMI - Free Machine Learning and Predictive Analytics Training

Re-posted from the Inside Microsoft Research blog.

One of the underpinnings of the Azure Stream Analytics service – which we have covered in our earlier blog posts– is a query processor from a high-performance streaming analytics engine named Trill.

Developed by Microsoft researchers, Trill processes data a whole lot faster than today’s streaming engines – anywhere from two to four orders of magnitude faster – by using newer algorithms and techniques to process events in batches.

Although not directly available to the public, Trill is finding wide applicability within Microsoft in diverse areas such as web advertising, gaming and IoT.

Learn more about Trill and the researchers behind it here or by clicking the image below.

ML Blog Team

↧

L'innovation par les données - Le marketing direct

February 4, 2015, 5:07 am

≫ Next: Download Free O'Reilly Report - Data Science in the Cloud

≪ Previous: Trill - Processing "Big Data" Orders of Magnitude Faster

La conquête de nouveaux clients constitue un facteur clé dans toutes les entreprises commerciales. La transformation d’un simple prospect en client représente un coût important, il est donc essentiel de disposer de processus performants et maitrisés.

...(read more)

↧

Download Free O'Reilly Report - Data Science in the Cloud

February 4, 2015, 9:00 am

≫ Next: SQL Server Reporting Services 2012 SP2 CU3 Report Rendering Issues

≪ Previous: L'innovation par les données - Le marketing direct

Re-post of Stephen Elston’s post on O’Reilly Radar

O’Reilly's new report, titled “Data Science in the Cloud, with Azure Machine Learning and R," shows how newer Cloud-based tools, combined with established techniques such as R, make sophisticated ML models accessible to a wide range of users. Through a practical data science example, with relevant data sets and R scripts available on GitHub, it helps you navigate through tasks such as:

Data management
Data transformation
Building and evaluating ML models
Producing R graphics
Publishing your models as web services

All this is done using a free account in the Azure ML cloud environment. You can click here or on the image below to read the original post – and, once there, be sure to download your free copy!

ML Blog Team

↧

SQL Server Reporting Services 2012 SP2 CU3 Report Rendering Issues

February 4, 2015, 5:27 pm

≪ Previous: Download Free O'Reilly Report - Data Science in the Cloud

I wanted to make you aware of an issue that we’ve seen on a few support cases this week. In these cases, the PDF, Print Preview, and TIFF rendering formats are affected. If you apply SQL Server 2012 SP2 CU3 or CU4, you may see a behavior where the...(read more)

↧

Popular on KDnuggets: Publish R Code as a Web Service - In Just a Few Clicks

February 5, 2015, 9:00 am

≫ Next: [Announcement] ODataLib 6.10.0 Release

≪ Previous: SQL Server Reporting Services 2012 SP2 CU3 Report Rendering Issues

As reported by KDnuggets, one of the most viewed/tweeted -about recent posts on that site was an article by Jaya Mathew, Data Scientist at Microsoft.

Jaya’s article talks about how, with Azure ML Studio, users can write R code and – within just a few clicks – publish it as a web service.

By helping you operationalize your R scripts in this fashion, Azure ML helps your API become available, discoverable and consumable from the Azure Marketplace. Developers from around the world can now have their apps call your API and readily leverage your idea as part of their solutions.

The original article is available here or by clicking the image below, which illustrates how this all works:

ML Blog Team

↧

[Announcement] ODataLib 6.10.0 Release

February 6, 2015, 3:26 am

≫ Next: [Announcement] OData Web API 5.4 Release

We are happy to announce that the ODataLib 6.10.0 is released and available on NuGet. Detailed release notes are listed below:

New Features:

[GitHub issue #1] EdmLib now supports EdmError/EdmLocation containing file name.

[GitHub issue #26] OData Client for .NET now supports getting instance annotations from payload or getting metadata annotations.

[GitHub issue #41] OData Client for .NET now supports posting an action with entity valued parameters that only contain setted properties.

Bug Fixes:

[GitHub issue #34] Fix a bug that OData Client for .NET did not follow the ABNF rule for OData-Version/OData-MaxVersion header.

[GitHub issue #47] Fix a bug that EdmReader could not parse the undefined Enum type in EnumMember.

[GitHub issue #61] Fix a bug that EnumMember could not reference an Enum type defined outside current schema.

[GitHub issue #66] Fix a bug that ODataUriParser failed to parse a null value for a nullable enum function parameter in Enum qualified namespace free mode.

Call to Action:

You and your team are highly welcomed to try out this new version if you are interested in the new features and fixes above. For any feature request, issue or idea please feel free to reach out to us at odatafeedback@microsoft.com.

↧

[Announcement] OData Web API 5.4 Release

February 6, 2015, 5:32 am

≫ Next: Join us at the PASS Business Analytics Conference this Year and Explore What’s New in Data Analysis

≪ Previous: [Announcement] ODataLib 6.10.0 Release

The NuGet packages for OData Web API 5.4 are now available on the NuGet gallery.

Download this release

You can install or update the NuGet packages for OData Web API 5.4 using the Package Manager Console:

PM> Install-Package Microsoft.AspNet.OData -Version 5.4.0

PM> Install-Package Microsoft.AspNet.WebApi.OData -Version 5.4.0

What’s in this release?

This release primarily includes new features for OData (v4 and v3) Web API as summarized below:

Referential constraint (v4, v3) #37
Relax flag for version constraint (v4, v3): By default requests with both v3 and v4 max version headers will no longer fail. #191
DateTime support (v4) #136
Case-insensitive support (v4) #11
StoreGeneratedPattern annotation (v3) #189
ConcurrencyMode annotation and ETag (v3) #190
Bug fixes: CodePlex & GitHub

V4 package has a dependency on ODataLib 6.9.

Questions and feedback

You can submit questions related to this release, any issues you encounter and feature suggestions for future releases on our GitHub site.

↧

Join us at the PASS Business Analytics Conference this Year and Explore What’s New in Data Analysis

February 10, 2015, 8:00 am

≫ Next: 6 Minutes to Learn How to Get a Cloud-Based IoT Solution Running!

≪ Previous: [Announcement] OData Web API 5.4 Release

Guest post by
Kasper de Jonge, Senior PM
Microsoft

It’s that exciting time of the year again, spring is in the air and conference season is starting up again. Usually one of the first conferences that I am really excited about is the PASS Business Analytics Conference. The PASS BAC is a great place where I love to go and meet peers in the field of data analytics and spend days talking about all the nerdy things that excite us from PivotTables to XIRR, DAX calculations and Tree maps.

This year I am really excited and happy to talk about some great additions to the new Power BI designer (go here and try it out if you haven’t) and talk about what great features we are adding there around data modelling and analytics. In this demo packed session we will take a look at how the new Power BI designer can solve complex business problems with ease. This session will explore new functionality in the Power BI designer around complex relationships between tables and additions to the DAX language to support new types of business logic. I’ll be mainly demo's using real world examples to highlight these new features.

Check out the full session details here.

Looking forward to seeing you there!

Want to learn more about Microsoft business intelligence offerings? Check out the details here.

↧

6 Minutes to Learn How to Get a Cloud-Based IoT Solution Running!

February 10, 2015, 9:00 am

≫ Next: Automated Backup & Automated Patching Best Practices

≪ Previous: Join us at the PASS Business Analytics Conference this Year and Explore What’s New in Data Analysis

This post is by Santosh Balasubramanian, a Program Manager in the Microsoft Azure Stream Analytics team

Have you ever wondered how to connect cheap devices like Arduino boards or Raspberry Pi’s with off-the-shelf sensors (which measure such things as temperature, light, motion, sound, etc.) to create your own monitoring solutions? Have you assumed that it might get pretty complicated or overwhelming to efficiently collect such sensor data, analyze it, and then visualize it on dashboards or setup notifications?

Well, it’s time to think again.

This problem – of collecting, analyzing and taking action on high volumes of IoT or sensor data in real-time – is relevant not just for hobbyists among us, but is also of critical importance to many large enterprises involved in a myriad different activities such as manufacturing, energy efficient buildings, smart meters, connected cars and more.

Through the short 6 minute video below, you will learn everything you need to know on how trivial we have made it to connect sensor data to the cloud and run sophisticated data analytics on it using a set of Microsoft Azure services including Event Hubs, Stream Analytics and Machine Learning.

At the end of it, you will have created custom “live” dashboards and notifications on data emitted by a weather shield sensor.

And, best of all, you don’t have to be a data scientist to do any of this.

&amp;lt;p&amp;gt;&amp;lt;br /&amp;gt;&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;&amp;lt;b&amp;gt;Getting Data into Azure and Performing Analytics on it &amp;lt;/b&amp;gt;&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;&amp;lt;br /&amp;gt;&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;We started building the IoT solution using &amp;lt;a href="http://azure.microsoft.com/en-us/services/event-hubs/" data-mce-href="http://azure.microsoft.com/en-us/services/event-hubs/"&amp;gt;&amp;lt;span style='color: #0563c1;' color='#0563c1' data-mce-style='color: #0563c1;'&amp;gt;Azure Event Hubs&amp;lt;/span&amp;gt;&amp;lt;/a&amp;gt;, a highly scalable publish-subscribe ingestor. It can take in millions of events per second, so you can process and analyze massive amounts of data produced by your connected devices or applications. There’s code running on the Arduino boards and Raspberry Pi’s to take sensor data and stream it in real-time to Event Hubs. Once this is done, you are ready to create live dashboards and view your current sensor data, such as the temperature and humidity charts shown in the video.&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;Now say that you have thousands of temperature sensors – in a large building, for instance, and, rather than seeing each sensors’ data individually, you wish to see aggregated information such as an average, maximum or minimum temperature for the building each hour. To do this, you can use &amp;lt;a href='http://azure.microsoft.com/en-us/services/stream-analytics/' data-mce-href='http://azure.microsoft.com/en-us/services/stream-analytics/'&amp;gt;&amp;lt;span style='color: #0563c1;' color='#0563c1' data-mce-style='color: #0563c1;'&amp;gt;Azure Stream Analytics&amp;lt;/span&amp;gt;&amp;lt;/a&amp;gt;, our fully managed stream processing solution, which seamlessly connects to Event Hubs. This allows you to write stream processing logic in a SQL-like language. It includes several temporal functions such as &amp;lt;i&amp;gt;TumblingWindow, SlidingWindow, &amp;lt;/i&amp;gt;and&amp;lt;i&amp;gt; Hopping Window, &amp;lt;/i&amp;gt;allows you to &amp;lt;i&amp;gt;Join&amp;lt;/i&amp;gt; multiple streams, detect patterns and create your stream processing logic. It provides enterprise-grade SLAs and easily enables you to scale your resource needs up or down based on the incoming throughput. You can create the cheapest stream-processing jobs for as little as $25/month (and currently at half that price, as this service is still in public preview). With Azure Stream Analytics, there is no writing or debugging of complex temporal logic in Java or .NET – if you know SQL, you are ready.&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;&amp;lt;br /&amp;gt;&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;&amp;lt;b&amp;gt;Real-Time Notifications and Alerts &amp;lt;/b&amp;gt;&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;&amp;lt;br /&amp;gt;&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;Once you see live or aggregate data in your dashboards, you will likely want to setup rules or conditions under which you will get notified about issues in real-time. For this you can setup thresholds for alerts in Azure Stream Analytics. These alerts can be as simple as “show me alerts when the temperature is over 79 degrees,” to complex, such as “alert me when the average humidity in the last second is 20% greater than the average humidity in the previous 5 seconds.”&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;&amp;lt;br /&amp;gt;&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;As a system gets more complex and involves an ever increasing number and variety of sensors, it often becomes necessary for alert thresholds to be adjusted periodically. This is often a very manual and cumbersome or complicated process. Hence the need for systems that recognize “normal” data patterns as opposed to “outlier” situations where something unusual may be happening. Better yet, such a system should teach itself to recognize such anomalous patterns, so that rules would not need to be manually or continuously adjusted. This is where &amp;lt;a href='http://azure.microsoft.com/en-us/services/machine-learning/' data-mce-href='http://azure.microsoft.com/en-us/services/machine-learning/'&amp;gt;&amp;lt;span style='color: #0563c1;' color='#0563c1' data-mce-style='color: #0563c1;'&amp;gt;Azure Machine Learning&amp;lt;/span&amp;gt;&amp;lt;/a&amp;gt; comes in. We use Azure ML models to detect anomalies and raise alerts in our example. We simply used a pre-existing &amp;lt;a href='http://datamarket.azure.com/dataset/aml_labs/anomalydetection' data-mce-href='http://datamarket.azure.com/dataset/aml_labs/anomalydetection'&amp;gt;&amp;lt;span style='color: #0563c1;' color='#0563c1' data-mce-style='color: #0563c1;'&amp;gt;Anomaly Detection API&amp;lt;/span&amp;gt;&amp;lt;/a&amp;gt; available from the Azure Marketplace. Our streaming sensor data is sent to this model, where anomalies are detected in real-time and get displayed in our alerts. An ML model such as this one, made available as an API on the Azure Marketplace, allows folks such as myself who are not data scientists to easily consume such “best of breed” models, even if we don’t know their full inner workings.&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;&amp;lt;br /&amp;gt;&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;If you wish to create your own weather sensor on Raspberry Pi’s, Arduino boards etc. please go to the following &amp;lt;a href='http://msopentech.com/blog/2014/12/10/connecthedots-io/' data-mce-href='http://msopentech.com/blog/2014/12/10/connecthedots-io/'&amp;gt;&amp;lt;span style='color: #0563c1;' color='#0563c1' data-mce-style='color: #0563c1;'&amp;gt;blog post&amp;lt;/span&amp;gt;&amp;lt;/a&amp;gt; by Microsoft Open Technologies. There you can download the code and provision your sensors easily. This code not only gets data from sensors, but also streams the data in real time to Event Hubs like as described above. Additionally, you will also get the code for creating Stream Analytics queries and custom real-time dashboards.&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;&amp;lt;br /&amp;gt;&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;I hope you found this post useful.&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;Santosh&amp;lt;br /&amp;gt;&amp;lt;a href='/b/machinelearning/rss.aspx' data-mce-href='/b/machinelearning/rss.aspx'&amp;gt;Subscribe&amp;lt;/a&amp;gt; to this blog. Follow us on &amp;lt;a href='https://twitter.com/mlatmsft' data-mce-href='https://twitter.com/mlatmsft'&amp;gt;twitter&amp;lt;/a&amp;gt;.&amp;lt;/p&amp;gt;

Getting Data into Azure and Performing Analytics on it

We started building the IoT solution using Azure Event Hubs, a highly scalable publish-subscribe ingestor. It can take in millions of events per second, so you can process and analyze massive amounts of data produced by your connected devices or applications. There’s code running on the Arduino boards and Raspberry Pi’s to take sensor data and stream it in real-time to Event Hubs. Once this is done, you are ready to create live dashboards and view your current sensor data, such as the temperature and humidity charts shown in the video.

Now say that you have thousands of temperature sensors – in a large building, for instance, and, rather than seeing each sensors’ data individually, you wish to see aggregated information such as an average, maximum or minimum temperature for the building each hour. To do this, you can use Azure Stream Analytics, our fully managed stream processing solution, which seamlessly connects to Event Hubs. This allows you to write stream processing logic in a SQL-like language. It includes several temporal functions such as TumblingWindow, SlidingWindow, and Hopping Window, allows you to Join multiple streams, detect patterns and create your stream processing logic. It provides enterprise-grade SLAs and easily enables you to scale your resource needs up or down based on the incoming throughput. You can create the cheapest stream-processing jobs for as little as $25/month (and currently at half that price, as this service is still in public preview). With Azure Stream Analytics, there is no writing or debugging of complex temporal logic in Java or .NET – if you know SQL, you are ready.

Real-Time Notifications and Alerts

Once you see live or aggregate data in your dashboards, you will likely want to setup rules or conditions under which you will get notified about issues in real-time. For this you can setup thresholds for alerts in Azure Stream Analytics. These alerts can be as simple as “show me alerts when the temperature is over 79 degrees,” to complex, such as “alert me when the average humidity in the last second is 20% greater than the average humidity in the previous 5 seconds.”

As a system gets more complex and involves an ever increasing number and variety of sensors, it often becomes necessary for alert thresholds to be adjusted periodically. This is often a very manual and cumbersome or complicated process. Hence the need for systems that recognize “normal” data patterns as opposed to “outlier” situations where something unusual may be happening. Better yet, such a system should teach itself to recognize such anomalous patterns, so that rules would not need to be manually or continuously adjusted. This is where Azure Machine Learning comes in. We use Azure ML models to detect anomalies and raise alerts in our example. We simply used a pre-existing Anomaly Detection API available from the Azure Marketplace. Our streaming sensor data is sent to this model, where anomalies are detected in real-time and get displayed in our alerts. An ML model such as this one, made available as an API on the Azure Marketplace, allows folks such as myself who are not data scientists to easily consume such “best of breed” models, even if we don’t know their full inner workings.

If you wish to create your own weather sensor on Raspberry Pi’s, Arduino boards etc. please go to the following blog post by Microsoft Open Technologies. There you can download the code and provision your sensors easily. This code not only gets data from sensors, but also streams the data in real time to Event Hubs like as described above. Additionally, you will also get the code for creating Stream Analytics queries and custom real-time dashboards.

I hope you found this post useful.

Santosh
Subscribe to this blog. Follow us on twitter

↧

Automated Backup & Automated Patching Best Practices

February 10, 2015, 12:00 pm

≫ Next: Free Webinar - ML for Business Users & Enterprise Developers

≪ Previous: 6 Minutes to Learn How to Get a Cloud-Based IoT Solution Running!

We recently released the Automated Backup and Automated Patching features. These features automate the processes of backing up and patching your SQL Virtual Machine to provide an added level of convenience for your VMs. We’d like to outline some best practices for these features to ensure that you get the most out of these features.

Automated Backup

Backup of Encryption Certificates and Data

When backup encryption is enabled, we strongly recommend that you ascertain whether the encryption certificate has been successfully created and uploaded to ensure restorability of your databases. You can do so by creating a database right away and checking the encryption certificates and data were backed up to the newly created container properly. This will show that everything was configured correctly and no anomalies took place.

If the certificate failed to upload for some reason, you can use the certificate manager to export the certificate and save it. You do not want to save it on the same VM, however, as this does not ensure you have access to the certificate when the VM is down. To know if the certificate was backed up properly after changing or creating the Automated Backup configuration, you can check the event logs in the VM (Figure 1), and if it failed you will see this error message:

Figure 1: Error Message Shown in Event Log in VM

If the certificates were backed up correctly, you will see this message in the Event Logs:

Figure 2: Successful Backup of Encryption Certificate in Event Logs

As a general practice, it is recommended to check on the health of your backups from time to time. In order to be able to restore your backups, you should do the following:

Confirm that your encryption certificates have been backed up and you remember your password. If you do not do this, you will not be able to decrypt and restore your backups. If for some reason your certificates were not properly backed up, you can accomplish this manually by executing the following T-SQL query:

BACKUP MASTER KEY TO FILE = ENCRYPTION BY PASSWORD =
BACKUP CERTIFICATE [AutoBackup_Certificate] TO FILE = WITH PRIVATE KEY (FILE = , ENCRYPTION BY PASSWORD = )
Confirm that your backup files are uploaded with at least 1 full backup. Because mistakes happen, you should be sure you always have at least one full backup before deleting your VM, or in case your VM gets corrupted, so you know you can still access your data. You should make sure the backup in storage is safe and recoverable before deleting your VM’s data disks.

Disaster Recovery

It is recommended that you select a storage account in a different region for your backups to provide Disaster Recovery for your data. Putting your backups in another region is critical for scenarios when a datacenter goes down and you need uninterrupted access to your data. However, if you have more interest in a short recovery time, rather than disaster recovery, then it may be better to store your backups in the same region. This decision depends on your specific requirements.

Encryption Password

Be sure to use a strong password to protect your certificates. Have some method of ensuring that you remember the password when the time comes to decrypt and restore your backup.

Automated Patching

Schedule

Be sure to schedule the Patching window during a time with low workload, but when the VM is still active. If you schedule during a window where the VM will be down, patching will not take place.

Windows Update compatibility boundaries

If you would like to manually install a specific update that you see in the Windows Update UI, you can do this with no interference to Automated Patching. However, keep in mind that turning Windows Update into automatic install mode will cause Automated Patching to be disabled. Despite this, all settings will persist, and you should manually re-enable Automated Patching to continue using it.

Azure only

Both Automated Backup and Automated Patching heavily rely on Azure VM Agent infrastructure. This means that there is no support for on-premises solution. If you plan to move your VM from Azure to any other environment, plan on uninstalling both SQL Server IaaS Agent and Azure VM Agent.

Try these features out for yourself at https://portal.azure.com.

If you haven’t already, start a free trial on SQL Server in Virtual Machines today.

↧

Free Webinar - ML for Business Users & Enterprise Developers

February 11, 2015, 9:00 am

≫ Next: Building Web Services with R and Azure ML

≪ Previous: Automated Backup & Automated Patching Best Practices

Re-posted from Gigaom Research.

This webinar will explore ML concepts in terms that business users can understand. A panel will discuss how ML can be used by mainstream developers and database professionals as an important tool for business decision making. This session will be most relevant for business users, IT decision makers, data architects, enterprise developers and developer managers.

Click here or on the image below to register for this free webinar which runs tomorrow, Thursday February 12^th 2015:

ML Blog Team

↧

Building Web Services with R and Azure ML

February 12, 2015, 9:00 am

≫ Next: EF6.1.3 Beta 1 Available

≪ Previous: Free Webinar - ML for Business Users & Enterprise Developers

This post is by Raymond Laghaeian, a Senior Program Manager on Microsoft Azure Machine Learning.

The support for R in Azure ML allows you to easily integrate existing R scripts into an experiment. From there, you are just a couple of clicks away to publishing your script as a web service. In this post, we walk you through the steps needed to accomplish that. We will then consume the resulting web service in an ASP.NET web application.

Create New Experiment

First, we create a new experiment by clicking on “+New” in the Azure ML Studio and selecting Blank Experiment. We then drag the Execute R module and add it to the experiment, as shown below:

Figure 1: Execute R module added to the experiment

Add the Script

We next click in the R Script Property in the right pane, delete the existing sample code, and paste the following script:

a = c("name", "Joe", "Lisa")

b = c("age", "20", "21")

c = c("married", TRUE, FALSE)

df = data.frame(a, b, c)

#Select the second row

data.set = df[2,]

# Return the selected row as output

maml.mapOutputPort("data.set");

The resulting experiment looks as follows:

Figure 2: Adding R script to the module

Publish the Web Service

To publish this as a web service, we drag the Web service output from the lower left pane and attach it to the right output of the Execute R Script module. You can use the Web Service view switch to toggle between experiment and web service flows.

Figure 3: Web Service Setup

We next click Run, and upon completion, click the Publish Web Service button when it becomes enabled (bottom of the screen). We next click Yes twice to publish the web service.

After the Web Service is published, its Dashboard will be displayed with the API Key and a Test link to call the Request Response Service (RRS):

Figure 4: Web Service Dashboard

Test the Web Service

To test the web service, we click on the Test link for the Request Response service, and on the checkmark on the dialog (“This web service does not require any input”).

Note the result on the bottom of the screen. Clicking the Details button will show the row we had selected in the dataset returned in the API response.

Figure 5: Web Service call result showing R script output

Create an ASP.Net Client

To consume the web service in an ASP.Net web application, we start Visual Studio and create a new ASP.Net Web Application (File -> New Project -> ASP.Net Web Application). In the Name field, we type in R Web Service Client.

Figure 6: Create an ASP.NET Web Application project

In the new ASP.NET Project window, we select Web Forms, then click OK.

Next, in the Solution Explorer window, we right click on the project name (R Web Service Client) and add a new Web Form. When prompted for the name, we name it CallR.

Set Up the UI

In CallR.aspx, we paste the following code in the

tag (between

and

R Web Service client

Setup the Code to Call Your Web Service

The code for this section is copied from the C# sample code on the API help page for RRS (see Figure 4 above) with some modifications for ASP.NET. First, as described at the top of the sample C# code, we install Microsoft.AspNet.WebApi.Client. We then add the following “using” statements:

using System.Net.Http;

using System.Net.Http.Formatting;

using System.Net.Http.Headers;

using System.Text;

using System.Threading.Tasks;

Next, in the code view of the page (CallR.aspx.cs), we add a button-click event:

protected void Button1_Click(object sender, EventArgs e)

{

InvokeRequestResponseService().Wait();

}

We next add the following method below the Button1_Click event.

async Task InvokeRequestResponseService()

{

using (var client = new HttpClient())

{

ScoreData scoreData = new ScoreData()

{

FeatureVector = new Dictionary () {},

GlobalParameters = new Dictionary (){}

};

ScoreRequest scoreRequest = new ScoreRequest()

{

Id = "score00001",

Instance = scoreData

};

//Set the API key (Use API key for your web service – see Firgure 4)

const string apiKey = "a4C/IkyCy6N4Gm80aF6A==";

client.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Bearer", apiKey);

client.BaseAddress = new Uri("https://ussouthcentral.services.azureml.net/workspaces/b3692371e94aa5bec7a28889/services/874c7886a2453d947113e48c1d/score");//Replace with your web service URL from C# Sample code of the API help page for RRS

HttpResponseMessage response = await client.PostAsJsonAsync("", scoreRequest).ConfigureAwait(false);

if (response.IsSuccessStatusCode)

{

string result = await response.Content.ReadAsStringAsync().ConfigureAwait(false);

Label1.Text = "Result: " + result;

}

else

{

Label1.Text = "Failed with status code: " + response.StatusCode;

}

Finally, add the following two classes after the Partial Class CallR.

public class ScoreData

{

public Dictionary FeatureVector { get; set; }

public Dictionary GlobalParameters { get; set; }

}

public class ScoreRequest

{

public string Id { get; set; }

public ScoreData Instance { get; set; }

}

The class should now look like this:

Run the Application

We then run the application (F5) and click on the Call Web Service button to get the results:

If you got this far, we hope you enjoyed reading this post! To quickly summarize, using Azure ML, it is easy to create a web service from an R script and consume it in an ASP.NET web application – just as this example demonstrated.

Raymond
Contact me on twitter. Get started with Azure ML at the Machine Learning Center.

↧

EF6.1.3 Beta 1 Available

February 12, 2015, 9:58 am

≫ Next: L’innovation par les données – Optimiser la relation client en combinant CRM et Machine Learning

≪ Previous: Building Web Services with R and Azure ML

Today we are making Beta 1 of the EF6.1.3 release available. This patch release contains only high priority bug fixes.

What are the 6.1.3 release timelines?

At this stage we are planning for our next release to be the RTM. This may change if we get additional reports of high priority bugs that we decide should be fixed in 6.1.3.

We expect to ship the next release sometime next month but this may change if we decided to take additional changes.

What’s in Beta 1?

EF6.1.3 will just contain fixes to high priority issues that have been reported on the 6.1.2 release. The fixes included in beta 1 are:

Where do I get the beta?

The runtime is available on NuGet. Follow the instructions on our Get It page for installing the latest pre-release version of Entity Framework runtime.

The tooling for Visual Studio 2012, Visual Studio 2013, and Visual Studio 2015 Preview is available on the Microsoft Download Center.

Support

This is a preview of changes that will be available in the final release of EF6.1.3 and is designed to allow you to try out the new features and report any issues you encounter. Microsoft does not guarantee any level of support on this release.

If you need assistance using the new features, please post questions on Stack Overflow using the entity-framework tag.

↧

L’innovation par les données – Optimiser la relation client en combinant CRM et Machine Learning

February 15, 2015, 2:21 pm

≫ Next: Data. Insights. Action. Listen. A TechNet Radio Discussion about the Upcoming PASS BA Conference

≪ Previous: EF6.1.3 Beta 1 Available

Cet article a pour objectif de présenter une solution de CRM Analytique, qui démontre la valeur ajoutée et l’optimisation que peuvent offrir aux entreprises les solutions prédictives basées sur le Machine Learning.

Problématique

Fournir aux directions commerciales et marketing les capacités d’interpréter les données clientes et l’historique des transactions
Fournir aux services client les capacités de répondre de la meilleure façon au besoin des clients et garantir leur fidélité
Eviter le churn client en identifiant les différents comportements et en proposant des actions préventives
Inscrire le client final au centre de la transformation numérique de l’entreprise

Bénéfices

Optimisation et personnalisation de la relation client via cette solution
Capacité des équipes à comprendre le client et ses futurs besoins
Réduction des risques de perte de fidélité et accroissement des opportunités de transformation de prospect à client
Agilité dans la représentation des données client et la prise de décision

L’évolution du CRM Opérationnel vers le CRM Analytique et Prédictif

L’évolution du CRM est au cœur de la transformation digitale des entreprises, qui évolue du fait des nouvelles demandes et des nouveaux besoins de ses clients, tout en prenant en compte l’évolution de la concurrence. C’est dans cette logique d’évolution que les technologies analytiques et prédictives font progresser les fonctionnalités du CRM opérationnel « classique » vers le CRM analytique et prédictif.

Même si les concepts de CRM Opérationnel et CRM Analytique sont proches l’un de l’autre, le CRM opérationnel va permettre de construire et gérer les activités métiers telles que par exemple le SFA (Sales Force Automation) ou la gestion des centres de contacts alors que le CRM analytique et prédictif va offrir des fonctionnalités d’extraction d’informations à haute valeur ajoutée permettant une prise de décision métier efficace et accélérée à partir de toutes les données clientes ainsi que de l’historique des transactions connues de ce client. Les 2 concepts restent complémentaires et un CRM opérationnel efficace permettra d’alimenter quantitativement et qualitativement les fonctionnalités du CRM analytique et prédictif.

En effet, les données provenant du CRM opérationnel sont organisées et accumulées pour d’une part répondre le plus rapidement et efficacement possible aux besoins identifiés du client et donc pour augmenter les ventes, mais d’autre part sont une mine d’informations à haute valeur ajoutée pouvant jouer le rôle de levier dans l’identification d’un comportement client par le CRM analytique et une réponse/solution automatique appropriée. Le CRM permet de stocker non seulement des données quantitatives mais aussi des données non-quantitatives difficilement exploitables en temps normal. Par exemple, il est possible de produire un modèle prédictif dans le but d’offrir des recommandations au client, basées sur ses données ainsi que sur ses précédents achats, la possession de ses contrats en cours ou ses recherches d’informations. Un autre exemple basé sur le même principe serait de produire un modèle prédictif précisant les produits pouvant intéresser le client et ainsi lui proposer un package personnalisé.

Cas d’usage du CRM Analytique et Prédictif

Vente/Marketing

La personnalisation de la relation client dans les phases de ventes est une des applications les plus répandues du CRM analytique. En effet, en comprenant les intérêts du client et en déterminant le segment dont il fait partie, et cela de manière automatique, il est possible de produire :

Des newsletters spécifiques
Un parcours et une expérience d'achat spécifique
Des recommandations d'achats spécifiques

Une autre application du CRM analytique et prédictif pour la vente est la prédiction de la conversion d'un prospect en client afin :

D’accélérer cette conversion,
De présenter les bons produits/services au bon prospect au meilleur moment
De réduire les coûts en évitant d’adresser les prospects à bas potentiel

Services Après-Vente et Support

Si l’entreprise ne prête pas attention à toutes les informations remontées par le client quel que soit le canal choisi (téléphone, face-à-face, site web, email, réseaux sociaux) et qu’elle ne les met pas en relation, l’engagement du client envers la marque peut se détériorer. A l’inverse la capacité d’une entreprise à s’engager dans une relation forte avec son client peut devenir un facteur différenciant face à la concurrence :

En comprenant et anticipant les demandes des clients. Les clients étant moins patients, ils souhaitent des réponses rapides et des résultats pertinents.
En harmonisant l’expérience client. Un client réalisant une action/recherche sur le site web de l’entreprise pourrait ensuite téléphoner au service client pour une demande complémentaire mais serait contacté proactivement par ce service pour répondre à la demande.
En déterminant les clients en face de désintérêt de l’offre

Ces divers cas d’utilisation démontrent aussi que la vitesse est une nouvelle valeur pour la relation client. Disposer de la bonne information au bon moment pour prendre la bonne décision et anticiper les futures décisions permet d’éviter le churn client et de le fidéliser.

Le schéma ci-dessous illustre les 4 étapes Métiers d’enrichissements de l’information client à travers une stratégie de Machine Learnig.

Le recueil d’informations en provenance des solutions de CRM opérationnels et digitaux (Réseaux sociaux,…)
L’extraction de ces données et leur traitement à travers un modèle prédictif de Business Analysis.
A la suite de cette extraction, la mise en place d’indicateurs quantitatifs et de KPI en vue de l’optimisation de la prise de décision
La réinjection des informations calculées par les systèmes prédictifs et analytiques afin d’enrichir les solutions de CRM opérationnel et de marketing digital existantes.

Figure 1 : Processus d'enrichissement de l'information Client

Présentation de la solution

La solution présentée dans cet article a pour objectif de

détecter le churn client,
d’identifier les clients prêts à monter en gamme (upselling)
de réaliser de la vente additionnelle (cross-selling)

Ceci est possible à partir des données fournis par le CRM de l’entreprise en établissant un modèle prédictif. Ce système va être utilisé pour réalimenter le CRM en temps réel, lors du cycle de vie du compte client avec l’indicateur clé de churn du compte, et les probabilités d’upselling et de cross-selling.

Cette solution s’articule principalement autour des éléments suivants :

La source de données CRM
3 modèles de scoring basé sur l’algorithme Boosted decision tree (c’est l’algorithme qui a démontré la plus grande efficacité lors du test de nos 3 modèles sur les données – visualisable grâce la courbe ROC et à la mesure de l’AUC, aire sous la courbe)

Le modèle ci-dessous réalisé dans Azure ML Studio explicite la démarche classique de scoring employée :

Figure 2: Principe du Boosted Decision Tree

On identifie ainsi :

Le jeu de données issu du CRM,
Les paramètres utilisés pour identifier le scoring de nos 3 indicateurs,
Les tests du modèle
L’exécution du modèle

Il est à noter que s’il est nécessaire de préparer les données en entrée, l’outil de modélisation Azure ML propose toutes les fonctionnalités d’un ETL (Extract, Transform and Load). En l’occurrence, nos données provenant directement du CRM, une préparation minimale a été réalisée.

Une fois le modèle conçu de façon très aisée dans Azure ML, il suffit de générer de façon toute aussi aisée la création d’un Web Service (Actions Set as Publish Input et Set as Publish Output, puis Publish Web Service) permettant de faire appel au modèle pour que celui-ci soit directement exploitable sur les formulaires clients du CRM.

En implémentant l’appel à ce Web Service depuis Dynamics CRM, lors de la création, mise à jour depuis le CRM d’une fiche client, et de son historique d’achat, le modèle est appelé avec les nouvelles données du client et renvoie le résultat des indicateurs de churn, ainsi que les probabilités d’up-selling et de cross-selling qui sont ensuite affichées à l’utilisateur et stockées dans la fiche du client.

Figure 3 : Fiche du compte Client dans Dynamics CRM

L’utilisateur peut en fonction de son interaction avec le client, soit :

Lui proposer la gamme supérieure du produit en cours de vente,
Lui proposer d’autres éléments complémentaires à l’offre en cours de vente,
Le questionner et le rassurer sur son intérêt pour les offres et services vendus, voire engager une action marketing.

Architecture de la solution

La solution décrite précédemment est architecturée de la manière suivante :

Les données CRM sont collectées à partir de l’application Dynamics CRM
Le modèle Machine Learning est réalisé et entrainé via Azure ML Studio
Un Web Service Rest est généré et publié avec la documentation de son utilisation
Les formulaires Dynamics CRM portent des champs dédiés aux indicateurs qui sont mis à jour grâce à des plug-ins CRM appelant le Web Service à leur déclenchement

Cette solution est capable de monter en charge et de s’adapter aux contraintes des entreprises modernes afin de proposer des fonctionnalités pouvant être rapidement mises en œuvre et déployées.

Elle est architecturée autour d’une collection de services Cloud de Microsoft Windows Azure et d’Office 365.

Figure 4 : Ordonnancement de la solution architecturée

Synthèse

Cet article a présenté une solution s’appuyant sur les nombreuses possibilités de la plateforme Dynamics de Microsoft, afin de proposer aux directions commerciales une solution de CRM Analytique et Prédictive permettant d’anticiper le churn client, l’up-selling et le cross-selling et basée sur l’utilisation du Machine Learning grâce à Azure ML.

Produits concernés :

Dynamics CRM Server 2015 & Dynamics Marketing
CRM Online
Azure Machine Learning

Des solutions innovantes avec Microsoft Consulting Services

Microsoft démontre depuis de nombreuses années son intérêt pour le Machine Learning et les Big Data, d’une part dans ses produits (Reconnaissance vocale, Kinect, Data Mining SQL Server, Power BI, …) et d’autre part grâce aux investissements massifs dédiés à ces sujets dans la Recherche et Développement, mais aussi par des partenariats avec plusieurs universités et instituts de recherche (INRIA, France ; University of Trento, Italie ; Barcelona Super Computing Centre, Espagne). Appliqués au CRM, le résultat de ces investissements permet de proposer une offre applicative et des infrastructures complètes répondant aux futurs enjeux des entreprises de demain.

Les architectes et consultants MCS sont formés sur l’ensemble de l’offre applicative et infrastructure de Microsoft, afin de proposer à leurs clients des solutions innovantes permettant de répondre à leurs enjeux opérationnels, et de s’engager dans une démarche de transformation numérique.

Pour plus d’informations sur les offres packagées Microsoft Consulting Services, rendez-vous sur http://www.microsoft.com/france/services. Nous vous invitons à y consulter notamment les diverses offres CRM et Business Analytics proposées.

Pour plus d’informations sur les blogs « L’innovation par les données », rendez-vous sur L’innovation par les données.

Stéphanie Monpain, Consultante CRM, Microsoft Consulting Services

Consultante CRM depuis 8 ans, j’ai rejoint Microsoft Services France en 2013.

Mon rôle est d’accompagner nos clients Métiers à travers leur processus de transformation avec Dynamics CRM. J’interviens notamment auprès des secteurs de la banque et de l’assurance mais également dans les sociétés des services et de l’industrie.

Jérôme Coquidé, Consultant Data Insight et CRM, Microsoft Consulting Services

Tout d’abord consultant CRM, j’ai rejoint la division Services de Microsoft en 2011 dans l’équipe dédiée aux solutions Dynamics (CRM, ERP) ce qui m’a permis d’accroitre mon appétence et mes compétences sur les sujets Base de données, Reporting et informatique Décisionnelle et d’intégrer naturellement en 2013 l’équipe SQL/BI.

J’interviens ainsi sur les thématiques Décisionnelle et Reporting ainsi que d’intégration de données (Qualité de données, Gestion des données référentielles) sans oublier la Gestion de la Relation Client.

↧

Data. Insights. Action. Listen. A TechNet Radio Discussion about the Upcoming PASS BA Conference

February 16, 2015, 8:00 am

≫ Next: Neural Nets in Azure ML – Introduction to Net#

≪ Previous: L’innovation par les données – Optimiser la relation client en combinant CRM et Machine Learning

The PASS Business Analytics Conference 2015 is right around the corner! Listen to some of our favorite Data and BI experts chat about this year’s event, April 20th – 22nd in Santa Clara, California.

Are you a key part of a team delivering business analytics and intelligence for your organization? Are you finding BA/BI more a part of your day-today work life? Find out why anyone trying to stay ahead of the analytics curve and position your data career for success should attend PASS Business Analytics Conference.

Tune in to TechNet Radio to find out about the 60+ how-to sessions, practical case studies with hands-on workshops and expert panels. This is a can’t miss event!

↧

Neural Nets in Azure ML – Introduction to Net#

February 16, 2015, 10:00 am

≫ Next: Our new engineering blog post

≪ Previous: Data. Insights. Action. Listen. A TechNet Radio Discussion about the Upcoming PASS BA Conference

This blog post is authored by Alexey Kamenev, Software Engineer at Microsoft.

Neural networks are one of the most popular machine learning algorithms today. One of the challenges when using neural networks is how to define a network topology given the variety of possible layer types, connections among them, and activation functions. Net# solves this problem by providing a succinct way to define almost any neural network architecture in a descriptive, easy-to-read format. This post provides a short tutorial for building a neural network using the Net# language to classify images of handwritten numeric digits in Microsoft Azure Machine Learning.

It is useful to have basic knowledge of neural networks for this tutorial. The following links provide good starting points to catch up:

http://www.coursera.org/course/neuralnets

http://en.wikipedia.org/wiki/Artificial_neural_network

Let us start with a very simple one-hidden-layer neural network architecture. We’ll walk through the “Sample Experiment – Digit Recognition (MNIST), Neural Net: 1 fully-connected hidden layer” – this is included as one of the sample experiments in the Samples list in every Azure ML workspace, you will need to sign up for our free trial to run this sample.

The network has 3 layers of neurons: an input layer of size 28*28 = 784, one hidden layer of size 100, and the output layer of size 10. The input layer is written as 28x28 because we train on the MNIST dataset which is a dataset of images of handwritten digits where each image is a 28x28 grayscale picture.

Here is the corresponding network definition in Net#:

input Picture [28, 28];
// Note that alternatively we could declare input layer as:
// input Picture [28 * 28];
// or just
// input Picture [784];
// Net# compiler will be able to infer the number of dimensions automatically.

// This defines a fully-connected (to the input layer 'Picture')
// hidden layer of size 100 with sigmoid activation function
// (which is a default activation function).
hidden H [100] from Picture all;

// This defines an output layer of size 10 which is fully-connected to layer 'H',
// with softmax activation function.
output Result [10] softmax from H all;

To add a Net# definition to a neural network module in Azure ML, you drag the learner module onto the canvas (in this case “Multiclass Neural Network”) and in the properties window for the module, under “Hidden layer specification,” select “Custom definition script” from the dropdown list. Then you will see the Neural Network definition script box in the properties window where you can enter your Net# definition. If you select the “Multiclass Neural Network” module in the sample experiment, you will see the following definition.

Using this topology, you can run a simple experiment using default values for learning rate and initial weight diameter, while reducing number of iterations to 30 to result in faster training. The experiment should run for less than 2 minutes, providing an accuracy of 97.7% (or 2.3% error), which is not bad given such a simple net and short training time.

Net#’s lexical grammar and rules are very similar to those of C#/C++. For example:

Net# is case sensitive.
Net# supports standard C#/C++ comments.
Net# constant literals are similar to C#, including decimal and hexadecimal integer literals, floating point literals, and string literals (including verbatim string literals) with escape sequence support.
Prefixing a keyword with the @ character makes it a normal identifier.

The language also supports various types of layers which will be described in subsequent posts.

Once you have the basic experiment in place, you can try playing with the network and the algorithm parameters to improve your results. For example, what happens if you:

Change the number of nodes in the hidden layer H? Does it change your accuracy if you use 200 nodes? Or 1000?
Change parameters like learning rate, initial weights diameter and number of iterations?

You can easily add more layers resulting in a more complex neural network. For example, to define a two hidden layers, fully-connected net, use the following Net# script:

input Picture [28, 28];

hidden H1 [200] from Picture all;

// Note that H2 is fully connected to H1.
hidden H2 [200] from H1 all;

// Output layer is now fully connected to H2.
output Result [10] softmax from H2 all;

If you train this “deeper” net, which should take about 4 minutes (30 iterations), you should get an accuracy of about 98.1% (error is 1.9%) which is certainly better than our previous single hidden layer net. Note that you might not get exactly the same results if you haven’t fixed your random seed, but the results should be close to those shown above.

In addition to changing the layers, changing various parameters of the network and observing results may be an interesting exercise and may improve the results.

In subsequent posts, we will cover more advanced topics, such as activation functions, and different layer types: sparse and convolutional. A guide to Net# is also available in case you want to get an overview of most important features of Net#.

Please do not hesitate to ask questions or share your thoughts – we value your opinion – and enjoy training the nets in Azure ML!

Alexey

↧

Our new engineering blog post

February 16, 2015, 11:17 am

≫ Next: Cumulative Update #6 for SQL Server 2014 RTM

≪ Previous: Neural Nets in Azure ML – Introduction to Net#

In addition to this blog which is primarily intended to announce releases for different SQL Server versions (CUs, SPs etc), we have started a new blog which will mainly target technical content. You can follow this blog at http://blogs.msdn.com/b/sql_server_team...(read more)

↧

Cumulative Update #6 for SQL Server 2014 RTM

February 16, 2015, 5:52 pm

≫ Next: Big Learning Made Easy – with Counts!

≪ Previous: Our new engineering blog post

Dear Customers, The 6 th cumulative update release for SQL Server 2014 RTM is now available for download at the Microsoft Support site. To learn more about the release or servicing model, please visit: CU#6 KB Article: http://support.microsoft...(read more)

↧

Big Learning Made Easy – with Counts!

February 17, 2015, 9:00 am

≫ Next: Announcing the General Availability of Azure Machine Learning

≪ Previous: Cumulative Update #6 for SQL Server 2014 RTM

This post is by Misha Bilenko, Principal Researcher in Microsoft Azure Machine Learning.

This week, Azure ML is launching exciting new capability for training on terabytes of data. It is based on a surprisingly simple yet amazingly robust learning algorithm that is widely used by practitioners, yet receives virtually no dedicated attention in ML literature or courses. At the same time, the algorithm has been mentioned in passing in numerous papers across many fields, dating back to literature on branch prediction in compilers from early 1990s. It is also the workhorse for several multi-billion-dollar applications of ML, such as online advertising and fraud detection. Known under different names – ‘historical statistics’, ‘risk tables’, ‘CTR features’ – it retains the same core technique under the covers of different applications. In this blog post, we introduce the general form of this learning with counts method (which we call ‘Dracula’– to avoid saying “Distributed Robust Algorithm for Count-based Learning” each time, and to honor this children’s classic). We illustrate its use in Azure ML, where it allows learning to scale to terabytes of data with just a few clicks, and summarize the aspects that make it a great choice for practitioners.

In many prediction problems, the most informative data attributes are identities of objects, as they can be directly associated with the historical statistics collected for each object. For example, in click probability prediction for online advertising, key objects are the anonymous user, the ad, and the context (e.g., a query or a webpage). Their identities may be captured by such attributes as the user’s anonymous ID, a machine’s IP, an identifier for the ad or advertiser, and the text or hash for the query or webpage URL.

Identity attributes, as well as their combinations, hold enormous predictive power: e.g. given a user’s historical propensity to react to ads, the likelihood that a given ad is clicked for a particular query, the tendency to click on a certain advertiser’s ads from a particular location etc. While attribute values can be encoded as one-of-many (also known as one-hot) binary features, this results in very high-dimensional representation, which, when training with terabytes of data, practically limits one to use linear models. Although using attribute combinations can mitigate the paucity of linear representation, the resulting high-dimensional parameters remain difficult to interpret or monitor.

An attractive alternative that allows easy inspection of the model is to directly associate each attribute or combination with the likelihoods (or propensities) mentioned above. Computing these conditional probabilities requires aggregating past data in a very simple data structure: a set of count tables, where each table associates an attribute or combination with its historical counts for each label value. The following figure illustrates two count tables in the online advertising domain:

User	Number of clicks	Number of non-clicks
Alice	7	134
Bob	17	735
…	…	…
Joe	2	274

QueryHash, AdDomain	Number of clicks	Number of non-clicks
598fd4fe, foo.com	8465	28216
50fa3cc0, bar.org	497	10984
…
437a45e1, qux.net	6	23

These tables allow easily computing click probability for each seen object or combination. In the example in the table above, user Bob clicks on ads 17/(17+235)=6.75% of the time, while users entering query with hash 598fd4fe click on the ad from foo.com a whopping 30% of the time (which will be less surprising until one realizes that 598fd4fe is the hash for query foo– in which case the ad is likely ‘navigational’, providing a shortcut to the most-desired search result for the query).

These tables may remind the reader of Naïve Bayes, the classic learning algorithm that multiplies conditional probabilities for different attributes assuming their independence. While the simplicity of multiplying historical estimates is attractive, it can produce less-than-desirable accuracy when the independence assumptions are violated – as they inevitably are when the same object is involved in computing multiple estimates for different combinations. While Bayesian Networks extend Naïve Bayes by explicitly modeling relationships between attributes, they require encoding the dependency structure (or learning it), and often demand computationally expensive prediction (inference).

A slight departure from the rigor of Bayesian methods reveals a more general algorithm that maintains the simplicity of aggregating count tables, yet provides the freedom to utilize state-of-the-art algorithms such as boosted trees or neural networks to maximize the overall predictive accuracy. Instead of multiplying probabilities obtained from each table, one can choose to treat them as features – along with the raw per-label counts (or any transform thereof).

Going back to the example above, the feature vector for estimating Bob’s probability to click on the bar.org ad for query 50fa3cc0 is {17; 735; 0.0675; 497; 10984; 0.045} – simply a concatenation of the counts and resulting probabilities. This representation can be used to train a powerful discriminative learner (such as boosted trees), which has the capacity to take into account all conditional probabilities, as well as the strength of evidence behind them (corresponding to actual counts). As done in Bayesian methods, probabilities are typically adjusted via standard smoothing techniques, such as adding pseudocounts based on priors.

One may wonder how this algorithm can work in domains where millions or billions of objects exist: wouldn’t we need extremely large tables, particularly for combinations that may be very rare? The method scales via two approaches that preserve most useful statistical information by combining rare-item counts deterministically or statistically: compressing the tables either via back-off, or via count-min sketches. Back-off is the simpler of the two: it involves including only rows that have some minimal number of counts, also keeping a single “back-off bin” where the counts for the rest are combined. An additional binary feature can then be added to the representation to differentiate between attribute values for which counts were stored, versus those that were looked up in the back-off bin. The alternative to back-off is count-min sketch– an elegant technique that was invented just over a decade ago. It stores the counts for each row in multiple hash tables indexed by independent hashing functions. The impact of collisions is reduced by taking the smallest of counts retrieved via the different hashes.

Next, we turn to Azure ML to summarize the algorithm and illustrate its basic use, as shown in the following screenshot from Azure ML Studio:

Data is input via the Reader module and passed to the Build Count Table module, which aggregates the statistics for specified attributes. The tables (along with metadata for smoothing) are then sent to the Count Featurizer module, which injects the count-based statistics and passes the resulting data representation to train boosted trees downstream. We note that the algorithm is not limited to binary classification: one can use it for regression or multi-class classification as well – the only difference being that instead of two count cells, more are needed, corresponding to discretization of the numeric label or multiple categories, respectively.

We conclude this post by summarizing the key benefits of learning with counts that made it so popular in the industry.

First and foremost, the algorithm is simple to understand, inspect and monitor: whenever one desires to perform root-cause analysis for a particular prediction, they can directly examine the evidence supplied by the count tables for each attribute or combination.
Second, the algorithm is highly modular in multiple aspects. It provides modeling modularity: one can experiment with different downstream learners utilizing the same tables, or construct multiple tables using different back-off parameters. It also provides modularity with respect to data: counts aggregated from multiple slices of data or different time periods can be added (or subtracted to remove problematic data).
It also provides elasticity to high data cardinality by automatically compressing tables using either back-off or count-min sketches.
Finally, the algorithm is very well-suited for scaling out, as the count table aggregation is a classic map-reduce computation that can be easily distributed.

In follow-up posts, we will discuss how the algorithm can be scaled out via Azure ML’s integration with HDInsight (Hadoop) to easily learn from terabytes of data, as well as techniques for avoiding overfitting and label leakage.

Misha