Cloudera Selects Azure as a Preferred Cloud Platform

October 21, 2014, 9:00 am

≫ Next: EF7 - What Does “Code First Only” Really Mean

≪ Previous: [Announcement] ODataLib 6.8.1 Release

We are working to make Azure the best cloud platform for big data, including Apache Hadoop. To accomplish this, we deliver a comprehensive set of solutions such as our Hadoop-based solution Azure HDInsight and managed data services from partners, including Hortonworks. Last week Hortonworks announced the most recent milestone in our partnership and yesterday we announced even more data options for our Azure customers through a partnership with Cloudera.

Cloudera is recognized as a leader in the Hadoop community, and that’s why we’re excited Cloudera Enterprise has achieved Azure Certification. As a result of this certification, organizations will be able to launch a Cloudera Enterprise cluster from the Azure Marketplace starting October 28. Initially, this will be an evaluation cluster with access to MapReduce, HDFS and Hive. At the end of this year when Cloudera 5.3 releases, customers will be able to leverage the power of the full Cloudera Enterprise distribution including HBase, Impala, Search, and Spark.

We’re also working with Cloudera to ensure greater integration with Analytics Platform System, SQL Server, Power BI and Azure Machine Learning. This will allow organizations to build big data solutions quickly and easily by using the best of Microsoft and Cloudera, together. For example Arvato Bertelsmann was able to help clients cut fraud losses in half and speed credit calculations by 1,000x.

Our partnership with Cloudera allows customers to use the Hadoop distribution of their choice while getting the cloud benefits of Azure. It is also a sign of our continued commitment to make Hadoop more accessible to customers by supporting the ability to run big data workloads anywhere – on hosted VM’s and managed services in the public cloud, on-premise or in hybrid scenarios.

From Strata in New York to our recent news from San Francisco it’s exciting times ahead for those in the data space. We hope you join us for this ride!

Eron Kelly
General Manager, Data Platform

↧

EF7 - What Does “Code First Only” Really Mean

October 21, 2014, 10:41 am

≫ Next: Cumulative Update #4 for SQL Server 2014 RTM

≪ Previous: Cloudera Selects Azure as a Preferred Cloud Platform

A while back we blogged about our plans to make EF7 a lightweight and extensible version of EF that enables new platforms and new data stores. We also talked about our EF7 plans in the Entity Framework session at TechEd North America.

Prior to EF7 there are two ways to store models, in the xml-based EDMX file format or in code. Starting with EF7 we will be retiring the EDMX format and having a single code-based format for models. A number of folks have raised concerns around this move and most of it stems from misunderstanding about what a statement like “EF7 will only support Code First” really means.

Code First is a bad name

Prior to EF4.1 we supported the Database First and Model First workflows. Both of these use the EF Designer to provide a boxes-and-lines representation of a model that is stored in an xml-based .edmx file. Database First reverse engineers a model from an existing database and Model First generates a database from a model created in the EF Designer.

In EF4.1 we introduced Code First. Understandably, based on the name, most folks think of Code First as defining a model in code and having a database generated from that model. In actual fact, Code First can be used to target an existing database or generate a new one. There is tooling to reverse engineer a Code First model based on an existing database. This tooling originally shipped in the EF Power Tools and then, in EF6.1, was integrated into the same wizard used to create EDMX models.

Another way to sum this up is that rather than a third alternative to Database & Model First, Code First is really an alternative to the EDMX file format. Conceptually, Code First supports both the Database First and Model First workflows.

Confusing… we know. We got the name wrong. Calling it something like “code-based modeling” would have been much clearer.

Is code-base modeling better?

Obviously there is overhead in maintaining two different model formats. But aside from removing this overhead, there are a number of other reasons that we chose to just go forward with code-based modeling in EF7.

Source control merging, conflicts, and code reviews are hard when your whole model is stored in an xml file. We’ve had lots of feedback from developers that simple changes to the model can result in complicated diffs in the xml file. On the other hand, developers are used to reviewing and merging source code.
Developers know how to write and debug code. While a designer is arguably easier for simple tasks, many projects end up with requirements beyond what you can do in the designer. When it comes time to drop down and edit things, xml is hard and code is more natural for most developers.
The ability to customize the model based on the environment is a common requirement we hear from customers. This includes scenarios such as multi-tenant database where you need to specify a schema or table prefix that is known when the app starts. You may also need slight tweaks to your model when running against a different database provider. Manipulating an xml-based model is hard. On the other hand, using conditional logic in the code that defines your model is easy.
Code based modeling is less repetitive because your CLR classes also make up your model and there are conventions that take care of common configuration. For example, consider a Blog entity with a BlogId primary key. In EDMX-based modeling you would have a BlogId property in your CLR class, a BlogId property (plus column and mapping) specified in xml and some additional xml content to identify BlogId as the key. In code-based modeling, having a BlogId property on your CLR class is all that is needed.
Providing useful errors is also much easier in code. We’ve all seen the “Error 3002: Problem in mapping fragments starting at line 46:…” errors. The error reporting on EDMX could definitely be improved, but throwing an exception from the line of code-based configuration that caused an issue is always going to be easier.
We should note that in EF6.x you would sometimes get these unhelpful errors from the Code First pipeline, this is because it was built over the infrastructure designed for EDMX, in EF7 this is not the case.

There is also an important feature that could have been implemented for EDMX, but was only ever available for code-based models.

Migrations allows you to create a database from your code-based model and evolve it as your model changes over time. For EDMX models you could generate a SQL script to create a database to match your current model, but there was no way to generate a change script to apply changes to an existing database.

So, what will be in EF7?

In EF7 all models will be represented in code. There will be tooling to reverse engineer a model from an existing database (similar to what’s available in EF6.x). You can also start by defining the model in code and use migrations to create a database for you (and evolve it as your model changes over time).

We should also note that we’ve made some improvements to migrations in EF7 to resolve the issues folks encountered trying to use migrations in a team environment.

What about…

We’ve covered all the reasons we think code-based modeling is the right choice going forwards, but there are some legitimate questions this raises.

What about visualizing the model?

The EF Designer was all about visualizing a model and in EF6.x we also had the ability to generate a read-only visualization of a code-based model (using the EF Power Tools). We’re still considering what is the best approach to take in EF7. There is definitely value in being able to visualize a model, especially when you have a lot of classes involved.

With the advent of Roslyn, we could also look at having a read/write designer over the top of a code-based model. Obviously this would be significantly more work and it’s not something we’ll be doing right away (or possibly ever), but it is an idea we’ve been kicking around.

What about the “Update model from database” scenario?

“Update model from database” is a process that allows you to incrementally pull additional database objects (or changes to existing database objects) into your EDMX model. Unfortunately the implementation of this feature wasn’t great and you would often end up losing customizations you had made to the model, or having to manually fix-up some of the changes the wizard tried to apply (often dropping to hand editing the xml).

For Code First you can re-run the reverse engineer process and have it regenerate your model. This works fine in basic scenarios, but you have to be careful how you customize the model otherwise your changes will get reverted when the code is re-generated. There are some customizations that are difficult to apply without editing the scaffolded code.

Our first step in EF7 is to provide a similar reverse engineer process to what’s available in EF6.x – and that is most likely what will be available for the initial release. We do also have some ideas around pulling in incremental updates to the model without overwriting any customization to previously generated code. These range from only supporting simple additive scenarios, to using Roslyn to modify existing code in place. We’re still thinking through these ideas and don’t have definite plans as yet.

What about my existing models?

We’re not trying to hide the fact that EF7 is a big change from EF6.x. We’re keeping the concepts and many of the top level APIs from past versions, but under the covers there are some big changes. For this reason, we don’t expect folks to move existing applications to EF7 in a hurry. We are going to be continuing development on EF6.x for some time.

We have another blog post coming shortly that explores how EF7 is part v7 and part v1 and the implications this has for existing applications.

Is everyone going to like this change?

We’re not kidding ourselves, it’s not possible to please everyone and we know that some folks are going to prefer the EF Designer and EDMX approach over code-based modeling.

At the same time, we have to balance the time and resources we have and deliver what we think is the best set of features and capabilities to help developers write successful applications. This wasn’t a decision we took lightly, but we think it’s the best thing to do for the long-term success of Entity Framework and its customers – the ultimate goals being to provide a faster, easier to use stack and reduce the cost of adding support for highly requested features as we move forward.

↧

Cumulative Update #4 for SQL Server 2014 RTM

October 21, 2014, 1:31 pm

≫ Next: Video - Joseph Sirosh Interview with theCUBE at BigDataNYC 2014

≪ Previous: EF7 - What Does “Code First Only” Really Mean

Dear Customers, The 4 th cumulative update release for SQL Server 2014 RTM is now available for download at the Microsoft Support site. To learn more about the release or servicing model, please visit: CU#4 KB Article: http://support.microsoft...(read more)

↧

Video - Joseph Sirosh Interview with theCUBE at BigDataNYC 2014

October 21, 2014, 5:00 pm

≫ Next: Embracing Uncertainty – the Role of Probabilities

≪ Previous: Cumulative Update #4 for SQL Server 2014 RTM

Joseph Sirosh was recently interviewed in NYC by Dave Vellante and Jeff Frick on theCube. He covers a lot of ground including suggestions for aspiring data scientists, the great opportunity on the Azure marketplace and also the future of machine learning and Azure ML Check out the video below.

ML Blog Team

↧

Embracing Uncertainty – the Role of Probabilities

October 22, 2014, 9:00 am

≫ Next: How to Hadoop: 4 Resources to Learn and Try Cloud Big Data

≪ Previous: Video - Joseph Sirosh Interview with theCUBE at BigDataNYC 2014

This is the first of a 2-part blog post by Chris Bishop, Distinguished Scientist at Microsoft Research

Almost every application of machine learning (ML) involves uncertainty. For example, if we are classifying images according to the objects they contain, some images will be difficult to classify, even for humans. Speech recognition too, particularly in noisy environments, is notoriously challenging and prone to ambiguity. Deciding which movies to recommend to a user, or which web page they are searching for, or which link they will click on next, are all problems where uncertainty is inevitable.

Quantifying Uncertainty

Uncertainties are a source of errors, and we are therefore tempted to view uncertainty as a problem to be avoided. However, the best way to handle uncertainty is to approach it head on and treat it as a first-class citizen of the ML world.

To do this we need a mathematical basis for quantifying and manipulating uncertain quantities, and this is provided by probability theory. We often think of probabilities in terms of the rate of occurrence of a particular event. For example, we say that the probability of a coin landing heads is 50% (or 0.5) if the fraction of heads in a long series of coin flips is a half. But we also need a way to handle uncertainty for events which cannot be repeated many times. For example, our particular coin might be badly bent in which case there is no reason to be sure that heads and tails are equally likely. The rate at which it will land heads is itself an uncertain quantity, and yet there is only one instance of this bent coin. This more general problem of quantifying uncertainty in a consistent way has been studied by many researchers, and although various different schemes have been proposed it turns out that they are all equivalent to probability theory.

Image Classification Example

To see how probabilities can be valuable in practice, let’s consider a simple example. Suppose we have been asked to build a system to detect cancer as part of a mass screening programme. The system will take medical images (for instance X-rays or MRI images) as input and will provide as output a decision on whether or not the patient is free of cancer. The judgement of human experts will be treated as ‘ground truth’ and our goal is to automate their expertise to allow screening on a mass scale. We will also imagine that we have been supplied with a large number of training images each of which has been labelled as normal or cancerous by a human expert.

The simplest approach would be to train a classifier, such as a neural network, to assign each new image to either ‘cancer’ or ‘normal’. While this appears to offer a solution to our problem, we can do much better by training our neural network instead to output the probability that the image represents cancer (this is called the inference step), and then subsequently using this probability to decide whether to assign the image to the normal class or the cancer class (this is called the decision step).

If our goal is to misclassify as few images as possible, then decision theory tells us that we should assign a new image to the class for which our neural network assigns the higher probability. For instance, if the neural network says the probability that a particular image represents cancer is 30% (and therefore that the probability that it is normal is 70%) then the image would be classified as normal. At this point our two-stage approach is equivalent to a simple classifier.

Minimising Costs

Of course we would not feel happy with a screening programme that gave an all-clear to someone with a 30% chance of having cancer, because an error in which cancer is mis-classified as normal is far more costly (the patient could develop advanced cancer before it is detected) than an error in which a normal image is mis-classified as cancer (the image would be sent to a human expert to assess whose time would be wasted). In a screening programme we might therefore require the probability of cancer to be lower than some very low threshold, say 1%, before we are willing to allow the system to classify an image as normal. This can be formalised by introducing cost values, for the two types of mis-classification. Decision theory then provides a simple procedure for classifying an image so as to minimise the average cost, given the probabilities for the two classes.

The key point is that if there are changes to the cost values, for example due to a change in the cost of human time to assess images, then only the trivial decision step needs to be changed. There is no need to repeat the complex and expensive process of retraining the neural network.

Cancer is Rare

In screening programmes we are typically looking for rare events. Let’s suppose that only 1 in 1,000 of the people being screened in our example have cancer. If we collected 10,000 images at random and hand-labelled them then typically only around 10 of those images would represent cancer – hardly enough to characterise the wide variability of cancer images. A better approach is to balance the classes and have, say, 5,000 images each of normal and cancer. Again decision theory tells us how to take the probabilities produced by a network trained on balanced classes and correct those probabilities to allow for the actual frequency of cancer in the population. Furthermore, if the system is applied to a new population with a different background rate of cancer, the decision step can trivially be modified, again without needing to retrain the neural network.

Incidentally, failure to take account of these so-called prior class probabilities for rare events lies at the heart of the prosecutor’s fallacy, a statistical blunder which has been responsible for numerous major miscarriages of justice in the court room.

Improving Accuracy by Rejection

Finally, in our screening example we might imagine a further improvement to the system in which the neural network only classifies the ‘easy’ cases, and rejects those images for which there is significant ambiguity. The downside is that a human must then examine the rejected images, but we would expect that the performance of the neural network on the remaining examples would be improved. This intuition turns out to be correct, and decision theory tells us that we should reject any image for which the higher class probability is below some threshold. By changing this threshold we can change the fraction of images which are rejected, and hence optimise the trade-off between improving system performance and minimizing human effort.

Probabilities Everywhere

We have seen some of the numerous benefits of training classifiers to generate probabilities rather than simply make decisions. Furthermore, many other ML tasks can benefit from a probabilistic approach, including regression, clustering, recommendation, and forecasting. You can find out more about how to use probabilities in ML in Chapter 1 of Pattern Recognition and Machine Learning. You can also try out the Two-Class Bayes Point Machine Classifier in the Microsoft Azure Machine Learning service as an example of classifier that generates probabilities as opposed to decisions.

Next week we’ll see how to take probabilities to the next stage and use them to describe the uncertainty in the parameters of our learning model itself.

Chris Bishop
Learn about my research

↧

How to Hadoop: 4 Resources to Learn and Try Cloud Big Data

October 22, 2014, 9:00 am

≫ Next: EF7 – v1 or v7?

≪ Previous: Embracing Uncertainty – the Role of Probabilities

Are you curious about how to begin working with big data using Hadoop? Perhaps you know you should be looking into big data analytics to power your business, but you’re not quite sure about the various big data technologies available to you, or you need a tutorial to get started.

If you want a quick overview on why you should consider cloud Hadoop: read this short article from MSDN Magazine that explores the implications of combining big data and the cloud and provides an overview of where Microsoft Azure HDInsight sits within the broader ecosystem.

If you’re a technical leader who is new to Hadoop: check out this webinar about Hadoop in the cloud, and learn how you can take advantage of the new world of data and gain insights that were not possible before.

If you’re on the front lines of IT or data science and want to begin or expand your big data capabilities: check out the ‘Working with big data on Azure’ Microsoft Press eBook, which provides an overview of the impact of big data on businesses, a step-by-step guide for deploying Hadoop clusters and running MapReduce in the cloud, and covers several use cases and helpful techniques.

If you want a deeper tutorial for taking your big data capabilities to the next level: Master the ins and outs of Hadoop for free on the Microsoft Virtual Academy with this ‘Implementing Big Data Analysis’ training series.

What question do you have about big data or Hadoop? Are there any other resources you might find helpful as you learn and experiment? Let us know. And if you haven’t yet, don’t forget to claim your free one month Microsoft Azure trial.

↧

EF7 – v1 or v7?

October 27, 2014, 2:06 pm

≫ Next: PASS Summit: Networking, Onsite Activities & SQL Family

≪ Previous: How to Hadoop: 4 Resources to Learn and Try Cloud Big Data

A while ago we blogged about EF7 targeting new platforms and new data stores. In that post we shared that our EF6.x code base wasn’t setup to achieve what we wanted to in EF7, and that EF7 would be a “lightweight and extensible version of EF”.

That begs the question, is EF7 the next version of EF, or is it something new? Before we dig into the answer, let’s cover exactly what’s the same and what’s changing.

What’s staying the same?

When it comes to writing code, most of the top level experience is staying the same in EF7.

You still create a class that derives from DbContext and has DbSet properties for each type in your model.
You still use LINQ to write queries against your DbSet properties.
You still Add and Remove instances of types from your DbSet properties.
There are still DbContext.ChangeTracker and DbContext.Database properties for accessing change tracking information and database related APIs.

An example

For example, this code looks exactly the same in EF6.x and EF7.

using (var db = new BloggingContext())
{
 db.Blogs.Add(new Blog { Url = "blogs.msdn.com/adonet" });
 db.SaveChanges();

 var blogs = from b in db.Blogs.Include(b => b.Posts)
            orderby b.Name
            select b;

 foreach (var blog in blogs)
 {
 Console.WriteLine(blog.Name);
 foreach (var post in blog.Posts)
 {
  Console.WriteLine(" -" + post.Title);
 }
 }
}

What’s changing?

While the top level API remains the same (or very similar), EF7 does also include a number of significant changes. These changes can be grouped into a series of buckets.

Bucket #1: New Features

One of the key motivations behind EF7 is to provide a code base that will allow us to more quickly add new features. While many of these will come after the initial RTM, we have been able to easily implement some of them as we build out the core framework.

Some examples of features already added to EF7 include:

Batching of updates for relational databases means that EF7 no longer sends an individual command for every insert/update/delete statement. In many situations EF7 will batch multiple statements together into a single roundtrip to the database. We’ll expand the capabilities of batching in future releases too.
Unique constrains allows you to identify additional unique keys within your entities in addition to the primary key. You can then use these alternate keys as the target of foreign key relationships.

Bucket #2: Behavior Changes

EF6 and earlier releases have some unintuitive behavior in the top level APIs. While the APIs are staying the same, we are taking the opportunity to remove some limitations and chose more expected behavior.

An example

An example of this is how queries are processed. In EF6.x the entire LINQ query was translated into a single SQL query that was executed in the database. This meant your query could only contain things that EF knew how to translate to SQL and you would often get complex SQL that did not perform well.

In EF7 we are adopting a model where the provider gets to select which bits of the query to execute in the database, and how they are executed. This means that query now supports evaluating parts of the query on the client rather than database. It also means the providers can make use of queries with multiple results sets etc., rather than creating one single SELECT with everything in it.

Bucket #3: Simple, lightweight components

Under the covers EF7 is built over the top of a lighter weight and more flexible set of components. Many of these provide the same functionality as components from EF6.x, but are designer to be faster, easier to use, and easier to replace or customize. To achieve this they are factored differently and bare varying resemblance to their counterparts from EF6.x.

An example

A good example of this is the metadata that EF stores about your entity types and how they map to the data store. The MetadataWorkspace from EF6.x (and earlier versions) was a complex component with a difficult API. MetadataWorkspace was not built with a lightweight and performant O/RM in mind and achieving basic tasks is difficult. For example here is the code to find out which table the Blog entity type is mapped to:

using (var context = new BloggingContext())
{
 var metadata = ((IObjectContextAdapter)context).ObjectContext.MetadataWorkspace;
 var objectItemCollection = ((ObjectItemCollection)metadata.GetItemCollection(DataSpace.OSpace));

 var entityType = metadata
 .GetItems (DataSpace.OSpace)
 .Single(e => objectItemCollection.GetClrType(e) == typeof(Blog));

 var entitySet = metadata
 .GetItems (DataSpace.CSpace).Single()
 .EntitySets
 .Single(s => s.ElementType.Name == entityType.Name);

 var mapping = metadata.GetItems (DataSpace.CSSpace).Single()
 .EntitySetMappings
 .Single(s => s.EntitySet == entitySet);

 var table = mapping
 .EntityTypeMappings.Single()
 .Fragments.Single()
 .StoreEntitySet;

 var tableName = (string)table.MetadataProperties["Table"].Value ?? table.Name;
}

In EF7 we are using a metadata model that is simple to use and purpose built for the needs of Entity Framework. To highlight this point, here is the EF7 code to achieve the same thing as the EF6.x code listed above.

using (var db = new BloggingContext())
{
 var tableName = db.Model.GetEntityType(typeof(Blog)).Relational().Table;
}

Bucket #4: Removal of some features

Removing features is always a tough decision, and not something we take lightly. Given the major changes in EF7 we have identified some features that we will not be bringing forward.

Most of the features not coming forwards in EF7 are legacy features that are only used by a very small number of developers.

Multiple Entity Sets per Type (MEST) is a legacy feature that allows you to use the same CLR type for multiple entity sets (i.e. you have a Products and RetiredProducts table that are both mapped to the Product class). This feature was never supported thru the DbContext API or code-based models. Although possible, it was difficult to use from the EF Designer too. Requirements like this are better solved with inheritance.
Very complex type to table mappings were possible in EF6.x. For example you could have an inheritance hierarchy that combined TPH, TPT, and TPC mappings as well as Entity Splitting all in the same hierarchy. This sounds great, but is one of the major contributing factors to the complexity of the MetadataWorkspace in EF6.x. In EF7 there will be cases where your CLR types need to more closely match your table structure.

Some of the features we are retiring because there is already another (we believe better) way of doing things. While we’d love to continue pulling everything forward, we need to balance time, resources, and the cost of adding support for highly requested features as we move forward. To be able to continue devloping and improving the stack we need to shed some of the baggage.

Retiring the EDMX model format in favor of code-based modeling is perhaps the most significant change in EF7. You can read more about this change and the reasoning behind it in our recent post on the topic.
ObjectContext API was the primary Entity Framework API until DbContext was introduced in EF4.1. Since then we have seen DbContext quickly become the API of choice for EF developers. Given this, and the much cleaner API surface that DbContext provides, we are not bringing ObjectContext forward into EF7. Of course, the important features you needed to drop down to ObjectContext API for in the past will be available from DbContext API, but factored into a cleaner API surface.

Not everything will be there in the initial release

Because much of the core of EF7 is new, the first release of EF7 isn’t going to have all the features that are required for all applications. There is always a tension between wanting to ship quickly and wanting to have more features in a given release. As soon as we have the core framework and basic functionality implemented we will provide a release of EF7 for folks to use in applications with simpler requirements. We’ll then provide a series of quick releases that add more and more features.

Of course, this means EF7 isn’t going to be usable for every application when it is first released, and for that reason we are continuing development of EF6.x for some time and expect many of our customers to remain on that release.

An example of this is lazy loading support, we know this is a critical feature for a number of developers, but at the same time there are many applications that can be developed without this feature. Rather than making everyone wait until it is implemented, we will ship when we have a stable code base and are confident that we have the correct factoring in our core components.

So, is it a v1 or a v7?

The answer is both. There were actually three options we discussed in terms of naming/branding for EF7:

Call it v7 of Entity Framework– Given the top level API and patterns are the same as past releases, this is in many ways a major version of the same product. Per semantic versioning, breaking API changes and removal of features is a permissible (and inevitable) part of major releases.
Create a sub-product under Entity Framework– This option was somewhat of a middle ground. While the developer experience is undoubtedly EF, creating a sub product would help communicate that there are also a significant number of changes. This would be akin to the “Entity Framework Everywhere” name we used for the initial design document we published in CodePlex.
Call it something new and make it v1– Given the number of changes, we did consider naming it something new.

We decided that once you start writing code, this feels so much like Entity Framework that is really isn’t something new (that ruled out option #3). While there are going to be some nuances between the v6 and v7 transition that need to be documented and explained, it would ultimately be more confusing to have two different frameworks that have almost identical APIs and patterns.

Options #1 and #2 both seem valid to us. Our ultimate conclusion was that #1 is going to cause some confusion in the short term, but make the most sense in the long term. To a lesser extent we’ve tackled similar hurdles in the past with the introduction of DbContext API and Code First in EF4.1 and then the move out of the .NET Framework in EF6 (and subsequent duplications of types, namespace changes, etc.). While these were confusing things to explain, in the long term it seems to have been the correct decision to continue with one product name.

Of course, this is a somewhat subjective decision and there are no doubt folks who are going to agree and some who will disagree (there are even mixed opinions within our team).

↧

PASS Summit: Networking, Onsite Activities & SQL Family

October 28, 2014, 9:00 am

≫ Next: Microsoft Adds IoT Streaming Analytics, Data Production and Workflow Services to Azure

≪ Previous: EF7 – v1 or v7?

Just one week from now, PASS Summit will bring together the #SQLFamily in Seattle for the best week of SQL Server and BI learning and networking on the calendar, Nov. 4-7. With a record 200+ technical sessions across 3 jam-packed days of connecting and sharing with 5,000 fellow SQL Server professionals from around the world, PASS Summit 2014 will be the biggest Summit yet.

In addition to sessions with top community and Microsoft experts, guidance from Microsoft CSS, SQLCAT, and SQL Tiger teams at the popular SQL Server Clinic, and hands-on instructor-led workshops, Summit attendees can get 50% off Microsoft certification exams onsite. What else can you look forward to at this year’s Summit? Here are just some of the networking activities, onsite events, and opportunities to immerse yourself in the #SQLCommunity like never before.

First Timers

First-time Summit attendee? Don’t know what to expect at the conference? Don’t miss one of our First-Timers’ orientation sessions Tuesday before the Welcome Reception to get an inside look at what’s in store for you at PASS Summit and tips on getting the most from your week. Then, jump into a Speed Networking session with your fellow First-Timers, and start making connections. We’ll also have daily “Get to Know Your Community” sessions on how to navigate Summit and get more involved in PASS and the #SQLCommunity year-round.

Speaker Idol

For the first time ever, watch as 12 presenter compete for a guaranteed speaking spot at PASS Summit 2015. With three rounds across three days, a panel of judges from the community will give Speaker Idol contestants real-time feedback and select the finalists for Friday’s “speak-off.” Drop by the Community Session Room (Room 400) to watch this competition and cheer on your favorite speakers.

Community Zone

The Community Zone, on the level-4 Skybridge, is the place to mix and mingle with members of the community. Local and Virtual Chapter leaders, Regional Mentors, SQLSaturday organizers, and MVPs will be on hand Wednesday through Friday to answer any questions you have about PASS. The Community Zone will also feature a different country/language spotlight every hour – come by and talk with community leaders from your area and who speak your native language. Plus, meet community leaders from around the world and you could win a $250 VISA gift card in our SQL Around the World scavenger hunt-style game.

http://www.sqlpass.org/summit/2014/Connect/CommunityZone.aspx

Luncheons

This year’s Summit features two great opportunities tolearn more as you dig into lunch. Thursday’s Women in Technology Luncheon welcomes special guest keynoter Kimberly Bryant, founder of Black Girls CODE, to share her thoughts in a question-and-answer session. And join with MVPs, speakers, PASS Board members, and fellow attendees in our closing day Birds of a Feather Luncheon, focused on bringing people with the same passions together.

Evening Events

The fun continues even after sessions are over, with evening events designed to help you engage with the community and relax after an intense day of training. Help us kick off PASS Summit at Tuesday’s Welcome Reception, and rub shoulders with our sponsors and exhibitors at Wednesday’s Exhibitor Reception. Then enjoy Thursday’s Community Appreciation Party at the contemporary, cutting-edge Experience Music Project Museum, sponsored by PASS and Microsoft as a special thank you for being part of the #SQLCommunity.

Read what community bloggers are looking forward to at the SQL Server event of the year. And if you haven’t already, make sure you register by Oct. 31 to save $200 off the onsite rate. We can’t wait to see everyone there!

↧

Microsoft Adds IoT Streaming Analytics, Data Production and Workflow Services to Azure

October 29, 2014, 12:01 am

≫ Next: Microsoft Adds Streaming Analytics, Data Production and Workflow Services to Azure

≪ Previous: PASS Summit: Networking, Onsite Activities & SQL Family

This blog post is authored by Joseph Sirosh, Corporate Vice President of Machine Learning at Microsoft.

Today, I am excited to announce three new services: Azure Stream Analytics, Azure Data Factory and Azure Event Hubs. These services continue to make Azure the best cloud platform for our customers to build big data solutions.

Azure Stream Analytics and Azure Data Factory are available in preview and Azure Event Hubs is now generally available. These new capabilities help customers process data from devices and sensors within the Internet of Things (IoT), and manage and orchestrate data across diverse sources.

Stream Analytics is a cost-effective event processing engine that helps uncover real-time insights from devices, sensors, infrastructure, applications and data quickly and easily.
Azure Data Factory enables information production by orchestrating and managing diverse data.
Azure Event Hubs is a scalable service for collecting data from millions of “things” in seconds.

Azure Stream Analytics and Azure Event Hubs

Every day, IoT is fueling vast amounts of data from millions of endpoints streaming at high velocity in the cloud. Examples of streaming analytics can be found across many businesses, such as stock trading, fraud detection, identity protection services, sensors, web clickstream analytics and alerts from CRM applications. In this new and fast-moving world of cloud and devices, businesses can no longer wait months or weeks for insights generated from data.

With Azure Stream Analytics, businesses can gain insights in real time from data generated by devices, sensors, infrastructure, applications and other sources. Developers can easily combine streams of data – such as clickstreams, logs, metering data or device-generated events – with historic records or reference data. Complementing Stream Analytics, Azure Event Hubs is a highly scalable publish-subscribe ingestor that collects millions of events per second, allowing users to process and analyze data produced by connected assets such as devices and sensors. Stream Analytics provides out-of-the-box integration with Event Hubs– when connected, these two solutions enable customers to harness IoT by processing and analyzing massive amounts of data in real time.

One customer already using Stream Analytics and Event Hubs is Aerocrine, a medical products company focused on the improved management and care of patients with inflammatory airway diseases. The company is developing devices that include the ability to collect telematics data from clinics. The devices will connect to Azure and use Stream Analytics and Event Hubs to collect telematics information and perform near real-time analytics on top of the stream of the data from the instruments. The system will collect data about usage and performance to further improve the customer service experience and send out real-time alerts for maintenance.

Azure Data Factory

Most organizations today are dealing with a variety of massive amounts of data from many different sources: across geographic locations, on-premises and cloud, unstructured and structured. Effectively managing, coordinating and processing this data can be challenging, especially when the system needs to constantly evolve to deal with new business requirements, scale to handle growing data volume and be broad enough scope to manage diverse systems – commercial or open source – from a single place.

Azure Data Factory helps solve this problem by providing customers with a single place to manage data movement, orchestration and monitoring of diverse data sources, including SQL Server and Azure Blobs, Tables, Azure SQL Database and SQL Server in Azure Virtual Machines. Developers can efficiently build data driven workflows that join, aggregate and transform data from local, cloud-based and internet services, and set up complex data processing systems with little programming.

Milliman, an independent actuarial and consulting firm, is continuously innovating solutions for its clients and is now taking advantage of Azure Data Factory to unlock Azure HDInsight to organize and report over large and disorganized data sets. Milliman’s SaaS solution, Integrate^TM, will provide a data management environment to support both the creation of input data for the models and reporting across the vast amount of data generated from the models.

Rockwell Automation, the world’s largest company dedicated to industrial automation and information, is demonstrating IoT capabilities by offering remote monitoring services that collect data from sensors which is then securely sent to Microsoft Azure. A key component of their architecture is Data Factory. With Data Factory, Rockwell Automation is able to orchestrate critical data pipelines for time series sensor data by leveraging Microsoft Azure HDInsight so users can work with the data in Power BI and Azure Machine Learning.

Microsoft data services

Azure Stream Analytics, Azure Event Hubs and Data Factory are just a few of the data services we’ve added to Azure recently. Just this month at Strata + Hadoop World we introduced support for Apache Storm in Azure HDInsight, and over the past few months we announced Azure SQL Database, Azure DocumentDB, Azure Search and Azure Machine Learning. We’re delivering these new services so our customers have easier ways to manage, analyze and act on their data – using the tools, languages and frameworks they are familiar with – in a scalable and reliable cloud environment. To learn more, go here.

###

↧

Microsoft Adds Streaming Analytics, Data Production and Workflow Services to Azure

October 29, 2014, 10:00 am

≫ Next: Embracing Uncertainty – Probabilistic Inference

≪ Previous: Microsoft Adds IoT Streaming Analytics, Data Production and Workflow Services to Azure

This is a repost of an article by Joseph Sirosh on Microsoft’s Data Platform Insider blog in which we announced three new services earlier today:

Azure Stream Analytics,a cost-effective event processing engine that helps uncover real-time insights from devices, sensors, infrastructure, applications and data quickly and easily.
Azure Data Factory, whichenables information production by orchestrating and managing diverse data.
Azure Event Hubs, a scalable service for efficiently ingesting data from millions of sensors or other similar streaming inputs.

Azure Stream Analytics and Azure Data Factory are available in preview and Azure Event Hubs is now generally available.

Alongside our Azure Machine Learning service, these new capabilities help our customers process data from devices and sensors within the Internet of Things (IoT), and manage, orchestrate and analyze data across diverse sources in a scalable and reliable cloud environment. To learn more about why these and other services make Azure the best cloud platform for customers to build their big data solutions, go here.

ML Blog Team

↧

Embracing Uncertainty – Probabilistic Inference

October 30, 2014, 9:00 am

≫ Next: The Ins and Outs of Azure Data Factory – Orchestration and Management of Diverse Data

≪ Previous: Microsoft Adds Streaming Analytics, Data Production and Workflow Services to Azure

This is the second of a 2-part blog post by Chris Bishop, Distinguished Scientist at Microsoft Research. The first part is available here.

Last week we explored the key role played by probabilities in machine learning, and we saw some of the advantages of arranging for the outputs of a classifier to represent probabilities rather than decisions. In fact, nearly all of the classifiers in Azure ML can be used to generate probabilistic outputs. However, this represents only a glimpse of the many ways in which probabilistic methods can be used in ML. This week, we will explore some further benefits arising from a deeper use of probabilities as part of the training process.

The Curse of Over-Fitting

Traditionally, the parameters of a classifier are tuned by minimizing an error function which is defined using a set of labelled data. This is a computationally expensive process and constitutes the training phase of developing a ML solution. A major practical problem in this approach is that of over-fitting, whereby a trained model appears to give good results on the training data, but where its performance measured on independent validation data is significantly worse. We can think of over-fitting as arising from the fact that any practical training set is finite in size, and during training the parameters of the model become finely tuned to the noise on the individual data points in the training set.

Over-fitting can be controlled by a variety of measures, for example by limiting the number of free parameters in the model, by stopping the training procedure before the minimum training error is reached, or by adding regularization terms to the error function. In each case there is at least one hyper-parameter (the number of parameters, the stopping time, or the regularization coefficient) that must be set using validation data that is independent of the data used during training. More sophisticated models can involve multiple hyper-parameters, and typically many training runs are performed using different hyper-parameter values in order to select the model with the best performance on the validation data. This process is time-consuming for data scientists, and requires a lot of computational resources.

So is over-fitting a fundamental and unavoidable problem in ML? To find out we need to explore a broader role for probabilities.

Parameter Uncertainty

We saw last week how probabilities provide a consistent way to quantify uncertainty. The parameters of a ML model (for example the weights in a neural network) are quantities whose values we are uncertain about, and which can therefore be described using probabilities. The initial uncertainty in the parameters can be expressed as a broad prior distribution. We can then incorporate the training data, using an elegant piece of mathematics called Bayes’ theorem, to give a refined distribution representing the reduced uncertainty in the parameters. If we had an infinitely large training set then this uncertainty would go to zero. We would then have point values for the parameters, and this approach would become equivalent to standard error-function minimization.

For a finite data set, the problem of over-fitting, which arose because error-function minimization set the parameters to very specific values with zero uncertainty, has disappeared! Instead, the uncertainty in the weights provides an additional contribution to the uncertainty in the model predictions, over and above that due to noise on the target labels. If we have lots of parameters in the model and only a few training points, then there is a large uncertainty on the predictions. As we include more and more training data, the level of uncertainty on our predictions decreases, as we would intuitively expect.

Hierarchical Models

A natural question is how to set the hyper-parameters of the prior distribution. Because we are uncertain of their values they should themselves be described by probability distributions, leading – it seems – to an infinite hierarchy of probabilities. However, in a recent breakthrough by scientists at Microsoft Research it has been shown that only a finite hierarchy is required in practice, provided it is set up in the correct way (technically a mixture of Gamma-Gamma distributions). The result is a model that has no external hyper-parameters. Consequently, the model can be trained in a single pass, without over-fitting, and without needing pre-processing steps such as data re-scaling. The structure of such a model is illustrated for a simple two-class classifier known as the Bayes Point Machine, in the figure below:

This diagram is an example of a factor graph which illustrates the variables in the model along with their probability distributions and inter-dependencies. Although the training of such a model using distributions is slightly more costly compared to the use of an optimization algorithm, the model only needs to be trained once, resulting in a significant overall saving in time and resources. The two-class Bayes Point Machines is available in Azure ML, and its use is illustrated in the following screen shot:

Towards Artificial Intelligence

So have we exhausted the possibilities of what can be achieved using probabilities in ML? Far from it – we have just scratched the surface. As the field of ML moves beyond simple prediction engines towards the ultimate goal of intelligent machines, rich probabilistic models will become increasingly vital in allowing machines to reason and learn about the world and to take actions in the face of uncertainty.

We must leave a more detailed discussion of this fascinating frontier of ML to a future article. Meanwhile, I encourage you to explore the world of probabilities further using the Bayes Point Machine in Azure ML.

Chris Bishop
Learn about my research

↧

The Ins and Outs of Azure Data Factory – Orchestration and Management of Diverse Data

October 30, 2014, 9:00 am

≫ Next: The Ins and Outs of Azure Stream Analytics – Real-Time Event Processing

≪ Previous: Embracing Uncertainty – Probabilistic Inference

Yesterday at TechEd Europe 2014, Microsoft announced the preview of Azure Data Factory. This post will give you the ins and outs of this new service.

What is Azure Data Factory?

Azure Data Factory is a fully managed service that does information production by orchestrating data with processing services as managed data pipelines. A pipeline connects diverse data (like SQL Server on-premises or cloud data like Azure SQL Database, Blobs, Tables, and SQL Server in Azure Virtual Machines) with diverse processing techniques (like Azure HDInsight (Hive and Pig), and custom C# activities). This will allow the data developer to transform and shape the data (join, aggregate, cleanse, enrich) so that it becomes authoritative and trustworthy to be consumed by BI tools. These pipelines are all managed within a single pane of glass where rich health and lineage is available to diagnose issues or do impact analysis across all data and processing assets. Some unique points about Data Factory are:

Ability to process data from diverse locations and data types. Data Factory can pull data from relational, on-premises sources like SQL Server and join with non-relational, cloud sources like Azure Blobs.
Provide a holistic view of the entire IT infrastructure that includes both commercial and open source together. Data Factory can orchestrate Hive and Pig using Hadoop while also bringing in commercial products and services like SQL Server and Azure SQL Database in a single view.

What can it do?

With the ability to manage and orchestrate the collection, movement and transformation of semi-structured and structured data together, Data Factory provides customers with a central place to manage their processing of web log analytics, click stream analysis, social sentiment, sensor data analysis, geo-location analysis, and more. In public preview, Microsoft views Data Factory as a key tool for customers who are looking to have a hybrid story with SQL Server or who currently use Azure HDInsight, Azure SQL Database, Azure Blobs, and Power BI for Office 365. In the future, we’ll bring more data sources and processing capabilities to the Data Factory.

How do I get started?

For Microsoft customers, we are offering Azure Data Factory as a public preview. To get started, customers will need to have an Azure subscription or a free trial to Azure. With this in hand, you should be able to get Azure Data Factory up and running in minutes. Start by reading this getting started guide.

For more information on Azure Data Factory:

↧

The Ins and Outs of Azure Stream Analytics – Real-Time Event Processing

October 30, 2014, 9:00 am

≫ Next: The Ins and Outs of Azure Data Factory – Orchestration and Management of Diverse Data

≪ Previous: The Ins and Outs of Azure Data Factory – Orchestration and Management of Diverse Data

Yesterday at TechEd Europe 2014, Microsoft announced the preview of Azure Stream Analytics. This post will give you the ins and outs of this new service.

What is Azure Stream Analytics?

Azure Stream Analytics is a cost effective event processing engine that helps uncover real-time insights from devices, sensors, infrastructure, applications, and data. Deployed in the Azure cloud, Stream Analytics has elastic scale where resources are efficiently allocated and paid for as requested. Developers are given a rapid development experience where they describe their desired transformations in SQL-like syntax. Some unique aspects about Stream Analytics are:

Low cost: Stream Analytics is architected for multi-tenancy meaning you only pay for what you use and not for idle resources. Unlike other solutions, small streaming jobs will be cost effective.
Faster developer productivity: Stream Analytics allow developers to use a SQL-like syntax that can speed up development time from thousands of lines of code down to a few lines. The system will abstract the complexities of the parallelization, distributed computing, and error handling away from the developers.
Elasticity of the cloud: Stream Analytics is built as a managed service in Azure. This means customers can spin up or down any number of resources on demand. Customers will not have to setup costly hardware or install and maintain software.

Similar to the recent announcement Microsoft made in making Apache Storm available in Azure HDInsight, Stream Analytics is a stream processing engine that is integrated with a scalable event queuing system like Azure Event Hubs. By making both Storm and Stream Analytics available, Microsoft is giving customers options to deploy their real-time event processing engine of choice.

What can it do?

Stream Analytics will enable various scenarios including Internet of Things (IoT) such as real-time fleet management or gaining insights from devices like mobile phones and connected cars. Specific scenarios that customers are doing with real-time event processing include:

Real-time ingestion, processing and archiving of data: Customers will use Stream Analytics to ingest a continuous stream of data and do in-flight processing like scrubbing PII information, adding geo-tagging, and doing IP lookups before being sent to a data store.
Real-time Analytics: Customers will use Stream Analytics to provide real-time dashboarding where customers can see trends that happen immediately when they occur.
Connected devices (Internet of Things): Customers will use Stream Analytics to get real-time information from their connected devices like machines, buildings, or cars so that relevant action can be done. This can include scheduling a repair technician, pushing down software updates or to perform a specific automated action.

How do I get started?

For Microsoft customers, we are offering Azure Stream Analytics as a public preview. To get started, customers will need to have an Azure subscription or a free trial to Azure. With this in hand, you should be able to get Azure Stream Analytics up and running in minutes. Start by reading this getting started guide.

For more information on Azure Stream Analytics:

↧

The Ins and Outs of Azure Data Factory – Orchestration and Management of Diverse Data

October 31, 2014, 9:30 am

≫ Next: The Ins and Outs of Azure Stream Analytics – Real-Time Event Processing

≪ Previous: The Ins and Outs of Azure Stream Analytics – Real-Time Event Processing

Earlier this week, at TechEd Europe 2014 in Barcelona, we announced the preview of Azure Data Factory. Azure Data Factory enables information production by orchestrating and managing diverse data.

Learn about the ins and outs of this new service here.

ML Blog Team

↧

The Ins and Outs of Azure Stream Analytics – Real-Time Event Processing

October 31, 2014, 9:40 am

≫ Next: ZDNet: Microsoft takes wraps off preview of its Azure Data Factory service

≪ Previous: The Ins and Outs of Azure Data Factory – Orchestration and Management of Diverse Data

Earlier this week, at TechEd Europe 2014 in Barcelona, we announced the preview of Azure Stream Analytics. Azure Stream Analytics is a cost-effective event processing engine that helps uncover real-time insights from devices, sensors, infrastructure, applications and data quickly and easily.

Learn about the ins and outs of this new service here.

ML Blog Team

↧

ZDNet: Microsoft takes wraps off preview of its Azure Data Factory service

October 31, 2014, 9:45 am

≫ Next: The Oracle and Teradata connector V3.0 for SQL Server 2014 Integration Service is now available for download

≪ Previous: The Ins and Outs of Azure Stream Analytics – Real-Time Event Processing

Re-post of a ZDNet article earlier this week by Mary Jo Foley.

Microsoft takes wraps off preview of its Azure Data Factory service

Summary: Another new Microsoft Azure service now in preview is aimed at helping developers integrate and get insights from their disparate data sources more easily.

http://www.zdnet.com/microsoft-takes-wraps-off-preview-of-its-azure-data-factory-service-7000035197/

↧

The Oracle and Teradata connector V3.0 for SQL Server 2014 Integration Service is now available for download

November 4, 2014, 12:37 pm

≫ Next: Microsoft announces major update to Azure SQL Database, adds free tier to Azure Machine Learning

≪ Previous: ZDNet: Microsoft takes wraps off preview of its Azure Data Factory service

Dear Customers, The Oracle and Teradata connector V3.0 for SQL Server 2014 Integration Service is now available for download at the Microsoft Download Center. Microsoft SSIS Connectors by Attunity Version 3.0 is a minor release. It supports SQL...(read more)

↧

Microsoft announces major update to Azure SQL Database, adds free tier to Azure Machine Learning

November 5, 2014, 9:00 am

≫ Next: Anomaly Detection – Using Machine Learning to Detect Abnormalities in Time Series Data

≪ Previous: The Oracle and Teradata connector V3.0 for SQL Server 2014 Integration Service is now available for download

This morning at the Professional Association for SQL Server (PASS) Summit, we celebrated SQL Server 2014’s strong momentum and introduced new services that further expand Microsoft’s big data platform. We announced a major update coming to our database-as-a-service, Azure SQL Database, and easier access to our machine learning service, Azure Machine Learning. These new updates continue our efforts to bring big data to everyone by delivering a comprehensive platform that ensures every organization, every team and every individual is empowered to do more and achieve more because of the data at their fingertips.

I’m really pleased to be making these announcements today at PASS Summit, where, along with my colleagues Joseph Sirosh, corporate vice president of Machine Learning and Information Management; and James Phillips, general manager of Data Experiences; I delivered a keynote highlighting the momentum of SQL Server 2014 and other recent releases in our data platform such as Azure Stream Analytics, Azure Data Factory,Azure DocumentDB, Azure Search and Azure HDInsight. As the world’s largest gathering of SQL Server and business intelligence professionals, PASS is hugely important, enabling us to connect with SQL Server customers and gain valuable feedback to help inform the product’s development.

Customers embrace SQL Server 2014

SQL Server is the cornerstone of our big data platform. It is the world’s most widely-deployed database and is used across the globe for mission critical enterprise deployments. Last month, based largely on our work with SQL Server, Microsoft was recognized as a Leader in Gartner's Magic Quadrant for Operational Database Management Systems, and positioned furthest to the right in completeness of vision*. Earlier this year, we released SQL Server 2014, which includes built-in breakthrough in-memory OLTP and columnstore technologies, as well as hybrid cloud capabilities. Since then, SQL Server 2014 has seen tremendous growth and positive reception among customers, with more than 1.2 million downloads to date and 30 percent of Azure SQL Server virtual machines currently running SQL Server 2014.

Clalit, Dell, Eastman Chemical Company, GameStop, Kiwibank, LC Waikiki, Pros, Saab and Stack Overflow are just a few of the customers using SQL Server 2014. GameStop is using SQL Server 2014 in two main scenarios: disaster recovery and backup to Azure to accommodate the company’s infrastructure freeze for holidays and big game launches, and as the default install for new SQL Server instances. Stack Overflow is a question and answer site for professional and enthusiast programmers. By basing their platform on technologies like SQL Server 2014 (specifically taking advantage of AlwaysOn Availability Group replicas), they can have a highly available, high-performing platform that easily and quickly gets answers to thousands of global users.

Azure SQL Database, Azure Machine Learning

Later this year, we will preview a new version of Azure SQL Database that represents another major milestone for this database-as-a-service. With this preview, we will add SQL Server capabilities that will make it easier to extend and migrate applications to the cloud, including support for larger databases with online indexing and parallel queries, improved T-SQL support with common language runtime and XML index, and monitoring and troubleshooting with extended events. In addition, the preview will unlock our in-memory columnstore, which will deliver greater performance for data marts and continue our journey of bringing in-memory technologies to the cloud. We will offer these new preview capabilities as part of the service tiers introduced earlier this year, which deliver 99.99% availability, larger database sizes, restore and geo-replication capabilities, and predictable performance.

Microsoft Azure Machine Learning is a fully managed cloud service for building predictive analytics solutions, and helps overcome the challenges most businesses face in deploying and using machine learning. Starting today, it will be easier than ever for anyone to try Azure Machine Learning, as the service is now available to test free of charge without a subscription or credit card – all customers need to get started is a Microsoft account ID. This free tier is one more way Azure Machine Learning is making advanced analytics more accessible to more people. DBAs, developers, business intelligence professionals and nascent data scientists can now experiment with Azure Machine Learning at no cost.

Microsoft Data Platform

We are making all these investments in SQL Server and the rest of our data platform because we are living and working in an amazing time where organizations are utilizing data to make smarter decisions, better predict their customers’ needs and provide more differentiated products and services. Data has become the new currency and it is helping to differentiate today’s leading companies.

To get there, organizations need a comprehensive platform to capture and manage all of their data, transform and analyze that data for new insights, and provide tools which enable users across their organization to visualize data and make better business decisions. Microsoft’s approach is to make it easier for our customers to work with data of any type and size – using the tools, languages and frameworks they want – in a trusted environment on-premises and in the cloud. To learn more about our approach to big data, visit our web page.

*Disclaimer:
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

↧

Anomaly Detection – Using Machine Learning to Detect Abnormalities in Time Series Data

November 5, 2014, 9:00 am

≫ Next: Microsoft adds free tier to Azure Machine Learning

≪ Previous: Microsoft announces major update to Azure SQL Database, adds free tier to Azure Machine Learning

This post was co-authored by Vijay K Narayanan, Partner Director of Software Engineering at the Azure Machine Learning team at Microsoft.

Introduction

Anomaly Detection is the problem of finding patterns in data that do not conform to a model of “normal” behavior. Detecting such deviations from expected behavior in temporal data is important for ensuring the normal operations of systems across multiple domains such as economics, biology, computing, finance, ecology and more. Applications in such domains need the ability to detect abnormal behavior which can be an indication of systems failure or malicious activities, and they need to be able to trigger the appropriate steps towards taking corrective actions. In each case, it is important to characterize what is normal, what is deviant or anomalous and how significant is the anomaly. This characterization is straightforward for systems where the behavior can be specified using simple mathematical models – for example, the output of a Gaussian distribution with known mean and standard deviation. However, most interesting real world systems have complex behavior over time. It is necessary to characterize the normal state of the system by observing data about the system over a period of time when the system is deemed normal by observers and users of that system, and to use this characterization as a baseline to flag anomalous behavior.

Machine learning is useful to learn the characteristics of the system from observed data. Common anomaly detection methods on time series data learn the parameters of the data distribution in windows over time and identify anomalies as data points that have a low probability of being generated from that distribution. Another class of methods include sequential hypothesis tests like cumulative sum (CUSUM) charts, sequential probability ratio test (SPRT) etc., which can identify certain types of changes in the distributions of the data in an online manner. All these methods use some predefined thresholds to alert on changes in the values of some characteristic of the distribution and operate on the raw time series values. At their core, all methods test if the sequence of values in a time series is consistent to have been generated from an i.i.d (independent and identically distributed) process.

Exchangeability Martingales

A direct way to detect changes in the distribution of time series values uses exchangeability martingales (EM) to test if the time series values are i.i.d ([3], [4] and [5]). A distribution of time series values is exchangeable if the distribution is invariant to the order of the variables. The basic idea is that an EM remains stable if the data is drawn from the same distribution, while it grows to a large value if the exchangeability assumption is violated.

EM based anomaly scores to detect changes in the distribution of time series values have a few properties that are useful for anomaly detection in dynamic systems.

Different type of anomalies (e.g. increased dynamic range of values, threshold change in the values, slow trends etc.) can be detected by transforming the raw data to capture strangeness (abnormal behavior) in the domain e.g., an upward trend in the values is probably indicative of a memory leak in a computing context, while it may be expected behavior in the growth rate of a population. When the time series is seasonal or has other predictable patterns, then the strangeness functions can also be defined on the residuals remaining after subtracting a forecast from the observed values.
Anomalies are computed in an online manner by keeping some of the historical time series in a window.
Threshold in martingale value for alerting can be used to control false positives. Further, the threshold has the same dynamic range irrespective of the absolute value of the time series or the strangeness function and has a physical interpretation in terms of the expected false positive rate ([3]).

Anomaly Detection Service on Azure Marketplace

We have published an anomaly detection service in the Azure marketplace for intelligent web services. This anomaly detection service can detect the following different types of anomalies on time series data:

Positive and negative trends: When monitoring memory usage in computing, for instance, an upward trend is indicative of a memory leak,
Increase in the dynamic range of values: As an example, when monitoring the exceptions thrown by a service, any increases in the dynamic range of values could indicate instability in the health of the service, and
Spikes and Dips: For instance, when monitoring the number of login failures to a service or number of checkouts in an e-commerce site, spikes or dips could indicate abnormal behavior.

The service provides a REST based API over HTTPS that can be consumed in different ways including a web or mobile application, R, Python, Excel, etc. We have an Azure web application that demonstrates the anomaly detection web service. You can also send your time series data to this service via a REST API call, and it runs a combination of the three anomaly types described above. The service runs on the AzureML Machine Learning platform which scales to your business needs seamlessly and provides SLA’s of 99.9%.

The figure below shows an example of anomalies detected in a times series using the above framework. The time series has 2 distinct level changes, and 3 spikes. The red dots show the time at which the level change is detected, while the red upward arrows show the detected spikes.

Application to Cloud Service Monitoring

Clusters of commodity compute and storage devices interconnected by networks are routinely used to deliver high quality services for enterprise and consumer applications in a cost effective manner. Real-time operational analytics to monitor, alert and recover from failures in any of the components of the system are necessary to guarantee the SLAs of these services. A naïve approach of alerting using rules, i.e. when KPIs of these components take on anomalous values, could easily lead to a large number of false positive alerts in any service of reasonable size. Further, tuning the thresholds for thousands of KPIs in a dynamic system is non-trivial. EMs are particularly well-suited for detecting and alerting changes in the KPIs of these systems due to the advantages mentioned earlier. The alerts generated by this system are handled by automated healing processes and human systems experts to help the SQL Database service on Azure meet its SLA of 99.99%, the first cloud database to achieve this level of SLA.

Anomaly Detection for Log Analytics

Most log analytics platforms provide an easy way to search through systems logs once a problem has been identified. However, proactive detection of ongoing anomalous behavior is important to be ahead of the curve in managing complex systems. Microsoft and Sumo Logic have been partnering to broaden the machine learning based anomaly detection capabilities for log analytics. The seamless cloud-to-cloud integration between Microsoft AzureML and Sumo Logic provides customers a comprehensive, machine learning solution for detecting and alerting anomalous events in logs. The end user can consume the integrated anomaly detection capabilities easily in their Sumo Logic service with minimal effort, relying on the combined power of proven technologies to monitor and manage complex system deployments.

Vijay K Narayanan, Alok Kirpal, Nikos Karampatziakis
Follow Vijay on twitter.

References

Intelligent web services on Azure marketplace
Anomaly detection serviceon Azure marketplace.
Vladimir Vovk, Ilia Nouretdinov, Alex J. Gammerman, “Testing Exchangeability Online”, ICML 2003.
Shen-Shyang Ho; Wechsler, H., "A Martingale Framework for Detecting Changes in Data Streams by Testing Exchangeability," Pattern Analysis and Machine Intelligence, IEEE Transactions , vol.32, no.12, pp.2113,2127, Dec. 2010
Valentina Fedorova, Alex J. Gammerman, Ilia Nouretdinov, Vladimir Vovk, “Plug-in martingales for testing exchangeability on-line”,ICML 2012

↧

Microsoft adds free tier to Azure Machine Learning

November 5, 2014, 6:35 pm

≫ Next: Available Now: Preview Release of the SQL Server PHP Driver

≪ Previous: Anomaly Detection – Using Machine Learning to Detect Abnormalities in Time Series Data

Starting today, we made it easier than ever for anyone to try Azure Machine Learning. Our service is now available to test free of charge without a subscription or credit card – all you need to get going is a Microsoft account! You can read more about this announcement, made at the PASS Summit this morning in Seattle.

The new free tier is one more way in which Microsoft and Azure ML are making advanced analytics more broadly accessible. Go ahead and give it a spin.

ML Blog Team

↧