Quantcast
Channel: Data Platform
Viewing all 808 articles
Browse latest View live

PASS Summit 2014 - Don’t miss the largest conference for SQL Server Professionals

$
0
0

Do you have the “Data Gene”? 

Preparations for PASS Summit 2014 in Seattle, Washington are well underway.  We are very excited to have this year’s event back in Seattle and look forward to bringing you some great sessions and activities throughout the event.    

Tune in to this week’s TechNet Radio special to listen to Jennifer Moser, Lara Rubbelke, Ann Bachrach andSQL RockStar, Thomas LaRock talk about the “data gene” and why you don’t want to miss this year’s event.

Don’t procrastinate, get registered for PASS Summit 2014!  


Announcing the PASS 24 HOP Challenge

$
0
0

Calling all data junkies! How smart are you?  Want to get smarter?

Play along with #pass24hop Challenge on Twitter starting at 5:00 AM PT Tuesday, September 9, 2014 to win a free Microsoft Exam Voucher!  Simply watch 24 Hours of PASS and be the first to answer the question correctly. At the beginning of each 24 live 24 Hours of PASS sessions (approximately 5-8 minutes into each hour) a new question regarding the session will be posted online on the  @SQLServer Twitter account. The first tweet with the correct answer will win a prize.  Your answer must include hashtags #pass24hop and #24hopquiz.

To take part in the #pass24hop Challenge, you must:

  1. Sign in to your Twitter account. If you do not have an account, visit www.twitter.com to create one. Twitter accounts are free.
  2. Once logged into your Twitter account, follow the links and instructions to become a follower of @SQLServer.
  3. From your own account, reply your response to the question tweeted by @SQLServer.  
  4. Your tweet must contain both the #pass24hop and #24hopquiz hashtags to be eligible for entry.
  5. Your tweet must include the complete answer to the question, or it will be disqualified.
  6. The first person to correctly tweet a correct reply to the corresponding question will win the prize described below.  

Register now for 24 Hours of PASS and get ready for 24 hours of play!  

Learn more about the 24 Hours of PASS read the official rules below.

 

NO PURCHASE NECESSARY. COMMON TERMS USED IN THESE RULES:

These are the official rules that govern how the ’24 Hours of PASS Social Media Answer & Question Challenge (“Sweepstakes”) promotion will operate. This promotion will be simply referred to as the “Sweepstakes” throughout the rest of these rules. In these rules, “we,” “our,” and “us” refer to Microsoft Corporation, the sponsor of the Sweepstakes. “You” refers to an eligible Sweepstakes entrant.

WHAT ARE THE START AND END DATES?

This Sweepstakes starts at 5:00 AM PT Tuesday, September 9, 2014 and ends at 5:00 AM PT Tuesday, September 9, 2014 and ends at 7:00 AM PT Wednesday, September 10, 2014 (“Entry Period”). The Sweepstakes consists of 24 prizes. Each Prize Period will begin immediately following each of the 24 session and run for 60 minutes.  

CAN I ENTER?

You are eligible to enter this Sweepstakes if you meet the following requirements at time of entry:

· You are professional or enthusiast with expertise in SQL Server or Business Intelligence and are 18 years of age or older; and

o If you are 18 of age or older, but are considered a minor in your place of residence, you should ask your parent’s or legal guardian’s permission prior to submitting an entry into this Sweepstakes; and

· You are NOT a resident of any of the following countries: Cuba, Iran, North Korea, Sudan, and Syria.

PLEASE NOTE: U.S. export regulations prohibit the export of goods and services to Cuba, Iran, North Korea, Sudan and Syria. Therefore residents of these countries / regions are not eligible to participate

• You are NOT an employee of Microsoft Corporation or an employee of a Microsoft subsidiary; and

• You are NOT involved in any part of the administration and execution of this Sweepstakes; and

• You are NOT an immediate family (parent, sibling, spouse, child) or household member of a Microsoft employee, an employee of a Microsoft subsidiary, or a person involved in any part of the administration and execution of this Sweepstakes.

This Sweepstakes is void wherever prohibited by law.

HOW DO I ENTER?  

At the beginning of each 24 live 24 Hours of PASS sessions (approximately 5-8 minutes into each hour) a new question regarding the session will be posted online on the  @SQLServer Twitter account. The first tweet with the correct answer will win a prize.  Your answer must include hashtags #pass24hop and #24hopquiz.  Failure to use this hashtag will automatically disqualify you.

To enter, you must do all of the following:

  1. Sign in to your Twitter account. If you do not have an account, visit www.twitter.com to create one. Twitter accounts are free.
  2. Once logged into your Twitter account, follow the links and instructions to become a follower of @SQLServer
  3. From your own account, reply your response to the question tweeted by @SQLServer  
  4. Your tweet must contain both the #pass24hop and #24hopquiz hashtags to be eligible for entry
  5. Your tweet must include the complete answer to the question, or it will be disqualified.
  6. The first person to correctly tweet a correct reply to the corresponding question will win the prize described below.  

Limit one entry per person, per session.  For the purposes of these Official Rules, a “day” begins 5:00 AM PT Tuesday, September 9, 2014 and ends at 7:00 AM PT Wednesday, September 10, 2014 (“Entry Period”). If you reply with more than one answer per session, all replies received from you for that session will be automatically disqualified.  You may submit one answer to each session, but will be eligible to win only one prize within the 24 hour contest period.

We are not responsible for entries that we do not receive for any reason, or for entries that we receive but are not decipherable for any reason, or for entries that do not include your Twitter handle.

We will automatically disqualify:

  • Any incomplete or illegible entry; and
  • Any entries that we receive from you that do not meet the requirements described above.

WINNER SELECTION AND PRIZES

The first person to correctly respond will receive a Microsoft Exam Voucher.  Approximate Retail Value each $150.  A total of twenty four prizes are available.

Within 48 hours following the Entry Period, we, or a company acting under our authorization, will select one winner per session to win one free Microsoft Certification Exam.  Voucher has a retail value of $ $150.  Prize eligibility is limited to one prize within the contest period.  If you are selected as a winner for a session, you will be ineligible for additional prizes for any other session.  In the event that you are the first to answer correctly on multiple session, the prize will go to the next person with the correct answer. 

If there is a dispute as to who is the potential winner, we reserve the right to make final decisions on who is the winner based on the accuracy of the answer provided, ensuring that the rules of including hashtags is followed, and the times the answers arrives based on what times are listed on www.twitter.com.

Selected winners will be notified via a Direct Message (DM) on Twitter within 48 business hours of the daily drawing. The winner must reply to our Direct Message (DM) within 48 hours of notification via DM on Twitter. If the notification that we send is returned as undeliverable, or you are otherwise unreachable for any reason, or you do not respond within 48 business hours, we will award the prize to an alternate winner as randomly selected. Only one alternate winner will be selected and notified; after which, if unclaimed, the prize will remain unclaimed.

If you are a potential winner, we may require you to sign an Affidavit of Eligibility, Liability/Publicity Release within 10 days of notification. If you are a potential winner and you are 18 or older, but are considered a minor in your place of legal residence, we may require your parent or legal guardian to sign all required forms on your behalf. If you do not complete the required forms as instructed and/or return the required forms within the time period listed on the winner notification message, we may disqualify you and select an alternate winner.

If you are confirmed as a winner of this Sweepstakes:

  • You may not exchange your prize for cash or any other merchandise or services. However, if for any reason an advertised prize is unavailable, we reserve the right to substitute a prize of equal or greater value; and
  • You may not designate someone else as the winner. If you are unable or unwilling to accept your prize, we will award it to an alternate potential winner; and
  • If you accept a prize, you will be solely responsible for all applicable taxes related to accepting the prize; and
  • If you are otherwise eligible for this Sweepstakes, but are considered a minor in your place of residence, we may award the prize to your parent/legal guardian on your behalf.

WHAT ARE YOUR ODDS OF WINNING? 
There will be 24 opportunities to respond with the correct answer. Your odds of winning this Challenge depend on the number of responses and being the first to answer with the correct answer.

WHAT OTHER CONDITIONS ARE YOU AGREEING TO BY ENTERING THIS CHALLENGE? 
By entering this Challenge you agree:

· To abide by these Official Rules; and

· To release and hold harmless Microsoft, and its respective parents, subsidiaries, affiliates, employees and agents from any and all liability or any injury, loss or damage of any kind arising from or in connection with this Challenge or any prize won; and

· That Microsoft’s decisions will be final and binding on all matters related to this Challenge; and

· That by accepting a prize, Microsoft may use of your proper name and state of residence online and in print, or in any other media, in connection with this Challenge, without payment or compensation to you, except where prohibited by law

WHAT LAWS GOVERN THE WAY THIS CHALLENGE IS EXECUTED AND ADMINISTRATED? 
This Challenge will be governed by the laws of the State of Washington, and you consent to the exclusive jurisdiction and venue of the courts of the State of Washington for any disputes arising out of this Challenge.

WHAT IF SOMETHING UNEXPECTED HAPPENS AND THE CHALLENGE CAN’T RUN AS PLANNED? 
If cheating, a virus, bug, catastrophic event, or any other unforeseen or unexpected event that cannot be reasonably anticipated or controlled, (also referred to as force majeure) affects the fairness and / or integrity of this Challenge, we reserve the right to cancel, change or suspend this Challenge. This right is reserved whether the event is due to human or technical error. If a solution cannot be found to restore the integrity of the Challenge, we reserve the right to select winners from among all eligible entries received before we had to cancel, change or suspend the Challenge. If you attempt to compromise the integrity or the legitimate operation of this Challenge by hacking or by cheating or committing fraud in ANY way, we may seek damages from you to the fullest extent permitted by law. Further, we may ban you from participating in any of our future Challenge, so please play fairly.

HOW CAN YOU FIND OUT WHO WON? 
To find out who won, send an email to v-daconn@microsoft.com by September 15, 2014 with the subject line: “SQL Server QQ Winners

WHO IS SPONSORING THIS CHALLENGE? 
Microsoft Corporation 
One Microsoft Way 
Redmond, WA 98052

Video – Pier 1 Imports Uses Azure ML to Build a Better Relationship with their Customers

$
0
0

This is the first in a series of posts on how Microsoft customers are gaining actionable insights on their data by operationalizing Machine Learning at scale in the cloud.

With over 1,000 stores, Pier 1 Imports aims to be their customers’ neighborhood store for furniture and home décor. But the way customers are shopping is different today and Pier 1 Imports recently launched a multi-year, omni-channel strategy called “1 Pier 1”, a key goal of which is to understand customers better and serve them with a more personalized experience across their multiple interactions and touch points with the Pier 1 brand.

Pier 1 Imports recently adopted Microsoft Azure Machine Learning to help them predict what their customers might like to buy next. Working with Microsoft partner MAX451, they built an Azure ML solution that predicts what a customer’s future product preferences might be and how they might like to purchase and receive these products.

Learn more about this innovative solution in the customer’s own voice by clicking the video below: 

Many Microsoft customers across a broad range of industries are deploying enterprise-grade predictive analytics solutions using Azure ML. You too can get started on Azure ML today.

ML Blog Team

[Announcement] ODataLib 6.7.0 Release

$
0
0

We are happy to announce that the ODL 6.7.0 is released and available on NuGet. Detailed release notes are listed below:

New Features

  1. EdmLib supports a list of additional core vocabulary terms
        -  IsLanguageDependent
        -  RequiresType
        -  ResourcePath
        -  DereferenceableIDs
        -  ConventionalIDs
        -  Immutable
        -  Computed
        -  IsURL
        -  AcceptableMediaTypes
        -  MediaType
        -  IsMediaType
  2. EdmLib supports adding metadata annotations to elements in Edm model.
  3. ODataLib supports computing Uri with KeyAsSegment convention for deserializing.
  4. ODataUriParser supports Enum value as an entity key
  5. ODataLib supports overriding default Uri parsing behavior, including:
        -  Resolving property name
        -  Resolving type name
        -  Resolving navigation source name
        -  Resolving operation import name
        -  Resolving bound operation name
        -  Resolving function parameters
        -  Resolving entity set key
        -  Resolving binary operator node elements
        We have also provided built-in implementations to enable following scenarios:
        1) Case insensitive for built in identifiers, including:
            •  In path segment
                $batch, $metadata, $count, $ref, $value
            •  In query options
                $id, $select, $expand(including nested query options), $levels, $filter, $orderby, $skip, $top, $count, $search, max, asc, desc, any, all, contains, startswith, endswith, length, indexof, substring, tolower, toupper, trim, concat, year, month, day, hour,  minute, second, fractionalseconds, totalseconds, totaloffsetminutes, mindatetime maxdatetime, now, and, or, eq, ne, lt, le, gt, ge, has, add, sub, mul, div, mod, not, round, floor, ceiling, isof, cast, geo.distance, geo.length, geo.intersects
            •  $it, true, false, null, are not supported and should be case sensitive
            •  AND, OR, NOT for $search are not supported and should be case sensitive
        2) Case insensitive for user metadata, including :
            •  EntitySet/Singleton Name
            •  EntitySet key predicate name
            •  Property Name
            •  Type Name
            •  FunctionImport name/parameter name
            •  ActionImport name
            •  Bound function name/parameter name
            •  Bound action name
        3) Unqualified function call, including:
            •  Bound function call without namespace prefix
            •  Bound action call without namespace prefix
        4) Omitting Enum type name prefix for Enum value, including:
            •  Omitting Enum type prefix in binary operator
            •  Omitting Enum type prefix in entity set key value
            •  Omitting Enum type prefix in function parameter

Bug fixes

  1. Fixes a bug that DataServiceCollection constructor runs into stack overflow when parameter T derives from a generic type of T.
  2. Fixes a bug that Edm validator doesn't correctly report error when the Edm model contains dupped key properties.

Call to Action

You and your team are highly welcomed to try out this new version if you are interested in the new features and fixes above. For any feature request, issue or idea please feel free to reach out to us atodatafeedback@microsoft.com.

Predict What's Next: Getting Started on Azure Machine Learning - Part 1

$
0
0

Earlier this summer at WPC, we announced the preview ofMicrosoft Azure Machine Learning, a fully-managed cloud service for building predictive analytics solutions. With this service, you can overcome the challenges most businesses have in deploying and using machine learning. How? By delivering a comprehensive machine learning service that has all the benefits of the cloud. In mere hours, with Azure ML, customers and partners can build data-driven applications to predict, forecast and change future outcomes – a process that previously took weeks and months.

But once you get your hands on Azure ML, what do you do with it? Some examples we already see happening include:

  • Consumer oriented firms with targeted marketing, churn analysis and online advertising
  • Manufacturing companies enabling failure and anomaly forecasting for predictive maintenance
  • Financial services companies providing credit scoring, bankruptcy prediction and fraud detection
  • Retailers doing demand forecasting, inventory planning, promotions and markdown management
  • Healthcare firms and hospitals supporting patient outcome prediction and preventive care.

So how can machine learning impact your organization? Walk through these tutorials and start exploring the possibilities. The video tutorials and learning resources below will help you to quickly get up and running on Azure ML.

  1. Create an Azure Account
    Before you begin, you must create an Azure account. Create a free trial here.

  2. Overview of Azure ML: Watch an overview of the Azure Machine Learning service: a browser-based workbench for the data science workflow, which includes authoring, evaluating, and publishing predictive models.


     
  3. Getting started with Azure ML Studio: Walk through a visual tour of the Azure Machine Learning studio workspaces and collaboration features.


     
  4. Introduction to Azure ML API Service: Learn about the Azure Machine Learning API service capabilities.


     
  5. Provisioning Azure ML Workspaces: Walk through steps needed to provision a Machine Learning workspace from the Azure Portal.

     

Look for more video tutorials later this week, when we’ll cover getting and saving data in Azure ML, pre-processing that data, how we handle R in Azure ML Studio, and deploying predictive models with Azure ML.

In the meantime, there are a ton of resources you can use to continue your learning:

From Stumps to Trees to Forests

$
0
0

This blog post is authored by Chris Burges, Principal Research Manager at Microsoft Research, Redmond.

In my last post we looked at how machine learning (ML) provides us with adaptive learning systems that can solve a wide variety of industrial strength problems, using Web search as a case study. In this post I will describe how a particularly popular class of learning algorithms called boosted decision trees (BDTs) works. I will keep the discussion at the ideas level, rather than give mathematical details, and focus here on the background you need to understand BDTs. I mentioned last time that BDTs are very flexible and can be used to solve problems in ranking (e.g. how should web search results be ordered?), classification (e.g., is an email spam or not?) and regression (e.g., how can you predict what price your house will sell for?).  Of these, the easiest to describe is the binary classification task (that is, the classification task where one of two labels is attached to each item), so let’s begin with that.

A BDT model is a collection (aka ensemble) of binary decision trees (that is, decision trees in which each internal node has two children), each of which attempts to solve part of the problem. To understand how single decision trees work let’s consider the simplest possible decision tree, known as a decision stump, which is just a tree with two leaf nodes. To make things concrete, let’s consider the classification task of predicting someone’s gender based only on other information, and to start with, let’s assume that we only know their height. A decision stump in this case looks like this:

Thus, the decision stump asserts that if a person’s height h is less than a threshold height ht, then that person is female, else male. The parameter ht is found using a “training set” of labeled data by determining that ht such that, if a leaf node asserts male or female based on the majority vote of the labels of the training samples that land there, the overall error rate on that training set is minimized. It is always possible to pick such a threshold: the corresponding error rate is known as the Bayes error. Let’s suppose we have such a training set and that doing this yields ht = 1.4 meters.

Now of course there are tall women and short men, and so we expect the Bayes error to be nonzero. How might we build a system that has a lower error rate? Well, suppose we also have access to additional data, such as the person’s weight w. Our training set now looks like a long list of tuples of type (height, weight, gender). The input data types (height and weight) are called features. The training data will also have labels (gender) but of course the test data won’t, and our task is to predict well on previously unseen data. We haven’t gone far yet, but we’ve already run into three classic machine learning issues: generalization, feature design, and greedy learning. Machine learning usually pins its hopes on the assumption that the unseen (test) data is distributed similarly to the training set, so let’s adopt that assumption here (otherwise we will have no reason to believe that our model will generalize, that is, perform well on test data). Regarding feature design, we could just use h and w as features, but we know that they will be strongly correlated: perhaps a measure of density such as w/h would be a better predictor (i.e. result in splits with lower Bayes error). If it so happened that w/h is an excellent predictor, the tree that approximates this using only w and h would require many more splits (and be much deeper) than one that can split directly on w/h. So perhaps we’d settle on h, w and w/h as our feature set. As you can see, feature design is tricky, but a good rule of thumb is to find as many features as possible that are largely uncorrelated, yet still informative, to pick the splits from. 

Notice that our features have different dimensions – in the metric system, h is measured in meters, w in kilograms, and w/h is measured in kilograms per meter. Other machine learning models such as neural networks will happily do arithmetic on these features, which works fine as long as the data is not scaled; but it does seem odd to be adding and comparing quantities that have different dimensions (what does it mean to add a meter to a kilogram?). Decision trees do not have this problem because all comparisons are done on individual features: for example it seems more natural to declare that if a person’s linear density is larger than 40 kilos per meter, and their height is greater than 1.8 meters, then that person is likely male. If after training, some law was passed requiring that all new data be represented in Imperial rather than in Metric units, for any machine learning system we can simply scale the new data back to what the system was trained on; but for trees we have an alternative, namely to rescale the comparison in each node, in this case from Metric to Imperial units. (Not handling such rescaling problems correctly can cause big headaches, such as the loss of one’s favorite Mars orbiter.) 

Regarding greediness, when growing the tree, choosing all of its splits simultaneously to give the overall optimal set of splits is a combinatorically hard problem, so instead we simply scan each leaf node and choose the node and split that results in lowest overall Bayes error. This often works well but there are cases (we’ll look at one below) for which it fails miserably. So, after two splits our decision tree might look like this, where we’ve put in the concrete values found by minimizing the error on a training set:

Now let’s turn to that miserable failure. Suppose the task is to determine the parity of a 100-bit bit vector (that is, whether the total number of bits is even or odd, not whether the vector represents an even or odd number!), and that for training data we’re given access to an unlimited number of labeled vectors. Any choice of bit position to split on in the first node will give the worst possible Bayes error of ½ (up to small statistical fluctuations, whose size depends on the size of the training set we wind up using). The only binary tree that can fully solve this will have to have depth 100, with 1,267,650,600,228,229,401,496,703,205,376 leaf nodes. Even if this were practical to do, the tree would have to be constructed by hand (i.e. not using the greedy split algorithm) because even at the penultimate layer of the tree (one layer above the leaf nodes), any split still gives Bayes error of ½. Clearly this algorithm won’t work for this task, which can be solved in nanoseconds by performing ninety nine XOR operations in sequence. The problem is that trying to split the task up at the individual bit level won’t work. As soon as we are allowed to consider more than one bit at a time (just two, in the case of XOR) we can solve the problem very efficiently. This might look like a minor concern, but it does illustrate that greedily trained trees aren’t likely to work as well on tasks for which the label depends only on combinations of many features.

Let’s look now at binary trees for regression. Binary classification may be viewed as a special case of regression in which the targets take only two values. The case in which the targets take on more than two values, but still a finite number of values, can be thought of as multiclass classification (for example: identify the species of animal shown in a photo). The case in which the targets take on a continuum of values, so that a feature vector x maps to a number f(x), where f is unknown, is the general regression task. In this case the training data consists of a set of feature vectors xi (for example, for the house price prediction problem, the number of bedrooms, the quality of the schools, etc.) each with f(xi) (the selling price of the house) provided, and the task is to model f so that we can predict f(x) for some previously unseen vector x. How would a tree work in this more general case? Ideally we would like a given leaf to contain training samples that all have the same f value (say, f0), since then prediction is easy – we predict that any sample that falls in that leaf has f value equal to f0. However there will almost always be a range of f values in a given leaf, so instead we try to build the tree so that each leaf has the smallest such range possible (i.e. the variances of the f values in its leaves are minimized) and we simply use the mean value of the f values in a given leaf as our prediction for that leaf. We have just encountered a fourth classic machine learning issue: choice of a cost function. For simplicity we will stick with the variance as our cost function, but several others can also be used.

So now we know how to train and test with regression trees: during training, loop through the current leaves and split on that leaf, and feature, that gives the biggest reduction in variance of the f’s. During test phase, see which leaf your sample falls to and declare your predicted value as the mean of the f’s of the training samples that fell there. However this raises the question: does it make sense to split if the number of training samples in the leaf is small? For example, if it’s two, then any split will reduce the variance of that leaf to zero. Thus looms our fifth major issue in machine learning: regularization, or the avoidance of overfitting. Overfitting occurs when we tune to the training data so well that accuracy on test data begins to drop; we are learning particular vagaries of our particular training set, which don’t transfer to general datasets (they don’t “generalize”). One common method to regularize trees is indeed to stop splitting when the number of training samples in a leaf falls below a fixed, pre-chosen threshold value (that might itself be chosen by trying out different values and seeing which does best on a held-out set of labeled data).

The regression tree is a function (that maps each data point to a real number). It is a good rule to use only the simplest function one can to fit the training data, to avoid overfitting. Now, for any positive integers n and k, a tree with nk leaves has nk -1 parameters (one for the split threshold at each internal node). Suppose instead that we combine k trees, each with n leaves, using k-1 weights to linearly combine the outputs of the k-1 trees with the first. This model can also represent exactly nk different values, but it has only nk-1 parameters – exponentially fewer than the single tree. This fits our “Occam’s Razor” intuition nicely, and indeed we find that using such “forests”, or ensembles of trees, instead of a single tree, often achieves higher test accuracy for a given a training set. The kinds of functions the two models implement are also different. When the input data has only two features (is two dimensional), each leaf in a tree represents an axis-parallel rectangular region in the plane (where the rectangles may be open, that is, missing some sides). For data with more than two dimension the regions are hyper-rectangles (which are just the high dimensional analog of rectangles). An ensemble of trees, however, is taking a linear combination of the values associated with each rectangle in a set of rectangles that all overlap at the point in question. But, do we lose anything by using ensembles? Yes: interpretability of the model. For a single tree we can follow the decisions made to get to a given leaf and try to understand why they were sensible decisions for the data in that leaf. But except for small trees it’s hard to make sense of the model in this way, and for ensembles, it’s even harder.

So, finally, what is boosting? The idea is to build an ensemble of trees, and to use each new tree to try to correct the errors made by the ensemble built so far. AdaBoost was the first boosting algorithm to become widely adopted for practical (classification) problems. Boosting is a powerful idea that is principled (meaning that one can make mathematical statements about the theory, for example showing that the model can learn even if, for the classification task, each subsequent tree does only slight better than random guessing on the task) and very general (one can solve a wide range of machine learning problems, each with its own cost function, with it). A particularly general and powerful way to view boosting is as gradient descent in function space, which makes it clear how to adapt boosting to new tasks. In that view, adding a new tree is like taking a step downhill in the space in which the functions live, to find the best overall function for the task at hand.

To go much further would require that I break my promise of keeping the ideas high level and non-mathematical, so I will stop here – but the math is really not hard, and if you’re interested, follow the links above and the references therein to find out more. If you’d like to try BDTs for yourself, you can use the Scalable Boosted Decision Trees module in AzureML to play around and build something cool.

Chris Burges
Learn about my research.

Predict What's Next: How to Get Started with Machine Learning Part 2

$
0
0

Earlier this week we talked about getting up and running on Azure to start exploring the new machine learning service. From here, you can really start to dig in and try out the capabilities of a cloud ML and put predictive analytics into practice.

Below you’ll find several more video tutorials that help you learn your way around the service. Check them out, and let us know what predictions you uncover or big ideas you solve.

  1. Getting and Saving Data in Azure ML Studio: Data Access is the first step of data science workflow. Azure Machine Learning supports numerous ways to connect to your data. This video illustrates several methods of data ingress in Azure Machine Learning.


     
  2. Pre-processing data in Azure ML Studio: Data preprocessing is the next step in data science workflow and general data analysis projects.  This video illustrates the commonly used modules for cleaning and transforming data in Azure Machine Learning


     
  3. R in Azure ML Studio: Azure Machine Learning supports R. You can bring in your existing R codes in to Azure Machine Learning and run it in the same experiment with provided learners and publish this as web service via Azure Machine Learning. This video illustrates how to incorporate your R code in ML studio


     
  4. Predictive Modeling with Azure ML: Azure Machine Learning features a pallets of modules to build a predictive model, including state of the art ML algorithms such as Scalable boosted decision trees, Bayesian Recommendation systems, Deep Neural Networks and Decision Jungles developed at Microsoft Research. This video walks through steps to building, scoring and evaluating a predictive model in Azure Machine Learning


     
  5. Deploying a Predictive Model as a Service – Part 1: This video walks through creating a Web service graph for a predictive model and putting the predictive model into staging, using the Azure Machine Learning API service


     
  6. Deploying a Predictive Model as a Service – Part 2: Azure Machine Learning enables you to put staging service into production via the Azure Management portal.  This video walks through putting the predictive model staging service into production


     

As a reminder, there are a ton of resources you can use to continue your learning:

Azure SQL Database new service tiers now generally available

$
0
0

We are excited to announce that the new SQL Database service tiers, Basic, Standard, and Premium are now generally available. These service tiers raise the bar for what you can expect from a database-as-a-service with business-class functionality that is both built-in and seamless to use—allowing you to dramatically increase the number of databases managed under a single database administrator.

Today is an important milestone for the Azure SQL Database community. Since the first public introduction in 2009, our journey has been influenced by our direct and deep engagements with customers and partners.  Along the way we have increased the global scale and reach of the service, database size, and made it easier to run database diagnostics, to name a few. Your drive to push the boundaries on what is possible in the cloud brought us to a million databases. Your feedback on what you need from a relational database-as-a-service helped us reimagine an approach that aligns best with the unique needs of cloud-based database workloads.

In April, we introduced the Basic, Standard, and Premium tiers into preview. These tiers address the needs of today’s demanding cloud applications by providing predictable performance for your light- to heavy-weight transactional applications while also ensuring the performance of your apps are no longer affected by other customer workloads. Additionally, the new tiers provide you with the following new capabilities:

  • Higher uptime SLA, previously at 99.9%, uptime is now 99.99%--one of the highest in the database-as-a-service industry
  • Point-in-time-restore, with built-in backups and up to 35 days of data retention
  • Active geo-replication and standard geo-replication options for continuous data replication to geographically dispersed secondaries
  • Larger database sizes, previously at 150 GB, database max size is now up to 500 GB
  • Auditing for added security confidence (remains in preview)

Since the preview release in April, we listened to important feedback on the new tiers and as a result pre-announced changes to GA based on this direct dialogue. A summary of the key changes are:

  • New S0 performance level:  Within the Standard service tier, we have introduced an S0 performance level to ease the transition from Basic to Standard.
  • Premium and Standard price reductions:  Final pricing reflects up to 50% savings from previously-published GA pricing. GA pricing will take effect on November 1, 2014.
  • Hourly billing:  Starting today, Azure SQL Database will move to hourly billing in the new service tiers. 

I am incredibly excited about the value our ongoing investments will continue to deliver to you. Our customers Samsung, ESRI, Callaway Golf, and Pottermore, to name a few, are already using Azure SQL Database as a relational database service platform to help grow their cloud-based businesses. With an expanding portfolio of cloud services, including DocumentDB, Azure Search, Azure Machine Learning, and Azure HDInsight and complimentary data services from our partners, we’re committed to delivering a complete data platform that makes it easier for you to work with data of any type and size—using the tools, languages and frameworks you want in a trusted cloud environment. 

Try the new SQL Database service tiers today.


2014 winners: 24 Hours of PASS

$
0
0

Thank you to everyone who took the time to enter the contest #pass24hop Challenge!  As always, we had a great time listening to such passionate community speakers.

Congratulations to all the winners!  You will be notified via a Direct Message (DM) with details on how to redeem your free Microsoft Certification Exam.

So without further ado, the winners are……

Session #1

Taiob Alia @SQLTaiob correctly answered the question: “What did N.U.S.E stand for?” posed by Brent Ozar who conducted the “Who Needs a DBA??”

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7273

Session #2

John Fewer @johnfewer correctly answered the question: “What is the default buffer size for SSIS pipelines?” posed by Brian Knight who conducted the “Performance Tuning SQL Server Integration Services”.

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7292

Session #3

Yang Shuai @shuaiyang correctly answered the question: “If it was required, what is the type of witness for quorum that would be used if using standalone instances of SQL Server in an availability group?” posed by Allan Hirt who conducted the “Availability Groups vs Failover Cluster Instances: What’s the Difference?

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7275

Session #4

Mike Cornell @DataMic correctly answered the question: “In the UPDATE demo, what city does John Smith move to?” posed by Kalen Delaney who conducted the “In-Memory OLTP Internals: How is a 30x Performance Boost Possible?”

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7276

Session #5

John Ludvig Brattas @intolerance correctly answered the question: “How many relationships can you define between two tables in Tabular?” posed by Marco Russo who conducted the “Create your first SSAS Tabular Model”.

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7277

Session #6

Sergio Pacheco @Sergmis correctly answered the question: “Using XE, how can you determine the number of times a query was executed during a specific timeframe?” posed by Erin Stellato and Jonathan Kehayias who conducted the “Everything You Never Wanted to Know about Extended Events”.

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7278

Session #7

The listeners were stumped on the question: “Who are the complainer hotel guests from the demo? What show are they from?” posed by Hope Foley who conducted the “Spatial Data: Looking Outside the Map”.

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7295

Session #8

The listeners were stumped on the question: “Without stats, what will the row estimation be for an equality predicate?” posed by Gail Shaw who conducted the “Guessing Games: Statistics, Heuristics, and Row Estimations”.

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7279

Session #9

@fja correctly answered the question: “What is the name of the executable that ships with SQL Server that can be used to collect diagnostic information?” posed by Tim Chapman and Denzil Ribeiro who conducted the “Troubleshoot Customer Performance Problems Like a Microsoft Engineer”

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7293

Session #10

Andrea Allred @RoyalSQL correctly answered the question: “What technology should you think about when you get a requirement to encrypt data on the fly?” posed by Argenis Fernandez who conducted the “Secure Your SQL Server Instance without Changing Any Code”.

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7280

Session #11

Theresa Iserman @Theresalserman correctly answered the question: “What are the 4 C’s of Hiring?”  posed by Joe Webb who conducted the “Hiring the Right People: Interviewing and Selecting the Right Team”.

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7294

Session #12

Andy Pho @andycsuf correctly answered the question:  “Why does Chris (@SQLShaw) recommend database mirroring as his primary technology choice to upgrade SQL Server?” posed by Chris Shaw and John Morehouse who conducted the “Real World SQL 2014 Migration Path Decisions”.

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7291

Session #13

Stephen Radford @stephen_radford correctly answered the question: “What are the first two cmdlets you should learn, when learning PowerShell?” posed by Robert Cain, Bradley Ball and Jason Strate who conducted the “Are your Indexes Hurting you or Helping You?”

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7281

Session #14

Mark Holmes @SQLJuJu correctly answered the question: “Which Excel data mining add-in offers functionality for the entire data mining lifecycle?” posed by Peter Myers who conducted the “Past to Future: Self-service Forecasting with Microsoft BI”

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7283

Session #15

P K Towett @pthepebble correctly answered the question: “SQL Server 2014 adds one piece of functionality to statistics maintenance, what is it?” posed by Grant Fritchey who conducted the “Query Performance Tuning in SQL Server 2014”.

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7284

Session #16

Allen Smith @cognitivebi correctly answered the question: “What is your first mission when trying to determine if indexes help or hurt you?” posed by Jes Borland who conducted the “Are your Indexes Hurting you or Helping You?”

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7296

Session #17

Andre Ranieri @sqlinseattle correctly answered the question: “What behavior is sometimes called 'the king of irreplaceable behaviors'?” posed by Kevin Kline who conducted the “Techniques to Fireproof Your IT Career”.

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7285

Session #18

Ginger Grant @DesertisleSQL correctly answered the question: “In what part of a predictive analytics project to people typically spend 60-80% of their time?” posed by Carlos Bossy who conducted the “Predictive Analytics in the Enterprise”.

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7286

Session #19

The listeners were stumped on the question: correctly answered the question: “What WSMan feature do you need to set up on your client system to remote to Azure VMs?” posed by Allen White who conducted the “Manage Both On-Prem and Azure Databases with PowerShell?”

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7287

Session #20

@wBob_uk correctly answered the question: “What is the name of the base theme used in the Adventure Works Power View demo today?” posed by Julie Koesmarno who conducted the “I Want It NOW!" Data Visualization with Power View”.

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7290

Session #21

Conan Farrell @SQL_Dub correctly answered the question: “What is the "mantra" for engineering a DWH solution?” posed by Davide Mauri who conducted the “Agile Data Warehousing: Start to Finish”.

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7289

Session #22

Nicky @ ADA @NickyvV correctly answered the question: “Which DAX function lets you access columns in other tables?” posed by Alberto Ferrari who conducted the “DAX Formulas in action”.

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7274

Session #23

Regis Baccaro @regbac correctly answered the question: “What R package do you need to use to connect R and PowerBI?” posed by Jen Stirrup who conducted the “Business Intelligence Toolkit Overview: Microsoft Power BI & R”.

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7288

Session #24

Amil Maharjan @Anil_Maharjan correctly answered the question: “Can the SSRS databases be clustered?” posed by Ryan Adams who conducted the “SQL Server AlwaysOn Quickstart”.

Webcast Link: http://www.sqlpass.org/24hours/2014/summitpreview/Sessions/SessionDetails.aspx?sid=7282

 

This is just a taste of what you can expect from PASS Summit 2014. To ensure you have a seat, register today at PASS Summit 2014 registration page.  

Announcing the Release of Web API OData 5.3

$
0
0

The NuGet packages for ASP.NET Web API OData 5.3 are now live on the NuGet gallery!

Download this release

You can install or update the NuGet packages for ASP.NET Web API OData 5.3 using the NuGet Package Manager Console, like this:

  • Install-Package Microsoft.AspNet.OData -Version 5.3.0
  • Install-Package Microsoft.AspNet.WebApi.OData -Version 5.3.0

What’s in this release?

This release primarily includes great new features for Web API OData v4 as summarized below:

ASP.NET Web API OData 5.3

Additional Documentation

Tutorials and other information about Web API OData are available from the ASP.NET web site (http://www.asp.net/web-api/overview/odata-support-in-aspnet-web-api).

Questions and feedback

You can submit questions related to this release on the ASP.NET Web API forums. Please submit any issues you encounter and feature suggestions for future releases on our CodePlex site.

Thanks and enjoy!

New VM Images Optimized for Transactional and DW workloads in Azure VM Gallery

$
0
0

We are delighted to announce the release of new optimized SQL Server images in the Microsoft Azure Virtual Machines Gallery. These images are pre-configured with optimizations for transactional and Data Warehousing workloads respectively by baking in our performance best practices for running SQL in Azure VMs.

What preconfigured VM images are available?

The following four new pre-configured VM images are now available in the Azure VM Gallery:

  • SQL Server 2014 Enterprise Optimized for Transactional Workloads on Windows Server 2012 R2
  • SQL Server 2014 Enterprise Optimized for Data Warehousing on Windows Server 2012 R2
  • SQL Server 2012 SP2 Enterprise Optimized for Transactional Workloads on Windows Server 2012
  • SQL Server 2012 SP2 Enterprise Optimized for Data Warehousing on Windows Server 2012

Currently we support these images on VM instances that allow up to 16 data disks attached to provide the highest throughput (or aggregate bandwidth). Specifically, these instances are Standard Tier A4, A7, A8 and A9 and Basic tier A4. Please refer to Virtual Machine and Cloud Service Sizes for Azure for further details on the sizes and options.

How to provision a VM from the gallery using the new transactional/DW images?

To provision an optimized transactional or DW VM image by using the Azure Management Portal,

  1. Sign in to the Azure Management Portal.
  2. Click VIRTUAL MACHINE in the Azure menu items in the left pane.
  3. Click NEW in the bottom left corner, and then choose COMPUTE, VIRTUAL MACHINE, and FROM GALLERY.
  4. On the Virtual machine image selection page, select one of the SQL Server for transactional or Data Warehousing images.
  5. On the Virtual machine configuration page, in the SIZE option, choose from the supported sizes.

    Please note that only Standard tier A4, A7, A8 and A9 and Basic Tier A4 are supported at this point and attempts to provision unsupported VM sizes will fail.
  6. Wait for the provisioning to finish. While waiting, you can see the provisioning status on the virtual machines page (as in the picture below). When the provisioning is finished, the status will be Running with a checkmark.

Alternatively, you can use PowerShell Commandlet New-AzureQuickVM to create the VM. You will need to pass your cloud service name, VM name, image name, Admin user name and password and so on as parameters. A simple way is to obtain the image name is to use Get-AzureVMImage to list out all the available VM images.

What are the specific configurations included in the transactional/DW images?

The optimization we include in the optimized images are based on the Performance Best Practices for SQL Server in Azure Virtual Machines. Specifically, it includes:

 

 

Transactional

DW

Disk  configurations

Number of data disks attached

15

15

Storage spaces

 

Two storage pools:

-          1 data pool with 12 data disks; fixed size 12 TB; Column = 12

-          1 log pool with 3 data disks; fixed size 3 TB; Column = 3

One data disk remaining for the user to attach and determine the usage.

Stripe size = 64 KB

Stripe size = 256 KB

Disk sizes, caching, allocation size

1 TB each, HostCache=None, NTFS Allocation Unit Size = 64KB

SQL Configurations

 

Startup Parameters

-T1117 to help keep data files the same size in case DB needs to autogrow

-T1118 to assist in TEMPDB scalability (See here for more details)

Recovery Model

No change

Set to “SIMPLE” for MODEL database using ALTER DATABASE

Setup default locations

Move SQL Server error log and trace file directories to data disks

Default locations for databases

System databases moved to data disks.

The location for creating user databases changed to data disks.

Instant File Initialization

Enabled

Locked pages

Enabled (See here for more details)

 

FAQ

  • Any pricing difference between the optimized images and the non-optimized ones?
    No. The new optimized images follow exactly the same pricing model (details here) with no additional cost. Note that with larger VM instance sizes, higher cost is associated.
  • Any other performance fixes I should consider:
    Yes, consider applying relevant performance fixes for SQL Server
  • How can I find more information on Storage Spaces?
    For further details on Storage Spaces, please refer to Storage Spaces Frequently Asked Questions (FAQ).
  • What is the difference between the new DW image and the previous one?
    The previous DW image requires customers to perform additional steps such as attaching the data disks post VM creation while the new DW image is ready for use upon creation so it is more streamlined and less error prone.
  • What if I need to use the previous DW image? Is there any way I can access it?
    The previous VM images are still available, just not directly accessible from the gallery. Instead, you can continue using Powershell commandlets. For instance, you can use Get-AzureVMImage to list out all images and once you locate the previous DW image based on the description and publish date, you can use New-AzureVM to provision it accordingly.

Visit our Azure portal and give this new SQL VM image offering a try, and let us know what you think.

Let your colleagues know about the New VM Images available by sharing via your preferred social channels and don’t forget to follow @SQLServer on Twitter and find SQL Server on Facebook

Cumulative Update #12 for SQL Server 2012 SP1

$
0
0
Dear Customers, The 12 th cumulative update release for SQL Server 2012 SP1 is now available for download at the Microsoft Support site. Cumulative Update 12 contains all the SQL Server 2012 SP1 hotfixes which have been available since the initial...(read more)

Cumulative Update #2 for SQL Server 2012 SP2

$
0
0
Dear Customers, The 2 nd cumulative update release for SQL Server 2012 SP2 is now available for download at the Microsoft Support site. Cumulative Update 2 contains all hotfixes which have been available since the initial release of SQL Server 2012...(read more)

Microsoft Machine Learning Hackathon 2014

$
0
0

This blog post is authored by Ran Gilad-Bachrach, a Researcher with the Machine Learning Group in Microsoft Research.

Earlier this summer, we had our first broad internal machine learning (ML) hackathon at the Microsoft headquarters in Redmond. One aspect of this hackathon was a one-day competition, the goal of which was to work in teams to get the highest possible accuracy on a multi-class classification problem. The problem itself was based on a real world issue being faced by one of our business groups, namely that of automatically routing product feedback being received from customers to the most appropriate feature team (i.e. the team closest to the specific customer input). The data consisted of around 15K records, of which 10K were used for training and the rest were split for validation and test. Each record contained a variety of attributes including such properties as device resolution, a log file created on the device, etc. Overall, the size of each record was about 100KB. Each record could be assigned to one of 16 possible feature teams or “buckets”. Participating teams had the freedom to use any tool of their choice to extract features and train models to map these records automatically into the right bucket.

The hackathon turned out to be a fun event with hundreds of engineers and data scientists participating. Aside from being a great learning experience it was also an opportunity to meet other people in the company with a shared passion for gleaning insights from data using the power of ML. We also used this event as an opportunity to gain some wisdom around the practice of ML, and I would like to share some of our interesting findings with you.

We had more than 160 participants in the competition track. We asked them to form teams of 1-4 participants and ended up with 52 teams. Many participants were new to the field of ML and therefore, unsurprisingly, 11 of 52 teams failed to submit any meaningfully significant solution to the problem at hand. However, when we looked closer at the teams that dropped out, we found out that all teams with just a single member had dropped out! While it is quite possible that the participants who showed up without a team were only there for the free breakfastJ, when we surveyed our participants and asked them whether working in teams was beneficial, a vast majority, well over 90%, agreed or strongly agreed with the statement.

 We also found out that there were two strategies for splitting the problem workload within teams. Most teams assigned specific roles to their team members with 1 or 2 participants working on “feature engineering”, while others tried different learning algorithms, or ramped up on the tools, or created requisite “plumbing”. The other strategy we saw teams use was to have each participant try a different approach to address the problem, i.e. build multiple end-to-end solutions, with each team member using a different strategy, and later zooming into most promising of their different approaches.

We did not notice significant difference in the performance of teams based on the strategy they used to split the workload. However, there was evidence that it was important to be working in teams and to be thoughtful about how to split a given ML challenge between team members. Assigning roles and having clarity on what tools to use were critical considerations.

Over the course of the day, teams were allowed to make multiple submissions of their candidate solutions. We scored these solutions against the test set but these scores were revealed only at the end of the event, when winners were announced. There were more than 270 submissions made, overall. It is interesting to look at these submissions when they are grouped together – the graph below shows all the submissions as blue dots, with the X-axis representing accuracy on the training set and the Y-axis representing accuracy on the test set.

 Most submissions with a training accuracy < 0.55 had a good match between the train and test accuracy (the gray line shows the equal train-test accuracy). However, the test-accuracy keeps improving even when the gap between the train and test accuracy becomes ridiculously large. For example, the winner (the red dot) had a training accuracy of 94% and test accuracy of 62%.

Next, let us look at the behavior of the algorithms used by the different teams in this particular competition. We are able to understand and analyze these algorithms only because we had asked participants to add a short description to every submission they made (i.e. akin to textual comments provided by software developers each time they update code in a source control system).

It is interesting to see these algorithms plotted on a graph (below). The submissions with a large gap between training and test were mostly using boosted trees. This makes sense since boosting works by aggressively reducing the training error. Additionally, note that 4 of the leading teams – each of who were among the top 15 submissions, overall – were using boosted trees to solve this particular problem.

 We have seen similar patterns in other cases too, and boosted trees are a strong candidate on many ML tasks. If you have some special or temporal structure to the data, it may be easier to encode it using neural-nets (although exploiting it may be non-trivial). However, if there is no structure to the features or if you have a limited amount of time to spend on a problem, one should definitely consider boosted trees.

Beyond the numbers and the graphs, what was cool about this event was that hundreds of engineers got a chance to work together and learn and have some plain fun – do check out our quick 1 minute time-lapse video of this, our inaugural ML hackathon.  

With over half our hackathon participants indicating that they were new to ML, it was great that they showed up in such big numbers and did as well as they did. In fact one of our Top 5 teams comprised entirely of summer interns who were new to this space. If some of you out there are emboldened by that and wish to learn ML for yourself, you can start at the Machine Learning Center– there are videos and other resources available, as well as a free trial.

Ran
Follow my research

Extensibility and R Support in the Azure ML Platform

$
0
0

This blog post is authored by Debi Mishra, Partner Engineering Manager in the Information Management and Machine Learning team at Microsoft.

The open source community practicing machine learning (ML) has grown significantly over the last several years with R and Python tools and packages especially gaining adoption among ML practitioners. Many powerful ML libraries have been developed in these languages resulting in a virtuous cycle with even more adopting these languages as a result. The popularity of R has a lot to do with CRAN while Python adoption has been significantly aided by the SciPy stack. In general though, these languages and associated tools and packages are a bit like islands – there is generally not much interoperability across them. The interoperability challenge is not just at language or script level. There are specialized objects for “dataset”, specialized interpretation of “columnar schema” and other key data science constructs in these environments. To truly enable the notion of “ambient intelligence in the cloud”, ML platforms need to allow developers and data scientists to mix and match languages and frameworks used to compose their solutions. Data science solutions frequently involve many stages of computation and data flow including data ingestion, transformation, and optimization and ML algorithms. Different languages, tools and packages may be optimal for different steps as they may fit the need of that particular stage better. 

The Azure ML service is an extensible, cloud-based, multi-tenant service for authoring and executing data science workflows and putting such workflows into production. A unique capability of the Azure ML Studio toolset is the ability to perform functional composition and execute arbitrary workflows with data and compute. Such workflows can be operationalized as REST end-points on Azure. This enables a developer or data scientist to author their data and compute workflows using a simple “drag, drop and connect” paradigm, test these workflows and then stand them up as production web services with a single click.

A key part of the vision for the Azure ML service is the emphasis on the extensibility of the ML platform and its support for open source software such as R, Python and other similar environments. This way, the skills as well as the code and scripts that exist among current ML practitioners can be directly brought into and operationalized within the context of Azure ML in a friction-free manner. We built the foundations of the Azure ML platform with this tenet in mind.

R is the first such environment that we support, specifically in the following manner:

  • Data scientists can bring their existing assets in R and integrate them seamlessly into their Azure ML workflows.

  • Using Azure ML Studio, R scripts can be operationalized as scalable, low latency web services on Azure in a matter of minutes!

  • Data scientists have access to over 400 of the most popular CRAN packages, pre-installed. Additionally, they have access to optimized linear algebra kernels that are part of the Intel Math Kernel Library.

  • Data scientists can visualize their data using R plotting libraries such as ggplot2.

  • The platform and runtime environment automatically recognize and provide extensibility via high fidelity bi-directional dataframe and schema bridges, for interoperability.

  • Developers can access common ML algorithms from R and compose them with other algorithms provided by the Azure ML platform.

The pictures below show how one would use the “Execute R Module” in Azure ML to visualize a sample dataset, namely “Breast cancer data”.

It is gratifying to see how popular R has been with our first wave of users. Interestingly, the most common errors our users see happen to be syntax errors that they discover in their R scripts! Usage data shows R being used in about one quarter of all Azure ML modelling experiments. The R forecasting package is being used by some of Microsoft’s key customers as well as some of our own teams, internally.

You too can get started with R on Azure ML today. Meanwhile, our engineering teams are working hard to extend Azure ML with similar support for Python – for more information on that, just stay tuned to this blog.

Debi
Follow me on Twitter
Email me at debim@microsoft.com


New Optimized OLTP and Data Warehousing SQL Server Images on Azure

$
0
0

Microsoft Azure continues to offer customers more reasons why  Azure Virtual Machines are an ideal place to develop, test and run your SQL Server applications.  This promise is further enhanced with the release of new workload optimized images for SQL Server that provide greater performance and simplified setup when it comes to running data warehousing and OLTP workloads in Azure.   There are 4 new SQL Server images being released with benefits outlined below:

  • New Data Warehousing Optimized Image for both SQL Server 2014 and SQL Server 2012 – These images simplify the setup process for customers by adding more automation. For example, the automation of attaching a disk to the SQL Server VM running a data warehouse workload. 
  • New OLTP Optimized Image for both SQL Server 2014 and SQL Server 2012– These new images allow a customer to get better performance for an OLTP high IO intensive workload.   One of the key improvements in this tuned image is the ability to attach many disks to the SQL Server VM.  This is critical in terms of improving IO in an OLTP workload as the number of disk has a direct impact on OLTP performance.  In addition, new Windows features are also utilized like Storage Pools for multi-disk environment to improve IO performance and latency.

The image below highlights the new VM images:

Be sure to read the blog post “New VM Images Optimized for Transactional and DW Workloads in Azure VM Gallery” for a more detailed technical overview.

Continue learning more about Microsoft Azure, Virtual Machines, and how our customers are reaping the benefits of these technologies:

Try Microsoft Azure.

Learn more about Virtual Machines.

Read how Amway and Lufthansa leveraged Microsoft SQL Server 2014 and Windows Azure.

SQL Application Column Encryption Sample (Codeplex) available

$
0
0

To achieve many compliance guidelines on Azure SQL Database, the application needs to encrypt the data. The intent of this article is provide some guidelines and an example library for encrypting data at rest for relational databases.

We just published the source code for a library at “SQL Application Column Encryption Sample” in Codeplex (https://sqlcolumnencryption.codeplex.com/) that can help developers to encrypt data (columns) at rest in SQL Azure database. This library is intended to work as sample code and published as open source with the goal to allow the community to improve it while we make a better solution available for Azure SQL Server.

We will appreciate your comments & feedback on this library as it will help us make it better as well to make sure we can make future solutions better.

Please use the Discussion section on the Codeplex library or leave a comment in this forum for feedback & comments.

Microsoft Azure Offers HDInsight (Hadoop-as-a-service) to China

$
0
0

Microsoft today announced that Azure HDInsight is now available for all customers in China as a public preview, making it the first global cloud provider to have a publicly available cloud Hadoop offering in China. With this launch, Microsoft is bringing Azure HDInsight’s ability to process big data volumes from unstructured and semi-structured sources to China. Both local Chinese organizations and multi-national corporations with offices in China can spin up a Hadoop cluster within minutes. 

Hadoop is an open-source platform for storing and processing massive amounts of data. By using Hadoop alongside your traditional data architecture, you can gain deep insights into data you never imagined being able to access.  As an example, Blackball, a Taiwanese Tea and Dessert restaurant chain was able to combine traditional data from its point-of-sale systems with new data from social sentiment and weather feeds to understand why a customer would purchase their products.  By combining traditional sources with new “Big Data”, Blackball found out that hot/cold weather was not really a factor in its sales of hot/cold drinks and was able to adjust accordingly to customer demand.

It is these types of results that are causing a wave of demand for big data. We invite you to try HDInsight today either in China or across the other Azure data centers. 

Learn more through the following resources:

EF6.1.2 Beta 1 Available

$
0
0

Today we are making Beta 1 of the EF6.1.2 release available. This patch release contains bug fixes and some contributions from our community.

 

What’s in Beta 1?

EF6.1.2 is mostly about bug fixes, you can see a list of the fixes included in EF6.1.2 on our CodePlex site.

We also accepted a couple of noteworthy changes from members of the community:

  • Query cache parameters can be configured from the app/web.configuration file
       
        
  • SqlFile and SqlResource methods on DbMigration allow you to run a SQL script stored as a file or embedded resource.

 

Where do I get the beta?

The runtime is available on NuGet. Follow the instructions on our Get It page for installing the latest pre-release version of Entity Framework runtime.

The tooling for Visual Studio 2012 and 2013 is available on the Microsoft Download Center. The tooling will also be included in the next preview of Visual Studio “14”.

 

Support

This is a preview of changes that will be available in the final release of EF6.1.2 and is designed to allow you to try out the new features and report any issues you encounter. Microsoft does not guarantee any level of support on this release.

If you need assistance using the new features, please post questions on Stack Overflow using the entity-framework tag.

 

Thank you to our contributors

We’d like to say thank you to folks from the community who have contributed to the 6.1.2 release so far:

  • BrandonDahler
  • Honza Široký
  • martincostello
  • UnaiZorrilla

KDD – Two Themes

$
0
0

This blog post is authored by Jacob Spoelstra, Director of Data Science at the Information Management & Machine Learning (IMML) team at Microsoft.

The recently concluded KDD conference reaffirmed its claim as the premier conference for Data Science, for both theory and practice, as evidenced by the sold-out crowd of over 2000 that packed the halls at the New York Sheraton. Premier sponsorship by Bloomberg and record-setting attendance (almost double last year’s) indicate this remains a white-hot field.

Every year brings a mix of new algorithms and applications. In line with this year’s theme of Data Mining for Social Good, two key aspects came to the fore: operationalization and interpretability. Right from the opening remarks we heard several times the need to get predictive models out of the lab and into real world systems and to drive real actions. Appropriately, the winners of the best Social Good paper, “Targeting Direct Cash Transfers to the Extremely Poor” by Brian Abelson, Kush Varshney and Joy Sun describes applying image recognition to locating villages with extreme poverty in Kenya and Uganda, driving deployment of aid and staff.

In his keynote, “Data, Predictions and Decisions in Support of People and Society” Eric Horvitz challenged the community to build systems that change the world. Deployment of predictive models remains a tough issue – data scientists are familiar with the long and painful process involved in going from good performance on training data to a production system. This typically involves documenting all the data transformations and model details, then handing it over to engineers to implement. Foster Provost and Tom Fawcett, in their excellent book “Data Science for Business”, remind us that the solution you deliver is not the model that your data scientist developed, it is the algorithm that your IT implemented.

Business sponsors want to comprehend the model and understand the drivers of outcomes. The interpretability issue is known to those who work in regulated industries such as consumer credit. The Fair Credit Reporting Act requires that consumers be provided with actionable reasons if declined for credit. In general, credit reports come with reasons behind the score. Explanations are a common requirement in customer-facing scenarios such as credit card fraud prevention (e.g. why are you blocking my card?) and online merchant suggestions (e.g. why is this being recommended to me?). In his keynote talk, “A Data Driven Approach to Diagnosing and Treating Disease”, Eric Schadt of the Icahn School of Medicine explained that medical professionals need to understand why a model produced a specific diagnosis. Interpretability is often accomplished by sacrificing accuracy – resorting to a relatively simple model such as linear or logistic regression or basic decision trees where the behavior can be understood by examining the model parameters and structure. In many cases, higher accuracy can be achieved using more complex non-linear methods such as neural networks or boosted decision trees but at the cost of comprehensibility. These models are best understood by their behavior, as opposed to inspecting the formula.

Interpretability is more than being able to comprehend the relation between model inputs and outputs. A point that Eric Horvitz emphasized is the importance of translating analytical results into business terms: Simulate the system, show costs, assumptions about efficacies of treatments. Present the true net benefit for realistic business scenarios.

Both themes play well to the strengths of the product we just launched, Azure ML. Deployment of a model to a cloud-hosted web service is just a few clicks away. From there on, it’s easy integration to production systems where real life decisions can be affected. This easy deployment also facilitates interpretability in the sense that such a model can be easily queried as part of a what-if simulation.

Historically this has been difficult to accomplish because the translation from a lab model to a production system that can score new data has been complex, time consuming and would only be done once the model has been finalized. This provides an opportunity to directly observe the behavior of the system by manipulating the input data in interesting ways.

As a proof of concept, we developed an Excel plug-in that can call a published request-response service using data in excel tables as input. This allowed us to use Excel’s GUI tools both to manipulate data and to graph results. Here are two examples:

  1. Direct “what if” scenarios: Using GUI controls, a user can manipulate inputs to define a specific case and observe the outcome. This could be used to explore the effect of perturbations around a specific case: e.g., what would the prediction be if inflation rates were 1% higher.

  2. Monte Carlo simulation: The user sets ranges of inputs (with probability distributions), then the system samples from possible scenarios, calculates and plots the distribution of outcomes. This is useful for getting estimates of best and worst cases, and most likely outcomes.

As data scientists we have our work cut out to get our models integrated into applications. While new tools do lower technology barriers, consumers and business owners still need systems that they can trust and relate to. For a walk-through on how to build, score and evaluate a predictive model in Azure ML, you can get started by watching this step by step video.

Jacob Spoelstra

Viewing all 808 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>