Cumulative Update #13 for SQL Server 2012 SP1
From Data to Operationalized ML in 60 Minutes!
This blog post was co-authored by Debi Mishra, Jacob Spoelstra and Dmitry Pechyony of the Information Management & Machine Learning team at Microsoft.
Microsoft has a strong track record for crafting tools such as our Office apps or Visual Studio which millions of users find relatively easy to use. These apps have set the industry standard for individual and team productivity in terms of how quickly a new user can learn the tool, use it to accomplish their tasks and automate tedious or mundane activities, so they can better focus on their job.
Great tools spark creativity and do not get in the way of the user. They make seemingly difficult things easy to accomplish. Great tool often eventually end up creating an entirely new breed of empowered users. Take for instance how, in the 1990s, Microsoft Visual Basic expanded the base of software programmers by millions worldwide – users who might otherwise not have taken up such a pursuit.
When we created Azure Machine Learning Studio, our target audience included all data scientists – from aspiring students and hobbyists to enthusiasts and seasoned experts. Our primary vision for the tool is the title of this blog post – we truly believe that the proof of the ease and power of a tool such as Azure ML Studio can be measured by the time it takes for a typical data scientist, even someone relatively new to the field, to go all the way from raw data to a fully operationalized web service, powered by the intelligence harnessed from that data.
In this post, we talk about how Azure ML Studio is helping drive greater productivity and ease of use for our Data Scientist audience. In particular, we focus on how the tool enables users to stand up intelligent web services powered by predictive analytics in a matter of an hour or less.
The ease of use starts with Azure ML being cloud hosted. There is no software to install, no hardware to manage, no dependency on IT and practically no constraints on disk space or CPU cycles. With our free option – which no longer requires an Azure subscription or credit card – you can start developing ML models in a matter of minutes and you can do so from anywhere, using any device and using nothing but a web browser. You can start work on an ML experiment at your workplace, pick things up from where you left them during your commute and – later the same evening – continue running your experiment from your tablet at home.
Model Authoring Experience
Azure ML Studio lets you set up experiments as simple visual data flow graphs, with an easy to use drag, drop and connect paradigm. The tool also makes many common data science tasks easy and intuitive. For instance, you can do the following:
Bootstrap from a set of pre-authored templates of fully working experiments, representing common data science patterns.
Compose an experiment workflow using “modules” as algorithmic building blocks. All our modules are plug ‘n play with strong “typing” and have reasonable default settings pre-selected. So simply dropping in a module without any customization works as a reasonable starting point.
Bring the data from multiple sources including SQL, Hadoop, OData, and Azure Storage.
Use our powerful built-in suite of world class ML algorithms. All our learners can be used in the same way and swapped with each other as needed, so there is little effort to use a new learner.
Handle feature selection with feature selection and parameter sweeper modules.
Easily compare the performance of several algorithms and choose the one that works best for your problem. Since our data flow graphs support multiple parallel paths, you can make side-by-side comparisons easily.
Use our built-in support for R. Over 400 of the most popular CRAN packages come preinstalled. This allows your existing R skills and scripts to be directly brought into and integrated seamlessly into Azure ML – see an earlier post on this topic. Our team is working to add Python support soon.
Easily revisit prior runs of an experiment, using our lineage tracking capability – so you can get a complete view of your prior experimentation.
Avoid programming for a large number of common tasks, which lets you focus on experiment design and iteration.
Collaborate with others worldwide on your project. Azure ML Studio lets teammates virtually look over each other’s shoulders, share data and intermediate results, and pick up on your work where others left off.
This screenshot of a typical Azure ML Studio experiment showcases many of these points:
We are gratified to receive many positive comments from our customers regarding our ease of use. Here is one such comment, from Yogesh Dandawate of Icertis Applied Cloud: “The standout benefit for us was to be able to quickly build and test predictive models and verify their results. There is no cognitive overhead to learn a new scripting or coding language”.
Models in Production
Data Scientists want to see their models deployed and functional in the real world. A common frustration is how hard it is to put built models into production, and indeed, a large percentage of models never see real world usage. Azure ML Studio makes it super simple to deploy a model into production use, with a single click. The operationalized workflow – containing the data transformations and model – are deployed as web services supported by the fully managed, secure, reliable, and elastic Azure cloud infrastructure, which provides worldwide access. The model that you build can be called from any modern programming language used by the engineering team that consumes the model. As you publish the model, Azure ML Studio provides you with sample code in C#, R or Python for immediate consumption of the published web services within the app or a productivity tool like Excel. Azure ML also provides an operationalization layer for R code. You can easily transform your existing R code into a cloud-based model with REST APIs. This is a critically important feature given how large the R developer community is, and given the fact that they have historically not had such as easy way to operationalize R code. The blog post Running R in the Azure ML cloud on R-bloggers discusses how Azure ML enables easy deployment of R models.
Our team has much work ahead as we aim to make our tool even more widely accessible and productive for our users. For instance:
We have heard from you that full REPL capability inside the Studio is desirable.
We are adding support to allow the dataset schema information to “flow” down the workflow, so that the column selector works even before you have run the experiment.
We are working on “composite modules” that will enable users to save common workflows as pre-fabricated compositions which they can reuse across many experiments.
Our design team is conducting user studies to create a continuous feedback loop and we are combining those inputs with our service analytics, to ensure that our team is fully aware of areas where the tool could do even better.
“The ease of implementation makes machine learning accessible to a larger number of investigators with various backgrounds—even non-data scientists.” says Bertrand Lasternas of Carnegie Mellon University. Hans Kristiansen of Capgemini agrees: "Azure ML offers a data science experience that is directly accessible to business analysts and domain experts, reducing complexity and broadening participation through better tooling."
If you have not done so already, please go to www.azure.com/ml and start using Azure ML Studio for free. Be sure to check out our samples, create a new experiment and stand up your own ML web service – not in weeks and months as it used to take – but all in matter of an hour or less! Send us your feedback and thoughts.
We believe that, with the right future investments, Azure ML can truly help attract many more practitioners to the data science community, just as Visual Basic earlier did for an earlier generation of software developers.
Debi, Jacob and Dmitry
Follow Debi on Twitter
Rapid Progress in Automatic Image Captioning
This blog post is authored by John Platt, Deputy Managing Director and Distinguished Scientist at Microsoft Research.
I have been excited for many years now in the grand challenge of image understanding. There are as many definitions of image understanding as there are computer vision researchers, but if we can create a system that can automatically generate descriptive captions of an image as well as a human, then I think we’ve achieved the goal.
This summer, about 12 interns and researchers at Microsoft Research decided to “go for it” and create an automatic image captioning software system. Given all of the advances in deep learning for object classification and detection, we thought it was time to build a credible system. Here’s an example output from our system: which caption do you think was generated by a person and which by the system?
An ornate kitchen designed with rustic wooden parts
A kitchen with wooden cabinets and a sink
[The answer is below]
The project itself was amazingly fun to work on; for many of us, it was the most fun we've had at work in years. The team was multi-disciplinary, involving researchers with expertise in computer vision, natural language, speech, machine translation, and machine learning.
Not only was the project great to work on: I’m also proud of the results, which are in a preprint. You can think about a captioning system as a machine translation system, from pixels to (e.g.) English. Machine translation experts use the BLEU metric to compare the output of a system to a human translation. BLEU breaks the captions into chunks of length (1 to 4 words), and then measures the amount of overlap between the system and human translations. It also penalizes short system captions.
To understand the highest possible BLEU score we could attain, we tested one human-written caption (as a hypothetical “system”) vs. four others. I’m happy to report that, in terms of BLEU score, we actually beat humans! Our system achieved 21.05% BLEU score, while the human “system” scored 19.32%.
Now, you should take this superhuman BLEU score with a gigantic boulder of salt. BLEU has many limitations that are well-known in the machine translation community. We also tried testing with the METEOR metric, and got somewhat below human performance (20.71% vs 24.07%).
The real gold standard is to conduct a blind test and ask people which caption is better (sort of like what I asked you above). We used Amazon’s Mechanical Turk to ask people to compare pairs of captions: is one better, the other one, or are they about the same? For 23.3% of test images, people thought that the system caption was the same or better than a human caption.
The team is pretty psyched about the result. It’s quite a tough problem to even approach human levels of image understanding. Here’s a tricky example:
System says: “A cat sitting on top of a bed”
Human says: “A person sitting on bed behind an open laptop computer and a cat sitting beside and looking at the laptop screen area”
As you can see, the system is perfectly correct, but the human uses his or her experience in the world to create a much more detailed caption.
[The answer to the puzzle, above: the system said “A kitchen with wooden cabinets and a sink”]
How it works
At a high level, the system has three components, as shown below:
First, the system breaks the image into a number of regions that are likely to be objects (based on edges). Then a deep neural networks is applied to each region to generate a high-level feature vector that captures relevant visual information. Next, we take that feature vector as input to a neural network that is trained to produce words that appear in the relevant captions. During that training, we don’t hand-assign each word to each region; instead, we use a trick (called “Multiple Instance Learning”) to let the neural network figure out which region best matches each word.
The result is a bag of words that are detected within the image, in no particular order. It’s interesting to look at which regions caused which words to be detected:
Next, we put together the words in a sensible sentence, using a language model. You may have heard of language models: they take a training corpus of text (say, Shakespeare), and generate new text that “sounds like” that corpus (e.g., new pseudo-Shakespeare). What we do is train a caption language model to produce new captions. We add a “steering wheel” to the language model, by creating a “blackboard” of the words detected from the image. The language model is encouraged to produce those words, and as it does, it erases each one from the “blackboard”. This discourages the system from repeating the same words over and over again (which I call the Malkovich problem).
The word detector and the language model are both local, meaning they only look at one segment of the image to generate each word, and only consider one word at a time to generate. There is no sense of global semantics or appropriateness of the caption to the image. To solve this, we create a similarity model, using deep learning to learn which captions are most appropriate for which images. We re-rank using this similarity model (and features of the overall sentence) and produce the final answer.
This, of course, is a high-level description of the system. You can find out more in the preprint.
Plenty of research activity
Sometimes, an idea is “in the air” and gets invented by multiple groups at the same time. That certainly seems to be true of image captioning. Before 2014, there were previous attempts at automatic image captioning systems that did not exploit deep learning. Some examples are Midge and BabyTalk. We certainly benefited from the experience of these previous systems.
This year, there has been a delightful Cambrian explosion of image captioning systems based on deep learning. It appears as if many groups were aiming towards submitting papers to the CVPR 2015 conference (with a due date of Friday, Nov 14). The papers I know about (from Andrej Karpathy and my co-authors from Berkeley) are:
Baidu/UCLA: http://arxiv.org/pdf/1410.1090v1.pdf
Berkeley: http://arxiv.org/pdf/1411.4389v1.pdf
Google: http://googleresearch.blogspot.com/2014/11/a-picture-is-worth-thousand-coherent.html
Stanford: http://cs.stanford.edu/people/karpathy/deepimagesent/
University of Texas: http://arxiv.org/pdf/1411.2539v1.pdf
This type of collective progress is just awesome to see. Image captioning is a fascinating and important problem, and I would like to better understand the strengths and weaknesses of these approaches. (I note that several people used recurrent neural networks and/or LSTM models). As a field, if we can agree on standardized test sets (such as COCO), and standard metrics, we'll continue to move closer to that goal creating a system that can automatically generate descriptive captions of an image as well as a human. The results from our work this summer and from others suggests we're moving in the right direction.
John
Learn more about my research. Follow me on twitter.
Azure HDInsight Adds Deeper Tooling Experience in Visual Studio
To allow developers in Visual Studio to more easily incorporate the benefits of “big data” with their custom applications, Microsoft is adding a deeper tooling experience for HDInsight in Visual Studio in the most recent version of the Azure SDK. This extension to Visual Studio helps developers to visualize their Hadoop clusters, tables and associated storage in familiar and powerful tools. Developers can now create and submit ad hoc Hive queries for HDInsight directly against a cluster from within Visual Studio, or build a Hive application that is managed like any other Visual Studio project.
Download the Azure SDK now for VS 2013 | VS 2012 | VS 2015 Preview.
Integration of HDInsight objects into the “Server Explorer” brings your Big Data assets onto the same page as other cloud services under Azure. This allows for quick and simple exploration of clusters, Hive tables and their schemas, down to querying the first 100 rows of a table. This helps you to quickly understand the shape of the data you are working with in Visual Studio.
Also, there is tooling to create Hive queries and submit them as jobs. Use the context menu against a Hadoop cluster to immediately begin writing Hive query scripts. In the example below, we create a simple query against a Hive table with geographic info to find the count of all countries and sort them by country. The Job Browser tool helps you visualize the job submissions and status. Double click on any job to get a summary and details in the Hive Job Summary window.
You can also navigate to any Azure Blob container and open it to work with the files contained there. The backing store is associated with the Hadoop cluster during cluster creation in the Azure dashboard. Management of the Hadoop cluster is still performed in the same Azure dashboard.
For more complex script development and lifecycle management, you can create Hive projects within Visual Studio. In the new project dialog (see below) you will find a new HDInsight Template category. A helpful starting point is the Hive Sample project type. This project is pre-populated with a more complex Hive query and sample data for the case of processing web server logs.
To get started visit the Azure HDInsight page to learn about Hadoop features on Azure.
AzureML Web Service Parameters
Overview
AzureML Web Service APIs are published from Experiments that are built using modules with configurable parameters. There is often a need to change the module behavior during Web Service execution. The Web Service Parameters feature enables this functionality.
A common example is setting up the Reader module to read from a different source, or the Writer module to write to a different destination. Some other examples include changing the number of bits for Feature Hashing module, or the number of desired features for filter-based feature selection module, or training and generating a forecast with newly incoming data in a time series forecasting scenario, among other things. The parameters can be marked as required or optional at the time of creation.
How to set and use Web Service Parameters
In the following example we’ll walk through setting up and using the feature in AzureML Studio. (Click here to get started with AzureML)
We will first create a predictive Web Service from one of the sample experiments. We will parameterize the API to enable the client calling it to write the results of the prediction to an Azure Storage Blob location different from the one specified in the Experiment. This gives the client control over where to write the results of the prediction.
1. Build a Training experiment and save the Trained Model
a. Start with the Sample 5: Train, Test, Evaluate for Binary Classification: Adult Dataset
b. Click Save As, then name the experiment as Web Service Parameters Example - Training
c. Remove some nodes to simplify the graph (see screenshot below), then Save and Run
d. After run is done, save the Trained Model by right-clicking on the lower pin of the Train module, then selecting Save As Trained Model. Call it Trained Model Web Service Parameters.
e. Note the newly saved Trained Model in the left menu under Trained Models.
2. Build a Scoring Experiment
Now that we have a Trained Model, we will build a Scoring experiment which we will publish as a Web Service. To do that:
a. Click on Save As and create a copy of the Experiment; call it Web Service Parameters Example – Scoring.
b. From under Trained Models in the left menu, drag the Trained Model Web Service Parameters and add it to the graph.
c. Remove the Two-Class Boosted Decision Tree learning algorithm, the Train and Split modules (we have already trained a model and saved it in the above step so we don’t need to train again).
d. Click on Project Columns and add Income to the list of excluded columns (this is the target value we will predict). The graph should now look like below.
3. Set a Web Service Parameter
Here is where we will use the Web Service Parameters feature to dynamically change the Writer module’s destination at run time.
a. Drag a Writer module under the Score module, then connect Score to Writer.
b. Click on the Writer to select it. Then view its properties on the right hand side of the screen.
c. Enter the account information for AzureBlobStorage option of the data destination. This information is available through the Azure Management Portal’s Storage option. (You would need to set up your Azure Storage in advance for this).
d. Note the icons next to the module properties. Click on the icon next to the Path to blob, and select Set as web service parameter.
e. Set the path to container1/output1.csv
f. Note the Web Service Parameters list item added to the Properties with the Path to blob beginning with… under it.
g. To rename the parameter, click on the name and type in blobpath, then hit enter. Note the property’s new name (bottom of the list).
h. Click run.
4. Publish the Web Service
a. Set the Input Port of the Web Service by right-clicking on the input pin (top) of the Project Columns and selecting the Set as Publish Input. Then right-click the output pin of the Score Model and select Set as Publish Output.
b. Click run, then click on Publish Web Service after run is completed successfully.
c. In the resulting Web Service Dashboard, note the API Key. We will copy this into the C# code later.
d. Click on the API help page link for the Batch Execution option (second entry). Note the Sample Request Payload shows the newly added parameter – blobpath.
e. Click on the Sample Code link on the Web Service help page to view the C# sample code. We will paste this code into a C# Console client App.
5. Build a client application to call the new Web Service
a. Start Visual Studio, and create a new C# Console Application. (File->New->Project->Windows Desktop->Console Application). Call it AzureMLClientApp.
b. Return to AzureML Studio API help page, and copy the code from the C# sample into the Program.cs file of AzureMLClientApp. (Note and follow the instructions in the sample code about installing libraries and setting references).
c. Update the parameters defined as constants in the code. A few to note:
i. BaseUrl: the Post URL on the Web Service’s API help page for the Batch Execution Service (see Step 4.e above)
ii. StorageAccountKey: the key from Azure Management Portal -> Storage -> Manage Access Keys
iii. StorageContainerName: name of the Storage Container from Azure Management Portal -> Storage -> Containers
iv. InputFileLocation: file location for the input file that we will do prediction on e.g. C:\Temp\censusinput.csv. To download a sample input file used for testing the API, return to the Training or Scoring Experiment created above, and right click on the output pin of the Adult Census Income Binary Classification Dataset (the top first module used in either Experiment), then select Download.
v. OutputFileLocation: file location for local output file generated after prediction e.g. c:\Temp\censusoutput.csv.
vi. apiKey: to get this, in AzureML Studio, click on Web Services in the left menu bar, then click on the Web Service name (Web Parameters Example – Scoring). Then copy the API Key from the Web Service Dashboard.
d. Set the Web Service Parameter’s value
In Program.cs file’s InvokeBatchExecutionService method, we set the value of “blobpath” to the desired blob name (e.g. container1/outputFromWebParam.csv). This is will be used as the value of the parameter we had set when setting up the Experiment.
e. Optional: Tweak the final Console.Writeline statement to show the blob path we are passing in
f. Run the C# application
g. Validate the result
The output file containing the prediction results is written to the Storage blob path specified in the client application (in Azure Management Portal -> Storage -> Containers):
Wrapping Up
We walked through setting up the Writer module in an AzureML BES service with a parameter to specify the destination Storage blob path at run time. During that example, we:
Created a Training Experiment and saved a Trained model
Created a Scoring Experiment using the Trained model
Used a Writer module and the new Web Service Parameters feature to set the Storage blob path as an input parameter
Published a Web Service from the Experiment, and used the Batch Execution Service (BES) to do batch prediction on an input file
Used the Web Service Parameter to set the location of the output of the prediction at run time
Wrote to the destination Storage blob path specified by the client application
We will be releasing a new feature in the near future, called Retraining APIs, which will allow programmatic retraining of trained models using this feature by setting the location of input file at run time. We will have more details on that later.
ML Blog Team
Microsoft ML featured on CIO magazine, WIRED, KDnuggets and PCWorld in the past week
Microsoft’s Machine Learning technology got a bit of press coverage in the past week – here’s a quick round up of the major stories:
1. Internet of Things Helps Asthma Patients Breathe Easily
Medical device company Aerocrine is reducing device downtime and better servicing hospitals and clinics by using new IoT services from Microsoft including the Azure Stream Analytics real-time event processing engine and Azure Event Hubs scalable publish-subscribe ingestor.
2. The Internet of Anything: A Smartphone App That Lets You Control Your Office Environment
As part of a collection of workplace technologies, Carnegie Mellon University has built an app that gives workers more control over their office environments, letting them actively oversee things like lighting and temperature from their smartphones. The project, which uses Azure Machine Learning, will sell to both businesses and government agencies in the coming months.
3. Why Azure ML is the Next Big Thing for Machine Learning
“With advanced capabilities, free access, strong support for R, cloud hosting benefits, drag-and-drop development and many more features, Azure ML is ready to take the consumerization of ML to the next level.”
4. Microsoft, five other groups race toward automated image captioning
“Automated image processing could not only improve the Web's search engines, it could help you: automatically tagging all your vacation photos of the Eiffel Tower, for example, rather than hunting them down by the dates that you were actually in Paris.”
We wish our readers a very happy Thanksgiving!
ML Blog Team
Tweet Chat with John Platt – Thu December 4th, 1pm PDT
Tweet Chat with John Platt
Thu December 4th, 1 pm PDT
Hosted by: @MLatMSFT
Hashtag: #MLatMSFT
He is one of the leading experts in Machine Learning, a multi-patented inventor and data guru. During his Ph.D. at Caltech he even managed to discover a couple of asteroids! He is a Distinguished Scientist and Deputy Managing Director at Microsoft Research. He is John Platt.
Want to learn more about John’s story and ML at Microsoft? Do you have an ML question and want to hear the answer from an expert?
Join us @MLatMSFT on December 4th at 1pm PDT for a one hour Tweet Chat with the man, the myth, the legend: John Platt.
ML Blog Team
Python Tools for Visual Studio now integrates with Azure Machine Learning
This blog post is authored byShahrokh Mortazavi, Partner Director of Program Management on the Microsoft Azure Machine Learning team.
Two languages are closely associated with Data Science today – R and Python. In Azure ML we’ve supported R for some time – and very soon we’ll add full Python support as well. This includes a world-class Python experience in Visual Studio, in Azure ML Studio and in the browser via Jupyter/IPython. As a first step, we’re excited to announce that the Python Tools for Visual Studio (PTVS) team has added features to integrate with Azure Machine Learning APIs hosted in the cloud.
I’m also happy to announce that PTVS 2.1 RTW was recently released and is available from codeplex. Note that this is an officially supported OSS plug-in. When installed into the Professional version of Visual Studio (free, available here), you’ll have a powerful Python centric Data Science IDE that is completely free. We believe powerful open source tools such as PTVS will greatly empower developers and help democratize frontier technologies such as machine learning and advanced analytics.
PTVS 2.1: A Quick Overview
Python Tools for Visual Studio offers an IDE experience for general scripting, web programming and Data Science. With integrated IPython REPL support for smart history, shell commands and inline images, PTVS provides a great exploratory coding environment. With unique features such as mixed mode debugging of Python with C++ and remote debugging of Linux servers in Azure, Visual Studio provides a productive development environment for Python developers:
1: Multi-lingual Projects; 2: Editor with deep code intelligence
3: VS Debugger 4: Integrated IPython REPL 5: VS/Excel live bridge
For a quick walkthrough of PTVS2.1 features, take a look at this video on YouTube.
PTVS “ML Pack” and Azure ML Web Service consumption
While the focus for the 2.1 release of PTVS was Web frameworks, the team has already created a “Machine Learning Pack” which can be download from codeplex to give you a taste for ML and Azure ML web services. The ML pack has three starter templates that include everything you need from data acquisition, cleaning and training all the way to visualization using matplotlib:
Simply select your template and hit F5 to get a sense for a typical ML workflow. Then browse through the code and customize it as you like for your particular scenario. As with everything else in PTVS, the code is open source (Apache 2.0), so feel free to send us your feedback and PR’s.
Azure ML Studio is a powerful easy to use canvas that enables rapid composition of ML experiments along with 1-click operationalization. PTVS has full support for quickly building web apps and dashboards using frameworks such as Django, Flask and Bottle. The ML Pack now brings the two together via a wizard that enables easy consumption of published predictive API’s into your web app:
Simply fill out the form after you’ve published, and PTVS will generate a skeleton dashboard that you can deploy to Azure Web Sites.
IPython/Jupyter
Azure ML Studio provides a convenient drag/drop model for quickly building ML workflows and operationalizing them. PTVS provides a desktop Data Science workbench with excellent support for large projects, debugging, profiling, intellisense, git, etc. The last piece that’s missing from this picture is IPython (now the polyglot “Jupyter”), which is a browser based “notebook” REPL. Azure ML will be adding this third canvas in the near future, enabling a fully cloud hosted, cross-platform, browser based experience for data science. You’ll be able to use Jupyter on Azure ML with both Python and R. Each of these authoring environments have their own centers of gravity. Our plan is to provide an integrated experience where you can use the right tool at the right time for your project.
Conclusion
Python and its ecosystem of rich libraries is a perfect fit for Data Science. You can pair PTVS with a scientific distro such as Anaconda or Canopy today, use scikit-learn, Pandas, matplotlib, etc. for analytics / Data Science work, and deploy to a VM or Cloud Service in Azure. In the near future we plan to bring you a fully integrated Visual Studio, Jupyter and Azure ML Studio experience to maximize your productivity as developers and Data Scientists. Stay tuned!
Shahrokh
EF7 - Priorities, Focus, and Initial Release
We all know, you don’t get to write software in a vacuum. Aside from the technical task we are trying to achieve there are things such as dependencies, stakeholders, release dates, etc. that impact the order and priority of tasks we do. In this post we wanted to share the top factors influencing the features and scenarios we focus on first.
Where are we at?
Up until recently we’ve been focused on validating our “New Platforms, New Data Stores” vision for EF7. This has meant building out a core framework, experimenting with patterns to support data stores with different capabilities, and implementing providers to validate our work.
In terms of data stores, we have experimented with SQL Server, SQLite, InMemory, Azure Table Storage, and Redis providers. We’ve also had discussions with other folks looking at DocumentDB, SQL Compact, and MongoDB providers.
For platforms, we have targeted traditional .NET Framework applications (WPF, WinForms, Console, and ASP.NET 4), Phone/Store/Universal, and ASP.NET 5 (a.k.a ASP.NET vNext).
The vision for EF7 is “a core framework that handles concepts common to most data stores with provider specific extensions that become available when you target a specific provider”. At this stage we feel confident that our architecture works well for our initial scenarios/providers and can evolve to accommodate future requirements.
What’s next?
Now that we are confident we are building the right thing, it’s time to start working towards a product that developers can use to write real applications with.
This means we need to focus on tasks such as completing functionality, improving quality, ensuring performance, adding logging, exploring usability, etc.
One approach to this would be to plug away until we have a production ready product for all providers and platforms that has all the standard features that we expect from an O/RM (of course, with plenty of previews along the way). If we operated in a vacuum, this is the approach our team would choose to take as it would mean we would launch with the best possible product. However, in the real world we have a series of partners and commitments that mean we need to focus on meeting these goals in incremental steps.
What’s our top priority?
Our team’s top commitment is to provide a data access stack for ASP.NET 5. Because ASP.NET 5 allows apps to target CoreCLR, the existing EF6.x product cannot provide this functionality (as it does not run on CoreCLR and it is not feasible to update it to do so). Within ASP.NET 5 our primary focus is on SQL Server, and then PostgreSQL to support the standard Mac/Linux environment.
Because ASP.NET 5 applications can also target the full .NET Framework, fulfilling our commitments to ASP.NET 5 will also allow EF7 to be used in other applications that target full .NET (WPF, WinForms, Console, and ASP.NET 4).
After fulfilling our ASP.NET 5 commitments, the other priorities of our team are as follows. These are in no particular order and we will likely work on them in parallel.
- Implement additional features
- Support EF7 on other platforms (Phone, Store, etc.)
- Deliver additional providers that our team will own (SQLite, Azure Table Storage, etc.)
What does this mean?
Scoping functionality for ASP.NET 5 release
To meet the dates for the initial release of ASP.NET 5 we are going to have to scope the functionality of EF7 to exclude some features that we would consider basic O/RM functionality. Examples of this include lazy loading and inheritance mapping patterns.
Because of this, we won’t be pushing EF7 as the ‘go-to release’ for all platforms at the time of initial ASP.NET 5 release. More details on this later in the post.
Temporary removal of providers from working branch
In order to focus on our top priorities we are going to move several providers to a separate branch that we will not always keep updated with the latest changes. The packages will also not be published as part of nightly builds or pre-releases to NuGet.org. The impacted providers are SQLite, Azure Table Storage, and Redis. This will allow us to iterate on the core framework without the overhead of keeping multiple providers and their tests up-to-date as we go through various interim stages of the core framework.
Once we have stabilized the core and the code churn reduces, we will bring back the additional providers to validate the core framework again and begin working towards an initial release of these providers.
One important implication is that the SQLite provider will temporarily not be kept up-to-date or published in our nightly builds. This temporarily removes the ability to target Phone/Store applications during this period.
Tasks we’ll be working on
Scoping features and removing some providers will allow us to focus on the following activities for the initial ASP.NET 5 release.
- Complete in-flight features
- Fix bugs
- Add test coverage
- Test and improve performance
- API reviews
- Exploratory Testing
- Clear exceptions for unsupported scenarios
- Documentation
- Add logging throughout stack
Initial release for ASP.NET 5 != recommended release
As previously mentioned, we won’t be pushing EF7 as the ‘go-to release’ for all platforms at the time of the initial release to support ASP.NET 5. EF7 will be the default data stack for ASP.NET 5 applications, but we will not recommend it as an alternative to EF6 in other applications until we have more functionality implemented.
Given that ASP.NET 5 is in the same ‘part v1 and part vNext’ position as EF, the missing features will be less of an issue. We will of course be working to make them available ASAP.
We are discussing ways to make this clearer when it comes time to ship a stable version of the EntityFramework package to support ASP.NET 5. We haven’t locked on the details yet (and won’t until we get closer to release) but some options we are considering are:
- Still have the NuGet package marked as pre-release so that it is not installed when you ask for the latest stable version.
- Only support the ASP.NET 5 platforms so that it is not inadvertently installed in other applications. In this scenario you could still install a pre-release package for use on other platforms.
Summary
To ensure we meet our primary commitments to ASP.NET 5 we are going to be focusing on rounding out the existing features and the SQL Server provider for EF7.
This means scoping out some very important features and we will not be encouraging folks to transition from EF6 to EF7 until these features are added.
We will temporarily be suspending work on SQLite, Azure Table Storage, and Redis providers and they will not be included in nightly builds. These are important scenarios for us and we will re-enable them as soon as we meet our ASP.NET 5 commitments.
We understand this is disappointing to many folks. Most notably, temporarily disabling Phone/Store was a painful decision for us to make. Our team would love to work in isolation and make EF7 a complete and polished product on many providers and platforms before initial release. Unfortunately that is not the case, so we’re working to meet our commitments while still delivering on the all up vision in a reasonable time frame.
The Best of Both Worlds: Top Benefits and Scenarios for Hybrid Hadoop, from On-Premises to the Cloud
Historically, Hadoop has been a platform for big data that you either deploy on-premises with your own hardware or in the cloud and managed by a hosting vendor. Deploying on-premises affords you specific benefits, like control and flexibility over your deployment. But the cloud provides other benefits like elastic scale, fast time to value, and automatic redundancy, amongst others.
With the recent announcement of the Hortonworks Data Platform 2.2 being made generally available, Microsoft and Hortonworks are partnered to deliver Hadoop on Hybrid infrastructure in both on-premises and cloud. This will give customers the best of both worlds with control & flexibility of on-premises deployments and the elasticity & redundancy of the cloud.
What are some of the top scenarios or use cases for Hybrid Hadoop? And what are the benefits of taking advantage of a hybrid model?
- Elasticity: Easily scale out during peak demand times by quickly spinning up more Hadoop nodes (with HDInsight)
- Reliability: Use the cloud as an automated disaster recovery solution that automatically geo-replicates your data. Or
- Breadth of Analytics Offerings: If you’re already working with on-prem Hortonworks offerings, you now have access to a suite of turn-key data analytics and management services in Azure, like HDInsight, Machine Learning, Data Factory, and Stream Analytics.
To get started, customers need Hortonworks Data Platform 2.2 with Apache Falcon configured to move data from on-premises into Azure. Detailed instructions can be found here.
We are excited to be working with Hortonworks to give Hadoop users Hadoop/Big Data on a hybrid cloud. For more resources:
- Step-by-step instructions on configuring Hortonworks HDP 2.2 to move data into Azure
- Hortonworks Data Platform 2.2
- Azure HDInsight Service Page
- Free 30-day Trial for Azure (including HDInsight)
- Learning Map – HDInsight documentation
Weekend reading - 3 recent stories
Three new stories about Microsoft ML and Advanced Analytics.
1.Fueling the Oil and Gas industry with IoT
The oil and gas industry’s supply chain starts in some of world’s most remote areas and serves consumers globally in all the places where the finished product gets consumed. The industry depends on complex and expensive equipment from hundreds of manufacturers to extract, move, refine and sell fuel 24 hours a day. One challenge and a significant opportunity for the industry is to monitor these assets and use sensor data to improve the efficiency of the system and enable innovation. Learn about how Rockwell Automation is using Azure and Machine Learing to take advantage of the Internet of Things (IoT) to bring its vision for The Connected Enterprise to life. In doing so, they are building intelligence that is transforming the petroleum supply chain, bringing enhanced productivity from the fuel source to the pump.
2.Using Visual Studio to build Apps that Integrate with HDInsight / Hadoop
Oliver Chiu talks about how Microsoft is making developers more productive with “big data” by adding additional tooling for HDInsight in Visual Studio as part of recent updates to the Azure SDK. These new VS extensions help developers to visualize their Hadoop clusters, tables and storage in familiar tools. You can now create and submit ad hoc Hive queries for HDInsight directly against a cluster from within VS, or build a Hive application that is managed like any other VS project.
3.New Azure ML book now the #1 New Release in ML, on Amazon.com
Co-authored by Microsoft insiders Valentine Fontama and Wee Hyong Tok, this new book provides an introduction to data science and ML with a focus on building and deploying predictive models. It explains the concepts of predictive analytics and ML through practical tasks and applications. Readers need to have a basic knowledge of statistics and data analysis, but not deep experience in data science. Advanced programming skills are not required either, although some R experience would prove handy.
ML Blog Team
Best of 2014: Top 10 Data Exposed Channel 9 Videos for Data Devs
Have you been watching Data Exposed over on Channel 9? If you’re a data developer, Data Exposed is a great place to learn more about what you can do with data: relational and non-relational, on-premises and in the cloud, big and small.
On the show, Scott Klein and his guests demonstrate features, discuss the latest news, and share their love for data technology – from SQL Server, to Azure HDInsight, and more!
We rounded up the year’s top 10 most-watched videos from Data Exposed. Check them out below – we hope you learn something new!
- Introducing Azure Data Factory: Learn about Azure Data Factory, a new service for data developers and IT pros to easily transform raw data into trusted data assets for their organization at scale.
- Introduction to Azure DocumentDB: Get an introduction to Azure DocumentDB, a NoSQL document database-as-a-service that provides rich querying, transactional processing over schema free data, and query processing and transaction semantics that are common to relational database systems.
- Introduction to Azure Search: Learn about Azure Search, a new fully-managed, full-text search service in Microsoft Azure which provides powerful and sophisticated search capabilities to your applications.
- Azure SQL Database Elastic Scale: Learn about Azure SQL Database Elastic Scale, .NET client libraries and Azure cloud service packages that provide the ability to easily develop, scale, and manage the stateful data tiers of your SQL Server applications.
- Hadoop Meets the Cloud: Scenarios for HDInsight: Explore real-life customer scenarios for big data in the cloud, and gain some ideas of how you can use Hadoop in your environment to solve some of the big data challenges many people face today.
- Azure Stream Analytics: See the capabilities of Azure Stream Analytics and how it helps make working with mass volumes of data more manageable.
- The Top Reasons People Call Bob Ward: Scott Klein is joined by Bob Ward, Principle Escalation Engineer for SQL Server, to talk about the top two reasons why people want to talk to Bob Ward and the rest of his SQL Server Services and Support team.
- SQL Server 2014 In-Memory OLTP Logging: Learn about In-Memory OLTP, a memory-optimized and OLTP-optimized database engine integrated into SQL Server. See how transactions and logging work on memory-optimized-tables, and how a system can recover in-memory data in case of a system failure.
- Insights into Azure SQL Database: Get a candid and insightful behind-the-scenes look at Azure SQL Database, the new service tiers, and the process around determining the right set of capabilities at each tier.
- Using SQL Server Integration Services to Control the Power of Azure HDInsight: Join Scott and several members of the #sqlfamily to talk about how to control cloud from on-premises SQL Server.
Interested in taking your learning to the next level? Try SQL Server or Microsoft Azure now.
Machine Learning – Hype or Reality? Microsoft ML Experts Weigh In
The recent Practice of Machine Learning Conference at Microsoft concluded with a lively panel discussion moderated by principal researcher Misha Bilenko on the topic of: "Are We at Peak ML, or at the Start of AI Takeover? Hype vs. Reality of Machine Learning.” Our panelists were:
Greg Buehrer, Partner Development Manager, Bing Ads
John Platt, Distinguished Scientist, Microsoft Research
Joseph Sirosh, Corporate Vice President of Machine Learning
This post recaps their conversation and the kinds of issues raised by our audience.
The first question played off the title of the panel: "As humans get replaced by machines, how will data scientists be useful and will they be replaced as well?"!
Greg commented that he didn't see that happening soon. Machines are good at mechanical and physical processes, but still not that great at automating decision-making especially in very complex environments. John and Joseph took the conversation further by discussing the challenges associated with employment and wealth distribution in an environment of seemingly ever-increasing automation, but the panelists admitted they couldn’t predict exactly how this trend might play out.
After discussing the Azure ML marketplace as a place where data scientists can publish their innovative ideas as web services, all panelists issued a call to action to the audience to be ever more data-driven in their work and build their ML skills, as there are many possibilities ahead of us to make products and services more intelligent.
Next, in response to an audience question, the panelists gave their opinions on whether there is a danger to using ML systems as "black boxes" inside systems such as drones, cars, etc. without fully understanding what's inside.
Joseph's opinion was that "I don't think of ML as being any different than any software algorithm collection, and there is a lot of software you can ask the same question about". He felt that there must be systems in place to ensure reliability. Greg agreed that mistakes can be made, and that experience and decision-making responsibilities have to be assigned appropriately. John then argued that "People are teaching ML incorrectly. They teach what's inside the black box, but before that you need to learn statistical hygiene: You need to have a test set, you need to not cheat, you have to do confidence intervals, and you need to worry about outliers. Learn statistical hygiene to avoid disasters". All three panelists agreed on this.
An audience member suggested that many ML use cases focus on mitigating negatives (e.g. intrusion detection, fraud detection) and asked about positive implementations. Greg suggested that avoiding these negatives in itself was a positive for customers and businesses :-) but also mentioned things such as recommendation systems which help consumers discover things that they may like. John brought up new experiences such as Office Delve which are helping people become more productive. Joseph mentioned a conversation that he had with the founder of eHarmony about how they use ML to help compatible people discover each other. He added that there are a whole host of other scenarios where ML is making some extremely positive contributions, including speech recognition, visual and gesture recognition. John closed with the tongue-in-cheek comment that if eHarmony is using ML, then "the future evolution of human DNA is being driven by machine learning."
Next, the panelists debated how privacy and ML intersect. Some of this discussion focused on the experience that Julia Angwin describes in her book Dragnet Nation, about how difficult it can be to gain true privacy in the Internet age. Joseph’s takeaway from hearing Julia speak recently was that the discussion might be changing from whether one has privacy to "whether people who have your data are using it responsibly… it's about justice, people who have your data are accountable for it, are responsible for using it in a just way."
The discussion also touched upon the enormous possibilities that the cloud opens up for data scientists and ML practitioners, as well as the potential of massive amounts of data that are becoming available. One example, for instance, focused on airplanes collecting vast amounts of weather data as they fly, which can be used to predict weather patterns much more accurately.
John noted that, if you can solve the privacy problem, then "One of the powers of data is that if you pool data together, you get more out than you put in, because you get more information by correlation". Greg added, "Not only do you want to encourage people to put data in a certain spot, but you want to ensure that the applications collecting the data collect it in the most structured way possible" so that you can make the most use of it.
Joseph chimed in with the comment that "When you put enormous compute against enormous data, and you bring machine learning to bear along with it, and the Internet of Things feeding data into the cloud…and streaming analytics running on live data…I think that in a very short time, you will see a completely different picture of analytics".
The discussion lasted an hour and the panelists addressed well over a dozen questions then, so this recap is necessarily incomplete. However, you can stay tuned to more ML happenings around Microsoft by subscribing to our ML blog feed and by following us on Twitter @MLatMSFT.
ML Blog Team
Channel 9 Video on Azure Stream Analytics
In case you missed it: Channel 9 recently featured a discussion and demos of our preview release of Azure Stream Analytics.
“Data Exposed” is a good place on Channel 9 to learn about Microsoft’s world of data – be it relational or non-relational, established products like SQL or brand new.
The site now features some of our latest advanced analytics capabilities. In this video, Judy Meyer and Dipanjan Banik from our Azure Stream Analytics (ASA)team introduce their fully managed streaming service which can provide real-time insights into huge volumes of data in a matter of seconds.
They show core capabilities of their service including a demo of how ASA can help you perform sentiment analysis over live twitter feeds. To check it out for yourself, click here or on the image below.
ML Blog Team
Advancing Research in Sign Language Recognition
Re-post of a recent article that ran on the
An estimated 360 million people worldwide suffer from hearing loss. But a majority of hearing individuals do not understand sign language. So communication between the hearing and the deaf can be challenging.
Now researchers are poised to make such interactions much more feasible through an easy, cost-effective and efficient prototype called the Kinect Sign Language Translator that translates sign language into spoken language – and spoken language into sign language – in real time.
Early last month, the Kinect Sign Language Working Group, a research community that includes a website for sharing data and algorithms, was established at the Institute of Computing Technology, CAS in Beijing. The community’s founding members are the CAS, Beijing Union University and Microsoft Research. This group has a very broad mission that spans machine learning, sign language, social science, and much more. We are encouraging experts from other research institutions, schools for the deaf and hard of hearing, and non-government organizations to join the Working Group.
Learn more by visiting this post or clicking on the image below.
ML Blog Team
Microsoft Strengthens Data Platform with SQL Database and Big Data Appliance Updates, Adds New Java SDK for its NoSQL Service
By Tiffany Wissner, Senior Director, Data Platform
Making it easier for more of our customers to access our latest big data technologies, we are announcing updates to some of our flagship data platform products and services. These updates are part of our approach to make it easier for our customers to work with data of any type and size – using the tools, languages and frameworks they want – in a trusted environment, on-premises and in the cloud.
Azure SQL Database
Announced last month and available today is a new version of Azure SQL Database that represents a major milestone for this database-as-a-service. With this preview, we are adding near-complete SQL Server engine compatibility, including support for larger databases with online indexing and parallel queries, improved T-SQL support with common language runtime and XML index, and monitoring and troubleshooting with extended events. Internal tests using over 600 million rows of data show query performance improvements up to 5x in the Premium tier of the new preview relative to today’s offering. Continuing on our journey to bring in-memory technologies to the cloud, when applying in-memory columnstore in the new preview, performance is also improved up to 100x.
“From a strategy perspective, these SQL Database service updates are our answer to migrating and working with large data types by leveraging features such as online index rebuild, and partitioning,” said Joe Testa, vice president of Systems Development at Weichert, one of the nation’s leading full-service real estate providers. “Simply put, the results so far have been fantastic—we’re seeing >2x better performance and the advanced features that were only previously available in SQL Server, now make it easier to work with our applications as we continue to migrate our mission-critical apps to Azure.”
These new preview capabilities are offered as part of service tiers introduced earlier this year, which deliver 99.99% availability, larger database sizes, restore and geo-replication capabilities, and predictable performance. When combined with our recently announced elastic scale technologies that scale out to thousands of databases for processing 10s of terabytes of OLTP data and new auditing capabilities, Azure SQL Database service is a clear choice for any cloud-based mission critical application.
Analytics Platform System
As Microsoft’s “big data in a box” solution built with HP, Dell and Quanta, the Analytics Platform System is a data warehousing appliance that supports the ability to query across traditional relational data and data stored in a Hadoop region – either in the appliance or in a separate Hadoop cluster. This latest release includes a data management gateway that establishes a secure connection between on-premises data stored in the Analytics Platform System and Microsoft’s cloud business intelligence and advanced analytics services such as Power BI and Azure Machine Learning. This capability, coupled with PolyBase, a feature of the Analytics Platform System, allows for seamless integration of data stored in SQL Server with data stored in Hadoop. This now enables users of Power BI and Azure Machine Learning to gain insights from Analytics Platform System, whether on-premises or in the Azure cloud.
New Java, PHP and migration tools
Microsoft is also making available new tools and drivers that support greater interoperability with PHP and Java and make it easier for customers to migrate to and use our big data technologies.
Azure DocumentDB is our fully-managed NoSQL document database service with native support for JSON and JavaScript. DocumentDB already includes SDKs for popular languages, including Node.js, Python, .NET, and JavaScript – today we are adding a new Java SDK that will make DocumentDB easier to use within a Java development environment. The SDK provides easy-to-use methods to manage and query DocumentDB resources including collections, stored procedures and permissions. The Java SDK is also available on Github and welcomes community contributions.
Additionally, we are bolstering our SQL Server tools and drivers with updates to the Microsoft JDBC Driver for SQL Server the SQL Server Driver for PHP. Available early next week, these drivers will make it easier for our customers’ applications to access both SQL Server and Azure SQL Database.
For customers that are migrating their IBM DB2 workloads to SQL Server, we are also making available today the SQL Server Migration Assistant (SSMA) tool which automates all aspects of database migration including migration assessment analysis, schema and SQL statement conversion, data migration as well as migration testing to reduce cost and reduce risk of database migration projects. SSMA 6.0 for IBM DB2 automates migrations from IBM DB2 databases to SQL Server and Azure SQL Database and is free to download and use. Support for IBM DB2 is in addition to earlier updates to SSMA 6.0 including migration support for larger Oracle databases.
Microsoft data platform
These new updates will enable more customers to use Microsoft’s data platform to build, extend and migrate more applications. Microsoft’s data platform includes all the building blocks customers need to capture and manage all of their data, transform and analyze that data for new insights, and provide tools which enable users across their organization to visualize data and make better business decisions. To learn more, go here.
Bing brings the world’s knowledge to your Office documents
Imagine your child is writing a report about Abraham Lincoln, they just started and so far they’ve typed: “Lincoln was the 16th president of United States. He was born in…” but then realize they’ve forgotten when Honest Abe was born.
Ordinarily, they would have to leave Word, open a browser window, search for “Lincoln” – all of which takes time and breaks their work flow. Worse, their search results would have many other “Lincolns” including the car, movie and town in Nebraska. The browser search obviously doesn’t know their intent.
Well, now you have a solution.
Earlier this week, Bing and Office introduced Insights for Office, a cool new way to find the information you need right within the documents you are creating.
We encourage you to go ahead and try a free version right here– just click the previous link, choose the New blank document template, paste the above quoted text into the blank document, then select the text, right click and choose “Insights” to see this in action.
Bing indexes and stores entity data from around the web representing people, places and things. Insights for Office uses Bing’s ability to index the world’s knowledge, its machine learned relevance models along with text analytics capabilities to semantically understand the most important content in the user’s document and return the most relevant results.
Intrigued? Learn more about this cool new capability here or by clicking on the image below.
ML Blog Team
[Announcement] ODataLib 6.9.0 Release
We are happy to announce that the ODataLib 6.9.0 is released. Detailed release notes are listed below.
New Features:
[GitHub issue #7, #24, #25]
- ODataUriParser now can parse the complex value, entity, entity reference and a collection of them as function parameter alias.
- OData Client for .NET now supports function taking complex values, entity, entity reference and collection of them as parameter.
[GitHub issue #33] ODataLib now supports custom payload format, including:
- Support resolving custom media types for request and response messages
- Support reading and writing following kinds of payload in custom payload format:
- Feed
- Entry
- Property
- Collection
- Parameter
- Error
Improvements:
[GitHub issue #12] OData Client for .NET supports using DELETE http method in DataServiceContext.Execute method.
Bug Fixes:
[GitHub issue #2] Fix a bug that model validation fails when complex type is marked as abstract in Edm model.
[GitHub issue #15] Fix a bug that ODataQueryOptionParser doesn't work on case insensitive query option if OData Uri resolver enables case insensitive
[GitHub issue #20] Fix a bug that client does not annotate the derived complex value in a collection with odata.type annotation.
Call to Action:
You and your team are highly welcomed to try out this new version if you are interested in the new features and fixes above. For any feature request, issue or idea please feel free to reach out to us at odatafeedback@microsoft.com.
Now Available: Updated SQL Server PHP and JDBC Drivers
As part of SQL Server’s ongoing interoperability program, we are pleased to announce the general availability of two SQL Server drivers: the Microsoft JDBC Driver for SQL Server and the SQL Server Driver for PHP are now available!
Both drivers provide that robust data access to Microsoft SQL Server and Microsoft Azure SQL Database. The JDBC Driver for SQL Server is a Java Database Connectivity (JDBC) type 4 driver supporting Java Development Kit (JDK) version 1.7. The PHP driver will allow developers who use the PHP scripting language version 5.5 to access Microsoft SQL Server and Microsoft Azure SQL Database, and to take advantage of new features implemented in ODBC.
You can download the JDBC driver here, and download the PHP driver here. We invite you to explore the latest the Microsoft Data Platform has to offer via a trial evaluation of Microsoft SQL Server 2014, or by trying the new preview of Microsoft Azure SQL Database.
Machine Learning Trends from NIPS 2014
This blog post is authored by John Platt, Deputy Managing Director and Distinguished Scientist at Microsoft Research.
I just returned from the Neural Information Processing Systems (NIPS) 2014 conference, which was held this year in Montreal, Canada. NIPS is one of the two main machine learning (ML) conferences, the other being ICML.
NIPS has broad coverage of many ML sub-fields, including links to neuroscience (hence the name). I thought that the program chairs and committee created a conference which appealed to many different ML specialists – excellent job!
I want to share three exciting trends that I saw in NIPS this year:
Continued rapid progress in deep learning and neural networks
Making large-scale learning more practical
Research into constraints that arise in the real practice of ML
Deep Learning
Deep learning is the automatic construction of deep models from data. They are called “deep” because the models compute desired functions in multiple steps, rather than trying to solve problems in one or two steps. Deep learning is typically accomplished using neural networks, which are models that use matrix multiplication and non-linearities to build their functions.
Progress in deep learning since 2011 has been amazingly rapid. For example, on a benchmark of recognizing objects in images, the error rate has decreased 40% relative, per year. Deep learning has also become more broadly applicable than just classifying images.
One challenging problem in ML is the co-estimation of outputs that are strongly coupled. For example, when translating a sentence from one language to another, you don’t want to translate word-by-word. You have to think about the entire sentence you would produce.
Previously, when ML algorithms estimated coupled outputs, they would explicitly use inference, which can be slow at run time. Recently, there’s been some exciting work of having neural networks do the inference implicitly. At NIPS, Ilya Sutskever showed that you can use a deep LSTM model to do machine translation (MT) and perform almost as well as the state-of-the-art MT system. Ilya’s system is more general: it can map input sequences to output sequences. At NIPS, there was also other work in coupling outputs across large amounts of space or time. For example, Jason Weston had a workshop paper that had a neural network that used a content-addressable memory to perform question answering. The “Neural Turing Machine” uses a similar idea.
Given the successes of deep learning, researchers are trying to understand how they work. Ba and Caruana had a NIPS paper which showed that, once a deep network is trained, a shallow network can learn the same function from the outputs of the deep network. The shallow network can’t learn the same function directly from the data. This indicates that deep learning could be an optimization/learning trick.
Many people (including us!) have used middle layers in deep neural networks as feature detectors in related tasks. There was a wonderful talk at NIPS, where the authors did a set of careful experiments that examined this pattern. They trained a deep network on one set of 500 visual categories, kept the first N layers, and then retrained on a different set of 500. They found that, if you use middle layers and retrain on top, you lose some accuracy due to the sub-optimality of training. They found that if you use the highest layers, you lose some accuracy due to the features being too specific. Fine tuning everything recovers all lost accuracy. Very useful to know.
Large-Scale Training
Large-scale training (of all sorts of models) has continued to be an interesting research vein. While not that many people have training sets above 1TB, the models that use that scale data tend to be commercially very valuable.
Training in machine learning is a form of parameter optimization: an ML model can be viewed as having a set of knobs that are adjusted to make the model perform well on a training set. Large-scale training then becomes large-scale optimization. Yurii Nesterov, a famous optimization expert, gave an interesting invited talk about how to solve certain optimization problems that arise from ML in time that is logarithmic in the number of parameters.
When ML training is distributed across many computers, it is challenging to minimize the amount of communication between the computers. Training time is typically dominated by communication time.
One very nice NIPS talk described a method of performing distributed feature selection which only requires two phases of communicating models between all of the nodes. This looks promising.
Practice of Machine Learning
One quite positive trend I saw at NIPS was algorithmic and theoretical researchers examining issues that ML practitioners frequently encounter.
In the last few years, adversarial training has been a topic of research interest. In adversarial training, you don’t try to model the world as a probability distribution, but rather as an adversary who is trying to make your algorithm perform poorly. You then measure your performance relative to the best possible model that could be trained from the adversarial data, in hindsight.
A lot of the work in adversarial training has been quite interesting. At this NIPS, I saw some work that showed its practicality. It’s the nature of adversarial training to provide worst-case bounds. If you have an algorithm that is adapted to “easy data”, you normally lose the worst-case guarantees. A paper in the main conference showed that you can have your cake (perform well on easy data) and eat it too (get a worst-case guarantee). Drew Bagnell gave a clear talk at one of the Reinforcement Learning workshops that illustrated how adversarial learning is required in order to learn control policies in the real world (because you should treat your own mistaken decisions as an adversary).
There was a delightful workshop about Software Engineering for Machine Learning. Speakers from LinkedIn, Microsoft, Netflix, and Facebook talked about their experiences in putting ML into production. Some Google engineers produced a very trenchant paper about technical debt incurred by putting ML into production. I highly recommend reading it, if you are planning on doing it.
Summary
Between the progress in deep and large-scale learning, and the theoretical focus on practical issues, I learned a lot at NIPS. I’ve gone every year that the conference has existed, and I’m looking forward to the next one.
John
Learn more about my research. Follow me on twitter.