Online Learning and Sub-Linear Debugging

September 24, 2014, 9:00 am

≫ Next: Using SSDs in Azure VMs to store SQL Server TempDB and Buffer Pool Extensions

This blog post is authored by Paul Mineiro, Research Software Developer at Microsoft.

Online learning algorithms are a class of machine learning (ML) techniques that consume the input as a stream and adapt as they consume input. They are often used for their computational desirability, e.g., for speed, the ability to consume large data sets, and the ability to handle non-convex objectives. However, they come with another useful benefit, namely “sub-linear debugging”.

The prototypical supervised online learning algorithm receives an example, makes a prediction, receives a label and experiences a loss, and then makes an update to the model. If the examples are independent samples from the evaluation distribution, then the instantaneous loss experienced by the algorithm is an unbiased estimate of the generalization error. By keeping track of this progressive validation loss, a practitioner can assess the impact of a proposed model change prior to consuming all the training input, hence sub-linear (in the training set size) debugging.

For example, when using the online learning software Vowpal Wabbit at a terminal, the output might look something like this:

The second column is the average progressive loss since the previous print.

Sub-linear debugging is an important technique to master because one of the greatest enemies of the ML practitioner is a slow experimental cycle. Most ideas, sadly, are not good ones, and if it takes an hour to rule out a bad idea you will only eliminate a handful of bad ideas a day. If, however, you take 5 minutes to rule out a bad idea, you are clearly a whole lot more productive.

As a real-world example, consider evaluating a new feature engineering strategy. The new training data are prepared, but progressive validation loss on this new input is much worse than previous results after less than a minute of processing. A likely culprit is a software defect introduced during the modification of the feature pipeline.

Going beyond detecting obvious large errors requires some intuition about different components of generalization error. Here's one fruitful way of thinking about it. The Bayes error rate is the best error rate possible from any prediction rule, and is a measure of the amount of noise inherent in the problem. The bias is the best error rate possible from a prediction rule that can actually be rendered by the learning algorithm. Roughly speaking, it is how well the learning algorithm would do given infinite data. The variance is how much error is caused by the learning algorithm picking a prediction rule which differs from the best learning rule available to the algorithm. Roughly speaking, it is due the algorithm having to filter out all the poorly performing available models using finite training data, and therefore variance tends to increase whenever the model class is made more powerful without easing the search problem for the learning algorithm. For an excellent introduction to these concepts, I refer you to master lecturer Abu-Mostafa.

With these concepts in mind, let's consider some common scenarios:

New features get introduced, and progressive validation loss is worse over the entire training run. This is the most common case: the progressive validation loss startsoff worse and stays worse. What’s going on? Adding new features cannot increase theBayes error rate, as the set of possible prediction rules includes thosethat ignore the new features. For similar reasons, with many learning algorithms adding new features cannot increase the bias, because the learning algorithm could choose not to use them (e.g., assign them zero weight in a linear model). However if the previous features were corrupted by virtue of introducing new features, then bias could go up. This corresponds to the software defect scenario outlined above.

Ruling out corruption of previous features, the remaining possibility is that variance has increased while the bias is unaffected (or even reduced). The hallmark of this effect is that the progressive loss starts to catch up to that of a previous good run as training proceeds. This is because variance is reduced with more training data. You might be tempted, under these conditions, to let the algorithm run for a long time in order to ultimately exceed previous performance, but this is a dangerous habit, as it defeats the purpose of sub-linear debugging. A better question to ask is how to achieve the (putative) bias decrease that the new features provide while avoiding much of the variance increase. Think “regularization”. Possibilities for the new features might include vector quantization, projection onto (a small number of) principal components, or application of a compressing nonlinearity.

If none of these things help, the new features might be counterproductive. Unless you are certain that these features reduce the Bayes error rate (e.g., from human experimentation), it is probably best to look elsewhere for model improvement.

Feature preprocessing gets changed, and progressive validation loss is initially better, but then is worse with more training data. This sometimes happens: new preprocessing is introduced and progressive validation loss starts off better, sometimes dramatically better. Feeling satisfied, you let the algorithm run while you get some lunch, and, when you return, you discover that with more data the progressive loss improvement slowed and performance is now worse than under the old preprocessing.

This scenario is consistent with the new preprocessing decreasing variance in exchange for increasing bias. With less data resources, this tradeoff can lower generalization error, but as data resources increase the tradeoff is no longer beneficial. Under these conditions some interpolation between the original and new preprocessing might yield best results, as the regularization effect of the new preprocessing is beneficial but too strong.

As an example I worked on a text problem where I tried limiting the input to the first 6 characters of each word. Initially progressive validation loss was much better but after an hour of training it fell behind using the complete word. Ultimately I discovered that using the first 8 characters of each word gave a mild lift on the complete training set. Presumably, if my training set had been larger, the right amount of prefix to use would have been larger (e.g., 9 or 10). The idea was right (“reduce variance by treating long words the same as their shorter prefixes”) but the strength had to be adjusted to fit the data resources (“with enough data, the best strategy is to treat words as different from their shorter prefixes”).

New features or preprocessing get introduced, and progressive validation loss starts better and stays better. Congratulations, you should consider buying some lottery tickets today! More seriously, just make sure you aren't looking at the previous scenario. That means enduring a full training run. As Reagan famously said, “Trust (sub-linear debugging), but verify (on the complete data set)”.

To get going on ML, a good place to begin is the Machine Learning Center where we have several resources available.

Paul Mineiro
Follow my personal blog here. Follow me on twitter.

↧

Using SSDs in Azure VMs to store SQL Server TempDB and Buffer Pool Extensions

September 25, 2014, 2:07 pm

≫ Next: SQL Server 2008 R2 Service Pack 3 has released.

≪ Previous: Online Learning and Sub-Linear Debugging

A common practice on-premise to improve the performance of SQL Server workloads is storing TempDB and/or Buffer Pool Extensions (SQL Server 2014) on SSDs. The former improves the performance of workloads that use temporary objects heavily (e.g. queries handling large recordsets, index rebuilds, row versioning isolation levels, temp tables, and triggers). The latter improves the performance of read workloads which working set doesn’t fit in memory.

Now you can do the same in Azure VMs using the new D-Series VM Sizes.

Important: The SSD drive is transient

The SSD drive (D:\) is not persistent, so its contents and permissions will be lost if the VM moves to a different host. This can happen in case of a host failure or a VM resize operation.

Do not store your data or log files there. Use (persistent) drives from Azure Storage.
If using TempDB and/or Buffer Pool Extensions, SQL Server requires accessing the directory specified to store them. The following section describes how to store SQL Server TempDB and/or Buffer Pool Extensions on the SSD drive and automatically recreate directory if the VM moves to a different host.

Configuration Steps

1) Create a folder in the D:\ drive

This is the folder that you will store TempDB and/or Buffer Pool Extensions in, for example as “D:\SQLTEMP”.

2) To move TempDB to the SSD

Using SSMS connect to your SQL Server instance. Execute the following T-SQL commands to change location of the TempDB files:

USE MASTER

ALTER DATABASE tempdb MODIFY FILE (NAME= tempdev, FILENAME= 'D:\SQLTEMP\tempdb.mdf')
GO

ALTER DATABASE tempdb MODIFY FILE (name = templog, filename = 'D:\SQLTEMP\templog.ldf')
GO

2) To configure Buffer Pool Extensions in the SSD

Using SSMS connect to your SQL Server instance. Execute the following T-SQL commands to configure the Buffer Pool Extension, specifying the location and size of its file. The general recommendation is to set the size to 4-6 times the size of the VM memory. For more details read the documentation.

ALTER SERVER CONFIGURATION

SET BUFFER POOL EXTENSION ON

( FILENAME = 'D:\SQLTEMP\ExtensionFile.BPE' , SIZE = [ KB | MB | GB ] )

3) Configure the Start Mode of SQL Server and SQL Agent startup as Manual

Using Configuration Manager set the Start Mode of SQL Server and SQL Agent as Manual.

4) Create a Powershell script to recreate the folder in D:\ if needed and start SQL server

Copy and paste the following script and save it as a Powershell file in the C:\ drive (OS drive), for example as “C:\SQL-startup.ps1”. If needed, modify the D:\ folder to the one you specified in Step 1.

$SQLService="SQL Server (MSSQLSERVER)"
$SQLAgentService="SQL Server Agent (MSSQLSERVER)"
$tempfolder="D:\SQLTEMP"
if (!(test-path -path $tempfolder)) {
New-Item -ItemType directory -Path $tempfolder
}
Start-Service $SQLService
Start-Service $SQLAgentService

5) Change the Powershell Execution Policy to allow the execution of signed scripts

From the Powershell console execute Set-ExecutionPolicy RemoteSigned

6) Create a schedule task at system startup to execute the Powershell script

Using Task Scheduler create a Basic Task that executes when the computer starts and executes the script in the Powershell file. For this, specify:

Program/script: powershell
Arguments: –file ‘C:\SQL-startup.ps1’

7) Test the setup

Restart the VM and verify using Configuration Manager that the SQL Server service is started.

This guarantees that if the VM moves to a different host SQL Server will start successfully.

Try the D-SERIES VM SIZES at the AZURE PORTAL today.

↧

SQL Server 2008 R2 Service Pack 3 has released.

September 26, 2014, 1:41 pm

≫ Next: [Announcement] ODataLib 6.8.0 Release

≪ Previous: Using SSDs in Azure VMs to store SQL Server TempDB and Buffer Pool Extensions

Dear Customers, Microsoft SQL Server Product team is pleased to announce the release of SQL Server 2008 R2 Service Pack 3 (SP3). As part of our continued commitment to software excellence for our customers, this upgrade is free and doesn’t...(read more)

↧

[Announcement] ODataLib 6.8.0 Release

September 27, 2014, 7:57 pm

≫ Next: Video – ThyssenKrupp Uses Predictive Analytics to Give Burgeoning Cities a Lift

≪ Previous: SQL Server 2008 R2 Service Pack 3 has released.

We are happy to announce that the ODL 6.8.0 is released. Detailed release notes are listed below.

Bug Fix

Remove the ordering constraint in ODataLib which forces instance annotations of a property to come before the property in complex value
Fix a race condition issue for finding navigation targets in EdmLib.

New Features

EdmLib supports edm:EnumMember Expression for metadata annotations.
EdmLib & ODataLib now support TimeOfDay and Date.
ODataLib supports reading and writing entities as parameter.
ODataUriParser supports two new built-in functions: time and date.
ODataUriParser supports overriding default behavior for resolving unbound operation name in query options.

Call to Action

You and your team are highly welcomed to try out this new version if you are interested in the new features and fixes above. For any feature request, issue or idea please feel free to reach out to us at odatafeedback@microsoft.com.

↧

Video – ThyssenKrupp Uses Predictive Analytics to Give Burgeoning Cities a Lift

September 30, 2014, 9:00 am

≫ Next: SQL Server 2008 Service Pack 4 has released.

≪ Previous: [Announcement] ODataLib 6.8.0 Release

This is our second post in a series on how Microsoft customers are gaining actionable insights on data by operationalizing ML at scale in the cloud. Based on a case study on IoT (Internet of Things), the post was edited by Vinod Anantharaman of the Information Management and Machine Learning (IMML) team at Microsoft.

Urban migration is one of the megatrends of our time. A majority of the world’s population now lives in its cities. By 2050, seven of every ten humans will call a city their home. To make room for billions of urban residents to live, work and play, there is only one direction to go – up.

As one of the world’s leading elevator manufacturers, ThyssenKrupp Elevator maintains over 1.1 million elevators worldwide, including those at some of the world’s most iconic buildings such as the new 102-story One World Trade Center in New York (featuring the fastest elevators in the western hemisphere) and the Bayshore Hotel in Dalian, China.

ThyssenKrupp wanted to gain a competitive edge by focusing on the one thing that matters most to their customers – having elevators run safely and reliability, round the clock. In the words of Andreas Schierenbeck, ThyssenKrupp Elevator CEO, “We wanted to go beyond the industry standard of preventative maintenance, to offer predictive and even preemptive maintenance, so we can guarantee a higher uptime percentage on our elevators.”

Fix it before it breaks – ‘Smart’ elevators
ThyssenKrupp teamed up with Microsoft and CGI to create a connected intelligent system to help raise their elevator uptime. Drawing on the potential of the Internet of Things (IoT), the solution securely connects the thousands of sensors in ThyssenKrupp’s elevators – sensors that monitor cab speed, door functioning, shaft alignment, motor temperature and much more – to the cloud, using Microsoft Azure Intelligent Systems Service (Azure ISS). The system pulls all this data into a single integrated real-time dashboard of key performance indicators Using the rich data visualization capabilities of Power BI for Office 365, ThyssenKrupp knows precisely which elevator cabs need service and when. Microsoft Azure Machine Learning (Azure ML) is used to feed the elevator data into dynamic predictive models which then allow elevators to anticipate what specific repairs they need.

As Dr. Rory Smith, Director of Strategic Development for the Americas at ThyssenKrupp Elevator, sums it up, “When the elevator reports that it has a problem, it sends out an error code and the three or four most probable causes of that error code. In effect, our field technician is being coached by this expert citizen.”

In other words, these ‘Smart’ elevators are actually teaching technicians how to fix them, thanks to Azure ML. With up to 400 error codes possible on a given elevator, such “coaching” is significantly sharpening efficiency in the field.

Hear the ThyssenKrupp story in the customer’s own voice in the video below:

Rather than respond to failure alarms after-the-fact, ThyssenKrupp technicians are now using real-time data to identify needed repairs even before breakdowns happen. The Azure ML predictive models used in this solution are continually updated via seamless integration with Azure ISS, creating an intelligent information loop. These models are expected to continually improve with time as more datasets get fed into the system. Because of two-way flow of data and control, technicians can even put an elevator into diagnostics mode and take actions remotely, reducing the need to travel.

Customers across a swathe of industries are deploying enterprise-grade predictive analytics solutions using Microsoft Azure ML – we make it easy for you to get started today.

By using IoT and predictive analytics in the cloud to increase the efficiency of their maintenance operations and elevator uptime, ThyssenKrupp is giving the world’s burgeoning cities a lift they can rely on.

↧

SQL Server 2008 Service Pack 4 has released.

September 30, 2014, 1:29 pm

≫ Next: SQL Server 2008 R2 SP3 and SQL Server 2008 SP4 are now available!

≪ Previous: Video – ThyssenKrupp Uses Predictive Analytics to Give Burgeoning Cities a Lift

Dear Customers, Microsoft SQL Server Product team is pleased to announce the release of SQL Server 2008 Service Pack 4 (SP4). As part of our continued commitment to software excellence for our customers, this upgrade is free and doesn’t...(read more)

↧

SQL Server 2008 R2 SP3 and SQL Server 2008 SP4 are now available!

October 1, 2014, 9:00 am

≫ Next: Are you harnessing all that in-memory can do for your business?

≪ Previous: SQL Server 2008 Service Pack 4 has released.

Microsoft is pleased to announce the release of SQL Server 2008 R2 Service Pack 3 and SQL Server 2008 Service Pack 4 . The Service Packs are available for download on the Microsoft Download Center. As part of our continued commitment to software excellence for our customers, this upgrade is available to all customers with existing SQL Server 2008 and SQL Server 2008 R2 deployments.

SQL Server 2008 R2 SP3 and SQL Server 2008 SP4 contain fixes to issues that have been reported through our customer feedback platforms. They contain Hotfix solutions provided in SQL Server 2008 R2 cumulative updates up to and including Cumulative Update 13, and SQL Server 2008 cumulative updates up to and including Cumulative Update 17. The Service Packs also include the security bulletin MS14-044. SQL Server 2008 and 2008 R2 are now in extended support, which means there will not be Cumulative Updates for these Service Packs.

For more on SQL Server 2008 R2 SP3, please read here. For more on SQL Server 2008 SP4, please read here. To obtain SQL Server 2008 R2 SP3 and SQL Server 2008 SP4 please visit the links below:

SQL Server 2008 R2 SP3

SQL Server 2008 R2 SP3 Feature Packs

SQL Server 2008 SP4

SQL Server 2008 SP4 Feature Packs

↧

Are you harnessing all that in-memory can do for your business?

October 1, 2014, 10:12 am

≫ Next: Vowpal Wabbit Modules in AzureML

≪ Previous: SQL Server 2008 R2 SP3 and SQL Server 2008 SP4 are now available!

Faster transactions, faster queries and faster analytics. Sounds like nirvana right? Just imagine it… your customers can find what they want from your large product catalog more quickly and purchase without huge lags. Your business divisions can perform timely analytics highlighting product, web site, and data trends. It’s all possible with in-memory technologies, and Microsoft SQL Server 2014’s in-memory is the secret speed sauce you need to realize these benefits.

Microsoft SQL Server 2014 offers optimized in-memory technologies for transaction processing (OLTP), data warehousing and data analytics built right into the product. We have a long history with in-memory technologies in SQL Server (more on that in a subsequent blog post), and the enhancements we’ve made to the In-Memory ColumnStore provide greater data compression and increased performance, resulting in world-record benchmarks on industry standard hardware.

So what does all this mean for you? Significant performance gains for starters. Microsoft’s in-memory solution leads to up to 30x faster transactions, over 100x faster queries and reporting, and easy management of millions of rows of data in Excel. The following video highlights just how in-memory can help speed your business:

Of course, gains vary by situation, but check out a few of our customers and how they’ve benefited from the latest in-memory improvements in SQL Server 2014:

Nasdaq was able to decrease query times from days to minutes, while at the same time reducing storage costs by 10x.
Bwin, using our in-memory technology on standard commodity servers, was able to boost performance gains by 17x and queries by 340x.
EdgeNet realized near-real time inventory updates and higher customer satisfaction because of the 7x faster performance our in-memory gave them.

Best of all, Microsoft’s in-memory solution is included in SQL Server 2014 at no additional cost. It can be used on industry-standard hardware, without the need for expensive upgrades, and there are no new development tools, management tools or APIs to learn. We invite you to visit http://www.microsoft.com/en-us/server-cloud/solutions/in-memory.aspx where you can see more about our in-memory solution, how customers are using it to speed their business, and how you can get started.

↧

Vowpal Wabbit Modules in AzureML

October 2, 2014, 9:00 am

≫ Next: #SQLHaikusweeps – We have a winner!

≪ Previous: Are you harnessing all that in-memory can do for your business?

This post is authored by Sudarshan Raghunathan, Principal Development Lead for modules in the Microsoft Azure ML Studio team based in Cambridge, MA.

In his blog post last month, John Langford wrote about the open source Vowpal Wabbit (VW) machine learning (ML) system. He highlighted some of the main advantages of VW, e.g. its performance and ability to handle large sparse datasets, which make it particularly popular both within and outside Microsoft for applications such as sentiment analysis and recommendation systems.

When we initially released the public preview of Azure ML in July this year, we exposed a small subset of VW functionality as part of our Feature Hashing module. The latter transforms datasets with text features into binary using the feature hashing algorithm (Murmur hash) implemented in VW. When we refreshed the service earlier this month, we added two new modules to our palette, Vowpal Wabbit Train and Vowpal Wabbit Score, and these expose almost all the functionality in VW with very similar performance characteristics, and, very importantly, allow models trained by VW learners to now be operationalized as web services on Azure. In the rest of this post, I will describe a few of the design decisions behind these two new modules and some of the implementation details and work in progress to extend the functionality of other modules in Azure ML with VW.

Design of the module

Our primary target users for the new VW-based modules are data scientists who have already invested in VW for their ML tasks, be it in traditional areas such as classification and regression or more contemporary ones like topic modeling or matrix factorization. The VW-based modules allow such users to easily port the modeling phase of their workflows to Azure ML (taking full advantage of the powerful features and native performance of VW) and easily publish the trained model as operationalized services.

To this end, both modules expect their input datasets to be in the native text format supported by VW and the user of the modules to be familiar with the command line arguments necessary to perform the modeling task. Finally, today we only support reading training data from Azure Blobs.

Figure 1 illustrates a simple training experiment that creates a multi-class classifier using one-vs-all. When executed, the module produces a model that can be saved and used for creating a scoring workflow.

Figure 1. A sample training experiment using the Vowpal Wabbit Train module.

Figure 2 illustrates a sample experiment that scores a trained module produced by Vowpal Wabbit Train using data from any of the data ingress sources supported by Azure ML. The data to be scored must once again be in the VW text format and the scored outputs from VW are returned as an Azure ML dataset.

Figure 2. A sample experiment to train and score a model using VW.

Technical Details

Recognizing the importance of the features in VW and the potential for their use throughout the system, the Azure ML team invested in creating a rich C++/CLI wrapper around the underlying native APIs in VW. All functionality in VW (including ones that will be added in the future) can easily be called through our managed wrappers and exposed as modules in Azure ML or used by other parts of our ML infrastructure.

The Vowpal Wabbit Train module simply calls into the general-purpose VW wrapper. It downloads the training dataset in blocks from Azure (utilizing the high bandwidth between the worker roles executing the computations and the store) and streams it to the learners in VW. This strategy allows us to achieve training performance quite similar to what one might get from an on premise machine. The resulting model is generally very compact thanks to the internal compression done by VW and is copied back to the model store utilized by other models in Azure ML.

The Vowpal Wabbit Score model works in a similar manner. The only difference is that the data to be scored typically comes in through a client of the published web service as opposed to a user’s Azure storage.

Limitations and further work

As mentioned above, our wrapper modules are geared towards existing users of VW in order to enable them to easily on-board parts of their modeling workflow to Azure ML and take optimal advantage of the ability to publish and scale out web-services backed by ML models. The modules therefore expect the data to be in VW’s native text format (rather than the dataset representation used by other modules in Azure ML). Further, the training data is directly streamed into VW from Azure for maximal performance and minimal parsing overhead as opposed to other models in Azure ML that pre-process the data to handle missing values and different data types such as numeric, categorical, text, date-time, etc. Therefore, the interoperability between the VW-based modules and other modules in Azure ML is currently somewhat limited. Over the coming months, we intend to expose selective functionality in VW such as topic modeling in a more turnkey manner that consumes Azure ML datasets and interoperates seamlessly with other modules.

Conclusions

Tools such as VW provide a data scientist easy access to state-of-the art ML algorithms that can churn through massive amounts of training data in a short amount of time. However, turning the resulting models into operationalized scalable, reliable web services that can be used to drive business decisions remains a non-trivial problem. Azure ML reduces the process of publishing such web services to a few mouse clicks. The two new modules in Azure ML based on Vowpal Wabbit aim to give the data scientist the best of both worlds: the state-of-the art performance and functionality of VW plus the ease of operationalization of Azure ML.

I hope you have a chance to try these new modules out and give us feedback so we can continue to improve.

Sudarshan

↧

#SQLHaikusweeps – We have a winner!

October 7, 2014, 10:01 am

≫ Next: Azure ML is Helping CMU Become More Energy Efficient

≪ Previous: Vowpal Wabbit Modules in AzureML

Thank you for participating in our Twitter sweepstakes.

As always, the creativity of our community never fails to amaze. Congratulations to @dragonfurther for the winning haiku.

Notable mentions include entries from @John_Deardurff and @sdevanny

We look forward to seeing everyone at PASS Summit 2014. Can’t make it in person? Don’t miss the live streaming of the keynotes on November 5^th and 6^th at www.passsumit.com">www.passsumit.com

↧

Azure ML is Helping CMU Become More Energy Efficient

October 8, 2014, 9:00 am

≫ Next: Microsoft sessions at PASS Summit 2014

≪ Previous: #SQLHaikusweeps – We have a winner!

Posted by Vinod Anantharaman, head of business strategy, Microsoft Information Management and Machine Learning (IMML).

Buildings are powered by multiple systems such as heating, cooling, lighting, ventilation, security and more, each of which affect occupant comfort and energy consumption. Traditionally, each system comes with its own sensors, actuators and the like, and some of these may gather and analyze data specific to that particular system. Because of this silo-based approach to building management, there traditionally has not been a holistic, or dashboard view of the operational efficiency of a building. This has made it challenging to accurately predict energy use or waste.

Based in Pittsburgh, Pennsylvania, Carnegie Mellon University (CMU) is a leading research university with over 12,000 students and 5,000 faculty and staff, and a birthplace of innovation since its founding in 1900. At CMU, the Center for Building Performance and Diagnostics is responsible for developing hardware and software solutions that improve the efficiency of campus buildings while achieving higher occupant comfort.

The Center saw an opportunity to create an integrated, automated system that could increase the energy performance of their buildings and deliver cost savings by predicting energy consumption patterns, detecting faults and taking actions in real-time. Such a system would anticipate heating and cooling needs and adjust thermostats accordingly, and it would alert building managers to repair or replace worn-out parts before they failed altogether.

In pulling together such a predictive analytics system, two of CMU’s primary requirements were:

It had to be easy to implement; and
It also had to be accessible to non-technical personnel.

The Solution
Working in partnership with OSIsoft, the Center created an integrated system to harness all its historical and current sensor data using the power of predictive analytics. Azure Machine Learning is one of a few major components in the CMU solution, which begins with an on-premises PI Server™ that collects sensor data from across the campus, forwards it via Microsoft Azure -based PI Cloud Services™ to a PI Server running in Azure, where an OSIsoft research tool then cleanses, aggregates, shapes, and transmits the data in real-time to an Azure repository, where it is accessed by Azure ML for predictive analytics. The predictive insights are then made accessible through Power BI, with predictions being stored in the PI Server for use by the building systems applications.

The solution was fast, easy, and inexpensive to set up and use. “We immediately began using Azure Machine Learning without having to prepare on-premises software; everything’s ready-to-use in the cloud,” says Bertrand Lasternas, a Researcher at the Center. “It’s significantly easier to use than other tools we’ve tried, and it fit seamlessly with the PI System and Microsoft cloud solution we already had."

Here are a couple of illustrative use cases involving the CMU solution:

A specified building’s temperature needs to be brought up to 72 degrees at the start of business at 9 a.m. The heating system is typically engaged at 6 a.m. or, on warmer days, at 6:30 a.m. But that likely wastes energy and CMU wanted to use predictive analytics to identify the ideal time to start heating the building. Researchers aimed to predict the internal temperature of the building at 9 a.m. using a model that included recent internal and external temperature, anticipated solar radiation levels, and several other factors. Since anticipated solar radiation data was not available, researchers first had to predict this variable. They trained a solar radiation model using a boosted decision tree algorithm in Azure ML, tested the model to confirm its accuracy, and then used it in the internal temperature model to address the question of when to start the heating, resulting in predicted energy savings.
CMU also wanted to address the challenge of fault detection and diagnosis for components that are hidden from visual inspection because they are behind walls or under floors. By using Azure ML on the historical data gathered by the PI System, they are able to predict such faults, resulting in potential cost savings.

The Azure ML solution also fosters collaboration by allowing teams of researchers or graduate students to share workspaces with each other.

Benefits
Based on the experimental results, CMU researchers estimate their solution can cut energy costs by 20 percent. Discussions are underway to implement it campus-wide, where it could save several hundreds of thousands of dollars annually. “The savings come both from reducing energy use and from being able to shift some energy use to hours of lower demand and cost,” says Lasternas.

The CMU researchers envision the PI System and Microsoft Azure supporting not just researchers but also the engineers and technicians who interact daily with building systems. For example, field service technicians could access the insights from predictive analytics on their tablets to check and update remote equipment before it fails. Smartphone notifications could alert engineers to energy demand spikes. Because the solution is scalable and cost-effective, it could be used at building complexes and public-utility systems that cannot be served by traditional solutions.

Customers across a swathe of industries are deploying enterprise-grade predictive analytics solutions using Microsoft Azure ML – you too can get started today.

At CMU, they are anticipating broad new uses for the solution they have built. “We see Azure Machine Learning and the PI System ushering in an era of self-service predictive analytics for the masses,” says Lasternas. “We can only imagine the possibilities.”

↧

Microsoft sessions at PASS Summit 2014

October 8, 2014, 9:15 am

≫ Next: Predict the 2014 U.S. Elections and more - at Microsoft Prediction Lab

≪ Previous: Azure ML is Helping CMU Become More Energy Efficient

Got SQL Server 2005 running on Windows Server 2003? We have fantastic pre-con and general sessions to help you plan your upgrade and migration strategies.

Interested in the new Azure data services like Azure DocumentDB, Azure ML, and Azure Search? We have awesome people lined up to give you all the details.

Want know the nitty-gritty about Azure SQL Database? We got you covered.

It is PASS Summit time and we are counting down the days. We have Microsoft experts from our Redmond campus, field experts flying in from Italy and the U.K., and we’re bringing customers to share their stories – just to name a few. Check out the Microsoft sessions below and add them to your PASS Summit session builder along with great community sessions.

7 Databases in 70 Minutes: A Primer for NoSQL in Azure, Lara Rubbelke and Karen Lopez

Analytics Platform System Deep Dive (APS), Paul Dyke

Analytics Platform System Overview (APS), Nicolle Whitman

Analyzing tweets with HDInsight, Excel and Power BI, Miguel Martinez and Sanjay Soni

Application Lifecycle Management for SQL Server database development, Lonny Bastien and Steven Green

Azure CAT: Azure Data Platform: Picking the right storage solution for the right problem, Kun Cheng, Rama Ramani, and Ewan Fairweather

Azure CAT: Azure SQL DB Performance Tuning & Troubleshooting, Sanjay Mishra, Kun Cheng, and Silvano Coriani

Azure CAT: Deep dive of Real world complex Azure data solutions: Lindsey Allen, and Rama Ramani

Azure CAT: Running your Line of Business application on Azure Virtual Machine Services, Juergen Thomas

Azure CAT: SQL Server 2014 Gems, Shep Sheppard

Azure CAT: SQL Server 2014 In-Memory Customer Deployments: Lessons Learned, Michael Weiner and Stephen Baron

Azure Search Deep Dive, Pablo Castro

Azure SQL Database Business Continuity and Auditing Deep Dive, Nadav Helfman and Sasha Nosov

Azure SQL Database Overview, Bill Gibson and Sanjay Nagamangalam

Azure SQL Database Performance and Scale Out Deep Dive, Torsten Garbs and Michael Ray

BI Power Hour, Matt Masson and Matthew Roche

Building a Big Data Predictive Application, Nishant Thacker and Karan Gulati

Built for Speed: Database Application Design for Performance, Pam Lahoud

ColumnStore Index: SQL Server 2014 and Beyond, Sunil Agarwal and Jamie Reding

Connecting SAP ERP and Microsoft BI Platform, Sanjay Soni

Data-tier Considerations of Cloud-based Modern Application Design, Scott Klein

Deep Dive into Power Query Formula Language, Matt Masson and Theresa Palmer-Boroski

Deploying Hadoop in a Hybrid Environment, Matt Winkler

Deployment and best practices for Power BI for Office 365, Miguel Llopis

End-to-End Demos with Power BI, Kasper de Jonge and Sanjay Soni

HBase: Building real-time big data apps in the cloud, Maxim Lukiyanov

Improve Availability using Online Operations in SQL Server 2014, Ajay Jagannathan and Ravinder Vuppula

In-Memory OLTP in SQL Server 2014: End-to-End Migration, George Li

Interactive Data Visualization with Power View, Will Thompson

Introducing Azure Machine Learning, Raymond Laghaeian

Introduction to Azure HDInsight and Visual Studio customizations, Matt Winkler

Just in Time Data Analytics with SQL Server 2014, Binh Cao and Tomas Polanco

Leveraging SQL Server in Azure Virtual Machines Best Practices, Scott Klein

Life in the fast lane with Azure DocumentDB, Stephen Baron

Making the most of Azure Machine Learning end-to-end, Parmita Mehta

Managing 1 Million+ DBs-How Big Data is used to run SQL Azure, Conor Cunningham

Match the database to the data – from on prem to the cloud, Buck Woody

Microsoft Azure SQL Database – Resource Management, Mine Tokus

Migration and Deployment Principles for SQL Server in Azure VMs, Selcin Turkarslan

Polybase in the Modern Data Warehouse, Artin Avanes

Power BI Hybrid Data Access via Data Management Gateway, Luming Han and Mini Nair

Power View with Analysis Services Multidimensional Models, Kasper de Jonge

Real world Healthcare BI transformations in the cloud, Matt Smith and Michael Wilmot

SQL Server 2014 AlwaysOn (High Availability and Disaster Recovery), Luis Carlos Vargas Herring

SQL Server 2014 in-Memory OLTP - Memory/Storage Monitoring and Troubleshooting, Sunil Agarwal

SQL Server 2014 In-Memory OLTP Query Processing, Jos de Bruijn

SQL Server 2014 In-Memory OLTP Transaction Processing, Jos de Bruijn

SQL Server 2014: In-Memory Overview, Kevin Farlee

SQL Server Hybrid Features End to End, Xin Jin

SQL Server in Azure VM Roadmap, Luis Carlos Vargas Herring

To The Cloud, Infinity, & Beyond: Top 10 Lessons Learned at MSIT, Jimmy May

Upgrading and Migrating SQL Server, John Martin

What's New in Microsoft Power Query for Excel, Miguel Llopis

Who Dunnit? A Walk Around the SQL Server 2014 Audit Feature, Timothy McAliley and Michael Ray

Still want more? No problem. Check back November 5^th for additional sessions and speakers.

Only 28 More days until PASS Summit. You won’t want to miss it!

↧

Predict the 2014 U.S. Elections and more - at Microsoft Prediction Lab

October 8, 2014, 6:20 pm

≫ Next: Distributed Cloud-Based Machine Learning

≪ Previous: Microsoft sessions at PASS Summit 2014

Microsoft Prediction Lab lets you make predictions about upcoming events and view the combined predictions of the crowd!

Go ahead and predict every Senate, House and Gubernatorial race out there.
Weigh in on a range of other topics, e.g. foreign affairs, social issues, science, technology and more.

Your predictions get integrated into our crowd forecasts, and we will show how likely it is that certain candidates will win or that certain other events will occur. Challenge your friends to see who is best at forecasting the future - register here and participate today!

The image below is an illustration from the Prediction Lab web site (click the image to see a bigger version):

↧

Distributed Cloud-Based Machine Learning

October 14, 2014, 9:00 am

≫ Next: Microsoft announces real-time analytics in Hadoop and new ML capabilities in Marketplace

≪ Previous: Predict the 2014 U.S. Elections and more - at Microsoft Prediction Lab

This post is authored by Dhruv Mahajan, Sundararajan Sellamanickam and Keerthi Selvaraj, Researchers at Microsoft’s Cloud & Information Services Lab (CISL) and at Microsoft Research.

Enterprises of all stripes are amassing huge troves of data assets, e.g. logs pertaining to user behavior, system access, usage patterns and much more. Companies will benefit enormously by using the power of cloud services platforms such as Microsoft Azure not merely to host such data or perform classic “look-in-the-rear-view mirror” BI, but by applying the power and scale of cloud-based predictive analytics. Using modern tools such as Azure Machine Learning, for instance, companies can obtain actionable insights about how the future of their businesses might evolve – insights that can give them a competitive edge.

Gathering and maintaining “big data” is becoming a common need across many applications. As data sizes explode, it becomes necessary to store data in a distributed fashion. In many applications, the collection of data itself is a decentralized process, naturally leading to distributed data storage. In such situations it becomes necessary to build machine learning (ML) solutions over distributed data using distributed computing. Examples of such situations include click-through rate estimation via logistic regression in the online advertising universe, or deep learning solutions applied to huge image or speech training datasets, or log analytics to detect anomalous patterns.

Efficient distributed training of ML solutions on a cluster, therefore, is an important focus area at the Microsoft Cloud & Information Services Lab (CISL, that’s pronounced “sizzle” :-)) to which the authors belong. In this post, we delve a bit into this topic, discuss a few related issues and our recent research where we try to addresses some of the same. Some of the details presented here are rather technical, but we attempt to explain the central ideas in as simple a manner as possible. Anybody interested in doing distributed ML on big data will gain by understanding these ideas, and we look forward to your comments and feedback too.

Choosing the Right Infrastructure

In a recent post, John Langford described the Vowpal Wabbit (VW) system for fast learning, where he briefly touched on distributed learning over terascale datasets. Most ML algorithms being iterative in nature, choosing the right distributed framework to run them is crucial.

Map Reduce and its open source implementation, Hadoop, are popular platforms for distributed data processing. However, they are not well-suited for iterative ML algorithms as each iteration has large overheads – e.g. job scheduling, data transfer and data parsing.

Better alternatives would be to add communication infrastructure such as All Reduce, which is compatible with Hadoop (as in VW), or to employ newer distributed frameworks such as REEF which support efficient iterative computation.

SQM

Current state-of-the-art algorithms for distributed ML such as the one in VW are based on the Statistical Query Model (SQM). In SQM, learning is based on doing some computation on each data point and then accumulating the resulting information over all the data points. As an example, consider linear ML problems where the output is formed by doing a dot product of a feature vector with the vector of weight parameters. This includes important predictive models such as logistic regression, SVMs and least squares fitting. In this case, at each iteration, the overall gradient of the training objective function is computed by summing the gradients associated with individual data points. Each node forms the partial gradient corresponding to the training data present in that node and then an All Reduce operation is used to get the overall gradient.

Communication Bottleneck

Distributed computing often faces a critical bottleneck in the form of a large ratio of computation to communication bandwidth. E.g. it is quite common to see communication being 10x to 50x slower than computation.

Let Tcomm and Tcomp denote the per iteration time for communication and computation respectively. Thus, the overall cost of an iterative ML algorithm can be written as:

Toverall = (Tcomm + Tcomp) * #iterations

Tcomp typically decreases linearly with increasing number of nodes while Tcomm increases or remains constant (in best implementations of All Reduce). ML solutions involving Big Data often have a huge number of weight parameters (d) that must be updated and communicated between the computing nodes of a cluster in each iteration. Moreover, there are other steps like gradient computation in SQM that also require O(d) communication. The situation is even worse in Map Reduce where each iteration requires a separate Map Reduce job. Hence, Tcomm is large when d is large. SQM does not place sufficient emphasis on the inefficiency associated with this.

Overcoming the Communication Bottleneck

Our recent research addresses this important issue. It is based on the following observation: Consider a scenario in which Tcomm, the time for communicating the weight parameters between nodes, is large. In each iteration, what happens with a standard approach such as SQM is that Tcomp, the time associated with computations within each node, is a lot less than Tcomm. So we ask the following question: Is it possible to modify the algorithm and its iterations in such a way that Tcomp is increased to come closer to Tcomm, and, in the process, make the algorithm converge to the desired solution in fewer iterations?

Of course, answering this question is non-trivial since it requires a fundamental algorithmic change.

More Nitty-Gritty Details

Consider the ML problem of learning linear models. In our algorithm, the weight updates and gradients in the nodes are shared in a way similar to the SQM based method. However, at each node, the gradient (computed using All Reduce) and the local data in the node are used in a non-trivial way to form a local approximation of the global problem. Each node solves its approximate problem to form local updates of weight variables. Then the local updates from all nodes are combined together to form a global update of the weight variables. Note that solving the approximate problem leads to increased computation in each node, but it does not require any extra communication. As a result Tcomp increases and, since Tcomm is already high, the per-iteration cost is not affected significantly. However, since each node now is solving the approximate global view of the problem, the number of iterations needed to solve the problem is reduced significantly. Think of a case where the amount of data is so large that the data present within each node is itself sufficient to do good learning. For this case, the approximate problem formed in each node is close to the global problem; the result is that our algorithm requires just one or two iterations while SQM based methods need hundreds or even thousands of iterations. In addition, our approach is flexible and allows a class of approximations rather than a specific one. In general, our algorithm is almost always faster than SQM and, on average, about two to three times faster.

One could also think of distributing the weight vector over many cluster nodes and setting up the distributed data storage and computation in such a way that all updates for any one weight variable happen only in one cluster node. This turns out to be attractive in some situations, for example when one is interested in zeroing out irrelevant weight variables in linear ML problems or for doing distributed deep net training. Here again, we have developed specialized iterative algorithms that do increased computation in each node while decreasing the number of iterations.

Evaluation

We focused above on algorithms suited to communication-heavy situations. But not all problems solved in practice are of this nature. For general situations, there exist a range of good distributed ML algorithms in the recent academic literature. But a careful evaluation of these methods has not been performed. Best methods are finding their way into cloud ML libraries.

Automating Distributed ML to Suit User Needs

There is also another important side to the whole story. Users of distributed ML on the cloud have a variety of needs. They may be interested in minimizing the total solution time, or the cost in dollars associated with the solution. Users may be willing to sacrifice accuracy a bit while optimizing the above mentioned variables. Alternatively, they may be keen to get the best accuracy irrespective of time and cost. Given a problem description, a varied set of such user specifications, and details of the system configuration available for use, it is important to have an automatic procedure for choosing the right algorithm and its parameter settings. Our current research focuses on this aspect.

Automated distributed ML solutions will be one of the important areas/considerations for Azure ML as we evolve our product and expand our offering in the future.

Dhruv, Sundar and Keerthi

↧

Microsoft announces real-time analytics in Hadoop and new ML capabilities in Marketplace

October 15, 2014, 9:00 am

≫ Next: Web Services and Marketplaces Create a New Data Science Economy

≪ Previous: Distributed Cloud-Based Machine Learning

This morning at Strata + Hadoop World, Microsoft announced the preview of Apache Storm clusters inside HDInsight as well as new machine learning capabilities in the Azure Marketplace.

Apache Storm is an open source project in the Hadoop ecosystem which gives users access to an event-processing analytics platform that can reliably process millions of events. Now, users of Hadoop can gain insights to events as they happen in real-time. Learn more from here:

As part of Strata, Microsoft partner, Hortonworks announced the next version of their Hadoop distribution HDP 2.2 will include capabilities to orchestrate data from on-premise to Azure. This will allow customers to back-up their on-premise data or elastically scale out using the power of the cloud.

Finally, Microsoft is offering new machine learning capabilities as part of the Azure Marketplace. Customers can now access ML as web services which enable scenarios like doing anomaly detection, running a recommendation engine, doing fraud detection, and a set of R packages.

Machine learning blog

Read more of Microsoft’s Strata announcements on the Official Microsoft Blog

↧

Web Services and Marketplaces Create a New Data Science Economy

October 16, 2014, 9:00 am

≫ Next: Report Builder of SQL Server 2008 R2 Service Pack 3 does not launch.

≪ Previous: Microsoft announces real-time analytics in Hadoop and new ML capabilities in Marketplace

This blog post is authored by Joseph Sirosh, Corporate Vice President of Machine Learning at Microsoft.

Yesterday, at Strata + Hadoop World, we announced the expansion of our data services with support of real-time analytics for Apache Hadoop in Azure HDInsight and new machine learning (ML) capabilities in the Azure Marketplace. Today, I would like to expand on the new ML capabilities that we announced and share how this is an important step in our journey to jump-start the new data science economy. I’ll also be speaking more about this in my keynote presentation tomorrow at Strata.

Data scientists and their management are often frustrated by just how little of their work makes it into production deployments. Consider this hypothetical, although not uncommon scenario. A data scientist and his team are asked to create a new sales prediction model that can be run whenever needed. The data scientists perfect the sales model using popular statistical modeling language, “R”. The new model is presented to management who want to get the model up and running right away as a web app and as a mobile client. Unfortunately, engineering is unable to deploy the model as they don’t have R and the only option is to convert it all to Java - something that will take months to get up and running. So the data scientists end up preparing a batch job to run R code and mail reports on a daily basis, leaving everyone unsatisfied.

Well, now there’s a better way, thanks to Azure Machine Learning.

We built Azure ML to empower data science with all the benefits of the cloud. Data scientists can bring R code and use Microsoft's world class ML algorithms in our web-based ML Studio. No software installs required for analysis or production – our browser UI works on any machine and operating system. Teams can collaborate in the cloud, share projects, experiment with world-class algorithms and include data from databases or blob storage. They can use enormous storage and compute resources in the cloud to develop the best models from their data, unrestrained by server or storage capacity.

Perhaps best of all, with just one-click, users can publish a web service with their data science code embedded in it. Data transformations and models can now run in a web service in the cloud – fully managed, secure, reliable, available, and callable from anywhere in the world.

These web service APIs can be invoked from Excel, as shown in this video, by using this simple plug-in. Now, instead of emailing reports, users can surprise management with cloud-hosted apps that are built in hours. Engineering can hook up APIs to any application easily and even create custom mobile apps. Users can publish as many web services as they like, test multiple models in production and update models with new data. The data science team just became several times more productive and engineering is happy because integration is so easy.

But wait, there's still more.

Imagine a data scientist hits upon that perfect idea for an intelligent web service that everyone else in the world should be building into their apps. Maybe it is a great forecasting method, or a new churn prediction technique, or a novel approach to pattern recognition. Data scientists can now build that web service in Azure ML, publish the ML web service on the Azure Marketplace and start charging for it in over one hundred currencies. Published APIs can be found via search engines. Anyone in the world can pay and subscribe to them and use them in their apps.

For the first time, data scientists can monetize their know-how and creativity just as app developers do. When this happens, we start changing the dynamics of the industry – essentially, data scientists are able to “self-publish” their domain expertise as cloud services which can then be made accessible to billions of users via smartphone apps that tap into those services.

The Azure Marketplace already has an emerging selection of such services. In just a couple of weeks, four of our data scientists published over 15 analytics APIs into the marketplace by wrapping functions from CRAN. Among others, these include APIs for forecasting, survival analysis and sentiment analysis.

Our marketplace has much more than basic analytics APIs. For example, we went and built a set of finished end-to-end ML applications, all using Azure ML, to solve specific business needs. These ML apps do not require a data scientist or ML expertise to use – the science is already baked into our solution. Users can just bring their own data and start using them. These include APIs for recommendations, items that are frequently bought together as well as anomaly detection to spot anomalous events in time-series data such as server telemetry.

A similar anomaly detection API is used by Sumo Logic, a cloud-based machine data analytics company. They have collaborated with Microsoft to bring metric-based anomaly detection capability to their customers. Our metric-based anomaly detection perfectly complements Sumo Logic's structure-based anomaly detection capabilities. Any Sumo Logic query which results in a numerical time-series now has a special “metric anomaly detection” button which sends the pre-aggregated time series data to Azure ML for analysis. The data is then annotated with labels provided by the Azure ML service indicating unusual spikes or level shifts. Sumo Logic is now offering this optional integration in a limited beta release.

Third parties too are starting to publish APIs into our marketplace. For instance, Versium, a predictive analytics startup, has published these three sophisticated customer scores, all based on public marketing data – Giving Score (which predicts customer propensity to donate), Green Score (predicts customer propensity to make environmentally conscious purchase decisions) and Wealth Score (helps companies estimate the net worth of customers and prospects). Versium offers these scores by analyzing and associating billions of LifeData® attributes and building predictive models using Azure ML.

Our marketplace also hosts a number of other exciting APIs that use ML, including the Bing Speech Recognition Control, Microsoft Translator, Bing Synonyms API and Bing Search API.

By bringing ML capabilities to the Azure Marketplace and making it easy for anyone to access, we are liberating data science from its confines. This two-minute video recaps how:

Get going today – sign up for Azure ML and try out some of our easy to use samples.

A new future for machine learning is being born in the cloud.

Joseph
Follow me on Twitter.

↧

Report Builder of SQL Server 2008 R2 Service Pack 3 does not launch.

October 16, 2014, 5:55 pm

≫ Next: Heading: Expanding the Availability and Deployment of Hadoop Solutions on Microsoft Azure

≪ Previous: Web Services and Marketplaces Create a New Data Science Economy

Dear Customers we have discovered a problem with Report Builder that ships with SQL Server 2008 R2 Service Pack 3. If you installed SQL Server 2008 R2, have upgraded it to Service Pack 2 and then applied Service Pack 3, then Report Builder will...(read more)

↧

Heading: Expanding the Availability and Deployment of Hadoop Solutions on Microsoft Azure

October 17, 2014, 9:00 am

≫ Next: Video - Joseph Sirosh Keynote: "A New Data Science Economy" at Strata + Hadoop 2014

≪ Previous: Report Builder of SQL Server 2008 R2 Service Pack 3 does not launch.

Author: Shaun Connolly
VP Corporate Strategy - Hortonworks

Data growth threatens to overwhelm existing systems and bring those systems to their knees. That’s one of the big reasons we’ve been working with Microsoft to enable a Modern Data Architecture for Windows users and Microsoft customers.

A history of delivering Hadoop to Microsoft customers

Hortonworks and Microsoft have been partnering to deliver solutions for big data on Windows since 2011. Hortonworks is the company that Microsoft relies on for providing the industry’s first and only 100% native Windows distribution of Hadoop, as well as the core Hadoop platform for Microsoft Azure HDInsight and the Hadoop region of the Microsoft Analytics Platform System.

This week we made several announcements that further enable hybrid choice for Microsoft-focused enterprises interested in deploying Apache Hadoop on-premises and in the cloud.

New capabilities for Hadoop on Windows

Hortonworks has announced the newest version of the Hortonworks Data Platform for Windows - the market’s only Windows native Apache Hadoop-based platform. This brings many new innovations for managing, processing and analyzing big data including:

Enterprise SQL at scale
New capabilities for data scientists
Internet of things with Apache Kafka
Management and monitoring improvements
Easier maintenance with rolling upgrades

Automated cloud backup for Microsoft Azure
Data architects require Hadoop to act like other systems in the data center, and business continuity through replication across on-premises and cloud-based storage targets is a critical requirement. In HDP 2.2, we extended the capabilities of Apache Falcon to establish an automated policy for cloud backup to Microsoft Azure. This is an important first step in a broader vision to enable seamlessly integrated hybrid deployment models for Hadoop.

Certified Hadoop on Azure Infrastructure as a Service (IaaS)

Increasingly the cloud is an important component for big data deployments. On Wednesday October 15 we announced that the Hortonworks Data Platform (HDP) is the first Hadoop platform to be Azure certified to run on Microsoft Azure Virtual Machines. This gives customers new deployment choices for small and large deployments in the cloud. With this new certification, Hortonworks and Microsoft make Apache Hadoop more widely available and easy to deploy for data processing and analytic workloads enabling the enterprise to expand their modern data architecture.

Maximizing Hadoop Deployment choice for Microsoft Customers

These latest efforts further expand the deployment options for Microsoft customers while providing them with complete interoperability between workloads on-premises and in the cloud. This means that applications built on-premises can be moved to the cloud seamlessly. Complete compatibility between these infrastructures gives customers the freedom to use the infrastructure that best meets their needs. You can backup data where the data resides (geographically) and provide the flexibility and opportunity for others to do Hadoop analytics in the cloud (globally).

We are excited to be the first Hadoop vendor to offer Hadoop on Azure Virtual Machines and we look forward to continuing our long history of working with Microsoft to engineer and offer solutions that meet the most flexible and easy to use deployment options for big data available, further increasing the power of the Modern Data Architecture.

Video - Joseph Sirosh Keynote: "A New Data Science Economy" at Strata + Hadoop 2014

October 17, 2014, 6:25 pm

≫ Next: [Announcement] ODataLib 6.8.1 Release

≪ Previous: Heading: Expanding the Availability and Deployment of Hadoop Solutions on Microsoft Azure

Be sure to check out Joseph's keynote talk below, under 10 minutes long, summarizing how, in the emerging new Data Science Economy, data scientists are able to monetize their skills - at scale, in the cloud - just like app developers have been able to do for several years now.

Joseph blogged about this topic yesterday, for those of you interested in more details on how Azure ML makes this possible.

ML Blog Team

↧

[Announcement] ODataLib 6.8.1 Release

October 20, 2014, 2:00 am

≫ Next: Cloudera Selects Azure as a Preferred Cloud Platform

≪ Previous: Video - Joseph Sirosh Keynote: "A New Data Science Economy" at Strata + Hadoop 2014

We are happy to announce that the ODataLib 6.8.1 is released. Detailed release notes are listed below.

Bug Fix

[Github issues #3] Fix a bug that string function parameters containing specific characters were handled incorrectly by URI parser.
Fix a bug that OData Client for .NET failed to serialize and materialize null value in collection of complex type, primitive type.

New Features

OData Client for .NET now supports Edm.TimeOfDay/Edm.Date.
OData Client for .NET now can take entity or collection of entity as parameter of action.

Call to Action

↧