What Beer with that Burger? Ziosk Can Help!

March 13, 2015, 9:00 am

≫ Next: Free Webinar Tomorrow: The Cloud Data Science Process

≪ Previous: SQL Server 2014 is Certified for SAP Applications On-Premises and in the Cloud

We continue our series of posts on how Microsoft customers are gaining actionable insights on their data through the power of advanced analytics – at scale and in the cloud.

A new tabletop platform from Ziosk will use Microsoft big data and cloud technologies to predict restaurant guests’ preferences – things such as a personalized recommendation for a dish or wine pairing perhaps – and serve them up as part of a better dining experience. The company’s goal is to create personalized experiences that result in happier guests and more business for restaurants.

Working with Artis Consulting, a Microsoft data analytics partner, Ziosk is building out its next generation data infrastructure, including new predictive analytics capability. Their solution uses a range of Microsoft cloud technologies including Azure Machine Learning, Azure Data Factory, Azure HDInsight and Power BI.

“The data we’re talking about has always been there, but until now there’s been no way to capture it, analyze it, and use it to drive more business and more effective operations. We’re using Azure and Power BI to change all that.”

Kevin Mowry, Chief Software Architect

Learn more about this story here, or by clicking the graphic below.

ML Blog Team

↧

Free Webinar Tomorrow: The Cloud Data Science Process

March 16, 2015, 11:00 am

≫ Next: Cumulative Update #15 for SQL Server 2012 SP1

≪ Previous: What Beer with that Burger? Ziosk Can Help!

At this webinar, Microsoft Data Scientists will demonstrate the end-to-end data science process in the cloud using Python, R, a range of Azure cloud technologies, SQL Server, IPython Notebook and more.

The process will start from raw data and ends at the point where we have a ready-to-consume web service API of an ML model to predict new observations. We will use a public dataset – the NYC Taxi Trips Data – for this exercise. There will be time for Q&A at the end.

To register for this webinar click here or on the graphic below.

ML Blog Team

↧

Cumulative Update #15 for SQL Server 2012 SP1

March 16, 2015, 12:05 pm

≫ Next: Cumulative Update #5 for SQL Server 2012 SP2

≪ Previous: Free Webinar Tomorrow: The Cloud Data Science Process

Dear Customers, The 15 th cumulative update release for SQL Server 2012 SP1 is now available for download at the Microsoft Support site. Cumulative Update 15 contains all the SQL Server 2012 SP1 hotfixes which have been available since the initial...(read more)

↧

Cumulative Update #5 for SQL Server 2012 SP2

March 16, 2015, 12:21 pm

≫ Next: Row-Level Security for Middle-Tier Apps – Using Disjunctions in the Predicate

≪ Previous: Cumulative Update #15 for SQL Server 2012 SP1

Dear Customers, The 5 th cumulative update release for SQL Server 2012 SP2 is now available for download at the Microsoft Support site. Cumulative Update 5 contains all hotfixes which have been available since the initial release of SQL Server 2012...(read more)

↧

Row-Level Security for Middle-Tier Apps – Using Disjunctions in the Predicate

March 16, 2015, 4:51 pm

≫ Next: Laundry Innovator Improves Customer Service Using Cloud Analytics

≪ Previous: Cumulative Update #5 for SQL Server 2012 SP2

In Building More Secure Middle-Tier Applications with Azure SQL Database using Row-Level Security, we discussed how CONTEXT_INFO could be used for middle-tier based RLS predicate definitions.

In many occasions it is necessary to introduce a disjunction to the predicate definition for scenarios that need to distinguish between filtered queries for some users and cases where a user must not be subject to filtering (i.e. administrator, etc.), and such disjunctions may potentially affect performance significantly.

The reason for this performance impact is that, once the RLS predicate is applied to a query, it will be applied as a predicate to the query. Because of the disjunction, the query may result in a scan. For details on the difference between scan and seek, I would recommend reading Craig Freedman’s “scans vs. seeks” article.

We are working on trying to optimize some of these scenarios for RLS usage, but we also know we may not be able to address all possible scenarios right away. Because of that, we would like to share an example on how to improve performance under similar circumstances on your own.

The scenario we will analyze is a slight modification to the scenario from the previous RLS blog post, but with one addition: The application needs to allow a super-user/administrator to access all rows.

The way we will identify the super-user in our application, is by not setting CONTEXT_INFO to any value (i.e. CONTEXT_INFO returns null). So we decide to modify the SECURITY POLICY to add the new logic:

CREATE FUNCTION [rls].[fn_userAccessPredicate_with_superuser](@TenantId int) 
 RETURNS TABLE 
 WITH SCHEMABINDING 
 AS
 RETURN SELECT 1 AS fn_accessResult 
 WHERE DATABASE_PRINCIPAL_ID() = DATABASE_PRINCIPAL_ID ('AppUser')
 AND 
 ( CONVERT(int, CONVERT( varbinary(4), CONTEXT_INFO())) = @TenantId 
 OR CONTEXT_INFO() is null )
 GO 

 ALTER SECURITY POLICY [rls].[tenantAccessPolicy] 
 ALTER FILTER PREDICATE [rls].[fn_userAccessPredicate_with_superuser]([TenantId]) on [dbo].[Sales]
 GO 

Unfortunately, this seemingly simple change seems to have triggered a regression in your application performance, and you decide to investigate, comparing the plan for the new predicate against the old one.

Fig 1. Plan when using [rls].fn_userAccessPredicate] as a predicate.

Fig 2. Plan when using [rls].fn_userAccessPredicate_with_superuser] as a predicate.

And after the analysis, the reason seems obvious: the disjunction you just added is transforming the query from a seek to a scan.

You also realized that this particular disjunction has a particularity: one side would expect a seek (i.e. TenantId = value ) and the other side of the disjunction would result in a scan (Administrator case), so in this case it may be possible to get better performance by trying to change this particular characteristic and transform both sides of the disjunction into seeks.

How to address this problem? One possibility in a scenario like this one is to transform the disjunction into a range. How would we accomplish it? By transforming the notion of null into a range that encompasses all values.

First, we alter the security policy to use the older version, after all we don’t want to leave our table unprotected while we fix the new predicate:

ALTER SECURITY POLICY [rls].[userAccessPolicy] 
 ALTER FILTER PREDICATE [rls].[fn_userAccessPredicate]([TenantId]) on [dbo].[Sales]
 GO

Then we create a couple of functions that will help us define the min and max for our range based on the current state of CONTEXT_INFO. Please notice that these functions will be data type-specific:

-- If context_info is not set, return MIN_INT, otherwise return context_info value as int
 CREATE FUNCTION [rls].[int_lo]() RETURNS int
 WITH SCHEMABINDING
 AS BEGIN
 RETURN CASE WHEN context_info() is null THEN -2147483648 ELSE convert(int, convert(varbinary(4), context_info())) END
 END
 GO

 -- If context_info is not set, return MAX_INT, otherwise return context_info value as int
 CREATE FUNCTION [rls].[int_hi]() RETURNS int
 WITH SCHEMABINDING
 AS BEGIN
 RETURN CASE WHEN context_info() is null THEN 2147483647 ELSE convert(int, convert(varbinary(4), context_info())) END
 END
 GO

And then we proceed to redefine the predicate function and security policy using a range:

-- Now rewrite the predicate
 ALTER FUNCTION [rls].[fn_userAccessPredicate_with_superuser](@TenantId int) 
 RETURNS TABLE 
 WITH SCHEMABINDING 
 AS 
 RETURN SELECT 1 AS fn_accessResult 
 WHERE DATABASE_PRINCIPAL_ID() = DATABASE_PRINCIPAL_ID ('AppUser') -- the shared application login
 AND 
 -- tenant info within the range:
 -- If context_info is set, the range will point only to one value
 -- If context_info is not set, the range will include all values
 @TenantId BETWEEN [rls].[int_lo]() AND [rls].[int_hi]() 
 GO 

 -- Replace the predicate with the newly written one
 ALTER SECURITY POLICY [rls].[tenantAccessPolicy] 
 ALTER FILTER PREDICATE [rls].[fn_userAccessPredicate_with_superuser]([TenantId]) on [dbo].[Sales]
 GO 

To finalize let’s look at the new actual execution plans:

Fig 3. Plan when using [rls].fn_userAccessPredicate_with_superuser] as a predicate.

This new function will allow a ranged scan in both circumstances. In the case of CONTEXT_INFO being set, this range will be “between @min_value and @max_value”, which will allow the query optimizer to take advantage of the index on TenantID.

NOTE: When you test this functionality with a small table, you may see a scan instead of a seek, even though you have a nonclustered index on the tenantId Column. The reason for this is that the query optimizer may be calculating that for a particular table a scan may be faster than a seek. If you hit this behavior, try using “WITH (FORCESEEK)” at the end of your SELECT statement to give the optimizer a hint that a seek is preferred.

Obviously this is not the only scenario where you may need to rewrite a security predicate in order to improve performance, and this is certainly not the only workaround, but hopefully this example will serve to give you an example to follow for similar scenarios and ideas for other scenarios.

To conclude, I would like to reiterate that we are currently investigating how to improve performance on predicates similar to the one I showed here with a disjunction being used to distinguish between filtered queries and cases where a user must not be subject to filtering. We will update you with news on the potential solution once it becomes available.

↧

Laundry Innovator Improves Customer Service Using Cloud Analytics

March 17, 2015, 12:00 pm

≫ Next: Thomas LaRock Invites You to Attend the PASS Business Analytics Conference 2015

≪ Previous: Row-Level Security for Middle-Tier Apps – Using Disjunctions in the Predicate

We continue our series of posts on how Microsoft customers are gaining actionable insights on their data by operationalizing ML and advanced analytics – at scale and in the cloud.

WASH Multifamily Laundry Systems provides outsourced laundry services to apartment buildings, at university housing environments, motels, hotels and more. With an installed base of over 500,000 machines in 75,000 locations, over 5 million people depend on WASH to do their laundry every week. Yet, the real product at WASH is not so much laundry as much as it world class technology, logistics and superior customer service.

On the one hand, WASH embraces its family-owned legacy, culture and dedication to service excellence, but, on the other, they also set the pace with the changing times. They take pride in being a metric-driven company. They pioneered new advances in laundry room technology and have found ways to modernize systems to make their employees more efficient and serve their customers’ needs better.

A core foundation element of the technology stack at WASH is Microsoft Dynamics. All key aspects of the business – customers, equipment, locations, payees and more – get instanced in Dynamics. Over the last ten years WASH has dramatically driven productivity in their organization by standardizing on Dynamics as their single enterprise-wide “source of truth”, one in which all key assets, customers, transactions and much get captured and shared throughout the organization.

Over the last two years, WASH is starting to transform the way they work with cloud-enabled technology. Specifically, through the use of Azure and Office 365 WASH is getting out of the business of worrying about server and networking infrastructure, wiring, packaged software installation and maintenance, and instead focusing on their core business so they can zoom further ahead to the next generation of service delivery. Play the video below to learn more:

Using Power BI, for instance, their finance and operations are able to take information out of data warehouses and pivot, slice and dice that data and share it via drag and drop – and all of it without involving IT, which they find liberating. Power BI also lets them access public data sets and marry it with internal data – for instance, things like demographics, gas prices and location-specific factors such as the weather which may influence usage.

To help improve customer service WASH is tapping into the power of Azure Machine Learning to unlock even more insights into their business on an automated basis and – importantly – integrate those insights into their everyday applications. For instance, they are eager to discover problems in advance through the power of predictive analytics. It’s critical that a field service technician arrives with the right parts in his or her truck, and that they fix the problem the very first time. WASH also wants to determine, when a service call is open, whether or not a real problem will actually be discovered. For instance, in some situations, service technicians get dispatched but there are no underlying problems found. If the system dynamically determines that there’s a likelihood of there being no real underlying problem, WASH would keep the customer on the phone a little bit longer to walk them through their issue and solve the problem remotely.

As Adam Coffey, President and CEO of WASH, says:

“I need our systems to think. I need them to learn, and I need them to present issues and problems and anomalies to the employees, to the managers. One of the keys to our success is to put more of our organization into the cloud so that we can leverage the enormous capabilities that Microsoft brings to the table.”

ML Blog Team

↧

Thomas LaRock Invites You to Attend the PASS Business Analytics Conference 2015

March 18, 2015, 8:00 am

≫ Next: Azure ML: Now With Even More Python!

≪ Previous: Laundry Innovator Improves Customer Service Using Cloud Analytics

This is a guest blog post from Thomas LaRock, the President of PASS (Professional Association for SQL Server)

It’s no secret that the role of data in the IT industry, in business, and in the world at large is changing at a rapid pace. As technology continues to become a more integrated and integral part of our lives the value of data continues to rise.

At PASS we have a 16-year history of empowering IT professionals who use Microsoft data technologies. The SQL Server community is the largest and most engaged group of data pros in the world. PASS volunteers and the PASS Board of Directors work together to help the PASS community succeed in connecting, sharing, and learning from one another. A big part of that effort is keeping an eye on the future of the data profession.

What we see is that data analytics is the next frontier for professionals passionate about data. The growth of Big (and Little) Data, the advent of cloud computing, and advances in machine learning are all areas that present challenges for our community. Data analysts, data scientists, and line-of-business managers are in high demand as organizations realize the potential of collecting and understanding the wealth of information that is available from a variety of sources.

PASS is dedicated to helping our members harness the technologies, skills, and networks that are the foundation of solid and successful careers. We believe that keeping up with industry advances is a vital skill for all data professionals. Setting and achieving new goals as well as learning new ways of working with data is a must.

Whether you’re coming from a background in SQL Server, business intelligence, or social media there are specific cornerstones of turning all this data into something that can benefit your organization. We call this the “analyst’s journey.”

One such cornerstone is data discovery and integration. We want our members to be aware of the latest technologies in collecting, modeling, and preparing data for analysis. Next is data analysis and interpretation. We want to help our members understand the techniques and tools that enable sophisticated analysis, prediction, and optimization. Then there’s visualization: the creative side of things, where we get into report and dashboard design.

As with any career another key skillset is communication. The people who analyze and work with data are in the best position to help gain executive buy-in for data-driven business decisions. For years PASS has been the leader in helping data professionals improve their communication and soft skills.

One way in which we’re reaching out to those who want to learn more about analytics is the PASS Business Analytics Conference. This premiere event brings together a stellar lineup of business and analytics speakers including our keynote speakers Carlo Ratti and BI Brainz founder Mico Yuk. We have created a series of webinars and a Meet the Expert interview series to give people an idea of what the conference will offer. We also have replays from last year’s conference, and we have hours of training available through our Virtual Chapters.

We’re excited about data and analytics and we’re hearing from more and more SQL Server pros who share that excitement.

It’s a wonderful time to be a data professional.

See you in Santa Clara!

Thomas LaRock

President, PASS

PASSBAConference.com

↧

Azure ML: Now With Even More Python!

March 19, 2015, 9:00 am

≫ Next: DacSamples Moves from CodePlex to GitHub

≪ Previous: Thomas LaRock Invites You to Attend the PASS Business Analytics Conference 2015

This post is authored by Shahrokh Mortazavi, Partner Director of Program Management, Microsoft Azure Machine Learning

Hello again Python enthusiasts! In a previous post I discussed how PTVS can be used as a powerful Data Science workbench. I'm very excited to talk about two important new Azure ML features for Python users:

Azure ML Studio Now Supports Python

As you know, the Studio already supported running R scripts. You now have the same capability with Python, backed by its rich ecosystem of libraries. Simply type or paste in your Python script and it will be run under CPython 2.7 (64bit) with access to the Anaconda Distro.

Azure ML Python SDK

This SDK provides programmatic access to your Experiments and Datasets in Azure ML. Thus far, these were available only via the Studio, but you can now access, manipulate and upload these via the SDK.

Let's look at a simple scenario where you can author some Python code, debug it, use it as script in the Studio and use IPython to visualize some intermediate data.

We'll be using the Iris dataset that's already available on AzureML:

Next I will click on “Generate Data Access Code” to get a Python snippet which enables secure access to my experiments and data:

Here I have PTVS up, the access code pasted in, with the debugger at a breakpoint so I can inspect the data. Note that you can use any Python IDE or environments of your choice, including IPython:

Here you can enter the relevant scikit-learn data processing or modeling code as needed. With the code verified, let us Alt-Tab to the Studio and run the code on Azure ML.

I've created a simple experiment to grab the Iris data to use with my debugged Python script:

The “Execute Python Script” node is where I’ve added my Python code (just as you’ve done with R before). I’ve also added a “Convert to CSV” node so it can be read by the Python SDK and converted into a Pandas dataframe.

Now I would like to take a look at my data while it's in flight between the DAG nodes to do some data debugging. I’ll right-click to get my Data Access code again. For this exercise, I'll quickly fire up IPython and use Bokeh to visualize the data. Note that your instance of IPython could be anywhere – local, in the cloud, console or notebook:

Conclusion

Azure ML now does Python! This is a major step forward as the two main languages used in data science, namely R and Python, are now fully supported. You can use scikit-learn, pybrain, statsmodel, pandas, bokeh, etc. to do a variety of data science tasks in an easy to use language. Additionally, you can use the SDK to do things such as download, upload or enumerate your data and experiments easily which enables manipulation and visualization programmatically or interactively from PTVS, IPython, or any other environment.

For further documentation on using Python on Azure ML, please use to the resources below:

Official docs for Azure ML Python Client Library

Official docs for Execute Python Script module:

Shahrokh

↧

DacSamples Moves from CodePlex to GitHub

March 19, 2015, 4:57 pm

≫ Next: Community Spotlight: Grant Fritchey, Red Gate Software

≪ Previous: Azure ML: Now With Even More Python!

The DacSamples project is moving from CodePlex to GitHub under the new name /Microsoft/DACExtensions/. As part of the move the team is adding a new extension to the DacFx API to allow easier usage of the public model.

Licensing

The new GitHub project, Microsoft/DACExtensions, is licensed under the MIT License as this is the preferred license for new GitHub projects in the Microsoft GitHub Organization. For those wanting to use the existing code under the existing Apache 2.0 license, the DacSamples CodePlex project will be left up, however, all new additions will only be added to the GitHub project.

Extended Public Model

We have receive lots of feedback about the usability and discoverability of the existing public API. The overwhelming sentiment is that the current API, which is very similar to using reflection in .Net, is too cumbersome to use and the notion of Referenced and Referencing relationships is difficult to understand. The good news is the metadata provided by the public API has most of the information needed to create a strongly-type API. To demonstrate the richness of the model metadata, we have created a set of T4 templates that generate a strongly-typed API so it can be easily incorporated into custom Deployment Contributors (DeploymentPlanModifier) as well as custom Source Code Analysis Rules (SqlCodeAnalysisRule) or other customer applications.

The new strongly-typed model provides much better discoverability and aligns better with model API designs. Along with the strongly-typed properties and references the strongly-type API provides interfaces for each SQL Server version allowing users to program against a specific SQL Server version like SQL Server 2014 or Microsoft Azure SQLDB. The following examples illustrate how to use the new strongly-typed API as well as leverage the version specific interfaces:

Existing API Usage:

TSqlModelmodel = newTSqlModel(SqlServerVersion.Sql120, newTSqlModelOptions(){});

// Create the Identifier for the dbo.users table

ObjectIdentifier tableId = newObjectIdentifier("dbo", "users");

// Query the model for the dbo.users table

// Note the return type is the generic TSqlObject not a Table object

TSqlObject table = model.GetObject(Table.TypeClass, tableId, DacQueryScopes.UserDefined);

// Get all the columns that do not support NULL values

IEnumerable<TSqlObject> column = table

.GetReferenced(Table.Columns)

// Note the use of GetProperty and the explicity cast

.Where(c =>!((bool)c.GetProperty(Column.Nullable)));

New Strongly-Typed API Usage:

TSqlTypedModel model = newTSqlTypedModel(SqlServerVersion.Sql120, newTSqlModelOptions() { });

// Create the Identifier for the dbo.users table

ObjectIdentifier tableId = newObjectIdentifier("dbo", "users");

// Query the model for the dbo.users table

// Note that the return type is TSqlTable not TSqlObject

TSqlTable table = model.GetObject<TSqlTable>(tableId, DacQueryScopes.UserDefined);

// Get all the columns that do not support NULL values

// Note the Columns reference property that returns an

// IEnumerable

// Note the Nullable property on the TSqlColumn to access

// the Boolean value Nullable.

IEnumerable<TSqlColumn> column = table.Columns.Where(c =>!c.Nullable);

Version Specific Interfaces.

The new interfaces for each SQL Server version allows programming against the correct set of properties and relationships for a target SQL Server version. The ISql90TSqlLogin and ISqlAzureTSqlLogin interfaces illustrate well the differences in surface area between different platforms:

SQL Server 2005

SQL Azure

publicinterfaceISql90TSqlLogin : ISqlModelElement

{

Boolean CheckExpiration

{

get;

}

Boolean CheckPolicy

{

get;

}

String DefaultDatabase

{

get;

}

String DefaultLanguage

{

get;

}

Boolean Disabled

{

get;

}

LoginEncryptionOption EncryptionOption

{

get;

}

Boolean MappedToWindowsLogin

{

get;

}

String Password

{

get;

}

Boolean PasswordHashed

{

get;

}

Boolean PasswordMustChange

{

get;

}

String Sid

{

get;

}

IEnumerable<ISql90TSqlAsymmetricKey> AsymmetricKey

{

get;

}

IEnumerable<ISql90TSqlCertificate> Certificate

{

get;

}

IEnumerable<ISql90TSqlCredential> Credential

{

get;

}

publicinterfaceISqlAzureTSqlLogin : ISqlModelElement

{

Boolean Disabled

{

get;

}

String Password

{

get;

}

These version specific interfaces allow consumers to leverage compile time validation and IntelliSense for specific versions of SQL Server.

TSqlTypedModel model = newTSqlTypedModel(SqlServerVersion.Sql90, newTSqlModelOptions() { });

// Create Identifier for the l1 login

ObjectIdentifier loginId = newObjectIdentifier("l1");

// Get the login from the model

TSqlLogin login = model.GetObject<TSqlLogin>(loginId, DacQueryScopes.UserDefined);

// Downcast login to ISql90TSqlLogin to ensure only

// SQL 2005 properties

// and references are used

ISql90TSqlLogin sql90Login = (ISql90TSqlLogin)login;

// Downcast login to ISqlAzureTSqlLogin to ensure only

// Microsoft Azure SQLDB properties

// and references are used

ISqlAzureTSqlLogin sqlAzureLogin = (ISqlAzureTSqlLogin)login;

Future Plans

This project will be a vehicle for sharing examples of using the extensibility APIs. The main focus of future additions we be addressing customer pain points we see through the forum and other customer engagements. We look forward to your feedback on examples that will help to better understanding DacFx extensibility and the platform as a whole.

Contact Us

The development team would really like to hear your feedback on this project. For issues and Design Change Requests (DCR) please use the issue tracker. For general questions and help using the public APIs or SQL Server Data Tools please use the team's MSDN forum: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/home?forum=ssdt.

↧

Community Spotlight: Grant Fritchey, Red Gate Software

March 20, 2015, 8:00 am

≫ Next: Row-Level Security: Blocking unauthorized INSERTs

≪ Previous: DacSamples Moves from CodePlex to GitHub

In honor of the upcoming PASS Business Analytics conference, we wanted to take some time to spotlight the great work happening in the SQL and BI communities across the world. The conference is focused on business analytics, but PASS offers many great community activities for SQL Server and beyond. Learn about the various local and digital opportunities to connect with the PASS community here.

Name: Grant Fritchey
Role: Product Evangelist, Red Gate Software
Location: Grafton, MA, USA

What is an exciting project that you’re working on right now?

I’m helping to build a set of classes to teach people how to automate their database deployments in support of Database Lifecycle Management. Development is moving faster and faster in order to keep up with the demands of business. Because of this, databases must also be deployed faster and faster. But, you still have to ensure the protection of the vital business information stored within your databases. In the class I’m working on, we’ll show you how to get your database into source control alongside your application and how to perform continuous integration with databases. We’re going to cover all sorts of mechanisms for automating database deployments and database testing in order to work Database Lifecycle Management right into your Application Lifecycle Management.

What are your current analytics and/or database challenges, and how are you solving?

The main challenges we have with databases are the same ones we’ve always had: performance and uptime. The thing is, we have blazing fast hardware these days. Or, if you’re looking at online solutions like Azure, we have very large VMs as well as methods for sharing across servers and databases. All this means that the underlying architectures of our database systems can perform very well. But, we still have to deal with the database design and the T-SQL code being run against the database. More and more we’re taking advantage of ORM tools such as Entity Framework, which really do speed up development. But, around 10% of the queries still need to be coded by hand in order to ensure adequate performance. Add to this the fact that we need to deploy all this while still ensuring up-time on the databases… Figuring out how to get adequate functionality in place without affecting up-time is tough work.

How does data help you do your job better?

Decisions on what to do with systems need to be based on information, not guesses. Data gathered about my systems shows me where I need to prioritize my work and directs choices on resource allocation.

What’s your favorite example of how data has provided an insight, a decision, or a shift in how business gets done?

Recently I found that I was seeing a serious “observer affect” in how I was collecting performance data. While tuning queries I was using STATISTICS IO and STATISTICS TIME. I normally do this all the time. As I was adjusting the code, I wasn’t seeing the kind of performance improvements I expected. In fact, some of my solutions seemed to be working even worse. I was a little surprised because I thought I was following a good methodology, and so I tried turning off all the STATISTICS capturing and just used Extended Events. Suddenly, the tuning started was working extremely well. I went back and experimented until I discovered that for some of my queries STATISTICS IO was actually impacting query execution, affecting both the time and the reads. Turning it off cleared the problem completely. I’ve now changed to using extended events most of the time in order to minimize, or eliminate, that issue. Best of all, I’m able to use these within Azure SQL Database as well as in my earthed servers.

What or who do you read, watch, or follow to help grow your data skills?

I go to SQLSkills.com over and over, sometimes multiple times in a day. It’s one of the single best resources for detailed SQL Server information. I also go to SQLServerCentral.com regularly to ask and answer questions. It’s a great resource for expanding your knowledge.

What’s your favorite SQL command and why?

RESTORE DATABASE: Because it has saved my job and the companies I’ve worked for so many times.

How does Azure help you protect your local databases?

There are a couple of ways you can use Azure to extend local capabilities. The first, and probably the easiest, is to use Azure Blob Storage as a means of ensuring that you have off-site storage of your backup files. You could pretty easily write a PowerShell script that copies your backups to Azure Storage. But, starting in SQL Server 2012, you can also issue a backup command to go straight to Azure Storage. Either way, you can be sure there’s a copy of your backups in case you suffer a catastrophic event locally.

Another way to extend your local capabilities to the cloud is to set up a virtual network. You can incorporate Azure Virtual Machines directly into your local network. Because of this, you can set up Availability Groups between Azure VMs and your local machines. This would enable you to have a failover setup to Azure, allowing for additional protection of your data and your systems.

Are there other ways Azure can be used in combination with local databases?

It’s my opinion that every developer should be using a local copy of SQL Server for their development. This is to allow them to experiment, learn, and, well, break stuff, without affecting anyone else. But, some laptops might be underpowered, or this could in some way violate a corporate policy. As a workaround, people can take advantage of the fact that SQL Database covers the vast majority of standard SQL Server functionality, at the database level. This makes it a great place to develop and test databases, especially if you’re already developing applications for Azure. You only need to keep the database around while you’re developing, and because you can keep the size small, the costs are extremely minimal.

Any other benefits for Azure in combination with local machines?

Tons. For one, expanded capacity. What if you need to get a lot more servers online quickly, but you’re hurting on disk space, or the servers are on back-order? Go back to that virtual network we talked about earlier. Set that up and now you can very quickly, even in an automated fashion through PowerShell, add SQL Server machines to your existing systems.

Another thing you could do, although this not something I’ve tried yet, is take advantage of the fact that in SQL Server 2014 you can actually add file groups that are in Azure Blob Storage. Do you need extra disks, right now, that you can’t get from the SAN team? Well, if you can afford a bit of latency, you can just expand immediately into Azure Storage. I’d certainly be cautious with this one, but it’s exciting to think about the expanded capabilities this offers for dealing with certain kinds of disk space emergencies.

Thanks for joining us, Grant!

Know someone doing cool work with data? Nominate them for a spotlight in the comments.

↧

Row-Level Security: Blocking unauthorized INSERTs

March 23, 2015, 3:14 pm

≫ Next: Azure ML Powers the Brain of the Modern Smart Grid

≪ Previous: Community Spotlight: Grant Fritchey, Red Gate Software

Row-Level Security (RLS) for Azure SQL Database enables you to transparently filter all “get” operations (SELECT, UPDATE, DELETE) on a table according to some user-defined criteria.

Today, however, there is no built-in support for blocking “set” operations (INSERT, UPDATE) according to the same criteria, so it is possible to insert or update rows such that they will subsequently be filtered to you. In a multi-tenant middle-tier application, for instance, an RLS policy in your database can automatically filter results returned by “SELECT * FROM table,” but it cannot block the application from accidentally inserting rows for the wrong tenant. For additional protection against mistakes in application code, developers may want to implement constraints in their database so that an error is thrown if the application tries to insert rows that violate an RLS filter predicate. This post describes how to implement this blocking functionality using check and default constraints.

We’ll expand upon the example in a prior post, Building More Secure Middle-Tier Applications with Azure SQL Database using Row-Level Security. As a recap, we have a Sales table where each row has a TenantId, and upon opening a connection, our application sets the connection's CONTEXT_INFO to the TenantId of the current application user. After that, an RLS security policy automatically applies a predicate function to all queries on our Sales table to filter out results where the TenantId does not match the current value of CONTEXT_INFO.

Right now there is nothing preventing the application from errantly inserting a row with an incorrect TenantId or updating the TenantId of a visible row to a different value. For peace of mind, we’ll create a check constraint that prevents the application from accidentally inserting or updating rows to violate our filter predicate in this way:

-- Create scalar version of predicate function so it can be used in check constraints
CREATE FUNCTION rls.fn_tenantAccessPredicateScalar(@TenantId int)
 RETURNS bit
AS
 BEGIN
 IF EXISTS(SELECT 1 FROM rls.fn_tenantAccessPredicate(@TenantId)) 
 RETURN 1
 RETURN 0
 END
go

-- Add this function as a check constraint on our Sales table
ALTER TABLE Sales
 WITH NOCHECK -- don't check data already in table
 ADD CONSTRAINT chk_blocking_Sales -- needs a unique name
 CHECK(rls.fn_tenantAccessPredicateScalar(TenantId) = 1)
go

Now if we grant our shared AppUser INSERT permissions on our Sales table and simulate inserting a row that violates the predicate function, the appropriate error will be raised:

GRANT INSERT ON Sales TO AppUser
go
EXECUTE AS USER = 'AppUser' -- simulate app user
go
EXECUTE rls.sp_setContextInfoAsTenantId 2 -- tenant 2 is logged in
go
INSERT INTO Sales (OrderId, SKU, Price, TenantId) VALUES (100, 'Movie000', 100, 1); -- fails: "The INSERT statement conflicted with CHECK constraint"
go
INSERT INTO Sales (OrderId, SKU, Price, TenantId) VALUES (101, 'Movie111', 5, 2); -- succeeds because correct TenantId
go
SELECT * FROM Sales -- now Movie001, Movie002, and Movie111
go
REVERT
go

Likewise for UPDATE, the app cannot inadvertently update the TenantId of a row to new value:

GRANT UPDATE ON Sales TO AppUser
go
EXECUTE AS USER = 'AppUser'
go
UPDATE Sales SET TenantId = 99 WHERE OrderID = 2 -- fails: "The UPDATE statement conflicted with CHECK constraint"
go
REVERT
go

Note that while our application doesn’t need to specify the current TenantId for SELECT, UPDATE, and DELETE queries (this is handled automatically via CONTEXT_INFO), right now it does need to do so for INSERTs. To make tenant-scoped INSERT operations transparent for the application just like these other operations, we can use a default constraint to automatically populate the TenantId for new rows with the current value of CONTEXT_INFO.

To do this, we’ll need to slightly modify the schema of our Sales table:

ALTER TABLE Sales
 ADD CONSTRAINT df_TenantId_Sales DEFAULT CONVERT(int, CONVERT(varbinary(4), CONTEXT_INFO())) FOR TenantId
go

And now our application no longer needs to specify the TenantId when inserting rows:

EXECUTE AS USER = 'AppUser'
go
EXECUTE rls.sp_setContextInfoAsTenantId 2
go
INSERT INTO Sales (OrderId, SKU, Price) VALUES (102, 'Movie222', 5); -- don't specify TenantId
go
SELECT * FROM Sales -- Movie222 has been inserted with the current TenantId
go
REVERT
go

At this point, our application code just needs to set CONTEXT_INFO to the current TenantId after opening a connection. After that, no the application no longer needs to specify the TenantId; SELECTs, INSERTs, UPDATEs, and DELETEs will automatically apply only to the current tenant. Even if the application code does accidentally specify a bad TenantId on an INSERT or UPDATE, no rows will be inserted or updated and the database will return an error.

In sum, this post has shown how to complement existing RLS filtering functionality with check and default constraints to block unauthorized inserts and updates. Implementing these constraints provides additional safeguards to ensure that your application code doesn’t accidentally insert rows for the wrong users. We’re working to add built-in support for this blocking functionality in future iterations of RLS, so that you won’t need to maintain the check constraints yourself. We’ll be sure to post here when we have updates on that. In the meantime, if you have any questions, comments, or feedback, please let us know in the comments below.

Full demo script: http://rlssamples.codeplex.com/SourceControl/latest#RLS-Blocking-Inserts.sql

↧

Azure ML Powers the Brain of the Modern Smart Grid

March 24, 2015, 9:00 am

≫ Next: Video – JJ Food Service Predicts Customers’ Future Orders Using Azure ML

≪ Previous: Row-Level Security: Blocking unauthorized INSERTs

The next post in our series on how Microsoft customers are gaining actionable insights on their data through the power of advanced analytics – at scale and in the cloud.

What could big data have to do with the reliable flow of electricity? As it turns out – a lot.

Electrical grids include power plants, transmission lines, and a network of substations that serve thousands of buildings and a lot of our public infrastructure. These are complex networks powered by data stored in building meters, customer information systems and in the supervisory control and data acquisition (SCADA) systems used to manage the grid. Although there’s a lot of data available, pulling the right information is a challenge as there are many disparate systems and applications involved which do not talk to each other. Getting trustworthy insights on all of this data is therefore difficult.

Disparate software and data sources are just one problem. Most grids, built decades old, find it hard to keep up with today’s demand. Although the recent advent of renewable energy sources such as wind and solar is a big plus, one issue is that these sources can sometimes be erratic.

Utilities typically cannot afford the cost and downtime of big infrastructure upgrades but are increasingly investing in smart sensors and meters to improve efficiency via better monitoring and failure detection. But, even with the adoption of newer technologies, utilities still struggle to get value from their data because of the lack of a central way to bring it all together before they can use it in their operations.

Picture courtesy eSmart Systems

Building the Brains of the Modern Smart Grid

eSmart Systems is in the business of creating the next generation of smart grid software. Their goal is to create systems with optimal demand flexibility and response capability. A critical part of their vision is the ability to process huge data flows and leverage the insights from data in critical business decisions.

eSmart Systems focused their initial efforts on demand response management. They looked to the cloud to address to the scale challenges of big data and the computing needed to perform the analytics and decided to build the brains of their modern smart grid on Azure and Azure ML.

eSmart Systems have designed an automated demand response solution that collects data from virtually any type of meter or sensor in Azure Storage blobs. The solution then runs predictive models in Azure ML to forecast potential capacity problems and automatically control the load to buildings and other infrastructure to prevent outages. Information gets analyzed on a Hadoop cluster with Azure HDInsight for a closer look at usage. One thing they loved was the fact that they didn’t have to operate at the level of virtual machines – rather, everything could be done through Azure services.

The eSmart solution provides a short-term 24-hour forecast, a long-term monthly forecast, and a temperature forecast, and it offers a centralized way to monitor and manage the entire grid.

With their Azure-based platform they have been able to connect all of their data together and provide grid managers with a single user interface for all their tasks. Data is visualized through an eSmart interface and also made more widely available to business users through an interactive Power BI dashboard.

Picture courtesy eSmart Systems

With eSmart Systems technology running in the Microsoft cloud, everyone has less to worry about. “We’re providing the brains of the smart grid – the software platform that operates the grid. We’re using a lot of different Microsoft Azure technologies, including Azure Machine Learning, but the core solution is to help operators predict issues to prevent blackouts” says Knut Johansen, CEO at eSmart Systems.

The company plans to extend its solution to include predictive maintenance. Their eSmart System Platform can easily optimize for other energy management scenarios such as in commercial buildings and for charging electric vehicles. The company is also working on additional consumer features such as providing status updates via smartphones.

Because the eSmart Systems solution is simpler, more affordable and offered as a service, utilities do not need an IT technician to navigate their solution. They simply spin up an Azure environment for them and it just works – all it takes is an Internet connection.

ML Blog Team
Get started on Azure ML today

↧

Video – JJ Food Service Predicts Customers’ Future Orders Using Azure ML

March 26, 2015, 9:30 am

≫ Next: L’innovation par les données – L’efficience du commerce en ligne par la mesure d’audience en temps réel

≪ Previous: Azure ML Powers the Brain of the Modern Smart Grid

In a popular earlier post, we had talked about the creative use of cloud analytics at JJ Food Service, a large food delivery service company in the UK.

In this video, their Chief Operating Officer, Mushtaque Ahmed, talks about how JJ Food Services tapped into a rich trove of existing data to anticipate customers’ future orders. In doing so, they have further personalized and simplified their users' experience and are creating even more delighted customers.

ML Blog Team
Subscribe to this blog. Follow us on twitter

↧

L’innovation par les données – L’efficience du commerce en ligne par la mesure d’audience en temps réel

March 29, 2015, 11:22 pm

≫ Next: Free Webinar Tomorrow: Building Predictive Models with Large Datasets

≪ Previous: Video – JJ Food Service Predicts Customers’ Future Orders Using Azure ML

La mesure d’audience des sites internet a pour objectif d’aider les équipes marketing et vente d’une entreprise B2C à élaborer une stratégie commerciale à partir d’informations réelles comme par exemple les pages web ayant conduit au plus grand nombre d’achats.

Nous allons voir dans cet article comment construire une solution simple, agile et rapide d’analyses en temps réel du trafic à partir des fichiers logs provenant des sites Web, en s’appuyant sur des informations qui peuvent être exploitées sans aucune modification de code nécessaire sur le site.

...(read more)

↧

Free Webinar Tomorrow: Building Predictive Models with Large Datasets

March 30, 2015, 9:00 am

≫ Next: In ML, What They Know Can Hurt You

≪ Previous: L’innovation par les données – L’efficience du commerce en ligne par la mesure d’audience en temps réel

Predictive analytics problems often involve large datasets that aren’t manageable on a single local client or even a server machine.

This webinar will use the public NYC taxi ride dataset to discuss how to store, manipulate and analyze such large data sets using Azure storage, HDInsight (Hadoop) and Azure ML. We will use the new "Learning with Counts" capability in Azure ML to train predictive models with large data sets and specifically use this to create a model to predict tips on NYC taxi rides.

To register for this free webinar, click here or on the image below.

ML Blog Team

↧

In ML, What They Know Can Hurt You

March 31, 2015, 9:00 am

≫ Next: Now Available on Azure ML – Criteo's 1TB Click Prediction Dataset

≪ Previous: Free Webinar Tomorrow: Building Predictive Models with Large Datasets

This blog post is authored by Juan M. Lavista, Principal Data Scientist at Microsoft.

Imagine you own a restaurant in a city (any city but not Seattle*). From inside the restaurant there’s no way to see the outside. You come up with a marketing plan that if it rains, your restaurant will provide free food. Given that there are no windows, you use the following model to determine if it is raining – if there’s at least one customer that comes into the restaurant with a wet umbrella you conclude that it’s raining. Initially, this model will be very accurate. After all, in what other circumstances would someone walk in with a wet umbrella? However, if you decide to disclose the umbrella rule to all your customers, do you believe that this model would continue to be accurate?

Information on how the model works and how features behave within a given model can be very useful if used wisely. However, if this information is used incorrectly, it can actually hurt the accuracy of the model.

Models require that the set of signals/features have information that are predictive. However, the relationship between these signals/features and the outcome do not necessarily need to be causal. This means that the feature might be indicative/correlated but not necessarily the cause of what we are trying to predict.

For example, let's say we need to predict [C], but we can only measure [B] as feature, and [B] does not affect [C]. The real cause that affects [C] is [A], but we cannot measure [A]. [A] on the other hand also affects [B], so we can use [B] as way to predict [C].

So let’s say we estimate this model and it gives very good predictions of [C]. However, we later release the model to the public who take advantage of this information and start targeting changes in [B] directly. At this point, the information being provided by [B] to the model is no longer only associated to [A] so it loses prediction power for predicting [C]. By releasing the model to the public, we actually end up hurting the accuracy of the model.

Examples

Pagerank

A real world example is Pagerank^[1]. Pagerank provides a way to rank the importance of a webpage based on which other websites have links to this particular website. The basis of the model is that if a document is important or relevant, other websites will reference it and include links to it, and when the links naturally happens the ranking works. However, once you make this model open to the general public, there are clear incentives to rank higher in a search result. What ends up happening is that users will try to game the system, for example, by buying backlinks to try to increase the ranking of their website. In reality, if we artificially increase the links to a website, this will not mean the website is more relevant, this is just gaming the system. By disclosing how the algorithm works we are actually hurting the accuracy of the model.

Credit Scores

The design objective of the FICO credit risk score is to predict the likelihood that a consumer will go 90 days past due (or worse) in the subsequent 24 months after the score has been calculated. Credit scores are another place where disclosing rules ends up hurting the model. For example myfico states: “Research shows that opening several credit accounts in a short period of time represents a greater risk – especially for people who don't have a long credit history.”^[2] This type of study shows correlation but not causation. These findings can help predict credit risk, however, by disclosing these rules, the credit agency incurs the risk that users will learn how to game the system and hurt their credit risk prediction.

Not All Information Hurts the Models

If the relationship between the feature and the outcome is causal, the outcome is different. For example, if I’m building a model for predicting blood pressure using the amount of exercise as feature, by providing this information to the public I will actually help the users. Given that the relationship between exercise and blood pressure is causal^[3], it will not hurt the accuracy of the model.

Conclusion

Before publicly sharing information about how your model works, it is important to understand if the relationship between the features and the outcome are causal. The rule of thumb is the following:

If the relationship between the feature and the outcome is not causal, especially if the signal/feature is easy to change – for example, by buying links, or by walking in with an umbrella in the examples above – and there are reasons why people have incentives to affect the actual outcome, then we might be at risk of users gaming the system.

Even if we do not disclose how the model works, we still might be at risk because users may find out. It’s important to understand and evaluate the risk and monitor your systems periodically.

Juan
Follow me on Twitter

*This model will not work in Seattle because Seattleites do not carry umbrellas :-)

References

[1] The PageRank Citation Ranking: Bringing Order to the Web. Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry (1999) The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.

[2] How my FICO Score is calculated

[3] Exercise: A drug-free approach to lowering high blood pressure by Mayo Clinic

↧

Now Available on Azure ML – Criteo's 1TB Click Prediction Dataset

April 1, 2015, 9:00 am

≫ Next: [Announcement] ODataLib 6.11.0 Release

≪ Previous: In ML, What They Know Can Hurt You

This post is by Misha Bilenko, Principal Researcher in Microsoft Azure Machine Learning.

Measurement is the bedrock of all science and engineering. Progress in the field of machine learning has traditionally been measured against well-known benchmarks such as the many datasets available in the UCI-ML repository, in the KDDCup and Kaggle contests and on ImageNet.

Today, we are delighted to announce the availability of the largest ever publicly released ML dataset– produced by our friends at Criteo, this dataset is now hosted by Microsoft Azure Machine Learning. This new benchmark allows us to compare the performance of supervised learning algorithms on a realistic dataset representing an industry-defining multi-billion dollar task – namely, advertisement click prediction.

The scale of the data that ML systems are expected to consume keeps growing steadily. In domains such as computational finance and online advertising, thousands of training points arrive every second. The development of ML algorithms and systems capable of learning on large-scale data that yield high-throughput predictions is a key challenge. Yet, few public benchmarks exist that provide a realistic snapshot of a revenue-critical predictive problem.

An earlier competition by Criteo, as well as contests by Yandex, Tencent and Avazu have all brought attention to variants of the task: namely, that of predicting the probability of a user clicking on an item (e.g. an advertisement or webpage link) based on attributes describing various properties of the context, the item and the user. However, these datasets tended to be small samples of realistic production datasets, often small enough to fit in the RAM of a high-end modern workstation.

By contrast, the newly available Criteo 1TB dataset provides over 4 billion examples with binary labels (click vs. no-click). There are 156 billion total (dense) feature-values and over 800 million unique attribute values. While this record-breaking scale may seem formidable for classic ML algorithms, the emergence of cloud ML platforms makes it straightforward for every data scientist to train predictive models on such datasets from their laptop, utilizing distributed learning techniques such as Learning with Counts (codenamed Dracula– an area where we will present a detailed experiment on this dataset in a follow-up post).

We salute Criteo for giving our field a practical baseline that allows us to quantify progress in “big learning”. And we really look forward to all the new algorithms and systems that this dataset will motivate and inspire you to create.

Misha

↧

[Announcement] ODataLib 6.11.0 Release

April 1, 2015, 9:05 pm

≫ Next: [Announcement] ODataLib 5.6.4 Release

≪ Previous: Now Available on Azure ML – Criteo's 1TB Click Prediction Dataset

We are happy to announce that the ODataLib 6.11.0 is released and available on NuGet. Detailed release notes are listed below:

New Features:

[GitHub issue #23] ODataLib now supports parsing URI path template.

[GitHub issue #71] EdmLib now supports adding vocabulary annotations to EdmEnumMember.

[GitHub issue #80] OData client for .NET now supports abstract entity type without key.

[GitHub issue #85] ODataLib now supports additional preference headers: odata.track-changes, odata.maxpagesize and odata.continue-on-error.

[GitHub issue #87] ODataLib now supports setting filter query option in ExpandedNavigationSelectItem.

[GitHub issue #94] ODataLib now supports $levels in ODataUriBuilder.

[Github issue #144] ODataLib now suppresses the errors in reading open entity’s undeclared primitive, collection and complex property value.

Improvements:

[GitHub issue #101] Improve the performance of DataServiceContext.SaveChanges when the entities are tracked by a DataServiceCollection.

Bug Fixes:

[GitHub issue #93] Fix a bug that DataServiceContext.CreateFunctionQuery should set isComposable property of DataServiceOrderedQuery.

[GitHub issue #95] Fix a bug that OData client for .NET does not support composing a query operation onto a composable function.

Call to Action:

You and your team are highly welcomed to try out this new version if you are interested in the new features and fixes above. For any feature request, issue or idea please feel free to reach out to us at odatafeedback@microsoft.com.

↧

[Announcement] ODataLib 5.6.4 Release

April 1, 2015, 9:09 pm

≫ Next: Building Azure ML Models on the NYC Taxi Dataset

≪ Previous: [Announcement] ODataLib 6.11.0 Release

We are happy to announce that the ODataLib 5.6.4 is released and available on NuGet. Detailed release notes are listed below:

New Features:

[GitHub issue #144] ODataLib now suppresses the errors in reading open entity’s undeclared collection or complex property value

Bug Fixes:

[GitHub issue #60] Fix an issue that $select does not work with EntityFramework 5

Call to Action:

↧

Building Azure ML Models on the NYC Taxi Dataset

April 2, 2015, 9:00 am

≫ Next: Microsoft Closes Acquisition of Revolution Analytics

≪ Previous: [Announcement] ODataLib 5.6.4 Release

This blog post is by Girish Nathan, a Senior Data Scientist at Microsoft.

The NYC taxi public dataset consists of over 173 million NYC taxi rides in the year 2013. The dataset includes driver details, pickup and drop-off locations, time of day, trip locations (longitude-latitude), cab fare and tip amounts. An analysis of the data shows that almost 50% of the trips did not result in a tip, that the median tip on Friday and Saturday nights was typically the highest, and that the largest tips came from taxis going from Manhattan to Queens.

This post talks about Azure ML models that we built on this data, with the goal of understanding it better.

Problem Statement

We categorized tip amounts into the following five bins:

Class 0	Tip < $1
Class 1	$1 <= Tip < $5
Class 2	$5 <= Tip < $10
Class 3	$10 <=Tip < $20
Class 4	Tip >= $20

Although there are several ML problems that this dataset can be used for, we focused on the following problem statement:

Given a trip and fare - and possibly other derived features - predict which bin the tip amount will fall into.

We can model this as a multiclass classification problem with 5 classes. An alternative is ordinal regression, which we do not discuss here. An Azure ML technology we used is Learning with Counts (aka Dracula) for building count features on the categorical data, and a multiclass logistic regression learner for the multiclass classification problem.

The Dataset

The NYC taxi dataset is split into Trip data and Fare data. Trip data has information on driver details (e.g. medallion, hack license and vendor ID), passenger count, pickup date and time, drop off date and time, trip time in seconds and trip distance. Fare data has information on the trip fare, relevant tolls and taxes, and tip amount.

An excerpt of the datasets is shown below.

Trip data example:

Schema:
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,
dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,
pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude

Example:
89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,1,N,2013-01-01 15:11:48,2013-01-01 15:18:10,4,382,1.00,-73.978165,40.757977,-73.989838,40.751171

Fare data example:

Schema:
medallion,hack_license,vendor_id,pickup_datetime,payment_type,fare_amount,
surcharge,mta_tax,tip_amount,tolls_amount,total_amount

Example:
89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,2013-01-01 15:11:48,CSH,6.5,0,0.5,0,0,7

Data Preprocessing

We perform a join operation of the Trip and Fare data on the medallion, hack_license, vendor_id, and pickup_datetime to get a dataset for building models in Azure ML. We then use this dataset to derive new features for tackling the problem.

Derived Features

We preprocess the pickup date time field to extract the day and hour. This is done in an Azure HDInsight (Hadoop) cluster running Hive. The final variable is of the form “day_hour” where day takes values from 1 = Monday to 7 = Sunday and the hour takes values from 00 to 23. We also use the hour to extract a categorical variable called “that_time_of_day” – this takes 5 values, Morning (05-09), Mid-morning(10-11), Afternoon(12-15), Evening(16-21), and Night(22-04).

Since the data has longitude-latitude information on both pickup and drop-off, we can use this in conjunction with some reverse geocoding software (such as this one, provided by the Bing Maps API) to derive a start and end neighborhood location and code these as categorical variables. These new features are created to reduce the dimensionality of the pickup-drop-off feature space and they help in better model interpretation.

Dealing with High Dimensional Categorical Features

In this dataset, we have many high dimensional categorical features. For example, medallion, hack_license, day_hour and also raw longitude-latitude values. Traditionally, these features are dealt with by one-hot encoding them (for an explanation, click here). While this approach works well when the categorical features have only a few values, it results in a feature space explosion for high dimensional categorical features and is generally unsuitable in such contexts.

An efficient way of dealing with this is using Learning with Counts to derive conditional counts based on the categorical features. These conditional counts can then be directly used as feature vectors or transformed into something more convenient.

Deriving Class Labels for the Multiclass Classification Problem

We map the tip amounts into the 5 bins mentioned in the Problem Statement above using an Execute R module in Azure ML and then save the train and test data for further use.

Model Building in Azure ML

After subsampling the data so that Azure ML can consume it, we are ready to build models for the problem at hand.

Class Distributions in Train Dataset

For our training data, the class distributions are as follows:

Class 0 : 2.55M examples (49%)
Class 1 : 2.36M examples (45%)
Class 2 : 212K examples (4.1%)
Class 3 : 65K examples (1.2%)
Class 4 : 26K examples (0.7%)

Modeling the Problem Using Learning with Counts, and Experimental Results

As mentioned in the earlier blog post on Learning with Counts, using conditional counts requires building count tables on the data. In our case, we split our final dataset into three components: One for building count tables, another for the training of the model, and a third for the test dataset. Once count features are built using the data allocated for building count tables, we featurize our train and test datasets using those count features.

We show a sample of the experiment to illustrate the set-up:

The Build Count Table module builds a count table for the high dimensional categorical features on a 33GB dataset allocated for counts using MapReduce. We then use those count features in the Count Featurizer module to featurize the train, test, and validation datasets. As is standard practice, we use the train and validation datasets for parameter sweeps, and then use the “best learned model” to score on the test dataset. This part of the experiment is shown below:

The model training and scoring portion of the experiment looks like so:

We can now use a standard confusion matrix to summarize our results. As is common, the rows represent the true labels and the columns the predicted labels.

Modeling the Problem Without Learning with Counts, and Experimental Results

For comparison, we show the results of using a multiclass logistic regression learner on data where feature hashing is used for reducing the dimensionality of the categorical features. The resulting confusion matrix on the test dataset is as follows:

Conclusion

We see that the use of conditional counts results in a higher prediction accuracy for the more rare classes. This accuracy benefit is likely due to the use of count tables to produce compact representations of high dimensional categorical features and was helpful in building better models for the NYC taxi dataset. When count features are not used, the high dimensional representation of the categorical features results in a higher variance model for the more rare classes as they have less data available for learning.

An additional benefit of using count features is that both the train and test times are reduced. For instance, for the NYC taxi dataset, we find that the model building time using count features is half of that without using count features. Similarly, the scoring time is also about half due to more effective featurization. This can be another significant benefit when dealing with large datasets.

Girish