Thursday, 28 December 2017

The Data Science of Predicting Disk Drive Failures

With the expanding volume of information in the digital universe and the increasing number of disk drives required to store that information, disk drive reliability prediction is imperative for EMC and EMC customers.

Data Science, EMC Guides, EMC Tutorials and Materials, EMC Learning
Figure 1- An illustration of the information expansion in the last years and expected growth

Disk drive reliability analysis, which is a general term for the monitoring and “learning” process of disk drive prior-to-failure patterns, is a highly explored domain both in academia and in the industry. The Holy Grail for any data storage company is to be able to accurately predict drive failures based on measurable performance metrics.

Naturally, improving the logistics of drive replacements is worth big money for the business. In addition, predicting that a drive will fail long enough in advance can facilitate product maintenance, operation and reliability, dramatically improving Total Customer Experience (TCE). In the last few months, EMC’s Data Science as a Service (DSaaS) team has been developing a solution capable of predicting the imminent failures of specific drives installed at customer sites.

Accuracy is key


Predicting drive failures is known to be a hard problem, requiring good data (exhaustive, clean and frequently sampled) to develop an accurate model. Well, how accurate should the developed model be? Very accurate! Since there are many more healthy drives than failed drives (a ratio of about 100:2 annually) the model has to be very precise when making a positive prediction in order to provide actual value.

EMC has been collecting telemetry on disk drives and other components for many years now, constantly improving the process efficacy. In a former project, we used drive error metrics collected from EMC VNX systems to develop a solution able to predict individual drive failures and thus avoid consecutive customer engineer visits to the same customer location (for more details contact the DSaaS team).  For the current proof of concept (POC) we used a dataset that BACKBLAZE, an online backup company, has released as open-source. This dataset includes S.M.A.R.T (Self-Monitoring, Analyses and Reporting Technology) measurements collected daily through 2015 from approximately 50,000 drives in their data center.  Developing the model on an open-source dataset makes it easier to validate the achieved performance and, equally important, allows for knowledge-sharing and for the “contribution” back to the community.

As of today, SMART attributes’ thresholds, which are the attributes values that should not be exceeded under normal operation, are set individually by manufacturers by means that are often considered a trade secret.  Since there are more than  100 SMART attributes (whose interpretation is not always consistent across vendors), rule-based learning of disk drive failure patterns is quite complicated and cumbersome, even without taking into account how different attributes affect each other (i.e. their correlations). Automated Machine Learning algorithms are very useful in dealing with these types of problems.

Implementation in mind


An important point to consider in this kind of analysis is that, unlike in a purely academic research, the solutions we are developing are meant to be either implemented in a product or as part of some business processes once their value is assessed in a POC. Thus, on each step of the modeling development process, we need to ask ourselves whether the tools and information we are using will be accessible when the solution is run in real time. To that end, we perform a temporal division of data into train and test sets. The division date can be set as a free parameter to the learning process, leaving room for different assignments and manipulations (several examples are illustrated in the plot below).  The critical point is that at the learning phase the model only has access to samples taken up to a certain date simulating the “current date” in a real-life scenario.

Data Science, EMC Guides, EMC Tutorials and Materials, EMC Learning

Figure 2- Examples of a temporal division of the data to train and test sets. The model can only learn from samples taken at the train period. The evaluation of the model is then performed on unseen samples taken in the test period.

In order to evaluate how well our solution can predict drive failures when applied on new data, we evaluate its performance solely on unseen samples taken at the test period. There are multiple performance metrics available such as model precision, recall and false positive rate. For this specific use case we are mainly looking to maximize the precision, which is the proportion of drives that will actually fail out of all the drives we predict as “going to fail.” Another measure of interest is how much time in advance we can predict a drive failure.  The accuracy of the provided solution and its extended scope into the future will have a direct effect on its eventual adoption by the business.

Engineering for informative features


At first, we only used the most recent sample from each drive for the model training. As we did not get accurate enough results (the initial precision was only 65 percent) we decided to add features describing longer time periods from the drive’s history. The more meaningful information we feed the model, the more accurate it will be. And since we have the daily samples taken over a year, why limit ourselves to the last sample only?

To picture the benefits of using a longer drive history, imagine yourself visiting the doctor and asking her for a full evaluation of your physical condition. She may run multiple tests and give you a diagnostic based solely on your physical measures on the current date, but these are subject to high variability- maybe today of all days is not a good representation of your overall condition. A more informative picture of your current state will be gained by running extensive, consecutive tests during continuous time-periods and look at all gathered data as a whole. In a sense, drives are like human beings- and we would like to capture and model their behavior over longer time periods.

Most of the new calculated features capture different aspects concerned with the trend and rate of attribute change over time, as well as other statistical features of each attribute sample population. For example, we calculate the slope and intercept of the line that best describes the feature trend with time and the variance associated with each attribute at the specified time period.

Data Science, EMC Guides, EMC Tutorials and Materials, EMC Learning

Figure 3- Examples of features that can be extracted from the temporal ‘raw’ data, to capture the behavior of the drive in a continuous time window

We can choose the size of the measured time period according to our interest. For example, a reasonable choice would be to look at the drive’s behavior during two recent weeks or the last month. Once we have the new “historical” features, we can train a model whose predictions will be based on their values in addition to the original “raw” SMART attributes values.

Taking into account that the samples are collected daily, we use a “continuous” evaluation approach, such that we apply the model and acquire the predictions consecutively for each of the daily samples in the test period (see also figure 4 for an illustrative of the evaluation process). This allows us to update our knowledge regarding the state of the drive with the most recent data at our disposal.  Training and applying the model with the new “historical” features highly improve our results, increasing the precision to 83.3 percent, with a prediction mean time of 14 days prior to an actual failure.

Data Science, EMC Guides, EMC Tutorials and Materials, EMC Learning

Figure 4- A snapshot of the results for multiple runs of the trained model on samples taken in consecutive days of the evaluation period. These samples belong to a drive that failed during the test period and the figure depicts the evolution of model predictions as the drive approaches its failure date. The numbers and colors represent the assigned probability of the drive to fail with -1 in green representing already failed drives.

Tuesday, 26 December 2017

Build a Last Line of Defense Against Cyber Attacks

Dell EMC Guides, Dell EMC Tutorials and Materials, Dell EMC Learning

Channel partners looking for new revenue sources should consider qualifying to sell the Dell EMC Isolated Recovery solution. It’s the last line of data protection defense for your customers against cyber attacks that could deal their businesses a devastating blow if an attack corrupted or compromised their mission-critical or system-of-record data.

Since cyber attacks are increasing in frequency and sophistication — such as ransomware spread by phishing, which is hard to defend against — the need for large enterprises to protect their most important data has never been greater. Now you can help them protect their most important data with the Dell EMC Isolated Recovery solution, for another layer of defense-in-depth protection.

Hardware, Software and High-Margin Professional Consulting Services Needed


The Dell EMC Isolated Recovery solution requires hardware, software and high-margin professional consulting services to set up. It complements existing backup-and-recovery systems, which themselves can be subject to malware infection and data corruption simply via their 24×7 network connections.

What’s more, the unique Dell EMC Isolated Recovery solution is an extremely competitive offering. While many of our data protection competitors talk about hardened approaches to critical data, we have fully operationalized it. And this solution has many years of maturity and proven deployments behind it that you can use to help in your presentations to prospects.

Two of the world’s largest business consulting firms are already selling and deploying the solution to their large enterprise clients, generating professional services fees as well as commissions on Dell EMC data protection hardware and software. As a Dell EMC channel partner, you can find plenty of candidates for this solution among your own customers and prospects in virtually every industry, with a heightened awareness among these industries:

◉ Finance
◉ Government
◉ Education
◉ Healthcare
◉ Utilities

How the Dell EMC Isolated Recovery Solution Works


In short, the Dell EMC Isolated Recovery solution works by replicating a copy of backup data from the most critical applications to a physically and logically isolated and protected “gold copy.”

Once the update completes — in minutes or hours, depending on the change rate from the prior update and the total volume — the data streaming stops and automated software reestablishes an air gap between the source and target servers. It’s like opening and closing a vault door.

The diagram below illustrates how to implement this architectural approach using Dell EMC Data Domain hardware. (Note that Dell EMC VMAX servers or RecoverPoint appliances can also be used). The Data Domain hardware provides the capability to replicate data securely and asynchronously between systems with the connection air gap implemented when replication is idle. This isolates the replica Data Domain system, safeguarding a no-access vault copy of your critical data for the purposes of recovery.

Dell EMC Guides, Dell EMC Tutorials and Materials, Dell EMC Learning

Ultimately, the goal of the Dell EMC Isolated Recovery solution is to streamline the process of isolating a trusted gold copy of critical data, while keeping that copy updated. This will help enable the fastest possible recovery to a trusted state. Below are the four key elements of the Dell EMC Isolated Recovery solution:

◉ Isolation: The solution requires a physically isolated data center environment — sometimes called a vault — that is normally disconnected from the network and restricted to IT personnel with the proper clearance. It is normally located in a cage at the production site; no third site is necessary.
◉ Data copy and scheduled air gap: This core process is automated, using purpose-built software developed by Dell EMC, and manages the process of copying data to the isolated target. The software also manages the air gap’s scheduled window between the production environment and the isolated recovery zone. Automation is essential to managing the air gap and data movement.
◉ Integrity checking and alerting: These are workflows to stage copied data in the isolated recovery zone and perform integrity checks to rule out that it was affected by malware, and mechanisms to trigger alerts in the event of cyber attack or breach.
◉ Recovery and remediation: The solution includes validated procedures to perform recovery and remediation processes after an incident by leveraging a customer’s regular restore procedures, according to specific recovery point objectives (RPOs) and recovery time objectives (RTOs).

How to Sell the Dell EMC Isolated Recovery Solution


Talk to your customers and articulate the risks posed by Hacktivist Cyber-attacks. Highlight to them that the Dell EMC Isolated Recovery Solution leverages long proven EMC infrastructure and software and isolates their critical data from cyber-attack.

Thursday, 21 December 2017

Future-Proof Storage Programs – Know the Facts!

EMC Guides, EMC Tutorials and Materials, EMC IT, Cybersecurity

With our newly expanded Future-Proof Storage Loyalty Program Dell EMC continues to raise the bar across the industry.  The Dell EMC Future-Proof Storage Loyalty Program gives customers additional peace of mind with guaranteed satisfaction and investment protection for future technology changes. This program not only provides longer coverage terms than other similar programs, it also has the fewest restrictions.  Quite simply, we believe it’s the best program in the industry and we aren’t done enhancing it yet!

EMC Guides, EMC Tutorials and Materials, EMC IT, Cybersecurity

We encourage everyone reading this to fully examine similar competitive programs to ensure you really understand what you’re getting and how much you are paying for it.  Specifically, in recent years, Pure Storage has dramatically modified its “Evergreen Storage” program and many customers tell us they are not happy with what they’ve seen.

To help you uncover the not-so-obvious details of some key elements of Pure Storage’s current Evergreen Storage program we’d like to provide some simple questions you can ask Pure Storage when reviewing their Evergreen Storage program.  Let’s focus on the three program benefits that require the higher priced Evergreen Gold subscription, and are not included in the Evergreen Silver subscription.

1. Free Every Three
2. Upgrade Flex
3. Capacity Consolidation

Free Every Three – This is Pure’s Evergreen Gold offering that claims you get upgraded controllers for free every three years.

Questions to ask:

◉ Do I really need to pay for six years of Evergreen Gold to receive the ‘free’ controllers? The first three years to be eligible and then I can get the new controllers when I renew for another three years?
◉ What will it really cost me (over six years of a Gold Subscription) to get the free controllers?
◉ How much am I really saving versus just buying new controllers at a negotiated discount?

From what we’ve seen and heard from customers the six years of required Evergreen Gold more than pays for the two new controllers.  So, ‘Free Every Three’ is really just a ‘Pre-Pay’ program!  It’s a trade-in program you need to pre-pay for, even if you never use it.

Upgrade Flex Controllers – This is Pure’s Evergreen Gold offering for upgrading your controllers before they are three years old.

Questions to ask:

◉ Why am I forced to buy additional capacity (Upgrade Flex Bundles) in order to upgrade my controllers? What if I don’t need more capacity?
◉ How much would it cost to buy a whole new Flash Array system, with only the capacity I require, compared to using the Upgrade Flex program?
◉ Does an Upgrade Flex purchase reset my Free Every Three subscription? (e.g. – If I use Upgrade Flex after two years do I then need to pay for another three years of Evergreen Gold (five total) to get the Free Every Three controllers?)

Upgrade Flex appears to force customers to buy additional capacity, at a significant price premium, if they want to upgrade their controllers within the three years of purchasing their FlashArray. Pure tries to make it sound attractive by highlighting that you get trade-in credits for your old controllers but look carefully at the price of the additional capacity you are forced to purchase in order to trade in your old controllers for new ones.  In total the customer ends up paying for all this to get the new controllers:

1. An Evergreen Gold subscription
2. Highly priced additional storage even if the customer doesn’t need it
3. The cost of the new controllers minus a trade-in credit for their current controllers (credit determined by Pure)

Capacity Consolidation – This is Pure’s Evergreen Gold offering for trading in existing installed flash drives for newer higher capacity flash drives, or flash modules.

Questions to ask:

◉ If 25% of the new capacity is the maximum trade-in credit I can get then what should I expect as the ‘normal’ trade in credit amount, and how is it calculated? How do I get the full 25% credit?
◉ How much of a premium am I going to pay for my new storage versus buying the extra capacity at time of original purchase?

Due to the use of an Active-Passive (backend) controller architecture Pure’s FlashArray products have a limited amount of drives that can be installed without affecting overall performance, which limits their overall scalability.  To overcome this they need a program (Capacity Consolidation) that allows customers to swap in newer higher capacity drives for older smaller capacity drives.  Unfortunately for Pure’s customers, like all the Evergreen Gold specific features, this comes at a price premium.

EMC Guides, EMC Tutorials and Materials, EMC IT, Cybersecurity

In contrast to Pure’s Evergreen Storage program, now a profit center for Pure, Dell EMC offers our recently expanded and enhanced Future-Proof Storage Loyalty Program to provide customers with the peace of mind and investment protection they need to make educated decisions today.  There are no hidden fees additional purchases required to take advantage of the benefits!

Dell EMC’s Future-Proof Storage Loyalty Program is comprised of seven industry-leading benefits:

◉ 3 Year Satisfaction Guarantee – All products in the program now carry a 3-Year Satisfaction Guarantee
◉ 4:1 Storage Efficiency Guarantee – by following Dell EMC’s recommended and straightforward best practices you will get effective logical storage capacity of at least 4X your purchased physical capacity. And there are no complex pre-assessments required
◉ Never-Worry Data Migrations – leverage seamless upgrades and built-in data migration technology, so it’s always easy to upgrade and/or migrate your data without disruption to the business
◉ Hardware Investment Protection – you can trade-in Dell EMC or even competitive systems for credit towards next gen Dell EMC storage or HCI product offerings
◉ All-Inclusive Software – with Dell EMC Storage you will have everything you need to store and manage your data included with product purchase
◉ Built-in Virtustream Storage Cloud – Dell EMC Unity All-Flash customers immediately get the benefit of a built-in hybrid cloud with one year free Virtustream capacity. You get 20% of your  purchased storage capacity free for a year, that’s like getting 20% more storage for free
◉ New! Clear Price – Consistent and predictable support pricing and services for all your Dell EMC storage and data protection appliances

Newly Expanded!


Our Future-Proof program now also includes our Storage and Data Protection platforms!

EMC Guides, EMC Tutorials and Materials, EMC IT, Cybersecurity

1. Satisfaction Guarantee: Requires purchase of our standard 3-year ProSupport agreement. Compliance is based on product specifications. Any refund will be prorated.
2. Hardware Investment Protection: Trade-In value determined based on market conditions at Dell EMC’s sole discretion.
3. 4:1 Storage Efficiency Guarantee: Requires customer signature and purchase of ProSupport with Mission Critical.
4. Never-Worry Migrations: Does not include data transfer services. Customer responsible for ensuring data is backed-up.
5. Built-in Virtustream Storage Cloud: Benefit available only with Dell EMC Unity purchase; Free capacity is limited to 20% of purchased storage capacity.
6. All-Inclusive Software: Includes select software needed to store and manage data.

Tuesday, 19 December 2017

Enterprise Integration Portal for One-Stop Tracking of Business Transactions

Tracking integrated business and service transactions across multiple IT systems is important in today’s fast-moving business climate. Being able to track them across two major IT companies (Dell and EMC) that recently merged to form Dell Technologies, one of the largest technology companies in the world, is absolutely vital.

That’s why Dell IT’s recent launch of a visualization tool that lets our business and IT teams use Dell technology to monitor transactions on a single dashboard is a critical step in the EMC/Dell integration.

It is called the Enterprise Integration and Services Business Monitoring Portal (EISBMP), a system that combines cutting-edge Dell technology solutions to bring together all integrated transaction data into a single view where business and support people can see transactions across multiple systems in a comprehensive dashboard.

Accessed via single sign on, the EISBMP has allowed us to consolidate more than 15 different tracking applications into one. It also lets us showcase our Pivotal platform and a range of other technologies across the powerful Dell portfolio that, combined with open-source apps, create this cloud native application.

The first phase of this project, which tracks legacy EMC transactions and serves internal IT customers, was launched in May and currently handles 20 million transactions a day. We plan to extend the portal to include legacy Dell over the next three to four quarters, expanding that volume to 120 million per day as our dashboard expands companywide. We also plan to open the EISBMP to our channel partners to allow business-to-business dashboard access in the near future.

The need for a single view

Like most growing organizations, EMC and Dell had amassed an array of systems and apps used to monitor transactions and support requests across various data bases. Our business and support users would have to check multiple locations to track a transaction from one phase to the next across our organization.

For example, prior to EISBMP, IT might get a call from a sales representative asking for insights into what is happening with a given sales order. Maybe they already confirmed the transaction data had been processed from the SAP database but didn’t know if it ever reached the third-party order fulfillment system we use to ensure our transactions conform to trade compliance regulations. Or they may be looking for its status in Salesforce.com, a third-party customer relationship management app we use.

There was no good portal that gave sales reps insight into what was happening with a given customer order to keep track of its movement across the many systems and tools.

With EISBMP, sales reps, as well as numerous other business users and support providers, can enter the transaction number to see where their customer order is within the business lifecycle.

Our new centralized dashboard tracks transactions for some 1,500 to 1,600 services that we support. And we expect to support another 3,500 services in the near future.

Dell EMC Guides, Dell EMC Tutorials and Materials, Dell EMC Guides, Dell EMC Learning

Bringing it all together

When EMC and Dell merged a year ago, the array of systems and platforms on which we processed business and support transactions became even more complex than the already complicated network of processes each individual company was grappling with.

Around that time, our Cloud Native Interactions (CNI) team, part of Dell IT’s Enterprise Integration Services, took on the job of creating a system that would integrate the various data sources and consolidate the various tools for the legacy EMC network.

Our task was to create a framework that would allow our various systems to work together, to connect to each other and exchange data and messages. The goal was to bring everything together in one dashboard.

We began by collaborating with business and support users to understand their needs and planning the high-level architecture we would need to build such capabilities. We then analyzed the kind of tools and technologies we would need to make it work.

My team decided to use Pivotal Spring Framework to build our portal. Not only did using Spring Framework fit into our practice of using our own technology, but it also provided an open-source product that is easy to use and offers standardization.

Technologies we chose include our own Pivotal Cloud Foundry and Pivotal Gemfire, Spring Boot and Spring Cloud Microservices.

The result is a single dashboard that is customizable and offers better insights into our integrated transaction data, which is also easily maintained. Users can search for transactions faster and see their status all in one place. Replacing 15+ apps with just ONE has resulted in significant cost and resource savings.

Dell EMC Guides, Dell EMC Tutorials and Materials, Dell EMC Guides, Dell EMC Learning

We are now working on completing the domain requirement to let our business partners access the EISBMP on a business-to-business basis. We are also working to extend the portal to integrate legacy Dell transaction data so that the dashboard will track transactions across the combined company. And eventually, we will offer mobile capabilities so that team members in the field can tap into such real time tracking information.

Overall, our new tool is enabling business and support teams to stay on top of their day-to-day transactions, track and address issues as they arise, analyze companywide metrics and leverage data to improve their performance and stay ahead of industry trends.

Saturday, 16 December 2017

Driving the Next-Generation Sales Solution on SAP HANA

Analytics, HANA, IT Infrastructure, SAP

As EMC IT remains focused on its IT Transformation journey, we continue to take calculated risks in emerging technologies and adopt bleeding-edge solutions — but not without bumps along the way. However, we have found that leveraging a strong partnership with a trusted supplier is a crucial strategy to help smooth the road ahead.

Recently, EMC IT went live with the third release of its SAP implementation, replacing our direct and channel ‘selling’ tools with a suite of software products predominantly within the SAP landscape. This grouping of products – while not currently sold by SAP as a “solution” – needed strong partnership between EMC leadership and SAP, as we cooperated to string them together.

Back in 2003, EMC’s ‘selling’ tools, Direct and Channel Xpress, went live as two separate applications as a conscious choice in a very different selling model. At the time, EMC’s direct sales force was focused on high-end deals with little to no partner interaction.  But as the channel grew and the go-to-market model adapted it became clear that this (EMC+Channel) partnership needed a more tightly coupled application – one that we just couldn’t deliver with two disparate applications.

Choosing the New Solution


The core of the new application is SAP’s CRM (customer relationship management). EMC IT quickly made the decision to run it on SAP HANA database platform – our first venture  to deploy an on-line transaction processing (OLTP) platform using SAP HANA. The capabilities within SAP CRM for quoting, pricing and customer management were going to be the core of the entire system.

The next major decision focused around the next-generation product configurator. The unlimited configurability of our products and solutions is a major competitive advantage for our sales force, and EMC needed an agile tool to maintain these tens-of-thousands of rules on a daily basis.  We moved forward with a plan to use SAP’s Solution Sales Configurator (SSC) – a relatively new product with only one other production customer at the time.

It was quickly apparent that SSC’s ability to be modular and configure solutions rather than individual products matches EMC’s vision as our product portfolio continues to grow. But like all new products, there are growing pains as the application evolves. SAP stepped up to the task and directly partnered with EMC  – from senior leadership level and down – prioritizing bug fixes directly affecting EMC and enhancing the tool as we found deficiencies to our process.

The last product selected was Vendavo – a leader in the Pricebook and Deal Management space as well as a close partner of SAP.  Touting out of the box integrations, Vendavo was a natural fit for us.

Analytics, HANA, IT Infrastructure, SAP

The “Wow” Factor


With the main components selected, the focus quickly turned to user experience. Ensuring ease in selling EMC’s products would make the salesforce more efficient as well as help our partners choose to sell EMC instead of one of our competitors.

Our first user-experience step was to enlist the help of SAP’s Custom Development organization to help build a tool to easily create standardized UI’s on top of our complex product configurator. Using SAPUI5 technology and partnering with the SSC product team, SAP delivered a robust tool that allows the business to use drag and drop technology to create easy to use UI’s so sales engineers can quickly configure products.

As the project continued, in pure EMC IT fashion, along came a new curveball. SAP had recently purchased an eCommerce market leader – a software platform known as Hybris. EMC had used Hybris sparingly, most notable in its online eStore, but this was a large-scope change very late in the project timeline. Again, partnering with SAP allowed us to get our hands on early releases of its integration roadmap using SAP’s RFC jCo connections. A stateful connection (one where the client stores and re-uses the backend session) between Hybris and the SAP CRM backend allowed us to use Hybris as simply the front-end window into the quote while still taking advantage of Hybris’ core capabilities – especially its versatile product catalog, which EMC uses extensively for search and enrichment of products. EMC IT then reached out to Pivotal Labs – an EMC federation company who had done numerous user experience projects. Their design services approach, a skill that most IT shops simply don’t have, led to major improvements and efficiencies in laying out the ‘price quoting’ user experience.

Stability and Performance


Tightly coupling these software products and ensuring consistency between each is the main challenge for us as we look to stabilize the solution. Our UI strategy strives to make each disparate system feel seamless so users don’t realize they are switching to a completely different tool each time. This raises the need for data consistency between all applications – real time. Using a mix of direct web services and our Application Integration Cloud (AIC) technology (a custom EAI layer built on Pivotal SpringSource) for real time integration, the data flow is constant each time a price quotation is saved (which is done often to avoid data loss).

SAP has partnered with us again, sending out their brightest minds from Germany to optimize the solution. With a mix of short-, medium- and long-term changes planned in both the configuration and pricing area, the goal is to not only improve the end-to-end experience but optimize business processes and the user experience.  A range of changes are slated for Q1 (improving the run times and handling quote saves more efficiently) with much larger enhancements in mind for the coming quarters and beyond.

Thursday, 14 December 2017

The Analytics Journey Leading to the Business Data Lake

More than ever, businesses see their futures tied to their ability to harness the explosive growth in data. You may even be familiar with the Business Data Lake concept—a central repository of vast information which can be used across an enterprise to drive all business intelligence, advanced analytics and even, eventually, intelligent applications.

We, at EMC IT, are in the process of creating a Business Data Lake, and I will be sharing insights about our efforts in this blog. To start, let’s trace the vision that’s leading EMC IT and other businesses to the shores of this new data landmark.

Dell EMC Analytics, DELL EMC Guides, DELL EMC Tutorials and Materials

Path to Analytics Maturity

Let’s consider business use of analytics. In today’s competitive and ever-changing marketplace, key questions businesses need analytics to answer to effectively manage operations and strategize for future growth and competitiveness include:

◉ What happened? How is my business doing and what’s its performance level?
◉ Why did it happen? Why did we experience high cost?
◉ What will happen? Where is more potential for market share? Will my customers stay with me?
◉ How can we make it happen? What can we offer by predicting a customer problem and solving it proactively – so we can ensure higher satisfaction and loyalty.

Based on these questions, Gartner, Inc. provides a maturity path for analytics. Essentially, the first two questions above are answered by traditional business intelligence–where business looks back on more operational data to describe and diagnose problems. This is mapped to descriptive and diagnostic analytics. For example, a sales analytics solution provides a view of current and historical opportunities and bookings, end of quarter reports, etc.

As data matures into information and provides optimization opportunities through advanced analytics, the business begins to look forward to predict and act on opportunities and change.

Thus, the “what will happen” and “how can we make this happen” questions are answered with advanced analytics to predict and explore and avail new opportunities for growth. For example, there is a solution available that helps sales organizations predict sales conversion and close risks of a potential customer deal. The solution collects a broad range of opportunity data and runs it through an analytical model to identify key win drivers. This is a great tool for the sales force to focus on the right deals and ensure closure to increase revenue. Thus analytics serves to predict and manage market opportunities.

This is the transition from traditional business intelligence to advanced analytics.

Along the same line, today’s applications are also moving towards more data-driven capabilities and utilizing analytics at their core. This is a major architectural shift from traditional applications using embedded analytics. For example, customer profile analytics can drive various customer-centric applications to provide more customized, profile-based interactions and outcomes.

Our Journey with Business Data Lake

At EMC, we have grown tremendously, progressing from business intelligence capabilities via our global data warehouse to the world of advanced analytics using Business Analytics as a Service (BAaaS) and EMC Connected Proactive Services (ECPS) – two large scale data stores. Along with high-performing, end-of-quarter reporting, and real-time sales analytics, we are exploring data-science-driven, advanced analytics solutions to identify new product opportunities, standardizing our product configuration in the most effective way or developing models for optimizing sales deals. Yet, we believe we are just scratching the surface on Big-Data-driven advanced analytics.

Key challenges we face in moving toward more innovation in advanced analytics areas include:

◉ Siloed data assets serving pockets of analytics.
◉ Current data assets on aging platforms with limited support technology and not expandable.
◉ Limited consolidation of all structured and unstructured data together.
◉ Limited data and knowledge sharing across analytics users.
◉ Data gathering slowing innovation.
◉ Insufficient governance and data quality management.

We continue to strive towards building a consolidated, more scalable data lake platform where we can empower innovation through agility and more collaboration, providing access to all enterprise, machine data, vendor data or social media data in one logical space. This will help realize our vision of a business data lake driving all business intelligence, advanced analytics and, eventually, intelligent applications via one logical data lake.

Dell EMC Analytics, DELL EMC Guides, DELL EMC Tutorials and Materials

Our data lake is one logical data platform with multiple tiers of performance and storage levels to optimally serve various data needs based on Service Level Agreements (SLA). It will provide a vast amount of structured and unstructured data at the Hadoop and Greenplum layers to data scientists for advanced analytics innovation. The higher performance levels powered by Greenplum and in-memory caching databases will serve mission-critical and real-time analytics and application solutions.

With more robust data governance and data quality management, we can ensure authoritative, high-quality data driving all of EMC business insights and analytics driven applications using data services from the lake. Thus we will move towards the next level of analytics maturity with:

◉ A more scalable Big Data platform with various performance levels serving deep data science innovation to real-time analytics.
◉ Tools to share data and analytics quickly among groups.
◉ Faster innovation through easy and self-service data availability.
◉ Workflow-driven robust data governance.

Our Business Data Lake implementation is a multi-phase journey. Our first round implementation in July consolidated major large-scale structured and unstructured data for heavy-duty self-service analytics and enables the business with quick turnaround for innovative solutions. With this foundational Data Lake setup, we have played with various new technology and architecture concepts. We have made great strides with creative solutions. Along the way, we have also been faced with challenges and limitations of evolving Big Data technologies.

Tuesday, 12 December 2017

The Power of Self-Service Big Data

From using analytics to predict how our storage arrays will perform in the field, to engineering product configurations to best meet customers’ future needs, EMC is just beginning to tap into the gold mine of intelligence waiting to be extracted from our new data lake.

In fact, we are currently working on dozens of business use cases that are projected to drive millions in revenue opportunities. And we are just scratching the surface. There’s a lot more data available, more to be harvested, and more analytics to be built out as data scientists and business users hit their stride in exploring a new era of data-driven innovation at EMC.

As EMC IT embarked on creating a data lake to transition from traditional business intelligence to advance analytics more than two years ago. A key focus of this effort was to address the fact that data scientists and business users seeking to leverage our growing amount of data were stifled by the need for such projects to go through IT, which was a costly and slow process that discouraged innovation.

We now have the foundation and tools in place to use data and analytics to create sustainable, long-term competitive differentiation. To get here, we worked closely with EMC affiliate Pivotal Software, Inc. to mature together and leverage the multi-tenancy capabilities of their Big Data Suite.

Building the Lake


Here are some highlights of our data lake journey.

We began our effort to create a scalable, but cost-effective foundation for EMC’s data analytics by building a Hadoop-based data lake to store and process EMC’s growing data footprint. Hadoop runs on commodity hardware and scales linearly, making it less expensive to scale than a traditional data base appliance. It also runs on EMC’s industry leading IT Proven technologies, including XtremIO, ScaleIO, Isilon and Data Domain to enable enterprise capabilities such as built-in name node fail-over, replication, storage efficiency, disaster recovery, backup and recovery, snapshots, and the ability to scale out compute and storage separately.

DELL EMC Guides, DELL EMC Tutorials and Materials, DELL EMC Certifications

Once the lake was operational, we used Pivotal Spring XD, a data integration and pipelining tool, to orchestrate batch and streaming data flows into the data lake. It didn’t take long for the data lake to exceed the size of EMC’s legacy global data warehouse. Today, it is more than 500 terabytes and continues to grow.

With the foundation in place, our team then turned our attention to building analytics capabilities. The most important requirement was that whatever tools and technologies we chose, they must provide self-service capabilities so users could easily spin up new analytics projects without having to involve IT. We wanted self-service capabilities that allowed users to easily identify the data sets they needed, to integrate new data sources, to create analytical workspaces to blend and interrogate data, and to publish the results of analysis for collaboration with colleagues.

We chose to use a mix of technologies centered around an EMC-developed framework for Data API/Services based on Pivotal Cloud Foundry (PCF) and Big Data Suite (BDS) that enables seamless interface with the data lake.

For the analytics itself, we chose Pivotal Greenplum, a massively parallel processing analytical database that is part of Pivotal’s BDS. Users bring their desired data sets into an analytics workspace powered by Greenplum, where they can run different styles of analytics—including machine learning, geospacial analytics and text analytics. They can visualize the results with the tools of their choice. EMC’s data scientists primarily use MADlib, R and SAS to develop and run algorithms and predictive models inside Greenplum, while business users tend to use business intelligence tools like Tableau and Business Objects.

Finally, users can publish the analytical results via a data hub. This significantly shortens time to insight since users can build on one another’s work rather than constantly starting from scratch.

Self-Service Tools Empower Users


EMC is already reaping the benefits of its new data lake and analytics capabilities.

One such opportunity involves log data created by EMC storage arrays and other products as they operate in the field. With its new Hadoop-based data lake, EMC is now equipped to ingest, store and process this log data, which is then analyzed to predict and prevent problems before they occur and impact the customer.

DELL EMC Guides, DELL EMC Tutorials and Materials, DELL EMC Certifications

Analysis of log data might reveal, for example, that a particular component in a customer’s storage array is likely to fail in the next 8 to 12 hours. With that insight in hand, EMC support can reach out to the customer and take steps to prevent the component failure before it disrupts important business processes. Our support and sales folks are at the door before a problem happens.

This preventative maintenance capability, which would not be possible without EMC’s new data lake and agile analytics capabilities, results in higher levels of customer satisfaction and customer loyalty, which has a direct impact on EMC’s bottom line. It also helps EMC’s engineers determine the optimal product configurations for various scenarios and use cases, as well as provides valuable insights as they develop new products and services.

Our cutting-edge Big Data analytics should yield many more such opportunities now that the data lake enables our innovators to experiment and explore at their own pace. After all, data scientists and business users no longer need to go through IT when they want to start a new analytics project. Instead, they log into the data hub and use self-service tools to identify potentially valuable data sets for analysis. They can bring in their own data or outside vendor data and mesh it with our enterprise data.

With the IT bottleneck out of the way, everyone is very excited and they feel very empowered to tap into and explore new data analytics frontiers.

Thursday, 7 December 2017

Building a Software Defined Data Center: Automation, Orchestration and Agility

Despite the emergence of IT as a Service and the rise of self-service catalogues, most IT operations—including EMC’s—have remained largely manual when it comes to filling users’ requests for networking, storage and compute, struggling to keep pace with growing demand. Until now, that is.

EMC IT is in the process of rolling out a new set of tools, based on a combined approach to infrastructure and automation that will reduce the time it takes to fill customers’ infrastructure demands from months to days or even hours.

The new production environment uses EMC’s Federation Enterprise Hybrid Cloud (FEHC) management platform on VCE Vblock™ converged and hyper-converged infrastructure to provide the abstraction of hardware through software. Translation: IT clients will no longer have to come to the IT infrastructure team every time they need a new environment or an additional server. They can self-provision these services using a truly automated portal and with a standardized set of components.

Dell EMC Guides, DELL EMC Tutorials and Materials, DELL EMC

EMC IT is initially launching the new FEHC model to a limited number of internal users in IT, starting with providing non-mission-critical cloud services to Cloud Infrastructure Services.  We will then progress to providing mission-critical, middleware and database services. The roll out is a first step towards our planned launch of a software defined data center that is slated to automate and orchestrate 50 percent of our 30-petabyte data center workload by year’s end.

With this new approach, our internal IT customers will be able to provision a new virtual machine (VM) in an hour rather than the three weeks it took previously using the traditional, manual IT process—certainly something that most organizations’ IT operations are striving to achieve.

Dell EMC Guides, DELL EMC Tutorials and Materials, DELL EMC

But how did we get here and why should it matter to your organization?

Industry-wide, the need for IT automation and orchestration is recognized as key to achieving the speed and agility that IT users require. Among the goals of automation:

1. Reducing deployment time. How long customers wait for service impacts your organization’s return on investment (ROI) and service level agreements (SLAs). Reducing manual labor is crucial.

2. Reducing complexity. Traditional manual IT processes can lead to the proliferation of home-grown processes with a variety of machines stemming from doing things differently each time. With automation, we strive for standardized processes using vendor-based application programming interfaces (APIs) to provide for machine-to-machine communications.

3. Creating standard and repeatable processes. This is a hard one. You have to go through the task of identifying your standard processes so you can repeat them. People are quite used to dealing with manual processes such as spread sheets and emails. When I come in and create block diagram of the processes and identify APIs that will reduce manual touch points, it is a big adjustment for them.

4. Creating a self-service process. Once you have achieved the previous three goals, you gain the ability to scale quickly, create use cases, onboard more customers faster and create an overall better customer experience. Now they can go to a portal and request a catalog item with just a few clicks.

5. Getting people to adopt the new model. We can build automated services but if no one uses them, they are worthless. That is the challenge EMC IT is currently facing—capturing the minds of our users to change the culture toward automation. Users and IT experts need to trust that the automation process will build out what they need instead of relying on a person as they always have.
These goals cover why automation is needed, but how would your organization get started to build the green field environment to then make the transformation to a software defined data center?

In our case, this involved transcending the traditional siloes of IT services and collaborating on defining user needs. For the past six months, EMC IT subject matter experts in automation, network, storage and compute have been working together to create software and services that will automate the way we manage IT infrastructure.

We began with a week-long planning session to identify our use cases and create a log of what services we were going to need in order to roll out this new model to our internal IT customers. We needed architects from the four groups to be on the same page regarding what the transformation to automation and orchestration would look like.

Next, we created an agile work stream, using a DevOps approach in which software people (like me) worked directly with our counterparts in compute, network, and storage to develop software and operations processes. Teaming up with data center operations, we started with a smaller, pilot-stage of DevOps where we created services, reviewed them with our internal customer to get feedback, and then made changes.

It was an evolutionary, agile process for the development — prioritizing services and developing them in a series of sprints. At the same time, the network, storage and computer team were working on their own architecture. We worked in parallel and bridged their processes where needed. Our goal was to publish fully compliant Windows 2012 Core, Windows 2012 GUI and Red Hat Linux VMs, as well as VIPR Software Defined Storage which we’ll be offering to our initial customer group in Q3.

We are building out a new environment in our Software Defined Data Center for EMC IT where all the automated requests for VMs, storage and networking will be handled. We plan to use this distributed model to integrate all the different software APIs then orchestrate the code library, removing the need for manual processes.

The result is similar to building an airplane on a factory line compared to building one by hand: By hand, even a small plane would take a year to build and be difficult to test pre-flight, but if you build it using a mile-long hanger and automated assembly, you can create a much higher quality airplane in 50 days. And you’ve done it using a repeatable process.

It is a longstanding concept that makes sense for today’s fast-moving IT infrastructure needs.

Tuesday, 5 December 2017

Data Science Lessons: Insights from an Agricultural Proof of Concept

Agriculture has come a long way from ancient times through the industrial revolution to the current digital era. In 2017, modern agricultural organizations have access to increasingly large amounts of data collected by sensors from soil quality measurements, weather sensors, GPS guided machinery, and more. According to a USDA’s recent survey, more than 60 percent of corn and soybean crops are monitored by data collection devices. However, there is still a substantial gap between the potential of utilizing this data and what happens in reality. Despite having the data, many companies lack the capability to effectively process, analyze, and efficiently build informative models in order to make data-driven decisions.

That’s where guidance from data service providers, such as Virtustream, can help. Virtustream provides data management expertise, tools and data science consulting to enable customers across different industries to get value from their data resources.

Our data science team in Dell IT recently initiated a Data-Science-as-a-Service Proof of Concept (PoC) as part of a Virtustream service engagement with a large company that plants thousands of farms across USA. Virtustream had enabled the company to become more data-driven by harnessing its large amounts of data, as well as developing and implementing different applications that enable scalable, faster, and more accurate operations – operations that couldn’t be executed with existing tools. Our PoC sought to demonstrate the speed and efficiency of those analytics applications.

Our goal in this PoC was to enable fast and automated execution of seasonal yield predictions for each field in terms of tons per acre using Virtustream cloud services. Such a prediction has a significant business value for the company as it enables efficient resource allocation among the different fields or, in other words, maximal crop yield with minimal costs.

In a three week sprint, we delivered a model that is at least as accurate and is much faster than the existing model (which runs on a single lap-top). The model calculated the predicted yield of a given field based on internal data sources (soil type, fertilizers, plant ages, etc.) and external public sources like satellite images and historical weather data. This short-term engagement highlights four points that we find worth sharing with readers:

Data Science as a part of a bigger solution


Cloud and storage services serve as tools to achieve what the customer really wants – business value. A key way to illustrate that business value to the customer is via Data Science PoC outputs that provide concrete proof of what can be achieved. We request the customer to share a dataset (or a fraction of it) and after several days we deliver concrete evidence of the potential ROI that can be gained with the proposed solution. In the discussed PoC, we proved that using the proposed Virtustream cloud infrastructure, a prediction model that use to take several days to create may be built in less than 30 minutes.

There is a fine line between Small Data and Big Data


‘Big Data’ has become one of the most widespread buzzwords of this decade. But what is Big Data? Is it a 2GB dataset, 100GB or 1TB? From our perspective, the answer is very clear: any data processing task that requires a scalable programming paradigm to be achieved in a reasonable amount of time is at the “Big Data Zone.” If it’s possible, we always prefer to develop with standard tools (like simple R or Python scripting) as it usually results in a shorter and clearer code that requires minimal configuration efforts. But what happens when you start with relatively simple procedures that take several seconds and after three days of developing find yourself waiting more than an hour for one procedure to be ended?

When things get rough, we want to shift our code to the Big Data zone as quickly as possible. In this PoC, we started to develop our solution with standard Python libraries. When we wanted to integrate image processing data into our model, it became clear that we need to shift our solution to the Big Data zone. Since we worked on a very scalable environment, this shift was very easy. We quickly configured our environment so it could run the existing code in a Big Data mode (using PySpark), allowing more complicated procedures to be integrated within it.


Focus on the business problem, not on the dataset


“I have a lot of data. Let’s do Data Science!” may be the first sentence that will initiate an engagement around a data science project. But taking this approach to the rest of the project is likely to result in a burnout. The first two questions that should be asked before initiating every data science customer engagement are, “what is the business problem?” and “can we solve this problem with the available data sources?” Answering these two questions is often complicated. Sometimes it requires conducting a three-to-five-day workshop that includes brainstorming and data analysis sessions with the customer. However, it is a prerequisite for a successful project. Every data source should be considered when working to solve the business problem.

The business goal that the company presented in this PoC was very clear: predict farm yield in terms of tons per acre. Given this business problem, we were able to search for the relevant datasets and available knowledge, whether they were internal from the customer (soil types, previous years’ yields, planted species, etc.) or external (public weather datasets, satellite images of farms).

Pair mode developing


Delivering a complete solution in three weeks leaves no time for distractions and redundancies. In this PoC, we were working as a pair so the risk for duplicated work or disconnected silos (pun unintended: not to be confused with grain silos) was even higher. In order to mitigate this risk, it was necessary to deliberately modulate the work. It was very clear from a very early stage of this engagement that there are two main tasks in the PoC: building the model pipeline and conducting the satellite images processing.

Building a designated model pipeline is a complicated task in which features are engineered from the given data (for example: amount of seasonal rainy days, average temperature in season, previous year yield, etc.) and then digested into a models competition module that choose the best Machine-Learning model for the given problem (e.g. Random Forest, Neural Network, Linear Regression). On the other hand, satellite image outputs serve as a crucial feature in the model, as Normalized Difference Vegetation Index (NDVI) and Normalized Difference Water Index (NDWI) are very indicative for yield prediction. Converting an image into a single number like NDVI and NDWI is a non-trivial task as well, so we decided to separate this task from the rest of the development process and combine its outputs into the developed model when ready.

To summarize, in each of our engagements, we learn different lessons that sharpened our analytics skills. We share these lessons because we feel it is highly important for us to give a true sense of data science work.