• Company News

What’s new in Pentaho 9.1

By Anand Sagar Rao Vala

Hello everyone, welcome to the Pentaho 91 GA sales enablement. We are excited to be doing the sales training on the day the product goes GA. This training is the first level of training primarily targeted to salespeople. there are additional trainings which dive deeper into the technical aspects targeted towards the sales engineers, professional services and Technical Support.

My name is Anand Rao. I’m part of the product marketing team. we have Jason Tiret from the product management team here to help answer questions.

So let’s go over the agenda. we will be providing an overview of the Pentaho 9.1 release I’m going into the details of three of the largest capabilities and the benefits customers and prospects will receive from them. We will be providing an update on the dataflow manager which you may recall is the web based version of Pentaho data integration targeted towards less technical audience. We will go over the enablement resources that you have and then we will wrap up with a summary along with Q&A. Please enter questions as we go along and we will address them in batches.

Let’s start with the overview.

The three new capabilities that we want to highlight in this release. 1st is the support for Google dataproc which is the Hadoop and spark cluster of Google in its cloud environment. This expands our multicloud support footprint an accelerates data integration projects while reducing management cost and complexity.

2nd is the integration between Pentaho data integration and Lumada data catalog creating a powerful combination our products do increase the productivity of producers and consumers. This integration reduces the data discovery time and increases the pipeline resiliency boosting data producer productivity.

3rd we have streamlined the upgrade process from Pentaho 9 to 9.1. This decreases the testing and production downtime cutting IT maintenance time and costs during upgrades thus prodding customers to upgrade their environments to the latest release.

Let’s dive deeper into the first capability – support for Google data proc which will help our customers migrate their on-premise environments to the Google cloud and reduce the management cost and complexity.

Let’s look at some market trends identified by IDC. The total size of the Hadoop market is roughly around 4 billion dollars. This is roughly 8% the size of the overall database market. in terms of growth the cloud Hadoop market is growing five times faster than on premise Hadoop market at roughly 25% YoY. to top it all, datastores on the Google cloud platform grew 60% last year this is very impressive. We want to tap into that growth and ensure that our customers and prospects doing these migration projects are successful.

Here are the different datastores we support on each of the three big cloud vendors. The type of store is mapped by the position. As you can see our coverage is greatest for Amazon where we support all their data stores followed by Google cloud where we support their data warehouse Bigquery, their relational database cloud SQL, their object storage cloud storage, as well as their Hadoop and spark environment data proc.

Let us look at how one of our lighthouse customers HSBC has started to use data proc for the spark and Hadoop capabilities and big query. The digital finance group within HBC is currently using PDI for orchestration in Hortonworks with a metadata driven approach. They have 200 nodes. HSBC as part of its digital transformation is moving to GCP and wants to move the existing PDI processes to the cloud. Their data volumes are increasing, and they want to leverage Spark and they want to maintain some level of cloud agnostic design to support a multicloud strategy in the future. Right now, they are focused on data proc and big query – and we now support both these datastores. They anticipate this migration project to last two years and they plan to transition their processing from MapReduce to spark. They plan to utilize our AEL spark capability and store their files in Avro and Parquet formats in Google cloud storage.

Let us briefly look at what a competition is doing for data proc support. Our traditional competitors informatica is providing data proc support in their big data management platform. Talend offers Yarn client mode support Which means it can be difficult to implement. Data proc itself has automatic integrations with other stores in their ecosystem.

We differentiate by promoting our multicloud support for AWS, Google cloud and Azure. Mid strong Hadoop support we can get customers into production faster and offer an end to end data analytics platform. with our adaptive execution layer, customers can choose which engine to use at runtime without making changes to their pipeline while our competitors must code to specific processing engines.

So here is what you need look for while prospecting. Look for cloud migration or data modernization projects involving GCP where they are modernizing their databases, data warehouses or data lakes. Look for customers who have recently received huge AWS cloud bills and are looking to avoid lock in with the help of GCP. As part of the new architecture they are considering a variety of Google data stores including the object store the data warehouse as well as the spark clusters.

Now let us look at the second exciting capability – reducing data discovery time an increasing pipeline resilience with the data catalog integration resulting in a boost to data producer productivity.

Here are some statistics from Gartner and Forrester that show that data discovery is hard and delays data pipeline projects. A Forrester survey found that close to 70% of firms found curating discovered data time consuming delaying further data processing. Gartner found that by using a curated catalogue of data organizations get twice the business value from their analytics investments. Another Forrester survey found that enterprises with machine learning driven catalogs see a 50% increase in their big data use outstripping their competitors.

Here is a high-level view of the change. When creating data pipelines, data engineers had to specify the location of the data source as well as the name of the data source. Now in 9.1 they simply need to provide the business name of the data set they are looking for in the input step of the data pipeline. this greatly reduces their work of finding the data source and keeping track of the data source.

Let’s dive a little deeper. The challenge data engineers face is that there are several manual steps and there are multiple user interfaces they need to use in order to locate and securely access data. It was difficult for them to individually search and read the metadata in a variety of files saved on premise and in the cloud. Also given the volume of data coming into the data lake, it took a long time to register new objects with standalone unintegrated catalogs. Now with this deep product integration between the catalog and data integration, the time to discover data on board it and later prepare it for further processing decreases tremendously. The pipeline is now resilient to data source location changes and format changes as the catalog registry abstracts the location of the data stores from the pipeline. The metadata can now be easily read and searched for use cases that span archiving old files, and retrieving tags for further processing and visualization.

Let us look now at our traditional data integration competitors who have catalogs.

Informatica offers an integrated data catalog with machine learning driven data discovery. It has a large market presence. It is strong in adjacent areas of data management including master data management and data quality.
However, it has an antiquated architecture with several services to run the catalog and multiple interfaces for administrators and data stewards. It leverages its market leader status to be very expensive. The catalog itself has no ability to learn from manual curation resulting in lower accuracy.

Looking at Talend, they have bolted on their catalog to their data integration platform. They have a large market presence in the cloud. They have a per user and per month pricing models which appeals to smaller customers.
Their bolt-on architecture requires multiple steps to discover and tag the data. Their metadata discovery capabilities focus on highly used metadata. Because their pipelines are code driven they lack flexibility in abstracting their data sources. And they cannot visualize data in their transformation steps by making use of the out of the box visualization that we offer.

So when you are prospecting, look for frustration on wait times to access on boarded data. Data engineers and data consumers could be waiting for data for further preparation and analytics before the value of the data expires with time. Data engineers could be dissatisfied with their pipeline resiliency and they could be facing frequent breakdowns due to changes in the source location or source formatting.

This is an opportunity to cross sell the catalog to PDI customers and increase the size to prospects by selling the combination. You now can talk about patented AI capabilities within the catalog and talk about how faster data discovery will speed up data onboarding and data integration efforts.

Please remember this allows you to sell the power duo if you will of the Lumada data Ops suite namely PDI and LDC.
The third big messages too decrease testing and production downtime during upgrades to cut IT maintenance time and cost.

The challenge that many of our customers face is that upgrades take a long time and they create production failures. The customizations that they have made in current releases are not maintained after upgrade.
The solution that we offer with the streamline upgrade will provide them painless, reliable, and faster upgrades and rollbacks if something goes wrong. We offer them a pre upgrade check and upgrade only components of our software that are installed. we will maintain a white list of customizations that they have made so that all those customizations are retained after upgrade. We will persist all plugins, that they have added to their installation as well as maintain a whitelist of all their database connectors.

This upgrade utility and the extensive documentation on using this utility are available on our help website and they can always reach support if they have questions.

So when you’re prospecting existing customers who are stuck in old releases, look for pain points where they are hesitating to upgrade due to customizations they’re made or plugins they’ve added or additional connectors that they are relying on. They could be waiting for many smaller disjointed fixes before deciding on an upgrade. Also, they could be planning a data modernization or a data migration project by leveraging GCP cloud. The opportunity here is that many pending support issues can be streamlined and fixed. you have an opportunity to promote multicloud messaging was public cloud options that they are looking at. And you have an opportunity to promote data discovery capabilities to accelerate data on boarding and data integration projects.

We continue to invest in the community edition of our product. We offer GDP non secured version in the community edition. However, we do not offer the upgrade utility or the four catalog steps in the community edition.

Here is a quick overview of the data flow manager. it will be limited availability in December. The goal of the dataflow manager is to empower analysts with pipeline templates and then allow analysts and engineers to collaboratively monitor the performance and health of data pipelines.

You have BNY Mellon as a lighthouse customer and it has one of the largest datasets in the financial world giving us some truly unique perspectives. They are a former informatica customer who is now using Pentaho everywhere. They’re looking to offer business users with self-service data flows. they want granular access to deploy the dataflows, monitor the jobs and kill resource consuming jobs when needed. The business user takes the dataflow, customizes it with parameters and submits the job which is executed in the mainframe environment. they use Splunk today to access the logs because Pentaho did not do a good job of making the logs usable. They would like to use Spark in the future.

The end result is that they would like to lower operational costs by improving operational efficiency and administration productivity while at the same time having better governance.
The feedback from BNY melon has allowed us to improve the user interface for all these activities of kicking off data pipelines monitoring them and viewing the logs after they have completed.

The dataflow manager is expected to have limited availability in December 7th. As you are prospecting for customers and prospects, look for Data democratization and self-service initiatives. That could be combined with issues with monitoring and managing execution schedules of pipelines. And they are looking to make processing capacity more dynamic by moving to the cloud. The opportunity is to pitch cell service data integration by empowering more engineers towards an analyst and removing IT as a bottleneck. If you have interested customers or prospects, please contact PM for further qualification and next steps.

So for 9.1, we have the following enablement sources available.

Our website is getting a refresh and that is the landing page. we have a customer webinar on November 12th. The webinar will be after the marketing launch on 10th where we will launch the Lumada data ops suite. Registrations for the web and are already opened so you can drive customers and prospects to register. We will have a FAQ coming up along with a few blogs. We will have updated data sheets coming shortly Between now and November 10th. We will make a reference deck available selected slides from this deck. And you can point customers to the 9.1 documentation.

So help us promote Pentaho 9.1 webinar on your social networks. Share one of our launch blog posts from Twitter, LinkedIn or Facebook on the Pentaho channel or the Hitachi Vantara channel.

In summary three main messages. Customers and prospects can accelerate data integration projects across multiple clouds including in the private cloud while reducing management costs and complexity with Google data proc integration. They can reduce data discovery time and increase pipeline resilience boosting productivity with the data catalog integration. They can decrease testing and production downtime during upgrades hence cutting IT maintenance cost.

Now I will open it up for any other questions.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.