Back to Index

Arch-core-banner.png

Loading Data to Wave Analytics

Authors: Terence Wilson, Rachana Gogate, Umair Rauf - Analytics Cloud ASG #AsktheACExperts

Overview

Data Integration (DI) combines data residing in different, mostly diverse sources and provides users with a unified view of this data. A robust and scalable DI strategy can provide consistent access to, and delivery of data across multiple data sources (e.g. finance, HR, sales, marketing, IoT, mobile, logs etc.) to meet the needs of analytic applications and business processes.

Salesforce Analytics Cloud provides a range of flexible options for users to load, transform, analyze and collaborate on data; to accelerate their analytics journey. Here we list the architecture, design and implementation options to Extract, Transform and Load data (ETL / ELT/ EtLT) to the Wave platform. Product documentation, declarative syntax and step-by-step directions are available in the Wave product documentation.


Enterprise Data Architecture

A Data Architecture is composed of models, policies, rules or standards that govern which data is collected, and how it is stored, arranged, integrated, and put to use in data systems and in organizations. Companies use data architecture to standardize the management of their data assets. Modern analytics platforms have allowed business users to convert data into actionable information; driving IT analysts and enterprises to introduce Information Architectures.

The foundational elements in these architectures include:

  • Data originates from sources (data sources) , which typically include transaction systems or systems-of-record. More recent data sources also include enriched data from analytic systems, machine-generated (e.g. sensors), geo-spatial, social and log data.
  • Data Integration to process the data and build a unified view (Integrated Data layer, Unified Data layer, data mart or data lakes)
  • Data Access to provide power-users or end-users access to this data (Semantic layer).

The figure below shows a visual representation of an information architecture:

ASGLoadDatatoWaveImage17.png
Figure: a High-level Information Architecture


Building an integrated data layer is traditionally a significant organizational commitment for both Business and IT; this drives the need for a well structured data management program for longer term success.


Enterprise Data Management (EDM) or Enterprise Information Management (EIM) is an established discipline in the Analytic space which helps companies manage their data assets. Gartner defines EIM as “an integrative discipline for structuring, describing and governing information assets across organizational and technological boundaries to improve efficiency, promote transparency and enable business insight” . EDM or EIM covers broad range of disciplines to manage data and information including data architecture, data integration, data quality, data security, data compliance, master data management and metadata management.

Wave Analytics loads data from Salesforce or external sources making it easy to connect with Organization’s Information architecture:

ASGLoadDatatoWaveImage19.png
Figure: Salesforce Wave flexibly fits into any Information Architecture


Data Integration using Wave

Salesforce Wave models data through an EtLT process. Data is extracted from the source, may undergo light transforms (e.g. filters and schema changes) and then this data loaded to the Wave platform. Additional transforms are performed on the Wave platform (e.g. augment, append etc.). For a full list please see this link. After completing transformations, the load process can “Register” a dataset, making the dataset accessible to Wave users. A Wave dataset is a collection of related data stored in a denormalized form, housing a combination of dimensions, measures and date attributes. Datasets are stored in highly compressed key-value pairs, with dimensions represented in the form of an inverted index and measures represented in a columnar-like structure. Wave dataset has schema-free design, enabling them to scale with new dimensions or measures without disrupting existing applications.

ASGLoadDatatoWaveImage12.png
Figure: EtLT flow for Data using Wave


Data loading options

Wave provides a scalable Dataflow to pull data natively from Salesforce (data that lives in a Salesforce object) and an External Data API which provides options to load data from Salesforce or external sources into a Wave dataset.

Here are some options to load data into Wave:

80%
Figure: Options to load data to Wave Analytics


  1. The Dataset Builder visually constructs a Dataflow JSON to digest and augment data from Salesforce objects. Any constructed JSON is appended into the Org’s main Dataflow.
  2. You can load CSV formatted files using Wave’s dataset creation UI, Microsoft Excel Connector for Wave, the Wave dataset utility or Informatica Rev.
  3. You can custom-code data loading logic using Wave’s External Data API. An open-source reference implementation is provided here. You can write your connector on the Force.com platform using APEX for Salesforce data or use popular languages like Java, Python and others to build custom connectors for external data sources (e.g. Hadoop)
  4. Finally, you can use tools from our leading ETL partners to pull data into Wave. This gives you the flexibility to connect with a wide range of data sources (e.g. SAP, ERP, etc.).These tools also offer extensive data mappings and transformation libraries, which can speed up the integration effort. (Recommended For External Data)


80%
Figure: Access the dataset builder or CSV upload UI using the “Create” button in Wave


ASGLoadDatatoWaveImage23.png
Figure: Wave Create Dataset page in Salesforce Winter 16 release


Wave Dataflow

The Dataflow is an integrated feature in Wave, which allows native data loads from multiple Salesforce objects or existing datasets. Each node in the Dataflow JSON represents a transformation step. The entire workflow graph is defined by the Dataflow json which is loaded into Wave using the Data Monitor screen.

ASGLoadDatatoWaveImage05.png
Figure: Wave provides a native Dataflow, to extract data from Salesforce objects

Here is a summary of the transformations using the Dataflow.


Phase Transform Description
Extract and Load your Data sfdcDigest

This transformation extracts data from Salesforce objects. The sfdcDigest provides ability to filter data being extracted from Salesforce and apply special field properties (e.g. load fiscal dates).

With the Spring ‘16 release, there are some new features added to the sfdcDigest to optimize the time required by a dataflow to run:

Incrementals: With incrementals, you can configure the dataflow to only extract updated records from salesforce.

Replication: Replication happens as a separate dataflow. The concept is that salesforce objects are replicated in Wave instead of a separate extract with your dataflow run. With this kind of a setup, your dataflow digest transformation just needs to pick up the replicated object instead of extracting it again from salesforce.

edgemart

Loads an existing Wave dataset for further transformations. The dataset may have been created earlier from Salesforce data or external data

Transform your Data

augment

Augments an input dataset by combining columns from another related dataset e.g. you can augment an opportunity with account using an Account ID relationship to search your opportunities with various attributes on the accounts object.

flatten

Build a multi-value field to flatten hierarchical data e.g. product hierarchy, manager hierarchy, role hierarchy etc

append

Combines records from multiple datasets into a single dataset. The transformation works best for two datasets with same columns.

filter

Removes records from an existing dataset based on the filter criteria

delta

Calculates change in value of a measure column in a dataset over a period of time. This transformation can help you create trending in your data e.g. changes on opportunity value over a period by Stage.

dim2mea

Creates a new measure attribute for any data loaded as a dimension. e.g. you may want to load ‘account tenure’ or ‘customer age’ as both a dimension and a measure. This allows you to group by the age and allows you to apply range filters or calculate average age

update

Updates the field in one dataset based on data from another dataset


computeRelative

Additional transformation available with the Spring ‘16 release. This transformation allows for certain computations computations that span across consecutive rows. This transformation should have a “sfdcdigest” transformation as its “source” node. For example calculating days between two dates for determining the number of days opportunity is in one stage.

Note that this transformation currently does not work with Incrementals. You need to have a full extract in your dataflow.

computeExpression

Additional transformation available with the Spring ‘16 release. Creates a new field in the dataset based on a row based operation using SAQL like expression. Operates only within the row.

Register Your Data

sfdcRegister

Registers a new dataset or overwrites/ update an existing dataset (with unique dataset name being the dataset alias) while updating key dataset properties e.g. folder, security predicate etc.

Product documentation describing the use of the Dataflow, schedule jobs, check status and Wave transformations is available at the following link.


Here is a virtual representation of a data flow and sequence of activity on the Wave platform:

ASGLoadDatatoWaveImage14.png

1) The ‘sfdcDigest’ and ‘edgemart’ produces a Dataset that is stored locally on a Wave worker node. The sfdcDigest step gets data from core Salesforce objects to create a Dataset. The edgemart step loads an existing Dataset and makes it referenceable within a Wave Dataflow.

2) The “augment” is a transformation that takes two datasets as inputs and generates a third Dataset. This Dataset is just stored locally on the Wave worker node.

3) The “filter” transformation is applied on a dataset Ext_Data to only use records which meet a certain filter criteria

4) The “delta” transformation calculates the difference in measure value against consecutive records. This transformation is valuable to understand the change in a measure value across dates and combination of a dimension.

5) A dataset isn’t available for exploration by users in Wave explorer or Mobile until it has been registered. Registering a Dataset makes the Dataset available to Wave users.


Choosing your ETL option

To choose the right data strategy for loading data into Wave, there are many considerations:


  1. Are you loading the data one-off to quickly get a development effort off the ground or is there a scheduled process required for wider deployment?
  2. What are the primary data sources i.e. file, salesforce objects, 3rd party databases, 3rd party applications and/or published web APIs?
  3. Are there any calculations, aggregations, complex lookups or projected fields required?
  4. Is a daily refresh sufficient or is there a need for intraday incremental loads?

The following table provides options to weigh against Salesforce data loading requirements:

ASGLoadDatatoWaveImage24.png

X: indicates viable option; 〄: indicates a potential option (higher effort, manual intervention or specific situations)

The following table provides options to weigh against External data loading requirements:

ASGLoadDatatoWaveImage25.png


Common Data Loading Use-Cases

There are a number of combinations to load data into Wave and a few examples are provided in the following table:


Data from Salesforce: Sales and Service Cloud

Requirement: Load the data from Salesforce: Accounts, Opportunities, Opportunity Lines, Products, Cases. You can load this data using the Wave Dataflow. The Dataflow development will be faster if you develop the first iteration using the Dataset Builder UI. You will have to build this flow in multiple steps if you use the Dataset Builder UI. Based on your data shape, you could plan the grain at Opportunity Lines (gives you additional detail of products) or at an aggregated Opportunity level. Initially, you may want to pull in both before doing further transformations.
  1. Use the Dataset Builder with Opportunity object and pull additional relationship attributes from Accounts, Owners, and Product (if maintained at opportunity level)
  2. Use the Dataset Builder with Opportunity Lines object and augment additional attributes from Accounts, Opportunity, Owners, and Product
  3. Use the Dataset Builder with Cases as a grain and augment with Account and Users
  4. Edit the data flow JSON in a file editor and consolidate any repeat nodes in your data flow e.g. Account, User and Opportunity objects are pulled multiple times in your JSON during step 1, 2 & 3. You can consolidate any duplicate digest nodes for an object and make any references point to a single node per example discussed in the #WaveDataBuilder section (This is done to avoid multiple digests e.g. if you use the dataset builder to first create Opportunity Dataflow with accounts, owners and products, a node is generated for extracting the Opportunity dataset. If you build the next one with OpportunityLines, you may notice additional ‘digest’ node getting created to extract opportunity, accounts and owners again. This means the Dataflow extracts the same object twice. You can consolidate this by removing the duplicate node and changes all references to the duplicate node with the first digest node. Ensure all required fields are extracted from the digest step which will be used in subsequent augments)
  5. Use Dataflow to apply the security predicates
ASGLoadDatatoWaveImage08.png


Data from Salesforce - Sales and Service Cloud Augmented with External Data

Requirement: Load the data from Salesforce: Accounts, Opportunities, Opportunity Lines, Products, Cases and Load external Data for Order (SAP), Revenue (Data Warehouse) and Expenses (Concur) to support Revenue Scorecard and Pipeline Analysis


  1. Load data externally using ETL tool from data warehouse, SAP and Concur. Perform the required transformations in ETL tool to help load denormalized a dataset from these sources. (Note: You can get started quickly in your development by manually loading external data using CSV. Once all ETL jobs are completed, you can replace the manual load process with the ETL process, by using the load schema file)
  2. Load Data from Salesforce Using the Dataflow. Example of building the Dataflow using the Dataset builder is provided at#WaveDataBuilder
  3. Use the Dataflow to augment external data with Salesforce’s common matching attributes (e.g. account, users, cases to analyze across datasets. Detailed instructions to download and edit Dataflow is available here.
  4. Use the Dataflow to append any security predicates.
ASGLoadDatatoWaveImage04.png
Alternate Implementation: Transform the data from Sales Cloud, Service Cloud and External Data using an ETL tool. Load this data using the ETL Tool’s connector into Wave using the External Data API. This approach has a disadvantage of Salesforce data traversing outside the Salesforce platform. eeen the destination Org to find any matching accounts and you will have an option to use Global Ultimate Business Number to find all linked accounts in each Org
  • You can digest data from Salesforce objects in the destination org to Analytics Cloud using the Wave native Dataflow
  • Use Dataflow to apply the security predicate


Please review ASG’s Multi-Org Guide to look at other consolidation scenarios for multi-org data.
ASGLoadDatatoWaveImage10.png


Analyze Salesforce data in combination with Social Data

Requirement : Social media signal data provide additional insights for users on social sentiment around a Company’s products and/or the company as a whole. Marketing and Support teams may want to analyze sentiment on channels like Twitter and directly take action by creating a case or opportunity from this social data against products within Salesforce.


Many ETL tools today provide direct connectivity with APIs from leading social media sites including Twitter. We recommend you use one of these ETL tools to load social media feeds since they offer pre-built connector and enable tokenization of the data feed (e.g. tweets) to compute social sentiment. You can pull this data from Twitter against pre-defined set of Twitter hashtags e.g. customer product tags or generally company image level. You can also define specific tags on an account or product record in Salesforce and match these with external data in the ETL tool (This provides a great level of matching between Salesforce data and social data).


  • Use the ETL tool to extract data from the Twitter API and apply any sentiment analysis logic to tweets. Augment details about which product these tweets may belong to or if these are company-wide social signals.
  • Use the native Dataflow to pull objects from Sales and Service Cloud.
  • The Dataflow can help augment any data loaded externally from Twitter with Product Data in Salesforce (if not done already using the ETL tool). It's better to do incremental loads on the dataset to pull only the most current tweets.
  • Apply data security as needed using predicates.
ASGLoadDatatoWaveImage16.png


Salesforce Marketing Cloud provides the Radian6 social monitoring platform which enables monitoring of social channels. The tool provides capability to export data into a CSV. The tool can help you determine sentiments and load this data into Wave for analysis.


Load Usage data for Product Usage from Hadoop and Analyze it against Sale Data

Requirements: Customer wants to explore the use of their products at feature level across customers which is maintained in Hadoop and look this up against data from Salesforce.
There are various implementation options to load data from Hadoop e.g. Sqoop, Pig/ Hadoop Get/ Storm, ETL connectors etc. You can pre-process results and then make that file available to load into Wave. Typical scenarios, would be to see usage and add a risk of churn field or see usage of product mix and suggest propensity to buy next product/ features and then load this data to augment with Account data in Wave.
  • Transfer the pre-processed file from Hadoop to a file system. Sqoop or Hadoop file system commands are possible options for this.
  • Upload data to wave using an ETL tool, wave UI or the Dataset utils. We will be discussing the last two options in detail in this document.
  • Use the native Dataflow to pull in data from Salesforce objects (e.g. Accounts, Cases, etc) and augment with externally uploaded data.
ASGLoadDatatoWaveImage01.png
Figure: Customer use case specific data ingestion architecture examples for Wave



Designing your datasets in Wave

Factors to consider while designing Wave datasets

Historically, databases have stored data in a row / column format which is maintained in a normalized fashion using a star-schema, snowflake schema or third-normal form. In Wave, it is recommended that the data be denormalized at its raw data grain to gain performance benefit from Wave’s inverted index technology, highly compressed data storage and indexing techniques.


Wave Dataflow transformations are designed to facilitate the rapid denormalization of data. A key part of the dataset technology is not only fast query time, but the ability to execute these transformations at high speed.


ASGLoadDatatoWaveImage20.png


Here are some design considerations while building datasets.


  • Use Wave as an exploration tool to arrive quickly at the final shape of your datasets. Exploring the data itself can help analyze data shape and drive qualified conversations with data owners to clarify business requirements.
  • Identify candidate data dimensions and measures which are important for the use case.
  • Denormalize your data by augmenting related data from other objects to flexibly analyze your data across multiple business dimensions. Here are some denormalization tips:
  • Identify your data grain to address business questions
  • Create your dataset at lowest grain level.
  • A single dataset will make it easy to rapidly build lenses and dashboards
  • Using multiple datasets to manage varied level of grains requires dashboard builders to ramp-up on advanced dashboarding techniques including selection binding, results binding and/or SAQL co-grouping to help pull data from various marts into common lenses, dashboards and applications.
  • In most cases, we recommend not to augment a measure from a higher-level data grain to a lower level data grain. This can result in an aggregated metric like opportunity amount repeat for every record in the corresponding detailed data like opportunity lines. For scenarios, where this is unavoidable, use the XMD to hide the aggregated measures e.g. “hide_dimensions” : [ “api_dimesion_1”, “api_dimesion_2”], “hide_measures” : [“api_measure_1, “api_measure_2”]
  • Flatten any role hierarchies and master/child relationships using flatten transformations.


Sales cloud dataset modeling

Let's take a Sales Cloud example to see how these modeling recommendations can be applied on Opportunity Data.


Scenario 1: Opportunity Analysis

ASGLoadDatatoWaveImage03.png


This example analyses the opportunity data and augments additional identifying dimensions for each opportunity including the account it belongs to, the campaign it was part of, the owner it belongs to etc. The entities above are using a 1-to-1 relationship, making use of the augment transformation to pull additional identifying information to the opportunity object. Scenario 2:


Opportunity Analysis with Product Details

Now if you want to determine the products which this opportunity belongs to, we have two implementation scenarios, i) the product information may exist through a multi-picklist field on the opportunity object, and, ii) the opportunity may have line items which map the multiple products to an opportunity using the Opportunity Product object.


Reporting in scenario #1 is easy with the above flow, but for #2, we need to qualify the use cases further before finalizing the design. If the requirement is to simply find opportunities by a specific product, the Dataflow documented below augments the Product Name from Opportunity Product to the Opportunity object. An important operation you will notice in step “109” is the “LookupMultiValue” which collapses all values from the Opportunity Product object (on the right) in a multi-value field on Opportunity Object.


{ 
  "100": {
    "action": "sfdcDigest",
		"parameters": {
          "object": "Opportunity",
                "fields": [ .... ]
     } 
  },

  "101": {
    "action": "sfdcDigest",
		"parameters": {
          "object": "Account",
                "fields": [ ....]
     }
  },

  "102": {
	"action": "sfdcDigest",
		"parameters": {
          "object":  "User"
                "fields": [ .... ]
     } 
  },

  "103": {
	"action": "sfdcDigest",
		"parameters": {
          "object":  "Campaign",
                "fields": [ .... ]
     } 
  },

  "104": {
	"action": "augment",
		"parameters": {
          "left":   "100",
          "left_key": [ "AccountId" ],
          "relationship": "AccountId",
          "right":   "101",
          "right_key": [ "Id" ],
          "right_select": 
                  [ 
                    "Name", 
                    "Region__c", 
                    "OwnerId" 
                  ]
     } 
  },

  "105": {
	"action": "augment",
		"parameters": {
          "left":   "104",
          "left_key": [ "AccountId.OwnerId" ],
          "relationship": "AccountId.OwnerId",
          "right":   "102",
          "right_key": [ "Id" ],
          "right_select": [ "Name" ]
     } 
  },

  "106": {
	"action": "augment",
		"parameters": {
		  "left":   "105",
		  "left_key": [ "CampaignId" ],
		  "relationship": "CampaignId",
		  "right":   "103",
		  "right_key": ["Id" ],
			"right_select": 
			       [
			        "Name", 
			        "IsActive", 
			        "EndDate", 
			        "ActualCost", 
			        "Type", 
			        "Status" 
			       ]
     } 
  },

  "107": {
	"action": "augment",
		 "parameters": {
			"left":   "106",
			"left_key": ["OwnerId"],
			"relationship": "OwnerId",
			"right":   "102",
			"right_key": [ "Id" ],
			"right_select": ["Name" ]
     } 
  },

  "108": {
	"action": "sfdcDigest",
		"parameters": {
			"object":  "OpportunityLine",
			"fields": [ .... ]
     } 
  },

  "109": {
	"action": "augment",
		"parameters": {
			"left":   "107",
			"operation": "LookupMultiValue",
			"left_key": ["Id"],
			"relationship": "OpportunityLine",
			"right":   "108",
			"right_key": [ "OpportunityId" ],
			"right_select": ["Name" ]
     } 
  },

  "110": {
	"action": "sfdcRegister",
		"parameters": { 
			"source":   "109", 
			"alias": "sf_opportunity", 
			"name": "SFOpportunity" 
	 }
  } 
}

Some Customers like to see ‘Opportunity Amount by Product’ breakdown. This requires an alternate design. The recommendation is to consider two datasets i.e. i) Opportunity Dataset and ii) OpportunityProduct and then combine these in the dashboard.

Append the following data flow to your Dataflow above.

ASGLoadDatatoWaveImage22.png
Figure: Keeping data at Opportunity Product grain.

A few things to watch out in this Dataflow:

  • While using dataset builder, ensure that the digests are only done once by consolidating digest steps for each object e.g. Account, User, Campaign, Opportunity etc.
  • As we suggested, taking an amount to a wrong level of grain is not recommended. e.g. if you chose to augment ‘Amount’ from Opportunity to Opportunity Product, the same ‘Amount’ for an opportunity will repeat on every line item, making sum(Amount) double-counting. You can use Avg(Amount), but a user should know that. It is recommended not to take the measures on wrong grain, or hide these fields using Extended metadata.
  • In the dashboard, you will have to bind the two datasets together using selection filter and/or results binding between steps from the two datasets. The following document will give you more details on bindings using dashboards.


Importing Salesforce data to Analytics cloud

Using Wave Dataset Builder

The Dataset Builder builds an ETL workflow for importing Salesforce data into Wave. The documentation for creating a dataset using dataset builder is available here.


A few considerations while using Dataset Builder:

  • For each object in your Dataset Builder graph, select standard and custom fields that you would like to import. The default selection is no fields. So, at least one field has to be selected in each object.
  • There are certain objects that are currently not supported. Refer this document for a list of unsupported objects.
  • The Dataflows are internally executed by the ‘Analytics Cloud Integration User’ (a default user profile created at the time of enabling analytics). If you select a field that this user does not have access to, your workflow will fail with an error message: [root cause: Field XYZ is not available]
  • The Wave dataset builder appends new digest and transformation nodes into an existing Dataflow file. Always download the latest version before making any further modifications and uploading those to the Dataflow. Use the following link to download and edit the Dataflow.


Data transformations in Analytics cloud

In addition to importing data to Analytics Cloud, Wave can also transform the data in some ways. This is the link listing all possible transformations. This next link gives details on how to create a workflow using all these transformations.


Here are some examples to help understand the workflows in more detail.


Example Case Study : Working with Salesforce Data

Background

Wave is a great tool for analyzing Salesforce data in Salesforce. Wave can pull in data from various Salesforce objects - Users, Opportunity, Account, Events, Tasks, etc. The data can then be transformed (augment, filter, etc) to create a single dataset. This document from the product team gives us a list of unsupported objects.


Requirements

In this use case, we want to analyze Tasks, Contact, Opportunity and User together to answer questions such as: “How many Tasks do I have for a particular opportunity?”, “Who owns a particular task?”, etc. In this example, we are using “Task” as the left table and augmenting with Contacts, Opportunity and User (sequentially) as the right tables.


Implementation details

Here is the Dataflow we used for this:


Dataflow JSON

Explanation

{
 	"Task": {
		"action": "sfdcDigest",
		"parameters": {
			"object":  "Task",
			"fields": [
			  {
				"name": "AccountId"
			  }, 
			   ....
			  {
				"name": "OwnerId"
			  }
			]
		}
 	},
	"Contact": {
		....
	},
	"Oppty": {
		....
	},
	"User": {
		....
	},

Extract the objects you need from Salesforce. Need to specify the following:

  1. action: sfdcDigest - Specifies that this node pulls data in from Salesforce
  2. object: The Salesforce Object you need to pull from
  3. fields: list of all the fields you need from that object.

Refer the “Using Dataset Builder” for common errors you might encounter.

The node for Contact, Oppty and User are similar to the Task node. These nodes now have the data contained in the corresponding objects.

	"TaskContact": {
		"action": "augment",
		"parameters": {
		"left": "Task",
			"left_key": [ "WhoId" ],
			"relationship": "TaskContact",
			"right": "Contact",
			"right_key": [ "Id" ],
			"right_select": [	"Name" ]
		}
	},
	

Augment task and contact nodes. The purpose of this augment is to bring in Contact details such as “Name”, etc.

“WhoId” field from the “Task” node gives the contact ID. We lookup the “Contact” node using this as the “left_key” field. We check matches based on this key and “right_key” from Contact node. In the “right_select” section, you can specify all the fields you need to pull in.

Note: Augment is different from a join. It pulls in only matched fields from the right object.

	"TaskContactOppty": {
		"action": "augment",
		"parameters": {
			"left": "TaskContact",
			"left_key": [ "WhatId" ],
			"relationship": "TaskContactOppty",
			"right": "Oppty",
			"right_key": [ "Id" ],
			"right_select": [ "Name" ]
		}
	},
	

Augment the previous result with Opportunity data.

	"TaskUser": {
		 "action": "augment",
		 "parameters": {
			"left": "TaskContactOppty",
			"left_key": [ "OwnerId" ],
			"relationship": "TaskUser",
			"right": "User",
			"right_key": [ "Id" ],
			"right_select": [ "Name" ]
		}
	},
	

Augment previous result with User data.


	"Register": {
		"action": "sfdcRegister",
		"parameters": {
			"source": "TaskUser",
			"alias": "Task_Final",
			"name": "Task_Final"
		}
	}
}

Create a dataset with the output from the augments.


Example Case Study : How long has a case been in each status?

Requirements

In this use case, we want to understand what are all the status a case has been through, number of days in each status, etc. The CaseHistory object stores all the status changes in the fields ‘OldValue’ and ‘NewValue’ along with ‘CreatedDate’ for the date this change occurred. We would need to subtract the dates across rows to get the days in each status.


Implementation details

Here is the detailed dataflow for this:


Dataflow JSON


Explanation

{
	"Extract_Case": {
		"action": "sfdcDigest",
		"parameters": {
	        "fields": [
					{	
						"name": "Id"
					}, 
					....
			],
        "object":  "Case"
		}
	},


	"Extract_CaseHistory_Status": {
		 "action": "sfdcDigest",
		 "parameters": {
		 	"filterConditions": [
							{
								"field": "Field",
								"value": [ "Status" ],
								"operator": "IN"
							}
			],
		    "fields": [
							{
								"name": "CaseId"
							},
							{
								"name": "CreatedDate"
							},
							{
								"name": "Field"
							},
							{
								"name": "Id"
							},
							{
								"name": "NewValue"
							},
							{
								"name": "OldValue"
							}
			],
			"object":  "CaseHistory"
		}
	}
},

The difference in the sfdcdigest transformation from the previous example is the additional filterConditions highlighted.

The CaseHistory object stores various changes to cases such as status change, owner change, etc. As we are interested in only status change, we add this condition.

We need the Case object as well because CaseHistory object is only populated when there is a “change” to the case. It is not populated the first time a case is created.

	"computeDatePrevious_Status": {
		 "action": "computeRelative",
	     "parameters": {
			"partitionBy": [ "CaseId" ],
			"orderBy": [
				{
					"name": "CreatedDate",
					"direction": "asc"
				}
			],
			"source": "Extract_CaseHistory_Status",
			"computedFields": [
			{
				"expression": {
						"sourceField": "CreatedDate",
						"default": null,
						"offset": "next()"
						},
					"name": "CreatedDateNext",
					"description": "Next Created Date",
					"label": "Next CreatedDate"
				},
			{
			"expression": {
						"sourceField": "CreatedDate",
						"default": null,
						"offset": "previous()"
						},
					"name": "CreatedDatePrev",
					"description": "Previous Created Date",
					"label": "Previous CreatedDate"
				}
			]
		}
	},

To find the difference between dates across rows, we use computeRelative. Here is how it works:

  1. The data is partitionedBy ‘CaseId’ and sorted ascending based on ‘CreatedDate’. You can see the parameters for “partitionBy” and “orderBy”.
  2. We then computeFields for next CreatedDate and previous CreatedDate. computeRelative adds these fields to the dataset.

The purpose for the calculating previous CreatedDate as well is to identify if this was the first date change.

	"Join_CaseHistoryCase": {
	 	 "action": "augment",
	     "parameters": {
			"right_key": [ "Id" ],
			"left": "computeDatePrevious_Status",
			"left_key": [	"CaseId" ],
			"right_select": [
					"IsClosed",
					"ClosedDate",
					"CreatedDate",
					"Duration__c",
					"Product_Family_KB__c",
					"Origin",
					"Reason",
					"Customer_Type__c",
					"Priority"
			],
			"right": "Extract_Case",
			"relationship": "Case"
		}
	},

This is a standard augment. The reason for bringing in fields from the Case object is to track the first change in status.

	"computeCurrentDate": {
		"action": "computeExpression",
	    "parameters": {
		"source": "Join_CaseHistoryCase",
		"computedFields": [
					{
					"format": "yyyy-MM-ddTHH:mm:ss.SSS&Z",
					"name": "CurrentDate",
					"saqlExpression": "now()",
					"label": "CurrentDate",
					"type": "Date"
					}
		],
		"mergeWithSource": true
		}
	},

Here we are using the computeExpression transformation which adds field as defined in “saqlExpression”.

“now()” gives us todays date. The reason we need this is to find out the number of days the case has been in the last status.

	"computeOldStatusTime": {
		"action": "computeExpression",
		     "parameters": {
				"source": "computeCurrentDate",
				"computedFields": [
						{
						"name": "Status",
						"saqlExpression": "'OldValue'",
						"label": "Status",
						"type": "Text"
						},
						{
						"defaultValue": "0",
						"precision": 18,
						"name": "Duration",
						"saqlExpression": "case when ('CreatedDatePrev' is null) then ('CreatedDate_sec_epoch' - 'Case.CreatedDate_sec_epoch')/86400 else ('CreatedDate_sec_epoch' - 'CreatedDatePrev_sec_epoch')/86400 end",
						"scale": 2,
						"label": "Duration",
						"type": "Numeric"
						}
				],
				"mergeWithSource": true
			}
	},

Here we use computeExpression again to add fields for Status and Duration (in Days)

  1. Status: ‘OldValue’ from CaseHistory object
  2. Duration: Calculates the number of days a case was in a particular status.

Notes:

  • We need the case statement for Duration so as to make sure that the duration of the first status is calculated correctly.
  • This transform does not calculate the number of days the case has been it is current status.
	"computeLastStatusTime": {
		"action": "computeExpression",
		"parameters": {
			"source": "computeCurrentDate",
			"computedFields": [
				{
					"name": "Status",
					"saqlExpression": "case when ('CreatedDateNext' is null && 'Case.IsClosed' == \"false\") then 'NewValue' else \"skip\" end",
					"label": "Status",
					"type": "Text"
				},
				{
					"defaultValue": "0",
					"precision": 18,
					"name": "Duration",
					"saqlExpression": "case when ('Case.IsClosed' == \"false\" && 'CreatedDateNext' is null) then ('CurrentDate_sec_epoch' -'CreatedDate_sec_epoch')/86400 else 0 end",
					"scale": 2,
					"label": "Duration",
					"type": "Numeric"
				}
			],
			"mergeWithSource": true
		}
	},

Let us consider an example: If the current status of the case is “Waiting for Customer” we would like to know the number of days it has been in this status.


  1. Status: This would give us the current status of the case. Since, we are only interested in the current status, we will “skip” the rest of the values.
  2. Duration: Calculating the Duration using the CurrentDate and last status change date.
	"Filter_skip": {
		"action": "filter",
		"parameters": {
			"source": "computeLastStatusTime",
			"filter": "Status:N:skip"
		}
	},

Filter out where status is “skip”


	"Append_AddLastStatus": {
		"action": "append",
		"parameters": {
			"sources": [
				"computeOldStatusTime",
				"Filter_skip"
			]
		}
	},

Combine the results

	"Register_ServiceCaseHistory": {
		"action": "sfdcRegister",
		"parameters": {
			"name": "CaseHistory",
			"alias": "CaseHistory",
			"source": "Append_AddLastStatus"
		}
	}
}

Create a dataset



External file uploads

Dataset requirements while using the Wave External Data API

The dataset loaded using the External Data API requires a few key files associated with the load:


  1. The Data File:
  • The data file, containing the external data to be loaded in comma separated value (CSV) format. It is important that the file adheres to proper CSV format to avoid data loading issues. Detailed CSV specifications can be found in this document.
  • It is helpful to convert the file in UTF-8 prior to load, to ensure that a file directly saved from Microsoft Excel or exported from another application/ datastore.
  1. On Unix/ MacOS terminal, use the following command for UTF8 encoding

iconv -f ISO-8859-1 -t UTF-8 input.csv > output.csv


  1. Some tools including Microsoft Excel will give you an option to save the file in a CSV along with UTF-8 encoding while saving the file. Here is a link to convert to the right encoding
  1. Schema file:
  • This meta file describes the structure of the data file and is in JSON format. It identifies which columns are dimensions (text), measures (numbers) and date fields.
  • If you are using the CSV upload feature in Wave or the Dataset Utilities, this file is optional when uploading the data. Analytics cloud will automatically detect the schema.
  • If you need to change the schema that has been detected, you would need to upload this file with the correct schema. For example, an all number user_id field would get detected as a ‘measure’. If you need the field to be a ‘dimension’, you would need to explicitly upload this file. Similarly by default date will be calendar date, and you can specify if you want to leverage fiscal date offsets.
  • The schema file can be downloaded using the Datasetutil jar file using the ‘Downloadxmd’ option. Further details in this #Datasetutils section.
  1. Extended metadata file (user XMD):
  • This is an optional file used to change how data is displayed.
  • Architecturally, Wave datasets have multiple XMD files. i) system XMD file, ii) user XMD file, and, iii) main XMD file. The system and user XMD combine to build the main.xmd.json file. The system and user XMD are auto generated during a new dataset registration. You can overwrite a user XMD by editing a dataset, which results in a new main.xmd.json generation.
  • For example, if you want a field “Opp.AccountName__c” to be displayed as “Account Name”, you would need to add this to the xmd.
  • You can also use the XMD to hide fields, set default values of raw data table, change dimension colors, link to salesforce objects, setup object actions etc.
  • The detailed XMD reference is available at following link


Uploading using the UI

This link gives the documentation from the product team for uploading a dataset using the UI.


Notes:


  1. The Salesforce UI now allows you to change the schema of a file uploaded. If you need to change your field to be a date, make sure to select the correct format.
  2. There may be certain formats of dates that you cannot specify (For eg: 20150131 - YYYYMMDD). In these cases, you would need to using the jar file and the schema.
  3. The UI has a file size limit of 512MB. For latest limits, reference the following link.


Using the Microsoft Excel connector for Wave

The Wave Connector for Excel gives you a great way to get your data into Salesforce Analytics Cloud using Microsoft Excel 2013 or Microsoft Office 360. After installing the Connector, you just select data from Excel, click Submit, and the Connector does the work for you, importing the data to Analytics Cloud and creating a dataset. Installation details are provided at the following link


Using Datasets utils jar file

The Dataset Utilities jar is a reference implementation of the External Data API for Analytics Cloud which is covered in more detail in the next section. The API provides a robust capability to build your own data loading connector for Wave, an example implementation of this is shown in a later section.


The Dataset Utilities jar provides methods to interact with various Salesforce external APIs to load CSV files, augmenting existing datasets, viewing data in existing datasets, exporting data, monitoring jobs and downloading data loading errors.


You can find the latest jar file link. Source code and usage details are link.


Features:


  • Used for uploading CSV files.
  • Can auto-generate a schema for the CSV file. You can edit the schema for specific fields. For example, an all number user_id field would get detected as a ‘measure’. You would need the json to make sure that it gets detected as a ‘text’ field
  • The jar file provides some visibility into job status, keeps a compressed copy of CSV and other log files local to your computer

Usage details:


  • Click on the Datasetutils.dmg file, followed by Datasetutils.app. It will open a webUI to sign in to your org.
  • Console mode: If you want to use the CLI

java -jar datasetutil-<version>.jar --server false


  • You could pass multiple parameters to the jar file at once

java -jar datasetutil-<version>.jar --action <action> --u <username@domain.com> --p <password> --dataset <dataset> --app <app> --inputFile <inputFile> --operation <operation> --endpoint <endPoint>


Here are some of the most commonly used actions from the jar file:


  • Uploading a CSV

java -jar datasetutil-<version>.jar --action <action> --u <username@domain.com> --p <password> --dataset <dataset> --app <app> --inputFile <inputFile> --operation <operation> --endpoint <endPoint>

--action: In this case the action should be “load”

--dataset: Specify dataset name

--inputFile: Specify input file to be uploaded. File should be in csv format

--app: ID of the app, if you want to upload to a specific app. Default private app

--endpoint: The Salesforce soap api endpoint (test - test.salesforce.com / prod - login.salesforce.com) Default: prod

--operation: There are 4 operations available with load action

Operation

Description

Overwrite

Default operation chosen if the parameter is not specified

Overwrites the dataset or creates a new dataset if it is not present

Append

Adds rows to an existing dataset If your dataset has a unique primary key, this operation fails. Use Upsert

Upsert

Update + Insert operation (Updates rows present, inserts new rows) Requires unique primary key

Delete

Used to delete specific rows from a dataset that has a unique key Need to provide a file with those rows There is a separate delete option in the UI which deletes the whole dataset

Notes
  1. Make sure that the file being uploaded is a proper csv in UTF-8 format
  2. The only supported value for the file delimiter at present is a comma.
  3. It is possible to append/upsert/delete with a CSV file that does not have headers. There is a “numberOfLinesToIgnore” parameter in the schema file which can be set to 0.
  4. All delimiters within a field should be quoted. For eg: A field ‘RecordCount’ contains 123,456. This would get interpreted as 2 fields: 123 and 456.The field ‘RecordCount’ should be quoted: “123,456”
  5. There should be no dangling quotes. For eg: A field ‘Description’ has “Dangling”Quote”. Such a file it not in a proper csv format and would not get processed correctly. It could result in erroneous data being uploaded.
Download an XMD

This action is used to download schema and XMD files associated with the dataset.

java -jar datasetutil-<version>.jar --action <action> --u <username@domain.com> --p <password> --dataset <dataset> --endpoint <endPoint> --action: downloadxmd --dataset: name of the dataset to download

To downloadxmd, you need make sure that you have edit access for that particular dataset.


Limits

Note that the jar file compresses your data file before uploads. All the below mentioned limits are for the compressed file.


  1. Single file: 40 GB
  2. Maximum file size for all external data uploads in a rolling 24-hour period: 50 GB
  3. Maximum number of external data files that can be uploaded in a rolling 24-hour period: 50
Some more notes

Make sure you have the latest version of java. The current jar requires 1.7. This is the error you would get in such a situation:

Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad version number in .class file


Using Informatica Rev

Informatica Rev is a 3rd party data preparation tool,available in the Cloud, which provides a visual and intuitive way to manipulate spreadsheet data loaded from an Excel, a Salesforce org or a 3rd party database and push that directly using Informatica’s Salesforce Analytics Cloud connector into Wave. You can sign-up or get more information at link


Custom coding for data uploads

Using the External Data API to implement your own custom load connector:

If you wish to implement your own connector to load data into wave you can use the External Data API, namely the InsightsExternalData and InsightsExternalDataPart objects. Several connectors of various ETL vendors, as well as the aforementioned DatasetUtils, use the External Data API objects to upload the data from their ETL integration jobs. A high-level architecture for the External Data API is provided below,


ASGLoadDatatoWaveImage21.png

The InsightsExternalData and InsightsExternalDataPart Objects

The InsightsExternalData object is used as a control object to configure the metadata for the dataset such as its name, the data format (CSV) , the update operation (Append, Upsert, Overwrite), as well as control the update action triggering the workflow to create the dataset. The initial process is to insert a row to return the InsightsExternalData row ID (parent ID) that is then used by the InsightsExternalDataPart object to maintain the relationship. The InsightsExternalDataPart object holds the actual data which is loaded into one or more rows in 10 megabytes (Mb) chunks, with the parent ID and part number so that the data can be reconstituted into a single dataset.


The External Data API objects are described more fully here: Analytics Guide External Data API Developers Guide. The data format for the CSV file can be found here: Analytics Cloud External Data Format Guide


Approaches to load data into the External Data API Objects

Salesforce provides several approaches to load the data into Salesforce objects. we will look at two of the preferred ways to load data into the Wave External Data API:


  1. SOAP API. The SOAP API allows you to perform CRUD actions on Salesforce objects. It provides authentication service and interoperability between platforms.
  2. REST Bulk API. The REST API provides a simpler REST based URL services interface. This particular interface also supports the update of a large number of records at once as well as supporting asynchronous updates.

On the face of it, the REST Bulk API may seem the obvious choice here as we are likely to be loading a large number of records to create the dataset, however, the InsightsExternalDataPart object already supports a bulk load approach; we chunk the data file into 10Mb parts and it is those parts that are inserted into a BLOB field of a single row, so we are not creating a large number of rows corresponding to the rows in the data file but a relative few. Both approaches use an initial SOAP call to login into the org and return a session id for the partner connection. Overall the SOAP approach is less complicated, but you may wish for to use a REST interface and asynchronous data load approach due to your own architectural considerations. Additionally, there seems to be little in the way of performance differences between the two approaches.


The general steps for both approaches are :


  1. Prepare your data in a CSV format. Create a metadata json schema file to describe the structure of the data.
  2. Connect to the Salesforce organisation to create a session.
  3. Initially prepare and configure the upload by inserting a row into the InsightsExternalData with the name of the dataset, the data format and the update operation. This returns a parent ID to pass to the InsightsExternalDataParts.
  4. Split your data into 10Mb parts if the data file is greater than 10Mb. Upload the parts into the InsightsExternalDataParts object maintaining their sequence and relationship to the parent ID.
  5. Trigger the workflow to create the dataset by updating the action on the InsightsExternalData object to ‘process’.
  6. Monitor the InsightsExternalData status to see if the dataset creation has completed successfully.

There is a fully worked and tested example in java as an appendix to this document; it covers both approaches. It does not however monitor the eventual success of a dataset creation as this would require monitoring the status code of the InsightsExternalData after the upload of the data and the action has been set to process, thereafter the workflow takes over.


The DatasetLoaderExample.java class

The example code uses the Web Services Connector ( WSC) jar and the Partner Connector API jar. These provide a java implementation and utility methods to access the Web Services API and REST API. It is suggested to start with the example code as it is directed at the Wave External Data API and is specific and complete, however, several resources that may be of use in understanding the APIs more fully can be found here.


For both SOAP and REST, the example code initially uses a SOAP call to get the partner connection and creates the entry into InsightsExternalData. The example also uses a SOAP call to update the InsightsExternalData at the end of the data upload to process the upload. It is the data part load processes that diverge and use either the SOAP or REST approaches. The data parts are 64 bit encoded in the BLOB field, both APIs take care of this encoding for you so there is no need to explicitly encode the datafile.

Source Code Download

100%
Download Java Source

Getting the partner connection

The partner connection is obtained using login credentials and the SOAP endpoint. An example endpoint url is https://login.salesforce.com/services/Soap/u/32.0


PartnerConnection partnerConn =null;
ConnectorConfig config = new ConnectorConfig();
config.setAuthEndpoint(aEndpoint);
config.setUsername(aUserName);
config.setPassword(aPassword);
partnerConn =new PartnerConnection(config);
return partnerConn;


Configure the InsightsExternalData object for upload

Insert a row into the InsightsExternalData object with the dataset configuration and retrieve the parent ID for use with InsightsExternalDataParts object. The object type is InsightsExternalData, the field format is CSV ( Binary is deprecated for the External Data API and should no longer be used) . The dataset alias is the unique name for the dataset, the metadata json is the json schema file, the operation is Append, Overwrite or Upsert. Initially the action is None so that the workflow is not yet triggered.


SObject sobj = new SObject();
sobj.setType(EXT_DATA_OBJECT);
sobj.setField(FIELD_FORMAT, "CSV");
sobj.setField(FIELD_EDGEMART_ALIAS, aDatasetAlias);
sobj.setField(FIELD_METADATA_JSON,
	FileUtils.readFileToByteArray(aMetadataJson));
sobj.setField(FIELD_OPERATION, aOperation);
sobj.setField(FIELD_ACTION, "None");
SaveResult[] results = aPartnerConn.create(new SObject[] { sobj });
for (SaveResult sv : results) 
		if (sv.isSuccess()) 
			parentID = sv.getId();


Using the Soap API to load the data parts

The data file is gzipped (if it is not already) to compress it and then split into into 10 Mb parts using the chunkBinary method in the sample code. No 64 bit encoding is necessary here as the SOAP call will take care of the transformation. The file parts list is iterated through inserting a row into the InsightsExternalDataParts object, with the data file part, the parent ID and a part number in sequence, this allows the data parts to be reconstituted into the dataset correctly.


	List<File> fileParts = chunkBinary(fileToChunk); // Split the file
	    for (int i = 0; i < fileParts.size(); i++) {
			logger.info("Adding data part number : " + (i + 1));
			SObject sobj = new SObject();
			sobj.setType(EXT_DATA_OBJECT_PARTS);
			sobj.setField(FIELD_DATA_FILE,
				FileUtils.readFileToByteArray(fileParts.get(i)));
			sobj.setField(FIELD_INSIGHTS_EXT_DATA_ID, aParentID);
			sobj.setField(FIELD_PART_NUMBER, i + 1); // Part numbers should start at 1
				SaveResult[] results = aPartnerConn.create(new SObject[] { sobj });
				for (SaveResult sv : results)
					if (sv.isSuccess()) {
					rowId = sv.getId();
		}


Using the REST Bulk API to load the data parts

Unlike the SOAP approach, the file is not compressed before splitting into 10Mb. This is because each data file part is then compressed into a zip file with a corresponding manifest file called request.txt, the method createBatchZip is used in the example to do this. The request.txt file is a standard file required by the REST Bulk API to described the handling of a binary attachment upload. It is as a binary attachment that the data file part is uploaded to the InsightsExternalDataPart object. Each part has a corresponding text file that includes the parent id (InsightsExternalData ID) , the part number and the data file part name. The contents of an request.txt for say data file part 5 is shown here:


	InsightsExternalDataId,PartNumber,DataFile
	06VB00000004XTbMAM,5,#InkStat.csv.4.csv

A bulk connection is obtained from the SOAP partner connection created on login. It is this connection that provides the REST Url endpoint


	logger.info("Getting bulk connection");
	ConnectorConfig config = new ConnectorConfig();
	config.setSessionId(aPartnerConfig.getSessionId());

	// The endpoint for the Bulk API service is the same as for the normal
	// SOAP uri until the /Soap/ part. From here it's '/async/versionNumber'
	String soapEndpoint = aPartnerConfig.getServiceEndpoint();
	String restEndpoint = soapEndpoint.substring(0, soapEndpoint.indexOf("Soap/"))
	+ "async/" + API_VERSION;
	config.setRestEndpoint(restEndpoint);
	// This should only be false when doing debugging.
	config.setCompression(true);
	// Set this to true to see HTTP requests and responses on stdout
	config.setTraceMessage(false);
	BulkConnection connection = new BulkConnection(config);
	return connection;

The batch job is then created to provide information on which object is to be updated as a batch job. The object type is InsightsExternalDataPart.


	logger.info("Creating batch job ");
	JobInfo job = new JobInfo();
	job.setObject(aSobjectType);
	job.setOperation(OperationEnum.insert);
	job.setContentType(ContentType.ZIP_CSV);
	job = aConnection.createJob(job);
	logger.info(job);
	return job;

The overall process of iterating through each zip data file part is shown here:

	List<File> fileParts = chunkBinary(aDataFile); // Split the file
	List<File> batchZips=createBatchZip(fileParts, aParentID);
	BulkConnection bulkConnection = getBulkConnection(aPartnerConn.getConfig());
	JobInfo job = createJob(EXT_DATA_OBJECT_PARTS, bulkConnection);
	LinkedHashMap<BatchInfo,File> batchInfoMap = new LinkedHashMap<BatchInfo,File>();
	for (int i = 0; i < batchZips.size(); i++) {
		logger.info("Adding data part number : " + (i + 1));
		createBatch(batchZips.get(i), batchInfoMap, bulkConnection, job);
	}
	closeJob(bulkConnection, job.getId());
	awaitCompletion(bulkConnection, job, batchInfoMap);
	checkResults(bulkConnection, job, batchInfoMap);

For each zip file data part a batch is created using the bulk connector and a corresponding batch info object returned. The batch info object provides the means to monitor the batch jobs as they complete as well as their status on completion as the batches are applied asynchronously. In the example code, threads are started to monitor the completion.

The createBatch code is shown here:

zipFileStream = new FileInputStream(aZipFile);
logger.info("creating bulk api batch for file {"+aZipFile+"}");
BatchInfo batchInfo = aBulkConn.createBatchFromZipStream(aJobInfo, zipFileStream);


Trigger the dataset creation

The SOAP call to the InsightsExternalData object row that was configured for the upload and set the action to process. The workflow will be triggered to execute the dataset creation. The object status could then be monitored for successful completion or the error status examined.


	SObject sobj = new SObject();
	sobj.setType(EXT_DATA_OBJECT);
	sobj.setField(FIELD_ACTION, "Process");
	sobj.setId(aParentId);
	SaveResult[] results = aPartnerConn.update(new SObject[] { sobj });
	for (SaveResult sv : results)
		if (sv.isSuccess()) {
			rowId = sv.getId();


Data loads using Force.com APEX

Apex is a strongly-typed, object-oriented programming language that lets you centralize and execute flow and transaction control statements on the Salesforce Force.com platform in conjunction with application calls to Force.com APIs. Using syntax that looks like Java and acts like database stored procedures, Apex lets you add powerful custom business logic to your data pull and can help you perform custom transformation before using the External Data API to create a wave dataset.


To pull your data from a salesforce object, we would use a SOQL query. To write data into analytics cloud we would use the External Data API. Here are the detailed steps for this:


  • Insert a record into InsightsExternalData Object using APEX code,
  • Set dataset alias
  • Schema metadata JSON
  • Action = None
  • Format = CSV format
	InsightsExternalData upload = new InsightsExternalData (Format = 'CSV', EdgemartAlias
		= uploadName, Action = 'None', Operation = 'Overwrite');
	System.debug ('Insights upload data: ' + upload.EdgemartAlias);
	String metaData =
		'{"fileFormat":{"charsetName":"UTF-8","numberOfLinesToIgnore":1},"objects":[{"name":"OpportunityEx"...... … }
	folderList = [Select Id From Folder Where Type='Insights' And
		developername='InsightsPublicEdgeMarts'];
	if(folderList.size()>0)
		upload.EdgemartContainer=folderList[0].Id;
	insert upload;
	insightsExtDataId = upload.id;
	

	//Initialize the query
	if(query==null){
		query ='SELECT Id,Name FROM Opportunity';
	}
	batchCount = 1;
	csvHeader = 'Id,Name’
  • Store the record identifier that is returned
  • Execute SOQL query asynchronously
  • Page: Up to 10K records at a time
  • Transform data
  • Update fields
  • Create new fields
	for (Opportunity opp : opps)
		{
		<i>//append id</i>
		recordString += '"' + opp.Id + '",';
		<i> //name </i>
		recordString += '"' + opp.Name + '",';
		}


return Blob.valueOf(recordString);

  • Construct CSV
  • Chunk CSV parts: Up to 10MB per part
  • No more than 4000 parts
  • Insert a record in the InsightsExternalDataParts Object
  • Update the InsightsExternalData record
InsightsExternalData upload = new InsightsExternalData (Format='CSV',
EdgemartAlias=uploadName, Action='Process', Operation='Overwrite');
upload.Id = insightsExtDataId;
update upload;

  • Scheduler picks up jobs ready for processing
  • Merges CSV parts
  • Digester builds the dataset

This is a webinar on uploading data to Analytics cloud through Apex.


Loading data using an ETL tool

This section is provided for completeness. A number of leading ETL tool providers have pre-built connectors to load data into Wave using the Wave external data API. Here is a list of providers with pre-built connectors,


Tool

Link

Informatica

Informatica Cloud Connector : link

Informatica Rev: link

Mulesoft

Mulesoft connector

Jitterbit

Jitterbit connector

Dell Boomi

Dell Boomi connector

Snaplogic

Snaplogic connector

IBM CastIron

IBM Case Iron connector

Talend

Talend Connector

The following examples promotes understanding of how data is prepared for submission to the EDI interface. It uses Talend Open Studio but could conceivably be implemented in the other ETL tools fairly easily.


Example Case Study : Loading Pardot Data into Wave

Background

Pardot is a lead generation management application and is part of the Salesforce Marketing Cloud. Although there is some integration of Pardot objects and Salesforce core objects a few objects resides outside of SFDC core and therefore are not accessible through the Wave workflow for digest. In order to get this data into Wave an ETL tool is used.


Requirements

Prospect , visitor and visits data is to be loaded into Wave so it can be segmented and be tied to campaigns and where possible existing opportunities and customers. The data is accessible through Pardot’s REST API and can be downloaded in bulk. You can find the Pardot API [here] .


Implementation

The overall process to load pardot object data into Wave is as follows:


  1. Pardot Connector. Authenticate on the Pardot API and make the bulk download REST request specifying the object name and bulk download format (JSON), iterate until all the data is retrieved. Persist the data to a single JSON file.
  2. JSON Schema Component. A JSON schema component is configured so that the structure of the Pardot object JSON is understood and can be propagated and transformed in the Talend job. This component reads the aforementioned object JSON file for transmission to the Wave Connector Component. It transforms the JSON file into a schema that can be easily transformed to CSV.
  3. Wave Analytics Connector.The Wave Analytics Connector component creates the Dataset meta schema file from the synchronized schema as well as transform the JSON file into CSV for upload. It connects to the Analytics Cloud org and uploads the data to Wave to create a dataset.


Pardot Connector

The Pardot connector here is custom java code wrapped in a Talend component tPardot. Both sets of code is available with this document. The connector works in this way:

Post Login and get response API Key. It posts the following REST URL with the user_key provided by Pardot admin and user credentials to return an api_key which is good for an hour to be used in subsequent requests. The response is in XML.

https://pi.pardot.com/api/login/version/3?email=auser@salesforce.com&password=apassword123&user_key=225f3fd7n97c067babf23cce9bfec78c1


Response:

	<?xml version="1.0" encoding="UTF-8"?>
	<rsp stat="ok" version="1.0">
	<api_key>7f476a878f6abc37a9324403043fa736</api_key>
	</rsp>

Once the api_key is obtained, the connector sends a query for the specific object , in this example the prospect object is queried. As the response is limited to only 200 records , repeated requests are made until all the records are returned. The id_greater_than value is used to paginate over the records and is governed by the last id obtained. The output is bulk and the format is JSON.


https://pi.pardot.com/api/prospect/version/3/do/query?user_key=225f3fd7n97c067babf23cce9bfec78c1&api_key=7d9a94c3565501ded38f5b9ec99ae3a8&output=bulk&format=json&id_greater_than=45444646


Response:

{
	"@attributes": {
	"stat": "ok",
	"version": 1
	},


	"result": {
	"prospect": [
		{"id": 118240185, "campaign_id": 214121, "user_id": null, "salutation": null …………
			"updated_at": "2015-10-04 15:18:41",
			"campaign": {
				"id": 214113,
				"name": "ACME Existing Customer Upsell Campaign"
			}
		  }
		]
	}
}

The Pardot connector iterates over the response for all records, removes the header and adds records to an array to persists to a single <object>.json file, in this example prospect.json.

{ 
	"records" :[
	{"id":118240185,"first_name":"Garfield","last_name":"Moore" ....."campaign":{"id":214121,"name":"ACME Existing Customer - SEO"}},

	......
	{"id":118405958,"first_name":"Lauren","last_name":"Boyle",... "campaign":{"id":214115,"name":"ACME"}}
	]
}


JSON Schema Component

The talend component tJSON is able to create a schema from the prospect.json file and map json attributes including nested values to field values to create a flattened structured that can be sent to Wave. This schema can be propagated to other components so that JSON data can be handled and transformed.


Salesforce Wave Output Component

The tSalesforceWaveOutputBulkExec component creates a metadata json schema for the Dataset synchronized from the JSON schema component . Data types and formatting can be changed here. This component uses the SOAP API load method to the External Data API to load data into Wave for dataset creation.


Talend Job

The image belows shows the Talend job to load Pardot prospect data into Wave.


ASGLoadDatatoWaveImage07.jpg

Source Code Download

100%
Download Java Source

Conclusion

The need to integrate data from multiple sources is important in many use cases, particularly where users require a complete view of what is happening across multiple business domains. To achieve this, the process of loading and transforming data from multiple sources to Wave is necessary.


Wave Analytics provides a variety of flexible options to load data. Use the native Dataflow to load data from Salesforce, or use one of our partner ETL tools for external data. You can use other options like CSV load to analyze data shape and validate business requirements early. A custom connector can be built using Wave's External Data API. In short, Wave is a flexible platform for users to change the next analytics conversation.


Related Resources

Official Salesforce Documentation

The following resources are written and maintained by Salesforce's award-winning documentation team. The content is updated every time Salesforce is upgraded with a new release, typically three times per year. You will always find the most current and trusted content in the official Salesforce documentation.

Other Resources

The following resources contain valuable content written by Salesforce and our community. Note that the content may not be regularly maintained.

About these Best Practices

The Advanced Solutions Group ( ASG ) is specialized analytics team in the Analytics Cloud product organization that provides world-class technical leadership and guidance focused on empowering the Wave ecosystem. ASG helps drive rapid value by leveraging deep technical knowledge of the Wave platform and apps. These best practices cover design and implementation alternatives for real-world analytics use-cases.

Reach the ASG experts at #AsktheACExperts

ASG Logo Final.png