As more organizations move their business intelligence (BI) environments to the cloud, loading and accessing enterprise data will become as important as the front-end visualizations. Oracle’s BI Cloud Service (BICS) offers several options for those data requirements that go beyond simple data upload. Each has a specific purpose, features, benefits, and limitations. One of the more powerful options for true enterprise-level data transfer to the cloud is Oracle’s Data Sync tool.
Data Sync Overview
Data Sync provides a full-featured data transfer tool with a client interface that allows for scheduling load jobs that efficiently move data from flat files, database tables, and other cloud data sources into the BICS Database Schema Service or Oracle Database Cloud Service. It can also directly load data as a data set source for the Visual Analyzer projects that are available in BICS.
It includes many of the features found in other data loading tools: logging of load job execution steps, restarting after failures, incremental loading of new or modified data, and configuring the sequence of load operations. A Data Sync solution would more likely lend itself either to new development of data load processes or to a more agile analytics environment that allowed for changing processes and data models more rapidly than would be possible with an on-premise database.
Data Sync Configuration Steps
Data Sync’s primary function is to upload data into a BICS environment. Data can be loaded from flat files (CSV or XLSX), relational database sources (either tables, views, or SQL statements that are executed dynamically), Oracle Transactional BI (OTBI), JDBC data sources (e.g., MongoDB, Impala, SalesForce, RedShift), or the Oracle Service Cloud. Uploaded data can be stored in cloud-based tables or data sets accessible to the data visualization components.
Data Sync can:
- Load data sources other than Oracle in addition to data files or Oracle tables
- Execute incremental data loads or rolling deletes and insert / append strategies
- Merge data from multiple sources
- Schedule data loads
Data Sync is installed on a local computer running either the Windows or Linux operating systems. Prior to installing Data Sync, ensure that Java Development Kit (JDK) 1.7 or later is installed on the local computer. It must be the JDK and not a JRE. It is also necessary to validate that the user account that will be used to access the BICS database schema has the proper permission. Work with your cloud administrator to request permission to upload data to Oracle BICS by assigning the BI Data Load Author application role to the account. To upload data to a data set instead of a table, the BI Advanced Content Author application role should be assigned.
- Download the Data Sync software from Oracle Technology Network. Currently located at: Data Sync Download on OTN
- Unzip BICSDataSync.Zip to a local directory (no spaces in directory name)
- Set the JAVA_HOME variable in config.bat or config.sh to point to JDK location
- Copy database-specific JDBC drivers to Data Sync’s \lib directory
Data Sync is comprised of both a server component and a client GUI interface. To start Data Sync and its server component, run datasync.bat (Windows) or datasync.sh (Linux/UNIX) from the Data Sync installation directory. The Data Sync icon displays in the system icon tray to indicate that the server is up and running.
To access the client interface, click on the icon and choose Start UI to open the Data Sync client.
Click on the icon and choose Exit to stop the Data Sync server.
Data Sync Updates
Data Sync automatically checks against the cloud environment to see if a new version is available prior to each load job executing. It is possible that a new version of Data Sync has been installed in the cloud that is incompatible with the local version. If the versions match, the data load continues unaffected. If the minor version of the tool is changed in the cloud, it indicates a patch is available. This triggers a one-time alert and an optional email if the Data Sync server is configured for emailing notifications. If the version change indicates a major version release, an alert is created and an email sent. The data load job is stopped and will not run until the new version is installed.
New versions (minor or major) are simple to install by following the standard installation process into a new home directory and then by copying the existing environment configuration into the new installation.
Data Sync Terminology
Connection: Defines the data sources and target databases.
Project: A container which describes and organizes the information related to a data load.
There can be any number of projects within a Data Sync instance.
Job: The mechanism for uploading all the data sources defined in a project to BICS.
When moving data to the cloud, a load strategy defines how that data is loaded from the data source into the target. Incremental data loading is available as long as there is a column on the source table which uniquely identifies each row (primary) as well as another column with a “Date/Timestamp” data type that can be used to identify which rows have been added or modified since the previous load. During the initial load job, the full source table is transmitted to the cloud. In subsequent loads, the last update date is compared to a stored timestamp for
Load strategy options include:
- Replace data in table: Truncates existing data and always reloads the table. Any indexes are dropped prior to data load and are recreated after the load
- Append data to table: New data is added to the table without checking for any prior existence of data. Indexes are not dropped before the load. Any new indexes defined on the source are created after the data load
- Update table (add new records only): Requires a primary key column. If a row with the same key does not exist, then it is inserted, or else the record is ignored
- Update table (update existing records): Requires a primary key column. If the data with the same key is available, then it is updated, or else it is ignored
Loading Data Sets
Data sets are separate data objects used primarily by the Visual Analyzer component of BICS. They can be thought of as separate data files stored in the cloud. Data Sync can load those data sets with the following guidelines:
- If a data set by a similar name does not exist, one is created automatically with default settings
- All string and timestamp based source columns are set to attribute type, and numeric datatype columns are set as measures in the target data set
- The maximum data set size is 50MB; data uploads fail if the data set exceeds the 50MB limit
Loading Data with Jobs
Data Sync uses the concept of a job to organize, schedule, and execute load processes. A run is an instance of a data loading job. For example, if you run a job twice, then you’ll see two run records on the History tab. As is common with most scheduling applications, Data Sync allows for job settings to recur on a scheduled basis to meet whatever load frequency is required.
Parameters are available in Data Sync to customize the data loads at run time to dynamically change the conditions on the data selection step. Rather than changing the SQL statements used to query the data sources, a parameter can be changed at the project or job level instead. A job level parameter will override a parameter with the same name at the project level.
Chaining Data Sync Load Jobs
There may be occasions where it would be beneficial to the load process to chain load jobs in a specific order. Possible reasons might be the need to load multiple separate data source tables into a common target or the need to load aggregate tables after base level data tables have completed. The first step would be to separate the load jobs into distinct projects.
When a job starts, a signal file with a name like “job_name”_StartSignal.txt is created in the Data Sync: log\jobSignal directory for each run of the job. A file with the naming pattern “job_name”_CompletedSignal.txt is created when the job completes successfully, or “job_name”_FailedSignal.txt when the job fails. Data Sync has a polling mechanism to look for the existence of these files before executing another load job.
By editing the on_demand_job.xml file located in the conf-shared directory, you can specify the name of a file that will trigger a specific load job.
<TriggerFile job=”Load_HR_Data” file=”d:\Load_The_HR_Data_Now.txt”/>
In the example above, Data Sync polls for the presence of the Load_The_Hr_Data_Now.txt file, and when it is found it triggers the execution of the Load_HR_Data job. When the job is started, the triggering file is deleted. A time window for polling can also be specified in the XML file.
<TriggerFile job=”Load_HR_Data” file=”d:\Load_The_HR_Data_Now.txt”>
<TimeWindow startTime=”00:00? endTime=”05:00?/>
<TimeWindow startTime=”21:00? endTime=”23:00?/>
Two other options exist for triggering Data Sync load jobs, either from the command line using the datasyncCmdLine.bat/.sh script file or else with polling tables stored in a database. Both of these methods are described in detail in the Data Sync documentation.
Monitoring and Troubleshooting Data Loads
Data load jobs can be monitored while in progress using the Current Jobs tab or History tab in Data Sync. The top panel shows the job summary information. The Tasks tab underneath that panel shows one record per user-defined data flow. The tasks and the details show important metrics, including start and end timestamps, the number of rows read and written and throughput.
The run log files are stored in the Data Sync log directory. One directory per run is created with a naming convention of CR_<Table/File Name>_<From Connection>_<To Connection>.<Timestamp>.log.
There are options available within the Connections configuration screens to reset a data target to do a full reload if necessary. The reload can be for all connections or individual tables within a connection.
Data Sync Configuration Migration
As with most development environments, it may be necessary to maintain multiple instances of Data Sync for DEV, TEST, and PROD. This can be managed within the Data Sync interface by exporting the metadata from one environment and importing it into the next. The export and import options are available under the Tools menu.
When exporting, there are options to export the following metadata types:
- Logical: Exports all information contained in the Project view
- System: Exports all information contained in the Connections view, except passwords for servers and database connections
- Run Time: Exports information about jobs and schedules contained in the Jobs view
- User Data: Exports users, roles, and passwords