Merge In Bigquery

Merge In Bigquery
  • GCS – Staging Area for BigQuery Upload. Unless you are directly loading data from your local.
  • When we use the UNNEST function on a column in BigQuery, all the rows under that column is flattened all at once. Currently, the UNNEST function does not accept multiple arrays as parameters. We need to merge the arrays into a single array before flattening it.
  • I have 2 web sites and i am sending all the GA data into Big Query. Since Big Query would create different tables for each site, i would like to merge them into one. Also i would like to create a.

Here is one scenario, suppose you a re doing RFM analysis using BigQuery ML. Where first you have to calculate the RFM values for all the users then apply the k-means cluster to the result of the first query and then merge the output of the first query and second query to generate the final data table.

Learn how to use the Google BigQuery Handler, which streams change data capture data from source trail files into Google BigQuery.

BigQuery is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage, see https://cloud.google.com/bigquery/.

  • Detailing the Functionality
  • Setting Up and Running the BigQuery Handler
    The Google BigQuery Handler uses the Java BigQuery client libraries to connect to Big Query.

2.1 Detailing the Functionality

  • Data Types
  • Operation Modes
  • Operation Processing Support
  • Proxy Settings

2.1.1 Data Types

The BigQuery Handler supports the standard SQL data types and most of these data types are supported by the BigQuery Handler. A data type conversion from the column value in the trail file to the corresponding Java type representing the BigQuery column type in the BigQuery Handler is required.

The following data types are supported:

The BigQuery Handler does not support complex data types, such as ARRAY and STRUCT.

2.1.2 Operation Modes

You can configure the BigQuery Handler in one of these two modes:

Audit Log Mode = true
gg.handler.name.auditLogMode=true

When the handler is configured to run with audit log mode true, the data is pushed into Google BigQuery without a unique row identification key. As a result, Google BigQuery is not able to merge different operations on the same row. For example, a source row with an insert operation, two update operations, and then a delete operation would show up in BigQuery as four rows, one for each operation.

Also, the order in which the audit log is displayed in the BigQuery data set is not deterministic.

To overcome these limitations, users should specify optype and postion in the meta columns template for the handler. This adds two columns of the same names in the schema for the table in Google BigQuery. For example: gg.handler.bigquery.metaColumnsTemplate = ${optype}, ${position}

The optype is important to determine the operation type for the row in the audit log.

To view the audit log in order of the operations processed in the trail file, specify position which can be used in the ORDER BY clause while querying the table in Google BigQuery. For example:

auditLogMode = false

gg.handler.name.auditLogMode=false

When the handler is configured to run with audit log mode false, the data is pushed into Google BigQuery using a unique row identification key. The Google BigQuery is able to merge different operations for the same row. However, the behavior is complex. The Google BigQuery maintains a finite deduplication period in which it will merge operations for a given row. Therefore, the results can be somewhat non-deterministic.

Merge In Bigquery

The trail source needs to have a full image of the records in order to merge correctly.

Example 1

An insert operation is sent to BigQuery and before the deduplication period expires, an update operation for the same row is sent to BigQuery. The resultant is a single row in BigQuery for the update operation.

Example 2

An insert operation is sent to BigQuery and after the deduplication period expires, an update operation for the same row is sent to BigQuery. The resultant is that both the insert and the update operations show up in BigQuery.

This behavior has confounded many users, as this is the documented behavior when using the BigQuery SDK and a feature as opposed to a defect. The documented length of the deduplication period is at least one minute. However, Oracle testing has shown that the period is significantly longer. Therefore, unless users can guarantee that all operations for a give row occur within a very short period, it is likely there will be multiple entries for a given row in BigQuery. It is therefore just as important for users to configure meta columns with the optype and position so they can determine the latest state for a given row. To read more about audit log mode read the following Google BigQuery documentation:Streaming data into BigQuery.

2.1.3 Operation Processing Support

The BigQuery Handler pushes operations to Google BigQuery using synchronous API. Insert, update, and delete operations are processed differently in BigQuery than in a traditional RDBMS.

The following explains how insert, update, and delete operations are interpreted by the handler depending on the mode of operation:

Bigquery Select Into

auditLogMode = true
  • insert – Inserts the record with optype as an insert operation in the BigQuery table.

  • update –Inserts the record with optype as an update operation in the BigQuery table.

  • delete – Inserts the record with optype as a delete operation in the BigQuery table.

  • pkUpdate—When pkUpdateHandling property is configured as delete-insert, the handler sends out a delete operation followed by an insert operation. Both these rows have the same position in the BigQuery table, which helps to identify it as a primary key operation and not a separate delete and insert operation.

auditLogMode = false
  • insert – If the row does not already exist in Google BigQuery, then an insert operation is processed as an insert. If the row already exists in Google BigQuery, then an insert operation is processed as an update. The handler sets the deleted column to false.

  • update –If a row does not exist in Google BigQuery, then an update operation is processed as an insert. If the row already exists in Google BigQuery, then an update operation is processed as update. The handler sets the deleted column to false.

  • delete – If the row does not exist in Google BigQuery, then a delete operation has no effect. If the row exists in Google BigQuery, then a delete operation is processed as a delete. The handler sets the deleted column to true.

  • pkUpdate—When pkUpdateHandling property is configured as delete-insert, the handler sets the deleted column to true for the row whose primary key is updated. It is followed by a separate insert operation with the new primary key and the deleted column set to false for this row.

Do not toggle the audit log mode because it forces the BigQuery handler to abend as Google BigQuery cannot alter schema of an existing table. The existing table needs to be deleted before switching audit log modes.

Note:

The BigQuery Handler does not support the truncate operation. It abends when it encounters a truncate operation.

2.1.4 Proxy Settings

To connect to BigQuery using a proxy server, you must configure the proxy host and the proxy port in the properties file as follows:

Merge In Bigquery

2.2 Setting Up and Running the BigQuery Handler

The Google BigQuery Handler uses the Java BigQuery client libraries to connect to Big Query.

These client libraries are located using the following Maven coordinates:

Bigquery Insert Into Select

  • Group ID: com.google.cloud
  • Artifact ID: google-cloud-bigquery
  • Version: 1.111.1

The BigQuery Client libraries do not ship with Oracle GoldenGate for Big Data. Additionally, Google appears to have removed the link to download the BigQuery Client libraries. You can download the BigQuery Client libraries using Maven and the Maven coordinates listed above. However, this requires proficiency with Maven. If you cannot download the Google BigQuery client libraries, then contact Oracle Support for assistance. See Google BigQuery Dependancies.

  • Schema Mapping for BigQuery
  • Understanding the BigQuery Handler Configuration
  • Review a Sample Configuration
  • Configuring Handler Authentication

Merge In Bigquery Online

2.2.1 Schema Mapping for BigQuery

The table schema name specified in the replicat map statement is mapped to the BigQuery dataset name. For example: map QASOURCE.*, target 'dataset_US'.*;

This map statement replicates tables to the BigQuery dataset 'dataset_US'

Parent topic:Setting Up and Running the BigQuery Handler

2.2.2 Understanding the BigQuery Handler Configuration

The following are the configurable values for the BigQuery Handler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).

To enable the selection of the BigQuery Handler, you must first configure the handler type by specifying gg.handler.name.type=bigquery and the other BigQuery properties as follows:

Properties Required/ Optional Legal Values Default Explanation

gg.handlerlist

Required

Any string

None

Provides a name for the BigQuery Handler. The BigQuery Handler name then becomes part of the property names listed in this table.

gg.handler.name.type=bigquery

Required

bigquery

None

Selects the BigQuery Handler for streaming change data capture into Google BigQuery.

gg.handler.name.credentialsFile

Optional

Relative or absolute path to the credentials file

None

The credentials file downloaded from Google BigQuery for authentication. If you do not specify the path to the credentials file, you need to set it as an environment variable, see Configuring Handler Authentication.

gg.handler.name.projectId

Required

Any string

None

The name of the project in Google BigQuery. The handler needs project ID to connect to Google BigQuery store.

gg.handler.name.batchSize

Optional

Any number

500

The maximum number of operations to be batched together. This is applicable for all target table batches.

gg.handler.name.batchFlushFrequency

Optional

Any number

1000

The maximum amount of time in milliseconds to wait before executing the next batch of operations. This is applicable for all target table batches.

gg.handler.name.skipInvalidRows

Optional

truefalse

false

Sets whether to insert all valid rows of a request, even if invalid rows exist. If not set, the entire insert request fails if it contains an invalid row.

gg.handler.name.ignoreUnknownValues

Optional

truefalse

false

Sets whether to accept rows that contain values that do not match the schema. If not set, rows with unknown values are considered to be invalid.

gg.handler.name.connectionTimeout

Optional

Positive integer

20000

The maximum amount of time, in milliseconds, to wait for the handler to establish a connection with Google BigQuery.

gg.handler.name.readTimeout

Optional

Positive integer

30000

The maximum amount of time in milliseconds to wait for the handler to read data from an established connection.

gg.handler.name.metaColumnsTemplate

Optional

A legal string

None

A legal string specifying the metaColumns to be included. If you set auditLogMode to true, it is important that you set the metaColumnsTemplate property to view the operation type for the row inserted in the audit log, see Setting Metacolumn Output.

gg.handler.name.auditLogMode

Optional

truefalse

false

Set to true, the handler writes each record to the target without any primary key. Everything is processed as insert.

Set to false, the handler tries to merge incoming records into the target table if they have the same primary key. Primary keys are needed for this property. The trail source records need to have a full image updates to merge correctly.

gg.handler.name.pkUpdateHandling

Optional

abenddelete-insert

abend

Sets how the handler handles update operations that change a primary key. Primary key operations can be problematic for the BigQuery Handler and require special consideration:

  • abend- indicates the process abends.

  • delete-insert- indicates the process treats the operation as a delete and an insert. The full before image is required for this property to work correctly. Without full before and after row images the insert data are incomplete. Oracle recommends this option.

gg.handler.name.adjustScaleOptional

truefalse

falseThe BigQuery numeric data type supports a maximum scale of 9 digits. If a field is mapped into a BigQuery numeric data type, then it fails if the scale is larger than 9 digits. Set this property to true to round fields mapped to BigQuery numeric data types to a scale of 9 digits. Enabling this property results in a loss of precision for source data values with a scale larger than 9.
gg.handler.name.includeDeletedColumnOptional

truefalse

falseSet to true to include a boolean column in the output called deleted. The value of this column is set to false for insert and update operations, and is set to true for delete operations.
gg.handler.name.enableAlterOptionaltruefalsefalseSet to true to enable altering the target BigQuery table. This will allow the BigQuery Handler to add columns or metacolumns configured on the source, which are not currently in the target BigQuery table.

Parent topic:Setting Up and Running the BigQuery Handler

2.2.3 Review a Sample Configuration

The following is a sample configuration for the BigQuery Handler:

Parent topic:Setting Up and Running the BigQuery Handler

2.2.4 Configuring Handler Authentication

You have to configure the BigQuery Handler authentication using the credentials in the JSON file downloaded from Google BigQuery.

Merge In Bigquery In C

Download the credentials file:

  1. Login into your Google account at cloud.google.com.

  2. Click Console, and then to go to the Dashboard where you can select your project.

  3. From the navigation menu, click APIs & Services then select Credentials.

  4. From the Create Credentials menu, choose Service account key.

  5. Choose the JSON key type to download the JSON credentials file for your system.

Once you have the credentials file, you can authenticate the handler in one of these two ways:

Dml Query

  • Specify the path to the credentials file in the properties file with the gg.handler.name.credentialsFile configuration property.

    The path of the credentials file must contain the path with no wildcard appended. If you include the * wildcard in the path to the credentials file, the file is not recognized.

    Or

  • Set the GOOGLE_APPLICATION_CREDENTIALS environment variable on your system. For example:

    Then restart the Oracle GoldenGate manager process.

Merge In Bigquery In Word

Parent topic:Setting Up and Running the BigQuery Handler