This topic introduces the terms that are related to the data transformation feature.

Basic terms

  • ETL

    Extract, transform, and load (ETL) is a process during which data is extracted from business systems, cleansed, transformed, and loaded. This process unifies and standardizes data from different sources. Log Service can load data from a source Logstore, transform data, and then write transformed data to destination Logstores. Log Service can also load data from Object Storage Service (OSS) buckets, ApsaraDB RDS instances, or other Logstores.

  • event, data, and log

    In data transformation, events and data are represented by logs. For example, the event time is equivalent to the log time, and the drop_event_fields function discards log fields.

  • log time

    The log time indicates the point in time at which an event occurs. The log time is also known as the event time. The log time is indicated by the reserved field __time__ in Log Service. The value of this field is extracted from the time information in logs. The value is a UNIX timestamp representing the number of seconds that have elapsed since the epoch time January 1, 1970, 00:00:00 UTC. Data type: integer. Unit: seconds.

  • log receiving time
    The log receiving time indicates the point in time at which a log is received by a server of Log Service. By default, this time is not saved in logs. However, if you turn on Log Public IP for a Logstore, this time is recorded in the log tag field __receive_time__. In the data transformation process, the complete name of this field is __tag__:__receive_time__. The value is a UNIX timestamp representing the number of seconds that have elapsed since the epoch time January 1, 1970, 00:00:00 UTC. Data type: integer. Unit: seconds.
    Note In most scenarios, logs are sent to Log Service in real time, and the log time is the same as the log receiving time. If you import historical logs, the log time is different from the log receiving time. For example, if you import logs generated during the last 30 days by using an SDK, the log receiving time is the current time and is different from the log time.
  • tag
    Logs have tags. Each tag field is prefixed with __tag__:. Log Service supports two types of tags.
    • Custom tags: the tags that you add when you call the PutLogs operation to write data.
    • System tags: the tags that are added by Log Service, including __client_ip__ and __receive_time__.

Configuration-related terms

  • source Logstore

    The data transformation feature reads data from a source Logstore for transformation.

    You can configure only one source Logstore for a data transformation task. However, you can configure the same source Logstore for different data transformation tasks.

  • destination Logstore

    The data transformation feature writes transformed data to destination Logstores.

    You can configure one or more destination Logstores for a data transformation task. Data can be written to destination Logstores in static or dynamic mode. For more information, see Distribute data to multiple destination Logstores.

  • DSL for Log Service

    The domain-specific language (DSL) for Log Service is a Python-compatible scripting language, and is used for data transformation in Log Service. The DSL for Log Service is built on top of Python. The DSL provides more than 200 built-in functions to simplify common data transformation tasks. The DSL also allows you to use custom Python extensions. For more information, see Language introduction.

  • transformation rule

    A transformation rule is a data transformation script that is orchestrated by using the DSL for Log Service.

  • data transformation task

    A data transformation task is the minimum scheduling unit of data transformation. You must configure a source Logstore, one or more destination Logstores, a transformation rule, a transformation time range, and other parameters for a data transformation task.

Rule-related terms

  • resource

    Resources refer to third-party data sources that are referenced during data transformation. The data sources include but are not limited to on-premises resources, Object Storage Service (OSS), ApsaraDB RDS, and Logstores other than the source and destination Logstores. The resources may be referenced to enrich data. For more information, see Resource functions.

  • dimension table

    A dimension table contains dimension information that can be used to enrich data. A dimension table is an external table. For example, a dimension table can contain the information of users, products, and geographical locations of a company. In most scenarios, dimension tables are included in resources and may be dynamically updated.

  • enrichment or mapping

    If the information contained in a log cannot meet your requirements, you can map one or more fields in the log by using a dimension table to obtain more information. This process is called enrichment or mapping.

    For example, a request log contains the status field that specifies the HTTP status code. You can map the field to the status_desc field to obtain the HTTP status description by using the following table.
    Before enrichment After enrichment
    status status_desc
    200 Success
    300 Redirect
    400 Permission error
    500 Server error

    If a source log contains the user_id field, you can map the field by using a dimension table that contains account details to obtain more information. For example, you can obtain the user name, gender, registration time, and email address for each user ID. Then, you can add the information to the source log and write the log to the destination Logstores. For more information, see Mapping and enrichment functions.

  • event splitting

    If a log contains multiple pieces of information, the log can be split into multiple logs. This process is called event splitting.

    For example, a log contains the following information:
    __time__: 1231245
    __topic: "win_logon_log"
    content: 
    [ {
      "source": "1.2.3.4",
      "dest": "1.2.3.4"
      "action": "login",
      "result": "pass"
    },{
      "source": "1.2.3.5",
      "dest": "1.2.3.4"
      "action": "logout",
      "result": "pass"
    }
    ]
    The log can be split into two logs.
    __time__: 1231245
    __topic: "win_logon_log"
    content: 
    {
      "source": "1.2.3.4",
      "dest": "1.2.3.4"
      "action": "login",
      "result": "pass"
    }
    __time__: 1231245
    __topic: "win_logon_log"
    content: 
    {
      "source": "1.2.3.5",
      "dest": "1.2.3.4"
      "action": "logout",
      "result": "pass"
    }
  • grok

    Grok uses patterns to replace complex regular expressions.

    For example, the grok("%{IPV4}") pattern indicates a regular expression that is used to match IPv4 addresses and is equivalent to the expression "(?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9])". For more information, see Grok function.

  • content capturing by using a regular expression

    You can use a regular expression to capture specified content in a field and include the content in a new field.

    For example, the function e_regex("content", "(?P<email>[a-zA-Z][a-zA-Z0-9_.+-=:]+@\w+\.com)") extracts the email address from the content field and includes the extracted email address in the email field. The email address information is extracted by using a common regular expression. We recommend that you use the following grok pattern to simplify the regular expression: e_regex("content", grok("%{EMAILADDRESS:email}"). For more information, see Regular expressions.