Intel® Extension for Tensorflow* Code Guide

ID 766683
Updated 11/9/2022
Version Latest



ContributorFeng Ding


This article aims to help developers who want to write custom op, debug and so on based on the source code of Intel® Extension for TensorFlow*, to sort out the code context, read code, and find related resources.

The Tensorflow* community proposed the PluggableDevice (link) architecture, which provides a plug-in mechanism that allows devices to be registered in TensorFlow without changing the TensorFlow code so that the accelerator and Tensorflow* can be seamlessly integrated. Based on that, Intel releases its high-performance plugin - Intel® Extension for TensorFlow* (source code), which allows TensorFlow* code to run on Intel XPU (GPU and CPU) freely.

The PluggableDevice mechanism has four main components: PluggableDevice type, Custom operations and kernels, Device execution and memory management, Custom graph optimization pass. More details here.

The code structure of Intel® Extension for TensorFlow* is shown in the figure below.


Tensorflow extends the device class hierarchy to add a standardized pluggable device named PluggableDevice which is built on top of StreamExecutor, and all new third-party devices who want to integrate with current TensorFlow stack only need to implement StreamExecutor C API.

Related Code:

itex/core/devices/  SE_InitPlugin()

when “import tensorflow”, Tensorflow loads library “tensorflow-plugins/” or “tensorflow-plugins/”, and invokes SE_InitPlugin(), DEVICE_XPU_NAME is “XPU” when creating session, “xpu_create_stream_executor” creates StreamExecutor which implements memory management,allocate/free/merge,memcpy htod/dtoh/dtod,stream(queue) management,event management timer and so on.

For more details, click here .

More DPC++/SYCL code,for example “aligned_alloc_device”, “aligned_alloc_host” are defined in intel/llvm

Pluggable Graph

TensorFlow provides plug-in mechanism with C API to register custom graph optimizers.

 Related Code:

itex/core/graph/  TF_InitGraph()  -> Optimizer_Optimize()
  • RunRemapper

    Finding fused pattern and modify graph.

    For example, “_ITEXFusedAddV2WithSoftmax” implements “Addv2+softmax”,  in, “FindAddV2WithSoftmax()” finds the matched pattern,”AddFusedAddV2WithSoftmaxNode()” modifies graph. 

    The test code:test/tensorflow/python/grappler/

  • RunAutoMixedPrecision

    Following the certain algorithm steps, convert some nodes to FP16 or BF16 implementation, and insert cast node.
    More details click here.

  • RunOneDnnGraph

           oneDNN graph optimizer,default is closed.

  • RunOneDnnLayout

    GPU operators and fused operators rewrite and replace.
    Related Code:

    RunOneDnnLayout() -> CheckForNodeRewrite() -> GetRewriteInfo()
  • RunNativeLayout

    RunNativeLayout() -> CheckForNodeNativeFormat() -> GetNativeFormatInfo()

    CPU operators and fused operators rewrite and replace.
    For more details, click here.

Pluggable Kernel

Tensorflow provides plug-in mechanism with C API to register custom kernel and op implementations.
Related Code:

itex/core/kernels/   TF_InitKernel
itex/core/utils/  RegisterCPUKernels()  RegisterGPUKernels()

For more details, click here.

Pluggable Profiler

Tensorflow provides plug-in mechanism with C API to implement and register pluggable profilers.

Related Code:

itex/core/profiler/   TF_InitProfiler

For more details, click here.

How to Build Itex from Source Code

Refer to
$ bazel build -c opt --config=gpu  //itex/tools/pip_package:build_pip_package
Changing “-c opt” to “-c dbg”, the debug info will be built to binary, then we can use gdb to debug.

VERBOSE for Debug

  • Itex Verbose

     $ export ITEX_VERBOSE=1
           the value can be set 1, 2, 3, 4, level 4 will dump graph
           Related Code:  itex/core/graph/  DumpGraphDefToFile
           For more details, click here.

  • oneDNN Verbose

    $ export ONEDNN_VERBOSE=1

    oneDNN verbose mode enables tracing execution of oneDNN primitives and collection of basic statistics like execution time and primitive parameters. When verbose mode is enabled oneDNN will print out information to `stdout`.
    oneDNN version is set here.

How to Write Custom Op

  • Refer to ResizeBilinear op

     ResizeBilinearOp cpu, gpu implementation and  REGISTER_KERNEL_BUILDER
    Register OP  input, attr and so on, Register_ITEXResizeBilinearOp

    Add Register_ITEXResizeBilinearOp() or Register_OneDnnResizeBilinearOp() to RegisterOps()

    itex/core/kernels/gpu/BUILD(for GPU)
    itex/core/kernels/cpu/BUILD (for CPU)

    Add kernel implementation to build system
    Add OP to build system
    itex/core/graph/onednn_layout/ Map ResizeBilinear to oneDNN GPU implementation
    itex/core/graph/native_layout/ Map ResizeBilinear to oneDNN CPU implementation

    Build the whole code according to chapter 5, the above kernel implementation will be compiled into or

    Tensorflow's standard resize_bilinear OP is mapped to an implementation of oneDNN in RunOneDnnLayout or to a CPU implementation in RunNativeLayout described in Section 2.

  • Test resize_bilinear
    We can call tensorflow's standard tf.image.resize_bilinear interface, or call resize_bilinear through load_ops_library. The underlying call is the same implementation, you can turn on verbose to see the details.

    import tensorflow as tf
    import numpy as np
    from intel_extension_for_tensorflow.python.ops.load_ops_library import load_ops_library
    resize_shape = (10, 10)
    a = np.ones((1, 2, 2, 1), dtype=np.float32)
    a[0, 0, 0, 0] = 5.0
    a[0, 1, 1, 0] = 5.0
    b = tf.constant(a, dtype=tf.float32)
    c = tf.compat.v1.image.resize_bilinear(b, resize_shape)
    d = load_ops_library.resize_bilinear(b, resize_shape)
    with tf.compat.v1.Session() as sess:
        np_c =
        print(np_c[0, :, :, 0])
        np_d =
    print(np_d[0, :, :, 0])
  • Refer to ItexRnn op
    itex/core/kernels/gpu/     itex/core/kernels/gpu/rnn_ops.h          itex/core/kernels/gpu/     itex/core/kernels/gpu/rnn_ops_gpu.h      

    Define RnnOp, RnnGradOp, and register “ItexRnn”, “ItexRnnGrad”, REGISTER_KERNEL_BUILDER(“ItexRnn”)


    itex/core/ops/   Register ITEXRnnOP input, attr and so on  
    itex/core/kernels/gpu/BUILD               Add kernel gpu implementation to build system
    itex/core/ops/BUILD   Add OP implementation to build system
    itex/core/ops/ itex/core/ops/op_init.h

    Add Register_ITEXRnnOp(),  Register_ITEXRnnGradOp() to RegisterOps()


    itex/python/ops/ Register ItexRnnGrad  
    itex/python/ops/ Register ItexLSTM class and custom python API to custom itex package
  • Test itex_rnn
    import tensorflow as tf
    import intel_extension_for_tensorflow as itex
    inputs = tf.random.normal([32, 10, 8])
    lstm = itex.ops.ItexLSTM(4)
    output = lstm(inputs)
    lstm = itex.ops.ItexLSTM(4, return_sequences=True, return_state=True)
    whole_seq_output, final_memory_state, final_carry_state = lstm(inputs)
  • Refer to _ITEXFusedAddV2WithSoftmax

    Define AddV2WithSoftmaxOp, and register “_ITEXFusedAddV2WithSoftmax” REGISTER_KERNEL_BUILDER(“_ITEXFusedAddV2WithSoftmax”)

    itex/core/kernels/gpu/softmax_op_functor.h Kernel implementation AddV2WithSoftmaxFunctor
    itex/core/ops/ Register input, attr and son on.
    itex/core/kernels/gpu/BUILD Add kernel gpu implementation to build system
    itex/core/ops/BUILD Add OP implementation to build system
    Add Register_ITEXFusedAddV2WithSoftmaxOp to RegisterOps

    Find Addv2 + softmax pattern, and modify graph

    test/tensorflow/python/grappler/ Test case

More References