Instruction Manual of Ninf-G

These guidelines provide detailed examples for using Ninf-G. In addition to the instructions on Ninf-G given in Chapters 2 to 11, also see the sample programs.


Table of contents


1. Basic information

Ninf-G software helps the development and execution of parallel task application working on a computational Grid consisting of multiple clusters. Ninf-G was developed based on the Globus Toolkit. Software such as the batch scheduler, Globus Toolkit and Ninf-G has to be properly installed and ready for use in all the clusters. An outline of the batch scheduler, Globus Toolkit and Ninf-G is given below.

1.1 Batch system

A batch system is used for successive execution of jobs that users place in the queue. It is introduced for the following functions:

These functions are important for the management and use of large-scale clusters, and in fact most large-scale clusters have introduced batch systems and require users to always use the batch system. Portable Batch System (PBS) and Sun Grid Engine (SGE) are widely used as batch systems.

1.2 Globus Toolkit

Globus Toolkit is software package that provides UNIX commands and API of C and Java to implement some of the functions necessary for the development and execution of Grid software and applications. Currently, Globus Toolkit is released in Version 2, 3 and 4. Ninf-G Version 2 is implemented with Globus Toolkit Version 2 and Pre-WS GRAM of Globus Toolkit Version 3 and 4.

In general in Pre-WS GRAM, Globus Gatekeeper runs on the entrance host of each cluster and sets a job manager, which activates jobs on the computation hosts with help of a batch system such as PBS or SGE, as a default job manager. In other words, unless a job manager is explicitly specified, jobs are activated on the computation hosts through the batch system. The computation hosts are utilized through Globus Gatekeeper on the entrance host and the job manager, since Globus Gatekeeper is not running on the computation hosts, although Globus Toolkit libraries are installed in the computation hosts as well.

1.3 Ninf-G

Ninf-G is the supporting software for the development and execution of programs with the Grid Remote Procedure Call (GridRPC) on Grid. Here we explain the Ninf-G Version 2 (Ninf-G2). Due to the primitive C/Java API of Globus Toolkit, application development with the API is difficult. Ninf-G2, as an upper middleware of Globus Toolkit, covers up the complexity of the Grid and Globus Toolkit and helps the development and execution of applications that use the distributed computational resources on the Grid. For the details of Ninf-G, see Chapt. 2 "How to use Ninf-G" of this document and "Ninf-G Users' Manual" ( http://ninf.apgrid.org/documents/ng2-manual/user-manual.html).

1.3.1 What can Ninf-G do?

Ninf-G allows you to carry out the following tasks.

1.3.2 Unsupported functions in Ninf-G

Simply speaking, Ninf-G does not provide the following functions. The following are the functions that should be set, if necessary, in the programs using Ninf-G.


2. How to use Ninf-G

In Chapter 3 and the following chapters, the procedure for actual use of Ninf-G is described based on the sample programs. Firstly, we rewrite a sequential C program of non Ninf-G type to a program that uses Ninf-G, and gradually sophisticate the utilization of Ninf-G.

In this and the following chapters, use the sample programs to check the procedure.

2.1 Technical terms and definitions

We define the technical terms that will be used in this and the following chapters and explain the terms necessary for using Ninf-G.

User:

The user is the one that creates Ninf-G based programs.

RPC function:

RPC function means in Ninf-G a function that actually carries out numerical computation. It is executed on a computational server and requires computational resources (CPU time and memory) on the server for the execution. The user should create an RPC function and can use the functions that other users made (share the RPC functions). (Users should set a function that requires the computational resources as an RPC function.)

Ninf-G client:

It is a program or a process that calls RPC functions and controls the entire computation. The Ninf-G client program should be created by users.

Ninf-G Executable, Ninf-G server program:

Ninf-G Executable is a program to perform numerical computation as an RPC function. It is activated when a function handle is created from the Ninf-G client. When activated, it waits for start request for the execution (computation) of the RPC function.

There are two types of Ninf-G Executable: One that can define only one RPC function and does not retain the function call state i.e. does not retain the result of the previous function call, and the other that can define more than one RPC function (method) and retain the result of the previous function call. The latter is called the Ninf-G remote object.

Client machine:

A machine on which users execute the Ninf-G client program

Server machine:

A machine on which users execute the Ninf-G server program

Function handle:

It is grpc_function_handle_t type data prepared or created by the Ninf-G client to call the RPC function. It is an abstraction of the connection (communication path) between the Ninf-G remote object and the Ninf-G client.

A computational server name needs to be given for creating a function handle. A function handle has one-to-one correspondence with the Ninf-G Executable process activated on the computational server.

Object handle:

It is grpc_object_handle_t_np type data prepared or created by the Ninf-G client to make the Ninf-G remote object and call a method function. It is an abstraction of the connection (communication path) between the Ninf-G remote object and the Ninf-G client.

A computational server name needs to be given for creating an object handle. An object handle has one-to-one correspondence with the Ninf-G Executable process activated on the computational server.

IDL file:

IDL is an abbreviation of Interface Description Language and defines, in an IDL file, the input/output type and order for the RPC function call

In Ninf-G IDL, object files (ending with .o) and others can be specified to make a link for the actual computational processing procedure and execution of the function described in IDL.

The Ninf-G server program and local LDIF file are created based on this user-created IDL file.

Local LDIF file:

It is a file storing information, such as argument order, to call the user-created RPC function.

It is created when the IDL file is compiled on the server machine, and used by loading from the client configuration file.

In the following command line explanation, there are lines starting with "%" or "$". The symbols mean a command prompt on various shells. If the line starts with "%", the command is executed on csh (C shell) or tcsh. If the line starts with "$", the command is executed on sh (Bourne shell) or bash.

2.2 Assumed operating environment

In this document, the following operating environment is assumed.

The above machine names are just tentative and do not exist. Replace them with the actual machine names users use to edit and execute the samples.

2.3 Sample program

The sample programs are listed below.

The directory of each sample program is presented on the right side of the symbol ":".

Chapter Contents : Directory name
4 Programs not Ninf-G based : serial
5 Test of Ninf-G2 on a single machine : sync
6 Test of Ninf-G2 with asynchronous call : 1site
7 Test of Ninf-G2 on multiple machines (2 sites) : 2sites
(The above two have the same contents.)
8 Test of Ninf-G2 on multiple machines (3 sites) : 3sites
9 Multiple job execution with a single request of job activation : array
10 Using remote object capability of Ninf-G : object
11 Using MPI : mpi

2.4 Ninf-G user-prepared files and settings

Users should prepare the following files to use Ninf-G and the sample programs include them.

  1. Files necessary for server machine
  2. Files necessary for client machines

Before the execution of the sample programs, make the following modification.


3. Configuration

You need to set the user configuration for Globus Toolkit and Ninf-G before you start using Ninf-G. The configuration is necessary on every login, although you do not have to set manually every time if you write the procedure in advance in the file, such as .cshrc, .login, .profile or .bashrc that the shell loads on each login.

3.1 Configuration of Globus Toolkit

For the user environment of Globus Toolkit, see Globus Toolkit manual. Configuration of Globus Toolkit Version 2, for example, should be set in the following steps.

(Configuration for sh or bash)

$ GPT_LOCATION=[directory where GPT is installed]
$ GLOBUS_LOCATION=[directory where Globus Toolkit is installed]
$ export GPT_LOCATION GLOBUS_LOCATION
$ . $GLOBUS_LOCATION/etc/globus-user-env.sh

(Configuration for csh or tcsh)

% setenv GPT_LOCATION [directory where GPT is installed]
% setenv GLOBUS_LOCATION [directory where Globus Toolkit is installed]
% source $GLOBUS_LOCATION/etc/globus-user-env.csh

3.2 Configuration of Ninf-G

User configuration of Ninf-G should be set in the following steps.

(Configuration for sh or bash)

$ NG_DIR=[directory where Ninf-G is installed]
$ export NG_DIR
$ . $NG_DIR/etc/ninfg-user-env.sh

(Configuration for csh or tcsh)

% setenv NG_DIR [directory where Ninf-G is installed]
% source $NG_DIR/etc/ninfg-user-env.csh


4. Programs not Ninf-G based

The sample programs created by not utilizing Ninf-G are available in the directory "serial".

This sample program is a simple program to compute the circle ratio with the Monte Carlo method. In the Monte Carlo method, a dot is randomly placed within a square whose side length is 1. Check if the distance between the origin and the dot is shorter than 1. The circle ratio can be obtained by repeating these steps a number of times.

The circle ratio is derived from the following equation:

  Circle ratio
    = the number of dots whose distance from the origin is shorter than 1
        / total number of dots x 4

In this method, the computation precision of the circle ratio varies with the number of dots. The number of dots should be increased to improve the precision.

Placing dots within the square requires some computational capacity, and more capacity is needed to achieve a higher precision result. Ninf-G uses RPC to improve the utilization of the computational capacity.

In the simple execution of the programs created by not utilizing Ninf-G, pi_serial.c, only one CPU can be used as computation power or resource. However Ninf-G allows simultaneous/parallel use of multiple computational servers and can produce more dots in the same computation time. Consequently, a more precise result could be obtained.

A function suitable for RPC is one that can start computation only with the information provided as input, just like pi_trial() function used in this program.

Execute the program in the following steps.

  1. Change the directory.
    % cd serial
    
  2. Compile.
    % make
    

    The execution file "pi_serial" is created.

  3. Execute.
    % ./pi_serial 100000
    

    The computed circle ratio is displayed as a result. Here 100000 is the number of dots and you may enter an arbitrary number.


5. Test of Ninf-G2 on a single machine

First, the operation of Ninf-G is tested on a single machine as the simplest test case.

A server program is executed with a client program. The client and the server use the same machine.

Sample program : sync
Server machine : client.example.org (Step 5.1)
Client machine : client.example.org (Step 5.2)

Move to the sample program directory before proceeding.

% cd sync

5.1 Server setup

  1. Compiling of IDL file
    % ng_gen pi.idl
    
  2. Creation of Ninf-G Executable
    % make -f pi.mak
    

5.2 Client setup

  1. Compiling of client program
    % ng_cc -o pi_client_sync pi_client_sync.c
    

    (The above steps, i.e. server setup (1) and (2) and client setup (1), are given in Makefile. So the execution of the make command will also follow the same procedure.)

  2. Modification of configuration file

    Edit client.conf file. (Use the text editor like vi and emacs command)

  3. Execution of client program
    % grid-proxy-init
    'Enter passphrase'
    

    The command grid-proxy-init is not needed for all the executions of the client programs. The command does not have to be executed again until the created proxy certificate will be expired.

    % ./pi_client_sync 10000 client.example.org
    

    pi_client_sync receives the number of dots and the server host name in the arguments.

It should be noted that a single activation of a job requires more than several seconds. Most of the time is caused by the overhead of the Globus Toolkit such as authentication and authorization by GSI and process invocation via the Globus jobmanager. Job activation corresponds to the creation of a function handle for Ninf-G, and therefore the creation of a function handle takes more than several seconds.

This means, even if the actual RPC calculation time is less than 1 second, the execution of the client program takes more than several seconds. But the overhead can be negligible for large-scale applications.

Also, this waiting time varies depending on the batch system and queuing system of jobs. In some scheduling of the batch system, the activation of jobs may not start immediately and may have to wait until other jobs end.

5.3 Making programs to Ninf-G applied form

sync/pi_client_sync executed in this chapter is a program modified from the one explained in Chapter 4 to a form suited to Ninf-G.

The Ninf-G server program for calculating PI is defined in the IDL file, pi.idl. To be more precise, the program is modified so that the pi_trial() function can be used as an RPC function and executed through Ninf-G. Also in the modification, the pi_trial() function is called from an IDL file and the RPC function pi_trial() defined in the IDL is called from the Ninf-G client program.

A procedure for calling the defined RPC function pi_trial() is also added in the Ninf-G client program. The added procedure calls the following GridRPC API functions.

This GridRPC APIs enable RPC model computation with Ninf-G.


6. Test of Ninf-G2 with the API of asynchronous call on a single machine

Next, the operation of Ninf-G is tested on a single machine with the API of asynchronous call.

A server program is executed with a client program. The client and the server use the same machine.

Sample program : 1site
Server machine : client.example.org (Step 6.1)
Client machine : client.example.org (Step 6.2)

Move to the sample program directory before proceeding.

% cd 1site

6.1 Server setup

  1. Compiling of IDL file
    % ng_gen pi.idl
    
  2. Creation of Ninf-G Executable
    % make -f pi.mak
    

6.2 Client setup

  1. Modification of client program file

    Modify pi_client_sync.c, used for test in Chapter 5, for executing asynchronous computation. (Modified source file is pi_client_single.c.)

  2. Compiling of client program
    % ng_cc -o pi_client_single pi_client_single.c
    

    (The above steps, i.e. server setup (1) and (2) and client setup (2), are given in Makefile. So the execution of the make command will also follow the same procedure.)

  3. Modification of configuration file

    Edit client.conf file.

  4. Execution of client program
    
    % grid-proxy-init
      'Enter passphrase'
    % ./pi_client_single 10000 client.example.org
    
    

    pi_client_single receives the number of dots and the server host name in the arguments.

With these steps, it is confirmed that the usage of the API for asynchronous GridRPC call on the Ninf-G client program.

The API for synchronous GridRPC call used in the previous Chapter is a blocking call and the API will not return until the server completes the computation and the client receives the results from the server. Therefore, the Ninf-G client can do nothing until the API will return.

The asynchronous GridRPC call is a non-blocking call and it does not wait for the end of computation by the server. Therefore, the Ninf-G client is able to call other functions immediately after calling grpc_call_async() with regardless of the status of the computation by the server. For example, you can call grpc_call_async() for another server after calling grpc_call_async(). It is required to call waiting functions such as grpc_wait() or its variants (e.g. grpc_wait_all()) for waiting the end of the computation and receive the result from the server.

Asynchronous GridRPC call is usually used for the implementation of task-parallel applications, i.e. calling multiple asynchronous GridRPC calls simultaneously on different servers enables to implement task-parallel applications.


7. Test of Ninf-G2 on multiple machines (2 sites)

Next we perform a test with multiple machines.

A server program is executed with a client program. The client and the server use different machines.

Sample program : 2sites
Server machine : server01.example.org (Step 7.1)
Client machine : client.example.org (Step 7.2)

Move to the sample program directory before proceeding.

% cd 2sites

7.1 Server setup

Login the server machine and go to the directory 2sites. Follow the same steps in 5.1. (Since Makefile contains the rule, the execution of the make command will create the server program.)

The local LDIF file "pi.server01.example.org.ngdef" that contains the server machine name is created simultaneously. This file is used by Ninf-G client.

7.2 Client setup

  1. Compiling of client program

    Login the client machine and make an operation.

    % ng_cc -o pi_client_single pi_client_single.c
    
  2. Copy of local LDIF file

    The file "pi.server01.example.org.ngdef" in the directory on server01.example.org, where Ninf-G Executable is compiled, must be copied to the client machine client.example.org.

    % scp server01.example.org:/path/to/pi.server01.example.org.ngdef .
    

    (In this case the command scp is used for copying. It is a program included inSSH (Secure SHell: Remote operation command with coded communication path). Here scp copies the remote file into the current directory. )

  3. Modification of configuration file

    Edit client.conf file.

  4. Execution of client program
    % grid-proxy-init
      'Enter passphrase'
    % ./pi_client_single 10000 server01.example.org
    

    pi_client_single receives the number of dots and the server host name in the arguments.

With these steps, it is confirmed that the Ninf-G client program activated on client.example.org requests server01.example.org to execute PI library function.

The client machine and the server machine can be located at geographically separated locations. As seen here, the computation of using remote computation resources over the network can be performed with Ninf-G.


8. Test of Ninf-G2 on multiple machines (3 sites)

Next we perform a test to call RPC in parallel with multiple computational servers.

Two server programs are executed with a client program. The client and the server use different machines. Parallel computation is realized by distributing the job to the computational servers.

Sample program : 3sites
Server machine 1 : server01.example.org (Step 8.1)
Server machine 2 : server02.example.org (Step 8.1)
Client machine : client.example.org (Step 8.2)

Move to the sample program directory before proceeding.

% cd 3sites

8.1 Server setup

Login each server machine and follow the steps in 5.1 in each machine.

8.2 Client setup

  1. Modification of client program file

    Modify pi_client_single.c, used for test in Chapters 6 and 7, for executing multi server computation. (Modified source file is pi_client_multi.c.)

  2. Compiling client program
    % ng_cc -o pi_client_multi pi_client_multi.c
    
  3. Copy of local LDIF file

    The files "pi.server01.example.org.ngdef" and "pi.server02.example.org.ngdef" "pi.server01.example.org.ngdef" in the directories respectively on server01.example.org and server02.example.org, where Ninf-G Executable is compiled, should be copied to the client machine client.example.org.

    % scp server01.example.org:/path/to/pi.server01.example.org.ngdef .
    % scp server02.example.org:/path/to/pi.server02.example.org.ngdef .
    
  4. Modification of configuration file

    Edit client.conf file.

  5. Execution of client program
    % grid-proxy-init
      'Enter passphrase'
    % ./pi_client_multi 10000 \
      server01.example.org server02.example.org
    

    (Here, \ (the backslash in the line end) is a shell function, meaning that the command line continues to the next line.)

    pi_client_multi receives the number of dots and the server host name in the arguments. More than one server host name can be specified, and the computation will be carried out in all the hosts specified in the argument.

This procedure enables parallelization of computation on multiple servers. In this program, the computational servers share the dots, whose total number is specified in the command line argument, and carry out the computation in parallel.

Therefore the computation time of this program with two servers is smaller than that with a single server. If the number of the dots is of the order of 10,000, the computation ends immediately and the effect of parallelization is not clear. However if the computation takes a long time such as one day, multiple server computation reduces the computation time significantly. The more computational servers we have, the smaller the computation time is.


9. Multiple job execution with a single request for job activation

A cluster system usually has multiple computational servers (computational nodes). The jobs submitted to the job submission node of the cluster system are allocated to each computational node and executed in the system.

To use more than one computational node from Ninf-G, a function handle needs to be created for each of the nodes.

The creation of one function handle takes at least several seconds, and that of more than one function handle takes time proportional to the number of the handles to be created. Also, a job manager process is activated in the entrance host every time the request for creating a function handle is submitted. So the entrance host will be overloaded if more than several tens or more than several hundreds of function handles are created in the simple way, i.e. by making the handle one by one. To prevent the overload, Ninf-G has a function to create more than one function handle simultaneously when it receives a single request for the creation of the handles. With this function of Ninf-G, the creation time is saved and a large number of (several tens to several hundreds of) function handles can be created.

A test on using this function is conducted in this chapter.

Sample program : array
Server machine : server01.example.org (Step 9.1)
Client machine : client.example.org (Step 9.2)

Move to the sample program directory before proceeding.

% cd array

9.1 Server setup

Login the server machine (cluster management node) and make operations following the steps in 5.1.

9.2 Client setup

  1. Modification of client program file

    pi_client_multi.c, used as the sample in Chapter 8, is modified with API grpc_function_handle_array_init_np() to create more than one function handle simultaneously. (The source file obtained by the modification is pi_client_array.c.)

  2. Compiling of client program
    % ng_cc -o pi_client_array pi_client_array.c
    
  3. Copy of local LDIF file

    The file "pi.server01.example.org.ngdef" in the directory on server01.example.org, where Ninf-G Executable is compiled, should be copied to the client machine client.example.org.

    % scp server01.example.org:/path/to/pi.server01.example.org.ngdef .
    
  4. Modification of configuration file

    Edit client.conf file.

  5. Execution of client program
    % grid-proxy-init
      'Enter passphrase'
    % ./pi_client_array 10000 server01.example.org 4
    

    pi_client_array receives arguments of the number of dots, server hostname, and the number of handles.

    "4" at the end of the command line represents the number of function handles to be created. Here the number is set to 4, although it is not necessarily 4. If the cluster system you use is shared with other users, pay attention to the number of the created handles and be sure to use the system insofar as it is allowed.

This program uses "Simultaneous creation function of function handles" to suppress the overhead in the creation of function handles and keep the number of the job managers activated in the entrance host as small as possible. This function is needed for the usage of a large size cluster system with tens to hundreds processors.

10. Using remote object capability of Ninf-G

Ninf-G Executable is state-less. Namely, even when the RPC function is called repeatedly with the same function handle, the state in the previous call is not retained in Ninf-G Executable. So, data needs to be sent for each computation even in the repeated computations with the same data.

Ninf-G2 provides Ninf-G Executable, as a remote object, capable of retaining states. More than one function (method) can be defined in the remote object, and the object provides a function for Ninf-G client to call these methods.

For use of the remote object capability, object handles, not function handles, should be created. Object handles are the grpc_object_handle_t_np type data and abstraction of the connection (communication path) between Ninf-G remote object and Ninf-G client. Once an object handle is created, the corresponding remote object retains the state until the object handle is discarded. It can also call various methods to change the state for the object handle.

For example when the input data of RPC is extremely large and the same input data is used in the call for every RPC, the remote object capability can be effectively utilized. Once an initialization method sending the large input data is called just one time after each handle is created, and saved in Ninf-G Executable, the large input data need not to be transferred again in the subsequent calls for RPC and the system performance is improved.

The following is another example of the implementation of methods in Ninf-G class.

Sample program : object
Server machine : server01.example.org (Step 10.1)
Client machine : client.example.org (Step 10.2)

Move to the sample program directory before proceeding.

% cd object

10.1 Server setup

  1. Modification of IDL file

    The IDL file pi.idl used in Chapters 5-9 is changed to an object. (The source file obtained in the modification is pi_object.idl) Edit pi_object.idl.

  2. Compiling of IDL file
    % ng_gen pi_object.idl
    
  3. Creation of Ninf-G Executable
    % make -f pi_object.mak
    

10.2 Client setup

  1. Modification of client program file

    pi_client_multi.c, used as an example in Chapter 8, is modified to a form that executes computation using Ninf-G remote object capability. (The source file obtained in the modification is pi_client_object.c)

  2. Compiling client program
    % ng_cc -o pi_client_object pi_client_object.c
    
  3. Copy of local LDIF file

    The file "pi.server01.example.org.ngdef" in the directory on server01.example.org, where Ninf-G Executable is compiled, should be copied to the client machine client.example.org.

    % scp server01.example.org:/path/to/pi_object.server01.example.org.ngdef .
    
  4. Modification of configuration file

    Edit client.conf file.

  5. Execution of client program
    % grid-proxy-init
      'Enter passphrase'
    % ./pi_client_object 10000 server01.example.org
    

    pi_client_object receives arguments of the number of dots and server hostname.


11. Using MPI

Parallel programs written in MPI can be set as RPC functions in Ninf-G. This feature allows the programming to perform fine-grained parallel processing with MPI on a server machine that are called in remote procedure call, in addition to the master-worker type parallel task programming mentioned in the above.

Sample program : mpi
Server machine : server01.example.org (Step 11.1)
Client machine : client.example.org (Step 11.2)

Move to the sample program directory before proceeding.

% cd mpi

11.1 Server setup

  1. Modification of IDL file

    The IDL file pi.idl used in Chapters 5-9 is modified to a form that uses MPI. (The modified source file is pi_mpi.idl)

    Edit pi_mpi.idl.

  2. Compiling of IDL file
    % ng_gen pi_mpi.idl
    
  3. Creation of Ninf-G Executable
    % make -f pi_mpi.mak
    

11.2 Client setup

  1. Modification of client program file

    pi_client_single.c used as an example in Chapters 6 and 7 is modified to a form that uses MPI for executing computation. (The source file obtained in the modification is pi_client_mpi.c)

  2. Compiling of client program
    % ng_cc -o pi_client_mpi pi_client_mpi.c
    
  3. Copy of local LDIF file

    The file "pi.server01.example.org.ngdef" in the directory on server01.example.org, where Ninf-G Executable is compiled, should be copied to the client machine client.example.org.

    % scp server01.example.org:/path/to/pi_mpi.server01.example.org.ngdef .
    
  4. Modification of configuration file

    Edit client.conf file.

  5. Execution of client program
    % grid-proxy-init
      'Enter passphrase'
    % ./pi_client_mpi 10000 server01.example.org
    

    pi_client_mpi receives arguments of the number of dots and server hostname. The number of CPU used here is set in mpi_runNoOfCPUs in the section SERVER of the configuration file as described in (4). There are several other ways in Ninf-G to set the number of CPU for MPI. For details see the Ninf-G Users' Manual.

MPI is famous as a parallel programming model in existing parallel systems. It has the advantage that existing MPI programs can be executed on a Grid in their original forms with Grid-compatible MPI implementation such as MPICH-G2. On the other hand, it is difficult to ensure co-allocation in a Grid, i.e. to undertake simultaneous activation of all MPI processes at the start of the program execution, although MPI requires the co-allocation. Some functions such as process spawning may help the co-allocation to a certain extent, although in principle the processors and the number of processes MPI uses are determined statically and the dynamical property of the Grid is not entirely compatible with MPI. It is one of the most attractive features of Grids to carry out a large size application in a large-scale environment for a long time, while it is important to pay attention to what should be done in case of system failures, e.g. hardware failure of computation hosts. However with MPI, all program executions are terminated if an error occurs in a single process. In this sense, MPI is not completely fault-tolerant.

GridRPC is a programming model that can circumvent the above problem. GridRPC does not always require the co-allocation, and resources can be rather easily added/released dynamically by the dynamical creation/destruction of function handles. Also, an error in a computational host has no influence on the computation executed in other hosts, and therefore one can easily deal with failures by terminating the troubled computation or by sending the request again.

MPI is a popular programming model for parallel computation. Under the current circumstances, however, even the programs that are not suited for MPI are often written with MPI. The most suitable programming model should be selected according to the characteristics of the implemented application. To implement an application that executes more than one independent computation with distributed resources on Grid, GridRPC should be used. (At least MPI should not be used.) Both advantages of GridRPC and MPI can be utilized for the implementation of a large-scale application by combining them to use MPI for computation with frequent detailed communication and Ninf-G for computation with infrequent communication.


12. FAQ

Past questions and cautions are listed below.

Q1. Is the domain name necessary for the server's name?

Q. Creating a function handle with grpc_function_handle_init() API requires the server's name. Can I omit the domain name in the server's name, or do I have to use the domain name?

A. The domain name is necessary.

When creating a function handle and client configuration file, you have to use FQDN (Fully Qualified Domain Name) to set the server's name and cannot use "localhost". If the server's name is not properly specified with FQDN, Ninf-G API is terminated with an error.

For example when you use a server of the domain name server.example.org, you cannot specify the server's name simply as "server" but have to use "server.example.org".

Q2. What causes the execution error and log output?

Q. I made and executed an Ninf-G based client program, but it hasn't executed properly. It seems the execution of Ninf-G API failed. What is the cause?

A. Ninf-G has a customizable function for log output.

By making log-related settings in the client configuration file and the server configuration file, you get the output of the information on the error in the execution of Ninf-G API as well as on debugging. The output can be set to a file.

With this log function you can check the execution status of Ninf-G API and trace the cause of the failure in the execution of API.

For details of log function settings, see Ninf-G Users' Manual in http://ninf.apgrid.org/documents/ng2-manual/user-manual.html

(Log output setting for client: Section 4.3.9)
(Log output setting for server: Section 3.3.2)

The log output messages will appear in the following order.

"Date and time:Client/Server:Host name:Log level: ID of each data structure:Name of Ninf-G internal function:Log message"

Every Ninf-G API returns the value of grpc_error_t that presents success or failure of the execution. User's programs should be written to include a function for checking these values.

If the computation function executed with Ninf-G Executable results in Segmentation fault, the coreDumpSize attribute of the section <SERVER> in the client configuration file might be enabled.

Q3. What does the error message "GRAM Job failed" mean?

Q. grpc_call() or grpc_call_async() is terminated with an error. What is the cause? In the error log, the log message "ngcllJobCallback: GRAM Job failed because ..." appears.

A. The error message "GRAM job failed" indicates an error in GRAM, which is a job execution module of Globus Toolkit used by Ninf-G.

There are various causes of the error in GRAM and it is difficult to identify one. The resolution varies depending on the error message.

Execute the following command to check if GRAM can be properly used.

% globus-job-run server.example.org /bin/hostname

The above command requests GRAM in server.example.org to execute the command /bin/hostname. If the command is executed successfully the host name is displayed. Sometimes the error is fixed with the execution of this command.

For the error in GRAM, see the web page of Globus.

GRAM RSL (Resource Specification Language) that Ninf-G client sends to GRAM can be checked if the log level of the client is set to Information or Debug.

Q4. What is the file ~/gram_job_mgr_[numerics].log on server's home directory?

Q. When I used Ninf-G, a file "gram_job_mgr_[numerics].log" was created in the server's home directory. What is this file?

A. This file is a log file from GRAM, a job execution module of Globus Toolkit Ninf-G uses.

GRAM activates jobmanager process in the same timing of job execution. The jobmanager process detects the completion and executes forced termination by monitoring the job.

The jobmanager eliminates the log file if the job ends normally, but does not if any error occurs or the job execution is canceled. In the latter case, the log file of GRAM remains in the home directory.

If Ninf-G client is terminated by e.g. ^C key after grpc_function_handle_init() API creates a function handle and before grpc_function_handle_destruct() destructs the handle, the job corresponding to the handle is cancelled.

A file gram_job_mgr-[numerics].log remains in the home directory of the server as a result of such abnormal terminations.

You may delete the file if you do not need to find the cause of the job termination.

Q5. Is a sample program for Ninf-G available?

Q. I would like to create a Ninf-G based program, I don't know how to write. Is a sample program available?

A. Ninf-G based sample programs are available in the Ninf-G package.

Ninf-G can be downloaded from the web page:

http://ninf.apgrid.org/packages/welcome.shtml

Download Ninf-G version 2 package from the web page.

The package contains four programs.

test/samples/pi : Monte Carlo computation for obtaining circle ratio
test/samples/add : Addition of arrays
test/samples/pi_mpi : Computation with the use of MPI
test/diag : Exhaustive test to check if Ninf-G can be used with no problem.

Q6. Computation time on the server is long. Is it OK?

Q. After making a function handle, I would like to use it for more than a few days. What should I be aware of?

A. Pay attention to the valid term of the temporary certificate. The command grid-proxy-init generates the temporary certificate with the valid term set to 12 hours unless given specifically.

There are two ways to use function handles for more than 12 hours.

  1. Use -valid option of grid-proxy-init command.
  2. Use Refresh Credentials function of Ninf-G and continue the computation while creating proxy certificate periodically.

Q7. Which should I use, non-thread flavor or pthread flavor?

Q. Non-thread flavor and pthread flavor are available in Globus Toolkit, and both can be compiled with Ninf-G. Which should I use?

A. Use pthread flavor. Ninf-G client has some features that can be used only after the compilation with pthread flavor. So the client should use pthread flavor.

If you do not need the features, you can use either flavor. Also either flavor can be used for Ninf-G Executable.

If the user-written program uses pthread, pthread flavor needs to be used in Ninf-G.

Globus Toolkit can provide almost the same features with either non-thread flavor or pthread flavor, although the performance changes depending on the flavor because of the totally different internal implementation.

Q8. What is MDS?

Q. Ninf-G seems to use MDS. What is MDS? Do I need to use it?

A. MDS is an information service Globus Toolkit provides.

Ninf-G can use MDS. The information of each RPC installed in the computational server can be acquired with MDS. However the information can also be obtained with a local LDIF file and MDS is not always necessary. The following are cautions for use of MDS.

Q9. How do I debug?

Q. It is difficult to debug Ninf-G Executable because it is activated remotely. Is there any convenient method of debugging?

A. Ninf-G has a debugging function for Ninf-G Executable.

Set debug_busyLoop, debug, and coreDumpSize in the section <SERVER> of the client configuration file. For details see the Users' Manual.

The log output function is enabled also in the debugging operation. Both client and Ninf-G Executable can output a log and can check every change in the operation status of Ninf-G API.

Q10. What does the heartbeat-related warning mean?

Q. The following message appeared in the log output in Ninf-G client. Is there any problem?

"... heartbeat timeout warning (60 seconds) occurred ..."

A. It does not always mean a problem.

Heartbeat is a Ninf-G function to detect if the Ninf-G Executable process is running with no problems, if the system is not freezing, and if the network has any problems.

Ninf-G Executable sends the heartbeat periodically to the client. If the network or Ninf-G Executable freezes, the heartbeat cannot be transferred from Ninf-G Executable to the client.

If the heartbeat does not reach the Ninf-G client within a certain period of time, the client considers that the Ninf-G Executable (function handle) has a problem and cannot be used, and puts out an error. Then the function handle cannot be used any more. The RPC in execution returns with an error.

For the sake of this function, the Ninf-G client can detect failure without freezing, and the users can make a trouble-resistant, fault-tolerant program.

Suppose Ninf-G Executable transmits the heartbeat every 60 seconds. The warning of the heartbeat appears if the Ninf-G client does not receive the heartbeat for 60 seconds. So, the warning appears even when a slight delay of Ninf-G Executable processing causes the delay in periodic heartbeat transmission. It is therefore not certain that this warning status should be determined as a failure.

In the default setting, the interval of the periodic heartbeat transmission is 60 seconds and the failure judgment is made after the client does not received the heartbeat for 60 seconds x 5 =300 seconds. (For details see the Users' Manual.)

If the following message is displayed after the heartbeat warning, it means the recovery of the heartbeat transmission.

"... heartbeat revived again ..."

Q11. Is shared memory available for Ninf-G?

Q. I want to use Ninf-G. Is a shared memory function available in Ninf-G? Is it possible to share the variables in Ninf-G Executable?

A. Ninf-G does not have a shared memory function. A variable sharing function for Ninf-G Executable is also not available.

Q12. Can I set the compiler and linker?

Q. Can I set the compiler and linker?

A. Yes you can. Compiler, Linker and others are available in IDL file syntax.

Q13. Why are jobs not executed immediately?

Q. I made a function handle but the job is not executed immediately and grpc_call() API and grpc_call_async() API stop. How could this happen?

A. It can happen.

For example, if the cluster that the job goes to already has many users and other user's jobs are preferentially executed, your job has to wait until the other jobs end. So the execution of jobs does not always starts immediately.

If you want to place a limit on the execution time for your job, set the job_startTimeout attribute in the section <SERVER> of the client configuration file.

Q14. What is the workDirectory?

Q. What is the workDirectory attribute that can be set in the client configuration file?

A. workDirectory is used to specify the directory in which the Ninf-G Executable process operates.

In default Ninf-G Executable operates in the directory in which the Ninf-G Executable program is installed. Use workDirectory to change this directory.

If a relative file name is specified in an RPC function or computational function to open a file, the file is searched starting from the workDirectory and opened.

In case of a Segmentation fault due to a computational function failure, a core file, which is a memory image at abnormal termination, is sometimes created. This core file also goes to the directory specified in the workDirectory.

The absence of the specified workDirectory causes an error. So you need to be careful.

You may need to be careful if the job submission node and the job execution computational node are different in the cluster system. Due to the restriction of Globus Toolkit, an error occurs if the workDirectory exists in the computational node but not in the job submission node.

Q15. Why aren't stdout and stderr are sent to the client?

Q. Character strings are put out from printf() or fprintf() in IDL files or computational functions, but the output result is not sent to the client. There is a setting "redirect_outerr true" in the section <SERVER> in the client configuration file. Why isn't it sent?

A. Ninf-G uses the stdout/stderr transfer function provided by Globus Toolkit GASS module to send stdout and stderr of the computational functions.

The timing of the GASS stdout/stderr transfer changes with the jobmanager that Globus Toolkit GRAM uses. So the transfer timing is not always the same.

For example, if you use the jobmanager-fork, every output line is transferred to the client's stdout and stderr and displayed within several seconds after the line is output.

With some other type of jobmanager, however, the output is all transferred at the end of the job. In this case, the output would not be transferred until grpc_function_handle_destruct() discards the function handle of Ninf-G.

Output is sometimes not transferred when the job is cancelled. In case of Ninf-G, jobs are cancelled for safety if an error such as abnormal disconnection occurs in the connection to the corresponding handle.

Output is also not transferred if Ninf-G client is terminated by e.g. exit() before the execution of grpc_function_handle_destruct().

Q16. How can I create a local LDIF file?

Q. Login to the GRAM job submission node (e.g. server01.example.org), which is a designated host for the creation of function handles, is not permitted in my cluster system, and I cannot compile Ninf-G Executable in server01.example.org. So the local LDIF file for server01.example.org is not created. How can I make the file?

A. Local LDIF file is a text file and users can rewrite the content.

For example, you can obtain a local LDIF file for server01.example.org in the following way.

The local LDIF file created in the above steps should be specified in the Ninf-G client configuration file.

Take full responsibility for the modification of the local LDIF file.

Q17. When I compiled Ninf-G, I got the message "warning: "GLOBUS_FAILURE" redefind.". How can I solve this?

Q. When I compiled Ninf-G or Ninf-G application, I got the following message. How can I solve this?

/usr/Globus/include/gcc32dbg/globus_common.h:290:1: warning: " GLOBUS_FAILURE" redefind.

A. It's a bug of the Globus Toolkit 2.2.4 and 2.4.0. and it's resolved on the Globus Toolkit 2.4.1 or later. (see http://bugzilla.globus.org/globus/show_bug.cgi?id=959 for details.)

So you can solve this problem by updating the Globus Toolkit.


last update : $Date: 2005/10/04 05:47:07 $