These guidelines provide detailed examples for using Ninf-G. In addition to the instructions on Ninf-G given in Chapters 2 to 11, also see the sample programs.
Ninf-G software helps the development and execution of parallel task application working on a computational Grid consisting of multiple clusters. Ninf-G was developed based on the Globus Toolkit. Software such as the batch scheduler, Globus Toolkit and Ninf-G has to be properly installed and ready for use in all the clusters. An outline of the batch scheduler, Globus Toolkit and Ninf-G is given below.
These functions are important for the management and use of large-scale clusters, and in fact most large-scale clusters have introduced batch systems and require users to always use the batch system. Portable Batch System (PBS) and Sun Grid Engine (SGE) are widely used as batch systems.
Globus Toolkit is software package that provides UNIX commands and API of C and Java to implement some of the functions necessary for the development and execution of Grid software and applications. Currently, Globus Toolkit is released in Version 2, 3 and 4. Ninf-G Version 2 is implemented with Globus Toolkit Version 2 and Pre-WS GRAM of Globus Toolkit Version 3 and 4.
In general in Pre-WS GRAM, Globus Gatekeeper runs on the entrance host of each cluster and sets a job manager, which activates jobs on the computation hosts with help of a batch system such as PBS or SGE, as a default job manager. In other words, unless a job manager is explicitly specified, jobs are activated on the computation hosts through the batch system. The computation hosts are utilized through Globus Gatekeeper on the entrance host and the job manager, since Globus Gatekeeper is not running on the computation hosts, although Globus Toolkit libraries are installed in the computation hosts as well.
Ninf-G is the supporting software for the development and execution of programs with the Grid Remote Procedure Call (GridRPC) on Grid. Here we explain the Ninf-G Version 2 (Ninf-G2). Due to the primitive C/Java API of Globus Toolkit, application development with the API is difficult. Ninf-G2, as an upper middleware of Globus Toolkit, covers up the complexity of the Grid and Globus Toolkit and helps the development and execution of applications that use the distributed computational resources on the Grid. For the details of Ninf-G, see Chapt. 2 "How to use Ninf-G" of this document and "Ninf-G Users' Manual" ( http://ninf.apgrid.org/documents/ng2-manual/user-manual.html).
Ninf-G allows you to carry out the following tasks.
Simply speaking, Ninf-G does not provide the following functions. The following are the functions that should be set, if necessary, in the programs using Ninf-G.
In Chapter 3 and the following chapters, the procedure for actual use of Ninf-G is described based on the sample programs. Firstly, we rewrite a sequential C program of non Ninf-G type to a program that uses Ninf-G, and gradually sophisticate the utilization of Ninf-G.
In this and the following chapters, use the sample programs to check the procedure.
We define the technical terms that will be used in this and the following chapters and explain the terms necessary for using Ninf-G.
The user is the one that creates Ninf-G based programs.
RPC function means in Ninf-G a function that actually carries out numerical computation. It is executed on a computational server and requires computational resources (CPU time and memory) on the server for the execution. The user should create an RPC function and can use the functions that other users made (share the RPC functions). (Users should set a function that requires the computational resources as an RPC function.)
It is a program or a process that calls RPC functions and controls the entire computation. The Ninf-G client program should be created by users.
Ninf-G Executable is a program to perform numerical computation as an RPC function. It is activated when a function handle is created from the Ninf-G client. When activated, it waits for start request for the execution (computation) of the RPC function.
There are two types of Ninf-G Executable: One that can define only one RPC function and does not retain the function call state i.e. does not retain the result of the previous function call, and the other that can define more than one RPC function (method) and retain the result of the previous function call. The latter is called the Ninf-G remote object.
A machine on which users execute the Ninf-G client program
A machine on which users execute the Ninf-G server program
It is grpc_function_handle_t
type data prepared
or created by the Ninf-G client to call the RPC function.
It is an abstraction of the connection (communication path)
between the Ninf-G remote object and the Ninf-G client.
A computational server name needs to be given for creating a function handle. A function handle has one-to-one correspondence with the Ninf-G Executable process activated on the computational server.
It is grpc_object_handle_t_np
type data prepared
or created by the Ninf-G client to make the Ninf-G remote object
and call a method function.
It is an abstraction of the connection (communication path) between
the Ninf-G remote object and the Ninf-G client.
A computational server name needs to be given for creating an object handle. An object handle has one-to-one correspondence with the Ninf-G Executable process activated on the computational server.
IDL is an abbreviation of Interface Description Language and defines, in an IDL file, the input/output type and order for the RPC function call
In Ninf-G IDL, object files (ending with .o) and others can be specified to make a link for the actual computational processing procedure and execution of the function described in IDL.
The Ninf-G server program and local LDIF file are created based on this user-created IDL file.
It is a file storing information, such as argument order, to call the user-created RPC function.
It is created when the IDL file is compiled on the server machine, and used by loading from the client configuration file.
In the following command line explanation, there are lines starting with "%" or "$". The symbols mean a command prompt on various shells. If the line starts with "%", the command is executed on csh (C shell) or tcsh. If the line starts with "$", the command is executed on sh (Bourne shell) or bash.
In this document, the following operating environment is assumed.
The above machine names are just tentative and do not exist. Replace them with the actual machine names users use to edit and execute the samples.
The sample programs are listed below.
The directory of each sample program is presented on the right side of the symbol ":".
Chapter | Contents | : | Directory name | |
---|---|---|---|---|
4 | Programs not Ninf-G based | : | serial | |
5 | Test of Ninf-G2 on a single machine | : | sync | |
6 | Test of Ninf-G2 with asynchronous call | : | 1site | |
7 | Test of Ninf-G2 on multiple machines (2 sites) | : | 2sites | |
(The above two have the same contents.) | ||||
8 | Test of Ninf-G2 on multiple machines (3 sites) | : | 3sites | |
9 | Multiple job execution with a single request of job activation | : | array | |
10 | Using remote object capability of Ninf-G | : | object | |
11 | Using MPI | : | mpi |
Users should prepare the following files to use Ninf-G and the sample programs include them.
Before the execution of the sample programs, make the following modification.
You need to set the user configuration for Globus Toolkit and Ninf-G before you start using Ninf-G. The configuration is necessary on every login, although you do not have to set manually every time if you write the procedure in advance in the file, such as .cshrc, .login, .profile or .bashrc that the shell loads on each login.
For the user environment of Globus Toolkit, see Globus Toolkit manual. Configuration of Globus Toolkit Version 2, for example, should be set in the following steps.
$ GPT_LOCATION=[directory where GPT is installed]
$ GLOBUS_LOCATION=[directory where Globus Toolkit is installed]
$ export GPT_LOCATION GLOBUS_LOCATION
$ . $GLOBUS_LOCATION/etc/globus-user-env.sh
(Configuration for csh or tcsh)
% setenv GPT_LOCATION [directory where GPT is installed]
% setenv GLOBUS_LOCATION [directory where Globus Toolkit is installed]
% source $GLOBUS_LOCATION/etc/globus-user-env.csh
User configuration of Ninf-G should be set in the following steps.
$ NG_DIR=[directory where Ninf-G is installed]
$ export NG_DIR
$ . $NG_DIR/etc/ninfg-user-env.sh
(Configuration for csh or tcsh)
% setenv NG_DIR [directory where Ninf-G is installed]
% source $NG_DIR/etc/ninfg-user-env.csh
The sample programs created by not utilizing Ninf-G are available in the directory "serial".
This sample program is a simple program to compute the circle ratio with the Monte Carlo method. In the Monte Carlo method, a dot is randomly placed within a square whose side length is 1. Check if the distance between the origin and the dot is shorter than 1. The circle ratio can be obtained by repeating these steps a number of times.
The circle ratio is derived from the following equation:
Circle ratio = the number of dots whose distance from the origin is shorter than 1 / total number of dots x 4
In this method, the computation precision of the circle ratio varies with the number of dots. The number of dots should be increased to improve the precision.
Placing dots within the square requires some computational capacity, and more capacity is needed to achieve a higher precision result. Ninf-G uses RPC to improve the utilization of the computational capacity.
In the simple execution of the programs created by not utilizing Ninf-G, pi_serial.c, only one CPU can be used as computation power or resource. However Ninf-G allows simultaneous/parallel use of multiple computational servers and can produce more dots in the same computation time. Consequently, a more precise result could be obtained.
A function suitable for RPC is one that can start computation
only with the information provided as input,
just like pi_trial()
function used in this program.
Execute the program in the following steps.
% cd serial
% make
The execution file "pi_serial" is created.
% ./pi_serial 100000
The computed circle ratio is displayed as a result. Here 100000 is the number of dots and you may enter an arbitrary number.
First, the operation of Ninf-G is tested on a single machine as the simplest test case.
A server program is executed with a client program. The client and the server use the same machine.
Sample program | : | sync |
Server machine | : | client.example.org (Step 5.1) |
Client machine | : | client.example.org (Step 5.2) |
Move to the sample program directory before proceeding.
% cd sync
% ng_gen pi.idl
% make -f pi.mak
% ng_cc -o pi_client_sync pi_client_sync.c
(The above steps, i.e. server setup (1) and (2) and client setup (1),
are given in Makefile.
So the execution of the make
command will also follow the same
procedure.)
Edit client.conf file.
(Use the text editor like vi
and emacs
command)
% grid-proxy-init 'Enter passphrase'
The command grid-proxy-init
is not needed for
all the executions of the client programs.
The command does not have to be executed again
until the created proxy certificate will be expired.
% ./pi_client_sync 10000 client.example.org
pi_client_sync receives the number of dots and the server host name in the arguments.
It should be noted that a single activation of a job requires more than several seconds. Most of the time is caused by the overhead of the Globus Toolkit such as authentication and authorization by GSI and process invocation via the Globus jobmanager. Job activation corresponds to the creation of a function handle for Ninf-G, and therefore the creation of a function handle takes more than several seconds.
This means, even if the actual RPC calculation time is less than 1 second, the execution of the client program takes more than several seconds. But the overhead can be negligible for large-scale applications.
Also, this waiting time varies depending on the batch system and queuing system of jobs. In some scheduling of the batch system, the activation of jobs may not start immediately and may have to wait until other jobs end.
sync/pi_client_sync executed in this chapter is a program modified from the one explained in Chapter 4 to a form suited to Ninf-G.
The Ninf-G server program for calculating PI is
defined in the IDL file, pi.idl.
To be more precise, the program is modified so that
the pi_trial()
function can be used as
an RPC function and executed through Ninf-G.
Also in the modification, the pi_trial()
function is called
from an IDL file and the RPC function pi_trial()
defined in the IDL is called from the Ninf-G client program.
A procedure for calling the defined RPC function
pi_trial()
is also added in the Ninf-G client program.
The added procedure calls the following GridRPC API functions.
grpc_initialize()
, grpc_finalize()
grpc_function_handle_init()
,
grpc_function_handle_destruct()
grpc_call()
This GridRPC APIs enable RPC model computation with Ninf-G.
Next, the operation of Ninf-G is tested on a single machine with the API of asynchronous call.
A server program is executed with a client program. The client and the server use the same machine.
Sample program | : | 1site |
Server machine | : | client.example.org (Step 6.1) |
Client machine | : | client.example.org (Step 6.2) |
Move to the sample program directory before proceeding.
% cd 1site
% ng_gen pi.idl
% make -f pi.mak
Modify pi_client_sync.c, used for test in Chapter 5, for executing asynchronous computation. (Modified source file is pi_client_single.c.)
grpc_call()
is replaced with
grpc_call_async()
and grpc_wait_all()
.
% ng_cc -o pi_client_single pi_client_single.c
(The above steps, i.e. server setup (1) and (2) and client setup (2),
are given in Makefile.
So the execution of the make
command will
also follow the same procedure.)
Edit client.conf file.
% grid-proxy-init
'Enter passphrase'
% ./pi_client_single 10000 client.example.org
pi_client_single receives the number of dots and the server host name in the arguments.
With these steps, it is confirmed that the usage of the API for asynchronous GridRPC call on the Ninf-G client program.
The API for synchronous GridRPC call used in the previous Chapter is a blocking call and the API will not return until the server completes the computation and the client receives the results from the server. Therefore, the Ninf-G client can do nothing until the API will return.
The asynchronous GridRPC call is a non-blocking call and
it does not wait for the end of computation by the server.
Therefore, the Ninf-G client is able to call other functions
immediately after calling grpc_call_async()
with
regardless of the status of the computation by the server.
For example, you can call grpc_call_async()
for
another server after calling grpc_call_async()
.
It is required to call waiting functions such as
grpc_wait()
or its variants
(e.g. grpc_wait_all()
) for waiting the end of
the computation and receive the result from the server.
Asynchronous GridRPC call is usually used for the implementation of task-parallel applications, i.e. calling multiple asynchronous GridRPC calls simultaneously on different servers enables to implement task-parallel applications.
Next we perform a test with multiple machines.
A server program is executed with a client program. The client and the server use different machines.
Sample program | : | 2sites |
Server machine | : | server01.example.org (Step 7.1) |
Client machine | : | client.example.org (Step 7.2) |
Move to the sample program directory before proceeding.
% cd 2sites
Login the server machine and go to the directory 2sites.
Follow the same steps in 5.1.
(Since Makefile contains the rule, the execution of the
make
command will create the server program.)
The local LDIF file "pi.server01.example.org.ngdef" that contains the server machine name is created simultaneously. This file is used by Ninf-G client.
Login the client machine and make an operation.
% ng_cc -o pi_client_single pi_client_single.c
The file "pi.server01.example.org.ngdef" in the directory on server01.example.org, where Ninf-G Executable is compiled, must be copied to the client machine client.example.org.
% scp server01.example.org:/path/to/pi.server01.example.org.ngdef .
(In this case the command scp
is used for copying.
It is a program included inSSH
(Secure SHell: Remote operation command with coded communication path).
Here scp
copies the remote file into
the current directory. )
Edit client.conf file.
% grid-proxy-init 'Enter passphrase' % ./pi_client_single 10000 server01.example.org
pi_client_single receives the number of dots and the server host name in the arguments.
With these steps, it is confirmed that the Ninf-G client program activated on client.example.org requests server01.example.org to execute PI library function.
The client machine and the server machine can be located at geographically separated locations. As seen here, the computation of using remote computation resources over the network can be performed with Ninf-G.
Next we perform a test to call RPC in parallel with multiple computational servers.
Two server programs are executed with a client program. The client and the server use different machines. Parallel computation is realized by distributing the job to the computational servers.
Sample program | : | 3sites |
Server machine 1 | : | server01.example.org (Step 8.1) |
Server machine 2 | : | server02.example.org (Step 8.1) |
Client machine | : | client.example.org (Step 8.2) |
Move to the sample program directory before proceeding.
% cd 3sites
Login each server machine and follow the steps in 5.1 in each machine.
Modify pi_client_single.c, used for test in Chapters 6 and 7, for executing multi server computation. (Modified source file is pi_client_multi.c.)
% ng_cc -o pi_client_multi pi_client_multi.c
The files "pi.server01.example.org.ngdef" and "pi.server02.example.org.ngdef" "pi.server01.example.org.ngdef" in the directories respectively on server01.example.org and server02.example.org, where Ninf-G Executable is compiled, should be copied to the client machine client.example.org.
% scp server01.example.org:/path/to/pi.server01.example.org.ngdef . % scp server02.example.org:/path/to/pi.server02.example.org.ngdef .
Edit client.conf file.
% grid-proxy-init 'Enter passphrase' % ./pi_client_multi 10000 \ server01.example.org server02.example.org
(Here, \ (the backslash in the line end) is a shell function, meaning that the command line continues to the next line.)
pi_client_multi receives the number of dots and the server host name in the arguments. More than one server host name can be specified, and the computation will be carried out in all the hosts specified in the argument.
This procedure enables parallelization of computation on multiple servers. In this program, the computational servers share the dots, whose total number is specified in the command line argument, and carry out the computation in parallel.
Therefore the computation time of this program with two servers is smaller than that with a single server. If the number of the dots is of the order of 10,000, the computation ends immediately and the effect of parallelization is not clear. However if the computation takes a long time such as one day, multiple server computation reduces the computation time significantly. The more computational servers we have, the smaller the computation time is.
A cluster system usually has multiple computational servers (computational nodes). The jobs submitted to the job submission node of the cluster system are allocated to each computational node and executed in the system.
To use more than one computational node from Ninf-G, a function handle needs to be created for each of the nodes.
The creation of one function handle takes at least several seconds, and that of more than one function handle takes time proportional to the number of the handles to be created. Also, a job manager process is activated in the entrance host every time the request for creating a function handle is submitted. So the entrance host will be overloaded if more than several tens or more than several hundreds of function handles are created in the simple way, i.e. by making the handle one by one. To prevent the overload, Ninf-G has a function to create more than one function handle simultaneously when it receives a single request for the creation of the handles. With this function of Ninf-G, the creation time is saved and a large number of (several tens to several hundreds of) function handles can be created.
A test on using this function is conducted in this chapter.
Sample program | : | array |
Server machine | : | server01.example.org (Step 9.1) |
Client machine | : | client.example.org (Step 9.2) |
Move to the sample program directory before proceeding.
% cd array
Login the server machine (cluster management node) and make operations following the steps in 5.1.
pi_client_multi.c, used as the sample in Chapter 8,
is modified with API grpc_function_handle_array_init_np()
to create more than one function handle simultaneously.
(The source file obtained by the modification is pi_client_array.c.)
grpc_function_handle_init()
is changed to a single call for
grpc_function_handle_array_init_np()
.
% ng_cc -o pi_client_array pi_client_array.c
The file "pi.server01.example.org.ngdef" in the directory on server01.example.org, where Ninf-G Executable is compiled, should be copied to the client machine client.example.org.
% scp server01.example.org:/path/to/pi.server01.example.org.ngdef .
Edit client.conf file.
% grid-proxy-init 'Enter passphrase' % ./pi_client_array 10000 server01.example.org 4
pi_client_array receives arguments of the number of dots, server hostname, and the number of handles.
"4" at the end of the command line represents the number of function handles to be created. Here the number is set to 4, although it is not necessarily 4. If the cluster system you use is shared with other users, pay attention to the number of the created handles and be sure to use the system insofar as it is allowed.
This program uses "Simultaneous creation function of function handles" to suppress the overhead in the creation of function handles and keep the number of the job managers activated in the entrance host as small as possible. This function is needed for the usage of a large size cluster system with tens to hundreds processors.
Ninf-G Executable is state-less. Namely, even when the RPC function is called repeatedly with the same function handle, the state in the previous call is not retained in Ninf-G Executable. So, data needs to be sent for each computation even in the repeated computations with the same data.
Ninf-G2 provides Ninf-G Executable, as a remote object, capable of retaining states. More than one function (method) can be defined in the remote object, and the object provides a function for Ninf-G client to call these methods.
For use of the remote object capability, object handles,
not function handles, should be created.
Object handles are the grpc_object_handle_t_np
type data
and abstraction of the connection (communication path)
between Ninf-G remote object and Ninf-G client.
Once an object handle is created,
the corresponding remote object retains the state
until the object handle is discarded.
It can also call various methods to change the state
for the object handle.
For example when the input data of RPC is extremely large and the same input data is used in the call for every RPC, the remote object capability can be effectively utilized. Once an initialization method sending the large input data is called just one time after each handle is created, and saved in Ninf-G Executable, the large input data need not to be transferred again in the subsequent calls for RPC and the system performance is improved.
The following is another example of the implementation of methods in Ninf-G class.
Arguments are given. The given arguments are stored in the data holding area.
The data in the holding area is used to execute Computation 1. The result is stored in the data holding area and not returned.
The result in the data holding area is used to execute Computation 2. The final result of Computation 2 is returned to the client.
Sample program | : | object |
Server machine | : | server01.example.org (Step 10.1) |
Client machine | : | client.example.org (Step 10.2) |
Move to the sample program directory before proceeding.
% cd object
The IDL file pi.idl used in Chapters 5-9 is changed to an object. (The source file obtained in the modification is pi_object.idl) Edit pi_object.idl.
% ng_gen pi_object.idl
% make -f pi_object.mak
pi_client_multi.c, used as an example in Chapter 8, is modified to a form that executes computation using Ninf-G remote object capability. (The source file obtained in the modification is pi_client_object.c)
% ng_cc -o pi_client_object pi_client_object.c
The file "pi.server01.example.org.ngdef" in the directory on server01.example.org, where Ninf-G Executable is compiled, should be copied to the client machine client.example.org.
% scp server01.example.org:/path/to/pi_object.server01.example.org.ngdef .
Edit client.conf file.
% grid-proxy-init 'Enter passphrase' % ./pi_client_object 10000 server01.example.org
pi_client_object receives arguments of the number of dots and server hostname.
Parallel programs written in MPI can be set as RPC functions in Ninf-G. This feature allows the programming to perform fine-grained parallel processing with MPI on a server machine that are called in remote procedure call, in addition to the master-worker type parallel task programming mentioned in the above.
Sample program | : | mpi |
Server machine | : | server01.example.org (Step 11.1) |
Client machine | : | client.example.org (Step 11.2) |
Move to the sample program directory before proceeding.
% cd mpi
The IDL file pi.idl used in Chapters 5-9 is modified to a form that uses MPI. (The modified source file is pi_mpi.idl)
Edit pi_mpi.idl.
A code for exchanging the given arguments and computation results through MPI communication is added to carry out computation at each rank of MPI.
% ng_gen pi_mpi.idl
% make -f pi_mpi.mak
pi_client_single.c used as an example in Chapters 6 and 7 is modified to a form that uses MPI for executing computation. (The source file obtained in the modification is pi_client_mpi.c)
grpc_call_async()
.
% ng_cc -o pi_client_mpi pi_client_mpi.c
The file "pi.server01.example.org.ngdef" in the directory on server01.example.org, where Ninf-G Executable is compiled, should be copied to the client machine client.example.org.
% scp server01.example.org:/path/to/pi_mpi.server01.example.org.ngdef .
Edit client.conf file.
% grid-proxy-init 'Enter passphrase' % ./pi_client_mpi 10000 server01.example.org
pi_client_mpi receives arguments of the number of dots and server hostname. The number of CPU used here is set in mpi_runNoOfCPUs in the section SERVER of the configuration file as described in (4). There are several other ways in Ninf-G to set the number of CPU for MPI. For details see the Ninf-G Users' Manual.
MPI is famous as a parallel programming model in existing parallel systems. It has the advantage that existing MPI programs can be executed on a Grid in their original forms with Grid-compatible MPI implementation such as MPICH-G2. On the other hand, it is difficult to ensure co-allocation in a Grid, i.e. to undertake simultaneous activation of all MPI processes at the start of the program execution, although MPI requires the co-allocation. Some functions such as process spawning may help the co-allocation to a certain extent, although in principle the processors and the number of processes MPI uses are determined statically and the dynamical property of the Grid is not entirely compatible with MPI. It is one of the most attractive features of Grids to carry out a large size application in a large-scale environment for a long time, while it is important to pay attention to what should be done in case of system failures, e.g. hardware failure of computation hosts. However with MPI, all program executions are terminated if an error occurs in a single process. In this sense, MPI is not completely fault-tolerant.
GridRPC is a programming model that can circumvent the above problem. GridRPC does not always require the co-allocation, and resources can be rather easily added/released dynamically by the dynamical creation/destruction of function handles. Also, an error in a computational host has no influence on the computation executed in other hosts, and therefore one can easily deal with failures by terminating the troubled computation or by sending the request again.
MPI is a popular programming model for parallel computation. Under the current circumstances, however, even the programs that are not suited for MPI are often written with MPI. The most suitable programming model should be selected according to the characteristics of the implemented application. To implement an application that executes more than one independent computation with distributed resources on Grid, GridRPC should be used. (At least MPI should not be used.) Both advantages of GridRPC and MPI can be utilized for the implementation of a large-scale application by combining them to use MPI for computation with frequent detailed communication and Ninf-G for computation with infrequent communication.
Past questions and cautions are listed below.
Q. Creating a function handle with
grpc_function_handle_init()
API requires the server's name.
Can I omit the domain name in the server's name,
or do I have to use the domain name?
A. The domain name is necessary.
When creating a function handle and client configuration file, you have to use FQDN (Fully Qualified Domain Name) to set the server's name and cannot use "localhost". If the server's name is not properly specified with FQDN, Ninf-G API is terminated with an error.
For example when you use a server of the domain name server.example.org, you cannot specify the server's name simply as "server" but have to use "server.example.org".
Q. I made and executed an Ninf-G based client program, but it hasn't executed properly. It seems the execution of Ninf-G API failed. What is the cause?
A. Ninf-G has a customizable function for log output.
By making log-related settings in the client configuration file and the server configuration file, you get the output of the information on the error in the execution of Ninf-G API as well as on debugging. The output can be set to a file.
With this log function you can check the execution status of Ninf-G API and trace the cause of the failure in the execution of API.
For details of log function settings, see Ninf-G Users' Manual in http://ninf.apgrid.org/documents/ng2-manual/user-manual.html
(Log output setting for client: Section 4.3.9)
(Log output setting for server: Section 3.3.2)
The log output messages will appear in the following order.
"Date and time:Client/Server:Host name:Log level: ID of each data structure:Name of Ninf-G internal function:Log message"
Every Ninf-G API returns the value of grpc_error_t
that presents success or failure of the execution.
User's programs should be written to include a function
for checking these values.
If the computation function executed with Ninf-G Executable results in Segmentation fault, the coreDumpSize attribute of the section <SERVER> in the client configuration file might be enabled.
Q. grpc_call()
or grpc_call_async()
is terminated with an error.
What is the cause?
In the error log, the log message
"ngcllJobCallback: GRAM Job failed because ..." appears.
A. The error message "GRAM job failed" indicates an error in GRAM, which is a job execution module of Globus Toolkit used by Ninf-G.
There are various causes of the error in GRAM and it is difficult to identify one. The resolution varies depending on the error message.
Execute the following command to check if GRAM can be properly used.
% globus-job-run server.example.org /bin/hostname
The above command requests GRAM in server.example.org to execute the command /bin/hostname. If the command is executed successfully the host name is displayed. Sometimes the error is fixed with the execution of this command.
For the error in GRAM, see the web page of Globus.
GRAM RSL (Resource Specification Language) that Ninf-G client sends to GRAM can be checked if the log level of the client is set to Information or Debug.
Q. When I used Ninf-G, a file "gram_job_mgr_[numerics].log" was created in the server's home directory. What is this file?
A. This file is a log file from GRAM, a job execution module of Globus Toolkit Ninf-G uses.
GRAM activates jobmanager process in the same timing of job execution. The jobmanager process detects the completion and executes forced termination by monitoring the job.
The jobmanager eliminates the log file if the job ends normally, but does not if any error occurs or the job execution is canceled. In the latter case, the log file of GRAM remains in the home directory.
If Ninf-G client is terminated by e.g. ^C key
after grpc_function_handle_init()
API creates
a function handle and before
grpc_function_handle_destruct()
destructs the handle,
the job corresponding to the handle is cancelled.
A file gram_job_mgr-[numerics].log remains in the home directory of the server as a result of such abnormal terminations.
You may delete the file if you do not need to find the cause of the job termination.
Q. I would like to create a Ninf-G based program, I don't know how to write. Is a sample program available?
A. Ninf-G based sample programs are available in the Ninf-G package.
Ninf-G can be downloaded from the web page:
http://ninf.apgrid.org/packages/welcome.shtml
Download Ninf-G version 2 package from the web page.
The package contains four programs.
test/samples/pi | : | Monte Carlo computation for obtaining circle ratio |
test/samples/add | : | Addition of arrays |
test/samples/pi_mpi | : | Computation with the use of MPI |
test/diag | : | Exhaustive test to check if Ninf-G can be used with no problem. |
Q. After making a function handle, I would like to use it for more than a few days. What should I be aware of?
A. Pay attention to the valid term of the temporary certificate.
The command grid-proxy-init
generates
the temporary certificate with the valid term
set to 12 hours unless given specifically.
There are two ways to use function handles for more than 12 hours.
-valid
option of grid-proxy-init
command.
Q. Non-thread flavor and pthread flavor are available in Globus Toolkit, and both can be compiled with Ninf-G. Which should I use?
A. Use pthread flavor. Ninf-G client has some features that can be used only after the compilation with pthread flavor. So the client should use pthread flavor.
If you do not need the features, you can use either flavor. Also either flavor can be used for Ninf-G Executable.
If the user-written program uses pthread, pthread flavor needs to be used in Ninf-G.
Globus Toolkit can provide almost the same features with either non-thread flavor or pthread flavor, although the performance changes depending on the flavor because of the totally different internal implementation.
Q. Ninf-G seems to use MDS. What is MDS? Do I need to use it?
A. MDS is an information service Globus Toolkit provides.
Ninf-G can use MDS. The information of each RPC installed in the computational server can be acquired with MDS. However the information can also be obtained with a local LDIF file and MDS is not always necessary. The following are cautions for use of MDS.
The installation of the Ninf-G RPC function information
into the Globus Toolkit directory is required for use of MDS.
The information is provided by installing an LDIF file,
which LDAP internally implemented in MDS refers to,
into var/gridrpc
under the Globus Toolkit directory.
However, MDS has a cache and it sometimes takes about 10 minutes
to update the information.
So the information search may fail just after
the installation of the LDIF file.
The command ng_dump_functions
of Ninf-G can be used
to see which RPC is installed.
The directory where the LDIF file is installed for use of MDS is a shared directory. Each user does not have his/her own directory. So, users are not allowed to create and install RPC functions of the same name in the same server.
You need to be careful not to disturb other users when using the shared directory.
You should know the VO_NAME in advance.
To search information with MDS, you should know the VO_NAME
set in the server from which you request information.
Without knowing the VO_NAME you cannot search for information.
If you can login to the MDS server, you can see the VO_NAME
by looking into etc/grid-info-slapd.conf
under the Globus Toolkit directory.
Q. It is difficult to debug Ninf-G Executable because it is activated remotely. Is there any convenient method of debugging?
A. Ninf-G has a debugging function for Ninf-G Executable.
Set debug_busyLoop, debug, and coreDumpSize in the section <SERVER> of the client configuration file. For details see the Users' Manual.
The log output function is enabled also in the debugging operation. Both client and Ninf-G Executable can output a log and can check every change in the operation status of Ninf-G API.
Q. The following message appeared in the log output in Ninf-G client. Is there any problem?
"... heartbeat timeout warning (60 seconds) occurred ..."
A. It does not always mean a problem.
Heartbeat is a Ninf-G function to detect if the Ninf-G Executable process is running with no problems, if the system is not freezing, and if the network has any problems.
Ninf-G Executable sends the heartbeat periodically to the client. If the network or Ninf-G Executable freezes, the heartbeat cannot be transferred from Ninf-G Executable to the client.
If the heartbeat does not reach the Ninf-G client within a certain period of time, the client considers that the Ninf-G Executable (function handle) has a problem and cannot be used, and puts out an error. Then the function handle cannot be used any more. The RPC in execution returns with an error.
For the sake of this function, the Ninf-G client can detect failure without freezing, and the users can make a trouble-resistant, fault-tolerant program.
Suppose Ninf-G Executable transmits the heartbeat every 60 seconds. The warning of the heartbeat appears if the Ninf-G client does not receive the heartbeat for 60 seconds. So, the warning appears even when a slight delay of Ninf-G Executable processing causes the delay in periodic heartbeat transmission. It is therefore not certain that this warning status should be determined as a failure.
In the default setting, the interval of the periodic heartbeat transmission is 60 seconds and the failure judgment is made after the client does not received the heartbeat for 60 seconds x 5 =300 seconds. (For details see the Users' Manual.)
If the following message is displayed after the heartbeat warning, it means the recovery of the heartbeat transmission.
"... heartbeat revived again ..."
Q. I want to use Ninf-G. Is a shared memory function available in Ninf-G? Is it possible to share the variables in Ninf-G Executable?
A. Ninf-G does not have a shared memory function. A variable sharing function for Ninf-G Executable is also not available.
Q. Can I set the compiler and linker?
A. Yes you can. Compiler, Linker and others are available in IDL file syntax.
Q. I made a function handle but the job is not executed
immediately and grpc_call()
API
and grpc_call_async()
API stop.
How could this happen?
A. It can happen.
For example, if the cluster that the job goes to already has many users and other user's jobs are preferentially executed, your job has to wait until the other jobs end. So the execution of jobs does not always starts immediately.
If you want to place a limit on the execution time for your job, set the job_startTimeout attribute in the section <SERVER> of the client configuration file.
Q. What is the workDirectory attribute that can be set in the client configuration file?
A. workDirectory is used to specify the directory in which the Ninf-G Executable process operates.
In default Ninf-G Executable operates in the directory in which the Ninf-G Executable program is installed. Use workDirectory to change this directory.
If a relative file name is specified in an RPC function or computational function to open a file, the file is searched starting from the workDirectory and opened.
In case of a Segmentation fault due to a computational function failure, a core file, which is a memory image at abnormal termination, is sometimes created. This core file also goes to the directory specified in the workDirectory.
The absence of the specified workDirectory causes an error. So you need to be careful.
You may need to be careful if the job submission node and the job execution computational node are different in the cluster system. Due to the restriction of Globus Toolkit, an error occurs if the workDirectory exists in the computational node but not in the job submission node.
Q. Character strings are put out from
printf()
or fprintf()
in IDL files or computational functions,
but the output result is not sent to the client.
There is a setting "redirect_outerr true" in the section
<SERVER> in the client configuration file.
Why isn't it sent?
A. Ninf-G uses the stdout/stderr transfer function provided by Globus Toolkit GASS module to send stdout and stderr of the computational functions.
The timing of the GASS stdout/stderr transfer changes with the jobmanager that Globus Toolkit GRAM uses. So the transfer timing is not always the same.
For example, if you use the jobmanager-fork, every output line is transferred to the client's stdout and stderr and displayed within several seconds after the line is output.
With some other type of jobmanager,
however, the output is all transferred at the end of the job.
In this case, the output would not be transferred until
grpc_function_handle_destruct()
discards
the function handle of Ninf-G.
Output is sometimes not transferred when the job is cancelled. In case of Ninf-G, jobs are cancelled for safety if an error such as abnormal disconnection occurs in the connection to the corresponding handle.
Output is also not transferred if Ninf-G client
is terminated by e.g. exit()
before
the execution of grpc_function_handle_destruct()
.
Q. Login to the GRAM job submission node (e.g. server01.example.org), which is a designated host for the creation of function handles, is not permitted in my cluster system, and I cannot compile Ninf-G Executable in server01.example.org. So the local LDIF file for server01.example.org is not created. How can I make the file?
A. Local LDIF file is a text file and users can rewrite the content.
For example, you can obtain a local LDIF file for server01.example.org in the following way.
First, compile the IDL file for RPC on a host (ex. client.example.org) other than server01.example.org. Then a local LDIF file for client.example.org is created.
Then, modify the obtained local LDIF file with a text editor. Modify all the lines in the local LDIF file that contain the host name. (Change client.example.org to server01.example.org.)
The file name of the local LDIF file also contains the host name, although it has no influence on the client execution. To prevent confusion, change the file name to server01.example.org.
The local LDIF file created in the above steps should be specified in the Ninf-G client configuration file.
Take full responsibility for the modification of the local LDIF file.
Q. When I compiled Ninf-G or Ninf-G application, I got the following message. How can I solve this?
/usr/Globus/include/gcc32dbg/globus_common.h:290:1:
warning: " GLOBUS_FAILURE" redefind.
A. It's a bug of the Globus Toolkit 2.2.4 and 2.4.0. and it's resolved on the Globus Toolkit 2.4.1 or later. (see http://bugzilla.globus.org/globus/show_bug.cgi?id=959 for details.)
So you can solve this problem by updating the Globus Toolkit.