FAQs

General Questions

Paramshakti is the name of the supercomputer hosted at IIT Kharagpur.

All IITKGP Faculty Adviser and their Research Group who are working in HPC domain can get an account provided their Account request form is approved by PS administration.

A new User can start with quick start guide http://www.hpc.iitkgp.ac.in/HPCF/quickGuide.

At present, any user can run the jobs under “standard-low” partition/queue for free. In addition to this, each faculty group is given a default virtual currency which they can use for running jobs under the partitions/queues standard, hm and gpu.

Users are required to acknowledge the use of PS in all publications, presentations, thesis, webpages, etc by including the following or similar statement:

“ This work used the Supercomputing facility of IIT Kharagpur established under National Supercomputing Mission (NSM), Government of India and supported by Centre for Development of Advanced Computing (CDAC), Pune

Users are also requested to inform the PS administration on any such outcome for annual reports/documentation/uploading on website, etc.


For documentation specific to software applications, see http://hpc.iitkgp.ac.in/HPCF/softwareAvail.

Raise Support ticket https://paramshakti.iitkgp.ac.in/support.

Raise Support ticket https://paramshakti.iitkgp.ac.in/support.



Account Questions

You should contact your Faculty Adviser.

After the faculty member has applied for the account on PS, students may submit the form using http://hpc.iitkgp.ac.in/HPCF/user_form

No. Any student graduated his/her account will be deleted in 3 months. Please read the User guidelines on http://www.hpc.iitkgp.ac.in/HPCF/userpol

No. Please read the User guidelines on http://hpc.iitkgp.ac.in/HPCF/userpol

Please refer to the link http://hpc.iitkgp.ac.in/HPCF/accessSystem.

Use sbalance command to see the account balance.

Your Faculty Adviser is the slurm account coordinator. He can modify the distribution of computing resources among his group members.



Disk Storage Questions

Please use myquota command and check the "quota" column to see your disk quota. The maximum limits are mentioned on http://hpc.iitkgp.ac.in/HPCF/Queues.

Please use myquota command and check the "used" column to see your disk quota.

Each individual case will be discussed by PS administration and will decide how much disk space can be allocated. Raise Support ticket https://paramshakti.iitkgp.ac.in/support.

To reveal the directories in your account that are taking up the most disk space you can use the du , sort and tail commands. For example, to display the ten largest directories, change to your home directory and then run the command: du . | sort -n | tail -n 10

Please use myquota command and check the "used" column to see your used disk quota.

The commands tar and gzip can be used together to produce compressed file archives representing entire directory structures. For example, to package a directory structure rooted at src/ use “tar -czvf src.tar.gz src/”. This archive can then be unpackaged using “tar -xzvf src.tar.gz” where the resulting directory/file structure is identical to what it was initially.
The programs zip, bzip2 and compress can also be used to create compressed file archives. See the man pages on these programs for more details


Currently, users are limited to 50GB of space on /home and 2TB (soft limit) on /scratch. Users can check their quota usage using following Commands :
lfs quota -hu $USER /home
lfs quota -hu $USER /scratch


The lists all regular files in a user directory more than 30 days old.
lfs find $HOME -mtime +30 -type f -print | xargs du -sh
lfs find $SCRATCH -mtime +30 -type f -print | xargs du -sh




Email Questions

Raise Support ticket https://paramshakti.iitkgp.ac.in/support.



Linux Questions

Linux is an open-source operating system that is similar to UNIX. It is widely used in High Performance Computing.

There are also many tutorials available on the web which can be a good starting point.



SSH Questions

Secure Shell (SSH) is a program to log into another computer over a network, to execute commands in a remote machine, and to move files from one machine to another. It provides strong authentication and secure communications over insecure channels. SSH provides secure X connections and secure forwarding of arbitrary TCP connections.

Do NOT use IP address(es) to access HPC! Please use domain name paramshakti.iitkgp.ac.in.(i.e $ssh paramshakti.iitkgp.ac.in).

Please refer to http://hpc.iitkgp.ac.in/HPCF/accessSystem.

If you are using Linux system, use ssh -X @paramshakti.iitkgp.ac.in.
Windows users: please see "X11 Forwarding in putty" .


If you are using Linux system, ensure correct search domain and DNS entry exists in /etc/resolv.conf.
You can also check with CIC Helpdesk to get assistance for configuring correct nameserver IP address for your machine.

 [user1@localhost ~]$ cat /etc/resolv.conf 

search iitkgp.ac.in 

nameserver 172.16.1.164 

nameserver 172.16.1.180 


For Windows Users: If machine is not DHCP enabled.Please Check Control Panel\Network and Internet\Network Connections => Properties =>Internet Protocol TCP/IPV4 Properties => Advanced =>DNS Tab =>Add DNS servers



Batch Processing Questions

On all PS systems, batch processing is managed by the slurm. The slurm batch requests (jobs) are shell scripts that contain the same set of commands that you enter interactively. These requests may also include options for the batch system that provide timing, memory, and processor information. For more information, general procedure on website and refer to the manpage “man sbatch”.

slurm uses sbatch to submit, squeue to check the status, and scancel to delete a batch request. For more information, see http://hpc.iitkgp.ac.in/HPCF/jobSubmissions page.

Yes. See the --mail-user, --mail-type option. Please refer to slurm documentation page.

There are numerous reasons why a job might not run even though there appear to be processors and/or memory available. These include:
a. Your account may be at or near the job count or processor count limit for an individual user.
b. Your group/Faculty adviser may be at or near the job count or processor count limit for a group.
c. The scheduler may be trying to free enough processors to run a large parallel job.
d. Your job may need to run longer than the time left until the start of a scheduled downtime.
e. You may have requested a scarce resource or node type, either inadvertently or by design. Also see the scheduling policies for more detail refer http://.hpc.iitkgp.ac.in/HPCF/Queues .


Ideally it should be at job output location given in the batch script.

Use the command “sacct –X”. For similar job accounting related information refer “man sacct”.

By default, we don't start an X server on gpu nodes because it impacts computational performance and therefore it is not possible to use them for visualization.

Provide the directive "#SBATCH -p standard-low" in job submission script.

The job priority depends on the job's current wait time, the queue priority,size of the job and job walltime.

The codes identify the reason that a job is waiting for execution. A job may be waiting for more than one reason, in which case only one of those reasons is displayed.

 

JOB REASON CODES  Explanation
AssociationJobLimit             The job’s association has reached its maximum job count.
AssocGrpNodeLimit               The jobs requested a number of nodes above the allowed for the entire project/association/group.
AssocMaxNodesPerJobLimit        The job requested a number of nodes above the allowed.
AssocMaxJobsLimit               Generally occurs when you have exceeded the number of jobs running in the queue.
AssocMaxWallDurationPerJobLimit The job requested a runtime greater than that allowed by the queue
AssociationResourceLimit        The job’s association has reached some resource limit.
AssociationTimeLimit            The job’s association has reached its time limit.
BadConstraints                  The job’s constraints can not be satisfied.
BeginTime                       The job’s earliest start time has not yet been reached.
Cleaning                        The job is being requeued and still cleaning up from its previous execution.
Dependency                      This job is waiting for a dependent job to complete.
FrontEndDown                    No front end node is available to execute this job.
InactiveLimit                   The job reached the system InactiveLimit.
InvalidAccount                  The job’s account is invalid.
InvalidQOS                      The job’s QOS is invalid.
JobHeldAdmin                    The job is held by a system administrator.
JobHeldUser                     The job is held by the user.
JobLaunchFailure                The  job     could  not be launched.  This may be due to a file system problem, invalid program name, etc.
Licenses                        The job is waiting for a license.
NodeDown                        A node required by the job is down.
NonZeroExitCode                 The job terminated with a non-zero exit code.
PartitionConfig                 Requesting more or wrong number of resources than the partition is configured for.
PartitionDown                   The partition required by this job is in a DOWN state.
PartitionInactive               The partition required by this job is in an Inactive state and not able to start jobs.
PartitionNodeLimit              The number of nodes required by this job is outside of it’s partitions current limits. Can also indicate that required nodes are DOWN or DRAINED.  
PartitionTimeLimit              The job’s time limit exceeds it’s partition’s current time limit.
Priority                        One or more higher priority jobs exist for this partition or advanced reservation.
Prolog                          It’s PrologSlurmctld program is still running.
QOSJobLimit                     The job’s QOS has reached its maximum job count.
QOSResourceLimit                The job’s QOS has reached some resource limit.
QOSTimeLimit                    The job’s QOS has reached its time limit.
ReqNodeNotAvail                 Some  node     specifically  required by the job is not currently available.  The node may cur-
 rently be in use, reserved for another job, in an advanced reservation,  DOWN,  DRAINED,     or not responding.  Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job’s "reason" field as "UnavailableNodes". Such nodes  will     typically  require  the intervention of a system administrator to make available.
Reservation                     The job is waiting its advanced reservation to become available.
Resources                       The job is waiting for resources to become available.
SystemFailure                   Failure of the Slurm system, a file system, the network, etc.
TimeLimit                       The job exhausted its time limit.
QOSUsageThreshold               Required QOS threshold has been breached.
WaitingForScheduling            No  reason has been set for this job yet.  Waiting for the scheduler to determine the appropriate reason.


Compiling System Questions

Fortran, C, and C++ are available on all Paramshakti systems. For more details refer the http://hpc.iitkgp.ac.in/HPCF/softwareAvail link and The commands used to invoke the compilers and/or loaders refer http://hpc.iitkgp.ac.in/HPCF/jobSubmissions.

Use ‘module avail compiler’ to list the names of all compilers available on Paramshakti.

Although it may be possible to use the executables generated on other machines, all users are recommended to recompile the software in their home directory if they are not available on Paramshakti.



Libraries/Software Questions

Please use ‘module avail apps’ command to know the list of available applications on Paramshakti.

Use the command “module avail ABC” command to know the list of versions.

Use the command ‘module avail libraries’ to see available Libraries. Use command ‘module avail lapack’ or ‘module avail atlas’ to check their availability.

The NumPy and SciPy modules are installed with the python software.see the command output of “module avail python” command to check the python versions.

You may install open source software yourself in your home directory. If you have a license for commercial software that allows all institutes users to use it, please raise a ticket by providing all detail related to the software along with license information. We make this software available as module for everyone’s use.

Most packages have a option to install under a normal user account. We avoid installing the user softwares as root.

Modules are used to manage the environment variable settings associated with software packages in a shell-independent way. On Paramshakti, you will by default have modules in your environment for MPI, compilers, and a few other pieces of software. For information on using the module system, see http://hpc.iitkgp.ac.in/HPCF/jobSubmissions.



Performance Analysis Questions

MegaFLOPS/GigaFLOPS/TeraFLOPS/PetaFLOPS are millions/billions/trillions/quadrillions of FLoating-point Operations (calculations) Per Second.

The command "sacct -j 'jobid' --format=user,JobID,JobName,MaxRSS,Elapsed" will give you statistics on completed jobs by jobID. Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc. The command output below shows the sample output. See the man page "$man sacct" to capture additional information related to the job.

$ sacct -j 269667 --format=JobID,Jobname,partition,state,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist,AveDiskWrite
JobID JobName Partition State Elapsed MaxRSS MaxVMSize NNodes NCPUS NodeList AveDiskWrite
------------ ---------- ---------- ---------- ---------- ---------- -------- ---------- --------------- --------------
269667 SBBS_dist standard-+ COMPLETED 03:43:01 16 625 cn[285-300]
269667.batch batch COMPLETED 03:43:01 12564K 1530304K 1 40 cn285 3.43M
269667.0 pmi_proxy COMPLETED 03:42:59 17989840K 43527004K 16 16 cn[285-300] 62530.07M


You can list your current running jobs and on which compute node they are running with "squeue" command .

While the Job is in running status you can login those nodes using command "ssh 'nodename'" .

You will be able to see resource allocation for the job using command "scontrol show job -dd 'JobID'|grep NumCPUs"

You will be able see the load average using command "ssh 'nodename' uptime".

$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
269667 standard- SBBS_dis sandeepa R 2:53:19 16 cn[285-300]

$ scontrol show job -dd 269667|grep NumCPUs
NumNodes=16 NumCPUs=625 NumTasks=625 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

$ ssh cn291 uptime
19:40:07 up 7 days, 10:21, 0 users, load average: 40.21, 40.10, 40.00

In above example it shows total 625 Cores allocated to 16 nodes having Nodelist cn[285-300] for jobID 269667 . So on an avrage approx. 40 cores per Node allocated and a random check of node cn291 show the load average 40. which can be considered as 100% efficient Job.

For single- or multinode jobs the AverageNodeLoad is an important indicator if your jobs runs efficiently, at least with respect to CPU usage. If you use the whole node, the average node load should be close to number of CPU cores of that node (i.e 40 for Paramshakti because we have 40 cores max for each compute node). In some cases it is totally acceptable to have a low load if your job is memory intensive but in general either CPU or memory load should be high.

If you detect inefficient jobs you either look for ways to improve the resource usage of your job or ask for less resources in your SLURM batch script. As a summary you should know the rough wall time and Resource requirement(core,memory) for your Job otherwise you are wasting your allocated resource quota and also may experience probably longer than necessary queuing wait times.




Other Common Questions

Programs run on the login nodes are subject to strict CPU time limits called wall time. To run an application that takes more time, you need to create a batch request. Your batch request should include an appropriate estimate for the amount of time that your application will need.

Programs run on the login nodes are subject to strict CPU time limits. Because file transfers use encryption, you may hit this limit when transferring a large file. To run longer programs, use the batch system.

It is possible that the program runs out of the default memory available. Users are advised to update the "ulimit" using the following commands: $ ulimit -s unlimited
If this resolves the issue, the same commands should be added to the ~/.bashrc file.
The segmentation fault is most commonly caused by trying to access an array beyond its bounds -- for example, trying to access element 15 of an array with only 10 elements. Unallocated arrays and invalid pointers are other causes. You may wish to debug your program using one of the available tools such as the gdb.


Windows and Mac have different end-of-line conventions for text files than UNIX and Linux systems do, and most UNIX shells (including the ones interpreting your batch script) don't like seeing the extra character that Windows appends to each line or the alternate character used by Mac. You can use the following commands on the Linux system to convert a text file from Windows or Mac format to UNIX format:
$dos2unix myfile.txt
$mac2unix myfile.txt


A text file created on Linux/UNIX will usually display correctly in Wordpad but not in Notepad. You can use the following command on the Linux system to convert a text file from UNIX format to Windows format:
$unix2dos myfile.txt


If you are using Linux system, use ssh -X @paramshakti.iitkgp.ac.in
Windows users: please see "X11 Forwarding in putty"