Molecule Input Formats
CHEMSMART provides flexible molecule input capabilities, supporting multiple file formats and molecular representations. This page describes the various ways you can create molecules for use in quantum chemistry calculations.
Note
For more information about CHEMSMART’s design and capabilities, see the preprint: https://arxiv.org/abs/2508.20042
Workflow Overview
The following diagram illustrates how different input formats are processed to create molecules for quantum chemistry calculations:
┌──────────────────┐
│ Input Sources │
└──────────────────┘
│
├─── ASE Atoms
├─── Pymatgen Molecule
├─── RDKit Molecule
├─── File Formats (.xyz, .sdf, .com/.log, .inp/.out, .cdx/.cdxml, etc.)
└─── PubChem queries (by name, CID, or SMILES)
│
▼
┌──────────────────────────────┐
│ Input Processing │ ──────► Creates Molecule Object
│ (chemsmart.io) │ (parses all supported input formats)
└──────────────────────────────┘
│
▼
┌──────────────────┐
│ Molecule │ ──────► Central molecular representation
│ Object │ (symbols, positions, charge, multiplicity)
└──────────────────┘
│
├─────────────────────────────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ chemsmart.jobs │ │ chemsmart.jobs │
│ .gaussian.writer │ │ .orca.writer │
└──────────────────┘ └──────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ .com/.gjf files │ │ .inp files │
│ (Gaussian, Inc.)│ │ (ORCA) │
└──────────────────┘ └──────────────────┘
Supported File Formats
Coordinate Files
XYZ Files
Standard XYZ format for atomic coordinates:
# Single structure
chemsmart sub -s server gaussian -p project -f molecule.xyz -c 0 -m 1 opt
# Multi-structure file (select 5th structure)
chemsmart sub -s server gaussian -p project -f molecules.xyz -i 5 -c 0 -m 1 opt
Note
XYZ files do not contain charge or multiplicity information. You must specify these using -c and -m flags.
SDF Files
Structure-Data Format files with 2D or 3D coordinates:
chemsmart sub -s server gaussian -p project -f molecule.sdf -c 0 -m 1 opt
Gaussian Files
Gaussian Input Files (.com, .gjf)
Gaussian input files contain route, charge, multiplicity, and coordinates:
chemsmart sub -s server gaussian -p project -f input.com opt
Tip
CHEMSMART reads the existing charge and multiplicity from .com files. Override with -c and -m if needed.
Gaussian Output Files (.log, .out)
Use optimized structures from Gaussian output:
# Use last structure from optimization
chemsmart sub -s server gaussian -p project -f opt.log sp
# Use specific structure (e.g., 10th structure from scan)
chemsmart sub -s server gaussian -p project -f scan.log -i 10 sp
ORCA Files
ORCA Input Files (.inp)
ORCA input files with coordinates and calculation settings:
chemsmart sub -s server gaussian -p project -f input.inp opt
ORCA Output Files (.out)
Extract structures from ORCA calculations:
chemsmart sub -s server gaussian -p project -f orca_opt.out sp
ChemDraw Files
ChemDraw XML (.cdxml) and Binary (.cdx) Files
CHEMSMART supports reading molecular structures directly from ChemDraw files, including organometallic complexes with aromatic ligands such as Cp, Cp*, and benzene rings.
# Organic molecule
chemsmart sub -s server gaussian -p project -f molecule.cdxml -c 0 -m 1 opt
# Organometallic complex (charge and multiplicity must be specified explicitly)
chemsmart sub -s server gaussian -p project -f ferrocene.cdxml -c 0 -m 1 opt
Tip
You can submit a Gaussian optimization directly from a ChemDraw file in a single command:
chemsmart sub -s server gaussian -p project -f molecule.cdxml -c 0 -m 1 opt
This will:
Read the molecular structure from the ChemDraw file
Generate 3D coordinates using RDKit’s EmbedMolecule
Create a Gaussian input file with specified charge and multiplicity
Submit the geometry optimization job to the HPC cluster
Note
Both binary (
.cdx) and XML-based (.cdxml) ChemDraw formats are supported.RDKit is used internally to parse ChemDraw files and generate 3D coordinates.
For multi-molecule ChemDraw files, use
-ito select a specific molecule.3D coordinates are automatically generated from 2D structures.
Reading binary
.cdxfiles requires Open Babel (obabel) to be installed. If Open Babel is not available, save the file as.cdxmlinstead.Charge and multiplicity of organometallic complexes are not inferred from the ChemDraw file – always specify
-cand-mexplicitly.
Example: Multi-molecule ChemDraw file
# Select the 2nd molecule from a ChemDraw file with multiple structures
chemsmart sub -s server gaussian -p project -f molecules.cdxml -i 2 -c 0 -m 1 opt
For full details on organometallic complex support and its restrictions, see ChemDraw Organometallic Complex Files.
Molecular Databases
PubChem Integration
Query PubChem directly by name, CID, or SMILES:
# By compound name
chemsmart sub -s server gaussian -p project -P water -c 0 -m 1 -l water opt
# By CID (Compound ID)
chemsmart sub -s server gaussian -p project -P 962 -c 0 -m 1 -l water opt
# By SMILES string
chemsmart sub -s server gaussian -p project -P "O=Cc1ccccc1" -c 0 -m 1 -l benzaldehyde opt
Note
When using
-P(PubChem), the-lflag specifies the output filename.PubChem queries support compound names, CIDs (Compound IDs), and SMILES strings.
SMILES strings are processed by PubChem to retrieve 3D structures and generate coordinates.
ASE Database Files (.db, .traj)
Use structures from ASE database or trajectory files:
# From ASE database
chemsmart sub -s server gaussian -p project -f molecules.db -i 5 -c 0 -m 1 opt
# From ASE trajectory
chemsmart sub -s server gaussian -p project -f md.traj -i 10 -c 0 -m 1 opt
Python Object Integration
CHEMSMART’s Molecule class provides seamless integration with popular Python chemistry libraries:
From ASE Atoms
from ase import Atoms
from chemsmart.io.molecules.structure import Molecule
# Create ASE Atoms object
atoms = Atoms('H2O', positions=[[0, 0, 0], [0, 0, 1], [0, 1, 0]])
# Convert to CHEMSMART Molecule
molecule = Molecule.from_ase_atoms(atoms)
From RDKit Mol
from rdkit import Chem
from rdkit.Chem import AllChem
from chemsmart.io.molecules.structure import Molecule
# Create RDKit molecule from SMILES
rdkit_mol = Chem.MolFromSmiles('CCO')
AllChem.EmbedMolecule(rdkit_mol)
# Convert to CHEMSMART Molecule
molecule = Molecule.from_rdkit_mol(rdkit_mol)
Aromaticity Detection (is_aromatic)
The is_aromatic property detects aromaticity by converting the molecule to an RDKit representation and checking
whether any atom both carries the aromatic flag and belongs to a ring. This guards against false positives in
acyclic molecules (e.g. H₂O, MgI₂) that can arise when the geometry-based bond-order heuristic assigns a bond order of
1.5 to short single bonds.
Note
Known limitations of aromaticity detection:
Bond orders are inferred purely from 3D geometry (interatomic distances), not from an electronic structure calculation. This means the detection is model-dependent and may not match formal aromaticity criteria in all cases.
Edge cases such as the cyclopropenyl cation (aromatic) versus the cyclopropenyl radical (non-aromatic) may not be distinguished correctly, because the outcome depends on how bond orders and electron counts are assigned from the geometry alone.
For borderline or unusual systems (strained rings, metal-organic frameworks, non-Kekulé structures, etc.) the result should be treated as a heuristic estimate rather than a definitive answer.
If precise aromaticity information is required, consider constructing the RDKit molecule directly from a SMILES string or from an output file that encodes explicit bond orders.
From Pymatgen
CHEMSMART molecules can be converted to and from Pymatgen format:
from chemsmart.io.molecules.structure import Molecule
# Convert CHEMSMART Molecule to Pymatgen
molecule = Molecule.from_filepath('input.xyz')
pymatgen_mol = molecule.to_pymatgen()
Note
For converting Pymatgen molecules to CHEMSMART, you can use the ASE Atoms adaptor as an intermediate format.
Best Practices
Charge and Multiplicity
Always specify charge and multiplicity for:
XYZ files
SDF files
ASE database/trajectory files
PubChem queries
ChemDraw files
# Correct usage with charge and multiplicity
chemsmart sub -s server gaussian -p project -f molecule.xyz -c -1 -m 2 opt
Structure Selection
For multi-structure files, use -i to select a specific structure:
# Use 1-based indexing
chemsmart sub -s server gaussian -p project -f scan.log -i 15 sp
Warning
CHEMSMART uses 1-based indexing to match most molecular visualization software, unlike Python’s 0-based indexing.
File Format Auto-Detection
CHEMSMART automatically detects file formats based on extensions:
.xyz→ XYZ format.sdf→ SDF format.com,.gjf→ Gaussian input.log→ Gaussian output.inp→ ORCA input.out→ ORCA/Gaussian output (auto-detected by reading file header).cdx,.cdxml→ ChemDraw format.db,.traj→ ASE database/trajectory
Note
For .out files, CHEMSMART automatically detects whether the file is from ORCA or Gaussian by examining the file
header. If detection fails, an error will be raised indicating the unsupported format.
For unsupported extensions, CHEMSMART falls back to ASE’s file reading capabilities.
See Also
For more technical details on the implementation, see the CHEMSMART preprint: https://arxiv.org/abs/2508.20042