Molecule Input Formats

CHEMSMART provides flexible molecule input capabilities, supporting multiple file formats and molecular representations. This page describes the various ways you can create molecules for use in quantum chemistry calculations.

Note

For more information about CHEMSMART’s design and capabilities, see the preprint: https://arxiv.org/abs/2508.20042

Workflow Overview

The following diagram illustrates how different input formats are processed to create molecules for quantum chemistry calculations:

┌──────────────────┐
│  Input Sources   │
└──────────────────┘
        │
        ├─── ASE Atoms
        ├─── Pymatgen Molecule
        ├─── RDKit Molecule
        ├─── File Formats (.xyz, .sdf, .com/.log, .inp/.out, .cdx/.cdxml, etc.)
        └─── PubChem queries (by name, CID, or SMILES)
        │
        ▼
┌──────────────────────────────┐
│  Input Processing            │  ──────►  Creates Molecule Object
│  (chemsmart.io)              │           (parses all supported input formats)
└──────────────────────────────┘
        │
        ▼
┌──────────────────┐
│    Molecule      │  ──────►  Central molecular representation
│     Object       │           (symbols, positions, charge, multiplicity)
└──────────────────┘
        │
        ├─────────────────────────────────┐
        ▼                                 ▼
┌──────────────────┐            ┌──────────────────┐
│ chemsmart.jobs   │            │ chemsmart.jobs   │
│ .gaussian.writer │            │   .orca.writer   │
└──────────────────┘            └──────────────────┘
        │                                 │
        ▼                                 ▼
┌──────────────────┐            ┌──────────────────┐
│  .com/.gjf files │            │    .inp files    │
│  (Gaussian, Inc.)│            │      (ORCA)      │
└──────────────────┘            └──────────────────┘

Supported File Formats

Coordinate Files

XYZ Files

Standard XYZ format for atomic coordinates:

# Single structure
chemsmart sub -s server gaussian -p project -f molecule.xyz -c 0 -m 1 opt

# Multi-structure file (select 5th structure)
chemsmart sub -s server gaussian -p project -f molecules.xyz -i 5 -c 0 -m 1 opt

Note

XYZ files do not contain charge or multiplicity information. You must specify these using -c and -m flags.

SDF Files

Structure-Data Format files with 2D or 3D coordinates:

chemsmart sub -s server gaussian -p project -f molecule.sdf -c 0 -m 1 opt

Gaussian Files

Gaussian Input Files (.com, .gjf)

Gaussian input files contain route, charge, multiplicity, and coordinates:

chemsmart sub -s server gaussian -p project -f input.com opt

Tip

CHEMSMART reads the existing charge and multiplicity from .com files. Override with -c and -m if needed.

Gaussian Output Files (.log, .out)

Use optimized structures from Gaussian output:

# Use last structure from optimization
chemsmart sub -s server gaussian -p project -f opt.log sp

# Use specific structure (e.g., 10th structure from scan)
chemsmart sub -s server gaussian -p project -f scan.log -i 10 sp

ORCA Files

ORCA Input Files (.inp)

ORCA input files with coordinates and calculation settings:

chemsmart sub -s server gaussian -p project -f input.inp opt

ORCA Output Files (.out)

Extract structures from ORCA calculations:

chemsmart sub -s server gaussian -p project -f orca_opt.out sp

ChemDraw Files

ChemDraw XML (.cdxml) and Binary (.cdx) Files

CHEMSMART supports reading molecular structures directly from ChemDraw files, including organometallic complexes with aromatic ligands such as Cp, Cp*, and benzene rings.

# Organic molecule
chemsmart sub -s server gaussian -p project -f molecule.cdxml -c 0 -m 1 opt

# Organometallic complex (charge and multiplicity must be specified explicitly)
chemsmart sub -s server gaussian -p project -f ferrocene.cdxml -c 0 -m 1 opt

Tip

You can submit a Gaussian optimization directly from a ChemDraw file in a single command:

chemsmart sub -s server gaussian -p project -f molecule.cdxml -c 0 -m 1 opt

This will:

  1. Read the molecular structure from the ChemDraw file

  2. Generate 3D coordinates using RDKit’s EmbedMolecule

  3. Create a Gaussian input file with specified charge and multiplicity

  4. Submit the geometry optimization job to the HPC cluster

Note

  • Both binary (.cdx) and XML-based (.cdxml) ChemDraw formats are supported.

  • RDKit is used internally to parse ChemDraw files and generate 3D coordinates.

  • For multi-molecule ChemDraw files, use -i to select a specific molecule.

  • 3D coordinates are automatically generated from 2D structures.

  • Reading binary .cdx files requires Open Babel (obabel) to be installed. If Open Babel is not available, save the file as .cdxml instead.

  • Charge and multiplicity of organometallic complexes are not inferred from the ChemDraw file – always specify -c and -m explicitly.

Example: Multi-molecule ChemDraw file

# Select the 2nd molecule from a ChemDraw file with multiple structures
chemsmart sub -s server gaussian -p project -f molecules.cdxml -i 2 -c 0 -m 1 opt

For full details on organometallic complex support and its restrictions, see ChemDraw Organometallic Complex Files.

Molecular Databases

PubChem Integration

Query PubChem directly by name, CID, or SMILES:

# By compound name
chemsmart sub -s server gaussian -p project -P water -c 0 -m 1 -l water opt

# By CID (Compound ID)
chemsmart sub -s server gaussian -p project -P 962 -c 0 -m 1 -l water opt

# By SMILES string
chemsmart sub -s server gaussian -p project -P "O=Cc1ccccc1" -c 0 -m 1 -l benzaldehyde opt

Note

  • When using -P (PubChem), the -l flag specifies the output filename.

  • PubChem queries support compound names, CIDs (Compound IDs), and SMILES strings.

  • SMILES strings are processed by PubChem to retrieve 3D structures and generate coordinates.

ASE Database Files (.db, .traj)

Use structures from ASE database or trajectory files:

# From ASE database
chemsmart sub -s server gaussian -p project -f molecules.db -i 5 -c 0 -m 1 opt

# From ASE trajectory
chemsmart sub -s server gaussian -p project -f md.traj -i 10 -c 0 -m 1 opt

Python Object Integration

CHEMSMART’s Molecule class provides seamless integration with popular Python chemistry libraries:

From ASE Atoms

from ase import Atoms
from chemsmart.io.molecules.structure import Molecule

# Create ASE Atoms object
atoms = Atoms('H2O', positions=[[0, 0, 0], [0, 0, 1], [0, 1, 0]])

# Convert to CHEMSMART Molecule
molecule = Molecule.from_ase_atoms(atoms)

From RDKit Mol

from rdkit import Chem
from rdkit.Chem import AllChem
from chemsmart.io.molecules.structure import Molecule

# Create RDKit molecule from SMILES
rdkit_mol = Chem.MolFromSmiles('CCO')
AllChem.EmbedMolecule(rdkit_mol)

# Convert to CHEMSMART Molecule
molecule = Molecule.from_rdkit_mol(rdkit_mol)

Aromaticity Detection (is_aromatic)

The is_aromatic property detects aromaticity by converting the molecule to an RDKit representation and checking whether any atom both carries the aromatic flag and belongs to a ring. This guards against false positives in acyclic molecules (e.g. H₂O, MgI₂) that can arise when the geometry-based bond-order heuristic assigns a bond order of 1.5 to short single bonds.

Note

Known limitations of aromaticity detection:

  • Bond orders are inferred purely from 3D geometry (interatomic distances), not from an electronic structure calculation. This means the detection is model-dependent and may not match formal aromaticity criteria in all cases.

  • Edge cases such as the cyclopropenyl cation (aromatic) versus the cyclopropenyl radical (non-aromatic) may not be distinguished correctly, because the outcome depends on how bond orders and electron counts are assigned from the geometry alone.

  • For borderline or unusual systems (strained rings, metal-organic frameworks, non-Kekulé structures, etc.) the result should be treated as a heuristic estimate rather than a definitive answer.

  • If precise aromaticity information is required, consider constructing the RDKit molecule directly from a SMILES string or from an output file that encodes explicit bond orders.

From Pymatgen

CHEMSMART molecules can be converted to and from Pymatgen format:

from chemsmart.io.molecules.structure import Molecule

# Convert CHEMSMART Molecule to Pymatgen
molecule = Molecule.from_filepath('input.xyz')
pymatgen_mol = molecule.to_pymatgen()

Note

For converting Pymatgen molecules to CHEMSMART, you can use the ASE Atoms adaptor as an intermediate format.

Best Practices

Charge and Multiplicity

Always specify charge and multiplicity for:

  • XYZ files

  • SDF files

  • ASE database/trajectory files

  • PubChem queries

  • ChemDraw files

# Correct usage with charge and multiplicity
chemsmart sub -s server gaussian -p project -f molecule.xyz -c -1 -m 2 opt

Structure Selection

For multi-structure files, use -i to select a specific structure:

# Use 1-based indexing
chemsmart sub -s server gaussian -p project -f scan.log -i 15 sp

Warning

CHEMSMART uses 1-based indexing to match most molecular visualization software, unlike Python’s 0-based indexing.

File Format Auto-Detection

CHEMSMART automatically detects file formats based on extensions:

  • .xyz → XYZ format

  • .sdf → SDF format

  • .com, .gjf → Gaussian input

  • .log → Gaussian output

  • .inp → ORCA input

  • .out → ORCA/Gaussian output (auto-detected by reading file header)

  • .cdx, .cdxml → ChemDraw format

  • .db, .traj → ASE database/trajectory

Note

For .out files, CHEMSMART automatically detects whether the file is from ORCA or Gaussian by examining the file header. If detection fails, an error will be raised indicating the unsupported format.

For unsupported extensions, CHEMSMART falls back to ASE’s file reading capabilities.

See Also

For more technical details on the implementation, see the CHEMSMART preprint: https://arxiv.org/abs/2508.20042