# Deep learning can accelerate grasp-optimized motion planning

This section describes the methods in DJ-GOMP. Underlying DJ-GOMP is a jerk- and time-optimizing constrained motion planner based on an SQP. Because of the complexity of solving this SQP, computation time can far exceed the trajectory’s execution time. DJ-GOMP uses this SQP on a random set of pick-and-place inputs to generate training data (trajectories) to train a neural network. During pick-and-place operation, DJ-GOMP uses the neural network to compute an approximate trajectory for the given pick and place frames, which it then uses to warm start the SQP.

### Jerk- and time-optimized trajectory generation

To generate a jerk- and time-optimized trajectory, DJ-GOMP extends the SQP formulated in GOMP (2). The solver for this SQP, following the method in TrajOpt (3) and summarized in Algorithm 1, starts with a discretized estimate of the trajectory τ as a sequence of H waypoints after the starting configuration, in which each waypoint represents the robot’s configuration q, velocity v, acceleration a, and jerk j at a moment in time. The waypoints are sequentially separated by tstep seconds. This discretization is collected into x(0), where the superscript represents a refinement iteration. Thus

The choice of H and tstep is application specific, although in physical experiments, we set tstep to match (an integer multiple of) the control frequency of the robot, and we set H such that H · tstep is an estimate of the upper bound of the minimum trajectory time for the workspace and task input distribution.

The initial value of x(0) seeds (or warm starts) the SQP computation. Without the approximation generated by the neural network (e.g., for training data set generation), this trajectory can be initialized to all zeros. In practice, the SQP can converge faster by first computing a trajectory between inverse kinematic solutions to g0 and gH with only the convex kinematic and dynamic constraints (described below).

In each iteration k = (0,1,2, …, m) of the SQP, DJ-GOMP linearizes the nonconvex constraints of obstacles and pick-and-place locations and solves a QP of the following form

where A defines constraints enforcing the trust region, joint limits, and dynamics, and P is defined such that xTPx is a sum-of-squared jerks. To enforce the linearized nonconvex constraints, DJ-GOMP adds constrained nonnegative slack variables penalized using appropriate coefficients in p. As DJ-GOMP iterates over the SQP, it increases the penalty term exponentially, terminating on the iteration m at which x(m) meets the nonconvex constraints.

Algorithm 1: Jerk-limited Motion Plan

Require: x(0) is an initial guess of the trajectory, h + 1 is the number of waypoints in x(0), tstep is the time between each waypoint, g0 and gH are the pick and place frames, βshrink ∈ (0,1), βgrow > 1, and γ > 1

1: μ ← initial penalty multiple

2: ϵtrust ← initial trust region size

3: k ← 0

4: P, p, A, b ← linearize x(0) as a QP

5: while μ < μmaxdo

6:

7:      if sufficient decrease in trajectory cost then

8:        kk + 1                     /*accept trajectory */

9:        ϵtrust ← ϵtrustβgrow      /* grow trust region */

10:       A, b ← update linearization using x(k)

11:    else

12:    ϵtrust ← ϵtrustβshrink      /* shrink trust region */

13:      b ← update trust region bounds only

14:    if ϵtrust < ϵmin_trustthen

15:      μ ← γμ                       /* increase penalty */

16:      ϵtrust ← initial trust region size

17:      p ← update penalty in QP to match μ

18: return x(k)

To enforce joint limits and dynamic constraints, Algorithm 1 creates a matrix A and a vector b that enforce the following linear inequalities on joint limits

$qmin≤qt≤qmax$

$−vmax≤vt≤vmax$

$−amax≤at≤amax$

$−jmax≤jt≤jmax$

and the following equalities that enforce dynamic constraints between variables

$qt+1=qt+tstepvt+12tstep2at+16tstep3jt$

$at+1=at+tstepjt$

In addition, Algorithm 1 linearizes nonconvex constraints by adding slack variables to implement L1 penalties. Thus, for a nonconvex constraint gj(x) ≤ c, the algorithm adds the linearization

$g¯j(x)$

as a constraint in the form

$g¯j(x)−μyj++μyj−≤c$

where μ is the penalty, and the slack variables are constrained such that

$yj+≥0$

and

$yj−≥0$

.

In the QP, obstacle avoidance constraints are linearized on the basis of the waypoints of the current iteration’s trajectory (Algorithm 2). To compute these constraints, the algorithm evaluates the spline

$qspline(s;t)=qt+svt+12s2at+16s3jt$

between each pair of waypoints (xt, xt + 1) against a depth map of obstacles to find the time s ∈ [0, tstep) and corresponding configuration

$q̂t$

that minimizes signed distance separation from any obstacle. In this evaluation, a negative signed distance means that the configuration is in collision. The algorithm then uses this

$q̂t$

to computes a separating hyperplane in the form nTq + d = 0. The hyperplane is either the top plane of the obstacle it is penetrating or the plane that separates

$q̂t$

from the nearest obstacle (see Fig. 8). By selecting the top plane of the penetrated obstacle, this pushes the trajectory above the obstacle, which is a specialization of TrajOpt’s more general obstacle avoidance approach that is useful in bin picking. By selecting the hyperplane of the nearest obstacle, the algorithm keeps the trajectory from entering the obstacle. The linearize constraint for this point is

$nTĴt(k)x̂t(k+1)≥−d−nTp(x̂t(k))+nTĴt(k)x̂t(k)$

where

$Ĵt$

is the Jacobian of the robot’s position at

$q̂t$

. Because

$q̂t$

and

$Ĵt$

are at an interpolated state between optimization variables at xt and xt + 1, linearizing this constraint requires computing the chain rule for the Jacobian

$Ĵt=Jp(q̂t)Jq(s)$

where

$Jp(q̂t)$

is the Jacobian of the position at configuration qt, and Jq(s) is the Jacobian of the configuration on the spline at s

$Jq(s)=[∂p∂qt∂p∂qt+1∂p∂vt∂p∂vt+1]T=[−3s2t2+2s3t3+13s2t2−2s3t3−2s2t+s3t3+ss3t2−s2t]T$

We observe that linearization at each waypoint will safely avoid obstacles with a sufficient buffer around obstacles (e.g., via a Minkowski difference with obstacles); however, slight variations in pick or place frames can shift the alignment of waypoints to obstacles. This shift of alignment (see Fig. 8C) can lead to solutions that vary disproportionately to small changes in input. Although this may be acceptable in operation, it can lead to data that can be difficult for a neural network to learn.

Algorithm 2: Linearize Obstacle-Avoidance Constraint

1: for t ∈ [0, H) do

2:      (nmin, dmin) ← linearize obstacle nearest to p(qt)

3:      qminqt

4:      for all s ∈ [0, tstep) do /* line search s to desired resolution */

5:

$qs←qt+svt+12s2at+16s3jt$

6:        (ns, ds)← linearize obstacle nearest to p(qs)

7:        if

$ns⊤p(qs)+ds

then /* compare signed distance */

8:           (nmin, dmin, qmin) ← (ns, ds, qs)

9:           Jq ← Jacobian of qs

10:     Jp ← Jacobian of position at qmin

11:

$Ĵt←JpJq$

12:

$bt←−dmin−nmin⊤p(qmin)+nmin⊤Ĵtxt−μyt+$

/* lower bound with slack

$yt+$

*/

$(nmin⊤Ĵtxt≥bt)$

and

$(yt+≥0)$

to set of linearconstraints in QP

As with GOMP, DJ-GOMP allows degrees of freedom in rotation and translation to be added to start and goal grasp frames. Adding this degree of freedom allows DJ-GOMP to take a potentially shorter path when an exact pose of the end effector is unnecessary. For example, a pick point with a parallel-jaw gripper can rotate about the axis defined by antipodal contact points (see Fig. 2), and the pick point with a suction gripper can rotate about the normal of its contact plane. Similarly, a task may allow for a place point anywhere within a bounded box. The degrees of freedom about the pick point can be optionally added as constraints that are linearized as

$wmin≤J0(k)q0(k+1)−(g0−p(q0(k)))+J0(k)q0(k)≤wmin$

where

$q0(k)$

and

$J0(k)$

are the configuration and Jacobian of the first waypoint in the accepted trajectory,

$q0(k+1)$

is one of variables the QP is minimizing, and wminwmax defines the twist allowed about the pick point. Applying a similar set of constraints to gH allows degrees of freedom in the place frame as well.

The SQP establishes trust regions to constrain the optimized trajectory to be within a box with extents defined by a shrinking trust region size. Because convex constraints on dynamics enforce the relationship between configuration, velocity, and acceleration of each waypoint, we observe that trust regions only need to be defined as box bounds around one of the three for each waypoint. In experiments, we established trust regions on configurations.

Algorithm 3: Time-optimal Motion Plan

Require: g0 and gH are the start and end frames, γ > 1 is the search bisection ratio

1: Hupper ← fixed or estimated upper limit of maximum time

2: Hlower ← 3

3: vupper ← ∞                         /* constraint violation */

4: while vupper> tolerance do      /* find upper limit */

5:      (xupper, vupper) ← call Alg. 1 with cold-start trajectory for Hupper

6:      Hupper ← max(Hupper + 1, ⌈γ Hupper⌉)

7: while Hlower < Hupperdo        /* search for shortest H */

8:      HminHlower + ⌊(HupperHlower)/γ⌋

9:      (xmid, vmid) ← call Alg. 1 with warm-start trajectory xupper interpolated to Hmid

10:    if vmid≤ tolerance then

11:        (Hupper, xupper, vupper) ← (Hmid, xmid, vmid)

12:    else

13:        HlowerHmid + 1

14: return xupper

To find the minimum time-time trajectory, J-GOMP searches for the shortest jerk-minimized trajectory that solves all constraints. This search, shown in Algorithm 3, starts with a guess of H and then performs an exponential search for the upper bound, followed by a binary search for the shortest H by repeatedly performing the SQP of Algorithm 1. The binary search warm starts each SQP with an interpolation of the trajectory of the current upper bound of H. The search ends when the upper and lower bounds of H are the same.

### Deep learning of trajectories

To speed up motion planning, we add a deep neural network to the pipeline. This neural network treats the trajectory optimization process as a function fτ to approximate

$fτ:SE(3)×SE(3)→ℝH*×n×4$

where the arguments to the function are the pick and place frames, and the output is a discretized trajectory of variable length H* waypoints, each of which has a configuration, velocity, acceleration, and jerk for all n joints of the robot. We assume that the neural network

$f˜τ$

can only approximate fτ, thus

$f˜τ(·)=fτ(·)+E(τ)$

for some unknown error distribution E(τ). Hence, the output of

$f˜τ$

may not be dynamically or kinematically feasible. To address this potential issue, we use the network’s output to warm start a final pass through the SQP. This process can be thought of as polishing the output of the neural network approximation to overcome any errors in the network’s output.

In this section, we describe a proposed neural network architecture, its loss function, training and testing dataset generation, and the training process. Although we posit that a more general approximation could include the amount of pick and place degrees of freedom as inputs, for brevity, we assume that fτ and its neural network approximation are parameterized by a preset amount of pick and place degrees of freedom. In practice, it may also be appropriate to train multiple neural networks for different parameterizations of fτ.

#### Architecture

The deep neural network architecture we propose is depicted in Fig. 3. It consists of an input layer connected through four fully connected blocks to multiple output blocks. The input layer takes in the concatenated grasp frames

$[g0TgHT]T$

. Because the optimal trajectory length H* can vary, the network has multiple output heads for each of the possible values for H*. To select which of the outputs to use, we use a separate classification network with two fully connected layers with one-hot encoding trained using a cross-entropy loss.

We refer to the horizon classification and multiple-output network as a HYDRA (Horizon Yielding Distillation through Retained Activations) network. The network yields both an optimal horizon length and the trajectory for that horizon. To train this network (detailed below), the trajectory output layers’ activation values for horizons not in the training sample are retained using a zero gradient so as to weight the contribution of the layers during backprop to the input layers.

In experiments, a neural network with a single output head was unable to produce a consistent result for predicting varied length horizons. We conjecture that the discontinuity between trajectories of different horizon lengths made it difficult to learn. In contrast, we found that a network was able to accurately learn a function for a single horizon length but was computationally and space inefficient, because each network should be able to share information about the function between the horizons. This led to the proposed design in which a single network with multiple output heads shares weights through multiple shared input layers.

#### Dataset generation

We propose generating a training dataset by randomly sampling start and end pairs from the likely distribution of tasks. For example, in a warehouse pick-and-place operation, the pick frames will be constrained to a volume defined by the picking bin, and the place frames will be constrained to a volume defined by the placement or packing bin. For each random input, we generate optimized trajectories for all time horizons from Hmax to the optimal H*. Although this process generates more trajectories than necessary, generating each trajectory is efficient because the optimization for a trajectory of size H warm starts from the trajectory of size H + 1. Overall, this process is efficient and, with parallelization, can quickly generate a large training dataset.

This process can also help detect whether the analysis of the maximum trajectory duration was incorrect. If all trajectories are significantly shorter than Hmax, then one may reduce the number of output heads. If any trajectory exceeds Hmax, then the number of output heads can be increased.

We also note that in the case where the initial training data do not match the operational distribution of inputs, the result may be that the neural network produces suboptimal motions that are substantially, kinematically, and dynamically infeasible. In this case, the subsequent pass through the optimization (see “Fast pipeline for trajectory generation” section) will fix these errors, although with a longer computation time. We propose, in a manner similar to DAgger (48), that trajectories from operation can be continually added to the training dataset for subsequent training or refinement of the neural network.

#### Training

To train the network in a way that encourages matching the reference trajectory while keeping the output trajectory kinematically and dynamically feasible, we propose a multipart loss function ℒ. This loss function includes a weighted sum of MSE loss on the trajectory

$LT$

, a boundary loss

$LB$

, which enforces the correct start and end positions, and a dynamics loss

$Ldyn$

that enforces the dynamic feasibility of the trajectory. The MSE loss is the mean of the sum of squared differences of the two vector arguments:

$LMSE(a˜,a¯)=1nΣi=1n(a˜i−a¯i)2$

. The dynamics loss attempts to mimic the convex constraints of the SQP. Given the predicted trajectories

$X˜=(x˜Hmin,…,x˜Hmax)$

, where

$x˜h=(q˜,v˜,a˜,j˜)t=0h$

and the ground-truth trajectories from dataset generation

$X¯=(x¯H*,…,x¯Hmax)$

, the loss functions are

$LT=αqLMSE(q˜,q¯)+αvLMSE(v˜,v¯)+αaLMSE(α˜,a¯)+αjLMSE(j˜,j¯)$

$LB=LMSE(q˜0,q¯0)+LMSE(q˜H,q¯H)$

$Lh=αTLTh+αBLBh+αdynLdynh$

where values of αq = 10, αv = 1, αa = 1, αj = 1, αB = 4 × 103, and αdyn = 1 were chosen empirically. This loss is combined into a single loss for the entire network by summing the losses of all horizons while multiplying by an indicator function for the horizons that are valid

$L=Σh=HminHmaxLh𝟙[H¯*,Hmax](h)$

Because the ground-truth trajectories for horizons shorter than H* are not defined, we must ensure that some finite data are present so that the multiplication of an indicator value of 0 results in 0 (and not NaN). Multiplying by indicator function in this way results in a zero gradient for the part of the network that is not included in the trajectory data.

During training, we observed that the network would often exhibit behavior of coadaptation in which it would learn either

$LT$

or

$Ldyn$

but not both. This showed up as a test loss for one going to small values, whereas the other remained high. To address this problem, we introduced dropout layers (49) with dropout probability pdrop = 0.5 between each fully connected layer, shown in Fig. 3. We annealed (50) pdrop to 0 over the course of the training to reduce the loss further.

Tagged with: