Deep learning can accelerate grasp-optimized motion planning

This section describes the methods in DJ-GOMP. Underlying DJ-GOMP is a jerk- and time-optimizing constrained motion planner based on an SQP. Because of the complexity of solving this SQP, computation time can far exceed the trajectory’s execution time. DJ-GOMP uses this SQP on a random set of pick-and-place inputs to generate training data (trajectories) to train a neural network. During pick-and-place operation, DJ-GOMP uses the neural network to compute an approximate trajectory for the given pick and place frames, which it then uses to warm start the SQP.

Jerk- and time-optimized trajectory generation

To generate a jerk- and time-optimized trajectory, DJ-GOMP extends the SQP formulated in GOMP (2). The solver for this SQP, following the method in TrajOpt (3) and summarized in Algorithm 1, starts with a discretized estimate of the trajectory τ as a sequence of H waypoints after the starting configuration, in which each waypoint represents the robot’s configuration q, velocity v, acceleration a, and jerk j at a moment in time. The waypoints are sequentially separated by t_step seconds. This discretization is collected into x⁽⁰⁾, where the superscript represents a refinement iteration. Thus

x^{(0)} = (x_{0}^{(0)}, x_{1}^{(0)}, \dots, x_{H}^{(0)}), where x_{t}^{(k)} = [\begin{matrix} q_{t}^{(k)} \\ v_{t}^{(k)} \\ a_{t}^{(k)} \\ j_{t}^{(k)} \end{matrix}]

The choice of H and t_step is application specific, although in physical experiments, we set t_step to match (an integer multiple of) the control frequency of the robot, and we set H such that H · t_step is an estimate of the upper bound of the minimum trajectory time for the workspace and task input distribution.

The initial value of x⁽⁰⁾ seeds (or warm starts) the SQP computation. Without the approximation generated by the neural network (e.g., for training data set generation), this trajectory can be initialized to all zeros. In practice, the SQP can converge faster by first computing a trajectory between inverse kinematic solutions to g₀ and g_H with only the convex kinematic and dynamic constraints (described below).

In each iteration k = (0,1,2, …, m) of the SQP, DJ-GOMP linearizes the nonconvex constraints of obstacles and pick-and-place locations and solves a QP of the following form

x^{(k + 1)} = \underset{x}{argmin} \frac{1}{2} x^{T} Px + p^{T} x

s . t . Ax \leq b

where A defines constraints enforcing the trust region, joint limits, and dynamics, and P is defined such that x^TPx is a sum-of-squared jerks. To enforce the linearized nonconvex constraints, DJ-GOMP adds constrained nonnegative slack variables penalized using appropriate coefficients in p. As DJ-GOMP iterates over the SQP, it increases the penalty term exponentially, terminating on the iteration m at which x^(m) meets the nonconvex constraints.

Algorithm 1: Jerk-limited Motion Plan

Require: x⁽⁰⁾ is an initial guess of the trajectory, h + 1 is the number of waypoints in x⁽⁰⁾, t_step is the time between each waypoint, g₀ and g_H are the pick and place frames, β_shrink ∈ (0,1), β_grow > 1, and γ > 1

1: μ ← initial penalty multiple

2: ϵ_trust ← initial trust region size

3: k ← 0

4: P, p, A, b ← linearize x⁽⁰⁾ as a QP

5: while μ < μ_maxdo

x^{(k + 1)} \leftarrow arg {min}_{x} \frac{1}{2} x^{⊤} Px + p^{⊤} x s . t . Ax \leq b

/* warm start with x^(k) */

7: if sufficient decrease in trajectory cost then

8: k ← k + 1 /*accept trajectory */

9: ϵ_trust ← ϵ_trustβ_grow /* grow trust region */

10: A, b ← update linearization using x^(k)

11: else

12: ϵ_trust ← ϵ_trustβ_shrink /* shrink trust region */

13: b ← update trust region bounds only

14: if ϵ_trust < ϵ_{min_trust}then

15: μ ← γμ /* increase penalty */

16: ϵ_trust ← initial trust region size

17: p ← update penalty in QP to match μ

18: return x^(k)

To enforce joint limits and dynamic constraints, Algorithm 1 creates a matrix A and a vector b that enforce the following linear inequalities on joint limits

q_{min} \leq q_{t} \leq q_{max}

- v_{max} \leq v_{t} \leq v_{max}

- a_{max} \leq a_{t} \leq a_{max}

- j_{max} \leq j_{t} \leq j_{max}

and the following equalities that enforce dynamic constraints between variables

q_{t + 1} = q_{t} + t_{step} v_{t} + \frac{1}{2} t_{step}^{2} a_{t} + \frac{1}{6} t_{step}^{3} j_{t}

v_{t + 1} = v_{t} + t_{step} a_{t} + \frac{1}{2} t_{step}^{2} j_{t}

a_{t + 1} = a_{t} + t_{step} j_{t}

In addition, Algorithm 1 linearizes nonconvex constraints by adding slack variables to implement L₁ penalties. Thus, for a nonconvex constraint g_j(x) ≤ c, the algorithm adds the linearization

{\bar{g}}_{j} (x)

as a constraint in the form

{\bar{g}}_{j} (x) - μ y_{j}^{+} + μ y_{j}^{-} \leq c

where μ is the penalty, and the slack variables are constrained such that

y_{j}^{+} \geq 0

and

y_{j}^{-} \geq 0

In the QP, obstacle avoidance constraints are linearized on the basis of the waypoints of the current iteration’s trajectory (Algorithm 2). To compute these constraints, the algorithm evaluates the spline

q_{spline} (s; t) = q_{t} + s v_{t} + \frac{1}{2} s^{2} a_{t} + \frac{1}{6} s^{3} j_{t}

between each pair of waypoints (x_t, x_{t + 1}) against a depth map of obstacles to find the time s ∈ [0, t_step) and corresponding configuration

{\hat{q}}_{t}

that minimizes signed distance separation from any obstacle. In this evaluation, a negative signed distance means that the configuration is in collision. The algorithm then uses this

{\hat{q}}_{t}

to computes a separating hyperplane in the form n^Tq + d = 0. The hyperplane is either the top plane of the obstacle it is penetrating or the plane that separates

{\hat{q}}_{t}

from the nearest obstacle (see Fig. 8). By selecting the top plane of the penetrated obstacle, this pushes the trajectory above the obstacle, which is a specialization of TrajOpt’s more general obstacle avoidance approach that is useful in bin picking. By selecting the hyperplane of the nearest obstacle, the algorithm keeps the trajectory from entering the obstacle. The linearize constraint for this point is

n^{T} {\hat{J}}_{t}^{(k)} {\hat{x}}_{t}^{(k + 1)} \geq - d - n^{T} p ({\hat{x}}_{t}^{(k)}) + n^{T} {\hat{J}}_{t}^{(k)} {\hat{x}}_{t}^{(k)}

where

{\hat{J}}_{t}

is the Jacobian of the robot’s position at

{\hat{q}}_{t}

. Because

{\hat{q}}_{t}

and

{\hat{J}}_{t}

are at an interpolated state between optimization variables at x_t and x_{t + 1}, linearizing this constraint requires computing the chain rule for the Jacobian

{\hat{J}}_{t} = J_{p} ({\hat{q}}_{t}) J_{q} (s)

where

J_{p} ({\hat{q}}_{t})

is the Jacobian of the position at configuration q_t, and J_q(s) is the Jacobian of the configuration on the spline at s

J_{q} (s) = {[\begin{matrix} \frac{\partial p}{\partial q_{t}} \\ \frac{\partial p}{\partial q_{t + 1}} \\ \frac{\partial p}{\partial v_{t}} \\ \frac{\partial p}{\partial v_{t + 1}} \end{matrix}]}^{T} = {[\begin{matrix} - 3 \frac{s^{2}}{t^{2}} + 2 \frac{s^{3}}{t^{3}} + 1 \\ 3 \frac{s^{2}}{t^{2}} - 2 \frac{s^{3}}{t^{3}} \\ - 2 \frac{s^{2}}{t} + \frac{s^{3}}{t^{3}} + s \\ \frac{s^{3}}{t^{2}} - \frac{s^{2}}{t} \end{matrix}]}^{T}

Fig. 8 Obstacle constraint linearization.

The constraint linearization process keeps the trajectory away from obstacles by adding constraints based on the Jacobian of the configuration at each waypoint of the accepted trajectory x^(k). In this figure, the obstacle is shown from the side, the robot’s path along part of x^(k) is shown in blue, and the constraints’ normal projections into Euclidean space are shown in red. For waypoints that are outside the obstacle (A), constraints keep the waypoints from entering the obstacle. For waypoints that are inside the obstacle (B), constraints push the waypoints up out of the obstacle. If the algorithm adds constraints only at waypoints as in (C), the optimization can compute trajectories that collide with obstacles and produce discontinuities between trajectories with small changes to the pick or place frame. These effects are mitigated when obstacles are inflated to account for them, but the discontinuities can lead to poor results when training the neural network. The proposed algorithm adds linearized constraints to account for collision between obstacles, leading to more consistent results shown in (D).

" data-icon-position="" data-hide-link-title="0">

Fig. 8 Obstacle constraint linearization.

We observe that linearization at each waypoint will safely avoid obstacles with a sufficient buffer around obstacles (e.g., via a Minkowski difference with obstacles); however, slight variations in pick or place frames can shift the alignment of waypoints to obstacles. This shift of alignment (see Fig. 8C) can lead to solutions that vary disproportionately to small changes in input. Although this may be acceptable in operation, it can lead to data that can be difficult for a neural network to learn.

Algorithm 2: Linearize Obstacle-Avoidance Constraint

1: for t ∈ [0, H) do

2: (n_min, d_min) ← linearize obstacle nearest to p(q_t)

3: q_min ← q_t

4: for all s ∈ [0, t_step) do /* line search s to desired resolution */

q_{s} \leftarrow q_{t} + s v_{t} + \frac{1}{2} s^{2} a_{t} + \frac{1}{6} s^{3} j_{t}

6: (n_s, d_s)← linearize obstacle nearest to p(q_s)

7: if

n_{s}^{⊤} p (q_{s}) + d_{s} < n_{min}^{⊤} p (q_{min}) + d_{min}

then /* compare signed distance */

8: (n_min, d_min, q_min) ← (n_s, d_s, q_s)

9: J_q ← Jacobian of q_s

10: J_p ← Jacobian of position at q_min

11:

{\hat{J}}_{t} \leftarrow J_{p} J_{q}

12:

b_{t} \leftarrow - d_{min} - n_{min}^{⊤} p (q_{min}) + n_{min}^{⊤} {\hat{J}}_{t} x_{t} - μ y_{t}^{+}

/* lower bound with slack

y_{t}^{+}

13: Add

(n_{min}^{⊤} {\hat{J}}_{t} x_{t} \geq b_{t})

and

(y_{t}^{+} \geq 0)

to set of linearconstraints in QP

As with GOMP, DJ-GOMP allows degrees of freedom in rotation and translation to be added to start and goal grasp frames. Adding this degree of freedom allows DJ-GOMP to take a potentially shorter path when an exact pose of the end effector is unnecessary. For example, a pick point with a parallel-jaw gripper can rotate about the axis defined by antipodal contact points (see Fig. 2), and the pick point with a suction gripper can rotate about the normal of its contact plane. Similarly, a task may allow for a place point anywhere within a bounded box. The degrees of freedom about the pick point can be optionally added as constraints that are linearized as

w_{min} \leq J_{0}^{(k)} q_{0}^{(k + 1)} - (g_{0} - p (q_{0}^{(k)})) + J_{0}^{(k)} q_{0}^{(k)} \leq w_{min}

where

q_{0}^{(k)}

and

J_{0}^{(k)}

are the configuration and Jacobian of the first waypoint in the accepted trajectory,

q_{0}^{(k + 1)}

is one of variables the QP is minimizing, and w_min ≤ w_max defines the twist allowed about the pick point. Applying a similar set of constraints to g_H allows degrees of freedom in the place frame as well.

The SQP establishes trust regions to constrain the optimized trajectory to be within a box with extents defined by a shrinking trust region size. Because convex constraints on dynamics enforce the relationship between configuration, velocity, and acceleration of each waypoint, we observe that trust regions only need to be defined as box bounds around one of the three for each waypoint. In experiments, we established trust regions on configurations.

Algorithm 3: Time-optimal Motion Plan

Require: g₀ and g_H are the start and end frames, γ > 1 is the search bisection ratio

1: H_upper ← fixed or estimated upper limit of maximum time

2: H_lower ← 3

3: v_upper ← ∞ /* constraint violation */

4: while v_upper> tolerance do /* find upper limit */

5: (x_upper, v_upper) ← call Alg. 1 with cold-start trajectory for H_upper

6: H_upper ← max(H_upper + 1, ⌈γ H_upper⌉)

7: while H_lower < H_upperdo /* search for shortest H */

8: H_min ← H_lower + ⌊(H_upper − H_lower)/γ⌋

9: (x_mid, v_mid) ← call Alg. 1 with warm-start trajectory x_upper interpolated to H_mid

10: if v_mid≤ tolerance then

11: (H_upper, x_upper, v_upper) ← (H_mid, x_mid, v_mid)

12: else

13: H_lower ← H_mid + 1

14: return x_upper

To find the minimum time-time trajectory, J-GOMP searches for the shortest jerk-minimized trajectory that solves all constraints. This search, shown in Algorithm 3, starts with a guess of H and then performs an exponential search for the upper bound, followed by a binary search for the shortest H by repeatedly performing the SQP of Algorithm 1. The binary search warm starts each SQP with an interpolation of the trajectory of the current upper bound of H. The search ends when the upper and lower bounds of H are the same.

Deep learning of trajectories

To speed up motion planning, we add a deep neural network to the pipeline. This neural network treats the trajectory optimization process as a function f_τ to approximate

f_{τ} : SE (3) \times SE (3) \to ℝ^{H^{*} \times n \times 4}

where the arguments to the function are the pick and place frames, and the output is a discretized trajectory of variable length H* waypoints, each of which has a configuration, velocity, acceleration, and jerk for all n joints of the robot. We assume that the neural network

{\tilde{f}}_{τ}

can only approximate f_τ, thus

{\tilde{f}}_{τ} (\cdot) = f_{τ} (\cdot) + E (τ)

for some unknown error distribution E(τ). Hence, the output of

{\tilde{f}}_{τ}

may not be dynamically or kinematically feasible. To address this potential issue, we use the network’s output to warm start a final pass through the SQP. This process can be thought of as polishing the output of the neural network approximation to overcome any errors in the network’s output.

In this section, we describe a proposed neural network architecture, its loss function, training and testing dataset generation, and the training process. Although we posit that a more general approximation could include the amount of pick and place degrees of freedom as inputs, for brevity, we assume that f_τ and its neural network approximation are parameterized by a preset amount of pick and place degrees of freedom. In practice, it may also be appropriate to train multiple neural networks for different parameterizations of f_τ.

Architecture

The deep neural network architecture we propose is depicted in Fig. 3. It consists of an input layer connected through four fully connected blocks to multiple output blocks. The input layer takes in the concatenated grasp frames

{[\begin{matrix} g_{0}^{T} & g_{H}^{T} \end{matrix}]}^{T}

. Because the optimal trajectory length H* can vary, the network has multiple output heads for each of the possible values for H*. To select which of the outputs to use, we use a separate classification network with two fully connected layers with one-hot encoding trained using a cross-entropy loss.

We refer to the horizon classification and multiple-output network as a HYDRA (Horizon Yielding Distillation through Retained Activations) network. The network yields both an optimal horizon length and the trajectory for that horizon. To train this network (detailed below), the trajectory output layers’ activation values for horizons not in the training sample are retained using a zero gradient so as to weight the contribution of the layers during backprop to the input layers.

In experiments, a neural network with a single output head was unable to produce a consistent result for predicting varied length horizons. We conjecture that the discontinuity between trajectories of different horizon lengths made it difficult to learn. In contrast, we found that a network was able to accurately learn a function for a single horizon length but was computationally and space inefficient, because each network should be able to share information about the function between the horizons. This led to the proposed design in which a single network with multiple output heads shares weights through multiple shared input layers.

Dataset generation

We propose generating a training dataset by randomly sampling start and end pairs from the likely distribution of tasks. For example, in a warehouse pick-and-place operation, the pick frames will be constrained to a volume defined by the picking bin, and the place frames will be constrained to a volume defined by the placement or packing bin. For each random input, we generate optimized trajectories for all time horizons from H_max to the optimal H*. Although this process generates more trajectories than necessary, generating each trajectory is efficient because the optimization for a trajectory of size H warm starts from the trajectory of size H + 1. Overall, this process is efficient and, with parallelization, can quickly generate a large training dataset.

This process can also help detect whether the analysis of the maximum trajectory duration was incorrect. If all trajectories are significantly shorter than H_max, then one may reduce the number of output heads. If any trajectory exceeds H_max, then the number of output heads can be increased.

We also note that in the case where the initial training data do not match the operational distribution of inputs, the result may be that the neural network produces suboptimal motions that are substantially, kinematically, and dynamically infeasible. In this case, the subsequent pass through the optimization (see “Fast pipeline for trajectory generation” section) will fix these errors, although with a longer computation time. We propose, in a manner similar to DAgger (48), that trajectories from operation can be continually added to the training dataset for subsequent training or refinement of the neural network.

Training

To train the network in a way that encourages matching the reference trajectory while keeping the output trajectory kinematically and dynamically feasible, we propose a multipart loss function ℒ. This loss function includes a weighted sum of MSE loss on the trajectory

L_{T}

, a boundary loss

L_{B}

, which enforces the correct start and end positions, and a dynamics loss

L_{dyn}

that enforces the dynamic feasibility of the trajectory. The MSE loss is the mean of the sum of squared differences of the two vector arguments:

L_{MSE} (\tilde{a}, \underline{a}) = \frac{1}{n} Σ_{i = 1}^{n} {({\tilde{a}}_{i} - {\underline{a}}_{i})}^{2}

. The dynamics loss attempts to mimic the convex constraints of the SQP. Given the predicted trajectories

\tilde{X} = ({\tilde{x}}^{H_{min}}, \dots, {\tilde{x}}^{H_{max}})

, where

{\tilde{x}}^{h} = {(\tilde{q}, \tilde{v}, \tilde{a}, \tilde{j})}_{t = 0}^{h}

and the ground-truth trajectories from dataset generation

\underline{X} = ({\underline{x}}^{H^{*}}, \dots, {\underline{x}}^{H_{max}})

, the loss functions are

L_{T} = α_{q} L_{MSE} (\tilde{q}, \underline{q}) + α_{v} L_{MSE} (\tilde{v}, \underline{v}) + α_{a} L_{MSE} (\tilde{α}, \underline{a}) + α_{j} L_{MSE} (\tilde{j}, \underline{j})

L_{B} = L_{MSE} ({\tilde{q}}_{0}, {\underline{q}}_{0}) + L_{MSE} ({\tilde{q}}_{H}, {\underline{q}}_{H})

\begin{matrix} L_{dyn} = \frac{1}{h} Σ_{t = 0}^{h - 1} {‖ {\tilde{q}}_{t} + t_{step} {\tilde{v}}_{t} + \frac{1}{2} t_{step}^{2} {\tilde{a}}_{t} + \frac{1}{6} t_{step}^{3} {\tilde{j}}_{t} - {\tilde{q}}_{t + 1} ‖}^{2} \\ + \frac{1}{h} Σ_{t = 0}^{h - 1} {‖ {\tilde{v}}_{t} + t_{step} {\tilde{a}}_{t} + \frac{1}{2} t_{step}^{2} {\tilde{j}}_{t} - {\tilde{v}}_{t + 1} ‖}^{2} \\ + \frac{1}{h} Σ_{t = 0}^{h - 1} {‖ {\tilde{a}}_{t} + t_{step} {\tilde{j}}_{t} - {\tilde{a}}_{t + 1} ‖}^{2} \\ + \frac{1}{h} Σ_{t = 0}^{h - 1} {‖ \frac{1}{t_{step}} ({\underline{j}}_{t + 1} - {\underline{j}}_{t}) - \frac{1}{t_{step}} ({\tilde{j}}_{t + 1} - {\tilde{j}}_{t}) ‖}^{2} \end{matrix}

L^{h} = α_{T} L_{T}^{h} + α_{B} L_{B}^{h} + α_{dyn} L_{dyn}^{h}

where values of α_q = 10, α_v = 1, α_a = 1, α_j = 1, α_B = 4 × 10³, and α_dyn = 1 were chosen empirically. This loss is combined into a single loss for the entire network by summing the losses of all horizons while multiplying by an indicator function for the horizons that are valid

L = Σ_{h = H_{min}}^{H_{max}} L^{h} 𝟙_{[{\underline{H}}^{*}, H_{max}]} (h)

Because the ground-truth trajectories for horizons shorter than H* are not defined, we must ensure that some finite data are present so that the multiplication of an indicator value of 0 results in 0 (and not NaN). Multiplying by indicator function in this way results in a zero gradient for the part of the network that is not included in the trajectory data.

During training, we observed that the network would often exhibit behavior of coadaptation in which it would learn either

L_{T}

L_{dyn}

but not both. This showed up as a test loss for one going to small values, whereas the other remained high. To address this problem, we introduced dropout layers (49) with dropout probability p_drop = 0.5 between each fully connected layer, shown in Fig. 3. We annealed (50) p_drop to 0 over the course of the training to reduce the loss further.

Source link

Jerk- and time-optimized trajectory generation

Deep learning of trajectories

Architecture

Dataset generation

Training

Share this:

Like this: