[Project Track: Project 8] This project implements a LiDAR-based 3D object detection pipeline that uses PointPillars to encode raw point clouds into a bird’s-eye-view (BEV) pseudo-image, enabling efficient convolutional feature extraction. On top of this representation, the CenterPoint framework decodes BEV features by predicting object centers and regressing bounding box attributes in an anchor-free manner. This design removes the need for predefined anchors while maintaining accurate spatial localization and computational efficiency.

Abstract
1. Introduction
2. Models/Designs
- 2.1 PointPillars
- 2.2 CenterPoint
3. Our Implementation
4. Results
- 4.1 Quantitative Performance
- 4.2 Analysis
5. Conclusion
Reference

Abstract

Efficient and accurate 3D perception from LiDAR point clouds is essential for autonomous driving, yet many methods struggle to balance computational efficiency with robustness to object rotation and scale variation. In this project, we implement and analyze two representative 3D detection frameworks: the anchor-based PointPillars and the anchor-free CenterPoint. Building on these approaches, we propose CenterPillar, a hybrid design that combines pillar-based feature encoding with center-based object detection. We evaluate our method on the SemanticKITTI dataset using standard detection and orientation metrics. Our results show that pillar-based bird’s-eye-view representations enable strong localization performance with low computational cost, while center-based detection alleviates several limitations of anchor-based methods.

1. Introduction

1.1 Motivation

Accurate 3D perception is a fundamental requirement for the safe deployment of autonomous vehicles, particularly in complex urban environments where traffic moves in unpredictable patterns. While LiDAR sensors provide rich spatial information, efficiently processing sparse 3D point clouds to detect and track objects in real-time remains a significant computational challenge. Traditional anchor-based methods often struggle to balance inference speed with the ability to accurately regress bounding boxes for objects with diverse orientations and aspect ratios. This project is motivated by the need to evaluate whether shifting from these rigid, predefined anchor boxes to a flexible, center-based object representation can improve detection accuracy and robustness without sacrificing the low-latency performance required for real-world driving systems.

1.2 Problem Setting

The core problem addresses the limitations of applying standard 2D convolutional pipelines to 3D LiDAR data, which is inherently sparse, unordered, and irregular. Specifically, this project investigates the trade-offs between two distinct detection paradigms on the SemanticKITTI dataset: the anchor-based PointPillars architecture and the anchor-free CenterPoint architecture. The challenge lies in accurately detecting and localizing objects, such as cars, pedestrians, and cyclists, that may be rotated or occluded, a task where traditional axis-aligned anchor boxes often fail or require extensive hyperparameter tuning. By implementing and comparing these architectures, we aim to determine if a point-based representation offers a superior mechanism for handling the geometric complexities of 3D scenes compared to the established anchor-based baseline.

1.3 Datasets

In this project, we use SemanticKITTI, a large-scale dataset for semantic scene understanding in autonomous driving, built on top of the KITTI odometry benchmark (Behley et al., 2019) [1]. It provides dense, point-level semantic annotations for LiDAR point clouds collected in urban driving environments, covering a wide range of classes such as road, buildings, vegetation, vehicles, and pedestrians. The dataset consists of sequential scans captured by a Velodyne HDL-64E LiDAR sensor, enabling both semantic segmentation and temporal modeling of 3D scenes, and has become a standard benchmark for LiDAR-based semantic perception.

1.4 Terminology In Use

1.4.1 Point Cloud

A point cloud is a collection of discrete points in 3D space, typically generated by LiDAR sensors that measure distances by emitting laser pulses and recording their return times (Qi et al., 2017) [5]. Formally, a point cloud can be represented as
\(P = \{(x_i, y_i, z_i, r_i)\}_{i=1}^{N}\) where \(x_i\), \(y_i\), and \(z_i\) denote the 3D spatial coordinates of each point, and \(r_i\) represents an optional reflectance or intensity value.

Unlike images, point clouds are unordered, sparse, and irregularly distributed, with point density decreasing as distance from the sensor increases. These characteristics make point clouds incompatible with standard convolutional neural networks, which assume dense, grid-structured inputs, and therefore require specialized representations or preprocessing for learning-based methods (Qi et al., 2017 ) [5].

A sample LiDAR point cloud from KITTI dataset

1.4.2 Voxelization

Voxelization is the process of converting a 3D point cloud into a regular volumetric grid by partitioning space into small cubic cells, called voxels, and aggregating points within each cell (Zhou & Tuzel, 2018) [6]. While voxelization enables the use of 3D convolutional neural networks, it introduces significant computational and memory overhead due to the sparsity of point clouds and the high cost of 3D convolutions.

In our use of PointPillars, we avoid voxelization along the vertical dimension and instead organize points into vertical pillars on the ground plane, allowing the model to operate entirely with efficient 2D convolutions while preserving 3D geometric information through learned point-level features (Lang et al., 2019) [2].

2. Models/Designs

2.1 PointPillars

PointPillars is an end-to-end architecture that proposes a fast and accurate way for 3D object detection from LiDAR point clouds, by introducing a learned point cloud encoder that avoids expensive 3D convolutions (Lang et al., 2019) [2]. The key idea is to organize the point cloud into vertical columns (“pillars”) in the ground plane, learning features within each pillar using a lightweight PointNet, and then processing the resulting representation with 2D convolutional networks.

2.1.1 From Point Cloud to Pillars

Instead of discretizing space into 3D voxels, PointPillars only discretizes the ground plane:

The x-y plane is divided into a regular grid.
Each grid cell defines a pillar, which spans the full height range.
All points whose (x,y) fall into the same cell are assigned to that pillar.

There is no binning in the z-dimension, removing a major source of computational cost and hyperparameter tuning.

Because LiDAR point clouds are sparse, most pillars are empty. The method keeps at most P non-empty pillars per frame, and at most N points per pillar (via random sampling or zero-padding). This yields the point cloud as a dense tensor of shape

\[(D, P, N)\] \[D = \text{number of features per point}\] \[P = \text{number of non-empty pillars}\] \[N = \text{maximum number of points per pillar}\]

2.1.2 Point Decoration and Central References

Each point in a pillar is augmented with relative geometric features:
\((x, y, z, r, x - \bar{x}, y - \bar{y}, z - \bar{z}, x - x_p, y - y_p)\)

\[(\bar{x}, \bar{y}, \bar{z}) = \text{mean of points in the pillar}\] \[(x_p, y_p) = \text{pillar center}\]

These relative offsets improve translation invariance at the pillar level. They encode the local geometry explicitly and significantly improve detection performance. The final per-point feature dimension is therefore \(D = 9\).

2.1.3 Pillar Feature Network (Learned Encoder)

Each pillar is encoded independently using a simplified PointNet-style network:

Apply a shared linear layer (implemented as a 1×1 convolution)
BatchNorm + ReLU
Max-pooling across the point dimension

This produces one feature vector per pillar, which is formatted as: \((C, P)\)

\[C = \text{learned feature dimension}\]

2.1.4 Pseudo-Image Construction (BEV)

To enable efficient convolutional processing, PointPillars converts the unordered set of learned pillar features into a structured bird’s-eye-view (BEV) representation. After the Pillar Feature Network produces one feature vector \(f_i \in \mathbb{R}^{C}\) for each non-empty pillar i, these features are scattered back onto a fixed \(H \times W\) grid defined on the ground plane, forming a pseudo-image \(F \in \mathbb{R}^{C \times H \times W}\) where \(F[:, yi, xi] = f_i\) and empty cells are zero.

This operation restores spatial structure while avoiding expensive 3D voxel grids: instead of performing convolutions over \((x,y,z)\), the model operates entirely in \((x,y)\) using fast 2D CNNs, with vertical geometry implicitly encoded in the feature channels through learned point-level statistics (e.g., height distributions and relative offsets). As a result, PointPillars preserves essential 3D information while achieving orders-of-magnitude efficiency gains by reducing the problem to 2D convolution on a BEV feature map.

Fig 3. Pillar Feature Extractor [3]

2.1.5 2D Backbone Network

PointPillars uses a multi-scale 2D convolutional backbone, similar in spirit to feature pyramid networks, containing several convolutional blocks with increasing stride, upsampling via transposed convolutions, and feature concatenation across scales. All operations are 2D, which is the key reason the model achieves real-time performance compared to voxel-based 3D CNNs.

2.1.6 Detection Head

Once the BEV feature map is constructed, PointPillars uses a Single Shot Detector head to predict 3D objects in a single forward pass. A dense set of predefined anchors is placed on the ground plane, and for each anchor the network outputs a classification score and regression offsets that refine it into a full 3D bounding box \((x, y, z, w, l, h)\). The offsets are normalized to stabilize training across scales. The original anchor-based detection head is replaced with the CenterPoint-based one (described below) in our implementation.

Fig 2. PointPillar Network Overview [2]

2.2 CenterPoint

CenterPoint proposes a shift from bounding box anchors to a point-based representation, addressing the difficulty anchor-based methods have with fitting axis-aligned boxes to rotated objects. By representing, detecting, and tracking objects as points, the framework simplifies the detection pipeline, removes the need for heuristic anchor design and scale tuning, and improves robustness against rotation and aspect ratio variations.

2.2.1 Objects as Points

Unlike PointPillars, which relies on predefined anchors, CenterPoint identifies objects by locating their center point in a bird’s-eye-view heatmap. This approach eliminates the need for complex anchor tuning and allows the backbone to learn rotational invariance. The model uses a standard 3D backbone (like VoxelNet or PointPillars) to generate map-view features, which are then flattened and processed to find local heatmap peaks corresponding to object centers.

2.2.2 Center-Based Detection Head

Once a center is detected, the model regresses other object properties directly from the feature vector at that center location. Several dense regression heads predict attributes such as:

3D Size & Orientation: The width, length, height, and yaw angle (encoded as sine and cosine).
Sub-voxel Refinement: An offset to correct quantization errors caused by the backbone’s stride.
Velocity: A 2D velocity vector predicting the object’s motion between frames, enabling simple greedy tracking without Kalman filters.

While the center-based head is efficient, relying solely on the center feature may miss geometric details at the object’s edges. CenterPoint employs a lightweight second stage that extracts point-features from the 3D centers of the predicted bounding box faces (top, bottom, left, right, front, back). These features are concatenated and passed through a Multi-Layer Perceptron (MLP) to predict an IoU-guided confidence score and further refine the bounding box estimates

Fig 3. Overview of CenterPoint Framework [4]

2.3 Loss Function and Training Objective

CenterPoint adopts a fully anchor-free supervision strategy, where objects are represented as points in a bird’s-eye-view (BEV) heatmap and all other object attributes are regressed directly from the predicted center locations. This formulation leads to a multi-task training objective that jointly optimizes center localization, bounding box regression, and object orientation.

The overall loss is defined as:

\[L = \frac{1}{N_{\text{pos}}} \left( \lambda_{hm}L_{hm} + \lambda_{reg}L_{reg} + \lambda_{dir}L_{dir} \right)\]

where \(N_{\text{pos}}\) is the number of positive object centers, and \(\lambda_{hm}\), \(\lambda_{reg}\), and \(\lambda_{dir}\) are scalar weights.

2.3.1 Center Heatmap Loss

To localize object centers, CenterPoint predicts a dense BEV heatmap in which each object center is represented as a two-dimensional Gaussian peak. Supervising only the exact center pixel results in a highly sparse learning signal, as most locations are labeled as background. To alleviate this issue, CenterPoint follows the CenterNet formulation by rendering a Gaussian around each ground-truth center, with the radius determined by the object’s spatial extent.

The heatmap is trained using focal loss, which addresses the severe imbalance between foreground center locations and background pixels. The focal loss is defined as:

\[L_{hm} = -\frac{1}{N} \sum_{i} \begin{cases} (1 - \hat{Y}_i)^{\alpha} \log(\hat{Y}_i), & \text{if } Y_i = 1 \\ (1 - Y_i)^{\beta} \hat{Y}_i^{\alpha} \log(1 - \hat{Y}_i), & \text{otherwise} \end{cases}\]

where \(Y_i\) is the ground-truth heatmap value at location \(i\), \(\hat{Y}_i\) is the predicted heatmap value, and \(\alpha = 2\), \(\beta = 4\) are focusing parameters. This formulation suppresses the influence of easy background examples while emphasizing hard positive center locations, thereby improving recall and training stability.

2.3.2 Bounding Box Regression Loss

After identifying object centers, CenterPoint directly regresses the remaining 3D bounding box parameters from the feature vector at the center location. Each object is parameterized as:

\[(x, y, z, w, l, h, \theta)\]

To stabilize optimization across objects of different scales, the regression targets are optimized using the Smooth L1 loss. The total regression loss is given by:

\[L_{reg} = \sum_{b \in \{x, y, z, w, l, h, \theta\}} \text{SmoothL1}(\Delta b)\]

The Smooth L1 loss is defined as:

\[\text{SmoothL1}(x) = \begin{cases} 0.5x^2, & |x| < 1 \\ |x| - 0.5, & \text{otherwise} \end{cases}\]

This loss behaves quadratically for small residuals, enabling precise localization, while remaining linear for large errors, improving robustness during early training.

2.3.3 Orientation and Direction Loss

Direct regression of object orientation is challenging due to the periodic nature of angles, as rotations differing by \(2\pi\) represent the same physical orientation. To address this issue, CenterPoint encodes yaw using both continuous regression and discrete classification.

The yaw residual is expressed using a sine formulation:

\[\Delta \theta = \sin(\theta_{gt} - \theta_{pred})\]

In addition, CenterPoint introduces a direction classification loss by discretizing the yaw angle into two bins (e.g., \([0, \pi)\) and \([\pi, 2\pi)\)) and applying a softmax cross-entropy loss:

\[L_{dir} = -\sum_{k} y_k \log(\hat{y}_k)\]

where \(y_k\) is the ground-truth direction label and \(\hat{y}_k\) is the predicted probability. This auxiliary loss resolves the 180-degree ambiguity common in symmetric objects and significantly improves orientation estimation accuracy.

2.3.4 Final Objective

By combining dense center heatmap supervision, direct regression of object geometry, and explicit orientation classification, the CenterPoint loss function enables stable and efficient training without requiring anchor assignment. This design integrates naturally with pillar-based BEV representations and is well-suited for real-time 3D object detection in autonomous driving scenarios.

3. Our Implementation

For our experiments on the SemanticKITTI dataset, we utilized a hybrid “PointPillars with CenterPoint” architecture that combines the efficient feature encoding of PointPillars with the anchor-free detection head of CenterPoint. In this setup, the PointPillars backbone first discretizes the raw LiDAR point cloud into vertical columns to create a dense 2D pseudo-image, effectively bypassing the computational bottleneck of 3D convolutions. This feature map is then processed by the CenterPoint head, which identifies objects as points in a heatmap rather than relying on predefined anchor boxes; the network localizes object centers and regresses dense 3D attributes, including size, orientation, and velocity, directly from the center features. This design allows us to leverage the high inference speed of the pillar-based encoder while mitigating the difficulties anchor-based methods face when fitting bounding boxes to rotated objects in complex driving environments. This hybrid design is not directly provided in existing codebases and represents our own architectural integration of PointPillars and CenterPoint.

4. Results

We evaluated the PointPillars model on the SemanticKITTI benchmark using the standard KITTI evaluation protocol. The performance was measured using Average Precision (AP) across three difficulty levels (Easy, Moderate, Hard) for three primary classes: Car, Pedestrian, and Cyclist. Additionally, we evaluated the Average Orientation Similarity (AOS) to assess the model’s ability to predict object heading.

Screenshot showing BEV projection and object annotations

4.1 Quantitative Performance

The following table presents the complete evaluation results for 2D Bounding Boxes, Orientation Similarity, Bird’s Eye View (BEV) detection, and full 3D Object Detection. Evaluated at 80 epochs.

4.2 Analysis

4.2.1 2D Detection vs. 3D Detection

The model achieves its highest scores in the 2D Detection task (Overall AP: 70.75%), particularly for Cars, where performance reaches 82.75% on Easy samples. This behavior is expected, as 2D detection only requires projecting 3D bounding boxes onto the image plane, which is a more forgiving metric than accurately reconstructing the full 3D volume.

The gap between 2D and 3D performance is most pronounced in the Pedestrian class. While pedestrian detection reaches 54.05% AP (Easy) in 2D, it drops to 43.32% AP (Easy) in full 3D detection. This disparity highlights the difficulty of resolving depth and precise spatial volume for small, non-rigid objects, compared to simply identifying their presence in a 2D view. In contrast, larger and more rigid objects, such as cars, maintain relatively strong performance across both detection modalities.

4.2.2 Bird’s Eye View (BEV) Performance

The BEV Detection results closely track the 2D detection trends and consistently outperform full 3D detection. Specifically, the model achieves an Overall BEV AP of 68.20%, compared to 61.45% for 3D detection. This performance gap reflects the architectural strengths of the PointPillars-based pipeline, which operates by projecting features into a 2D ground-plane grid, or pseudo-image.

Because the network is inherently optimized for estimating horizontal spatial coordinates (x, y), it excels at top-down localization, which is critical for navigation and planning tasks. However, the accurate regression of the vertical dimension (z) required for full 3D detection remains a more challenging learned component. These results reinforce the effectiveness of BEV representations for precise ground-plane localization while maintaining computational efficiency.

4.2.3 Orientation Estimation (AOS)

The Average Orientation Similarity (AOS) indicates how well the model understands object heading.

Cars: The model shows exceptional orientation capability for cars (82.56% Easy), nearly matching its detection accuracy. The rigid, rectangular geometry of cars in the point cloud makes heading estimation reliable.
Cyclists: Performance for cyclists is strong on easy samples (75.07%), significantly outperforming pedestrians. This is likely because bicycles, like cars, have an elongated shape that provides a clear directional axis. However, there is a substantial drop in performance for moderate samples (down to 60.76%).
Pedestrians: There is a sharp drop in AOS for pedestrians (40.76% Easy). Unlike cars, pedestrians are roughly cylindrical and non-rigid, making their orientation ambiguous in sparse LiDAR data. This suggests that while the model can detect where a pedestrian is, it struggles to predict which direction they are facing.

4.2.4 Impact of Difficulty Levels

Across all metrics, there is a clear performance degradation from “Easy” to “Hard,” though the severity of this drop varies significantly by class.

Cars (Robustness): Cars demonstrate the highest resilience to increased difficulty. In 3D detection, the performance drops approximately 10% from Easy (71.23%) to Moderate (61.66%), but then stabilizes, dropping only another ~2.5% for Hard samples. This stability indicates that even when cars are partially occluded or at a distance, their large surface area ensures enough points remain to form a recognizable pillar structure.
Cyclists (Occlusion Sensitivity): Cyclists suffer the sharpest decline in performance. There is a massive drop of nearly 17% in 3D AP from Easy (69.81%) to Moderate (53.26%). This suggests that cyclists are particularly susceptible to occlusion. Unlike cars, a bicycle is a thin, skeleton-like object; if partially blocked or seen from a poor angle, it may lose critical geometric features (like the gap between wheels), causing the detector to miss it entirely.
Pedestrians (Inherent Difficulty): The trend for pedestrians is distinct. While the drop from Easy to Hard is numerically smaller than for cyclists (decreasing ~11% total in 3D AP), the baseline performance is significantly lower to begin with (starting at only 61.45%). This implies that pedestrians are inherently difficult to detect regardless of the specific difficulty annotation. Because of their small physical cross-section, even “Easy” (non-occluded, close) pedestrians occupy very few pillars. By the time the difficulty reaches “Hard,” the point density likely becomes too sparse for the pillar feature encoder to distinguish a person from a vertical pole or background noise, resulting in a low 43.52% AP.

5. Conclusion

In summary, we demonstrated that combining a pillar-based encoder with a center-based detection head offers an effective balance between efficiency and robustness. By integrating PointPillars with CenterPoint, we retained real-time performance while reducing sensitivity to anchor design. Future work could explore tighter integration of temporal information or extend the approach to full 3D segmentation tasks. Additionally, the two models were only trained on 80 epochs due to computational restraints. At deeper depths, there might be changes in how the models perform relative to each other. Additionally, testing the model on newer datasets like the Waymo Open Dataset or NuScenes would provide temporal information that could boost model performance.

Reference

[1] Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Gall, J., & Stachniss, C. (2019).
SemanticKITTI: A dataset for semantic scene understanding of LiDAR sequences.
In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
https://arxiv.org/abs/1904.01416

[2] Lang, A. H., Vora, S., Caesar, H., Zhou, L., Yang, J., & Beijbom, O. (2019). PointPillars: Fast encoders for object detection from point clouds. arXiv:1812.05784.

[3] Zhang, Y., Liu, Y., Wang, Z., & Chen, X. (2022).
DirectionNet: Road main direction estimation for autonomous vehicles from LiDAR point clouds.
In Proceedings of the 2022 International Conference on Advanced Robotics and Mechatronics (ICARM).
https://doi.org/10.1109/ICARM54641.2022.9959732

[4] Yin, T., Zhou, X., & Krähenbühl, P. (2021). Center-based 3D Object Detection and Tracking. arXiv:2006.11275.

[5] Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017).
PointNet: Deep learning on point sets for 3D classification and segmentation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
https://arxiv.org/abs/1612.00593

[6] Zhou, Y., & Tuzel, O. (2018).
VoxelNet: End-to-end learning for point cloud based 3D object detection.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
https://arxiv.org/abs/1711.06396

Source Repository: https://github.com/HuangEric22/PointPillars

Center Pillars - Anchor-Free Object Detection in 3D