Why do we annotate the AI training data in-house at Samp?

Picture of Author

Author

Article written by Shivani Shah, CTO & Co-founder at Samp

Given the current market, there are so many tools with very mature workflows to annotate and manage AI training data. But at Samp we chose to develop custom tools in-house to perform this task, which can be very puzzling. The aim of this article is therefore to describe the thought process behind this choice, while exposing the underlying complexity of the task.

Domains like autonomous cars, mapping of geographical territories and indoor buildings ect also require annotation of large amounts of 3D lidar data. But this market is smaller than that for 2D images and the needs of annotation are also fragmented, so there are much fewer tools developed for the purpose of labelling point clouds.

What data annotation is needed on Point Clouds?

The role of 3D AI for the Samp ‘Shared reality’ product is to detect an instance of an object and also predict the semantic class of it, also known as Panoptic Segmentation. So for the data labelling, we need to separate each object from the scene and attach the instance and semantic label to it.

We started with open source softwares like Cloud Compare, and then moved to Leica Cyclone when the datasets got bigger. Such an annotation task requires domain expertise of knowing objects in the Industrial site, so it was important to choose software which the domain experts are comfortable with. But eventually we designed our own tool to perform the task.

Typical instances labelled by their Semantic Class

1. Massive size of the industrial data

Since we started to handle real client data, we had to annotate scans which have billions of points per site! It is petabytes of point cloud data to be processed if all put together. This slowed down most of the operations and the data transfers between two locations itself could take hours.

Most of the AI data annotation tools support up to a few million points. This is sufficient for the market of autonomous cars, mainly because cars use very sparse point clouds. Whereas the industrial point clouds we have requires much higher density of points leading to very large size files.

With Industrial softwares like Cyclone, they start to significantly slow down with this data size. Adding lots of minutes of waiting time and then crashing eventually. Given they are not open source, it is also not possible to understand the bottlenecks or improve any features ourselves.

For goal of accelerating the whole annotation process, we identified following tasks which needed to be addressed:

  • Enable the annotation tool to handle larger data files
  • Centralise data pipeline to avoid a lot of data transfers
  • Accelerate the speed of read, write and transfer of the data

With the insanely skilled team members at Samp, we managed to solve these objectives above by building our own tool for annotation which is able to handle much larger data files.

This lead to the very special in-house technical developments like:

  • Custom file format. Our beloved ‘Samp-point-Z’. It is designed to have all information needed for AI training, for example object labels. Which reduced number of read and write operations.
  • Custom Compression algorithm. It is custom designed specifically for the density of point clouds needed for the AI and visual quality of Samp products. It is better suited than Draco as there is no loss of data on read-write. It gave us about 30% more data compression than open formats like .e57.
  • Centralising the data flow. By controlling the process end-to-end, we are able to design optimal ways to transfer data between different steps of the Annotation process. Starting from data verification, segmentation, labelling and quality checks for QaQC.
Samp annotates AI training data in-house for NavVis VLX 3 Water Room
Annotated NavVis VLX 3 Water Room

2. Data security

The Industrial data is sensitive and there are serious regulations in place to avoid any leakage of the asset information. This was always a challenge when transferring data to 3rd party tools. But in the newly designed process, we control the user authentication. Also the data never leaves the Samp server and tools which ensures better data security for the customers.

3. Speeding up the data annotation process by leveraging in-house AI.

It is a common practice in the 2D annotation tools to pre-segment the data using AI under the hood to facilitate selection of the regions for labelling. Similarly, the same analogy has been applied to 3D, where the AI built in-house has been leveraged in the annotation tool to pre-segment the data into clusters corresponding to the desired instance.

The major gain here is in the time as AI already does a lot of the heavy lifting reducing the time to half.

4. Handling data variations

As we’re deploying across different categories of heavy industries like water, waste, energy etc, we find some differences in the process. Typically in the taxonomy, data types, data formats, occlusion, density ect.

Having our own data annotation tool provides us enough flexibility to efficiently accommodate the evolving needs and help us scale across industries.

5. Productization and Human-in-the-loop

The vision for companies deploying computer vision models is to go from “Manual” to “Partial AI” to “Only AI“. However, mistakes here can have dangerous consequences for safety on these sites. So the humans need to verify and if needed correct the output from AI models before the final delivery.

By connecting together the data annotation tool to the Samp web platform, we have enabled the corrections capability for the client data… facilitating the “Human-in-the-loop”.

All the AI companies have their unique challenges for achieving the needed labelled data, would be very happy to connect and exchange on this subject.

More to explore

Scroll to Top
Logo Samp

Stéphane Evanno

Chief Strategy & Development Officer

Cosa ci si può aspettare da una demo con SAMP?

Scanning Services

Extend the reach of your scanning services by delivering your scanning campaigns on a purely web-based portal, designed from the ground up for the industry. Leverage your surveying and topographic expertise to consolidate valuable field data into a single viewer: maps, aerial and drone orthophotos, laser scans, photogrammetry or videogrammetry, and georadar can now be securely viewed, updated and shared in one place.

Engineering services, EPC

Upscale the value of your engineering services offering by delivering digital twin as a service, powered by your qualified staff. Improve customer retention with longer-term contracts that ensure continuous synchronization of technical documentation with the as-built facility. Accelerate or automate the production of technical deliverables when working on brownfield projects with little or no existing input information.

CONTRACT OPERATORS

When preparing a quote for operating a facility on behalf of the owner, be sure to maximize that short window of time by taking advantage of as much technical information as possible. Turn your initial site visit into a unique opportunity to capture the current condition of the facility. Make a bid that will beat the competition with an already operational digital twin, while giving you increased confidence in your future service contract margins.

OWNER OPERATORS

Whether you manage a single plant or a fleet of sites, whether your facilities are on-shore or off-shore, we can help you build and maintain a twin within days. Major milestones in a plant’s lifecycle, such as handover from EPC to operator, change of ownership, revamping or decommissioning, provide an opportunity to implement a safer and more efficient way of working with your extended teams, regardless of the quality of your technical data.