• No results found

Scalability & Data Management Challenges

N/A
N/A
Protected

Academic year: 2022

Share "Scalability & Data Management Challenges"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

Aadhaar

Scalability & Data Management Challenges

Dr. Pramod K. Varma

Chief Architect, UIDAI

twitter.com/pramodkvarma pramodkvarma.com

(2)

Understanding Aadhaar System

Understanding Aadhaar System

(3)

Establishing ID is a Challenge Establishing ID is a Challenge

MobilePhone Bank A/C

Issue rations A resident typically accesses multiple service providers, at different times

Needs to repeatedly re-establish ID =

MobilePhone Bank A/C

NREGA jobsPassport

re-establish ID =

problem for the poor

Birth records Address proof

Money to ‘beat’ the system

= No or limited

access to entitlements and opportunities

(4)

Why Aadhaar?

Why Aadhaar?

Difficulty in establishing ID exclusion

Weak authentication inefficient delivery

Financial

Entitlements Ghost Entries

Food, fuel, fertilizer

“…biometric-based unique identity has the potential to address both these dimensions simultaneously.”

- Thirteenth Finance Commission

Financial

Social Security Net

Duplication

Multiple layers

60% unbanked (~700mn) Food, fuel, fertilizer

subsidy = ~Rs. 1 lac crore

45% BPL do not have a ration card

(5)

Enroll Once … Enroll Once …

Demographic Data Biometric Data

Resident’s Photograph

• Compulsory data:

Name, Age/Date of Birth,

• Aadhaar Number - Unique, lifetime, biometric based identity

Resident’s Photograph

Resident’s Finger Prints

Resident’s Iris

Name, Age/Date of Birth, Gender and

Address of the resident.

• Conditional data:

Parents/Guardian details

• Optional data:

Phone no., email address

(6)

… authenticate many times

… authenticate many times

• Online service to verify the claim – “are you who you claim to be?”

• 1:1 check – only a “yes/no” answer

• 1:1 check – only a “yes/no” answer

• Authenticate online

Anytime, anywhere, multi-factor Always responds with “yes” or “no”

• Open identity platform

– Can be used in any service, any domain

– using any protocol, any device, any network

(7)

Application Modules Application Modules

• Enrolment

• Geographically Distributed Client (mostly offline)

• Enrolment Server with Multi modal, Multi-vendor ABIS

• Authentication

• Geographically Distributed Servers

• Geographically Distributed Servers

• Geographically Distributed Devices (several millions)

• Multi-factor support

• Supporting Systems

• Business Intelligence

• Fraud Detection

(8)

Enrolment Process Enrolment Process

CIDR

Enrolment Service

Biometric De-duplication

UID Assignment

Letter Delivery &

Verification Biometric

De-duplication

UID Assignment

Logistics Partner

Aadhaar Number Enrolment

Processing

3

4 Registrar

Partner (India Post)

Customer Contact

Center

Information/

Issue resolution (Option A)

Enrolment Data to CIDR (Option B)

Enrolment Data

to CIDR Aadhaar letter or

Rejection letter Aadhaar Number

And rejection data

Enrolment Agency

Automatic Synch (software/data)

Data Capture

1 2

2 4

5

(9)

Authentication Process

Authentication Process

(10)

Enrolment Server Enrolment Server

• Manages complete Aadhaar enrolment and lifecycle process

• Features

– Data validation

– Operator, supervisor verification – Operator, supervisor verification

– Biometric de-duplication (1:N matching) – Manual inspection

– Aadhaar number allocation / rejection – Letter generation and delivery tracking – Registrar integration

(11)

Biometric De

Biometric De--duplication duplication

• Multi-modal matching

– 1:N matching (Every resident is matched using his/her biometrics against every entry in the ABIS system)

• Multi-vendor interface through ABIS API

– Dynamic allocation to ABIS vendor based on their – Dynamic allocation to ABIS vendor based on their

accuracy and performance

– Multi-DC architecture adds complexity

• Exception handling

– Mostly automated and manual

– Volumes require highly automated and learning systems to handle exceptions in an effective manner

(12)

Authentication Authentication

Supports answering the question “is a resident the person he/she claims to be”

Verifies resident information (demographics, biometrics) for a given Aadhaar number against the stored data

Online service that is lightweight, ubiquitous, and secure

Only responds with a “yes/no” and no personal identity information is returned as part of the response

returned as part of the response

Supports multi-factor authentication using biometrics, PIN, OTP and combinations thereof

Supports multiple protocols and devices

Personal computer, mobile, PoS terminals, etc.

Many protocols (USSD, SMS, HTTPS) over data and mobile connections Works with assisted and self-service applications

(13)

Scalability and Data Management Challenges

Challenges

(14)

Architecture Highlights Architecture Highlights

• Support large scaling of enrolments and authentications

• No vendor lock-in across the system

• Use of open-source technologies wherever available and prudent

• Use of open standards to ensure interoperability

• Use of open standards to ensure interoperability

• Ensure wide device driver support for biometric devices through standardization

• Use of widely adopted technology platforms and tools

• Make all performance metrics (no PII) public through business intelligence portal for transparency

• Build strong end-to-end security upfront

(15)

Enrolment Server Architecture Enrolment Server Architecture

• Throughput is the key

• Fully distributed compute platform

• Data sharded across multiple RDBMS instances and DFS

• Highly asynchronous using a high speed messaging layer

• SEDA (Staged Even Driven Architecture) allows smarter failure handling

• Multi-DC architecture for near-zero RTO and zero

RPO (adds complexity in biometric d-deuplication)

(16)

Enrolment Volume Enrolment Volume

• 600 to 800 million UIDs in 4 years

• 1 to 4 million enrolments a day

• When we cover half the country, we will end up doing

– 4 m * 12 * 500 m * 12 biometric matches a – 4 m * 12 * 500 m * 12 biometric matches a

day!!!

• Data updates and new enrolments will continue for ever

• Enrolment data moves from very hot to cold

needing multi-layered storage architecture

(17)

Enrolment Data Management Enrolment Data Management

• Enrolment require handling of large binary data for all residents

~5 MB per resident biometrics ~3 MB for supporting docs

Maps to about 8 PB of raw data!

With replication, it means managing about 25 PB of source data With replication, it means managing about 25 PB of source data Replication and backup across DCs of 4+ TB of incremental data

every day for near-zero RTO

• Additional workflow/process/event data

15+ million events on an average moving through async channels Needing complete update and insert guarantees across data stores

• Lifetime updates adds several more petabytes

(18)

Authentication Server Authentication Server

• Authentication poses response time issue

– Match demographics (partial, fuzzy, Indian language matching)

– Match biometrics (balancing FPIR)

• Needs to scale to handle 100’s of million requests

• Needs to scale to handle 100’s of million requests every day with sub-sec response

• Edge cached, in-memory operation

• Async data updates to the cache

• Stateless service

• Audits maintained asynchronously on HDFS

(19)

Authentication Volume Authentication Volume

• Few 100 million authentications per day

mostly during 10 hr period

High variance on peak and average

Requires async request handling on HTTP server

Sub second response with support for OTP, guaranteed audits

Multi-DC architecture

• Multi-DC architecture

Fully load balanced

Mostly reads with some updates (OTP, Audit)

• All changes needs to be propagated from enrolment data stores to all authentication sites

PIN updates, OTP requests, and less occasional demographic data updates

(20)

Authentication Data Management Authentication Data Management

• Minutiae based authentication request is about 1 K

– Image based ones are about 10 K on an average

• 100 million authentications / day means

• 100 million authentications / day means

– 1 billion audit records in 10 days

– 1 TB encrypted audit logs in 10 days

– Need to keep recent audits online accessible any time and older ones in achieve until deleted

– Audit write must be guaranteed

(21)

Analytics/Mining Architecture Analytics/Mining Architecture

• Analyzing terabytes of data generated out of billion+ events every day

– Constantly aggregating data across billions of records on a distributed compute grid to analyze and create patterns for operational and strategic decision making patterns for operational and strategic decision making

• Fraud detection

– Detecting fraud during enrolment

– Detecting identity fraud scenarios near real-time during authentication

– Building mining, clustering, learning tools to work on top of billions of events

(22)

Technology Stack Technology Stack

• Java application deployed on Linux stack with virtualization

• Multiple MySQL instances as RDBMS

• Apache Hadoop (HDFS, Hive, HBase, Pig) stack for large scale compute and distributed storage

scale compute and distributed storage

• RabbitMQ (AMQP standard) as messaging framework

• Drools for rules engine

• Several other open source libraries

• All 3rd party interfaces abstracted through standard API layer (VDM, ABIS, Language Support, etc)

(23)

Final Thoughts Final Thoughts

• Largest biometric identity system is about 120 million. Scaling needs are unprecedented.

• Completely built on open standards and open source platforms

• Scalability, Security, interoperability, and vendor

• Scalability, Security, interoperability, and vendor neutrality a must

• Next generation e-governance applications require cloud based, large data-driven, open platforms

• Research community support required

(24)

Thank You!

Dr. Pramod K. Varma

Chief Architect, UIDAI

twitter.com/pramodkvarma pramodkvarma.com

References

Related documents

Note: data of 2 Private (New), 5 Private (Old), 3 PSU (Large) and 17 PSU (Medium) banks excluded from the analysis because of data consistency issues or data unavailability;

Sivakumar சிவகுமா Computer Science and Engineering भारतीय ूौ ोिगकी सं ान मुंबई (IIT Bombay) siva@iitb.ac.in Big Data for Central Banking.?.

I District resource maps, better models for ground water District Planning tools. I get CEO/collector

The paper builds on the work on Enabling Data Sharing: Emerging Principles for Transforming Urban Mobility by the World Business Council for Sustainable Development (WBCSD)

Land Restoration /Reclamation Monitoring of Opencast Coal Mines of WCL Based On Satellite Data for the Year- 2008-09.. Remote Sensing Cell

• As computers use binary numbers for internal data representation, computer codes use binary coding schemes. • In binary coding, every symbol that appears in the data is represented

The Jan Swasthya Abhiyan, a national platform working for people’s health, believes that healthy living conditions and access to good quality health care for all citizens are not

Sanitation Data Book 2008—Achieving Sanitation for All, comes from a survey of 27 cities that are members of CITYNET and participants in the Water for Asian Cities Program of