Boost your ML & AI accuracy with thousands of ready-to-use ML features from external data sources

Discover and integrate new relevant features auto-generated
by LLMs and greedy feature engineering algorithms from
200+ public, community, and premium data sources

Boost your ML & AI accuracy with thousands of ready-to-use ML features from external data sources

Discover and integrate new relevant features auto-generated
by LLMs and greedy feature engineering algorithms from
200+ public, community, and premium data sources



Trusted by data scientists and data engineers

❓Why use Upgini

On-the-fly data source optimization for ML models:

Automated feature generation with Large Language Models' data augmentation

Instructed embeddings generation using LLMs (such as GPT) with data augmentation from connected external sources.

If properly prompted with context from all relevant external data, an LLM significantly improves the quality of its embeddings for text data in a source.

Automated feature generation with special GraphNN and RNN


Automated feature generation for transactional and graph data sources through specialized RNN and Graph NN for accurate information extraction on sequences and object relationships in the data source.

Open Street Map is an example of graph data source

Multiple data sources ensembling to minimize data errors


Data is not perfect, and different sources, even with the same type of information, have their own errors.

Thus, if multiple sources with different error distributions are used, their ensemble will have better accuracy. This is similar to a consensus forecast.

Iterative search with automatic search keys augmentation from all connected sources

For example, if you lack geographic location information for an IP address, Upgini will search for cross-mapping of IDs in the sources.

If it finds the relevant information, it will automatically add a new search key - in this case, the postal code for each IP. This enables searching through all geo data sources in addition to IP sources.

What is LLM based automated feature generation

How this works

Large Language Models (LLMs) are capable of recognizing, summarizing, translating, predicting, and generating text. One of the most popular application of LLM is ChatGPT.

Several LLMs integrated into Upgini data search to improve the accuracy of ML models.

Upgini enriches input texts with contextual information from external data sources, instructs LLM based on a context, and LLM generate more accurate embeddings from a combination of initial text, contextual information and generated text.

Use case

Upgini automatically generates optimized embeddings using LLM's external data augmentation for text in both connected data sources and training datasets for search.

Just launch data search using Upgini and your labeled training dataset with text columns, and Upgini will generate LLMs embeddings from text columns and check it for predictive power for you ML task.

Finally Upgini will return you dataset enriched by relevant only components of LLMs embeddings.

Example

Raw string from a training dataset:
The Nook

Description generated by LLM without augmentation from external data:
The Nook is a line of e-readers and tablets produced by Barnes & Noble...

Description generated by LLM with augmentation from external data sources and advanced instructions:
The Nook is a tattoo shop located in Jefferson City, Missouri. The shop is known for....

Connected data sources

200+ Public, Community and Premium sources
239 countries
40 years of data history

🌐 Public data

Historical weather & Climate normals for postal/ZIP code

68 countries
22 years history
Monthly update

Air temperature
Precipitation
Wind
Air pressure
Normals
Sun hours
Moon phase

Location/Places/POI/Area/Proximity
from OpenStreetMap
for postal/ZIP code
221 countries
2 years history
Monthly update

POI Categories:
Schools, restaurants, hotels, supermarkets, etc
Houses:
Living buldings, business centers, etc
Transport infrustructure:
Roads, public transport stops, etc
Public facilities:
Gov. offices, post office, police, etc
Natural features:
Public parks, green areas, etc
Stats for different distances (1 km / 3 km / 5 km)

International holidays & events, Workweek calendar

232
countries
22 years history
Monthly update

Workweek calendars by countries
Public holidays / Observed holidays
Religious holidays
Sporting events
Political events

Consumer Confidence index


44
countries
22 years history
Monthly update

World economic
indicators

191 countries
41 years history
Monthly update

Consumer Price index
GDP
Сentral Bank Rates
Сommodities prices

Markets
data

17 years
history
Monthly update

Stock prices
Stock volumes
Currencies and exchange rates
Market indexes

👩🏻‍💻 Community shared data

World demographic data
for postal/ZIP code


2
sources for ensemble
90
countries
Annual update

Residential population
Income
Home value
Home ownership
Employment
Industries
Occupations
Population mobility

Public social media profile data
for email & phone


600+
mln phones
350+ mln emails
104 countries
Monthly update

Estimated age
Gender, nationality
Residence & zip/postal code
Maritial status
Employer, job title
Duration of employment
Interests

World mobile & fixed broadband network coverage and perfomance
for postal/ZIP code

4 sources for ensemble
167
countries
Monthly update

Mobile network coverage statistics
Fixed broadband and mobile network performance metrics - download/upload speed, latency
Estimated number of mobile phones & PCs
Statistics for different distances
(1 km / 3 km / 5 km)

Car ownership data and
Parking statistics
for postal/ZIP code
email & phone

3 countries
Annual update

Car Brand
Car Model
Year statistics
Parking statistics by:
Brand, Model

Geolocation profile
for IPv4 & phone


6
sources for ensemble
2^32
IP
600+
mln phones
239 countries
Monthly update

Country
Region
City
Postal/ZIP code
ISP / ASN
Proxy/VPN/Datacenter flag for IP

World house prices
for postal/ZIP code


3
sources for ensemble
2
countries
Annual update

House price index for countries
House price index for zip/postal code

🛒 Premium data providers

Don’t see the data source you need?
Let us know, we’ll add that!

🔎 Search and enrichment for 6 entity types

Dateor DateTime
CountryISO 3166 codes
Postal/ZIP code900 000+ unique codes
Phone number600 mln+ phone numbers
Hashed email (HEM)350 mln+ emails
IP-address2^32 ip-addresses

🏁 Get started with Python

Step by step guide

#1

Install Upgini library

... from PyPI and check out our documentation on GitHub (it's open-source)

#2

Select data enrichment keys
and initiate feature search

You can reuse your existing labeled training dataset
Only relevant features that give metric improvement (ROC AUC, RMSE, etc.) returned, not just correlated with the target variable.
Without API Key With Free API Key

#3

Enrich ML model with new features and retrain

10-25% accuracy improvement to baseline results from mainstream AutoML frameworks

#4

Add external features into production ML pipeline

Enrich production datasets with actual features/data for the present time
arlington-research-Kz8nHVg_tGI-unsplash.jpg

Contact us

Our team of ML and AI experts will be happy to answer your questions