Data analytics

Click on the Data analytics icon, which will redirect you into the project creation screen. This will take the users into the screen where the project can be created.

Data Analysis

A - For project creation, click on “Create project” which will redirect you to your data sources (screen number 4)

B - Here you can see a list of all your current projects

C - By clicking on the tick box next to your project, you can select the project

D - Delete the selected project

Other possible actions with the selected project:

E - Load the data to perform analysis
F - Delete a file from the project
G - Download the data into the local system as CSV file

Data sources

Connecting to the file system

Data Sources- Screen number 4.1

A - For uploading a file from your computer, select file upload

B - Select file that you are planning to upload

C - Select the file type - CSV, JSON, Parquet, Avro, S3 or HDFS

file upload

D - Select your delimiters – tab, Semicolon, Comma, Space, Other

E - After completing previous steps click on Add files

Connecting to Database

Data Sources- Screen number 4.2

A - For uploading a file from a database, select DB connection

B - If you wish to upload a new file, select New Connection tab

C - If you already have a connection, select Existing Connection tab

D - Within new connection, select the database type from the following options

E - Fill in your username and password

DB Schemas and Access

This section is about the data upload into SEDGE from different types of database. SEDGE has the functionality of connecting to different database such as MYSQL, Mariadb, Oracle and Postgresql.

The future release of SEDGE will also cover other database such as Cassandra, SQLite, Presto, Redshift and Redis. If you need assistance in connecting to these DB, talk to our support team, and they can provide functionality to connect to these databases also

Accessing Database

DB Schemas and Access

A - For uploading a file from a database, select DB connection

B - If you wish to upload new file, select New Connection tab

C - If you already have a connection, select Existing Connection tab

Connecting to DB

existing connection 1

D - Within new connection, select the database type from the following options, we are demonstrating our example on the MYSQL database

database type

E - Fill in the class name of your database

F - Fill in the host name assigned to your database

G - Fill in the port number

H - Fill in the username

I - Fill in the password

J - Proceed with the connect button

Selecting existing DB Connections

save connection

After filling in the above table a new window appears

A - Fill in the Database and Connection Name
B - Save connection (this will then appear in the Existing Connections tab)
C - Or proceed directly to SQL Editor

_images/existing_connections.png — *existing connections 2*

Your project was successfully uploaded and is among the Existing Connections

A - You can open the project by clicking on the Connect button and you will be redirected to the SQL editor
B - You can erase the project by clicking on the Bin button

Working with SQL editor

SQL editor

A - In the SQL editor your datasets will appear under the Entities box, for our demonstration we have uploaded three medical related datasets

B - In the SQL Query generator box, you can write your commands, for example when you wish to open one of your datasets, or do joins of the tables

C - After finishing your query, click on the Execute button to proceed to the preview

Writing SQL query in Editor

SQL editor query

A - After clicking on one of the datasets, the query generator automatically selects this file

B - After selecting Execute button, we can see the preview of the data

C - Data preview (top 100 rows)

D - If we want to proceed with this dataset, we continue by clicking on the Save button

Joining multiple tables in SQL query Editor

SQL join

SQL Join query example

We can use the SQL Query Generator in many ways, for example to combine records from two or more tables. There are 4 types of joins – Inner, Full, Left or Right Join. In our example, we are performing a left join – putting together records from the first (left-most) table with matching right table records. As a result, we get a left join with table “cancer diagnostic” and “diabetes” using as a condition patient ID. This means we will get one final table, that will combine the information on the respective patient with same ID.

ETL Process

File upload

upload file - Screen no 5

A - After adding your files, which might take a while, depending on the file size, click on the upload button, which will redirect you to the data preview page.

B - Users also have the option of zipping the CSV file, as large size CSV file can be uploaded.

C - Users can enter the name of the project (optional). If the users have not entered the name of the file then the system will take the name of the file and set a default project number.

Multiple upload

It is also possible to upload multiple files simultaneously. SEDGE allows to upload multiple files in one upload form.

Multipleupload

File collaboration

File collaboration in SEDGE facilitates the sending and receiving of files and collaboration among users. This functionality helps in collaborating and exchanging opinions with the other users in SEDGE. This offers a wide array of advantages, especially among users from different geographic locations and diverse timelines. It only takes a single click to send and enable the file access to the user.

Filecollaboration

Data preview

Data Preview - Screen no 6

For our demonstration, we have chosen the “Titanic” data. In the future steps, we will be forecasting, based on the available data, what was the chance of survival on board of Titanic. In this case, the factors taken into consideration are: class, name, sex, age, number of family members, parch, ticket number, fare, cabin, and point of embarkment. Some of these factors might not be relevant for the predictive analysis, so we will also demonstrate how to erase or ignore some of the data.

Note

This page is just a preview of the full data, only a small sample is displayed over here to get the first image.

A - This row displays all the variables in the current analysis. You can sort the data by clicking on the arrows next to the respective variable

B - This row identifies if the data in a column are categorical or numerical (C or N)

C - Monitor in which phase of the analysis you are currently located – identified by the green colour

D - Information about the total number of rows and columns in the uploaded file

E - Action buttons to move either to the next step or return to the previous step

Profile Dataset

Profile Dataset - Screen no 7

A - Dataset info – basic information about the respective data set like number or variables, number of observations/rows, percentage of missing data, warnings

B - List of all columns/variables in the data set

C - Information if the variable is categorical or numerical

D - How many unique values are there within each variable (sex – male or female – 2 possible values, name – unlimited number of values)

E - Total number of missing values within the respective variable

F - Number of missing values represented in percentage

G - Mean – the average value of the respective numerical variable. Found by adding all numbers in the data set and then dividing by the total number of variables in the data set

H - Median – the middle value in the data set – value that separates the higher half of the data sample from the lower half

I - SD – standard deviation represents how close or far the values are dispersed from the mean

J - In this tab, you have to select the target you are going to be predicting. It is one of the variables that are already part of the dataset. In our case, we are going to be predicting if the Titanic passenger did or did not survive

feature list

K - Sampling of data – you can base your research only on a limited part of the dataset (for example when the dataset is extremely large).

If you wish to sample your data, after clicking yes, you have to select the type

In the next step select the size type – percentage, absolute value or by the calculator

sampling

Select the percentage of the sample you are going to base your analytics on

sampling percent

Select the variable

sampling percent

After selecting your preferences click on Apply button

L - GDPR - Data pseudonymization

Depending on the nature of your data, you can choose if you wish to protect personal data or no. If you select yes, next step will take you to the GDPR page where you will select which data you want to hide. If you do not need to hide any of the data, select NO button and the GDPR page will be skipped

data protection

GDPR

After filling in the GDPR form, you can select if you want to run and Automatic Machine Learning or a Custom Machine Learning

M - Machine learning (ML)

If in the previous step you have selected “Yes” for GDPR protection, you will find the ML option under the GDPR form. If you selected “No” you can select this option under this tab

Machine learning

Statistics

Statistics is the branch of mathematics concerned with collecting, analyzing, interpreting, presenting, and organizing data. It encompasses a wide range of techniques and methodologies used to make sense of data, including summarizing data through measures based on samples.

On the statistics page, there are several top functions available, which offer:

Numerical Operations

Strings

Dates

Maths

Text

Data

Advanced

Boolean

Frequently used

To navigate to the Statistics page from the profiling page, click on the next button or click the statistics icon.

Main Screen

Main Screen

A: In the first row, all the possible data operations are displayed – sorted into eight different categories: Numerical, String, Dates, Maths, Text, Data, Advanced, Boolean, and Frequently used. Each tab will be discussed separately on the following pages.

B: The first window in the statistics display showcases the fundamental statistics of the dataset, with the target highlighted in green, as indicated by the legend.

C: Information we can get in this window is:

Column name: Shows the columns or variables that are present in the dataset.

Value type: Numerical, Categorical, Date, Text.

Unique Values: Number of unique values in the variable.

Mean: The average value of the variable.

Median: The middle value of the variable.

Standard Deviation: SD - Measures how the data is spread around the mean.

Missing Values: Number of missing values of the variable in the total count.

% of Missing Values: The number of missing values in the dataset is represented as a percentage. Upon clicking on a column name, the other charts dynamically adjust based on the selection. To perform operations with a specific column, simply click on the checkbox corresponding to that column and proceed the functions.

D: Graphstat - The graphical display illustrates our chosen value – in this case, Sales is selected for demonstration purposes. In this tab, the distribution of sales can be observed. Additionally, options are available to view the display in full-screen mode, download the graphs in different formats, print the chart, or utilize data binning to divide the dataset into appropriate bins (which refers to grouping data into ranges of values). Furthermore, if a Scatter Plot is required, the option can be enabled. This will plot for the target variable and the selected variable.

Bins

E: Transformations - Under this category, various available transformations can be applied to modify the data as needed. To perform transformations, refer to the syntax provided, and you can also clear, test, and apply the syntax. Additionally, view the Expression History.

F: Preview - Displays preview of the whole dataset, top 50 rows

G: Advanced statistics - In this tab, advanced statistics for the selected row are displayed; in this case, the ‘Sales’ row is selected.

H: In this line, corresponding information from the advanced statistics is displayed: Minimum value, 5th percentile, First Quartile (25th percentile), Third Quartile (75th percentile), Interquartile Range (IQR), 95th percentile, Maximum value, Variance (measure of data dispersion), Skewness (measure of data symmetry), Kurtosis (measure of data peakiness or tails), Outliers (data points significantly distant from the rest), and the option to delete the Outliers.

J: Variable of Importance - In this window, all the rows of the dataset are visible, sorted by their importance towards the selected target. In this case, Sales is selected as the most important factor, although it may not be accurate. The variable of importance can be refreshed using this icon (M). Additionally, a Chart Context Menu is available where you can view in full screen, print the chart, download the image as PNG, JPEG, PDF, SVG, and download the data as CSV and XLS.

Advanced Statistics

Box Plot will display for Numerical variable. Hovering over the chart the following information will be displayed:

Upper Fence - The maximum limit beyond which data points are considered outliers.

Upper Quartile - The median of the upper half of the dataset, separating the highest 25% of values from the rest.

Median - The middle value of the dataset when arranged in ascending order.

Lower Quartile - The median of the lower half of the dataset, separating the lowest 25% of values from the rest.

Lower Fence - The minimum limit beyond which data points are considered outliers.

When hovering over outliers, the outlier observations will be displayed.

Data Table

For Categorical and Date variables, the Data Table will be displayed.

Statistics- Numerical Operations

image Numerical Operations

Absolute Value

The operation converts a numerical value into its absolute value, exclusively accepting numerical inputs. For instance, temperature can be transformed into its absolute value to determine its distance from zero. Upon applying an operation, successful validation appear in the top-right corner. A newly generated column is displayed at the bottom of the preview. After column addition, the variable of importance may alter; to observe its impact.

image showing message notification after ABS function (absolute)

Ceiling

Returns the closest integer greater than or equal to a given number, used for getting the nearest integer up

Floor

Returns the closest integer less than or equal to a given number, used for getting the nearest integer up

Unit converter

Simplifies converting measurements between different units. For example, Converting inches to centimeters or pounds to kilograms using a unit converter.

Numerical conversion to hexadecimal form

Convert from a standard numerical format to hexadecimal system, where 16 symbols are used compared to 10 symbols used in the numerical system

Rounding of values

Round numbers to a specified number of digits, such as rounding housing measurements to 2 digits. This function should allow for manual or automatic selection of the column name and if required, enable the user to manually choose the name of the new column.

image round

Num –> Cat

Converting a numerical column to a categorical format. In case of missing values, they can be managed by either deletion or imputation, techniques elaborated upon in the forthcoming “Data” chapter.

Upon applying this function, the successful validation will appear.

Num format conversion

Helps in generating the numeric text of values in different languages. The value will be displayed in words in the chosen language.

Num format conversion

Note

Numerical operations can be performed only to the Numerical variable.

Statistics- String

String

Substring

A Substring is a contiguous sequence of characters within a larger string. It is a portion of a string that is extracted or considered separately based on its starting position and length.

Steps to Perform

Select a column, then click on ‘Substring’

Choose position (Right, Middle, Left, Exclude String, or Split by Character/Pattern) based on your requirement.

Define length and choose ‘Auto’ or ‘User Defined’

Click on ‘Apply’

Right

For the string “USA”, the selected position is “Right” with a length of 1. This operation will return the last character of the string.

The output will be: A.

Example for substring for the position - Right

Middle

For the string “houstan”, the selected position is “Middle” with a start index 1 and length of 5. This operation will return the middle portion. For the middle position it is required to select the start index.

The output will be: houst.

Example for substring for the position - Middle

Left

For the string “ New York”, the selected position is “Left” with a length of 3. This operation will return the leftmost 3 characters.

The output will be: New.

Example for substring for the position - Left

Exclude String

For the string “New York”, the selected position is “Exclude String” with a length of 3 and a start index of 1. This operation will exclude the specified characters from the string.

The output will be: York.

Example for substring for the position - Exclude String

Split by character/pattern

For the string “Los Angels” and the selected character “s” with the position “after”, the operation will split the string after each occurrence of the character “s”.

The output will be: Angel.

Example for substring for the position - Spelit by Character/Pattern

Note

For the position “Middle” and “Exclude String” it is required to select the start index.

For the position “Split by character/pattern” it is required to choose either “Before” or “After.

Replace

Replace one or more characters with another pattern/set of strings. Under ‘Pattern which exist,’ provide the pattern that needs to be changed. Then, under ‘Pattern to be replaced with,’ specify the pattern for replacement.

To replace “UK” with “United Kingdom,” provide the following:

Pattern which exists: “UK”

Pattern to be replaced with: “United Kingdom”

The output will be: Hi

Example for Replace

Replace range

Replace set of strings based on its occurrence index value. specify the start and end indices of the substring to be replaced along with the replacement text.

Example: Original word: “Chine”

To replace a range of characters from index 1 to index 5 with “India” provide the following,

Start index: 1

End index: 5

String to be replaced: “India”

After completing the replacement operation, the updated word will be: India

Example for ReplaceRange

Note

After performing the operation a new column will appear at the bottom.

Delete range

Deletes a set of characters based on the index values. To delete a range, specify the start and end indices, then click “Apply”.

Example: Original String: Los Angels

To remove ‘Ang’ from “Los Angels”, specify the indices as follows:

Start index: 4

End index: 8

After applying this deletion, the updated word will be Losels.

Example for DeleteRange

Append

Append a single or multiple character in each class/row.

Example:

Original String: “Japan”

To append “ese” to the end of the word, simply add “ese” to it.

After appending, the updated word will be: **Japanese*.

Example for Append

Insert

Select the index or position where insertion is desired, then specify the value to be inserted.

Example: Original string: “Phoenix”

Index to be inserted: 8

Inserted value: “City”

After insertion, the updated string will be: PhoenixCity.

Example for Insert

Repeat

To choose the number of times the selected column should be repeated.

Example:

Original string: “UK”

To repeat “UK” two times, simply specify the repeat count as 2.

After repeating, the updated string will be: UKUK.

Example for Repeat

Pop

Pop removes and returns the element at the specified index.

Example:

Original string: “Chicago”

To remove ‘o’ from the string by specifying the index as 7.

After the operation, the updated string will be: Chicag.

Example for Pop

Reverse item

Iterate through the string in reverse order.

Example:

Original string: “UK”

Reversed string: “KU”

Example for Reverse

Upper

Convert a column to upper case.

“New York” will be be converted into “NEW YORK”.

Example for Upper

Lower

Convert a column to lower case

“New York” will be be converted into “new york”

Example for Lower

Find

To determine if an entry matches a searched string, compare each entry against the specified string. If a match is found, return the matched string. If no match is found, return False to indicate that the string is not present in the column.

Example for Find

Length

Count the number of characters in the string.

Example:

String: “Chicago”

Length of the string: 7

Example for Length

Trim

Trimming/erasing of leading & trailing spaces from the string.

Example for Trim

CountOccurence

Determines the number of times a specific substring appears.

Steps to perform

Select Substring to specify a substring pattern.

Select Regex to provide a regular expression pattern.

Example:

String: “Japan”

Substring pattern: “a”

The output will be 2.

Example for Count occurence

Extract

To extract specific information from a string.

Select the Method such as: First word, Last word, Numbers extract.

First word:

Original String: New York

The output will be: New

Example for Extract - First Word

Last word:

Original String: New York

The output will be: York

Example for Extract - Last Word

Numbers extract:

Original String: Hello, I am 22 years Old now turning to 23

The output will be: 2223

Example for Numbers Extract

Split

Splitting is the act of partitioning available data into portions, usually for cross-validatory purposes. Below are the methods of splitting.

HTTP string

Email

Invalid

URL

image Split

Split Http string

This processor splits the elements of an HTTP query string.

Split email

This processor splits an e-mail address into the local part and the domain.

Split invalid

This processor takes all values of a column that are invalid.

Split URL

This processor splits the elements of an URL into multiple columns.

ExtractNum

To extract numerical values from text This function is useful for isolating and retrieving numbers embedded within text, simplifying the process of data cleaning and preprocessing.

Cat -> Num

Convert categorical variables into numerical values only when the selected categorical column contains numbers exclusively. Upon applying this function, the successful validation will appear.

Note

String Operations can be performed only to the String variables.

Statistics- Dates

Dates Extraction

The Dates function with the menu, helps to strip out the date, month, Year, etc., from a date column. These stripped out date, month and year, act as additional feature for the model.

Example: Lets try to extract Month from the “Date of Hire” column,

Dates Pattern Conversion

This function is used to convert the date pattern in a different, more suitable way for the research. After selecting the input and desired output pattern click on the apply button.

Example: Lets convert the date pattern, from “yyyy-MM-dd HH:mm:ss” to “yyyy-MM-dd”

This will covert the existing column with the selected date pattern.

There are several types of date patterns and shortcuts, that can be present in a date variable. Can Design own format patterns for dates and times from the list of symbols in the following table:

Date Format Examples

Date format	Description	Examples
EEEE	Day of the week	Monday, Tuesday
EEE	Abbreviated day of the week	Mon, Tue, Wed
MM	Month of the year in a two-digit format	03, 07, 12
MMM	Abbreviated month of the year	Jan, Feb, Mar
MMMM	Month of the year	January, February, March
dd	Day of the month	01, 15, 30
D	Day in the year	yyyy-DDD —> 1999-081
yy	Year in two-digit format	99, 12, 20
yyyy	Year in four-digit format	1999, 2012, 2020
YYYY	Week-based year	2009; 09
HH	Hour of the day (0-23)	00, 12, 23
hh	Clock hour in AM/PM (1-12) format	00, 05, 12
mm	Minute in hour	00, 30, 59
ss	Second in minute	00, 30, 59
SSS	Fraction of a second	235, 512, 820
a	AM/PM marker	AM, PM
G	Era designator	AD, BC
u	Day number of week (1 = Mon,.., 7 = Sun)	1, 2, 7
ww	Week in year	02, 27, 52
W	Week in month	1, 2, 4
z	General time zone	GMT, PST, SGT, CET
Z	UTC time offset	+0000, -0800, +0100
XXX	ISO 8601 time zone	+01:00, -08:00
’	Escape for text	H’ h ‘mm z —> 5 h 06 CET

Time pattern examples

Date and time pattern	Example
yyyyMMddZ	19990322+0100
yyyyMMdd	19990322
yyyy/MM/dd H:mm	1999/03/22 5:06
yyyy.MM.dd HH:mm:ss	1999.03.22 05:06:07
yyyy.d.M HH:mm:ss	1999.22.3 05:06:07
yyyy-MM-dd	1999-03-22
yyyy-MM-ddXXX	1999-03-22+01:00
yyyy-MM-dd’T’HH:mm:ss.SSSXXX	1999-03-22T05:06:07.000+01:00
yyyy-MM-dd’T’HH:mm:ss.SSS’Z’	1999-03-22T05:06:07.000Z
yyyy-MM-dd’T’HH:mm:ss.SSS	1999-03-22T05:06:07.000
yyyy-MM-dd’T’HH:mm:ss	1999-03-22T05:06:07
yyyy-MM-dd HH:mm:ss.S	1999-03-22 05:06:07.0
yyyy-MM-dd h:mm:ss a	1999-03-22 5:06:07 AM
yyyy-MM-dd H:mm:ss	1999-03-22 5:06:07
yyyy-MM-dd G	1999-03-22 AD
yyyy-M-d HH:mm:ss	1999-3-22 05:06:07
yyyy-M-d h:mm:ss a	1999-3-22 5:06:07 AM
yyyy-DDDXXX	1999-081+01:00
yyyy MMMM dd E	2019 July 24 Wed
yyyy-ww	1999-02
yyyy-ww-u	2001-27-3
yyyy-‘W’ww-u	2001-W27-3
yy/MM/dd HH:mm	99/03/22 05:06
yy/MM/dd H:mm:ss	99/03/22 5:06:07
MMMM d, yyyy h:mm:ss z a	March 22, 1999 5:06:07 CET AM
MMM.dd.yyyy	Mar.22.1999
MMM d, yyyy h:mm:ss a	Mar 22, 1999 5:06:07 AM
MM/dd/yyyy HH:mm:ss	03/22/1999 05:06:07
MM/dd/yyyy h:mm:ss a	03/22/1999 5:06:07 AM
MM-dd-yyyy HH:mm:ss	03-22-1999 05:06:07
MM-dd-yyyy h:mm:ss a	03-22-1999 5:06:07 AM
HH:mm:ss dd/MM/yyyy	05:06:07 22/03/1999
HH:mm:ss dd-MM-yyyy	05:06:07 22-03-1999.
EEEE, MMMM d, yyyy h:mm:ss a z	Monday, March 22, 1999 5:06:07 AM CET.
EEEE, MMMM d, yyyy h:mm:ss ‘o’’clock’ a z	Monday, March 22, 1999 5:06:07 o’clock AM CET.
EEEE, MMMM d, yyyy	Monday, March 22, 1999
EEEE, d MMMM yyyy HH:mm:ss ‘o’’clock’ z.	Monday, 22 March 1999 05:06:07 o’clock CET.
EEEE, d MMMM yyyy	Monday, 22 March 1999
EEE, d MMM yyyy HH:mm:ss Z	Mon, 22 Mar 1999 05:06:07 +0100
EEE MMM dd HH:mm:ss z yyyy	Mon Mar 22 05:06:07 CET 1999
dd/MMM/yy h:mm a	22/Mar/99 5:06 AM
dd/MM/yyyy HH:mm:ss	22/03/1999 05:06:07
dd/MM/yyyy h:mm:ss a	22/03/1999 5:06:07 AM
dd/MM/yyyy H:mm	22/03/1999 5:06
dd.MM.yyyy. HH.mm.ss z	22.03.1999. 05.06.07 CET
dd-MMM-yyyy HH:mm:ss	22-Mar-1999 05:06:07
dd-MM-yyyy HH:mm:ss	22-03-1999 05:06:07
dd-MM-yy HH:mm	22-03-99 05:06
dd MMMM yyyy HH:mm:ss z	22 March 1999 05:06:07 CET

Characters that are not letters are treated as quoted text. That is, they will appear in the formatted text even if they are not enclosed within single quotes.

Delete Date Range

Removing or deleting all records or entries that fall within a specified range of dates from a dataset.

Steps to perform:

Select “From date” and “To date”

Select Keep range (will keep only the selected range of dates) or Delete range (will delete only the selected range of dates) based on your preferences.

Example: Lets keep only the records from “6/19/1961” To “8/17/1970”

This will alter the existing column and will keep only the selected range of values.

Flag Date Range

Marking or highlighting all records or entries that fall within a specified range of dates. This is often done to identify, review, or take specific actions on the data within that range.

Example: Lets try to flag from “5/30/2011” to “2/19/2013”

Date Difference

Calculation of the amount of time between two dates. This can be measured in various units such as days, months, or years.

Steps to perform:

select Minuend (The number from which another number (the subtrahend) is to be subtracted) and Subtrahend (The number that is to be subtracted from the minuend).

choose output time unit from the dropdown menu.

Example: Lets try to do difference for the columns ‘Date of Hire” and “Date of Termination”

This will generate a new column at the bottom of the preview.

Flag Holiday

Specific date refers to a day designated for celebrating or commemorating a national flag, often marked by various patriotic activities.

Steps to perform:

Select flag type from the dropdown either holiday or weekend.

Select Country or Region.

Click on Apply

Example: Lets flag the Holiday for Australia.

The output will be in the form of boolean (true or false).

Statistics- Maths

Statistics - Maths

Sign conversion

Convert a positive to negative sign and vice versa. Choose between positive or negative conversion, and the result will be displayed in the existing column.

Before applying sign conversion,

Example for before applying sign conversion

After applying sign conversion,

Example for after applying negative sign conversion

Note

This function will not generate a new column for the output. It will modify the original column.

Binning

Select Binning Type such as: Interval binning, Equal width and Custom Binning.

Interval binning

Interval binning involves dividing the range of values into equal-sized intervals.

For example, if the data ranging from 11 to 100 and want 5 bins,

Example for Interval Binning

Equal width

Equal width binning is a type of interval binning where each bin has the same width. For example, if the data representing the heights of students in a class ranging from 120 cm to 180 cm, and need to create 4 bins,

Example for Equal Width Binning

Custom Binning

Custom binning involves defining bins based on specific criteria or requirements.

Addition

Add multiple columns. Atleast two columns are required.

Example for Addition of two columns

Subtraction

Subtract multiple columns.

Example for Subtraction of two columns

Multiplication

Multiply multiple columns.

Example for Multiplication of two columns

Division

Divide multiple columns. Its a reverse function from multiplication.

Note

The functions (D, E, F) are designed to operate sequentially from left to right.

Remainder

The remainder is the integer left over after division, indicating the part that remains unallocated.

Example for Remainder

Data transformation

This function includes several mathematical operations - Box cox, Yeo Johnson, Cubic root, Exponential, Log, Normalization, Power2, Power3, Reciprocal, Square root and Standardization. After selecting your chosen operation a new column is created at the bottom.

Box Cox

The Box-Cox transformation is a statistical method used to stabilize variance and make data more normally distributed (symmetric bell-shaped distribution where the mean, median, and mode are all equal)

This dataset is skewed to the right, indicating that most of the house prices are lower.

To apply the Box-Cox transformation mathematically,

Choose a range of values for the λ (lambda) parameter to test the transformation. Let’s say we try λ values ranging from -2 to 2.

For each λ value, calculate the transformed values using the Box-Cox transformation formula:

y_i = \begin{cases} \frac{x_i^\lambda - 1}{\lambda}, & \text{if } \lambda \neq 0 \\ \ln(x_i), & \text{if } \lambda = 0 \end{cases}

y_i represents the transformed value and x_i represents the original value.

Example:

Consider the dataset:

X={1200,2000,4500,800,3500,3000,1500,2500,4000,1000}

For λ ≠ 0 λ=0.2784475681948208:

y_1 &= \frac{1200^{0.2784475681948208} - 1}{0.2784475681948208} \approx 22.26977 \\ y_2 &= \frac{2000^{0.2784475681948208} - 1}{0.2784475681948208} \approx 26.22267 \\ y_3 &= \frac{4500^{0.2784475681948208} - 1}{0.2784475681948208} \approx 33.77536 \\ y_4 &= \frac{800^{0.2784475681948208} - 1}{0.2784475681948208} \approx 19.50882 \\ y_5 &= \frac{3500^{0.2784475681948208} - 1}{0.2784475681948208} \approx 31.24991 \\ y_6 &= \frac{3000^{0.2784475681948208} - 1}{0.2784475681948208} \approx 29.78606 \\ y_7 &= \frac{1500^{0.2784475681948208} - 1}{0.2784475681948208} \approx 23.92758 \\ y_8 &= \frac{2500^{0.2784475681948208} - 1}{0.2784475681948208} \approx 28.13388 \\ y_9 &= \frac{4000^{0.2784475681948208} - 1}{0.2784475681948208} \approx 32.56974 \\ y_10 &= \frac{1000^{0.2784475681948208} - 1}{0.2784475681948208} \approx 20.98964 \\

Note

The lambda (λ), which varies from -5 to 5. All values of λ are considered and the optimal value for your data is selected; The “optimal value” is the one which results in the best approximation of a normal distribution curve.

_images/boxcox.png — *Example for Box Cox Transformation*

Yeo Johnson

The Yeo-Johnson transformation extends the Box-Cox transformation to handle both positive and negative values, as well as zero values. It is useful for stabilizing variance and making data more normally distributed.

y_i = \begin{cases} \frac{(x_i + 1)^{\lambda_i} - 1}{\lambda_i}, & \text{if } \lambda_i \neq 0, x_i \geq 0 \\ \ln(x_i + 1), & \text{if } \lambda_i = 0, x_i \geq 0 \\ -\frac{(|x_i| + 1)^{2 - \lambda_i} - 1}{2 - \lambda_i}, & \text{if } \lambda_i \neq 2, x_i < 0 \\ -\ln(|x_i| + 1), & \text{if } \lambda_i = 2, x_i < 0 \end{cases}

where:

x_i is the original value,
y_i is the transformed value,
λ_i is a parameter that varies for each feature.

Example:

Let’s consider x=3 and lambda (λ) = 0.7995651776710038,

\begin{align*} x'_1 &= ((3 + 1)^{0.7995651776710038} - 1) / 0.7995651776710038 \\ &= 2.389922907892544 \\

x=6 and lambda (λ) = 0.7995651776710038,

\begin{align*} x'_1 &= ((3 + 1)^{0.7995651776710038} - 1) / 0.7995651776710038 \\ &= 4.191151663607149 \\

Note

Factors influencing the lambda value include the data’s skewness, distribution, variance, sensitivity to outliers and optimization method. The lambda value is empirically determined to maximize the normality and symmetry of the transformed data.

_images/yeojohnson.png — *Example for Yeo Johnson Transformation*

Cubic Root

The value when multiplied by itself twice, yields the original number is referred to as the cubic root. For instance, the cubic root of 8 is 2. .. math:

y_i = \sqrt[3]{x_i}

where:

x_i is the original value,
y_i is the transformed value.

Example:

\sqrt[3]{27} = 27^{1/3} = 3.0

\sqrt[3]{64} = 64^{1/3} = 4.0

_images/cubicroot.png — *Example for Cubic Root*

Exponential

Exponential refers to a mathematical operation or function where a constant (the base) is raised to the power of an exponent, resulting in a rapidly increasing or decreasing function. It is commonly represented as a^x, where a is the base and x is the exponent.

An example of an exponential function is f(x) = 2^x. Let’s evaluate this function for some values of x:

When x = 0, f(0) = 2^0 = 1.
When x = 1, f(1) = 2^1 = 2.

Example for Exponential Transformation

Log

The log transformation is a mathematical operation that calculates the logarithm of a number. It is commonly used to transform skewed data to make it more symmetrically distributed.

The transformation formula is as follows:

y_i = \log(x_i)

where:

x_i is the original value,
y_i is the transformed value.

Example:

\log_{10}(100) = \frac{\ln(100)}{\ln(10)}

We can substitute \ln(100) and \ln(10) with their respective values:

\ln(100) = \ln(e^4) = 4\ln(e) \ln(10) \approx 2.30258

Finally:

\log_{10}(100) \approx \frac{4\ln(e)}{2.30258}

Since \ln(e) = 1, we have:

\log_{10}(100) \approx \frac{4}{2.30258} \approx 1.732

Therefore, \log_{10}(100) is approximately 1.732.

\log_2(8) = \frac{\ln(8)}{\ln(2)}

We can substitute \ln(8) and \ln(2) with their respective values:

\ln(8) = \ln(e^2.07944) \approx 2.07944 \ln(2) \approx 0.69315

Finally:

\log_2(8) \approx \frac{2.07944}{0.69315}

This simplifies to:

\log_2(8) \approx 3

Therefore, \log_2(8) is equal to 3.

Example for Log Transformation

Normalization

Normalization is a technique used to scale numeric features to a standard range, typically between 0 and 1 or between -1 and 1. It is used statistical analyses to ensure that all features have the same scale and to prevent features with larger values from dominating those with smaller values.

For Example, Consider the following dataset,

{Data} = [2, 5, 8, 10, 15]

To normalize these values using min-max scaling, we’ll use the formula:

x_{\text{norm}} = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}

Where:

x_min= 2 is the minimum value of the dataset,

x_max = 15 is the maximum value of the dataset.

Now, we can calculate the normalized values using the formula:

For x = 2:

x_{\text{norm}} = \frac{2 - 2}{15 - 2} = 0

For x = 5:

x_{\text{norm}} = \frac{5 - 2}{15 - 2} = \frac{3}{13} \approx 0.2308

For x = 8:

x_{\text{norm}} = \frac{8 - 2}{15 - 2} = \frac{6}{13} \approx 0.4615

For x = 10:

x_{\text{norm}} = \frac{10 - 2}{15 - 2} = \frac{8}{13} \approx 0.6154

For x = 15:

x_{\text{norm}} = \frac{15 - 2}{15 - 2} = 1

So, after normalizing the data using min-max scaling, the normalized dataset would be:

Normalized Data = [0, 0.2308, 0.4615, 0.6154, 1]

These values are now scaled to the range between 0 and 1.

Example for Normalization

Power2

The term “Power of 2” typically refers to the operation of exponentiation, where a number is raised to the second power. In mathematical notation, it is represented as "x^2" where, x is the base number. This operation involves multiplying the base number by itself.

For example:

2^{2} = 4

In general, raising a number to the power of 2 is known as squaring that number.

Example for Power 2

Power3

The term “power of 3” also refers to exponentiation, where a number is raised to the third power. In mathematical notation, it is represented as "x^3" where, x is the base number. This operation involves multiplying the base number by itself twice.

For example:

1200^{3} = 1,728,000,000

In general, raising a number to the power of 3 is known as cubing that number.

_images/power3.png — *Example for Power 3*

Reciprocal

The reciprocal of a number is calculated by dividing 1 by that number.

Mathematically, if we have a number represented as x, the reciprocal of x is represented as \frac{1}{x}.

For example:

x = 5, the reciprocal of 5 = 1/5.

Similarly, x = 0.5, the reciprocal of 0.5 is 1/0.5 = 2.

The reciprocal of a number essentially represents how many times that number can fit into 1. For instance, the reciprocal of 5 is (1/5) means that 5 fits into 1 one-fifth of a time.

_images/reciprocal.png — *Example for Reciprocal*

Square Root

The square root of a number is a value that, when multiplied by itself, gives the original number. In mathematical notation, the square root of a number x is represented as √x.

For example:

\sqrt{25} = 5. Since 5 * 5 = 25

The square root operation can also be denoted using exponentiation: x^(1/2). This means raising x to the power of one-half, which is equivalent to taking the square root of x.

_images/squareroot.png — *Example for Square Root*

Standardization

In statistics, standardization is a process used to transform data so that it has a mean of zero and a standard deviation of one. This technique is applied to individual variables within a dataset, making them comparable and facilitating statistical analysis.

The formula for standardizing a variable x is:

z = \frac{x - \mu}{\sigma}

Where:

x is the original value of the variable,
\mu is the mean of the variable,
\sigma is the standard deviation of the variable, and
z is the standardized value.

Standardization transforms the distribution of the data to have a mean of zero and a standard deviation of one. This process does not change the shape of the distribution but places it on a common scale, allowing for easier comparison between variables and datasets.

Standardization is particularly useful in statistical modeling, machine learning, and data analysis, where it helps improve the interpretability and performance of models by ensuring that variables are on a consistent scale.

Example:

Consider the data x= [5, 20, 8, 3, 4, 6, 15, 12, 9, 10]

Mean \bar{x} = \frac{{5 + 20 + 8 + 3 + 4 + 6 + 15 + 12 + 9 + 10}}{{10}} = 8.2

Standard Deviation \sigma = \sqrt{\frac{{\sum_{i=1}^{10} (x_i - \bar{x})^2}}{{10}}} = 5.03

Standardized Data z_i = \frac{{x_i - \bar{x}}}{{\sigma}}

where x_i is each data point in the given data.

z_1 &= \frac{5 - 8.2}{5.03} \approx -0.64 \\ z_2 &= \frac{20 - 8.2}{5.03} \approx 2.35 \\ z_3 &= \frac{8 - 8.2}{5.03} \approx -0.04 \\ z_4 &= \frac{3 - 8.2}{5.03} \approx -1.03 \\ z_5 &= \frac{4 - 8.2}{5.03} \approx -0.83 \\ z_6 &= \frac{6 - 8.2}{5.03} \approx -0.43 \\ z_7 &= \frac{15 - 8.2}{5.03} \approx 1.35 \\ z_8 &= \frac{12 - 8.2}{5.03} \approx 0.76 \\ z_9 &= \frac{9 - 8.2}{5.03} \approx 0.16 \\ z_{10} &= \frac{10 - 8.2}{5.03} \approx 0.36

Note

Mathematical operations can be performed only for the Numerical variables.

Statistics - Text

Text Processing

Text processing is used to analyze most used words in a written text. Can also be used to identify SPAM emails, by searching for specific key words. By doing the text processing you will get the top 100 used words as new columns.

Select the preferred N grams from 1 to 5.

Select the Encode type such as : Boolean( 0 or 1), Count.

Select Custom stop words if required.

Click on apply button.

Note

The output will appear at the bottom of the page. You can also preview the output on the preview page.

_images/textprocessing.png — *Example for Text Processing*

Text->Cat

Converts Text data type into Categorical data type.

Text data type before conversion,

_images/beforeconversion.png — *Example for Text->Cat before conversion*

After conversion,

Example for Text->Cat after conversion

Note

This operation will not generate a new column for the output; instead, it will only modify the data type within the same column, altering the output accordingly.

Note

Text functions can be performed only for the Text variable.

Statistics- Data

Label encoding

Label encoding is a technique used to convert categorical variables into numerical format.

Consider a dataset with a categorical variable “Temperature” and its corresponding values are “Hot,” “Cold,” and “Warm.”

Using label encoding, numerical labels are assigned to each unique category. The labels are assigned based on alphabetical order.

Note

This function applicable only for the Categorical column.

One Hot Encoding

One hot encoding is another method used to represent categorical variables in a format suitable for machine learning algorithms. Instead of assigning numerical labels like label encoding, one hot encoding creates binary vectors for each category.

Note

This function is applicable for both categorical and text columns. It does not require a user-defined output column name; instead, it will automatically generate a column name for the output.

_images/onehotencoding.png — *Example for One Hot Encoding*

Renaming Factor Level

Changing the labels or names of categories in a categorical variable to make them more descriptive or meaningful.

Note

This Function applicable for all data types.

_images/renamingfactor.png — *Example for Renaming Factor*

Ignore

Refers to excluding Certain variables to filtering out irrelevant or noisy data.

Note

Multiple columns can also be ignored.

_images/ignore.png — *Example for Ignore*

Unignore

Reverse function of ignore (Ignored variables can be undo by using this function). Unignoring the ignored variables.

Impute Missing

Involves filling in missing data points in a dataset.

Techniques available for Numerical Data Type include:

Mean (Often known as the average, represents the central tendency of a dataset. It’s computed by adding up all values and dividing by the total count of values.)

Median (Middle value in a dataset when the values are arranged in ascending or descending order. If there’s an even number of values, it’s the average of the two middle values.)

Mode (Value that appears most frequently in a dataset. It’s the most common observation or value in the data.)

Default

Moving Average (Smooths out fluctuations in data by calculating the average of a subset of data points within a sliding window.)

Forward Filling (Copying the last observed value forward to fill in missing values.)

Backward Filling (Copying the next observed value backward to fill in missing values in a dataset.)

Custom Filling (Filling missing values using a user-defined method.)

Average of previous and next (Involves taking the average of the values immediately preceding and following the missing value to estimate its value.)

KNN (Missing values are replaced with the average or median value of the k nearest neighbors.)

The available techniques for “Categorical” variable are,

Mode

Forward Filling

Backward Filling

Custom Filling

KNN

_images/categorical.png — *Techniques for categorical variable*

The available technique for “Date” variable are,

Mode
Forward Filling
Backward Filling

The available technique for “Text” variable are,

Mode

Forward Filling

Backward Filling

Custom Filling

KNN

_images/imputemissing.png — *Example for Impute Missing*

Delete Missing

Deleting rows with missing values can be a preferred approach when those values are critical for analysis and cannot be easily imputed.

This function removes all rows containing missing values across all columns in the dataset, ensuring that only complete observations are retained for analysis.

Copy

Copying a column or creating a duplicate allows for performing two different operations on the same column, which can be useful. This process simply involves creating a copy of the column, and then proceeding with the desired operations. The new column is created at the bottom.

Column Concat

The function is used to concatenate two or more columns into one column. For instance, if there are columns for “First name” and “Surname,” they can be merged into a single column.

For example, “Name” and “Sex” columns are being combined. A delimiter is manually selected to clearly mark the transition between the values of the first and second columns.

_images/columnconcat.png — *Example for Column Concat*

Cross Tab

Cross tabulation is a method to organize and summarize data from two categorical variables into a table.

It helps to see how these variables are related by counting the occurrences of each combination of categories.

Example for Cross Tab

To transpose the output, click on the ‘inverse’ option

_images/crosstabinverse.png — *Example for Cross Tab Inverse*

Filter

Filter the data using multiple conditions. Available Filters for the Data types are,

Filters for Numerical Data Type

Condition

Numerical data often involves conditions such as:

Equal to: Filters data where the numerical value is exactly equal to a specified value. For example, filtering sales data for transactions where the amount equals $1000.

Not Equal to: Filters data where the numerical value is not equal to a specified value. For instance, excluding transactions with a sales amount of $1000.

Is Greater Than: Filters data where the numerical value is greater than a specified threshold. For example, filtering sales data for transactions with amounts greater than $1000.

Is Greater Than or Equal to: Filters data where the numerical value is either greater than or equal to a specified threshold. For instance, filtering sales data for transactions with amounts greater than or equal to $1000.

Is Less Than: Filters data where the numerical value is less than a specified threshold. For example, filtering sales data for transactions with amounts less than $1000.

Is Less Than or Equal to: Filters data where the numerical value is either less than or equal to a specified threshold. For instance, filtering sales data for transactions with amounts less than or equal to $1000.

Is Null: Filters data where the numerical value is null or missing.

Is Not Null: Filters data where the numerical value is not null or missing.

Between: Filters data where the numerical value falls within a specified range. For example, filtering sales data for transactions with amounts between $500 and $1000.

Note

This filter is applicable to Categorical and Text data types.

Example for Filter

For adding more rules and conditions, click on “Add a Group” and choose between an “and” or “or” option for each step. Once the “Apply” button is clicked, the entire dataset will be modified to include only those entries that meet the specified filter criteria.

_images/addrule.png — *Example for Add Rule and Add Group*

Grouped Statistics

Grouped statistics organize data into groups and calculate statistical measures for each group, offering detailed insights into how variables behave within specific categories.

The available aggregation are:

Mean (Often known as the average, represents the central tendency of a dataset. It’s computed by adding up all values and dividing by the total count of values.)

Median (Middle value in a dataset when the values are arranged in ascending or descending order. If there’s an even number of values, it’s the average of the two middle values.)

Variance (Value that appears most frequently in a dataset. It’s the most common observation or value in the data.)

Standard Deviation (How the data is spread around the mean)

Unique (Number of unique values in the dataset)

Count (Total number of elements present in a particular variable or column)

Sum (Total value obtained by adding together all the individual values in a specific variable)

Example: To calculate average booking amount for each destination city, select INR_Amount with,

_images/groupedstat.png — *Example for Grouped Statistics*

Transpose

Transpose refers to the operation of flipping or rotating a dataset, typically switching its rows and columns. This operation can be useful for various purposes, such as changing the orientation of data for easier analysis or visualization, or preparing data for specific computations or algorithms. When a dataset is transposed, the rows become columns and vice versa.

Transpose Methods:

Split & Transpose

Combine & Transpose

Split & Transpose

Dividing a text string based on a separator and then rearranging the resulting parts into columns or rows.

Note

This functionality will work only for Categorical variable(s).

_images/split%26transpose.png — *Example for Split & Transpose*

Combine & Transpose

Merging multiple columns into one column, and then rearranging the data from rows to columns or vice versa.

_images/combine%26transpose.png — *Example for Combine & Transpose*

Drop Duplicates(Row Wise)

Dropping duplicates refers to removing Duplicates in column or entire row duplicates. This operation ensures that each row in the dataset is unique, which can be important for various data analysis tasks

In the selected Column ( from_city) contains 2033 missing records.

After applying the function to the selected column, the validation results will be displayed.

Statistics- Advanced

Pivot analysis

Analysis based on pivoting data - a Pivot Table is used to summarise, sort, reorganise, group, count, total or average data stored in a table. It allows us to transform columns into rows and rows into columns.

We have two options in this section - either a “Pivot mode” or a regular chart mode.

Regular chart mode

In the regular chart mode we can select all the rows that we are interested in, in our example we selected Age, Fare, Sex, Ticket purchase, Cabin Detail and Number of meals. We get a basic chart with the selected values.

RegularModeSelect

In the next step, we can also divide our selected columns in groups, by dragging the preferred group type down to the “Row groups” section. We have selected “Sex” to differentiate the groups. We can see we have 577 items in group “Male” and 314 items in group “Female”. We can open each group by clicking on the group name.

RegularModeRowGroups

In the next step, we can add some other numerical values that we are interested in. In our case, we dragged the “Fare” down bellow to the “Values” section and by clicking on the desired value in the bottom green box. We can also select in which form the value should be displayed - average, count, first, last, max, min or sum. In our case it was the “average fare” value.

RegularModeValues

Pivot mode

Firstly it is necessary to switch to the “Pivot mode” in the top right part of the screen

PivotModeOn

After that, select the fields you want to have in your Pivot table. For our demonstration we selected - Survived, Age, Sex, Ticket purchase and Meal Cost. “Ticket purchase” was dragged to the “Row Groups”, so it was automatically shifted to the left side as the first column. We also shifted “Sex” to the “Column Labels” section, so our chart was split into two sections - one section showing only the “Female” data and second showing only the “Male” data. Rest of our selected fields (Survived, Age and Meal cost) remain in the “Values” section, where we can choose again in which form each field should be displayed (average, count, first, last, max, min or sum. In our case it was the “average fare”).

PivotMode

More tools are hidden next to each value column as per the bellow image

ToolsButton

You can also export or copy the chart with the right click on the table.

Jupyter Notebook

Jupyter is an open-source, interactive web tool known as a computational notebook, which researchers can use to combine software code, computational output, explanatory text and multimedia resources in a single document. User can create a customised program/code using Jupyter notebook and import it back to EDGE.

Regular Expression

Regular Expressions (REs) offer a mechanism to select specific strings from a set of character strings. They provide a context-independent syntax capable of representing a wide variety of character sets and their orderings, with interpretations based on the current locale. In addition to using Search and Replace for finding and replacing simple text strings, you can perform more precise searches using Regular Expressions (RegEx). RegEx allows you to accurately match a particular pattern in your search.

Select the Method such as :

Extract

Search

Replace

Example: Lets try to extract ‘_’,

_images/regextract.png — *Example for Extract*

Example: Lets try to search ‘Chennai’,

_images/regsearch.png — *Example for Search*

Example: Lets replace ‘Bālāpur’ with ‘Balapur’

_images/regreplace.png — *Example for Replace*

Delta

Delta (Δ) is a generalized function that calculates the difference between each row value and the value from its corresponding subtrahend position in the given numerical column grouped & sorted through the groupBy & orderBy columns respectively.

_images/egdelta.png — *Example for Delta*

Statistics- Boolean

Refers to a data type that represents one of two states: true or false.

Booleans are commonly used to filter, subset, or mask data based on certain conditions.

Is Null

Refers to missing or nonexistent values in a dataset.

When checking for null values in a specific data field or observation, the result is typically a boolean (true or false) indicating whether the value is null (true) or not null (false). This process of checking for null values is essential for data cleaning and quality assessment before analysis.

_images/null.png — *Example for Is Null*

Is Even

Determining whether a given number is divisible by 2 without any remainder, indicating that it is an even number.

It will return true if the number is even and false if it is not.

_images/even.png — *Example for Is Even*

Is Odd

Determining whether a given number is not divisible by 2 without any remainder, indicating that it is an odd number.

It will return true if the number is odd and false if it is not.

Note

This function only applicable to Numerical Variables.

Is Number

Checks whether a given value is a numerical type.

It will return true if the value is a number and false if it is not.

_images/number.png — *Example for Is Number*

Is Text

Checks whether a given value is a string or textual data type.

It will return true if the value is a number and false if it is not.

_images/egtext.png — *Example for Is Text*

Negate Boolean

Means flipping its truth value. For example, if the original boolean value is true, its negation would be false, and vice versa. It’s a way to change a statement from being true to false, or vice versa.

Note

This function applicable only to the Boolean values.

When selecting non boolean data type variable it will show the validation called “Selected column has no boolean values.

_images/negate.png — *Example for Negate Boolean*

Frequently Used

The functions commonly used for analysis are:

Label Encoding

One Hot Encoding

Ignore

unignore

Impute Missing

Extract Date Parts

Data Transform

Binning

Graphstat

This feature presents a bar chart for the selected column, displaying the value and its count. It supports numerical, categorical, and date data types for bar chart visualization. Numerical data types offer an option to adjust the number of Bins. Text data types are represented using a word cloud. Scatter plots can be enabled to visualize two numerical variables against each other, typically the selected variable and the target variable. Users can view charts in full-screen mode and utilize a print option. Chart download options include PNG, JPEG, PDF, and SVG formats. Data can be downloaded in CSV or XLS format.

Note

Categorical and date data types display only bar charts, while numerical data types offer both scatter plots and bar charts.

For categorical variables, the ‘sort by’ feature is available.

_images/cat.png — *Graphstat for Categorical Column*

_images/date.png — *Graphstat for Date Column*

_images/numeric.png — *Graphstat for Numerical Column*

If Scatter Plot is required, enable the Scatter with Target variable toggle.

After enabling the toggle, the Scatter will display

_images/scatterplot.png — *Graphstat for Numerical Column–Scatter Plot*

The word cloud offers the N-grams feature, which ranges from 1 to 5.

_images/textwc.png — *Graphstat for Text variable*

Statistical Test

A one-sample t-test is used to determine if the mean of a single sample is significantly different from a known or hypothesized population mean.

One Sample T Test - One Tailed

A one-sample t-test is used to determine if the mean of a single sample is significantly different from a known or hypothesized population mean.

Steps to perform

Select One Sample T Test - One tailed under test name.

Under Numerical column select Numerical variable.

Under Categorical column select categorical variable. Select if it is required.

Choose a hypothesis mean value based on existing theories, previous research findings, or practical considerations. For example, if previous studies suggest that the average height of adults is 170 cm, you might set your hypothesis mean value to 170 cm.

The default Alpha (Level of significance) value is 0.05. Change the value based on your requirement.

Sampling:

Enable Sampling, if required. Choose the sampling method Random (each member of the population is selected purely by chance. This means that every individual or item has an equal chance to be included in the sample, and selection is not biased by any personal judgment or preference.)or Stratified (method of sampling in which the population is divided into subgroups, or strata, based on certain characteristics that are relevant to the research question. Then, samples are randomly selected from each stratum independently. This approach ensures that each subgroup is represented in the sample, and it allows for more precise estimates and comparisons within each stratum.)

Size:

% of a Variable: Selecting a sample size that is a certain percentage of a particular variable. For instance, if the variable is the total population, and you choose 10%, your sample will include 10% of the total population. This approach is useful when you want the sample size to be proportional to the size of a certain variable.

Absolute Value: Selecting a sample size by specifying an absolute number. For example, if you decide that your sample size should be 5 individuals, you simply select 5 individuals from the population. This method is straightforward and is often used when the required sample size is predetermined by external factors such as budget or time constraints.

By Calculator: Using a calculator involves employing a statistical tool or software to determine the appropriate sample size based on certain parameters such as confidence level, margin of error, and population size. This method ensures that the sample size is statistically valid and sufficient to make reliable inferences about the population.

Lazy Sampling: A less rigorous approach where the sample is selected based on convenience or ease of access rather than using a strict randomization process. This method is quick and cost-effective but may introduce bias and is not always representative of the population. It’s often used in exploratory research or when resources are limited.

Systematic Sampling: Selecting every nth item from a list or population. For example, if you have a list of 1,000 individuals and you want a sample of 100, you would select every 10th individual. This method is easy to implement and ensures a spread across the population, but it assumes that the list is randomly ordered to avoid periodicity bias.

Filtering: Enable Filtering and filter the data, if required.

Once the steps are done, Click on ‘Test’ and verify the result.

Output

Histogram-Density

Provides a normalized representation of the distribution of data, making it easier to compare distributions and understand the relative proportions of data within different intervals.

Histogram:

The histogram displays the frequency distribution of age data within specified intervals (bins).

The blue bars represent the number of data points (frequency) that fall within each age interval.

The height of each bar corresponds to the count of data points in that interval.

Density:

The black curve represents the density estimate, showing the distribution of the data in a continuous manner.

The area under the density curve represents the entire data set and sums to 1.

The histogram gives a direct count of data points within intervals, while the density plot provides a smoothed view, highlighting the shape of the distribution. Together, they offer a comprehensive view of the data, with the histogram showing discrete counts and the density plot illustrating the overall distribution pattern.

T distribution

Provides a visual representation of how the shape of the distribution changes with different sample sizes and illustrates the properties of the t-distribution relevant for statistical inference.

Central Tendency of Age Data:

The peak of the density curve (red line) suggests that the most common or average age in the dataset is around 0 on a standardized scale. This means that the data is centered around this value.

Spread and Distribution of Age Data:

The shape of the density curve (red line) and the t-distribution curve (blue line) indicates how age values are distributed in the dataset.

The t-distribution curve (blue line) closely follows the density curve, but it’s slightly smoother and broader. This suggests that the data may have been fitted to a theoretical distribution, such as the t-distribution, often used in statistics for small sample sizes.

Significance of Tail Area:

The shaded area under the t-distribution curve (pink) represents the tail of the distribution.

This area is significant in statistical terms, especially for hypothesis testing, where it helps determine the probability of observing values beyond a certain threshold (e.g., ages above a certain point).

Data Observation

The observation of the data is also displayed.

Hypothesis

Here the detailed hypothesis displayed.

Summary

Here the Summary Stats are displayed.

One Sample T Test - Two Tailed

In a one-sample t-test with a two-tailed hypothesis, one examines whether the sample mean significantly differs from a hypothesized population mean. This involves calculating the t-statistic based on the sample data and comparing it to critical values from a t-distribution. The aim is to determine if the mean significantly deviates from the hypothesized value in either direction, hence the term “two-tailed”.

Steps to perform

Select One Sample T Test - One tailed under test name.

Under Numerical column select Numerical variable.

Under Categorical column select categorical variable. Select if it is required.

Choose a hypothesis mean value based on existing theories, previous research findings, or practical considerations. For example, if previous studies suggest that the average height of adults is 170 cm, you might set your hypothesis mean value to 170 cm.

The default Alpha (Level of significance) value is 0.05. Change the value based on your requirement.

To learn more: ‘Sampling’

To learn more: ‘Filtering’

Histogram-Density

To learn more: ‘Histogram-Density’

T Distribution

To learn more: ‘T distribution’

To learn more: ‘Data Observation’

Shapiro-Wilk

The Shapiro-Wilk test is a useful tool for determining the normality of a data set, which is a crucial assumption in many statistical analyses. It provides a clear decision-making process based on the p-value derived from the T-statistic.

Steps to perform

Select Shapiro-Wilk test under test name.

Select Numerical variable under Numerical clumn.

To learn more: ‘Sampling’

To learn more: ‘Filtering’

Histogram-Density

To learn more: ‘Histogram-Density’

Output

Data Observation

The observation of the selected variable is displayed. Unlike One Sample T test - One Tailed & Two Tailed, this won’t display Standard error.

To learn more: ‘T distribution’

Hypothesis

Under Hypothesis, this will give whether the sample looks normally distributed or not.

Summary

Under Summary, only the P Value and T-Statistics are displayed.

Paired Student T-Test

The paired Student’s t-test compares the means of two related groups to determine if there is a statistically significant difference. It is used when the same subjects are measured before and after a treatment or in matched pairs. The test calculates the difference between paired observations, then computes the mean and standard deviation of these differences to determine the t-statistic. If the p-value is less than a chosen significance level (e.g., 0.05), the null hypothesis (no difference) is rejected, indicating a significant difference between the means.

This won’t display the Histogram-Density Plot and T-distribution.

Steps to perform

Select Paired Student T test under test name.

Select two Numerical variables under Numerical columns.

To learn more: ‘Sampling’

To learn more: ‘Filtering’

Data Observation

The observation of the selected variable is displayed. Unlike One Sample T test - One Tailed & Two Tailed, this won’t display Standard error.

Hypothesis

Under Hypothesis, will display only whether the two pairs are significant or not.

Summary Stats

Under Summary, the summary statistics are displayed.

Chi Square Test

The Chi Square test is a statistical method used to determine whether there is a significant association between categorical variables. It compares the observed frequencies of categories with the expected frequencies if there were no association.

This won’t display the Histogram-Density Plot and T-distribution

Steps to perform

Select Chi Square test under test name.

Select two Categorical variables under Categorical columns.

To learn more: ‘Sampling’

To learn more: ‘Filtering’

Data Observation

The observation of the selected variable is displayed. Unlike One Sample T test - One Tailed & Two Tailed, this won’t display Standard error.

Hypothesis

Under Hypothesis, will display only whether the two pairs are significant or not.

Summary Stats

Under Summary, the summary statistics are displayed.

Note

In order to view the T distribution, click on ‘T Distribution’ next to ‘Histogram-Density’.

In order to view the Summary Stats click on the ‘Summary’ next to ‘Hypothesis’ in the ‘Hypothesis & Summary’ Panel.

Transformations

Absolute

For video tutorial see: SEDGE YouTube channel - A functions.

abs( ): Absolute value of a variable. The magnitude of a real number without regard to its sign. Numbers with negative values will be converted to positive ones.

Syntax:

abs(#column1#) AS <<Newabscolumn>>

New column name: Newabscolumn

Note

The absolute value can only be applied to Numerical datatype columns. The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored.

Example:

abs(#temperature#)

Where (#temperature#) include following values -12, 6, 0.5, -6, ,25 After applying abs(#temperature#) the values will change to 12, 6, 0.5, 6, ,25

The column may have temperatures in -ve value and this will convert all negative temperatures to positive

Accrint

accrint(): The ACCRINT function returns the accrued interest for a security that pays interest periodically.

The arguments can be explained as:

Issue - The issue date of the security
First_interest - The first date interest will be paid
Settlement - The settlement date of the security
Rate - The security’s annual coupon rate.
Par - The security’s par value.
Frequency - The number of interest payments per year.

Choose from the following:

1 = annual payments
2 = semi -annual payments
4 = quarterly payments
12 = monthly payments

Syntax:

accrint(#Issue_Date#, #First_Interest_Date#, #Settlement_Date#, #Annual_Interest_Rate#, #Par#, #Frequency#)  as <<Newcol_accrint>>

New column name: Newcol_accrint

Note

Issue_date, First_Interest_Date and Settlement_Date must be in date datatype.

Example:

accrint(#issue_Date#, #First_Interest#, #Settlement_Data#, #open_acc#, #Par#, 2) as <<new_accrint>>

Above example will return the expected result.

Arc cosine - Acos

acos( ): Arc cosine function result is the same as the inverse of cosine of a variable, the function returns angles in radians

Syntax:

acos(#column1#) AS <<Newacoscolumn>>

New column name: Newacoscolumn

Note

The acos function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored

Example:

acos(#NumericalColumn#) AS <<Newacoscolumn>>

Where (#numericalColumn#) value is between -1 and 1 After applying acos, we will get an angle expressed in radians

Add months

Add months( ): The function returns a date with given number of months added (date + integer months)

Syntax:

add_months(#DateColumn#, number_of_months) as <<NewAddColumn>>

New column name: NewAddColumn

Note

The Add months function can only be applied to Date datatype columns

Example:

add_months(#DateOfHire#, 12) as <<NewAddColumn>>

Assuming new employees have a 12 month trial period, after adding 12 months to the Date of hire column we will get the date when the trial period terminates

Append

append( ): Appending a variable with a pattern to create a new variable.

Syntax:

append(#column1#, 'Pattern to be appended') AS <<NewColumnApp>>

New column name: NewColumn

Note

Both Categorical and Numerical value and the new field which is created is a Categorical type.

Example:

append(#temperature#, 'Deg C') AS <<AppCol1>>

Assuming that the temperature column is a numeric value and appending the string ‘Deg C’ will append the text to the numerical value. Example ‘13.5’ is the numerical value and the text ‘Deg C’ will be appended to read ‘13.5 Deg C’, and the new column AppCol1 which is formed will be string.

Ascii

ascii( ): Function returns numerical value of the first character of a string

Syntax:

ascii(#Column#) as <<NewAsciiColumn>>

New column name: NewAsciiColumn

Note

Works for categorical columns

Example:

ascii(#CategoricalColumn#) AS <<NewAsciiColumn>>

We have a categorical column “Geography” and we want to transform the variables into numerical Ascii representation. We have variable “France”, “Spain” and “Germany” and after the transformation we get results of “70”, “83” and “71” as per the ASCII characters table.

Arcsin

asin( ): The arcsin function is the inverse of the sine function. It returns the angle whose sine is a given number. Use arcsin when you know the sine of an angle and want to know the actual angle

Syntax:

asin(#Column#) as <<NewAsinColumn>>

New column name: NewAsinColumn

Note

The asin function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored

Example:

asin(#SineColumn#) as <<NewAsinColumn>>

Assuming we have a column with sine value (0 to 1) we can determine the Asin value. In case our Sine value is 0,85 we will get an Asin value of 1.0159852938148251.

Atan - Arc tangent

atan( ): The Atan function returns the arc tangent in radians (which is the inverse tangent), of its argument. The arctangent means the angle whose tangent is the argument.

Syntax:

atan(#Column#) as <<NewAtanColumn>>

New column name: NewAtanColumn

Note

The atan function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored

Example:

atan(#NumericColumn#) as <<NewAtanColumn>>

Assuming we have a column with numerical value and a row has value of 37, the Atan function will return the arc tangent value of 1.5437758 - angle in radians.

Atan2

atan2( ): The atan2 function calculates one unique arc tangent value from two variables x and y and returns angle in radians

Syntax:

atan2(#Column_Y#, #Column_X#) as <<NewColumn>>

New column name: NewAtan2Column

Note

The atan2 function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored

Example:

atan2(#NumericColumnX#, #NumericColumnY#) as <<NewAtan2Column>>

Assuming we have 2 columns with numerical value and a row has value of x=44 and y=10 the Atan2 function will return the value of 1.3473197256542635

Base64

base64( ): The base64 function is most commonly used to encode binary data and transform the argument from binary format to a base 64 string

Syntax:

base64(#Column#) as <<NewBaseColumn>>

New column name: NewBaseColumn

Example:

base(#CategoricalColumn#) as <<NewBaseColumn>>

Assuming we have a categorical column with locations and one variable is France, the Base64 function will transform the value to RnJhbmNI as a base 64 string

Bigint

Bigint( ): The bigint function returns a 64-bit integer representation of a number or character string in the form of an integer constant

Syntax:

bigint(#Column#) as <<NewBigintColumn>>

New column name: NewBigintColumn

Bin

Bin( ): The bin function takes in integer x and returns the binary representation of x in a string format

Syntax:

bin(#Column#) as <<NewBinColumn>>

New column name: NewBinColumn

Example:

bin(#NumericColumn#) as <<NewBinColumn>>

Assuming we have a numerical column with a value of 15, the Bin function will return its binary value of 1111

Bit-length

Bit-length( ): The function returns the length of the given string in bits

Syntax:

bit_length(#Column#) as <<NewBitLColumn>>

New column name: NewBitLColumn

Note

The function can only be applied to Categorical datatype columns

Example:

bit_length(#Geography#) as <<NewBitLColumn>>

Assuming we have a column with France as a variable, the function will return a value of 48 as a length of the string in bits

Bround

Bround( ): Returns expression rounded to n decimal places using HALF_EVEN rounding mode

Syntax:

bround(#Column#, n) as <<NewBroundColumn>>

New column name: NewBroundColumn

Example:

bround(#NumericColumn#, 3) as <<NewBroundColumn>>

Assuming we have a numerical column with a value of 51.93487, the Bround function will return its rounded value of 51.935

Char

Char( ): The char function returns the character based on the ASCII code

Syntax:

char(#Column#) as <<NewCharColumn>>

New column name: NewCharColumn

Example:

char(#NumericColumn#) as <<NewCharColumn>>

Assuming we have a numerical column with a value of ASCII = 77, the Char function will return the character M

Char_length

Char_length( ): Gives the length in characters of the input string

Syntax:

char_length(#Column#) as <<NewCharLColumn>>

New column name: NewCharLColumn

Note

The function can only be applied to Categorical datatype columns

Example:

char_length(#CategoricalColumn#) as <<NewCharLColumn>>

Assuming we have a categorical column “Geography” then the variable “Spain” will return 5 as a length of characters, “Germany” will return 7 etc.

Coalesce

Coalesce( ): The function returns the first non-null expression in the list. If all expressions evaluate to null, then the Coalesce function returns null

Syntax:

coalesce(#Column1#, #Column2#, ..., #ColumnN#) as <<NewCoalColumn>>

New column name: NewCoalColumn

Note

The function can be applied to Categorical and Numerical datatype columns

Example:

coalesce(#FinishDate#, #StartDate#) as <<NewCoalColumn>>

Assuming we have a column “StartDate” and “FinishDate”. Some of the variables in “FinishDate” are not filled in as a task was not completed yet. In this case the transformation will be searching for “FinishDate” firstly (written on first place in the syntax), if the value is null it will return the “StartDate” value.

Concat

Concat( ): The function is used to concatenate two strings to make a single string, for example concatenate two or more columns in one

Syntax:

concat(#Column1#, #Column2#, ..., #ColumnN#) as <<NewConcColumn>>

New column name: NewConcColumn

Example:

concat(#Sex#, #Age#) as <<NewConcColumn>>

Assuming we have a column “Sex” and “Age” and we want both of these variables in one single column. After applying the concat function, we will get results like “Male24”, “Female43” etc.

Concat_ws

Concat_ws( ): The function joins the input strings into a single string. It separates those concatenated strings with the separator specified in the first argument

Syntax:

concat_ws('separator', #Column1#, #Column2#, ..., #ColumnN#) as <<NewConcWSColumn>>

New column name: NewConcWSColumn

Note

The CONCAT_WS() requires at least two input strings

Example:

concat_ws('/', #Sex#, #Age#) as <<NewConcWSColumn>>

Assuming we have a column “Sex” and “Age” and we want both of these variables in one single column separated by “/”. After applying the concat function, we will get results like “Male/24”, “Female/43” etc.

Conv

Conv( ): The Conv function converts a number from one numeric base system to another, and returns the result as a string value

Syntax:

conv(#Column#, from_base, to_base) as <<NewConvColumn>>

New column name: NewConvColumn

Example:

conv(#Column1#, 11, 3)

Convert a column from numeric base system 11 to numeric base system 3

Note

This function returns NULL if any of the parameters are NULL

Cos

Cos( ): The Cos function returns the cosine of a number

Syntax:

cos(#Column#) as <<NewCosColumn>>

New column name: NewCosColumn

Note

The Cos function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored

Example:

cos(#NumericalColumn#) as <<NewCosColumn>>

Assuming we have a numerical column with angles and we want to convert it into cosine value we use this function. A value of 2 will be converted into -0,416147.

Cosh

Cosh( ): The Cosh function returns the hyperbolic cosine of a number

Syntax:

cosh(#Column#) as <<NewCoshColumn>>

New column name: NewCoshColumn

Note

The Cosh function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored

Example:

cosh(#NumericalColumn#) as <<NewCoshColumn>>

Assuming we have a numerical column with angles and we want to convert it into hyperbolic cosine value we use this function. A value of 3 will be converted into 10,067.

Cot

Cot( ): The Cot function returns the cotangent of a number

Syntax:

cot(#Column#) as <<NewCotColumn>>

New column name: NewCotColumn

Note

The Cos function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored

Example:

cot(#NumericalColumn#) as <<NewCotColumn>>

Assuming we have a numerical column with angles and we want to convert it into cotangent value we use this function. A value of 6 will be converted into -3.436353.

Crc32

Crc32( ): The Crc32 function function calculates a 32-bit CRC (cyclic redundancy checksum) for a string. This function can be used to validate data integrity

Syntax:

crc32(#Column#) as <<NewCrcColumn>>

New column name: NewCrcColumn

Note

The Crc32 function can only be applied to Categorical datatype columns

Example:

crc32(#CategoricalColumn#) as <<NewCrcColumn>>

Assuming we have a categorical column and we want to convert it into 32-bit CRC value we use this function. A string word “France” will be converted into 1493835906.

Cube Root

For video tutorial see: SEDGE YouTube channel - C functions.

cbrt() : Finds the cube root of the value

Syntax:

cbrt(#MPG#) as <<Newcolcbrt>>

New column name: Newcolcbrt

Note

A The cbrt value can only be applied to Numerical datatype columns.

B The new column name should not contain any special characters such as !@#$%^&*()-+= space

C If there are any missing rows in the numerical column, then the row will be ignored.

Example:

cbrt(#NewBornWeight#) AS <<Newcbrtcolumn>>

Where (#NewWeight#) include following values 729, 27, 5832 After applying cbrt(#NewWeight#) the values will change to 9, 3, 18

Ceiling

ceil() : Ceiling value of a number returns the greatest integer, greater than or equal to the number. The ceil of number 2.5 will generate the value 3. The ceil of the value -2.5 will generate the value -2.

Syntax:

ceil(#column1#) AS <<NewCeilColumn>>

New column name: Newceilcolumn

Note

The ceil value can only be applied to Numerical datatype columns. The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored.

Example:

ceil(#NewBornWeight#) AS <<NewCeilColumn>>

Where (#NewBornWeight #) include following values 2.2, 3.5, 2.8, , 2.7 After applying ceil(#NewBornWeight #) the values will change to 3, 4, 3, , 3

Copy

Copy( ): Copying a variable and naming it as a new variable

Syntax:

copy(#column1#) AS <<NewCopiedColumn>>

New column name: NewCopiedColumn

Note

Both Categorical and Numerical value and the new field which is created are of the same datatype as the copied column.

Example:

copy(#Age#) as <<new_age>>

Column ‘Age’ is copied as new field ‘new_age’

Current_date

Current_date( ): The CURRENT_DATE() function returns the current date at the time of query execution

Syntax:

current_date()as <<NewCDColumn>>

New column name: NewCDColumn

Note

The date is returned as “YYYY-MM-DD” (string) or as YYYYMMDD (numeric)

Example:

current_date()as <<NewCDColumn>>

A new column with today’s date is created - could be used to calculate a time difference between any other date column and current date.

Datediff

datediff( ): Returns the number of days between two date columns

Syntax:

datediff(#column1#, #column2#) as <<Newcol_diff>>

New column name: Newcol_diff

Note

The datediff function is only applied for date_columns

#column1#, #column2# must be a valid date_column

Example:

datediff(#DOB#, #Date of Hire#) as <<diff_days>>

Above example returns the number of days between two date columns

Datediff_hrs

datediff_hrs( ): Returns the number of hours between two date columns

Syntax:

datediff_hrs(#column1#, #column2#) as <<Newcol_diffhrs>>

New column name: Newcol_diffhrs

Note

The datediff_hrs function is only applied for date_columns

#column1#, #column2# must be a valid date_columns

Example:

datediff_hrs(#DOB#, #Date of Hire#) as <<diff_hrs>>

Above example returns the number of hours between two date columns

Datediff_mins

datediff_mins( ): Returns the number of minutes between two date columns

Syntax:

datediff_mins(#column1#, #column2#) as <<Newcol_diffmins>>

New column name: Newcol_diffmins

Note

The datediff_mins function is only applied for date_columns

#column1#, #column2# must be a valid date_columns

Example:

datediff_mins(#DOB#, #Date of Hire#) as <<diff_mins>>

Above example returns the number of minutes between two date columns

Datediff_secs

datediff_secs( ): Returns the number of seconds between two date columns

Syntax:

datediff_secs(#column1#, #column2#) as <<Newcol_diffsecs>>

New column name: Newcol_diffsecs

Note

The datediff_secs function is only applied for date_columns

#column1#, #column2# must be a valid date_columns

Example:

datediff_secs(#DOB#, #Date of Hire#) as <<diff_secs>>

Above example returns the number of seconds between two date columns

Date_add

date_add(): add days to the selected date

Syntax:

date_add(#Column#, Number_of_days_to_add) as <<Newcol_dateadd>>

New column name: Newcol_dateadd

Note

The date_add function is only applied for date_columns

Number_of_days_to_add field should have digits

Example:

date_add(#DOB#,3) as <<dateadd>>

Above example adds days to the selected date

Date_value

date_value( ): converts date to serial number

Syntax:

date_value(#column#, 'offset_date') as <<Newcol_value>>

New column name: Newcol_value

Note

The date_value function is only applied for date_columns

Offset_date must be a valid date format(yyyy-MM-dd)

Example:

date_value(#DOB#, '1900-01-01') as <<datevalue>>

Above example shows the conversion of date to serial number

Current_timestamp

Current_timestamp( ): The CURRENT_TIMESTAMP() function returns the current date and time

Syntax:

current_timestamp() as <<NewCTColumn>>

New column name: NewCTColumn

Note

The date and time is returned as “YYYY-MM-DD HH-MM-SS” (string) or as YYYYMMDDHHMMSS.uuuuuu (numeric)

Example:

current_timestamp() as <<NewCTColumn>>

A new column with today’s date and time is created - could be used to calculate a time difference between any other date column and current date.

Date_isin

Date_isin ( ): The function is used to check if a date column contains a value from selected date interval or not

Syntax:

date_isin(#DateColumn#, 'start_date', 'end_date') as <<NewDIIColumn>>

New column name: NewDIIColumn

Note

The date_isin function is only applied for date_column

date1 and date2 refers the input_date that must have valid date format(yyyy-MM-dd)

Example:

date_isin(#DOB#, '2009-05-24', '2020-01-24') as <<NewDIIColumn>>

Above example shows the given date column lies in between given dates (date1 and date2)

Date_sub

Date_sub ( ): The function returns the date after subtracting a selected number of days from the original date in the column

Syntax:

date_sub(#DateColumn#, number_of_days) as <<NewDateSubColumn>>

New column name: NewDateSubColumn

Example:

date_sub(#DateOfHire#, 7) as <<NewDateSubColumn>>

Above example shows the given date column with 7 days subtracted from each variable

Date_trunc

Date_trunc ( ): The function returns date with the time portion of the day truncated to the unit specified by the format model

Format should be one of [“YEAR”, “YYYY”, “YY”, “MON”, “MONTH”, “MM”, “DAY”, “DD”, “HOUR”, “MINUTE”, “SECOND”, “MILLISECOND”, “MICROSCOND”, “WEEK”, “QUARTER”]

Syntax:

date_trunc('format', #DateColumn#) as <<NewTruncColumn>>

New column name: NewTruncColumn

Day

Day ( ): The Day function returns the day of the month for a given date (a number from 1 to 31)

Syntax:

day(#DateColumn#) as <<NewDayColumn>>

New column name: NewDayColumn

Example:

day(#DateOfHire#) as <<NewDayColumn>>

Above example will extract each day from the respective column - for example ‘2020-09-27’ will return ‘27’ as a result

Dayofmonth

Dayofmonth ( ): The function returns the day of the month for a given date (a number from 1 to 31)

Syntax:

dayofmonth(#DateColumn#) as <<NewDOMColumn>>

New column name: NewDOMColumn

Example:

dayofmonth(#DateOfHire#) as <<NewDOMColumn>>

Above example will extract each day from the respective column - for example ‘2020-09-27’ will return ‘27’ as a result

Dayofweek

Dayofweek ( ): The function returns the weekday index for a given date (a number from 1 to 7)

Syntax:

dayofweek(#DateColumn#) as <<NewDOWColumn>>

New column name: NewDOWColumn

Note

1=Sunday, 2=Monday, 3=Tuesday, 4=Wednesday, 5=Thursday, 6=Friday, 7=Saturday Input date column must be in format ‘yyyy-MM-dd’

Example:

dayofweek(#DateOfHire#) as <<NewDOWColumn>>

Above example will extract the numer of day in a week from the respective column - for example ‘2021-03-27’ will return ‘7’ for Saturday as a result

Dayofyear

Dayofyear ( ): The function returns the day of the year for a given date (a number from 1 to 366)

Syntax:

dayofyear(#DateColumn#) as <<NewDOYColumn>>

New column name: NewDOYColumn

Note

Input date column must be in format ‘yyyy-MM-dd’

Example:

dayofyear(#DateOfHire#) as <<NewDOYColumn>>

Above example will extract the numer of day in a year from the respective column - for example ‘2021-02-27’ will return ‘58’ as a result

Degrees

Degrees ( ): The function converts a value in radians to degrees

Syntax:

degrees(#Column#) as <<NewDegrColumn>>

New column name: NewDegrColumn

Example:

degrees(#RadianColumn#) as <<NewDegrColumn>>

Above example will convert a value in radians to degrees - for example a radian value 2 will give result of 114.592 in degrees.

Delete

For video tutorial see: SEDGE YouTube channel - D functions.

delete( ): Deleting string character from the 2nd to 5th position in the Column

Syntax:

delete(#column1#, 2, 5) AS <<NewColumnDel>>

New column name: NewColumn

Note

Applies to Categorical data type and the new column which is created is also Categorical type.

Example:

delete(#column1#, 1, 6) AS <<NewColumnDel>>

The characters in column 1 in the position from 1st Character to the 6th Character is deleted. If the data in column 1 is “Hello World”, “Great Day”, after applying the formula, the new field will have the values “World” and “Day”.

Double

Double ( ): The function casts the value to the target data type

Syntax:

double(#Column#) as <<NewDoubColumn>>

New column name: NewDoubColumn

Effect

effect( ): The EFFECT Function returns the effective annual interest rate, from the nominal annual interest rate (APR), and the number of compounding periods per year.

The arguments can be explained as:

Nominal - Annual interest rate, also known as APR (Annual Percentage Rate)
Npery - Number of periods that interest is calculated

Syntax:

EFFECT(#nominal_rate#, #npery#) AS  <<Newcol_effect>>

New column name: Newcol_effect

Example:

effect (#int_rate#, #open_acc#) as <<new_effect>>

Above example will return the expected result.

Exponential

For video tutorial see: SEDGE YouTube channel - E functions.

exp( ): The exp() function allows users to calculate the exponential value with the base set to e. e is a mathematical constant, with a value approximately equal to 2.71828 and is also called Eulers constant.

Syntax:

exp(#column1#) AS <<NewColumnExpo>>

New column name: NewColumn

Note

A - e is a Mathematical constant, with a value approximately equal to 2.71828

B - Applies to Numerical data type and the new column which is created is also Numerical type.

Example:

exp(#column1#) AS <<NewColumnExpo>>

Assuming the numerical value in the #column1# is 2 then exp(2) means 2.7182 to the power of 2

Expm1

Expm1 ( ): The expm1 function (e ^ x )- 1 returns e raised to the given power argument, minus 1.0, where e is the base of the natural logarithms (e is the Eulers number equal to 2.71828)

Syntax:

expm1(#Column#) as <<NewExpm1Column>>

New column name: NewExpm1Column

Example:

expm1(#NumericalColumn#) AS <<NewExpm1Column>>

Assuming the numerical value in the #column# is 3 then expm1 result will be 19.08554

Factorial

For video tutorial see: SEDGE YouTube channel - F functions.

factorial( ): Factorial of an integer number.

Syntax:

factorial(#column1#) AS <<NewColumnFact>>

New column name: NewColumn

Note

A - Applies to Numerical data type and the new column which is created is also Numerical type.

B - If the value is non-integer, then the factorial will change the number to integer value and then apply factorial on integer.

Example:

factorial(#column1#) AS <<NewColumnFact>>

Assuming the numerical value in the #column1# is 5 then factorial(5) means 5 x 4 x 3 x 2 x 1 = 120 If the value in the column is non-integer then factorial(4.3) means 4 x 3 x 2 x 1 = 24

Find

find( ): Finds the full character within the field, and returns true if it finds character, and false if not.

Syntax:

find(#marital_status#,'Married') AS <<NewColumnFind>>

The first word is the field name within #. Example #marital_status# The second is the text / word within double quotes that needs to be found. Example ‘Married’

New column name: NewColumn

Note

A Applies to Categorical data type and the new column which is created is Categorical type and the output is True or False.

B If the word is partly existing, it will not return True.

C The word being found is case sensitive.

Example:

find(#marital_status#,'Married') AS <<NewColumnFind>>

The word ‘Married’ if found in the field, it return True, else False.

Float

Float ( ): The Float function converts a specified value into a floating point number.

Syntax:

float(#Column#) as <<NewFloatColumn>>

New column name: NewFloatColumn

Floor

floor( ) : Floor value of a number returns the largest integer value less than or equal to the specified number. The floor of number 2.5 will generate the value 2. The floor of the value -2.5 will generate the value -3.

Syntax:

floor(#column1#) AS <<Newfloorcolumn>>

New column name: Newfloorcolumn

Note

The floor value can only be applied to Numerical datatype columns. The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored.

Example:

floor(#NewBornWeight#) AS <<Newfloorcolumn>>

Where (#NewBornWeight #) include following values 2.2, 3.5, 2.8, ,2.7 After applying floor(#NewBornWeight #) the values will change to 3, 4, 3, , 3

Format_number

Format_number ( ): The Format number function formats a number to a format like “#,###,###.##”, rounded to a specified number of decimal places, then it returns the result as a string

Syntax:

format_number(#Column#, number of digits to be rounded to) as <<NewNumForColumn>>

New column name: NewNumForColumn

Example:

format_number(#Distance#, 2) as <<NewNumForColumn>>

Where distance is 153.4625 the result string will be 153.46.

From unixtime

From_unixtime ( ): This function converts the UNIX timestamp to default Datetime format (yyyy-MM-dd HH:mm:ss)

Syntax:

from_unixtime(#Column#) as <<NewFrUnColumn>>

New column name: NewFrUnColumn

Example:

from_unixtime(#DateColumn#) as <<NewFrUnColumn>>

Where UNIX time is 1255033470 the result converted time will be 2009-10-08 13:24:30

From-utc-timestamp

From_utc_timestamp ( ): The function returns the current UTC date and time as a value in ‘YYYY-MM-DD HH:MM:SS’ or YYYYMMDDHHMMSS.uuuuuu format depending on the usage of the function i.e. in a string or numeric context.

Syntax:

from_utc_timestamp(#Datecolumn#, 'timezone') as <<NewTimeColumn>>

For the list of time zones click on the link below: List of time zones.

New column name: NewTimeColumn

Example:

from_utc_timestamp(#Datecolumn#, 'Europe/Istanbul') as <<NewTimeColumn>>

Where the selected date is 2016-08-31 and the time zone is Tokyo the result converted time will be 2016-08-31 03:00:00

FV

fv( ): The FV function is a financial function that returns the future value of an investment. We can use the FV function to get the future value of an investment assuming periodic, constant payments with a constant interest rate.

Rate - Interest rate per compound period
N - The total number of payment periods
Pmt - The payment per period
Pv - The present value of the investment
Type - When payments are deposited
1. end = End of each period
2. begin = Beginning of each period

Syntax:

fv(#Annual_Interest_Rate#, #Number_of_Periods#, #Amount_of_Payments#, #Present_value#, 'end') as <<Newcol_fv>>

New column name: : Newcol_fv

fv(#int_rate#, #open_acc#, #Par#, #pv#) as <<new_fv>>

Above example will return the expected result.

Geo Dist Kms

Geo_dist_kms ( ): The function calculates the distance in kilometres between two geographical points

Syntax:

geo_dist_kms(#Lati_A#, #Lati_B#, #Longi_A#, #Longi_B#) as <<NewGDKColumn>>

New column name: NewGDKColumn

For the example chart click on the link below: Geo lat long chart.

Example:

geo_dist_kms(#Lati_A#, #Lati_B#, #Longi_A#, #Longi_B#) as <<NewGDKColumn>>

Where the selected Latitude A is 38.631913, Longitude A is -121.434879, Latitude B is 38.502519 and Longitude B is -121.420769 the calculated distance is 14.44016 Kms |

Greatest

Greatest ( ): The function returns the greatest value from the list of arguments

Syntax:

greatest(#Column1#, #Column2#, ..., #ColumnN#) as <<NewGreatColumn>>

New column name: NewGreatColumn

Example:

greatest(#Temperature1#, #Temperature2#, ..., #TemperatureN#) as <<NewGreatColumn>>

Where the selected temperatures are 10, 4 and 35 degrees the function will select 35 value as the greatest one

Hexadecimal

For video tutorial see: SEDGE YouTube channel - H functions.

hex( ): Converts decimal numbers (base 10) to hexa decimal value (base 16). As a base-16 numeral system, it uses 16 symbols. These are the 10 decimal digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) and the first six letters of the English alphabet (A, B, C, D, E, F). The letters are used in order to represent the values 10, 11, 12, 13, 14 and 15 each in one single symbol.

Syntax:

hex(#column1#) AS <<NewColumnHex>>

New column name: NewColumn

Note

Applies to numerical data type and the new column which is created is Categorical type.

Example:

hex(#column1#) AS <<NewColumnHex>>

Converts the decimal numeral system to hexa decimal numeral system.

image Hexa decimal explanation

Holiday_count

holiday_count( ): Returns the number of holidays between two given date columns based on country

Syntax:

holiday_count(#column1#,#column2#, 'countrycode') as <<Newcol_holidaycount>>

New column name: Newcol_holidaycount

For the list of country codes click on the link below: List of country codes.

Example:

holiday_count (#DOB#, #Date of Hire#, 'US) as <<holidaycount>>

Above example returns the number of holidays between two given date columns based on country

Hour

hour( ): The function returns the hour part for a given date (from 0 to 838)

Syntax:

hour(#Datecolumn#) as <<NewHourColumn>>

New column name: NewHourColumn

Example:

hour(#Datecolumn#) as <<NewHourColumn>>

Above example returns the hour number in a specific column a value of ‘2019-07-28 09:15:00’ will return ‘9’ as a result

Hypot

hypot( ): The function returns the square root of the sum of squares of its arguments - can be used as a calculation of hypotenuse of a triangle

Syntax:

hypot(#Column1#, #Column2#) as <<NewHypotColumn>>

New column name: NewHypotColumn

Example:

hypot(#TriangleLegA#, #TriangleLegB#) as <<NewHypotColumn>>

In case the length of one leg is 10 and second leg is 40 the hypotenuse result will be 41.23106.

If Else And

For video tutorial see: SEDGE YouTube channel - I functions.

If Else And: The if/else statement executes a code if a specified condition is true. If the condition is false, another code is executed.

Syntax:

if(#age#>=0 and #age#<=18,'Child',(if(#age#>=19 and #age#<=50,'Adult' ,(if(#age#>=51 and #age#<=70,'Mature','OLD' ))))) as <<NewAgeColumn2If>>

New column name: NewAgeColumn2

Note

A Applies to Categorical, Numerical, Text or Date data type and the new column which is created is either Categorical or Numerical type.

B The column name is case sensitive and the existing columns should be defined within #.

C The And operator is also used to join multiple conditions.

Example:

if(#age#>=0 and #age#<=18,'Child',(if(#age#>=19 and #age#<=50,'Adult' ,(if(#age#>=51 and #age#<=70,'Mature','OLD' ))))) as <<NewAgeColumn2If>>

In the above example the values in the column age if it is between 0 to 18 then a NewAgeColumn2 is created where the age values between 0 to 18 is classified as ‘Child’, age from 19 to 50 classified as ‘Adult’, age between 51 to 70 classified as ‘Wise’ and rest classified as ‘Old’.

image If Else function

Ifnull

Ifnull ( ): The function returns a specified value if the expression is NULL. If the expression is NOT NULL, the function returns the expression

Syntax:

ifnull(#Column#, value) as <<NewIfNullColumn>>

New column name: NewIfNullColumn

Example:

ifnull(#Balance#, 0) as <<NewIfNullColumn>>

In case the value in a selected column equals to 0, the function will return a null value

Initcap

Initcap ( ): The function returns the expression with the first letter in each word in uppercase. Rest of the letters are in lowercase. Words are delimited by a white space

Syntax:

initcap(#Column#) as <<NewInitCapColumn>>

New column name: NewInitCapColumn

Example:

initcap(#Gender#) as <<NewInitCapColumn>>

In case the variable like ‘male’ and ‘female’ start with a lower case letter, the new column after applying the function will have values ‘Male’ and ‘Female’

Int

Int ( ): The function returns the numeric integer equivalent from a given expression

Syntax:

int(#Column#) as <<NewIntColumn>>

New column name: NewIntColumn

Example:

int(#Column#) as <<NewIntColumn>>

In case we have a numerical column and one of the values is 7.75, the Int number after transformation will be 7

Insert

insert( ): Insert word at the position specified in the field.

Syntax:

insert(#column#, position,'_WordToInsert') as <<NewColumnIns>>

The first word is the field name within # where the word needs to be inserted. Example #occupation# The second is the number which is the position where the word will be inserted. Example the number 4 The third part is the text / word with double quotes that needs to be inserted. Example ‘_SERVICE’

New column name: NewColumn

Note

A Applies to Categorical and Text data type and the new column which is created is Categorical type.

B If the field does not have any word in the field the insert function will not insert the word.

Example:

insert(#occupation#,4,'_SERVICE') as <<NewColumnIns>>

The word ‘_SERVICE’ is inserted after the 4th position of the characters found in the field #occupation#.

Instr

Instr ( ): The function returns the position of the first occurrence of a string in another string

Syntax:

instr(#Column#, 'substr') as <<NewInstrColumn>>

New column name: NewInstrColumn

Example:

instr(#Name#, 'Mr.') as <<NewInstrColumn>>

We are searching for an expression/string ‘Mr’ in a name column and the function will tell us a position number where the string was found. In case the string is not there, the function returns ‘0’.

Inverse sign

Inverse_sign ( ): The function returns the negated values i.e. positive to negative and vice versa of given numerical column

Syntax:

inverse_sign(#Column#) as <<NewInvSignColumn>>

New column name: NewInvSignColumn

Example:

inverse_sign(#Temeprature#) as <<NewInvSignColumn>>

In a selected ‘Temperature’ column we will have values like 20, 7, -6 and -18 and after the transformation the function will return values -20, -7, 6 and 18.

Is_holiday

is_holiday( ): Check whether the given date column is holiday or not based on country.

Syntax:

is_holiday(#column#, 'countrycode') as <<Newcol_isholiday>>

New column name: Newcol_isholiday

For the list of country codes click on the link below: List of country codes.

Example:

is_holiday(#DOB#, 'US') as <<isholiday>>

Above example shows the given date column is holiday or not based on country

Is_Text

is_text( ): Conditional statement to check if the each field from the given column is Text. Returns True or False. Condition: A string is considered as text, if its length of characters is greater than 50.

Syntax:

is_text(#column#) AS <<Newcol_istext>>

New column name: Newcol_istext

Note

Applies for both Numerical and Categorical datatypes to check whether the value is Text or Not

Example:

is_text(#age#) as <<istext_col>>

Checks whether the string in the field is text, returns condition True or False

Is_weekend

is_weekend( ): Check whether the given date column is weekday or not based on country.

Syntax:

is_holiday(#column#, 'countrycode') as <<Newcol_isweekend>>

New column name: Newcol_isweekend

For the list of country codes click on the link below: List of country codes.

.. note::

The is_weekend function is only applied for date_columns

Countrycode field should have valid codes

Example:

is_weekend(#DOB#, 'US') as <<isweekend>>

Above example shows the given date column is weekend or not based on country

Isnan

Isnan ( ): The function determines whether a value is NaN (Not-a-Number). This function returns true if the value equates to NaN. Otherwise it returns false

Syntax:

isnan(#Column#) as <<NewIsNaNColumn>>

New column name: NewIsNaNColumn

Example:

isnan(#PasswordColumn#) as <<NewIsNaNColumn>>

We are checking if a value in a selected “PasswordColumn” is Not A Number - if we have value ‘567’ the function will return ‘false’, as the expression is a number, if we have value ‘blue’ then function will return ‘true’ as the expression is not a number.

Isnull

Isnull ( ): The function returns 1 or 0 depending on whether an expression is NULL.

If expression is NULL, this function returns 1 as true. If not, it returns 0 as false.

Syntax:

isnull(#Column#) as <<NewIsNullColumn>>

New column name: NewIsNullColumn

Example:

isnull(#Column#) as <<NewIsNullColumn>>

We are checking if a value in a selected “Column” is NULL. If the value in the column is ‘1’ the function will return ‘False’ as the function is not NULL.

Is not null

Is not null ( ): The function returns 1 or 0 depending on whether an expression is NULL.

If expression is not NULL, this function returns 1 as true. If it is NULL, it returns 0 as false.

Syntax:

isnotnull(#Column#) as <<NewIsNotNullColumn>>

New column name: NewIsNotNullColumn

Example:

isnotnull(#Column#) as <<NewIsNotNullColumn>>

We are checking if a value in a selected “Column” is NOT NULL. If the value in the column is ‘1’ the function will return ‘True’ as the function is not NULL.

Is Numeric

is_numeric( ): Conditional statement to check if the value in the field is numeric, returns True or False

Syntax:

is_numeric(#Age#) AS <<NewColumnIsNum>>

New column name: NewColumn

Note

Applies to Categorical data type and the new column which is created is Categorical type.

Example:

is_numeric(#Age#) AS <<NewColumnIsNum>>

Checks whether the text in the field is numerical, returns condition True or False.

Last_day

Last_day ( ): The function extracts the last day of the month for a given date

Syntax:

last_day(#Column#) as <<NewLDColumn>>

New column name: NewLDColumn

Example:

last_day(#DateColumn#) as <<NewLDColumn>>

If we have a Date Column with a value of ‘2021-01-01’the function will return the last day of the respective month so ‘2021-01-31’.

Least

Least ( ): The function returns the smallest value from the list of arguments

Syntax:

least(#Column1#, #Column2#, ..., #ColumnN#) as <<NewLeastColumn>>

New column name: NewLeastColumn

Example:

least(#Balance#, #EstimatedSalary#) as <<NewLeastColumn>>

If we have numerical columns with multiple values, the function will select the smallest value. If one column contains a value of 14,567 and second column a value of 6,553 the function will return 6,553 as the Least value.

Left

For video tutorial see: SEDGE YouTube channel - L functions.

left( ): Extract left characters from the fields till the numerical count.

Syntax:

left(#nationality#,3) as <<NewColumnLeft>>

New column name: NewColumn Above example the function extracts the first 3 characters in the field from left.

Note

Applies to Categorical, Text, Numerical and Date data type and the new column which is created is Categorical type.

Example:

left(#nationality#,3) as <<NewColumnLeft>>

Example above, function extracts the first 3 characters in the field from left

Length

length( ): Gives the length of the number of characters in the field

Syntax:

length(#column1#) AS <<NewColumnLen>>

New column name: NewColumn

Note

Applies to Categorical, Numerical, Text and date data type and the new column which is created is Numerical type.

Example:

length(#column1#) AS <<NewColumnLen>>

Generates the length of the characters in the field.

Levenshtein

Levenshtein ( ): The function returns the Levenshtein distance between two strings. The Levenshtein distance is the number of characters you have to replace, insert or delete to transform string1 into string2

Syntax:

levenshtein(#Column1#, #Column2#) as <<NewLevenColumn>>

New column name: NewLevenColumn

Example:

levenshtein(#Password1#, #Password2#) as <<NewLevenColumn>>

If one value is “tomato” and second one is “mate” the Levenshtein distance is 3 - as we need to replace, insert or delete 3 characters to transform “tomato” into “mate”.

Ln

Ln ( ): The function returns the natural logarithm of a number

Syntax:

ln(#Column#) as <<NewLnColumn>>

New column name: NewLnColumn

Example:

ln(#NumericalColumn#) as <<NewLnColumn>>

If the value in a column is ‘3’ the Ln function will return 1.0986122886681098 as a result.

Locate

Locate ( ): The function identifies the position of the first occurrence of a substring in a string. If the substring is not found in the original string, the function returns 0. The function ignores upper or lower cases

Syntax:

locate('substr', #Column#) as <<NewLocColumn>>

New column name: NewLocColumn

Example:

locate('r', #Surname#) as <<NewLocColumn>>

We are searching for the expression “r” in the surname column. Surname “Nobe” returns 0 as it does not contain an “r”. Surname “Taylor” returns “6” as the expression “r” is at the 6th position.

Log10

Log10 ( ): The function returns the natural logarithm of a number to base 10

Syntax:

log10(#Column#) as <<NewLog10Column>>

New column name: NewLog10Column

Example:

log10(#NumericColumn#) as <<NewLog10Column>>

We are searching for a natural logarithm of a number to base 10. In case the value is 4 the function will return 0.602059991328.

Log1p

Log1p ( ): The function returns the natural logarithm (base e) of 1 + a given number. E is the Eulers number which is approximately 2,718

Syntax:

log1p(#Column#) as <<NewLog1pColumn>>

New column name: NewLog1pColumn

Example:

log1p(#NumericColumn#) as <<NewLog1pColumn>>

We are searching for a natural logarithm (base e) of 1 + a given number. In case the value is 4 the function will return 1.6094379124341003.

Log2

Log2 ( ): The function returns the natural logarithm of a number to base 2

Syntax:

log2(#Column#) as <<NewLog2Column>>

New column name: NewLog2Column

Example:

log2(#NumericColumn#) as <<NewLog2Column>>

We are searching for a natural logarithm with base 2 of a given number. In case the value is 4 the function will return 2.

Lower

lower( ): converts characters to lower case.

Syntax:

lower(#column1#) AS <<NewColumnLow>>

New column name: NewColumn

Note

Applies to categorical and text data type and the new column which is created is Categorical type.

Example:

lower(#column1#) AS <<NewColumnLow>>

Converts characters to lower case.

Lpad

Lpad ( ): The function left-pads a string with another string, to a certain specified length

Syntax:

lpad(#Column#, length, 'pad') as <<NewLpadColumn>>

New column name: NewLpadColumn

Example:

lpad(#Hello#, 8, 'ABC') as <<NewLpadColumn>>

The function will return ABCHello.

Ltrim

Ltrim ( ): The function removes leading spaces from a string

Syntax:

ltrim(#Column#) as <<NewLtrimColumn>>

New column name: NewLtrimColumn

Example:

ltrim(#      hello#) as <<NewLpadColumn>>

The function will return ‘hello’ without spaces on the left side.

Logarithm

log( ): gives the logarithmic value of the number. A natural log value is logarithm to the base of the mathematical constant e, where e is an irrational and transcendental number approximately equal to 2.71828 Similarly , user can use any base like 2,10 to calculate the logarithmic value.

Syntax:

log(Base value , #column1#) AS <<NewColumnLog>>

New column name: NewColumn

Note

Applies to Numerical data type and the new column which is created is Numerical type.

Example:

log('e',#column1#) AS <<NewColumn1>>

Example:

log(2,#column1#) AS <<NewColumn2>>

Performing logarithmic transformation help in data normality especially if the data is skewed. The log transformation can be used to make highly skewed distributions less skewed. Log transformation is important as it makes patterns in the data more interpretable.

Max

max( ): This function returns the maximum value in the specified column.

Syntax:

max(#column#) as <<Newcol_max>>

New column name: Newcol_max

Note

The max function is only applied for numerical columns

Example:

min(#Age#) as <<max_age>>

Above example will return the maximum of age.

Md5

Md5 ( ): The Message Digest Algorithm 5 (MD5) is a cryptographic hash algorithm that is used to create a 128-bit string value from an arbitrary length string

Syntax:

md5(#Column#) as <<NewMd5Column>>

New column name: NewMd5Column

Example:

md5(#France#) as <<NewMd5Column>>

The function will return 0309a6c666a7a803fdb9db95de71cf01 as a result.

Mean

mean( ): This function returns the average of values in the specified column.

Syntax:

mean(#column#) as <<Newcol_mean>>

New column name: Newcol_mean

Note

The mean function is only applied for numerical columns.

Example:

mean(#Age#) as <<mean_age>>

Above example will return the average value of age column.

Median

median( ): This function returns the median value of a range of values in the specified column.

Syntax:

median(#column#) as <<Newcol_median>>

New column name: Newcol_median

Note

The median function is only applied for numerical columns.

Example:

median(#Age#) as <<median_age>>

Above example will return the median value of age column.

Milisec_todate

Milisec_todate ( ): The function creates timestamp from the number of milliseconds since UTC epoch - Unix epoch is 00:00:00 UTC on 1 January 1970

Syntax:

millisec_todate(#Column#) as <<NewMiliColumn>>

New column name: NewMiliColumn

Example:

millisec_todate(#2021/01/01 15:26:40#) as <<NewMiliColumn>>

The above mentioned date (1st January 2021, 15:26:40) will be converted into 1609511200000.

Min

min( ): This function returns the minimum value in the specified column.

Syntax:

min(#column#) as <<Newcol_min>>

New column name: Newcol_min

Note

The min function is only applied for numerical columns

Example:

min(#Age#) as <<min_age>>

Above example will return the minimum of age.

Minute

Minute ( ): The function returns minute part of a time/datetime (from 0 to 59)

Syntax:

minute(#Datecolumn#) as <<NewMinuteColumn>>

New column name: NewMinuteColumn

Example:

minute(#2021/01/01/15:26:40#) as <<NewMinuteColumn>>

The above mentioned date (1st January 2021, 15:26:40) will give results of ‘26’ as the extracted minute.

Mod

Mod ( ): The function helps find a remainder after a number (dividend) is divided by another number (divisor)

Syntax:

mod(#Column1#, #Column2#) as <<NewModColumn>>

New column name: NewModColumn

Example:

mod(#30#, #4#) as <<NewModColumn>>

The above mentioned number will give us a result of ‘2’ as when we divide 30 by 4 we will get 7 in full numbers and 2 is the remainder.

Month

Month ( ): The function extracts the month from a given date as number between 1 and 12

Syntax:

month(#Datecolumn#) as <<NewMonthColumn>>

New column name: NewMonthColumn

Example:

month(#2021/01/01/15:26:40#) as <<NewMonthColumn>>

The above mentioned date (1st January 2021, 15:26:40) will give results of ‘1’ as the extracted month.

Months_between

Months_between ( ): The function returns number of months between date1 and date2. If date1 is later than date2, then the result is positive. If date1 and date2 are on the same day of month, or both are the last day of month, time of the day will be ignored. Otherwise, the difference is calculated based on 31 days per month, and rounded to 8 digits. If we mention ‘false’ in the syntax, the number will not be rounded.

Syntax:

months_between(#EndDatecolumn#, #StartDatecolumn#, false) as <<NewMBColumn>>

New column name: NewMBColumn

Example:

months_between(#2003/01/01#, #2003/03/14#, false) as <<NewMBColumn>>

The above mentioned date difference will be calculated as -2.41935483870968.

Multiple Replacement

For video tutorial see: SEDGE YouTube channel - M functions.

mul_replacement( ): Multiple replacement replaces multiple characters from defined position, with new characters.

Syntax:

mul_replacement(#marital_status#,4,7,'_OK') As <<NewColumnMulRepl>>

New column name: NewColumn

Note

Applies to Categorical data type and the new column which is created is Categorical type.

Example:

mul_replacement(#marital_status#,4,7,'_OK') As <<NewColumnMulRepl>>

Assuming in the example above, the data in the column #marital_status# is ‘Married’, it will replace ‘ried’ with ‘_OK’, the output will be ‘Marr_OK’.

Negate_bool

Negate_bool(): The Negate_bool function returns the negation of the boolean value (True/False) in the column.

Syntax:

Negate_bool(#column#) as <<Newcol_negate>>

New column name: Newcol_negate

Note

The Negate_bool function is only applied to the column contains boolean values.

Example:

Negate_bool(#Text#) as <<new_negate>>

Above example will return the negation value of Text column.

Negative

Negative ( ): The function returns the negative absolute of values from a given numerical column

Syntax:

negative(#Column#) as <<NewNegColumn>>

New column name: NewNegColumn

Note

Applies for Numerical column and the resulting column is numerical data type

Example:

negative(#data#) as <<negative_col>>

The function will first perform the absolute value and after that it will convert the value into a negative one

Next_day

Next_day ( ): The function returns the first day after the selected day

Syntax:

next_day(#Datecolumn#, 'day_of_week') as <<NewNDColumn>>

New column name: NewNDColumn

Example:

next_day(#2020-08-10#, 'WED') as <<NewNDColumn>>

The function will return the next Wednesday after the mentioned date which is 2020-08-12.

Normalize

Data normalization is the process of scaling the data from the given numerical column down to an equivalent scale of 0 to 1.

Normalize( ): Creates a new normalized column

Syntax:

normalize(#column#) AS <<New_normalized_col>>

New column name: New_normalized_col

Note

The normalize() function is only applied for Numerical datatype columns

Example:

normalize(#age#) as <<normalized_col>>

Returns the normalized column for given numerical column.

Now

Now( ): The function returns the current date and time. Can be used to compare a different date column with a current date.

Syntax:

now() as <<NewNowColumn>>

New column name: NewNowColumn

Example:

now() as <<NewNowColumn>>

The function returns current date and time - 2020-30-05 15:20:43

NPER

nper( ):The NPER function returns the number of periods for an investment based on periodic, constant payments and a constant interest rate.

Annual_Interest_Rate - The interest rate per period
Amount_of_Payments - The payment made each period. Generally, it contains principal and interest but no other fees and taxes.
Loan_amount - The present value, or the lump-sum amount that a series of future payments is worth right now.
Final_balance - The future value or the cash balance
Type - Enter “end” or omit for investing at the end of each period, “begin” for the beginning of each period.

Syntax:

nper(#Annual_Interest_Rate#, #Amount_of_Payments#, #Loan_amount#, #Final_balance#,'end') as <<Newcol_nper>>

New column name: Newcol_nper

Example:

nper(#int_rate#, #open_acc#, #pv#) as <<new_nper>>

The above example returns the expected value.

Nullif

Nullif ( ): The function returns null if two expressions are equal, otherwise it returns the first expression

Syntax:

nullif(#Column#, 'value_toNull') as <<NewNullifColumn>>

New column name: NewNullifColumn

Example:

nullif(#Age#, '55') as <<NewNullifColumn>>

With every occurrence of 55 in the “Age” column the value will be replaced by a NULL

Nvl2

Nvl2 ( ): The function allows you to substitute a value when a null value is encountered and also when a non-null value is encountered

Syntax:

nvl2(#Column#, 'value_ifnotNull', 'value_ifNull') as <<NewNvl2Column>>

New column name: NewNvl2Column

Octet_length

Octet_length ( ): The function returns an integer indicating the number of bytes in the input string.

Syntax:

octet_length(#Column#) as <<NewOctColumn>>

New column name: NewOctColumn

Example:

octet_length(#Geography#) as <<NewOctColumn>>

A variable ‘France’ will return a number 6 as the length. Germany will be 7 etc.

Offday_count

offday_count( ): Returns the number of holidays between two given date columns(Include holidays and weekends) based on the country

Syntax:

offday_count(#column1#,#column2#, 'countrycode') as <<Newcol_offdaycount>>

New column name: Newcol_offdaycount

For the list of country codes click on the link below: List of country codes .. note:: * The offday_count function is only applied for date_columns. * Countrycode field should have valid codes.

Example:

offday_count (#DOB#, #Date of Hire#, 'US') as <<offdaycount>>

Above example returns the number of holidays between two given date columns(Include holidays and weekends) based on the country

One Replacement

For video tutorial see: SEDGE YouTube channel - O functions.

One_replacement( ): One replacement replaces single character from defined position, with new characters.

Syntax:

one_replacement(#Occupation#,4,'_Yes') as <<NewOccupationColumn>>

New column name: NewOccupationColumn

Note

Applies to Categorical and text data type and the new column which is created is Categorical type.

Example:

one_replacement(#Occupation#,4,'_Yes') as <<NewOccupationColumn>>

Assuming in the example above, the data in the column #Occupation# is ‘Prof’, it will replace ‘f’ with ‘_YES’, the output will be ‘Pro_Yes’.

Pmod

Pmod ( ): The Pmod function returns the positive modulus of a number

Syntax:

pmod(#Column1#, #Column2#) as <<NewPmodColumn>>

New column name: NewPmodColumn

PMT

pmt( ): The PMT Function returns the regular and constant repayments for a loan or mortgage required to reduce the balance to zero, or an amount you specify. It’s based on a constant interest rate.

Syntax:

pmt(#Annual_Interest_Rate#, #Number_of_Payments#, #Loan_amount#, #Final_balance#, 'end') as <<Newcol_pmt>>

New column name: : Newcol_pmt

Example:

pmt(#int_rate#, #open_acc#, #pv#) as <<new_pmt>>

The above example returns the expected value.

Positive

Positive ( ): Returns the positive(or absolute) of values from the given numerical column

Syntax:

positive(#Column#) as <<NewPosColumn>>

New column name: NewPosColumn

Note

Applies for Numerical column and the resulting column is numerical data type

Example:

positive(#data#) as <<positive_col>>

Returns the positive(absolute) value of given value

Power

For video tutorial see: SEDGE YouTube channel - P functions.

pow( ): The power (or exponent) of a number is the number of times to use the number in a multiplication

Syntax:

pow(#age#,3) as <<power_age_column>>

New column name: NewColumn

Note

A Applies to Numerical data type and the new column which is created is Numerical type.

B The power transformation is used when the numbers are clustered too close to each other. Power Transformer is used to make the data distribution more-Gaussian (Normal).

Example:

pow(#age#,3) as <<power_age_column>>

The power (or exponent) of a number is the number of times to use the number in a multiplication. In the example above the power is raised to 3, that is multiplying the number in the field three times.

Proper

Proper( ): The Proper Case converts categorical and text values to upper case for the first character of each word and to lower case for subsequent characters in the word.

Syntax:

Proper(#column1#) AS <<NewProperColumn>>

New column name: NewProperColumn

Note

A Applies to categorical and text data type and the new column which is created is Categorical type.

B Does not apply to Numerical and Data type columns

Example:

Proper(#column1#) AS <<NewProperColumn>>

Converts characters to proper case (Starting Alphabet is capitals, rest are lower case).

Quarter

Quarter( ): The function returns the quarter of the year for a given date value (a number from 1 to 4)

January-March will return 1

April-June will return 2

July-Sep will return 3

Oct-Dec will return 4

Syntax:

quarter(#Datecolumn#) as <<NewQuartColumn>>

New column name: NewQuartColumn

Example:

quarter(#2021-02-02#) as <<NewQuartColumn>>

The function will return 1 as the number of quarter

Radians

For video tutorial see: SEDGE YouTube channel - R functions.

radians( ): Converts angles in degrees to radians. 1 degree is equivalent to (π/180) radians. Multiply the number of degrees with by π/180 to convert it to radian terms.

Syntax:

radians(#angles#) AS <<NewRadiansColumn>>

New column name: NewRadiansColumn

Note

Applies to Numerical data type and the new column which is created is Numerical type.

Example:

radians(#angles#) AS <<NewRadiansColumn>>

Converts angles in degrees to radians.

image Angles to Radians

Rand

Rand ( ): The function returns a random number between 0 (inclusive) and 1 (exclusive). If seed is specified, it returns a repeatable sequence of random numbers. If seed is not specified, it returns a completely random number

Syntax:

rand(Seed) as <<NewRandColumn>>

New column name: NewRandColumn

Randn

Randn ( ): The function returns a random value with independent and identically distributed (i.i.d.) values drawn from the standard normal distribution

Syntax:

randn(Seed) as <<NewRandnColumn>>

New column name: NewRandnColumn

Reciprocal

reci( ): The reciprocal transformation is defined as the transformation of x to 1/x. The transformation reverses the order of values with the same sign.

Syntax:

reci(#column1#) AS <<NewReciprocalColumn>>

New column name: NewReciprocalColumn

Note

A Applies to Numerical data type and the new column which is created is Numerical type.

B The transformation can only be used for non-zero values and also values should not be missing.

C If the standard deviation is proportional to the mean squared, a reciprocal transformation can be performed.

D Transformation helps to improve interpretability. If fuel economy of car is being compared, then there are two way of looking at the fuel consumption a. liters per kilometer or b. kilometers per liter. The reciprocal of ‘liters per Kilometer’ is ‘kilometers per liter’.

Example:

reci(#column1#) AS <<NewReciprocalColumn>>

The reciprocal transformation is defined as the transformation of x to 1/x.

image MilesPerGallon to GallonsPerMile using reciprocal

Regexp_replace

Regexp_replace ( ): The function replaces the portions of a string that match a given regular expression with the contents of another string

#Column# - The string perform a search-and-replace operation upon

Reg_exp - the JavaScript style regular expression to evaluate

Replace_str - the string with which the matched substrings are to be replaced

Syntax:

regexp_replace(#Column#, 'regexp_replace', 'replace_str') as <<NewRegRepColumn>>

New column name: NewRegRepColumn

Regexp_extract

Regexp_extract ( ): This string function is used in search operations for sophisticated pattern matching including repetition and alternation

#Column# - The string to search for strings matching the regular expression

Reg_exp - the JavaScript style regular expression to evaluate

Group_idx - an optional regular expression group number, defining which portion of the matching string will be returned

Syntax:

regexp_extract(#Column#, 'reg_exp', group_idx) as <<NewRegExtColumn>>

New column name: NewRegExtColumn

Remove

remove( ): The remove function removes a certain string pattern from the text.

Syntax:

remove(#column1#, 'Pattern to be removed') AS <<NewColumnRemove>>

New column name: NewColumn

Note

A Applies to string / text column

B The transformation can only be used for non-zero values and also values should not be missing.

Example:

remove(#column1#, 'USD') AS <<NewColumnRemove>>

In the above example all instances of string USD, will be removed from the text in the #column1#

Repeat

repeat( ): Repeats the value in the field the number of times as defined.

Syntax:

repeat(#MAKE#,2) as <<Make_repeat_NewCol>>

New column name: Make_repeat_NewCol

Note

Applies to Numerical, categorical, text data type and the new column which is created is Categorical type.

Example:

repeat(#MAKE#,2) as <<Make_repeat_NewCol>>

This will repeat the values in the field ‘Make’ twice.

Replace

replace( ): The replace function replaces a string pattern with another string pattern within a text / string column.

Syntax:

replace(#column1#, 'Existing string pattern','string to be replaced') AS <<NewColumnReplace>>

New column name: NewColumn

Note

A Applies to string / text column

B The transformation can only be used for non-zero values and also values should not be missing.

Example:

replace(#column1#, 'EUR', 'USD') AS <<NewColumnReplace>>

In the above example all instances of string EUR, is replaced with string USD, from the text in the #column1#.

Reverse

reverse( ): Reverses the order of the data. If the value is ABCDE, a reverse function will reverse the value to EDCBA

Syntax:

reverse(#position#) as <<reverse_pos>>

New column name: NewColumn

Note

Applies to Categorical, Text and Numerical data type and the new column which is created is Numerical type.

Example:

reverse(#position#) as <<reverse_pos>>

Reverses the order of the character

image Text Reversed

Right

right( ): Extract right characters from the fields till the numerical count.

Syntax:

right(#nationality#,3) as <<NewColumn>>

New column name: NewColumn Above example the function extracts the first 3 characters in the field from right.

Note

Applies to Categorical, Text, Numerical and Date data type and the new column which is created is Categorical type.

Example:

right(#nationality#,3) as <<NewColumn>>

Example above, function extracts the first 3 characters in the field from right.

Rint

rint( ): The rint functions returns a floating-point value that represents the nearest integer to x. Halfway values are rounded according to the current setting of the floating-point rounding mode

Syntax:

rint(#Column#) as <<NewRintColumn>>

New column name: NewRintColumn

Round

round( ): Rounds the value.

Syntax:

round(#MPG#,1) as <<NewColRound>>

New column name: NewcolRound

Note

Applies to Numerical data type and the new column which is created is Numerical type.

Example:

round(#MPG#,1) as <<NewColRound>>

Rounds the value.

image rounding of Miles Per Gallon (MPG)

Row_index

row_index(): The function assigns a unique, sequential index number to each row, starting with 1

Syntax:

row_index() as <<NewRowInColumn>>

New column name: NewRowInColumn

Example:

row_index() as <<rowindex_col>>

Returns the index of each row

Rtrim

Rtrim( ): The function removes trailing spaces from a string

Syntax:

rtrim(#Column#) as <<NewRtrimColumn>>

New column name: NewRtrimColumn

Example:

rtrim(#      hello    #) as <<NewRpadColumn>>

The function will return ‘hello’ without spaces on the sides.

Rpad

Rpad( ): The function right-pads a string with another string, to a certain length (when string1 is not null)

Syntax:

rpad(#Column#, length, 'pad') as <<NewRpadColumn>>

New column name: NewRpadColumn

Example:

rpad(#Hello#, 8, 'ABC') as <<NewRpadColumn>>

The function will return HelloABC.

Running total

running_total( ): This function returns a sequence of partial sums of given column which means displaying the summation of the data as it grows over time.

Syntax:

running_total(#column#) as <<Newcol_total>>

New column name: Newcol_total

Note

The running_total function is only applied for numerical columns.

Example:

running_total(#Age#) as <<total_age>>

Above example will return the running_total of age column.

Second

second( ): The function returns the seconds part of a time/datetime (from 0 to 59).

Syntax:

second(#Datecolumn#) as <<NewSecColumn>>

New column name: NewSecColumn

Example:

second(#2021-09-16 11:56:33#) as <<NewSecColumn>>

The function will return 33.

Sha1

Sha1( ): The function calculates an SHA-1 160-bit checksum for a string. The function returns a value as a binary string of 40 hex digits. In case the string supplied as the argument is NULL, the function returns NULL.

Syntax:

sha1(#Column#) as <<NewSha1Column>>

New column name: NewSha1Column

Example:

sha1(#Sedge#) as <<NewSha1Column>>

The function will return fca21fee4d117ec70a95dbee415c01e5c6e86df3

Sha2

Sha2( ): The function uses the SHA2 cryptographic hash function to convert a variable-length string into a character string. The character string is a text representation of the hexadecimal value of the checksum with the specified number of bits

Syntax:

sha2(#Column#, bitLength) as <<NewSha2Column>>

New column name: NewSha2Column

Note

String - A variable-length string

Integer - The number of bits in the hash functions. Valid values are 0 (same as 256), 224, 256, 384, and 512

Shiftleft

Shiftleft( ): The function performs a left shift bitwise

Syntax:

shiftleft(base, #Column#) as <<NewShLeftColumn>>

New column name: NewShLeftColumn

Shiftright

Shiftright( ): The function performs a right shift bitwise (signed)

Syntax:

shiftright(base, #Column#) as <<NewShRightColumn>>

New column name: NewShRightColumn

Shiftrightunsigned

Shiftrightunsigned( ): The function performs a right shift bitwise (unsigned)

Syntax:

shiftrightunsigned(base, #Column#) as <<NewShRiUnColumn>>

New column name: NewShRiUnColumn

Sign

For video tutorial see: SEDGE YouTube channel - S functions.

Sign( ): Converts if value in column is greater than 0 it returns 1 and if value is less than 0 it returns -1 and if the value is 0 it returns 0

Syntax:

Sign(#MPG#) as <<NewColSign>>

New column name: Newcolsign

Note

A Applies to Numerical data type and the new column which is created is Numerical type.

Example:

Sign(#temperature#) as <<NewColSig>>

If the value in the column temperature is 36, then the new column (Newcol) will have value 1, If the value in the column temperature is -25, then the new column (Newcol) will have value -1 and If the value in the column temperature is 0, then the new column (Newcol) will have value 0

Sin

Sin( ): The function returns the sine of a number

Syntax:

sin(#Column#) as <<NewSinColumn>>

New column name: NewSinColumn

Example:

sin(#1#) as <<NewSinColumn>>

The function will return the result of 0.8414709

Sinh

Sinh( ): The function returns the hyperbolic sine of a number

Syntax:

sinh(#Column#) as <<NewSinHColumn>>

New column name: NewSinHColumn

Example:

sinh(#1#) as <<NewSinHColumn>>

The function will return the result of 1.1752011

Soundex

Soundex( ): The function evaluates expression and returns the most significant letter in the input string followed by a phonetic code

Syntax:

soundex(#Column#) as <<NewSoundColumn>>

New column name: NewSoundColumn

Note

Characters that are not alphabetic are ignored

If expression evaluates to the null value, null is returned

Square Root

sqrt( ): Finds the square root of the number.

Syntax:

sqrt(#MPG#) as <<NewColSqrt>>

New column name: Newcolsqrt

Note

A Applies to Numerical data type and the new column which is created is Numerical type.

B The function can be used for square root transformation of positive values.

C Do not apply square root of negative values as the system will give error.

Example:

sqrt(#MPG#) as <<NewColSqrt>>

Finds the square root of the positive value.

image Square Root of Miles Per Gallon (MPG)

Step Delete

step_delete( ): Delete single character at certain position and then skips the next characters, and then deleting the Single character.

Syntax:

step_delete(#Occupation#,1,10,2) as <<NewColSkipDelete>>

New column name: NewColSkipDelete

Note

A Applies to Categorical and text data type and the resulting column is categorical data type.

B Does not apply to numerical data type.

Example:

step_delete(#Occupation#,1,10,2) as <<NewColSkipDelete>>

In the above example data in the field Occupation is selected. The step delete is applicable from 1st Character to 10th character and the skips 2 characters. Example if the word is ‘Life is beautiful’ then this function will work as per the below image.

image Step Delete function working explanation

Standardize

standardize( ): Data standardization is the process of rescaling the data from the given numerical column to have a mean of 0 and a standard deviation of 1 (unit variance). The function creates a new standardized column

Syntax:

standardize(#Column#) as <<NewStandColumn>>

New column name: NewStandColumn

Note

The standardize fuction is only applied for Numerical datatype columns

Example:

standardize(#age#) as <<NewStandColumn>>

Returns the standardized column for given numerical column

Stddev

stddev( ): This function returns the standard deviation of values in the specified column.

Syntax:

stddev(#column#) as <<Newcol_stddev>>

New column name: Newcol_stddev

Note

The stddev function is only applied for numerical columns.

Example:

stddev(#Age#) as <<stddev_age>>

Above example will return the standard deviation value of age column.

String

String( ): The function converts the value expression to the string data type

Syntax:

string(#Column#) as <<NewStringColumn>>

New column name: NewStringColumn

Substring

Substring( ): The function extracts a substring from a string (starting at any position)

Syntax:

substring(#Column#, position, length) as <<NewSubColumn>>

New column name: NewSubColumn

Substring_index

Substring:index( ): The function returns a substring of a string before a specified number of delimiter occurs

Syntax:

substring_index(#Column#, 'delimiter', count) as <<NewSubInColumn>>

New column name: NewSubInColumn

Sum

sum( ): This function returns the sum of all values in the specified column.

Syntax:

sum(#column#) as <<Newcol_sum>>

New column name: Newcol_sum

Note

The sum function is only applied for numerical columns.

Example:

sum(#Age#) as <<sum_age>>

Above example will return the sum of all values in age column.

Tan

Tan( ): The function returns the tangent of a number

Syntax:

tan(#Column#) as <<NewTanColumn>>

New column name: NewTanColumn

Tanh

Tanh( ): The function returns the hyperbolic tangent of a number

Syntax:

tanh(#Column#) as <<NewTanHColumn>>

New column name: NewTanHColumn

Example:

tanh(#1#) as <<NewTanHColumn>>

The function will return the result of 0.7615941

To_date

To_date( ): The function converts a string value to DATE data type value using the specified format

Syntax:

to_date(#Column#, 'Date_format_pattern') as <<NewToDateColumn>>

New column name: NewToDateColumn

For Date Pattern Formats check: Date Format Examples.

To_timestamp

To_timestamp( ): The function converts a character datatype to a value of TIMESTAMP datatype format. The Timestamp_format_pattern specifies the format of character

Syntax:

to_timestamp(#Column#, 'Timestamp_format_pattern') as <<NewToTimeColumn>>

New column name: NewToTimeColumn

For Date Pattern Formats check: Date Format Examples.

To_unix_timestamp

To_unix_timestamp( ): The function returns the UNIX timestamp of the given time column. The unix time stamp is a way to track time as a running total of seconds. This count starts at the Unix Epoch on January 1st, 1970 at UTC

Syntax:

to_unix_timestamp(#Column#, 'Date/time_format_pattern') as <<NewUnixColumn>>

New column name: NewUnixColumn

For Date Pattern Formats check: Date Format Examples.

To_utc_timestamp

To_utc_timestamp( ): The function returns the UTC timestamp of the given time column.

Syntax:

to_utc_timestamp(#Datecolumn#, 'timezone') as <<NewUTCColumn>>

For the list of time zones click on the link below: List of time zones.

New column name: NewUTCColumn

Trim

For video tutorial see: SEDGE YouTube channel - T functions.

trim( ): Trim function returns a text value with the starting and ending spaces removed. It also removes unnecessary spaces between words in a string.

Syntax:

trim(#occupation#) as <<Newcoltrim>>

New column name: Newcoltrim

Note

A Applies to Categorical, Text data type and the new column which is created is Categorical type.

B Do not apply to Numerical or date type.

Example:

trim(#occupation#) as <<Newcoltrim>>

Trim returns the categorical value where the unnecessary space between words, ending and starting of word is removed.

Typeof

Typeof( ): The function allows the user to quickly check a variable’s data type — or whether it is “undefined” or “null”

Syntax:

typeof(#Column#) as <<NewTypeOfColumn>>

New column name: NewTypeOfColumn

Unbase64

unbase64( ): The function decodes a BASE64 encoded string column and returns it as a binary column

Syntax:

unbase64(#Column#) as <<NewUnbaseColumn>>

New column name: NewUnbaseColumn

Unhex

unhex( ): The function converts hexa decimal value (base 16) to decimal numbers (base 10). As a base-16 numeral system, it uses 16 symbols.

Syntax:

unhex(#Column#) as <<NewUnhexColumn>>

New column name: NewUnhexColumn

Unix_timestamp

unix_timestamp( ): The function returns the UNIX timestamp of current time. The unix time stamp is a way to track time as a running total of seconds. This count starts at the Unix Epoch on January 1st, 1970 at UTC

Syntax:

unix_timestamp() as <<NewUnixColumn>>

New column name: NewUnixColumn

Example:

unix_timestamp() as <<NewUnixColumn>>

Above example performed on Mon Jun 07 2021 14:16:31 GMT will return Unix Timestamp 1623075391

Uuid

uuid( ): The function is used to return a Universal Unique Identifier (UUID) - this ID is random and unique for each row

Syntax:

uuid() as <<NewUuidColumn>>

New column name: NewUuidColumn

Example:

uuid() as <<NewUuidColumn>>

Above example will return a 36 character random string like for example 6521665d-ce8b-442a-b7c2-cfb4918f25fb

Unit converter

For video tutorial see: SEDGE YouTube channel - U functions.

unit_converter( ) : unit_converter of a number converts from one unit to another unit. The available units for conversion are listed below.

Syntax:

unit_converter(#temperature#, 'from_unit', 'to_unit') AS <<Newconvertedcolumn>>

New column name: Newconvertedcolumn

Note

The from_unit and the to_unit refer to the unit such as “C”. referring to Celsius or “F”, referring to Fahrenheit. The UNIT_CONVERTER value can only be applied to Numerical datatype columns. The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored. The user can enter “from unit” and ‘to unit’ in upper case or lower case.

Temperature units

Converts temperature units from Celsius to Fahrenheit and from Fahrenheit to celsius.

Syntax:

unit_converter(#temperature#, 'C', 'F') AS <<NewconvertedcolumnTemp>>

Temperature	From	Temperature.
Celsius (C).	to	Fahrenheit(F)
Fahrenheit(F)	to	Celsius (C)

Quantity units

Converts quantity units from Tablespoon to Teaspoon and from Teaspoon to Tablespoon.

Syntax:

unit_converter(#QuantityColumn#, 'TSP', 'TBS') AS <<NewconvertedcolumnTemp>>

Measure	From	Measure
Teaspoon (TSP)	to	Tablespoon (TBS)
Tablespoon (TBS	to	Teaspoon (TSP)

Volume units

Converts volume units from gallons to liters and from liters to gallons.

Syntax:

unit_converter(#QuantityColumn#, 'GAL', 'L') AS <<NewconvertedcolumnVol>>

Gallons (GAL) to Liters (L) Liters (L) to Gallons (GAL)

Distance units

Converts distance units from miles to kilometers and from kilometers to miles.

Syntax:

unit_converter(#QuantityColumn#, 'MI', 'KM') AS <<NewconvertedcolumnVol>>

Miles (MI) to Kilometers (KM) Kilometers (KM) to Miles (MI)

Length units

Converts length units from inches to feet or centimeters and from feet to inches and centimeters and from centimeters to inches and feet.

Syntax:

unit_converter(#QuantityColumn#, 'IN', 'FEET') AS <<NewconvertedcolumnLen>>

From Unit		To Unit
Inches (IN)	to	Feet (FEET)
Feet (FEET)	to	Inches (IN
Inches (IN)	to	Centimeters (CM)
Feet (FEET)	to	Centimeters(CM)
Centimeters(CM)	to	Inches (IN)
Centimeters (CM)\| to \|Feet (FEET)
Inches (IN)	to	Feet (FEET)
Fahrenheit(F)	to	Celsius (C)

Weight Units

Converts weight units from kilograms to pounds and from pounds to kilograms.

Syntax:

unit_converter(#QuantityColumn#, 'KG', 'LBM') AS <<NewconvertedcolumnWei>>

From Unit		To Unit
Kilograms (KG)	to	Pounds (LBM)
Kilograms (KG)	to	Grams (G)
Pounds (LBM)	to	Kilograms (KG).
Pounds (LBM)	to	Grams (G)
grams (G)	to	Kilograms (KG)
grams (G).	to	Pounds (LBM)

Upper

upper( ): converts characters to upper case.

Syntax:

upper(#column1#) AS <<NewColumnUpp>>

New column name: NewColumn

Note

Applies to categorical and text data type and the new column which is created is Categorical type.

Example:

upper(#column1#) AS <<NewColumnUpp>>

Converts characters to upper case.

Variance

variance( ): This function returns the variance of values in the specified column.

Syntax:

variance(#column#) as <<Newcol_variance>>

New column name: Newcol_variance

Note

The variance function is only applied for numerical columns.

Example:

variance(#Age#) as <<variance_age>>

Above example will return the variance of age column.

Weekday

weekday( ): The function returns the weekday number for a given date column

Syntax:

weekday(#Datecolumn#) as <<NewWeekdayColumn>>

New column name: NewWeekdayColumn

Note

0 = Monday, 1 = Tuesday, 2 = Wednesday, 3 = Thursday, 4 = Friday, 5 = Saturday, 6 = Sunday

Example:

weekday(#2021-06-06#) as <<NewWeekdayColumn>>

The function will return ‘6’for Sunday.

Weekend_count

weekend_count( ): Returns the number of weekend between two given date columns based on country

Syntax:

weekend_count(#column1#,#column2#, 'countrycode') as <<Newcol_weekendcount>>

New column name: Newcol_weekendcount

For the list of country codes click on the link below: List of country codes .. note:: * The weekend_count function is only applied for date_columns. * Countrycode field should have valid codes.

Example:

weekend_count (#DOB#, #Date of Hire#, 'US') as <<weekendcount>>

Above example returns the number of weekend between two given date columns based on country

Weekofyear

Weekofyear( ): The function returns the week number for a given date (a number from 1 to 53)

Syntax:

weekofyear(#Datecolumn#) as <<NewWoYColumn>>

New column name: NewWoYColumn

Note

This function assumes that the first day of the week is Monday and the first week of the year has more than 3 days

Example:

weekofyear(#2020-01-23#) as <<NewWoYColumn>>

Above example returns the number 4 as for the 4th week in the respective year

Workday_count

workday_count( ): Count the number of workdays between two given date columns based on country

Syntax:

workday_count(#column1#,#column2#, 'countrycode') as <<Newcol_workdaycount>>

New column name: Newcol_workdaycount

For the list of country codes click on the link below: List of country codes .. note:: * The workday_count function is only applied for date_columns. * Countrycode field should have valid codes.

Example:

workday_count (#DOB#, #Date of Hire#, 'US) as <<workdaycount>>

Above example returns the number of workdays between two given date columns based on country

XXhash64

XXhash64( ): The function returns a 64-bit hash value of the arguments/selected columns

Syntax:

xxhash64(#Column1#, #Column2#, …, #ColumnN#) as <<NewXXColumn>>

New column name: NewXXColumn

xxhash64(#30#, #7.8958#) as <<NewXXColumn>>

Above example will return -5523311690417587000 as a 64bit hash value.

Year

Year( ): The function returns the year part for a given date (a number from 1000 to 9999)

Syntax:

year(#Datecolumn#) as <<NewYearColumn>>

New column name: NewYearColumn

Example:

year(#2020-01-23#) as <<NewYearColumn>>

Above example returns the number 2020 as an extracted year

Year_to_date

Year( ): The function converts 4-digit year(yyyy) to default Date format (yyyy-MM-dd)

Syntax:

year_to_date(#Column#) as <<NewYTDColumn>>

New column name: NewYTDColumn

Visual Analytics

Cluster Analysis

Cluster analysis is a technique to group similar observations into a number of clusters based on the observed values of several variables for each individual. It is an exploratory data analysis technique that allows us to analyze the multivariate data sets.

Clustering algorithms

K-means

It is an iterative algorithm that divides the unlabeled dataset into k different clusters.

image_K-means

Bisecting K-Means

Bisecting K-Means is a hybrid approach of the K-Means algorithm to produce partitional/hierarchical clustering. It splits one cluster into two sub-clusters at each bisecting step (using k-means) until k clusters are obtained.

image_Bisecting_K-Means

Gaussian Mixture

The Gaussian mixture model is defined as a clustering algorithm that is used to discover the underlying groups of data. Gaussian mixture models have a higher chance of finding the right number of clusters in the data compared to k-means.

image_Gaussian_Mixture

Method for Optimal Cluster

Elbow method

Elbow Method is an empirical method to find the optimal number of clusters for a dataset. In this method, we pick a range of candidate values of k, then apply K-Means clustering using each of the values of k. Find the average distance of each point in a cluster to its centroid, and represent it in a plot. Pick the value of k, where the average distance falls suddenly.

image Elbow method

Silhouette method

The silhouette Method is also a method to find the optimal number of clusters and interpretation and validation of consistency within clusters of data. The silhouette method computes silhouette coefficients of each point that measure how much a point is similar to its own cluster compared to other clusters. by providing a succinct graphical representation of how well each object has been classified.

image Silhouette method

Graphical Analysis

image GraphicalAnalysis

Graphs are used to display large amounts of numerical data and to represent the relationships between numerical values of different variables. They can also be used to derive quantitative relationships between variables.

image BarChartAge

Bart chart presents data with rectangular bars with heights or lengths proportional to the values that they represent.

image BarChartAge

We have chosen age as an example for our bar chart. We can point our mouse at any part of the bar and the system will tell as which point we are currently standing - X axis tells us the location of our mouse and Y axis tells us the total count in the bar. Also we can zoom in or out by the grey tool in the upper part of the graph (highlighted in green).

“Group by” tool and “Aggregation” tool works with categorical values.

Note

We can also select the “Use Distinct” tool - which represents percentage of each category in the pie irrespective of the count in each category.

Stack Bar

The stack bar chart extends the standard bar chart from looking at numeric values across one categorical variable to two, where each bar is divided into a number of sub-bars.

image StackBarSurEmb

In this Stack Bar chart we have selected two factors - “port of embarkment” and “survived”. We can see that each port of embarkment - 0,1 and 2 is dived into two colours - 1 showing those who survived and 0 showing the opposite. By pointing the mouse on any section of the bar we can see the total number of cases in the group.

Box Plot

Box Plot graph depicts groups of numerical data through their quartiles. Box plots have also lines extending from the boxes (whiskers) indicating variability outside the upper and lower quartiles, these are the outlier values.

image BoxPlot

We used age as an example, we can see how mostly the age of passengers was in between 22 and 35 years, other values re represented as outliers. By pointing at the graph we can also see the Open, Low, High and Close values.

Pie Chart

Pie charts are beneficial in expressing data and information in terms of proportions. A pie chart is most effective when dealing with the collection of data. It is constructed by clustering all the required data into a circular constructed frame wherein the data are depicted in slices.

image PieChartMealNo

For our example we used the number of meals and we can see the number was between 1 and 9, each part of the pie is also represented by the percentage and after pointing at each section we can also see the total number of cases

Bubble Chart

A bubble chart is a type of chart that displays three dimensions of data. Each entity with its triplet of associated data is plotted as a disk that expresses two of the vᵢ values through the disk’s xy location and the third through its size.

image BubbleChartAgeFare

We used Age and Fare for our example and we can now see how the values are spread in the graph.

We can also add another factor in our chart, which represents the “Size” - we selected “Fare” and after that we can see the bubbles size is corresponding with the increasing fare.

image BubbleChartAgeFareSize

We have another feature that can help us represent the values effectively - we can add another factor and represent it in colour, in our case we have added “Embarkment point” so we can see the results also differentiated by the point of embarkment

image BubbleChartAgeFareSizeEmb

Correlation Chart

The depiction address each numerical variable’s correlation against all other variables. Bigger the connection better the correlation between the variables.

image Correlation

image Correlationtarget

Scatter Graph

Scatter graph uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables.

image Scatter

For our demonstration we have used three different numerical variables - Age as the target on X axis and Fare and Total meal cost on the Y axis. We can even add more numerical variables and all will be displayed in one chart.

Parallel Coordinates

Parallel coordinates are a common way of visualising and analysing high-dimensional datasets. It is applied to data where the axes do not correspond to points in time, and therefore do not have a natural order.

image Parallel

We have used Meal Cost, Number of Meals and Total Meal cost to visualise the relation between these variables.

Scatter Matrix

In multivariate statistics and probability theory, the scatter matrix is a statistic that is used to make estimates of the covariance matrix, for instance of the multivariate normal distribution.

Time Series

A time series is a series of data points indexed (or listed or graphed) in time order.

image TimeSeries

On the X axis we need to use a date variable, in our case “Ticket purchase” and on the Y axis we used “Survived”, for aggregation we have selected Average value (we have also option of Count, Sum, Average, Minimum, Maximum, Median, Standard Deviation and Variance). From our graph we can see, that everybody who purchased the ticket on April 1st survived.

Distribution

A distribution plot displays a distribution and range of a set of numeric values plotted against a target variable(s).

Distribution

Advanced Analysis

Outlier analysis

Outlier analysis is the process of identifying outliers, or abnormal observations, in a dataset. Also known as outlier detection, it’s an important step in data analysis, as it removes erroneous or inaccurate observations which might otherwise skew conclusions.

image Outlier analysis

Deviation analysis

The measurement of the absolute difference between any one number in a set and the mean of the set. Deviation analysis can be used to determine how a specification will behave in the face of such deviations.

image Deviation analysis

Decision Tree analysis

Decision tree analysis is the process of drawing a decision tree, which is a graphic representation of various alternative solutions that are available to solve a given problem, in order to determine the most effective courses of action. Decision trees are comprised of nodes and branches. Nodes represent a test on an attribute and branches represent potential alternative outcomes.

image Decision Tree analysis

Predictive Analytics

image Predictive Analytics

After completing our data manipulation and graphical analysis, we can move to the last part, which is the Predictive Analytics. In the “Statistics” part, you need to select all the columns that are important for your research and ignore those that have small or no relation to the target. From the beginning our target for the research is the “Survived” field. Very important step in this research was to convert the field “Survived” from “Numerical” to “Categorical” type, as keeping it as “Numerical” will result in very low accuracy.

Balancing

In following step we can choose the type of balancing - “None”, “Down sampling”, “SMOTE” or “Up sampling” - in our example we selected “None”. Sampling is used to resolve class imbalance, because the imbalance in the observed classes might negatively impact our model. Machine learning algorithms have trouble learning when one class dominates the other so the balancing of data helps us avoid these problems.

image Balancing

None - no sampling of data performed, data stays as it is the dataset.
Down Sampling - the system will randomly subset all the classes in the data set so that their class frequencies match the least common class
SMOTE - Synthetic Minority Over sampling Technique - this technique synthesises a new minority instances between already existing minority instances
Up Sampling - the system will randomly sample and replace the minority class to be the same size as the majority class

Next step is to select which algorithms should be used, the ratio (default is 80:20 - which means 80% of the sample is used for training and 20% is used for cross validation) and cross validation number (in how many groups the cross validation sample should be split into) and then click on the “Start learning” button and the system will run these algorithms.

Algorithms

image Algorithms

Linear regression

Linear regression is the statistical model that analyzes the linear relationship between a scalar response and one or more explanatory variables.

Logistic regression

Logistic regression is a supervised learning algorithm used to predict a dependent categorical target variable.

Decision Tree

Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain parameter. The tree can be explained by two entities, namely decision nodes, and leaves.

Random Forest

The random forest is a classification algorithm consisting of many decisions trees. It is an ensemble method that is better than a single decision tree as it reduces over-fitting by averaging the result. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by the committee is more accurate than that of any individual tree.

Gradient Boosting Machines

A Gradient Boosting Machine or GBM combines the predictions from multiple decision trees to generate the final predictions.

eXtreme Gradient Boosting

eXtreme Gradient Boosting or XGBoost is another popular boosting algorithm. In fact, XGBoost is simply an improvised version of the GBM algorithm! The working procedure of XGBoost is the same as GBM. The trees in XGBoost are built sequentially. XGBoost also includes a variety of regularization techniques that reduce overfitting and improve overall performance.

Naive Bayes

The Naive Bayes is a classification algorithm that is suitable for binary and multiclass classification. Naïve Bayes performs well in cases of categorical input variables compared to numerical variables. It is useful for making predictions and forecasting data based on historical results.

K-Nearest Neighbors

KNN is a non-parametric method used for classification. It is one of the best-known classification algorithms in that known data are arranged in a space defined by the selected features. When new data is supplied to the algorithm, the algorithm will compare the classes of the k closest data to determine the class of the new data.

Support Vector Machine

Support vector machine algorithm helps to find a hyperplane in N-dimensional space(N — the number of features) that distinctly classifies the data points.

Multilayer Perceptron

A multilayer perceptron (MLP) is a feedforward artificial neural network that generates a set of outputs from a set of inputs. An MLP is characterized by several layers of input nodes connected as a directed graph between the input and output layers.

Light GBM

Light Gradient Boosted Machine, or LightGBM for short, is an open-source library that provides an efficient and effective implementation of the gradient boosting algorithm. It can be used for data having more than 10,000+ rows. No fixed threshold helps in deciding the usage of LightGBM. It can be used for large volumes of data especially when one needs to achieve high accuracy.

We can see the status of each algorithm, some of these are more complicated and will take more time. After all are at 100% of completion, the algorithms are sorted by their accuracy.

In our case, we can see 7 algorithms were used.

XGBOOST - decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework - with this algorithm we got an accuracy of 85%.
Decision tree
GBM - Gradient boosting
Random forest
Logistic regression
Naive Bayes
KNN - K nearest neighbors

We can also see further details of the algorithm - after clicking on the “Metrices” button.

Variable of Importance

image ModelDetailsVI

In this chart we can see which data field has a higher correlation to our selected target field.

Metrics - Classification report

image ClassificationReport

Classification report is used to measure the quality of predictions from a classification algorithm. The metrics are calculated by using true and false positives, true and false negatives.

Metrics- Confusion Matrix

image Confusion Matrix

Confusion matrix, also known as an error matrix, is a specific table layout that allows visualisation of the performance of an algorithm. It shows the “True Positives”, “True Negatives”, “False positives” and “False Negatives”.

Metrics- Gain Charts

image Gain Charts

Cumulative Gain charts are used for visually measuring the effectiveness of classification model and a very popular metrics in predictive analytics. Gain charts measures the performance for a proportion of the population data. It is calculated as the ratio between the results obtained without model (random line) or with the model (the model line). The greater the area between these two line the better the model is. The Y-axis shows the target data and the X-axis shows the percentage of the population.

In the image in the Gain Charts, the black line is the random line, and the green and orange Line is results based on the model for the two classes. From the figure above it can be see that orange class model has performed better than the green model, and both the line has performed better than the random line (black).

Metrics- Lift Chart

image Lift Charts

Lift charts are used for visually measuring the effectiveness of classification model and also a popular metrics in predictive analytics. Lift Charts shows the actual lift. The lift curve is determined by measuring the ratio between the result predicted by model and the result using no model. A lift chart graphically represents the improvement that a model has when compared against a random guess, and it measures the change in terms of a lift score. A model can be determined which is best, by comparing the lift scores. A lift chart shows the point at which the model’s predictions are strong and when it become weak (less useful).

Metrics- K-S Chart

image K-S

K-S is a measure of the degree of separation between the positive and negative distributions.

AUC- Area under the ROC Curve

image AUC

On the Y axis we can see the True Positive Rate - samples that were evaluated correctly, on the X axis we can see the False Positive Rate - samples evaluated incorrectly. When we plot each of the results, we get the ROC curve. We evaluate the AUC - area under the ROC - if the value is near to zero, the selected model is not classifying correctly. If the value is near to 1, the selected model is classifying correctly. In our case the number is 0.92.

Model comparison

The SEDGE application now allows data analysts to review and compare models by MAPE (mean absolute percentage error), RMSE (root mean square error), MSE (mean squared error), MAE (mean absolute error), and MLCH. This allows the analyst to save and test models for greatest accuracy.

image Model Comparison

Trained model sharing to users who can access from any other environment.

image Modelsharing

Actual vs predicted

Actual vs. predicted plots are the visualization of actual vs. predicted values for all predictive models.

image Actual vs Predicted

The above is the graph between the actual and predicted values.

Shapely importance

SHAP or Shapley Additive exPlanations is a visualization technique that can be used for making a machine learning model more explainable by visualizing its output. It can be used for explaining the prediction of any model by computing the contribution of each feature to the prediction. If the SHAP value is much closer to zero, we can say that the data point contributes very little to predictions. If the SHAP value is a strong positive or strong negative value, we can say that the data point greatly contributes to predicting the positive or negative class.

Shapely importance