Data analytics

_images/Data_Analytics.png

Click on the Data analytics icon, which will redirect you into the project creation screen. This will take the users into the screen where the project can be created.

_images/data_analysis.png

Data Analysis

  • A - For project creation, click on “Create project” which will redirect you to your data sources (screen number 4)

  • B - Here you can see a list of all your current projects

  • C - By clicking on the tick box next to your project, you can select the project

  • D - Delete the selected project

_images/projects.png
Other possible actions with the selected project:
  • E - Load the data to perform analysis

  • F - Delete a file from the project

  • G - Download the data into the local system as CSV file

Data sources

Connecting to the file system

_images/data_sources.png

Data Sources- Screen number 4.1

  • A - For uploading a file from your computer, select file upload

  • B - Select file that you are planning to upload

  • C - Select the file type - CSV, JSON, Parquet, Avro, S3 or HDFS

_images/file_upload.png

file upload

  • D - Select your delimiters – tab, Semicolon, Comma, Space, Other

  • E - After completing previous steps click on Add files

Connecting to Database

_images/conn_database.png

Data Sources- Screen number 4.2

  • A - For uploading a file from a database, select DB connection

  • B - If you wish to upload a new file, select New Connection tab

  • C - If you already have a connection, select Existing Connection tab

_images/exist_dbconnection.png
  • D - Within new connection, select the database type from the following options

_images/connection.png
  • E - Fill in your username and password

DB Schemas and Access

This section is about the data upload into SEDGE from different types of database. SEDGE has the functionality of connecting to different database such as MYSQL, Mariadb, Oracle and Postgresql.

The future release of SEDGE will also cover other database such as Cassandra, SQLite, Presto, Redshift and Redis. If you need assistance in connecting to these DB, talk to our support team, and they can provide functionality to connect to these databases also

Accessing Database

_images/DB_Schemas_and_Access.png

DB Schemas and Access

  • A - For uploading a file from a database, select DB connection

  • B - If you wish to upload new file, select New Connection tab

  • C - If you already have a connection, select Existing Connection tab

Connecting to DB

_images/existing_connection_blank.png

existing connection 1

  • D - Within new connection, select the database type from the following options, we are demonstrating our example on the MYSQL database

_images/Database_type.png

database type

  • E - Fill in the class name of your database

  • F - Fill in the host name assigned to your database

  • G - Fill in the port number

  • H - Fill in the username

  • I - Fill in the password

  • J - Proceed with the connect button


Selecting existing DB Connections

_images/save_connection.png

save connection

After filling in the above table a new window appears
  • A - Fill in the Database and Connection Name

  • B - Save connection (this will then appear in the Existing Connections tab)

  • C - Or proceed directly to SQL Editor

_images/existing_connections.png

existing connections 2

Your project was successfully uploaded and is among the Existing Connections
  • A - You can open the project by clicking on the Connect button and you will be redirected to the SQL editor

  • B - You can erase the project by clicking on the Bin button


Working with SQL editor

_images/SQL_editor.png

SQL editor

  • A - In the SQL editor your datasets will appear under the Entities box, for our demonstration we have uploaded three medical related datasets

  • B - In the SQL Query generator box, you can write your commands, for example when you wish to open one of your datasets, or do joins of the tables

  • C - After finishing your query, click on the Execute button to proceed to the preview


Writing SQL query in Editor

_images/SQL_editor_query_1.png

SQL editor query

  • A - After clicking on one of the datasets, the query generator automatically selects this file

  • B - After selecting Execute button, we can see the preview of the data

  • C - Data preview (top 100 rows)

  • D - If we want to proceed with this dataset, we continue by clicking on the Save button

Joining multiple tables in SQL query Editor

_images/SQL_join.png

SQL join

SQL Join query example

We can use the SQL Query Generator in many ways, for example to combine records from two or more tables. There are 4 types of joins – Inner, Full, Left or Right Join. In our example, we are performing a left join – putting together records from the first (left-most) table with matching right table records. As a result, we get a left join with table “cancer diagnostic” and “diabetes” using as a condition patient ID. This means we will get one final table, that will combine the information on the respective patient with same ID.

ETL Process

SEDGE Logo

File upload

_images/file_upload2.png

upload file - Screen no 5

  • A - After adding your files, which might take a while, depending on the file size, click on the upload button, which will redirect you to the data preview page.

  • B - Users also have the option of zipping the CSV file, as large size CSV file can be uploaded.

  • C - Users can enter the name of the project (optional). If the users have not entered the name of the file then the system will take the name of the file and set a default project number.

Multiple upload

It is also possible to upload multiple files simultaneously. SEDGE allows to upload multiple files in one upload form.

_images/Multipleuploads.png

Multipleupload

File collaboration

File collaboration in SEDGE facilitates the sending and receiving of files and collaboration among users. This functionality helps in collaborating and exchanging opinions with the other users in SEDGE. This offers a wide array of advantages, especially among users from different geographic locations and diverse timelines. It only takes a single click to send and enable the file access to the user.

_images/Filecollab.png

Filecollaboration

Data preview

_images/data_preview.png

Data Preview - Screen no 6

For our demonstration, we have chosen the “Titanic” data. In the future steps, we will be forecasting, based on the available data, what was the chance of survival on board of Titanic. In this case, the factors taken into consideration are: class, name, sex, age, number of family members, parch, ticket number, fare, cabin, and point of embarkment. Some of these factors might not be relevant for the predictive analysis, so we will also demonstrate how to erase or ignore some of the data.

Note

This page is just a preview of the full data, only a small sample is displayed over here to get the first image.

  • A - This row displays all the variables in the current analysis. You can sort the data by clicking on the arrows next to the respective variable

  • B - This row identifies if the data in a column are categorical or numerical (C or N)

  • C - Monitor in which phase of the analysis you are currently located – identified by the green colour

  • D - Information about the total number of rows and columns in the uploaded file

  • E - Action buttons to move either to the next step or return to the previous step

Profile Dataset

_images/profile_dataset.png

Profile Dataset - Screen no 7

  • A - Dataset info – basic information about the respective data set like number or variables, number of observations/rows, percentage of missing data, warnings

  • B - List of all columns/variables in the data set

  • C - Information if the variable is categorical or numerical

  • D - How many unique values are there within each variable (sex – male or female – 2 possible values, name – unlimited number of values)

  • E - Total number of missing values within the respective variable

  • F - Number of missing values represented in percentage

  • G - Mean – the average value of the respective numerical variable. Found by adding all numbers in the data set and then dividing by the total number of variables in the data set

  • H - Median – the middle value in the data set – value that separates the higher half of the data sample from the lower half

  • I - SD – standard deviation represents how close or far the values are dispersed from the mean

  • J - In this tab, you have to select the target you are going to be predicting. It is one of the variables that are already part of the dataset. In our case, we are going to be predicting if the Titanic passenger did or did not survive

_images/feature_list.png

feature list

  • K - Sampling of data – you can base your research only on a limited part of the dataset (for example when the dataset is extremely large).

  • If you wish to sample your data, after clicking yes, you have to select the type

  • In the next step select the size type – percentage, absolute value or by the calculator

_images/sampling.png

sampling

  • Select the percentage of the sample you are going to base your analytics on

_images/sampling_percent.png

sampling percent

  • Select the variable

_images/variable_selection.png

sampling percent

  • After selecting your preferences click on Apply button

  • L - GDPR - Data pseudonymization

  • Depending on the nature of your data, you can choose if you wish to protect personal data or no. If you select yes, next step will take you to the GDPR page where you will select which data you want to hide. If you do not need to hide any of the data, select NO button and the GDPR page will be skipped

_images/data_protection.png

data protection

_images/GDPR.png

GDPR

  • After filling in the GDPR form, you can select if you want to run and Automatic Machine Learning or a Custom Machine Learning

  • M - Machine learning (ML)

  • If in the previous step you have selected “Yes” for GDPR protection, you will find the ML option under the GDPR form. If you selected “No” you can select this option under this tab

_images/machine_learning.png

Machine learning

Statistics

Statistics is the branch of mathematics concerned with collecting, analyzing, interpreting, presenting, and organizing data. It encompasses a wide range of techniques and methodologies used to make sense of data, including summarizing data through measures based on samples.

On the statistics page, there are several top functions available, which offer:

  • Numerical Operations

  • Strings

  • Dates

  • Maths

  • Text

  • Data

  • Advanced

  • Boolean

  • Frequently used

To navigate to the Statistics page from the profiling page, click on the next button or click the statistics icon.

Main Screen

_images/mainscreen.png

Main Screen

A: In the first row, all the possible data operations are displayed – sorted into eight different categories: Numerical, String, Dates, Maths, Text, Data, Advanced, Boolean, and Frequently used. Each tab will be discussed separately on the following pages.

B: The first window in the statistics display showcases the fundamental statistics of the dataset, with the target highlighted in green, as indicated by the legend.

C: Information we can get in this window is:

  • Column name: Shows the columns or variables that are present in the dataset.

  • Value type: Numerical, Categorical, Date, Text.

  • Unique Values: Number of unique values in the variable.

  • Mean: The average value of the variable.

  • Median: The middle value of the variable.

  • Standard Deviation: SD - Measures how the data is spread around the mean.

  • Missing Values: Number of missing values of the variable in the total count.

  • % of Missing Values: The number of missing values in the dataset is represented as a percentage. Upon clicking on a column name, the other charts dynamically adjust based on the selection. To perform operations with a specific column, simply click on the checkbox corresponding to that column and proceed the functions.

D: Graphstat - The graphical display illustrates our chosen value – in this case, Sales is selected for demonstration purposes. In this tab, the distribution of sales can be observed. Additionally, options are available to view the display in full-screen mode, download the graphs in different formats, print the chart, or utilize data binning to divide the dataset into appropriate bins (which refers to grouping data into ranges of values). Furthermore, if a Scatter Plot is required, the option can be enabled. This will plot for the target variable and the selected variable.

_images/bins.png

Bins

_images/options.png

E: Transformations - Under this category, various available transformations can be applied to modify the data as needed. To perform transformations, refer to the syntax provided, and you can also clear, test, and apply the syntax. Additionally, view the Expression History.

F: Preview - Displays preview of the whole dataset, top 50 rows

G: Advanced statistics - In this tab, advanced statistics for the selected row are displayed; in this case, the ‘Sales’ row is selected.

H: In this line, corresponding information from the advanced statistics is displayed: Minimum value, 5th percentile, First Quartile (25th percentile), Third Quartile (75th percentile), Interquartile Range (IQR), 95th percentile, Maximum value, Variance (measure of data dispersion), Skewness (measure of data symmetry), Kurtosis (measure of data peakiness or tails), Outliers (data points significantly distant from the rest), and the option to delete the Outliers.

J: Variable of Importance - In this window, all the rows of the dataset are visible, sorted by their importance towards the selected target. In this case, Sales is selected as the most important factor, although it may not be accurate. The variable of importance can be refreshed using this icon (M). Additionally, a Chart Context Menu is available where you can view in full screen, print the chart, download the image as PNG, JPEG, PDF, SVG, and download the data as CSV and XLS.

Advanced Statistics

Box Plot will display for Numerical variable. Hovering over the chart the following information will be displayed:

Upper Fence - The maximum limit beyond which data points are considered outliers.

Upper Quartile - The median of the upper half of the dataset, separating the highest 25% of values from the rest.

Median - The middle value of the dataset when arranged in ascending order.

Lower Quartile - The median of the lower half of the dataset, separating the lowest 25% of values from the rest.

Lower Fence - The minimum limit beyond which data points are considered outliers.

_images/ADboxplot.png

When hovering over outliers, the outlier observations will be displayed.

_images/ADbpoutliers.png

Data Table

For Categorical and Date variables, the Data Table will be displayed.

_images/ADcat.png
_images/ADdate.png

Statistics- Numerical Operations

_images/Numericaloperations.png

image Numerical Operations

Absolute Value

The operation converts a numerical value into its absolute value, exclusively accepting numerical inputs. For instance, temperature can be transformed into its absolute value to determine its distance from zero. Upon applying an operation, successful validation appear in the top-right corner. A newly generated column is displayed at the bottom of the preview. After column addition, the variable of importance may alter; to observe its impact.

_images/absoperation.png

image showing message notification after ABS function (absolute)

Ceiling

Returns the closest integer greater than or equal to a given number, used for getting the nearest integer up

Floor

Returns the closest integer less than or equal to a given number, used for getting the nearest integer up

Unit converter

Simplifies converting measurements between different units. For example, Converting inches to centimeters or pounds to kilograms using a unit converter.

Numerical conversion to hexadecimal form

Convert from a standard numerical format to hexadecimal system, where 16 symbols are used compared to 10 symbols used in the numerical system

Rounding of values

Round numbers to a specified number of digits, such as rounding housing measurements to 2 digits. This function should allow for manual or automatic selection of the column name and if required, enable the user to manually choose the name of the new column.

_images/roundoff.png

image round

Num –> Cat

Converting a numerical column to a categorical format. In case of missing values, they can be managed by either deletion or imputation, techniques elaborated upon in the forthcoming “Data” chapter.

Upon applying this function, the successful validation will appear.

_images/Numtocat.png

Num format conversion

Helps in generating the numeric text of values in different languages. The value will be displayed in words in the chosen language.

_images/Numformat.png

Num format conversion

Note

Numerical operations can be performed only to the Numerical variable.

Statistics- String

_images/string.png

String

Substring

A Substring is a contiguous sequence of characters within a larger string. It is a portion of a string that is extracted or considered separately based on its starting position and length.

Steps to Perform
  • Select a column, then click on ‘Substring’

  • Choose position (Right, Middle, Left, Exclude String, or Split by Character/Pattern) based on your requirement.

  • Define length and choose ‘Auto’ or ‘User Defined’

  • Click on ‘Apply’

Middle

For the string “houstan”, the selected position is “Middle” with a start index 1 and length of 5. This operation will return the middle portion. For the middle position it is required to select the start index.

The output will be: houst.

_images/substring_middle.png

Example for substring for the position - Middle

Left

For the string “ New York”, the selected position is “Left” with a length of 3. This operation will return the leftmost 3 characters.

The output will be: New.

_images/substringleft.png

Example for substring for the position - Left

Exclude String

For the string “New York”, the selected position is “Exclude String” with a length of 3 and a start index of 1. This operation will exclude the specified characters from the string.

The output will be: York.

_images/substringexclude.png

Example for substring for the position - Exclude String

Split by character/pattern

For the string “Los Angels” and the selected character “s” with the position “after”, the operation will split the string after each occurrence of the character “s”.

The output will be: Angel.

_images/substringsplit.png

Example for substring for the position - Spelit by Character/Pattern

Note

For the position “Middle” and “Exclude String” it is required to select the start index.

For the position “Split by character/pattern” it is required to choose either “Before” or “After.

Replace

Replace one or more characters with another pattern/set of strings. Under ‘Pattern which exist,’ provide the pattern that needs to be changed. Then, under ‘Pattern to be replaced with,’ specify the pattern for replacement.

To replace “UK” with “United Kingdom,” provide the following:

_images/replaceimg.png
  • Pattern which exists: “UK”

  • Pattern to be replaced with: “United Kingdom”

The output will be: Hi

_images/replace.png

Example for Replace

Replace range

Replace set of strings based on its occurrence index value. specify the start and end indices of the substring to be replaced along with the replacement text.

Example: Original word: “Chine”

To replace a range of characters from index 1 to index 5 with “India” provide the following,

  • Start index: 1

  • End index: 5

  • String to be replaced: “India”

_images/rr.png

After completing the replacement operation, the updated word will be: India

_images/replacerange.png

Example for ReplaceRange

Note

After performing the operation a new column will appear at the bottom.

Delete range

Deletes a set of characters based on the index values. To delete a range, specify the start and end indices, then click “Apply”.

Example: Original String: Los Angels

To remove ‘Ang’ from “Los Angels”, specify the indices as follows:

  • Start index: 4

  • End index: 8

_images/dr.png

After applying this deletion, the updated word will be Losels.

_images/deleterange.png

Example for DeleteRange

Append

Append a single or multiple character in each class/row.

Example:

  • Original String: “Japan”

  • To append “ese” to the end of the word, simply add “ese” to it.

_images/ap.png

After appending, the updated word will be: **Japanese*.

_images/append.png

Example for Append

Insert

Select the index or position where insertion is desired, then specify the value to be inserted.

Example: Original string: “Phoenix”

  • Index to be inserted: 8

  • Inserted value: “City”

_images/ins.png

After insertion, the updated string will be: PhoenixCity.

_images/insert.png

Example for Insert

Repeat

To choose the number of times the selected column should be repeated.

Example:

  • Original string: “UK”

  • To repeat “UK” two times, simply specify the repeat count as 2.

After repeating, the updated string will be: UKUK.

_images/repeat.png

Example for Repeat

Pop

Pop removes and returns the element at the specified index.

Example:

  • Original string: “Chicago”

  • To remove ‘o’ from the string by specifying the index as 7.

_images/popimg.png

After the operation, the updated string will be: Chicag.

_images/pop.png

Example for Pop

Reverse item

Iterate through the string in reverse order.

Example:

  • Original string: “UK”

  • Reversed string: “KU”

_images/reverse.png

Example for Reverse

Upper

Convert a column to upper case.

“New York” will be be converted into “NEW YORK”.

_images/upper.png

Example for Upper

Lower

Convert a column to lower case

“New York” will be be converted into “new york”

_images/lower.png

Example for Lower

Find

To determine if an entry matches a searched string, compare each entry against the specified string. If a match is found, return the matched string. If no match is found, return False to indicate that the string is not present in the column.

_images/find.png

Example for Find

Length

Count the number of characters in the string.

Example:

  • String: “Chicago”

  • Length of the string: 7

_images/length.png

Example for Length

Trim

Trimming/erasing of leading & trailing spaces from the string.

_images/trim.png

Example for Trim

CountOccurence

Determines the number of times a specific substring appears.

Steps to perform
  • Select Substring to specify a substring pattern.

  • Select Regex to provide a regular expression pattern.

_images/co.png

Example:

  • String: “Japan”

  • Substring pattern: “a”

_images/cos.png

The output will be 2.

_images/countoccurence.png

Example for Count occurence

Extract

To extract specific information from a string.

Select the Method such as: First word, Last word, Numbers extract.

_images/ext.png

First word:

  • Original String: New York

  • The output will be: New

_images/extract%28first%29.png

Example for Extract - First Word

Last word:

  • Original String: New York

  • The output will be: York

_images/extract%28last%29.png

Example for Extract - Last Word

Numbers extract:

  • Original String: Hello, I am 22 years Old now turning to 23

  • The output will be: 2223

_images/numextract.png

Example for Numbers Extract

Split

Splitting is the act of partitioning available data into portions, usually for cross-validatory purposes. Below are the methods of splitting.

  • HTTP string

  • Email

  • Invalid

  • URL

_images/split1.png

image Split

Split Http string

This processor splits the elements of an HTTP query string.

Split email

This processor splits an e-mail address into the local part and the domain.

Split invalid

This processor takes all values of a column that are invalid.

Split URL

This processor splits the elements of an URL into multiple columns.

ExtractNum

To extract numerical values from text This function is useful for isolating and retrieving numbers embedded within text, simplifying the process of data cleaning and preprocessing.

Cat -> Num

Convert categorical variables into numerical values only when the selected categorical column contains numbers exclusively. Upon applying this function, the successful validation will appear.

_images/cattonum.png

Note

String Operations can be performed only to the String variables.

Statistics- Dates

_images/Stat-dates.png

Dates Extraction

The Dates function with the menu, helps to strip out the date, month, Year, etc., from a date column. These stripped out date, month and year, act as additional feature for the model.

Example: Lets try to extract Month from the “Date of Hire” column,

_images/extracteg.png

_images/extractoutput.png

Dates Pattern Conversion

This function is used to convert the date pattern in a different, more suitable way for the research. After selecting the input and desired output pattern click on the apply button.

Example: Lets convert the date pattern, from “yyyy-MM-dd HH:mm:ss” to “yyyy-MM-dd”

_images/dpconvert.png

This will covert the existing column with the selected date pattern.

_images/dpconvertoutput.png

There are several types of date patterns and shortcuts, that can be present in a date variable. Can Design own format patterns for dates and times from the list of symbols in the following table:

Date Format Examples

Date format

Description

Examples

EEEE

Day of the week

Monday, Tuesday

EEE

Abbreviated day of the week

Mon, Tue, Wed

MM

Month of the year in a two-digit format

03, 07, 12

MMM

Abbreviated month of the year

Jan, Feb, Mar

MMMM

Month of the year

January, February, March

dd

Day of the month

01, 15, 30

D

Day in the year

yyyy-DDD —> 1999-081

yy

Year in two-digit format

99, 12, 20

yyyy

Year in four-digit format

1999, 2012, 2020

YYYY

Week-based year

2009; 09

HH

Hour of the day (0-23)

00, 12, 23

hh

Clock hour in AM/PM (1-12) format

00, 05, 12

mm

Minute in hour

00, 30, 59

ss

Second in minute

00, 30, 59

SSS

Fraction of a second

235, 512, 820

a

AM/PM marker

AM, PM

G

Era designator

AD, BC

u

Day number of week (1 = Mon,.., 7 = Sun)

1, 2, 7

ww

Week in year

02, 27, 52

W

Week in month

1, 2, 4

z

General time zone

GMT, PST, SGT, CET

Z

UTC time offset

+0000, -0800, +0100

XXX

ISO 8601 time zone

+01:00, -08:00

Escape for text

H’ h ‘mm z —> 5 h 06 CET

Time pattern examples

Date and time pattern

Example

yyyyMMddZ

19990322+0100

yyyyMMdd

19990322

yyyy/MM/dd H:mm

1999/03/22 5:06

yyyy.MM.dd HH:mm:ss

1999.03.22 05:06:07

yyyy.d.M HH:mm:ss

1999.22.3 05:06:07

yyyy-MM-dd

1999-03-22

yyyy-MM-ddXXX

1999-03-22+01:00

yyyy-MM-dd’T’HH:mm:ss.SSSXXX

1999-03-22T05:06:07.000+01:00

yyyy-MM-dd’T’HH:mm:ss.SSS’Z’

1999-03-22T05:06:07.000Z

yyyy-MM-dd’T’HH:mm:ss.SSS

1999-03-22T05:06:07.000

yyyy-MM-dd’T’HH:mm:ss

1999-03-22T05:06:07

yyyy-MM-dd HH:mm:ss.S

1999-03-22 05:06:07.0

yyyy-MM-dd h:mm:ss a

1999-03-22 5:06:07 AM

yyyy-MM-dd H:mm:ss

1999-03-22 5:06:07

yyyy-MM-dd G

1999-03-22 AD

yyyy-M-d HH:mm:ss

1999-3-22 05:06:07

yyyy-M-d h:mm:ss a

1999-3-22 5:06:07 AM

yyyy-DDDXXX

1999-081+01:00

yyyy MMMM dd E

2019 July 24 Wed

yyyy-ww

1999-02

yyyy-ww-u

2001-27-3

yyyy-‘W’ww-u

2001-W27-3

yy/MM/dd HH:mm

99/03/22 05:06

yy/MM/dd H:mm:ss

99/03/22 5:06:07

MMMM d, yyyy h:mm:ss z a

March 22, 1999 5:06:07 CET AM

MMM.dd.yyyy

Mar.22.1999

MMM d, yyyy h:mm:ss a

Mar 22, 1999 5:06:07 AM

MM/dd/yyyy HH:mm:ss

03/22/1999 05:06:07

MM/dd/yyyy h:mm:ss a

03/22/1999 5:06:07 AM

MM-dd-yyyy HH:mm:ss

03-22-1999 05:06:07

MM-dd-yyyy h:mm:ss a

03-22-1999 5:06:07 AM

HH:mm:ss dd/MM/yyyy

05:06:07 22/03/1999

HH:mm:ss dd-MM-yyyy

05:06:07 22-03-1999.

EEEE, MMMM d, yyyy h:mm:ss a z

Monday, March 22, 1999 5:06:07 AM CET.

EEEE, MMMM d, yyyy h:mm:ss ‘o’’clock’ a z

Monday, March 22, 1999 5:06:07 o’clock AM CET.

EEEE, MMMM d, yyyy

Monday, March 22, 1999

EEEE, d MMMM yyyy HH:mm:ss ‘o’’clock’ z.

Monday, 22 March 1999 05:06:07 o’clock CET.

EEEE, d MMMM yyyy

Monday, 22 March 1999

EEE, d MMM yyyy HH:mm:ss Z

Mon, 22 Mar 1999 05:06:07 +0100

EEE MMM dd HH:mm:ss z yyyy

Mon Mar 22 05:06:07 CET 1999

dd/MMM/yy h:mm a

22/Mar/99 5:06 AM

dd/MM/yyyy HH:mm:ss

22/03/1999 05:06:07

dd/MM/yyyy h:mm:ss a

22/03/1999 5:06:07 AM

dd/MM/yyyy H:mm

22/03/1999 5:06

dd.MM.yyyy. HH.mm.ss z

22.03.1999. 05.06.07 CET

dd-MMM-yyyy HH:mm:ss

22-Mar-1999 05:06:07

dd-MM-yyyy HH:mm:ss

22-03-1999 05:06:07

dd-MM-yy HH:mm

22-03-99 05:06

dd MMMM yyyy HH:mm:ss z

22 March 1999 05:06:07 CET

Characters that are not letters are treated as quoted text. That is, they will appear in the formatted text even if they are not enclosed within single quotes.

Delete Date Range

Removing or deleting all records or entries that fall within a specified range of dates from a dataset.

Steps to perform:

  • Select “From date” and “To date”

  • Select Keep range (will keep only the selected range of dates) or Delete range (will delete only the selected range of dates) based on your preferences.

Example: Lets keep only the records from “6/19/1961” To “8/17/1970”

_images/deldatrange.png

This will alter the existing column and will keep only the selected range of values.

Flag Date Range

Marking or highlighting all records or entries that fall within a specified range of dates. This is often done to identify, review, or take specific actions on the data within that range.

Example: Lets try to flag from “5/30/2011” to “2/19/2013”

_images/flagdatrang.png

Date Difference

Calculation of the amount of time between two dates. This can be measured in various units such as days, months, or years.

Steps to perform:

  • select Minuend (The number from which another number (the subtrahend) is to be subtracted) and Subtrahend (The number that is to be subtracted from the minuend).

  • choose output time unit from the dropdown menu.

_images/dropdown.png

Example: Lets try to do difference for the columns ‘Date of Hire” and “Date of Termination”

_images/input.png

This will generate a new column at the bottom of the preview.

_images/dd.png

Flag Holiday

Specific date refers to a day designated for celebrating or commemorating a national flag, often marked by various patriotic activities.

Steps to perform:

  • Select flag type from the dropdown either holiday or weekend.

  • Select Country or Region.

  • Click on Apply

Example: Lets flag the Holiday for Australia.

_images/fh.png

The output will be in the form of boolean (true or false).

_images/fgoutput.png

Statistics- Maths

_images/maths.png

Statistics - Maths

Sign conversion

Convert a positive to negative sign and vice versa. Choose between positive or negative conversion, and the result will be displayed in the existing column.

Before applying sign conversion,

_images/originalsign.png

Example for before applying sign conversion

After applying sign conversion,

_images/negativesign.png

Example for after applying negative sign conversion

Note

This function will not generate a new column for the output. It will modify the original column.

Binning

Select Binning Type such as: Interval binning, Equal width and Custom Binning.

Interval binning

Interval binning involves dividing the range of values into equal-sized intervals.

For example, if the data ranging from 11 to 100 and want 5 bins,

_images/intervalbinning.png

Example for Interval Binning

Equal width

Equal width binning is a type of interval binning where each bin has the same width. For example, if the data representing the heights of students in a class ranging from 120 cm to 180 cm, and need to create 4 bins,

_images/equalwidthbinning.png

Example for Equal Width Binning

Custom Binning

Custom binning involves defining bins based on specific criteria or requirements.

Addition

Add multiple columns. Atleast two columns are required.

_images/add.png

Example for Addition of two columns

Subtraction

Subtract multiple columns.

_images/sub.png

Example for Subtraction of two columns

Multiplication

Multiply multiple columns.

_images/multiplication.png

Example for Multiplication of two columns

Division

Divide multiple columns. Its a reverse function from multiplication.

_images/division.png

Example for Division

Note

The functions (D, E, F) are designed to operate sequentially from left to right.

Remainder

The remainder is the integer left over after division, indicating the part that remains unallocated.

_images/Remainder.png

Example for Remainder

Data transformation

This function includes several mathematical operations - Box cox, Yeo Johnson, Cubic root, Exponential, Log, Normalization, Power2, Power3, Reciprocal, Square root and Standardization. After selecting your chosen operation a new column is created at the bottom.

Box Cox

The Box-Cox transformation is a statistical method used to stabilize variance and make data more normally distributed (symmetric bell-shaped distribution where the mean, median, and mode are all equal)

This dataset is skewed to the right, indicating that most of the house prices are lower.

To apply the Box-Cox transformation mathematically,

Choose a range of values for the λ (lambda) parameter to test the transformation. Let’s say we try λ values ranging from -2 to 2.

For each λ value, calculate the transformed values using the Box-Cox transformation formula:

y_i = \begin{cases} \frac{x_i^\lambda - 1}{\lambda}, & \text{if } \lambda \neq 0 \\ \ln(x_i), & \text{if } \lambda = 0 \end{cases}

y_i represents the transformed value and x_i represents the original value.

Example:

Consider the dataset:

X={1200,2000,4500,800,3500,3000,1500,2500,4000,1000}

For λ ≠ 0 λ=0.2784475681948208:

y_1 &= \frac{1200^{0.2784475681948208} - 1}{0.2784475681948208} \approx 22.26977 \\ y_2 &= \frac{2000^{0.2784475681948208} - 1}{0.2784475681948208} \approx 26.22267 \\ y_3 &= \frac{4500^{0.2784475681948208} - 1}{0.2784475681948208} \approx 33.77536 \\ y_4 &= \frac{800^{0.2784475681948208} - 1}{0.2784475681948208} \approx 19.50882 \\ y_5 &= \frac{3500^{0.2784475681948208} - 1}{0.2784475681948208} \approx 31.24991 \\ y_6 &= \frac{3000^{0.2784475681948208} - 1}{0.2784475681948208} \approx 29.78606 \\ y_7 &= \frac{1500^{0.2784475681948208} - 1}{0.2784475681948208} \approx 23.92758 \\ y_8 &= \frac{2500^{0.2784475681948208} - 1}{0.2784475681948208} \approx 28.13388 \\ y_9 &= \frac{4000^{0.2784475681948208} - 1}{0.2784475681948208} \approx 32.56974 \\ y_10 &= \frac{1000^{0.2784475681948208} - 1}{0.2784475681948208} \approx 20.98964 \\

Note

The lambda (λ), which varies from -5 to 5. All values of λ are considered and the optimal value for your data is selected; The “optimal value” is the one which results in the best approximation of a normal distribution curve.

_images/boxcox.png

Example for Box Cox Transformation

Yeo Johnson

The Yeo-Johnson transformation extends the Box-Cox transformation to handle both positive and negative values, as well as zero values. It is useful for stabilizing variance and making data more normally distributed.

y_i = \begin{cases} \frac{(x_i + 1)^{\lambda_i} - 1}{\lambda_i}, & \text{if } \lambda_i \neq 0, x_i \geq 0 \\ \ln(x_i + 1), & \text{if } \lambda_i = 0, x_i \geq 0 \\ -\frac{(|x_i| + 1)^{2 - \lambda_i} - 1}{2 - \lambda_i}, & \text{if } \lambda_i \neq 2, x_i < 0 \\ -\ln(|x_i| + 1), & \text{if } \lambda_i = 2, x_i < 0 \end{cases}

where:

  • x_i is the original value,

  • y_i is the transformed value,

  • λ_i is a parameter that varies for each feature.

Example:

Let’s consider x=3 and lambda (λ) = 0.7995651776710038,

\begin{align*} x'_1 &= ((3 + 1)^{0.7995651776710038} - 1) / 0.7995651776710038 \\ &= 2.389922907892544 \\

x=6 and lambda (λ) = 0.7995651776710038,

\begin{align*} x'_1 &= ((3 + 1)^{0.7995651776710038} - 1) / 0.7995651776710038 \\ &= 4.191151663607149 \\

Note

Factors influencing the lambda value include the data’s skewness, distribution, variance, sensitivity to outliers and optimization method. The lambda value is empirically determined to maximize the normality and symmetry of the transformed data.


_images/yeojohnson.png

Example for Yeo Johnson Transformation

Cubic Root

The value when multiplied by itself twice, yields the original number is referred to as the cubic root. For instance, the cubic root of 8 is 2. .. math:

y_i = \sqrt[3]{x_i}

where:

  • x_i is the original value,

  • y_i is the transformed value.

Example:

\sqrt[3]{27} = 27^{1/3} = 3.0

\sqrt[3]{64} = 64^{1/3} = 4.0


_images/cubicroot.png

Example for Cubic Root

Exponential

Exponential refers to a mathematical operation or function where a constant (the base) is raised to the power of an exponent, resulting in a rapidly increasing or decreasing function. It is commonly represented as a^x, where a is the base and x is the exponent.

An example of an exponential function is f(x) = 2^x. Let’s evaluate this function for some values of x:

  • When x = 0, f(0) = 2^0 = 1.

  • When x = 1, f(1) = 2^1 = 2.

_images/exponential.png

Example for Exponential Transformation

Log

The log transformation is a mathematical operation that calculates the logarithm of a number. It is commonly used to transform skewed data to make it more symmetrically distributed.

The transformation formula is as follows:

y_i = \log(x_i)

where:

  • x_i is the original value,

  • y_i is the transformed value.

Example:

\log_{10}(100) = \frac{\ln(100)}{\ln(10)}

We can substitute \ln(100) and \ln(10) with their respective values:

\ln(100) = \ln(e^4) = 4\ln(e) \ln(10) \approx 2.30258

Finally:

\log_{10}(100) \approx \frac{4\ln(e)}{2.30258}

Since \ln(e) = 1, we have:

\log_{10}(100) \approx \frac{4}{2.30258} \approx 1.732

Therefore, \log_{10}(100) is approximately 1.732.

\log_2(8) = \frac{\ln(8)}{\ln(2)}

We can substitute \ln(8) and \ln(2) with their respective values:

\ln(8) = \ln(e^2.07944) \approx 2.07944 \ln(2) \approx 0.69315

Finally:

\log_2(8) \approx \frac{2.07944}{0.69315}

This simplifies to:

\log_2(8) \approx 3

Therefore, \log_2(8) is equal to 3.

_images/log.png

Example for Log Transformation

Normalization

Normalization is a technique used to scale numeric features to a standard range, typically between 0 and 1 or between -1 and 1. It is used statistical analyses to ensure that all features have the same scale and to prevent features with larger values from dominating those with smaller values.

For Example, Consider the following dataset,

{Data} = [2, 5, 8, 10, 15]

To normalize these values using min-max scaling, we’ll use the formula:

x_{\text{norm}} = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}

Where:

  • x_min= 2 is the minimum value of the dataset,

  • x_max = 15 is the maximum value of the dataset.

Now, we can calculate the normalized values using the formula:

For x = 2:

x_{\text{norm}} = \frac{2 - 2}{15 - 2} = 0

For x = 5:

x_{\text{norm}} = \frac{5 - 2}{15 - 2} = \frac{3}{13} \approx 0.2308

For x = 8:

x_{\text{norm}} = \frac{8 - 2}{15 - 2} = \frac{6}{13} \approx 0.4615

For x = 10:

x_{\text{norm}} = \frac{10 - 2}{15 - 2} = \frac{8}{13} \approx 0.6154

For x = 15:

x_{\text{norm}} = \frac{15 - 2}{15 - 2} = 1

So, after normalizing the data using min-max scaling, the normalized dataset would be:

Normalized Data = [0, 0.2308, 0.4615, 0.6154, 1]

These values are now scaled to the range between 0 and 1.

_images/normalization.png

Example for Normalization

Power2

The term “Power of 2” typically refers to the operation of exponentiation, where a number is raised to the second power. In mathematical notation, it is represented as "x^2" where, x is the base number. This operation involves multiplying the base number by itself.

For example:

2^{2} = 4

In general, raising a number to the power of 2 is known as squaring that number.

_images/power2.png

Example for Power 2

Power3

The term “power of 3” also refers to exponentiation, where a number is raised to the third power. In mathematical notation, it is represented as "x^3" where, x is the base number. This operation involves multiplying the base number by itself twice.

For example:

1200^{3} = 1,728,000,000

In general, raising a number to the power of 3 is known as cubing that number.

_images/power3.png

Example for Power 3

Reciprocal

The reciprocal of a number is calculated by dividing 1 by that number.

Mathematically, if we have a number represented as x, the reciprocal of x is represented as \frac{1}{x}.

For example:

x = 5, the reciprocal of 5 = 1/5.

Similarly, x = 0.5, the reciprocal of 0.5 is 1/0.5 = 2.

The reciprocal of a number essentially represents how many times that number can fit into 1. For instance, the reciprocal of 5 is (1/5) means that 5 fits into 1 one-fifth of a time.

_images/reciprocal.png

Example for Reciprocal

Square Root

The square root of a number is a value that, when multiplied by itself, gives the original number. In mathematical notation, the square root of a number x is represented as √x.

For example:

\sqrt{25} = 5. Since 5 * 5 = 25

The square root operation can also be denoted using exponentiation: x^(1/2). This means raising x to the power of one-half, which is equivalent to taking the square root of x.

_images/squareroot.png

Example for Square Root

Standardization

In statistics, standardization is a process used to transform data so that it has a mean of zero and a standard deviation of one. This technique is applied to individual variables within a dataset, making them comparable and facilitating statistical analysis.

The formula for standardizing a variable x is:

z = \frac{x - \mu}{\sigma}

Where:
  • x is the original value of the variable,

  • \mu is the mean of the variable,

  • \sigma is the standard deviation of the variable, and

  • z is the standardized value.

Standardization transforms the distribution of the data to have a mean of zero and a standard deviation of one. This process does not change the shape of the distribution but places it on a common scale, allowing for easier comparison between variables and datasets.

Standardization is particularly useful in statistical modeling, machine learning, and data analysis, where it helps improve the interpretability and performance of models by ensuring that variables are on a consistent scale.

Example:

Consider the data x= [5, 20, 8, 3, 4, 6, 15, 12, 9, 10]

Mean \bar{x} = \frac{{5 + 20 + 8 + 3 + 4 + 6 + 15 + 12 + 9 + 10}}{{10}} = 8.2

Standard Deviation \sigma = \sqrt{\frac{{\sum_{i=1}^{10} (x_i - \bar{x})^2}}{{10}}} = 5.03

Standardized Data z_i = \frac{{x_i - \bar{x}}}{{\sigma}}

where x_i is each data point in the given data.

z_1 &= \frac{5 - 8.2}{5.03} \approx -0.64 \\ z_2 &= \frac{20 - 8.2}{5.03} \approx 2.35 \\ z_3 &= \frac{8 - 8.2}{5.03} \approx -0.04 \\ z_4 &= \frac{3 - 8.2}{5.03} \approx -1.03 \\ z_5 &= \frac{4 - 8.2}{5.03} \approx -0.83 \\ z_6 &= \frac{6 - 8.2}{5.03} \approx -0.43 \\ z_7 &= \frac{15 - 8.2}{5.03} \approx 1.35 \\ z_8 &= \frac{12 - 8.2}{5.03} \approx 0.76 \\ z_9 &= \frac{9 - 8.2}{5.03} \approx 0.16 \\ z_{10} &= \frac{10 - 8.2}{5.03} \approx 0.36

Note

Mathematical operations can be performed only for the Numerical variables.

Statistics - Text

Text Processing

Text processing is used to analyze most used words in a written text. Can also be used to identify SPAM emails, by searching for specific key words. By doing the text processing you will get the top 100 used words as new columns.

Select the preferred N grams from 1 to 5.

Select the Encode type such as : Boolean( 0 or 1), Count.

Select Custom stop words if required.

Click on apply button.

Note

The output will appear at the bottom of the page. You can also preview the output on the preview page.


_images/textprocessing.png

Example for Text Processing

Text->Cat

Converts Text data type into Categorical data type.

Text data type before conversion,

_images/beforeconversion.png

Example for Text->Cat before conversion

After conversion,

_images/afterconversion.png

Example for Text->Cat after conversion

Note

This operation will not generate a new column for the output; instead, it will only modify the data type within the same column, altering the output accordingly.

_images/output.png

Note

Text functions can be performed only for the Text variable.

Statistics- Data

Label encoding

Label encoding is a technique used to convert categorical variables into numerical format.

Consider a dataset with a categorical variable “Temperature” and its corresponding values are “Hot,” “Cold,” and “Warm.”

Using label encoding, numerical labels are assigned to each unique category. The labels are assigned based on alphabetical order.

Note

This function applicable only for the Categorical column.


_images/labelencoding.png

Example for Label Encoding

One Hot Encoding

One hot encoding is another method used to represent categorical variables in a format suitable for machine learning algorithms. Instead of assigning numerical labels like label encoding, one hot encoding creates binary vectors for each category.

Note

This function is applicable for both categorical and text columns. It does not require a user-defined output column name; instead, it will automatically generate a column name for the output.

_images/onehotencoding.png

Example for One Hot Encoding

Renaming Factor Level

Changing the labels or names of categories in a categorical variable to make them more descriptive or meaningful.

Note

This Function applicable for all data types.

_images/renamingfactor.png

Example for Renaming Factor

_images/validation.png

Ignore

Refers to excluding Certain variables to filtering out irrelevant or noisy data.

Note

Multiple columns can also be ignored.

_images/ignore.png

Example for Ignore

Unignore

Reverse function of ignore (Ignored variables can be undo by using this function). Unignoring the ignored variables.

_images/unignore.png

Example for Unignore

Impute Missing

Involves filling in missing data points in a dataset.

Techniques available for Numerical Data Type include:

  • Mean (Often known as the average, represents the central tendency of a dataset. It’s computed by adding up all values and dividing by the total count of values.)

  • Median (Middle value in a dataset when the values are arranged in ascending or descending order. If there’s an even number of values, it’s the average of the two middle values.)

  • Mode (Value that appears most frequently in a dataset. It’s the most common observation or value in the data.)

  • Default

  • Moving Average (Smooths out fluctuations in data by calculating the average of a subset of data points within a sliding window.)

  • Forward Filling (Copying the last observed value forward to fill in missing values.)

  • Backward Filling (Copying the next observed value backward to fill in missing values in a dataset.)

  • Custom Filling (Filling missing values using a user-defined method.)

  • Average of previous and next (Involves taking the average of the values immediately preceding and following the missing value to estimate its value.)

  • KNN (Missing values are replaced with the average or median value of the k nearest neighbors.)

The available techniques for “Categorical” variable are,

  • Mode

  • Forward Filling

  • Backward Filling

  • Custom Filling

  • KNN

_images/categorical.png

Techniques for categorical variable

The available technique for “Date” variable are,
  • Mode

  • Forward Filling

  • Backward Filling

_images/categorical.png

Techniques for Numerical variable

The available technique for “Text” variable are,

  • Mode

  • Forward Filling

  • Backward Filling

  • Custom Filling

  • KNN


_images/imputemissing.png

Example for Impute Missing

Delete Missing

Deleting rows with missing values can be a preferred approach when those values are critical for analysis and cannot be easily imputed.

This function removes all rows containing missing values across all columns in the dataset, ensuring that only complete observations are retained for analysis.

_images/deletemissing.png

Example for Delete Missing

Copy

Copying a column or creating a duplicate allows for performing two different operations on the same column, which can be useful. This process simply involves creating a copy of the column, and then proceeding with the desired operations. The new column is created at the bottom.

_images/copy.png

Example for Copy

Column Concat

The function is used to concatenate two or more columns into one column. For instance, if there are columns for “First name” and “Surname,” they can be merged into a single column.

For example, “Name” and “Sex” columns are being combined. A delimiter is manually selected to clearly mark the transition between the values of the first and second columns.

_images/columnconcat.png

Example for Column Concat

Cross Tab

Cross tabulation is a method to organize and summarize data from two categorical variables into a table.

It helps to see how these variables are related by counting the occurrences of each combination of categories.

_images/crosstab.png

Example for Cross Tab

To transpose the output, click on the ‘inverse’ option

_images/crosstabinverse.png

Example for Cross Tab Inverse

Filter

Filter the data using multiple conditions. Available Filters for the Data types are,

Filters for Numerical Data Type

Condition

Numerical data often involves conditions such as:

Equal to: Filters data where the numerical value is exactly equal to a specified value. For example, filtering sales data for transactions where the amount equals $1000.

Not Equal to: Filters data where the numerical value is not equal to a specified value. For instance, excluding transactions with a sales amount of $1000.

Is Greater Than: Filters data where the numerical value is greater than a specified threshold. For example, filtering sales data for transactions with amounts greater than $1000.

Is Greater Than or Equal to: Filters data where the numerical value is either greater than or equal to a specified threshold. For instance, filtering sales data for transactions with amounts greater than or equal to $1000.

Is Less Than: Filters data where the numerical value is less than a specified threshold. For example, filtering sales data for transactions with amounts less than $1000.

Is Less Than or Equal to: Filters data where the numerical value is either less than or equal to a specified threshold. For instance, filtering sales data for transactions with amounts less than or equal to $1000.

Is Null: Filters data where the numerical value is null or missing.

Is Not Null: Filters data where the numerical value is not null or missing.

Between: Filters data where the numerical value falls within a specified range. For example, filtering sales data for transactions with amounts between $500 and $1000.

Note

This filter is applicable to Categorical and Text data types.


_images/filter.png

Example for Filter

For adding more rules and conditions, click on “Add a Group” and choose between an “and” or “or” option for each step. Once the “Apply” button is clicked, the entire dataset will be modified to include only those entries that meet the specified filter criteria.

_images/addrule.png

Example for Add Rule and Add Group

Grouped Statistics

Grouped statistics organize data into groups and calculate statistical measures for each group, offering detailed insights into how variables behave within specific categories.

The available aggregation are:

  • Mean (Often known as the average, represents the central tendency of a dataset. It’s computed by adding up all values and dividing by the total count of values.)

  • Median (Middle value in a dataset when the values are arranged in ascending or descending order. If there’s an even number of values, it’s the average of the two middle values.)

  • Variance (Value that appears most frequently in a dataset. It’s the most common observation or value in the data.)

  • Standard Deviation (How the data is spread around the mean)

  • Unique (Number of unique values in the dataset)

  • Count (Total number of elements present in a particular variable or column)

  • Sum (Total value obtained by adding together all the individual values in a specific variable)

Example: To calculate average booking amount for each destination city, select INR_Amount with,

_images/groupstat.png
_images/groupedstat.png

Example for Grouped Statistics

Transpose

Transpose refers to the operation of flipping or rotating a dataset, typically switching its rows and columns. This operation can be useful for various purposes, such as changing the orientation of data for easier analysis or visualization, or preparing data for specific computations or algorithms. When a dataset is transposed, the rows become columns and vice versa.

Transpose Methods:

  • Split & Transpose

  • Combine & Transpose

Split & Transpose

Dividing a text string based on a separator and then rearranging the resulting parts into columns or rows.

Note

This functionality will work only for Categorical variable(s).


_images/seperator.png
_images/split%26transpose.png

Example for Split & Transpose

Combine & Transpose

Merging multiple columns into one column, and then rearranging the data from rows to columns or vice versa.

_images/transpose.png
_images/combine%26transpose.png

Example for Combine & Transpose

Drop Duplicates(Row Wise)

Dropping duplicates refers to removing Duplicates in column or entire row duplicates. This operation ensures that each row in the dataset is unique, which can be important for various data analysis tasks

In the selected Column ( from_city) contains 2033 missing records.

_images/duplicatecount.png
_images/dropduplicates.png

After applying the function to the selected column, the validation results will be displayed.

_images/successvalidation.png

_images/result.png

Result

Statistics- Advanced

Pivot analysis

Analysis based on pivoting data - a Pivot Table is used to summarise, sort, reorganise, group, count, total or average data stored in a table. It allows us to transform columns into rows and rows into columns.

We have two options in this section - either a “Pivot mode” or a regular chart mode.

Regular chart mode

In the regular chart mode we can select all the rows that we are interested in, in our example we selected Age, Fare, Sex, Ticket purchase, Cabin Detail and Number of meals. We get a basic chart with the selected values.

_images/RegularModeSelect.png

RegularModeSelect

In the next step, we can also divide our selected columns in groups, by dragging the preferred group type down to the “Row groups” section. We have selected “Sex” to differentiate the groups. We can see we have 577 items in group “Male” and 314 items in group “Female”. We can open each group by clicking on the group name.

_images/RegularModeRowGroups.png

RegularModeRowGroups

In the next step, we can add some other numerical values that we are interested in. In our case, we dragged the “Fare” down bellow to the “Values” section and by clicking on the desired value in the bottom green box. We can also select in which form the value should be displayed - average, count, first, last, max, min or sum. In our case it was the “average fare” value.

_images/RegularModeValues.png

RegularModeValues

Pivot mode

Firstly it is necessary to switch to the “Pivot mode” in the top right part of the screen

_images/PivotModeOn.png

PivotModeOn

After that, select the fields you want to have in your Pivot table. For our demonstration we selected - Survived, Age, Sex, Ticket purchase and Meal Cost. “Ticket purchase” was dragged to the “Row Groups”, so it was automatically shifted to the left side as the first column. We also shifted “Sex” to the “Column Labels” section, so our chart was split into two sections - one section showing only the “Female” data and second showing only the “Male” data. Rest of our selected fields (Survived, Age and Meal cost) remain in the “Values” section, where we can choose again in which form each field should be displayed (average, count, first, last, max, min or sum. In our case it was the “average fare”).

_images/PivotMode.png

PivotMode

More tools are hidden next to each value column as per the bellow image

_images/ToolsButton.png

ToolsButton

_images/MoreTools.png

You can also export or copy the chart with the right click on the table.

_images/RightClick.png

Jupyter Notebook

Jupyter is an open-source, interactive web tool known as a computational notebook, which researchers can use to combine software code, computational output, explanatory text and multimedia resources in a single document. User can create a customised program/code using Jupyter notebook and import it back to EDGE.

Regular Expression

Regular Expressions (REs) offer a mechanism to select specific strings from a set of character strings. They provide a context-independent syntax capable of representing a wide variety of character sets and their orderings, with interpretations based on the current locale. In addition to using Search and Replace for finding and replacing simple text strings, you can perform more precise searches using Regular Expressions (RegEx). RegEx allows you to accurately match a particular pattern in your search.

Select the Method such as :

  • Extract

  • Search

  • Replace

_images/extract.png

Extract

Example: Lets try to extract ‘_’,

_images/regextract.png

Example for Extract

Example: Lets try to search ‘Chennai’,

_images/search.png
_images/regsearch.png

Example for Search

Example: Lets replace ‘Bālāpur’ with ‘Balapur’

_images/regreplace.png

Example for Replace

Delta

Delta (Δ) is a generalized function that calculates the difference between each row value and the value from its corresponding subtrahend position in the given numerical column grouped & sorted through the groupBy & orderBy columns respectively.

_images/delta.png
_images/egdelta.png

Example for Delta

Statistics- Boolean

Refers to a data type that represents one of two states: true or false.

Booleans are commonly used to filter, subset, or mask data based on certain conditions.

Is Null

Refers to missing or nonexistent values in a dataset.

When checking for null values in a specific data field or observation, the result is typically a boolean (true or false) indicating whether the value is null (true) or not null (false). This process of checking for null values is essential for data cleaning and quality assessment before analysis.

_images/null.png

Example for Is Null

Is Even

Determining whether a given number is divisible by 2 without any remainder, indicating that it is an even number.

It will return true if the number is even and false if it is not.

_images/even.png

Example for Is Even

Is Odd

Determining whether a given number is not divisible by 2 without any remainder, indicating that it is an odd number.

It will return true if the number is odd and false if it is not.

Note

This function only applicable to Numerical Variables.


_images/odd.png

Example for Is Odd

Is Number

Checks whether a given value is a numerical type.

It will return true if the value is a number and false if it is not.

_images/number.png

Example for Is Number

Is Text

Checks whether a given value is a string or textual data type.

It will return true if the value is a number and false if it is not.

_images/egtext.png

Example for Is Text

Negate Boolean

Means flipping its truth value. For example, if the original boolean value is true, its negation would be false, and vice versa. It’s a way to change a statement from being true to false, or vice versa.

Note

This function applicable only to the Boolean values.


When selecting non boolean data type variable it will show the validation called “Selected column has no boolean values.

_images/validation_negate.png
_images/negate.png

Example for Negate Boolean

Frequently Used

The functions commonly used for analysis are:

  • Label Encoding

  • One Hot Encoding

  • Ignore

  • unignore

  • Impute Missing

  • Extract Date Parts

  • Data Transform

  • Binning

Graphstat

This feature presents a bar chart for the selected column, displaying the value and its count. It supports numerical, categorical, and date data types for bar chart visualization. Numerical data types offer an option to adjust the number of Bins. Text data types are represented using a word cloud. Scatter plots can be enabled to visualize two numerical variables against each other, typically the selected variable and the target variable. Users can view charts in full-screen mode and utilize a print option. Chart download options include PNG, JPEG, PDF, and SVG formats. Data can be downloaded in CSV or XLS format.

Note

Categorical and date data types display only bar charts, while numerical data types offer both scatter plots and bar charts.

For categorical variables, the ‘sort by’ feature is available.

_images/sortby.png
_images/cat.png

Graphstat for Categorical Column

_images/date.png

Graphstat for Date Column

_images/numeric.png

Graphstat for Numerical Column

If Scatter Plot is required, enable the Scatter with Target variable toggle.

_images/scatteroption.png

After enabling the toggle, the Scatter will display

_images/scatterplot.png

Graphstat for Numerical Column–Scatter Plot

The word cloud offers the N-grams feature, which ranges from 1 to 5.

_images/textoptions.png
_images/textwc.png

Graphstat for Text variable

Statistical Test

A one-sample t-test is used to determine if the mean of a single sample is significantly different from a known or hypothesized population mean.

One Sample T Test - One Tailed

A one-sample t-test is used to determine if the mean of a single sample is significantly different from a known or hypothesized population mean.

Steps to perform

  • Select One Sample T Test - One tailed under test name.

  • Under Numerical column select Numerical variable.

  • Under Categorical column select categorical variable. Select if it is required.

  • Choose a hypothesis mean value based on existing theories, previous research findings, or practical considerations. For example, if previous studies suggest that the average height of adults is 170 cm, you might set your hypothesis mean value to 170 cm.

  • The default Alpha (Level of significance) value is 0.05. Change the value based on your requirement.

_images/onetailed.png

Sampling:

  • Enable Sampling, if required. Choose the sampling method Random (each member of the population is selected purely by chance. This means that every individual or item has an equal chance to be included in the sample, and selection is not biased by any personal judgment or preference.)or Stratified (method of sampling in which the population is divided into subgroups, or strata, based on certain characteristics that are relevant to the research question. Then, samples are randomly selected from each stratum independently. This approach ensures that each subgroup is represented in the sample, and it allows for more precise estimates and comparisons within each stratum.)

Size:

_images/size.png
  • % of a Variable: Selecting a sample size that is a certain percentage of a particular variable. For instance, if the variable is the total population, and you choose 10%, your sample will include 10% of the total population. This approach is useful when you want the sample size to be proportional to the size of a certain variable.

_images/sizeinputs.png
  • Absolute Value: Selecting a sample size by specifying an absolute number. For example, if you decide that your sample size should be 5 individuals, you simply select 5 individuals from the population. This method is straightforward and is often used when the required sample size is predetermined by external factors such as budget or time constraints.

_images/absinputs.png
  • By Calculator: Using a calculator involves employing a statistical tool or software to determine the appropriate sample size based on certain parameters such as confidence level, margin of error, and population size. This method ensures that the sample size is statistically valid and sufficient to make reliable inferences about the population.

_images/bycalculatorinputs.png
  • Lazy Sampling: A less rigorous approach where the sample is selected based on convenience or ease of access rather than using a strict randomization process. This method is quick and cost-effective but may introduce bias and is not always representative of the population. It’s often used in exploratory research or when resources are limited.

_images/lazysamplinginputs.png
  • Systematic Sampling: Selecting every nth item from a list or population. For example, if you have a list of 1,000 individuals and you want a sample of 100, you would select every 10th individual. This method is easy to implement and ensures a spread across the population, but it assumes that the list is randomly ordered to avoid periodicity bias.

_images/ssinputs.png
  • Filtering: Enable Filtering and filter the data, if required.

Once the steps are done, Click on ‘Test’ and verify the result.

_images/resultos.png

Output

Histogram-Density

Provides a normalized representation of the distribution of data, making it easier to compare distributions and understand the relative proportions of data within different intervals.

_images/histogram.png

Histogram:

  • The histogram displays the frequency distribution of age data within specified intervals (bins).

  • The blue bars represent the number of data points (frequency) that fall within each age interval.

  • The height of each bar corresponds to the count of data points in that interval.

Density:

  • The black curve represents the density estimate, showing the distribution of the data in a continuous manner.

  • The area under the density curve represents the entire data set and sums to 1.

The histogram gives a direct count of data points within intervals, while the density plot provides a smoothed view, highlighting the shape of the distribution. Together, they offer a comprehensive view of the data, with the histogram showing discrete counts and the density plot illustrating the overall distribution pattern.

T distribution

Provides a visual representation of how the shape of the distribution changes with different sample sizes and illustrates the properties of the t-distribution relevant for statistical inference.

_images/t-distribution.png

Central Tendency of Age Data:

  • The peak of the density curve (red line) suggests that the most common or average age in the dataset is around 0 on a standardized scale. This means that the data is centered around this value.

Spread and Distribution of Age Data:

  • The shape of the density curve (red line) and the t-distribution curve (blue line) indicates how age values are distributed in the dataset.

  • The t-distribution curve (blue line) closely follows the density curve, but it’s slightly smoother and broader. This suggests that the data may have been fitted to a theoretical distribution, such as the t-distribution, often used in statistics for small sample sizes.

Significance of Tail Area:

  • The shaded area under the t-distribution curve (pink) represents the tail of the distribution.

  • This area is significant in statistical terms, especially for hypothesis testing, where it helps determine the probability of observing values beyond a certain threshold (e.g., ages above a certain point).

Data Observation

The observation of the data is also displayed.

_images/obs.png

Hypothesis

Here the detailed hypothesis displayed.

_images/hypothesis.png

Summary

Here the Summary Stats are displayed.

_images/summarystats.png

One Sample T Test - Two Tailed

In a one-sample t-test with a two-tailed hypothesis, one examines whether the sample mean significantly differs from a hypothesized population mean. This involves calculating the t-statistic based on the sample data and comparing it to critical values from a t-distribution. The aim is to determine if the mean significantly deviates from the hypothesized value in either direction, hence the term “two-tailed”.

Steps to perform

  • Select One Sample T Test - One tailed under test name.

  • Under Numerical column select Numerical variable.

  • Under Categorical column select categorical variable. Select if it is required.

  • Choose a hypothesis mean value based on existing theories, previous research findings, or practical considerations. For example, if previous studies suggest that the average height of adults is 170 cm, you might set your hypothesis mean value to 170 cm.

  • The default Alpha (Level of significance) value is 0.05. Change the value based on your requirement.

  • To learn more: ‘Sampling’

  • To learn more: ‘Filtering’

Histogram-Density

To learn more: ‘Histogram-Density’

T Distribution

To learn more: ‘T distribution’

To learn more: ‘Data Observation’

Shapiro-Wilk

The Shapiro-Wilk test is a useful tool for determining the normality of a data set, which is a crucial assumption in many statistical analyses. It provides a clear decision-making process based on the p-value derived from the T-statistic.

Steps to perform

  • Select Shapiro-Wilk test under test name.

  • Select Numerical variable under Numerical clumn.

To learn more: ‘Sampling’

To learn more: ‘Filtering’

Histogram-Density

To learn more: ‘Histogram-Density’

_images/shapirooutput.png

Output

Data Observation

The observation of the selected variable is displayed. Unlike One Sample T test - One Tailed & Two Tailed, this won’t display Standard error.

_images/shapiroobs.png

To learn more: ‘T distribution’

Hypothesis

Under Hypothesis, this will give whether the sample looks normally distributed or not.

_images/hypo.png

Summary

Under Summary, only the P Value and T-Statistics are displayed.

_images/sum.png

Paired Student T-Test

The paired Student’s t-test compares the means of two related groups to determine if there is a statistically significant difference. It is used when the same subjects are measured before and after a treatment or in matched pairs. The test calculates the difference between paired observations, then computes the mean and standard deviation of these differences to determine the t-statistic. If the p-value is less than a chosen significance level (e.g., 0.05), the null hypothesis (no difference) is rejected, indicating a significant difference between the means.

This won’t display the Histogram-Density Plot and T-distribution.

Steps to perform

  • Select Paired Student T test under test name.

  • Select two Numerical variables under Numerical columns.

  • To learn more: ‘Sampling’

  • To learn more: ‘Filtering’

Data Observation

The observation of the selected variable is displayed. Unlike One Sample T test - One Tailed & Two Tailed, this won’t display Standard error.

_images/pairedobs.png

Hypothesis

Under Hypothesis, will display only whether the two pairs are significant or not.

_images/pairedhypo.png

Summary Stats

Under Summary, the summary statistics are displayed.

_images/pairedsum.png

Chi Square Test

The Chi Square test is a statistical method used to determine whether there is a significant association between categorical variables. It compares the observed frequencies of categories with the expected frequencies if there were no association.

This won’t display the Histogram-Density Plot and T-distribution

Steps to perform

  • Select Chi Square test under test name.

  • Select two Categorical variables under Categorical columns.

  • To learn more: ‘Sampling’

  • To learn more: ‘Filtering’

Data Observation

The observation of the selected variable is displayed. Unlike One Sample T test - One Tailed & Two Tailed, this won’t display Standard error.

_images/chiobs.png

Hypothesis

Under Hypothesis, will display only whether the two pairs are significant or not.

_images/chihypo.png

Summary Stats

Under Summary, the summary statistics are displayed.

_images/chisum.png

Note

In order to view the T distribution, click on ‘T Distribution’ next to ‘Histogram-Density’.

In order to view the Summary Stats click on the ‘Summary’ next to ‘Hypothesis’ in the ‘Hypothesis & Summary’ Panel.

Transformations

Absolute

For video tutorial see: SEDGE YouTube channel - A functions.

abs( ): Absolute value of a variable. The magnitude of a real number without regard to its sign. Numbers with negative values will be converted to positive ones.

Syntax:

abs(#column1#) AS <<Newabscolumn>>

New column name: Newabscolumn

Note

The absolute value can only be applied to Numerical datatype columns. The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored.

Example:

abs(#temperature#)

Where (#temperature#) include following values -12, 6, 0.5, -6, ,25 After applying abs(#temperature#) the values will change to 12, 6, 0.5, 6, ,25

The column may have temperatures in -ve value and this will convert all negative temperatures to positive



Accrint

accrint(): The ACCRINT function returns the accrued interest for a security that pays interest periodically.

The arguments can be explained as:

  • Issue - The issue date of the security

  • First_interest - The first date interest will be paid

  • Settlement - The settlement date of the security

  • Rate - The security’s annual coupon rate.

  • Par - The security’s par value.

  • Frequency - The number of interest payments per year.

Choose from the following:

  • 1 = annual payments

  • 2 = semi -annual payments

  • 4 = quarterly payments

  • 12 = monthly payments

Syntax:

accrint(#Issue_Date#, #First_Interest_Date#, #Settlement_Date#, #Annual_Interest_Rate#, #Par#, #Frequency#)  as <<Newcol_accrint>>

New column name: Newcol_accrint

Note

Issue_date, First_Interest_Date and Settlement_Date must be in date datatype.

Example:

accrint(#issue_Date#, #First_Interest#, #Settlement_Data#, #open_acc#, #Par#, 2) as <<new_accrint>>

Above example will return the expected result.



Arc cosine - Acos

acos( ): Arc cosine function result is the same as the inverse of cosine of a variable, the function returns angles in radians

Syntax:

acos(#column1#) AS <<Newacoscolumn>>

New column name: Newacoscolumn

Note

The acos function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored

Example:

acos(#NumericalColumn#) AS <<Newacoscolumn>>

Where (#numericalColumn#) value is between -1 and 1 After applying acos, we will get an angle expressed in radians



Add months

Add months( ): The function returns a date with given number of months added (date + integer months)

Syntax:

add_months(#DateColumn#, number_of_months) as <<NewAddColumn>>

New column name: NewAddColumn

Note

The Add months function can only be applied to Date datatype columns

Example:

add_months(#DateOfHire#, 12) as <<NewAddColumn>>

Assuming new employees have a 12 month trial period, after adding 12 months to the Date of hire column we will get the date when the trial period terminates



Append

append( ): Appending a variable with a pattern to create a new variable.

Syntax:

append(#column1#, 'Pattern to be appended') AS <<NewColumnApp>>

New column name: NewColumn

Note

Both Categorical and Numerical value and the new field which is created is a Categorical type.

Example:

append(#temperature#, 'Deg C') AS <<AppCol1>>

Assuming that the temperature column is a numeric value and appending the string ‘Deg C’ will append the text to the numerical value. Example ‘13.5’ is the numerical value and the text ‘Deg C’ will be appended to read ‘13.5 Deg C’, and the new column AppCol1 which is formed will be string.



Ascii

ascii( ): Function returns numerical value of the first character of a string

Syntax:

ascii(#Column#) as <<NewAsciiColumn>>

New column name: NewAsciiColumn

Note

Works for categorical columns

Example:

ascii(#CategoricalColumn#) AS <<NewAsciiColumn>>

We have a categorical column “Geography” and we want to transform the variables into numerical Ascii representation. We have variable “France”, “Spain” and “Germany” and after the transformation we get results of “70”, “83” and “71” as per the ASCII characters table.



Arcsin

asin( ): The arcsin function is the inverse of the sine function. It returns the angle whose sine is a given number. Use arcsin when you know the sine of an angle and want to know the actual angle

Syntax:

asin(#Column#) as <<NewAsinColumn>>

New column name: NewAsinColumn

Note

The asin function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored

Example:

asin(#SineColumn#) as <<NewAsinColumn>>

Assuming we have a column with sine value (0 to 1) we can determine the Asin value. In case our Sine value is 0,85 we will get an Asin value of 1.0159852938148251.



Atan - Arc tangent

atan( ): The Atan function returns the arc tangent in radians (which is the inverse tangent), of its argument. The arctangent means the angle whose tangent is the argument.

Syntax:

atan(#Column#) as <<NewAtanColumn>>

New column name: NewAtanColumn

Note

The atan function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored

Example:

atan(#NumericColumn#) as <<NewAtanColumn>>

Assuming we have a column with numerical value and a row has value of 37, the Atan function will return the arc tangent value of 1.5437758 - angle in radians.



Atan2

atan2( ): The atan2 function calculates one unique arc tangent value from two variables x and y and returns angle in radians

Syntax:

atan2(#Column_Y#, #Column_X#) as <<NewColumn>>

New column name: NewAtan2Column

Note

The atan2 function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored

Example:

atan2(#NumericColumnX#, #NumericColumnY#) as <<NewAtan2Column>>

Assuming we have 2 columns with numerical value and a row has value of x=44 and y=10 the Atan2 function will return the value of 1.3473197256542635



Base64

base64( ): The base64 function is most commonly used to encode binary data and transform the argument from binary format to a base 64 string

Syntax:

base64(#Column#) as <<NewBaseColumn>>

New column name: NewBaseColumn

Example:

base(#CategoricalColumn#) as <<NewBaseColumn>>

Assuming we have a categorical column with locations and one variable is France, the Base64 function will transform the value to RnJhbmNI as a base 64 string



Bigint

Bigint( ): The bigint function returns a 64-bit integer representation of a number or character string in the form of an integer constant

Syntax:

bigint(#Column#) as <<NewBigintColumn>>

New column name: NewBigintColumn



Bin

Bin( ): The bin function takes in integer x and returns the binary representation of x in a string format

Syntax:

bin(#Column#) as <<NewBinColumn>>

New column name: NewBinColumn

Example:

bin(#NumericColumn#) as <<NewBinColumn>>

Assuming we have a numerical column with a value of 15, the Bin function will return its binary value of 1111



Bit-length

Bit-length( ): The function returns the length of the given string in bits

Syntax:

bit_length(#Column#) as <<NewBitLColumn>>

New column name: NewBitLColumn

Note

The function can only be applied to Categorical datatype columns

Example:

bit_length(#Geography#) as <<NewBitLColumn>>

Assuming we have a column with France as a variable, the function will return a value of 48 as a length of the string in bits



Bround

Bround( ): Returns expression rounded to n decimal places using HALF_EVEN rounding mode

Syntax:

bround(#Column#, n) as <<NewBroundColumn>>

New column name: NewBroundColumn

Example:

bround(#NumericColumn#, 3) as <<NewBroundColumn>>

Assuming we have a numerical column with a value of 51.93487, the Bround function will return its rounded value of 51.935



Char

Char( ): The char function returns the character based on the ASCII code

Syntax:

char(#Column#) as <<NewCharColumn>>

New column name: NewCharColumn

Example:

char(#NumericColumn#) as <<NewCharColumn>>

Assuming we have a numerical column with a value of ASCII = 77, the Char function will return the character M



Char_length

Char_length( ): Gives the length in characters of the input string

Syntax:

char_length(#Column#) as <<NewCharLColumn>>

New column name: NewCharLColumn

Note

The function can only be applied to Categorical datatype columns

Example:

char_length(#CategoricalColumn#) as <<NewCharLColumn>>

Assuming we have a categorical column “Geography” then the variable “Spain” will return 5 as a length of characters, “Germany” will return 7 etc.



Coalesce

Coalesce( ): The function returns the first non-null expression in the list. If all expressions evaluate to null, then the Coalesce function returns null

Syntax:

coalesce(#Column1#, #Column2#, ..., #ColumnN#) as <<NewCoalColumn>>

New column name: NewCoalColumn

Note

The function can be applied to Categorical and Numerical datatype columns

Example:

coalesce(#FinishDate#, #StartDate#) as <<NewCoalColumn>>

Assuming we have a column “StartDate” and “FinishDate”. Some of the variables in “FinishDate” are not filled in as a task was not completed yet. In this case the transformation will be searching for “FinishDate” firstly (written on first place in the syntax), if the value is null it will return the “StartDate” value.



Concat

Concat( ): The function is used to concatenate two strings to make a single string, for example concatenate two or more columns in one

Syntax:

concat(#Column1#, #Column2#, ..., #ColumnN#) as <<NewConcColumn>>

New column name: NewConcColumn

Example:

concat(#Sex#, #Age#) as <<NewConcColumn>>

Assuming we have a column “Sex” and “Age” and we want both of these variables in one single column. After applying the concat function, we will get results like “Male24”, “Female43” etc.



Concat_ws

Concat_ws( ): The function joins the input strings into a single string. It separates those concatenated strings with the separator specified in the first argument

Syntax:

concat_ws('separator', #Column1#, #Column2#, ..., #ColumnN#) as <<NewConcWSColumn>>

New column name: NewConcWSColumn

Note

The CONCAT_WS() requires at least two input strings

Example:

concat_ws('/', #Sex#, #Age#) as <<NewConcWSColumn>>

Assuming we have a column “Sex” and “Age” and we want both of these variables in one single column separated by “/”. After applying the concat function, we will get results like “Male/24”, “Female/43” etc.



Conv

Conv( ): The Conv function converts a number from one numeric base system to another, and returns the result as a string value

Syntax:

conv(#Column#, from_base, to_base) as <<NewConvColumn>>

New column name: NewConvColumn

Example:

conv(#Column1#, 11, 3)

Convert a column from numeric base system 11 to numeric base system 3

Note

This function returns NULL if any of the parameters are NULL



Cos

Cos( ): The Cos function returns the cosine of a number

Syntax:

cos(#Column#) as <<NewCosColumn>>

New column name: NewCosColumn

Note

The Cos function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored

Example:

cos(#NumericalColumn#) as <<NewCosColumn>>

Assuming we have a numerical column with angles and we want to convert it into cosine value we use this function. A value of 2 will be converted into -0,416147.



Cosh

Cosh( ): The Cosh function returns the hyperbolic cosine of a number

Syntax:

cosh(#Column#) as <<NewCoshColumn>>

New column name: NewCoshColumn

Note

The Cosh function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored

Example:

cosh(#NumericalColumn#) as <<NewCoshColumn>>

Assuming we have a numerical column with angles and we want to convert it into hyperbolic cosine value we use this function. A value of 3 will be converted into 10,067.



Cot

Cot( ): The Cot function returns the cotangent of a number

Syntax:

cot(#Column#) as <<NewCotColumn>>

New column name: NewCotColumn

Note

The Cos function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored

Example:

cot(#NumericalColumn#) as <<NewCotColumn>>

Assuming we have a numerical column with angles and we want to convert it into cotangent value we use this function. A value of 6 will be converted into -3.436353.



Crc32

Crc32( ): The Crc32 function function calculates a 32-bit CRC (cyclic redundancy checksum) for a string. This function can be used to validate data integrity

Syntax:

crc32(#Column#) as <<NewCrcColumn>>

New column name: NewCrcColumn

Note

The Crc32 function can only be applied to Categorical datatype columns

Example:

crc32(#CategoricalColumn#) as <<NewCrcColumn>>

Assuming we have a categorical column and we want to convert it into 32-bit CRC value we use this function. A string word “France” will be converted into 1493835906.



Cube Root

For video tutorial see: SEDGE YouTube channel - C functions.

cbrt() : Finds the cube root of the value

Syntax:

cbrt(#MPG#) as <<Newcolcbrt>>

New column name: Newcolcbrt

Note

  • A The cbrt value can only be applied to Numerical datatype columns.

  • B The new column name should not contain any special characters such as !@#$%^&*()-+= space

  • C If there are any missing rows in the numerical column, then the row will be ignored.

Example:

cbrt(#NewBornWeight#) AS <<Newcbrtcolumn>>

Where (#NewWeight#) include following values 729, 27, 5832 After applying cbrt(#NewWeight#) the values will change to 9, 3, 18



Ceiling

ceil() : Ceiling value of a number returns the greatest integer, greater than or equal to the number. The ceil of number 2.5 will generate the value 3. The ceil of the value -2.5 will generate the value -2.

Syntax:

ceil(#column1#) AS <<NewCeilColumn>>

New column name: Newceilcolumn

Note

The ceil value can only be applied to Numerical datatype columns. The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored.

Example:

ceil(#NewBornWeight#) AS <<NewCeilColumn>>

Where (#NewBornWeight #) include following values 2.2, 3.5, 2.8, , 2.7 After applying ceil(#NewBornWeight #) the values will change to 3, 4, 3, , 3



Copy

Copy( ): Copying a variable and naming it as a new variable

Syntax:

copy(#column1#) AS <<NewCopiedColumn>>

New column name: NewCopiedColumn

Note

Both Categorical and Numerical value and the new field which is created are of the same datatype as the copied column.

Example:

copy(#Age#) as <<new_age>>

Column ‘Age’ is copied as new field ‘new_age’



Current_date

Current_date( ): The CURRENT_DATE() function returns the current date at the time of query execution

Syntax:

current_date()as <<NewCDColumn>>

New column name: NewCDColumn

Note

The date is returned as “YYYY-MM-DD” (string) or as YYYYMMDD (numeric)

Example:

current_date()as <<NewCDColumn>>

A new column with today’s date is created - could be used to calculate a time difference between any other date column and current date.



Datediff

datediff( ): Returns the number of days between two date columns

Syntax:

datediff(#column1#, #column2#) as <<Newcol_diff>>

New column name: Newcol_diff

Note

  • The datediff function is only applied for date_columns

  • #column1#, #column2# must be a valid date_column

Example:

datediff(#DOB#, #Date of Hire#) as <<diff_days>>

Above example returns the number of days between two date columns



Datediff_hrs

datediff_hrs( ): Returns the number of hours between two date columns

Syntax:

datediff_hrs(#column1#, #column2#) as <<Newcol_diffhrs>>

New column name: Newcol_diffhrs

Note

  • The datediff_hrs function is only applied for date_columns

  • #column1#, #column2# must be a valid date_columns

Example:

datediff_hrs(#DOB#, #Date of Hire#) as <<diff_hrs>>

Above example returns the number of hours between two date columns



Datediff_mins

datediff_mins( ): Returns the number of minutes between two date columns

Syntax:

datediff_mins(#column1#, #column2#) as <<Newcol_diffmins>>

New column name: Newcol_diffmins

Note

  • The datediff_mins function is only applied for date_columns

  • #column1#, #column2# must be a valid date_columns

Example:

datediff_mins(#DOB#, #Date of Hire#) as <<diff_mins>>

Above example returns the number of minutes between two date columns



Datediff_secs

datediff_secs( ): Returns the number of seconds between two date columns

Syntax:

datediff_secs(#column1#, #column2#) as <<Newcol_diffsecs>>

New column name: Newcol_diffsecs

Note

  • The datediff_secs function is only applied for date_columns

  • #column1#, #column2# must be a valid date_columns

Example:

datediff_secs(#DOB#, #Date of Hire#) as <<diff_secs>>

Above example returns the number of seconds between two date columns



Date_add

date_add(): add days to the selected date

Syntax:

date_add(#Column#, Number_of_days_to_add) as <<Newcol_dateadd>>

New column name: Newcol_dateadd

Note

  • The date_add function is only applied for date_columns

    • Number_of_days_to_add field should have digits

Example:

date_add(#DOB#,3) as <<dateadd>>

Above example adds days to the selected date



Date_value

date_value( ): converts date to serial number

Syntax:

date_value(#column#, 'offset_date') as <<Newcol_value>>

New column name: Newcol_value

Note

  • The date_value function is only applied for date_columns

  • Offset_date must be a valid date format(yyyy-MM-dd)

Example:

date_value(#DOB#, '1900-01-01') as <<datevalue>>

Above example shows the conversion of date to serial number



Current_timestamp

Current_timestamp( ): The CURRENT_TIMESTAMP() function returns the current date and time

Syntax:

current_timestamp() as <<NewCTColumn>>

New column name: NewCTColumn

Note

The date and time is returned as “YYYY-MM-DD HH-MM-SS” (string) or as YYYYMMDDHHMMSS.uuuuuu (numeric)

Example:

current_timestamp() as <<NewCTColumn>>

A new column with today’s date and time is created - could be used to calculate a time difference between any other date column and current date.



Date_isin

Date_isin ( ): The function is used to check if a date column contains a value from selected date interval or not

Syntax:

date_isin(#DateColumn#, 'start_date', 'end_date') as <<NewDIIColumn>>

New column name: NewDIIColumn

Note

  • The date_isin function is only applied for date_column

  • date1 and date2 refers the input_date that must have valid date format(yyyy-MM-dd)

Example:

date_isin(#DOB#, '2009-05-24', '2020-01-24') as <<NewDIIColumn>>

Above example shows the given date column lies in between given dates (date1 and date2)



Date_sub

Date_sub ( ): The function returns the date after subtracting a selected number of days from the original date in the column

Syntax:

date_sub(#DateColumn#, number_of_days) as <<NewDateSubColumn>>

New column name: NewDateSubColumn

Example:

date_sub(#DateOfHire#, 7) as <<NewDateSubColumn>>

Above example shows the given date column with 7 days subtracted from each variable



Date_trunc

Date_trunc ( ): The function returns date with the time portion of the day truncated to the unit specified by the format model

Format should be one of [“YEAR”, “YYYY”, “YY”, “MON”, “MONTH”, “MM”, “DAY”, “DD”, “HOUR”, “MINUTE”, “SECOND”, “MILLISECOND”, “MICROSCOND”, “WEEK”, “QUARTER”]

Syntax:

date_trunc('format', #DateColumn#) as <<NewTruncColumn>>

New column name: NewTruncColumn



Day

Day ( ): The Day function returns the day of the month for a given date (a number from 1 to 31)

Syntax:

day(#DateColumn#) as <<NewDayColumn>>

New column name: NewDayColumn

Example:

day(#DateOfHire#) as <<NewDayColumn>>

Above example will extract each day from the respective column - for example ‘2020-09-27’ will return ‘27’ as a result



Dayofmonth

Dayofmonth ( ): The function returns the day of the month for a given date (a number from 1 to 31)

Syntax:

dayofmonth(#DateColumn#) as <<NewDOMColumn>>

New column name: NewDOMColumn

Example:

dayofmonth(#DateOfHire#) as <<NewDOMColumn>>

Above example will extract each day from the respective column - for example ‘2020-09-27’ will return ‘27’ as a result



Dayofweek

Dayofweek ( ): The function returns the weekday index for a given date (a number from 1 to 7)

Syntax:

dayofweek(#DateColumn#) as <<NewDOWColumn>>

New column name: NewDOWColumn

Note

1=Sunday, 2=Monday, 3=Tuesday, 4=Wednesday, 5=Thursday, 6=Friday, 7=Saturday Input date column must be in format ‘yyyy-MM-dd’

Example:

dayofweek(#DateOfHire#) as <<NewDOWColumn>>

Above example will extract the numer of day in a week from the respective column - for example ‘2021-03-27’ will return ‘7’ for Saturday as a result



Dayofyear

Dayofyear ( ): The function returns the day of the year for a given date (a number from 1 to 366)

Syntax:

dayofyear(#DateColumn#) as <<NewDOYColumn>>

New column name: NewDOYColumn

Note

Input date column must be in format ‘yyyy-MM-dd’

Example:

dayofyear(#DateOfHire#) as <<NewDOYColumn>>

Above example will extract the numer of day in a year from the respective column - for example ‘2021-02-27’ will return ‘58’ as a result



Degrees

Degrees ( ): The function converts a value in radians to degrees

Syntax:

degrees(#Column#) as <<NewDegrColumn>>

New column name: NewDegrColumn

Example:

degrees(#RadianColumn#) as <<NewDegrColumn>>

Above example will convert a value in radians to degrees - for example a radian value 2 will give result of 114.592 in degrees.



Delete

For video tutorial see: SEDGE YouTube channel - D functions.

delete( ): Deleting string character from the 2nd to 5th position in the Column

Syntax:

delete(#column1#, 2, 5) AS <<NewColumnDel>>

New column name: NewColumn

Note

Applies to Categorical data type and the new column which is created is also Categorical type.

Example:

delete(#column1#, 1, 6) AS <<NewColumnDel>>

The characters in column 1 in the position from 1st Character to the 6th Character is deleted. If the data in column 1 is “Hello World”, “Great Day”, after applying the formula, the new field will have the values “World” and “Day”.



Double

Double ( ): The function casts the value to the target data type

Syntax:

double(#Column#) as <<NewDoubColumn>>

New column name: NewDoubColumn



Effect

effect( ): The EFFECT Function returns the effective annual interest rate, from the nominal annual interest rate (APR), and the number of compounding periods per year.

The arguments can be explained as:

  • Nominal - Annual interest rate, also known as APR (Annual Percentage Rate)

  • Npery - Number of periods that interest is calculated

Syntax:

EFFECT(#nominal_rate#, #npery#) AS  <<Newcol_effect>>

New column name: Newcol_effect

Example:

effect (#int_rate#, #open_acc#) as <<new_effect>>

Above example will return the expected result.



Exponential

For video tutorial see: SEDGE YouTube channel - E functions.

exp( ): The exp() function allows users to calculate the exponential value with the base set to e. e is a mathematical constant, with a value approximately equal to 2.71828 and is also called Eulers constant.

Syntax:

exp(#column1#) AS <<NewColumnExpo>>

New column name: NewColumn

Note

  • A - e is a Mathematical constant, with a value approximately equal to 2.71828

  • B - Applies to Numerical data type and the new column which is created is also Numerical type.

Example:

exp(#column1#) AS <<NewColumnExpo>>

Assuming the numerical value in the #column1# is 2 then exp(2) means 2.7182 to the power of 2



Expm1

Expm1 ( ): The expm1 function (e ^ x )- 1 returns e raised to the given power argument, minus 1.0, where e is the base of the natural logarithms (e is the Eulers number equal to 2.71828)

Syntax:

expm1(#Column#) as <<NewExpm1Column>>

New column name: NewExpm1Column

Example:

expm1(#NumericalColumn#) AS <<NewExpm1Column>>

Assuming the numerical value in the #column# is 3 then expm1 result will be 19.08554



Factorial

For video tutorial see: SEDGE YouTube channel - F functions.

factorial( ): Factorial of an integer number.

Syntax:

factorial(#column1#) AS <<NewColumnFact>>

New column name: NewColumn

Note

  • A - Applies to Numerical data type and the new column which is created is also Numerical type.

  • B - If the value is non-integer, then the factorial will change the number to integer value and then apply factorial on integer.

Example:

factorial(#column1#) AS <<NewColumnFact>>

Assuming the numerical value in the #column1# is 5 then factorial(5) means 5 x 4 x 3 x 2 x 1 = 120 If the value in the column is non-integer then factorial(4.3) means 4 x 3 x 2 x 1 = 24



Find

find( ): Finds the full character within the field, and returns true if it finds character, and false if not.

Syntax:

find(#marital_status#,'Married') AS <<NewColumnFind>>

The first word is the field name within #. Example #marital_status# The second is the text / word within double quotes that needs to be found. Example ‘Married’

New column name: NewColumn

Note

  • A Applies to Categorical data type and the new column which is created is Categorical type and the output is True or False.

  • B If the word is partly existing, it will not return True.

  • C The word being found is case sensitive.

Example:

find(#marital_status#,'Married') AS <<NewColumnFind>>

The word ‘Married’ if found in the field, it return True, else False.



Float

Float ( ): The Float function converts a specified value into a floating point number.

Syntax:

float(#Column#) as <<NewFloatColumn>>

New column name: NewFloatColumn



Floor

floor( ) : Floor value of a number returns the largest integer value less than or equal to the specified number. The floor of number 2.5 will generate the value 2. The floor of the value -2.5 will generate the value -3.

Syntax:

floor(#column1#) AS <<Newfloorcolumn>>

New column name: Newfloorcolumn

Note

The floor value can only be applied to Numerical datatype columns. The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored.

Example:

floor(#NewBornWeight#) AS <<Newfloorcolumn>>

Where (#NewBornWeight #) include following values 2.2, 3.5, 2.8, ,2.7 After applying floor(#NewBornWeight #) the values will change to 3, 4, 3, , 3



Format_number

Format_number ( ): The Format number function formats a number to a format like “#,###,###.##”, rounded to a specified number of decimal places, then it returns the result as a string

Syntax:

format_number(#Column#, number of digits to be rounded to) as <<NewNumForColumn>>

New column name: NewNumForColumn

Example:

format_number(#Distance#, 2) as <<NewNumForColumn>>

Where distance is 153.4625 the result string will be 153.46.



From unixtime

From_unixtime ( ): This function converts the UNIX timestamp to default Datetime format (yyyy-MM-dd HH:mm:ss)

Syntax:

from_unixtime(#Column#) as <<NewFrUnColumn>>

New column name: NewFrUnColumn

Example:

from_unixtime(#DateColumn#) as <<NewFrUnColumn>>

Where UNIX time is 1255033470 the result converted time will be 2009-10-08 13:24:30



From-utc-timestamp

From_utc_timestamp ( ): The function returns the current UTC date and time as a value in ‘YYYY-MM-DD HH:MM:SS’ or YYYYMMDDHHMMSS.uuuuuu format depending on the usage of the function i.e. in a string or numeric context.

Syntax:

from_utc_timestamp(#Datecolumn#, 'timezone') as <<NewTimeColumn>>

For the list of time zones click on the link below: List of time zones.

New column name: NewTimeColumn

Example:

from_utc_timestamp(#Datecolumn#, 'Europe/Istanbul') as <<NewTimeColumn>>

Where the selected date is 2016-08-31 and the time zone is Tokyo the result converted time will be 2016-08-31 03:00:00



FV

fv( ): The FV function is a financial function that returns the future value of an investment. We can use the FV function to get the future value of an investment assuming periodic, constant payments with a constant interest rate.

  • Rate - Interest rate per compound period

  • N - The total number of payment periods

  • Pmt - The payment per period

  • Pv - The present value of the investment

  • Type - When payments are deposited
    1. end = End of each period

    2. begin = Beginning of each period

Syntax:

fv(#Annual_Interest_Rate#, #Number_of_Periods#, #Amount_of_Payments#, #Present_value#, 'end') as <<Newcol_fv>>

New column name: : Newcol_fv

fv(#int_rate#, #open_acc#, #Par#, #pv#) as <<new_fv>>

Above example will return the expected result.



Geo Dist Kms

Geo_dist_kms ( ): The function calculates the distance in kilometres between two geographical points

Syntax:

geo_dist_kms(#Lati_A#, #Lati_B#, #Longi_A#, #Longi_B#) as <<NewGDKColumn>>

New column name: NewGDKColumn

For the example chart click on the link below: Geo lat long chart.

Example:

geo_dist_kms(#Lati_A#, #Lati_B#, #Longi_A#, #Longi_B#) as <<NewGDKColumn>>

Where the selected Latitude A is 38.631913, Longitude A is -121.434879, Latitude B is 38.502519 and Longitude B is -121.420769 the calculated distance is 14.44016 Kms |


Greatest

Greatest ( ): The function returns the greatest value from the list of arguments

Syntax:

greatest(#Column1#, #Column2#, ..., #ColumnN#) as <<NewGreatColumn>>

New column name: NewGreatColumn

Example:

greatest(#Temperature1#, #Temperature2#, ..., #TemperatureN#) as <<NewGreatColumn>>

Where the selected temperatures are 10, 4 and 35 degrees the function will select 35 value as the greatest one



Hexadecimal

For video tutorial see: SEDGE YouTube channel - H functions.

hex( ): Converts decimal numbers (base 10) to hexa decimal value (base 16). As a base-16 numeral system, it uses 16 symbols. These are the 10 decimal digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) and the first six letters of the English alphabet (A, B, C, D, E, F). The letters are used in order to represent the values 10, 11, 12, 13, 14 and 15 each in one single symbol.

Syntax:

hex(#column1#) AS <<NewColumnHex>>

New column name: NewColumn

Note

Applies to numerical data type and the new column which is created is Categorical type.

Example:

hex(#column1#) AS <<NewColumnHex>>

Converts the decimal numeral system to hexa decimal numeral system.

_images/hexadecimal.png

image Hexa decimal explanation



Holiday_count

holiday_count( ): Returns the number of holidays between two given date columns based on country

Syntax:

holiday_count(#column1#,#column2#, 'countrycode') as <<Newcol_holidaycount>>

New column name: Newcol_holidaycount

For the list of country codes click on the link below: List of country codes.

Example:

holiday_count (#DOB#, #Date of Hire#, 'US) as <<holidaycount>>

Above example returns the number of holidays between two given date columns based on country



Hour

hour( ): The function returns the hour part for a given date (from 0 to 838)

Syntax:

hour(#Datecolumn#) as <<NewHourColumn>>

New column name: NewHourColumn

Example:

hour(#Datecolumn#) as <<NewHourColumn>>

Above example returns the hour number in a specific column a value of ‘2019-07-28 09:15:00’ will return ‘9’ as a result



Hypot

hypot( ): The function returns the square root of the sum of squares of its arguments - can be used as a calculation of hypotenuse of a triangle

Syntax:

hypot(#Column1#, #Column2#) as <<NewHypotColumn>>

New column name: NewHypotColumn

Example:

hypot(#TriangleLegA#, #TriangleLegB#) as <<NewHypotColumn>>

In case the length of one leg is 10 and second leg is 40 the hypotenuse result will be 41.23106.



If Else And

For video tutorial see: SEDGE YouTube channel - I functions.

If Else And: The if/else statement executes a code if a specified condition is true. If the condition is false, another code is executed.

Syntax:

if(#age#>=0 and #age#<=18,'Child',(if(#age#>=19 and #age#<=50,'Adult' ,(if(#age#>=51 and #age#<=70,'Mature','OLD' ))))) as <<NewAgeColumn2If>>

New column name: NewAgeColumn2

Note

  • A Applies to Categorical, Numerical, Text or Date data type and the new column which is created is either Categorical or Numerical type.

  • B The column name is case sensitive and the existing columns should be defined within #.

  • C The And operator is also used to join multiple conditions.

Example:

if(#age#>=0 and #age#<=18,'Child',(if(#age#>=19 and #age#<=50,'Adult' ,(if(#age#>=51 and #age#<=70,'Mature','OLD' ))))) as <<NewAgeColumn2If>>

In the above example the values in the column age if it is between 0 to 18 then a NewAgeColumn2 is created where the age values between 0 to 18 is classified as ‘Child’, age from 19 to 50 classified as ‘Adult’, age between 51 to 70 classified as ‘Wise’ and rest classified as ‘Old’.

_images/if_else.png

image If Else function



Ifnull

Ifnull ( ): The function returns a specified value if the expression is NULL. If the expression is NOT NULL, the function returns the expression

Syntax:

ifnull(#Column#, value) as <<NewIfNullColumn>>

New column name: NewIfNullColumn

Example:

ifnull(#Balance#, 0) as <<NewIfNullColumn>>

In case the value in a selected column equals to 0, the function will return a null value



Initcap

Initcap ( ): The function returns the expression with the first letter in each word in uppercase. Rest of the letters are in lowercase. Words are delimited by a white space

Syntax:

initcap(#Column#) as <<NewInitCapColumn>>

New column name: NewInitCapColumn

Example:

initcap(#Gender#) as <<NewInitCapColumn>>

In case the variable like ‘male’ and ‘female’ start with a lower case letter, the new column after applying the function will have values ‘Male’ and ‘Female’



Int

Int ( ): The function returns the numeric integer equivalent from a given expression

Syntax:

int(#Column#) as <<NewIntColumn>>

New column name: NewIntColumn

Example:

int(#Column#) as <<NewIntColumn>>

In case we have a numerical column and one of the values is 7.75, the Int number after transformation will be 7



Insert

insert( ): Insert word at the position specified in the field.

Syntax:

insert(#column#, position,'_WordToInsert') as <<NewColumnIns>>

The first word is the field name within # where the word needs to be inserted. Example #occupation# The second is the number which is the position where the word will be inserted. Example the number 4 The third part is the text / word with double quotes that needs to be inserted. Example ‘_SERVICE’

New column name: NewColumn

Note

  • A Applies to Categorical and Text data type and the new column which is created is Categorical type.

  • B If the field does not have any word in the field the insert function will not insert the word.

Example:

insert(#occupation#,4,'_SERVICE') as <<NewColumnIns>>

The word ‘_SERVICE’ is inserted after the 4th position of the characters found in the field #occupation#.



Instr

Instr ( ): The function returns the position of the first occurrence of a string in another string

Syntax:

instr(#Column#, 'substr') as <<NewInstrColumn>>

New column name: NewInstrColumn

Example:

instr(#Name#, 'Mr.') as <<NewInstrColumn>>

We are searching for an expression/string ‘Mr’ in a name column and the function will tell us a position number where the string was found. In case the string is not there, the function returns ‘0’.



Inverse sign

Inverse_sign ( ): The function returns the negated values i.e. positive to negative and vice versa of given numerical column

Syntax:

inverse_sign(#Column#) as <<NewInvSignColumn>>

New column name: NewInvSignColumn

Example:

inverse_sign(#Temeprature#) as <<NewInvSignColumn>>

In a selected ‘Temperature’ column we will have values like 20, 7, -6 and -18 and after the transformation the function will return values -20, -7, 6 and 18.



Is_holiday

is_holiday( ): Check whether the given date column is holiday or not based on country.

Syntax:

is_holiday(#column#, 'countrycode') as <<Newcol_isholiday>>

New column name: Newcol_isholiday

For the list of country codes click on the link below: List of country codes.

Example:

is_holiday(#DOB#, 'US') as <<isholiday>>

Above example shows the given date column is holiday or not based on country



Is_Text

is_text( ): Conditional statement to check if the each field from the given column is Text. Returns True or False. Condition: A string is considered as text, if its length of characters is greater than 50.

Syntax:

is_text(#column#) AS <<Newcol_istext>>

New column name: Newcol_istext

Note

  • Applies for both Numerical and Categorical datatypes to check whether the value is Text or Not

Example:

is_text(#age#) as <<istext_col>>

Checks whether the string in the field is text, returns condition True or False



Is_weekend

is_weekend( ): Check whether the given date column is weekday or not based on country.

Syntax:

is_holiday(#column#, 'countrycode') as <<Newcol_isweekend>>

New column name: Newcol_isweekend

For the list of country codes click on the link below: List of country codes.

.. note::
  • The is_weekend function is only applied for date_columns

  • Countrycode field should have valid codes

Example:

is_weekend(#DOB#, 'US') as <<isweekend>>

Above example shows the given date column is weekend or not based on country



Isnan

Isnan ( ): The function determines whether a value is NaN (Not-a-Number). This function returns true if the value equates to NaN. Otherwise it returns false

Syntax:

isnan(#Column#) as <<NewIsNaNColumn>>

New column name: NewIsNaNColumn

Example:

isnan(#PasswordColumn#) as <<NewIsNaNColumn>>

We are checking if a value in a selected “PasswordColumn” is Not A Number - if we have value ‘567’ the function will return ‘false’, as the expression is a number, if we have value ‘blue’ then function will return ‘true’ as the expression is not a number.



Isnull

Isnull ( ): The function returns 1 or 0 depending on whether an expression is NULL.

If expression is NULL, this function returns 1 as true. If not, it returns 0 as false.

Syntax:

isnull(#Column#) as <<NewIsNullColumn>>

New column name: NewIsNullColumn

Example:

isnull(#Column#) as <<NewIsNullColumn>>

We are checking if a value in a selected “Column” is NULL. If the value in the column is ‘1’ the function will return ‘False’ as the function is not NULL.



Is not null

Is not null ( ): The function returns 1 or 0 depending on whether an expression is NULL.

If expression is not NULL, this function returns 1 as true. If it is NULL, it returns 0 as false.

Syntax:

isnotnull(#Column#) as <<NewIsNotNullColumn>>

New column name: NewIsNotNullColumn

Example:

isnotnull(#Column#) as <<NewIsNotNullColumn>>

We are checking if a value in a selected “Column” is NOT NULL. If the value in the column is ‘1’ the function will return ‘True’ as the function is not NULL.



Is Numeric

is_numeric( ): Conditional statement to check if the value in the field is numeric, returns True or False

Syntax:

is_numeric(#Age#) AS <<NewColumnIsNum>>

New column name: NewColumn

Note

Applies to Categorical data type and the new column which is created is Categorical type.

Example:

is_numeric(#Age#) AS <<NewColumnIsNum>>

Checks whether the text in the field is numerical, returns condition True or False.



Last_day

Last_day ( ): The function extracts the last day of the month for a given date

Syntax:

last_day(#Column#) as <<NewLDColumn>>

New column name: NewLDColumn

Example:

last_day(#DateColumn#) as <<NewLDColumn>>

If we have a Date Column with a value of ‘2021-01-01’the function will return the last day of the respective month so ‘2021-01-31’.



Least

Least ( ): The function returns the smallest value from the list of arguments

Syntax:

least(#Column1#, #Column2#, ..., #ColumnN#) as <<NewLeastColumn>>

New column name: NewLeastColumn

Example:

least(#Balance#, #EstimatedSalary#) as <<NewLeastColumn>>

If we have numerical columns with multiple values, the function will select the smallest value. If one column contains a value of 14,567 and second column a value of 6,553 the function will return 6,553 as the Least value.



Left

For video tutorial see: SEDGE YouTube channel - L functions.

left( ): Extract left characters from the fields till the numerical count.

Syntax:

left(#nationality#,3) as <<NewColumnLeft>>

New column name: NewColumn Above example the function extracts the first 3 characters in the field from left.

Note

Applies to Categorical, Text, Numerical and Date data type and the new column which is created is Categorical type.

Example:

left(#nationality#,3) as <<NewColumnLeft>>

Example above, function extracts the first 3 characters in the field from left



Length

length( ): Gives the length of the number of characters in the field

Syntax:

length(#column1#) AS <<NewColumnLen>>

New column name: NewColumn

Note

Applies to Categorical, Numerical, Text and date data type and the new column which is created is Numerical type.

Example:

length(#column1#) AS <<NewColumnLen>>

Generates the length of the characters in the field.



Levenshtein

Levenshtein ( ): The function returns the Levenshtein distance between two strings. The Levenshtein distance is the number of characters you have to replace, insert or delete to transform string1 into string2

Syntax:

levenshtein(#Column1#, #Column2#) as <<NewLevenColumn>>

New column name: NewLevenColumn

Example:

levenshtein(#Password1#, #Password2#) as <<NewLevenColumn>>

If one value is “tomato” and second one is “mate” the Levenshtein distance is 3 - as we need to replace, insert or delete 3 characters to transform “tomato” into “mate”.



Ln

Ln ( ): The function returns the natural logarithm of a number

Syntax:

ln(#Column#) as <<NewLnColumn>>

New column name: NewLnColumn

Example:

ln(#NumericalColumn#) as <<NewLnColumn>>

If the value in a column is ‘3’ the Ln function will return 1.0986122886681098 as a result.



Locate

Locate ( ): The function identifies the position of the first occurrence of a substring in a string. If the substring is not found in the original string, the function returns 0. The function ignores upper or lower cases

Syntax:

locate('substr', #Column#) as <<NewLocColumn>>

New column name: NewLocColumn

Example:

locate('r', #Surname#) as <<NewLocColumn>>

We are searching for the expression “r” in the surname column. Surname “Nobe” returns 0 as it does not contain an “r”. Surname “Taylor” returns “6” as the expression “r” is at the 6th position.



Log10

Log10 ( ): The function returns the natural logarithm of a number to base 10

Syntax:

log10(#Column#) as <<NewLog10Column>>

New column name: NewLog10Column

Example:

log10(#NumericColumn#) as <<NewLog10Column>>

We are searching for a natural logarithm of a number to base 10. In case the value is 4 the function will return 0.602059991328.



Log1p

Log1p ( ): The function returns the natural logarithm (base e) of 1 + a given number. E is the Eulers number which is approximately 2,718

Syntax:

log1p(#Column#) as <<NewLog1pColumn>>

New column name: NewLog1pColumn

Example:

log1p(#NumericColumn#) as <<NewLog1pColumn>>

We are searching for a natural logarithm (base e) of 1 + a given number. In case the value is 4 the function will return 1.6094379124341003.



Log2

Log2 ( ): The function returns the natural logarithm of a number to base 2

Syntax:

log2(#Column#) as <<NewLog2Column>>

New column name: NewLog2Column

Example:

log2(#NumericColumn#) as <<NewLog2Column>>

We are searching for a natural logarithm with base 2 of a given number. In case the value is 4 the function will return 2.



Lower

lower( ): converts characters to lower case.

Syntax:

lower(#column1#) AS <<NewColumnLow>>

New column name: NewColumn

Note

Applies to categorical and text data type and the new column which is created is Categorical type.

Example:

lower(#column1#) AS <<NewColumnLow>>

Converts characters to lower case.



Lpad

Lpad ( ): The function left-pads a string with another string, to a certain specified length

Syntax:

lpad(#Column#, length, 'pad') as <<NewLpadColumn>>

New column name: NewLpadColumn

Example:

lpad(#Hello#, 8, 'ABC') as <<NewLpadColumn>>

The function will return ABCHello.



Ltrim

Ltrim ( ): The function removes leading spaces from a string

Syntax:

ltrim(#Column#) as <<NewLtrimColumn>>

New column name: NewLtrimColumn

Example:

ltrim(#      hello#) as <<NewLpadColumn>>

The function will return ‘hello’ without spaces on the left side.



Logarithm

log( ): gives the logarithmic value of the number. A natural log value is logarithm to the base of the mathematical constant e, where e is an irrational and transcendental number approximately equal to 2.71828 Similarly , user can use any base like 2,10 to calculate the logarithmic value.

Syntax:

log(Base value , #column1#) AS <<NewColumnLog>>

New column name: NewColumn

Note

Applies to Numerical data type and the new column which is created is Numerical type.

Example:

log('e',#column1#) AS <<NewColumn1>>

Example:

log(2,#column1#) AS <<NewColumn2>>

Performing logarithmic transformation help in data normality especially if the data is skewed. The log transformation can be used to make highly skewed distributions less skewed. Log transformation is important as it makes patterns in the data more interpretable.




Max

max( ): This function returns the maximum value in the specified column.

Syntax:

max(#column#) as <<Newcol_max>>

New column name: Newcol_max

Note

The max function is only applied for numerical columns

Example:

min(#Age#) as <<max_age>>

Above example will return the maximum of age.



Md5

Md5 ( ): The Message Digest Algorithm 5 (MD5) is a cryptographic hash algorithm that is used to create a 128-bit string value from an arbitrary length string

Syntax:

md5(#Column#) as <<NewMd5Column>>

New column name: NewMd5Column

Example:

md5(#France#) as <<NewMd5Column>>

The function will return 0309a6c666a7a803fdb9db95de71cf01 as a result.



Mean

mean( ): This function returns the average of values in the specified column.

Syntax:

mean(#column#) as <<Newcol_mean>>

New column name: Newcol_mean

Note

The mean function is only applied for numerical columns.

Example:

mean(#Age#) as <<mean_age>>

Above example will return the average value of age column.



Median

median( ): This function returns the median value of a range of values in the specified column.

Syntax:

median(#column#) as <<Newcol_median>>

New column name: Newcol_median

Note

The median function is only applied for numerical columns.

Example:

median(#Age#) as <<median_age>>

Above example will return the median value of age column.



Milisec_todate

Milisec_todate ( ): The function creates timestamp from the number of milliseconds since UTC epoch - Unix epoch is 00:00:00 UTC on 1 January 1970

Syntax:

millisec_todate(#Column#) as <<NewMiliColumn>>

New column name: NewMiliColumn

Example:

millisec_todate(#2021/01/01 15:26:40#) as <<NewMiliColumn>>

The above mentioned date (1st January 2021, 15:26:40) will be converted into 1609511200000.



Min

min( ): This function returns the minimum value in the specified column.

Syntax:

min(#column#) as <<Newcol_min>>

New column name: Newcol_min

Note

The min function is only applied for numerical columns

Example:

min(#Age#) as <<min_age>>

Above example will return the minimum of age.



Minute

Minute ( ): The function returns minute part of a time/datetime (from 0 to 59)

Syntax:

minute(#Datecolumn#) as <<NewMinuteColumn>>

New column name: NewMinuteColumn

Example:

minute(#2021/01/01/15:26:40#) as <<NewMinuteColumn>>

The above mentioned date (1st January 2021, 15:26:40) will give results of ‘26’ as the extracted minute.



Mod

Mod ( ): The function helps find a remainder after a number (dividend) is divided by another number (divisor)

Syntax:

mod(#Column1#, #Column2#) as <<NewModColumn>>

New column name: NewModColumn

Example:

mod(#30#, #4#) as <<NewModColumn>>

The above mentioned number will give us a result of ‘2’ as when we divide 30 by 4 we will get 7 in full numbers and 2 is the remainder.



Month

Month ( ): The function extracts the month from a given date as number between 1 and 12

Syntax:

month(#Datecolumn#) as <<NewMonthColumn>>

New column name: NewMonthColumn

Example:

month(#2021/01/01/15:26:40#) as <<NewMonthColumn>>

The above mentioned date (1st January 2021, 15:26:40) will give results of ‘1’ as the extracted month.



Months_between

Months_between ( ): The function returns number of months between date1 and date2. If date1 is later than date2, then the result is positive. If date1 and date2 are on the same day of month, or both are the last day of month, time of the day will be ignored. Otherwise, the difference is calculated based on 31 days per month, and rounded to 8 digits. If we mention ‘false’ in the syntax, the number will not be rounded.

Syntax:

months_between(#EndDatecolumn#, #StartDatecolumn#, false) as <<NewMBColumn>>

New column name: NewMBColumn

Example:

months_between(#2003/01/01#, #2003/03/14#, false) as <<NewMBColumn>>

The above mentioned date difference will be calculated as -2.41935483870968.



Multiple Replacement

For video tutorial see: SEDGE YouTube channel - M functions.

mul_replacement( ): Multiple replacement replaces multiple characters from defined position, with new characters.

Syntax:

mul_replacement(#marital_status#,4,7,'_OK') As <<NewColumnMulRepl>>

New column name: NewColumn

Note

Applies to Categorical data type and the new column which is created is Categorical type.

Example:

mul_replacement(#marital_status#,4,7,'_OK') As <<NewColumnMulRepl>>

Assuming in the example above, the data in the column #marital_status# is ‘Married’, it will replace ‘ried’ with ‘_OK’, the output will be ‘Marr_OK’.




Negate_bool

Negate_bool(): The Negate_bool function returns the negation of the boolean value (True/False) in the column.

Syntax:

Negate_bool(#column#) as <<Newcol_negate>>

New column name: Newcol_negate

Note

The Negate_bool function is only applied to the column contains boolean values.

Example:

Negate_bool(#Text#) as <<new_negate>>

Above example will return the negation value of Text column.



Negative

Negative ( ): The function returns the negative absolute of values from a given numerical column

Syntax:

negative(#Column#) as <<NewNegColumn>>

New column name: NewNegColumn

Note

Applies for Numerical column and the resulting column is numerical data type

Example:

negative(#data#) as <<negative_col>>

The function will first perform the absolute value and after that it will convert the value into a negative one



Next_day

Next_day ( ): The function returns the first day after the selected day

Syntax:

next_day(#Datecolumn#, 'day_of_week') as <<NewNDColumn>>

New column name: NewNDColumn

Example:

next_day(#2020-08-10#, 'WED') as <<NewNDColumn>>

The function will return the next Wednesday after the mentioned date which is 2020-08-12.



Normalize

Data normalization is the process of scaling the data from the given numerical column down to an equivalent scale of 0 to 1.

Normalize( ): Creates a new normalized column

Syntax:

normalize(#column#) AS <<New_normalized_col>>

New column name: New_normalized_col

Note

The normalize() function is only applied for Numerical datatype columns

Example:

normalize(#age#) as <<normalized_col>>

Returns the normalized column for given numerical column.



Now

Now( ): The function returns the current date and time. Can be used to compare a different date column with a current date.

Syntax:

now() as <<NewNowColumn>>

New column name: NewNowColumn

Example:

now() as <<NewNowColumn>>

The function returns current date and time - 2020-30-05 15:20:43



NPER

nper( ):The NPER function returns the number of periods for an investment based on periodic, constant payments and a constant interest rate.

  • Annual_Interest_Rate - The interest rate per period

  • Amount_of_Payments - The payment made each period. Generally, it contains principal and interest but no other fees and taxes.

  • Loan_amount - The present value, or the lump-sum amount that a series of future payments is worth right now.

  • Final_balance - The future value or the cash balance

  • Type - Enter “end” or omit for investing at the end of each period, “begin” for the beginning of each period.

Syntax:

nper(#Annual_Interest_Rate#, #Amount_of_Payments#, #Loan_amount#, #Final_balance#,'end') as <<Newcol_nper>>

New column name: Newcol_nper

Example:

nper(#int_rate#, #open_acc#, #pv#) as <<new_nper>>

The above example returns the expected value.



Nullif

Nullif ( ): The function returns null if two expressions are equal, otherwise it returns the first expression

Syntax:

nullif(#Column#, 'value_toNull') as <<NewNullifColumn>>

New column name: NewNullifColumn

Example:

nullif(#Age#, '55') as <<NewNullifColumn>>

With every occurrence of 55 in the “Age” column the value will be replaced by a NULL



Nvl2

Nvl2 ( ): The function allows you to substitute a value when a null value is encountered and also when a non-null value is encountered

Syntax:

nvl2(#Column#, 'value_ifnotNull', 'value_ifNull') as <<NewNvl2Column>>

New column name: NewNvl2Column



Octet_length

Octet_length ( ): The function returns an integer indicating the number of bytes in the input string.

Syntax:

octet_length(#Column#) as <<NewOctColumn>>

New column name: NewOctColumn

Example:

octet_length(#Geography#) as <<NewOctColumn>>

A variable ‘France’ will return a number 6 as the length. Germany will be 7 etc.



Offday_count

offday_count( ): Returns the number of holidays between two given date columns(Include holidays and weekends) based on the country

Syntax:

offday_count(#column1#,#column2#, 'countrycode') as <<Newcol_offdaycount>>

New column name: Newcol_offdaycount

For the list of country codes click on the link below: List of country codes .. note:: * The offday_count function is only applied for date_columns. * Countrycode field should have valid codes.

Example:

offday_count (#DOB#, #Date of Hire#, 'US') as <<offdaycount>>

Above example returns the number of holidays between two given date columns(Include holidays and weekends) based on the country



One Replacement

For video tutorial see: SEDGE YouTube channel - O functions.

One_replacement( ): One replacement replaces single character from defined position, with new characters.

Syntax:

one_replacement(#Occupation#,4,'_Yes') as <<NewOccupationColumn>>

New column name: NewOccupationColumn

Note

Applies to Categorical and text data type and the new column which is created is Categorical type.

Example:

one_replacement(#Occupation#,4,'_Yes') as <<NewOccupationColumn>>

Assuming in the example above, the data in the column #Occupation# is ‘Prof’, it will replace ‘f’ with ‘_YES’, the output will be ‘Pro_Yes’.



Pmod

Pmod ( ): The Pmod function returns the positive modulus of a number

Syntax:

pmod(#Column1#, #Column2#) as <<NewPmodColumn>>

New column name: NewPmodColumn



PMT

pmt( ): The PMT Function returns the regular and constant repayments for a loan or mortgage required to reduce the balance to zero, or an amount you specify. It’s based on a constant interest rate.

Syntax:

pmt(#Annual_Interest_Rate#, #Number_of_Payments#, #Loan_amount#, #Final_balance#, 'end') as <<Newcol_pmt>>

New column name: : Newcol_pmt

Example:

pmt(#int_rate#, #open_acc#, #pv#) as <<new_pmt>>

The above example returns the expected value.



Positive

Positive ( ): Returns the positive(or absolute) of values from the given numerical column

Syntax:

positive(#Column#) as <<NewPosColumn>>

New column name: NewPosColumn

Note

Applies for Numerical column and the resulting column is numerical data type

Example:

positive(#data#) as <<positive_col>>

Returns the positive(absolute) value of given value



Power

For video tutorial see: SEDGE YouTube channel - P functions.

pow( ): The power (or exponent) of a number is the number of times to use the number in a multiplication

Syntax:

pow(#age#,3) as <<power_age_column>>

New column name: NewColumn

Note

  • A Applies to Numerical data type and the new column which is created is Numerical type.

  • B The power transformation is used when the numbers are clustered too close to each other. Power Transformer is used to make the data distribution more-Gaussian (Normal).

Example:

pow(#age#,3) as <<power_age_column>>

The power (or exponent) of a number is the number of times to use the number in a multiplication. In the example above the power is raised to 3, that is multiplying the number in the field three times.



Proper

Proper( ): The Proper Case converts categorical and text values to upper case for the first character of each word and to lower case for subsequent characters in the word.

Syntax:

Proper(#column1#) AS <<NewProperColumn>>

New column name: NewProperColumn

Note

  • A Applies to categorical and text data type and the new column which is created is Categorical type.

  • B Does not apply to Numerical and Data type columns

Example:

Proper(#column1#) AS <<NewProperColumn>>

Converts characters to proper case (Starting Alphabet is capitals, rest are lower case).



Quarter

Quarter( ): The function returns the quarter of the year for a given date value (a number from 1 to 4)

  • January-March will return 1

  • April-June will return 2

  • July-Sep will return 3

  • Oct-Dec will return 4

Syntax:

quarter(#Datecolumn#) as <<NewQuartColumn>>

New column name: NewQuartColumn

Example:

quarter(#2021-02-02#) as <<NewQuartColumn>>

The function will return 1 as the number of quarter



Radians

For video tutorial see: SEDGE YouTube channel - R functions.

radians( ): Converts angles in degrees to radians. 1 degree is equivalent to (π/180) radians. Multiply the number of degrees with by π/180 to convert it to radian terms.

Syntax:

radians(#angles#) AS <<NewRadiansColumn>>

New column name: NewRadiansColumn

Note

Applies to Numerical data type and the new column which is created is Numerical type.

Example:

radians(#angles#) AS <<NewRadiansColumn>>

Converts angles in degrees to radians.

_images/angles2radians.png

image Angles to Radians



Rand

Rand ( ): The function returns a random number between 0 (inclusive) and 1 (exclusive). If seed is specified, it returns a repeatable sequence of random numbers. If seed is not specified, it returns a completely random number

Syntax:

rand(Seed) as <<NewRandColumn>>

New column name: NewRandColumn



Randn

Randn ( ): The function returns a random value with independent and identically distributed (i.i.d.) values drawn from the standard normal distribution

Syntax:

randn(Seed) as <<NewRandnColumn>>

New column name: NewRandnColumn



Reciprocal

reci( ): The reciprocal transformation is defined as the transformation of x to 1/x. The transformation reverses the order of values with the same sign.

Syntax:

reci(#column1#) AS <<NewReciprocalColumn>>

New column name: NewReciprocalColumn

Note

  • A Applies to Numerical data type and the new column which is created is Numerical type.

  • B The transformation can only be used for non-zero values and also values should not be missing.

  • C If the standard deviation is proportional to the mean squared, a reciprocal transformation can be performed.

  • D Transformation helps to improve interpretability. If fuel economy of car is being compared, then there are two way of looking at the fuel consumption a. liters per kilometer or b. kilometers per liter. The reciprocal of ‘liters per Kilometer’ is ‘kilometers per liter’.

Example:

reci(#column1#) AS <<NewReciprocalColumn>>

The reciprocal transformation is defined as the transformation of x to 1/x.

_images/reciprocal.png

image MilesPerGallon to GallonsPerMile using reciprocal



Regexp_replace

Regexp_replace ( ): The function replaces the portions of a string that match a given regular expression with the contents of another string

  • #Column# - The string perform a search-and-replace operation upon

  • Reg_exp - the JavaScript style regular expression to evaluate

  • Replace_str - the string with which the matched substrings are to be replaced

Syntax:

regexp_replace(#Column#, 'regexp_replace', 'replace_str') as <<NewRegRepColumn>>

New column name: NewRegRepColumn



Regexp_extract

Regexp_extract ( ): This string function is used in search operations for sophisticated pattern matching including repetition and alternation

  • #Column# - The string to search for strings matching the regular expression

  • Reg_exp - the JavaScript style regular expression to evaluate

  • Group_idx - an optional regular expression group number, defining which portion of the matching string will be returned

Syntax:

regexp_extract(#Column#, 'reg_exp', group_idx) as <<NewRegExtColumn>>

New column name: NewRegExtColumn



Remove

remove( ): The remove function removes a certain string pattern from the text.

Syntax:

remove(#column1#, 'Pattern to be removed') AS <<NewColumnRemove>>

New column name: NewColumn

Note

  • A Applies to string / text column

  • B The transformation can only be used for non-zero values and also values should not be missing.

Example:

remove(#column1#, 'USD') AS <<NewColumnRemove>>

In the above example all instances of string USD, will be removed from the text in the #column1#



Repeat

repeat( ): Repeats the value in the field the number of times as defined.

Syntax:

repeat(#MAKE#,2) as <<Make_repeat_NewCol>>

New column name: Make_repeat_NewCol

Note

Applies to Numerical, categorical, text data type and the new column which is created is Categorical type.

Example:

repeat(#MAKE#,2) as <<Make_repeat_NewCol>>

This will repeat the values in the field ‘Make’ twice.



Replace

replace( ): The replace function replaces a string pattern with another string pattern within a text / string column.

Syntax:

replace(#column1#, 'Existing string pattern','string to be replaced') AS <<NewColumnReplace>>

New column name: NewColumn

Note

  • A Applies to string / text column

  • B The transformation can only be used for non-zero values and also values should not be missing.

Example:

replace(#column1#, 'EUR', 'USD') AS <<NewColumnReplace>>

In the above example all instances of string EUR, is replaced with string USD, from the text in the #column1#.



Reverse

reverse( ): Reverses the order of the data. If the value is ABCDE, a reverse function will reverse the value to EDCBA

Syntax:

reverse(#position#) as <<reverse_pos>>

New column name: NewColumn

Note

Applies to Categorical, Text and Numerical data type and the new column which is created is Numerical type.

Example:

reverse(#position#) as <<reverse_pos>>

Reverses the order of the character

_images/Reversetrans.png

image Text Reversed



Right

right( ): Extract right characters from the fields till the numerical count.

Syntax:

right(#nationality#,3) as <<NewColumn>>

New column name: NewColumn Above example the function extracts the first 3 characters in the field from right.

Note

Applies to Categorical, Text, Numerical and Date data type and the new column which is created is Categorical type.

Example:

right(#nationality#,3) as <<NewColumn>>

Example above, function extracts the first 3 characters in the field from right.



Rint

rint( ): The rint functions returns a floating-point value that represents the nearest integer to x. Halfway values are rounded according to the current setting of the floating-point rounding mode

Syntax:

rint(#Column#) as <<NewRintColumn>>

New column name: NewRintColumn



Round

round( ): Rounds the value.

Syntax:

round(#MPG#,1) as <<NewColRound>>

New column name: NewcolRound

Note

Applies to Numerical data type and the new column which is created is Numerical type.

Example:

round(#MPG#,1) as <<NewColRound>>

Rounds the value.

_images/rounding.png

image rounding of Miles Per Gallon (MPG)



Row_index

row_index(): The function assigns a unique, sequential index number to each row, starting with 1

Syntax:

row_index() as <<NewRowInColumn>>

New column name: NewRowInColumn

Example:

row_index() as <<rowindex_col>>

Returns the index of each row



Rtrim

Rtrim( ): The function removes trailing spaces from a string

Syntax:

rtrim(#Column#) as <<NewRtrimColumn>>

New column name: NewRtrimColumn

Example:

rtrim(#      hello    #) as <<NewRpadColumn>>

The function will return ‘hello’ without spaces on the sides.



Rpad

Rpad( ): The function right-pads a string with another string, to a certain length (when string1 is not null)

Syntax:

rpad(#Column#, length, 'pad') as <<NewRpadColumn>>

New column name: NewRpadColumn

Example:

rpad(#Hello#, 8, 'ABC') as <<NewRpadColumn>>

The function will return HelloABC.



Running total

running_total( ): This function returns a sequence of partial sums of given column which means displaying the summation of the data as it grows over time.

Syntax:

running_total(#column#) as <<Newcol_total>>

New column name: Newcol_total

Note

The running_total function is only applied for numerical columns.

Example:

running_total(#Age#) as <<total_age>>

Above example will return the running_total of age column.



Second

second( ): The function returns the seconds part of a time/datetime (from 0 to 59).

Syntax:

second(#Datecolumn#) as <<NewSecColumn>>

New column name: NewSecColumn

Example:

second(#2021-09-16 11:56:33#) as <<NewSecColumn>>

The function will return 33.



Sha1

Sha1( ): The function calculates an SHA-1 160-bit checksum for a string. The function returns a value as a binary string of 40 hex digits. In case the string supplied as the argument is NULL, the function returns NULL.

Syntax:

sha1(#Column#) as <<NewSha1Column>>

New column name: NewSha1Column

Example:

sha1(#Sedge#) as <<NewSha1Column>>

The function will return fca21fee4d117ec70a95dbee415c01e5c6e86df3



Sha2

Sha2( ): The function uses the SHA2 cryptographic hash function to convert a variable-length string into a character string. The character string is a text representation of the hexadecimal value of the checksum with the specified number of bits

Syntax:

sha2(#Column#, bitLength) as <<NewSha2Column>>

New column name: NewSha2Column

Note

  • String - A variable-length string

  • Integer - The number of bits in the hash functions. Valid values are 0 (same as 256), 224, 256, 384, and 512



Shiftleft

Shiftleft( ): The function performs a left shift bitwise

Syntax:

shiftleft(base, #Column#) as <<NewShLeftColumn>>

New column name: NewShLeftColumn



Shiftright

Shiftright( ): The function performs a right shift bitwise (signed)

Syntax:

shiftright(base, #Column#) as <<NewShRightColumn>>

New column name: NewShRightColumn



Shiftrightunsigned

Shiftrightunsigned( ): The function performs a right shift bitwise (unsigned)

Syntax:

shiftrightunsigned(base, #Column#) as <<NewShRiUnColumn>>

New column name: NewShRiUnColumn



Sign

For video tutorial see: SEDGE YouTube channel - S functions.

Sign( ): Converts if value in column is greater than 0 it returns 1 and if value is less than 0 it returns -1 and if the value is 0 it returns 0

Syntax:

Sign(#MPG#) as <<NewColSign>>

New column name: Newcolsign

Note

  • A Applies to Numerical data type and the new column which is created is Numerical type.

Example:

Sign(#temperature#) as <<NewColSig>>

If the value in the column temperature is 36, then the new column (Newcol) will have value 1, If the value in the column temperature is -25, then the new column (Newcol) will have value -1 and If the value in the column temperature is 0, then the new column (Newcol) will have value 0



Sin

Sin( ): The function returns the sine of a number

Syntax:

sin(#Column#) as <<NewSinColumn>>

New column name: NewSinColumn

Example:

sin(#1#) as <<NewSinColumn>>

The function will return the result of 0.8414709




Sinh

Sinh( ): The function returns the hyperbolic sine of a number

Syntax:

sinh(#Column#) as <<NewSinHColumn>>

New column name: NewSinHColumn

Example:

sinh(#1#) as <<NewSinHColumn>>

The function will return the result of 1.1752011



Soundex

Soundex( ): The function evaluates expression and returns the most significant letter in the input string followed by a phonetic code

Syntax:

soundex(#Column#) as <<NewSoundColumn>>

New column name: NewSoundColumn

Note

  • Characters that are not alphabetic are ignored

  • If expression evaluates to the null value, null is returned



Square Root

sqrt( ): Finds the square root of the number.

Syntax:

sqrt(#MPG#) as <<NewColSqrt>>

New column name: Newcolsqrt

Note

  • A Applies to Numerical data type and the new column which is created is Numerical type.

  • B The function can be used for square root transformation of positive values.

  • C Do not apply square root of negative values as the system will give error.

Example:

sqrt(#MPG#) as <<NewColSqrt>>

Finds the square root of the positive value.

_images/sqrt.png

image Square Root of Miles Per Gallon (MPG)



Step Delete

step_delete( ): Delete single character at certain position and then skips the next characters, and then deleting the Single character.

Syntax:

step_delete(#Occupation#,1,10,2) as <<NewColSkipDelete>>

New column name: NewColSkipDelete

Note

  • A Applies to Categorical and text data type and the resulting column is categorical data type.

  • B Does not apply to numerical data type.

Example:

step_delete(#Occupation#,1,10,2) as <<NewColSkipDelete>>

In the above example data in the field Occupation is selected. The step delete is applicable from 1st Character to 10th character and the skips 2 characters. Example if the word is ‘Life is beautiful’ then this function will work as per the below image.

_images/Step_Delete.png

image Step Delete function working explanation



Standardize

standardize( ): Data standardization is the process of rescaling the data from the given numerical column to have a mean of 0 and a standard deviation of 1 (unit variance). The function creates a new standardized column

Syntax:

standardize(#Column#) as <<NewStandColumn>>

New column name: NewStandColumn

Note

  • The standardize fuction is only applied for Numerical datatype columns

Example:

standardize(#age#) as <<NewStandColumn>>

Returns the standardized column for given numerical column



Stddev

stddev( ): This function returns the standard deviation of values in the specified column.

Syntax:

stddev(#column#) as <<Newcol_stddev>>

New column name: Newcol_stddev

Note

The stddev function is only applied for numerical columns.

Example:

stddev(#Age#) as <<stddev_age>>

Above example will return the standard deviation value of age column.



String

String( ): The function converts the value expression to the string data type

Syntax:

string(#Column#) as <<NewStringColumn>>

New column name: NewStringColumn



Substring

Substring( ): The function extracts a substring from a string (starting at any position)

Syntax:

substring(#Column#, position, length) as <<NewSubColumn>>

New column name: NewSubColumn



Substring_index

Substring:index( ): The function returns a substring of a string before a specified number of delimiter occurs

Syntax:

substring_index(#Column#, 'delimiter', count) as <<NewSubInColumn>>

New column name: NewSubInColumn



Sum

sum( ): This function returns the sum of all values in the specified column.

Syntax:

sum(#column#) as <<Newcol_sum>>

New column name: Newcol_sum

Note

The sum function is only applied for numerical columns.

Example:

sum(#Age#) as <<sum_age>>

Above example will return the sum of all values in age column.



Tan

Tan( ): The function returns the tangent of a number

Syntax:

tan(#Column#) as <<NewTanColumn>>

New column name: NewTanColumn



Tanh

Tanh( ): The function returns the hyperbolic tangent of a number

Syntax:

tanh(#Column#) as <<NewTanHColumn>>

New column name: NewTanHColumn

Example:

tanh(#1#) as <<NewTanHColumn>>

The function will return the result of 0.7615941



To_date

To_date( ): The function converts a string value to DATE data type value using the specified format

Syntax:

to_date(#Column#, 'Date_format_pattern') as <<NewToDateColumn>>

New column name: NewToDateColumn

For Date Pattern Formats check: Date Format Examples.



To_timestamp

To_timestamp( ): The function converts a character datatype to a value of TIMESTAMP datatype format. The Timestamp_format_pattern specifies the format of character

Syntax:

to_timestamp(#Column#, 'Timestamp_format_pattern') as <<NewToTimeColumn>>

New column name: NewToTimeColumn

For Date Pattern Formats check: Date Format Examples.



To_unix_timestamp

To_unix_timestamp( ): The function returns the UNIX timestamp of the given time column. The unix time stamp is a way to track time as a running total of seconds. This count starts at the Unix Epoch on January 1st, 1970 at UTC

Syntax:

to_unix_timestamp(#Column#, 'Date/time_format_pattern') as <<NewUnixColumn>>

New column name: NewUnixColumn

For Date Pattern Formats check: Date Format Examples.



To_utc_timestamp

To_utc_timestamp( ): The function returns the UTC timestamp of the given time column.

Syntax:

to_utc_timestamp(#Datecolumn#, 'timezone') as <<NewUTCColumn>>

For the list of time zones click on the link below: List of time zones.

New column name: NewUTCColumn



Trim

For video tutorial see: SEDGE YouTube channel - T functions.

trim( ): Trim function returns a text value with the starting and ending spaces removed. It also removes unnecessary spaces between words in a string.

Syntax:

trim(#occupation#) as <<Newcoltrim>>

New column name: Newcoltrim

Note

  • A Applies to Categorical, Text data type and the new column which is created is Categorical type.

  • B Do not apply to Numerical or date type.

Example:

trim(#occupation#) as <<Newcoltrim>>

Trim returns the categorical value where the unnecessary space between words, ending and starting of word is removed.



Typeof

Typeof( ): The function allows the user to quickly check a variable’s data type — or whether it is “undefined” or “null”

Syntax:

typeof(#Column#) as <<NewTypeOfColumn>>

New column name: NewTypeOfColumn



Unbase64

unbase64( ): The function decodes a BASE64 encoded string column and returns it as a binary column

Syntax:

unbase64(#Column#) as <<NewUnbaseColumn>>

New column name: NewUnbaseColumn


Unhex

unhex( ): The function converts hexa decimal value (base 16) to decimal numbers (base 10). As a base-16 numeral system, it uses 16 symbols.

Syntax:

unhex(#Column#) as <<NewUnhexColumn>>

New column name: NewUnhexColumn



Unix_timestamp

unix_timestamp( ): The function returns the UNIX timestamp of current time. The unix time stamp is a way to track time as a running total of seconds. This count starts at the Unix Epoch on January 1st, 1970 at UTC

Syntax:

unix_timestamp() as <<NewUnixColumn>>

New column name: NewUnixColumn

Example:

unix_timestamp() as <<NewUnixColumn>>

Above example performed on Mon Jun 07 2021 14:16:31 GMT will return Unix Timestamp 1623075391



Uuid

uuid( ): The function is used to return a Universal Unique Identifier (UUID) - this ID is random and unique for each row

Syntax:

uuid() as <<NewUuidColumn>>

New column name: NewUuidColumn

Example:

uuid() as <<NewUuidColumn>>

Above example will return a 36 character random string like for example 6521665d-ce8b-442a-b7c2-cfb4918f25fb



Unit converter

For video tutorial see: SEDGE YouTube channel - U functions.

unit_converter( ) : unit_converter of a number converts from one unit to another unit. The available units for conversion are listed below.

Syntax:

unit_converter(#temperature#, 'from_unit', 'to_unit') AS <<Newconvertedcolumn>>

New column name: Newconvertedcolumn

Note

The from_unit and the to_unit refer to the unit such as “C”. referring to Celsius or “F”, referring to Fahrenheit. The UNIT_CONVERTER value can only be applied to Numerical datatype columns. The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored. The user can enter “from unit” and ‘to unit’ in upper case or lower case.

Temperature units

Converts temperature units from Celsius to Fahrenheit and from Fahrenheit to celsius.

Syntax:

unit_converter(#temperature#, 'C', 'F') AS <<NewconvertedcolumnTemp>>

Temperature

From

Temperature.

Celsius (C).

to

Fahrenheit(F)

Fahrenheit(F)

to

Celsius (C)

Quantity units

Converts quantity units from Tablespoon to Teaspoon and from Teaspoon to Tablespoon.

Syntax:

unit_converter(#QuantityColumn#, 'TSP', 'TBS') AS <<NewconvertedcolumnTemp>>

Measure

From

Measure

Teaspoon (TSP)

to

Tablespoon (TBS)

Tablespoon (TBS

to

Teaspoon (TSP)

Volume units

Converts volume units from gallons to liters and from liters to gallons.

Syntax:

unit_converter(#QuantityColumn#, 'GAL', 'L') AS <<NewconvertedcolumnVol>>

Gallons (GAL) to Liters (L) Liters (L) to Gallons (GAL)

Distance units

Converts distance units from miles to kilometers and from kilometers to miles.

Syntax:

unit_converter(#QuantityColumn#, 'MI', 'KM') AS <<NewconvertedcolumnVol>>

Miles (MI) to Kilometers (KM) Kilometers (KM) to Miles (MI)

Length units

Converts length units from inches to feet or centimeters and from feet to inches and centimeters and from centimeters to inches and feet.

Syntax:

unit_converter(#QuantityColumn#, 'IN', 'FEET') AS <<NewconvertedcolumnLen>>

From Unit

To Unit

Inches (IN)

to

Feet (FEET)

Feet (FEET)

to

Inches (IN

Inches (IN)

to

Centimeters (CM)

Feet (FEET)

to

Centimeters(CM)

Centimeters(CM)

to

Inches (IN)

Centimeters (CM)| to |Feet (FEET)

Inches (IN)

to

Feet (FEET)

Fahrenheit(F)

to

Celsius (C)

Weight Units

Converts weight units from kilograms to pounds and from pounds to kilograms.

Syntax:

unit_converter(#QuantityColumn#, 'KG', 'LBM') AS <<NewconvertedcolumnWei>>

From Unit

To Unit

Kilograms (KG)

to

Pounds (LBM)

Kilograms (KG)

to

Grams (G)

Pounds (LBM)

to

Kilograms (KG).

Pounds (LBM)

to

Grams (G)

grams (G)

to

Kilograms (KG)

grams (G).

to

Pounds (LBM)


Upper

upper( ): converts characters to upper case.

Syntax:

upper(#column1#) AS <<NewColumnUpp>>

New column name: NewColumn

Note

Applies to categorical and text data type and the new column which is created is Categorical type.

Example:

upper(#column1#) AS <<NewColumnUpp>>

Converts characters to upper case.


Variance

variance( ): This function returns the variance of values in the specified column.

Syntax:

variance(#column#) as <<Newcol_variance>>

New column name: Newcol_variance

Note

The variance function is only applied for numerical columns.

Example:

variance(#Age#) as <<variance_age>>

Above example will return the variance of age column.



Weekday

weekday( ): The function returns the weekday number for a given date column

Syntax:

weekday(#Datecolumn#) as <<NewWeekdayColumn>>

New column name: NewWeekdayColumn

Note

0 = Monday, 1 = Tuesday, 2 = Wednesday, 3 = Thursday, 4 = Friday, 5 = Saturday, 6 = Sunday

Example:

weekday(#2021-06-06#) as <<NewWeekdayColumn>>

The function will return ‘6’for Sunday.



Weekend_count

weekend_count( ): Returns the number of weekend between two given date columns based on country

Syntax:

weekend_count(#column1#,#column2#, 'countrycode') as <<Newcol_weekendcount>>

New column name: Newcol_weekendcount

For the list of country codes click on the link below: List of country codes .. note:: * The weekend_count function is only applied for date_columns. * Countrycode field should have valid codes.

Example:

weekend_count (#DOB#, #Date of Hire#, 'US') as <<weekendcount>>

Above example returns the number of weekend between two given date columns based on country



Weekofyear

Weekofyear( ): The function returns the week number for a given date (a number from 1 to 53)

Syntax:

weekofyear(#Datecolumn#) as <<NewWoYColumn>>

New column name: NewWoYColumn

Note

This function assumes that the first day of the week is Monday and the first week of the year has more than 3 days

Example:

weekofyear(#2020-01-23#) as <<NewWoYColumn>>

Above example returns the number 4 as for the 4th week in the respective year



Workday_count

workday_count( ): Count the number of workdays between two given date columns based on country

Syntax:

workday_count(#column1#,#column2#, 'countrycode') as <<Newcol_workdaycount>>

New column name: Newcol_workdaycount

For the list of country codes click on the link below: List of country codes .. note:: * The workday_count function is only applied for date_columns. * Countrycode field should have valid codes.

Example:

workday_count (#DOB#, #Date of Hire#, 'US) as <<workdaycount>>

Above example returns the number of workdays between two given date columns based on country



XXhash64

XXhash64( ): The function returns a 64-bit hash value of the arguments/selected columns

Syntax:

xxhash64(#Column1#, #Column2#, …, #ColumnN#) as <<NewXXColumn>>

New column name: NewXXColumn

xxhash64(#30#, #7.8958#) as <<NewXXColumn>>

Above example will return -5523311690417587000 as a 64bit hash value.



Year

Year( ): The function returns the year part for a given date (a number from 1000 to 9999)

Syntax:

year(#Datecolumn#) as <<NewYearColumn>>

New column name: NewYearColumn

Example:

year(#2020-01-23#) as <<NewYearColumn>>

Above example returns the number 2020 as an extracted year



Year_to_date

Year( ): The function converts 4-digit year(yyyy) to default Date format (yyyy-MM-dd)

Syntax:

year_to_date(#Column#) as <<NewYTDColumn>>

New column name: NewYTDColumn

Visual Analytics

Cluster Analysis

Cluster analysis is a technique to group similar observations into a number of clusters based on the observed values of several variables for each individual. It is an exploratory data analysis technique that allows us to analyze the multivariate data sets.

Clustering algorithms

K-means

It is an iterative algorithm that divides the unlabeled dataset into k different clusters.

_images/clusteranalysis1.png

image_K-means

Bisecting K-Means

Bisecting K-Means is a hybrid approach of the K-Means algorithm to produce partitional/hierarchical clustering. It splits one cluster into two sub-clusters at each bisecting step (using k-means) until k clusters are obtained.

_images/clusteranalysis2.png

image_Bisecting_K-Means

Gaussian Mixture

The Gaussian mixture model is defined as a clustering algorithm that is used to discover the underlying groups of data. Gaussian mixture models have a higher chance of finding the right number of clusters in the data compared to k-means.

_images/clusteranalysis3.png

image_Gaussian_Mixture

Method for Optimal Cluster

Elbow method

Elbow Method is an empirical method to find the optimal number of clusters for a dataset. In this method, we pick a range of candidate values of k, then apply K-Means clustering using each of the values of k. Find the average distance of each point in a cluster to its centroid, and represent it in a plot. Pick the value of k, where the average distance falls suddenly.

_images/optimalcluster1.png

image Elbow method

Silhouette method

The silhouette Method is also a method to find the optimal number of clusters and interpretation and validation of consistency within clusters of data. The silhouette method computes silhouette coefficients of each point that measure how much a point is similar to its own cluster compared to other clusters. by providing a succinct graphical representation of how well each object has been classified.

_images/optimalcluster2.png

image Silhouette method

Graphical Analysis

_images/GraphicalAnalysis.png

image GraphicalAnalysis

Graphs are used to display large amounts of numerical data and to represent the relationships between numerical values of different variables. They can also be used to derive quantitative relationships between variables.

_images/BarChartAge.png

image BarChartAge

Bart chart presents data with rectangular bars with heights or lengths proportional to the values that they represent.

_images/BarChartAge.png

image BarChartAge

We have chosen age as an example for our bar chart. We can point our mouse at any part of the bar and the system will tell as which point we are currently standing - X axis tells us the location of our mouse and Y axis tells us the total count in the bar. Also we can zoom in or out by the grey tool in the upper part of the graph (highlighted in green).

“Group by” tool and “Aggregation” tool works with categorical values.

Note

We can also select the “Use Distinct” tool - which represents percentage of each category in the pie irrespective of the count in each category.

Stack Bar

The stack bar chart extends the standard bar chart from looking at numeric values across one categorical variable to two, where each bar is divided into a number of sub-bars.

_images/StackBarSurEmb.png

image StackBarSurEmb

In this Stack Bar chart we have selected two factors - “port of embarkment” and “survived”. We can see that each port of embarkment - 0,1 and 2 is dived into two colours - 1 showing those who survived and 0 showing the opposite. By pointing the mouse on any section of the bar we can see the total number of cases in the group.

Box Plot

Box Plot graph depicts groups of numerical data through their quartiles. Box plots have also lines extending from the boxes (whiskers) indicating variability outside the upper and lower quartiles, these are the outlier values.

_images/BoxPlotAge.png

image BoxPlot

We used age as an example, we can see how mostly the age of passengers was in between 22 and 35 years, other values re represented as outliers. By pointing at the graph we can also see the Open, Low, High and Close values.

Pie Chart

Pie charts are beneficial in expressing data and information in terms of proportions. A pie chart is most effective when dealing with the collection of data. It is constructed by clustering all the required data into a circular constructed frame wherein the data are depicted in slices.


_images/PieChartMealNo.png

image PieChartMealNo

For our example we used the number of meals and we can see the number was between 1 and 9, each part of the pie is also represented by the percentage and after pointing at each section we can also see the total number of cases

Bubble Chart

A bubble chart is a type of chart that displays three dimensions of data. Each entity with its triplet of associated data is plotted as a disk that expresses two of the vᵢ values through the disk’s xy location and the third through its size.

_images/BubbleChartAgeFare.png

image BubbleChartAgeFare

We used Age and Fare for our example and we can now see how the values are spread in the graph.


We can also add another factor in our chart, which represents the “Size” - we selected “Fare” and after that we can see the bubbles size is corresponding with the increasing fare.

_images/BubbleChartAgeFareSize.png

image BubbleChartAgeFareSize


We have another feature that can help us represent the values effectively - we can add another factor and represent it in colour, in our case we have added “Embarkment point” so we can see the results also differentiated by the point of embarkment

_images/BubbleChartAgeFareSizeEmb.png

image BubbleChartAgeFareSizeEmb

Correlation Chart

The depiction address each numerical variable’s correlation against all other variables. Bigger the connection better the correlation between the variables.

_images/Correlation.png

image Correlation

_images/Correlationtarget.png

image Correlationtarget

Scatter Graph

Scatter graph uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables.

_images/Scatter.png

image Scatter

For our demonstration we have used three different numerical variables - Age as the target on X axis and Fare and Total meal cost on the Y axis. We can even add more numerical variables and all will be displayed in one chart.

Parallel Coordinates

Parallel coordinates are a common way of visualising and analysing high-dimensional datasets. It is applied to data where the axes do not correspond to points in time, and therefore do not have a natural order.

_images/Parallel.png

image Parallel

We have used Meal Cost, Number of Meals and Total Meal cost to visualise the relation between these variables.

Scatter Matrix

In multivariate statistics and probability theory, the scatter matrix is a statistic that is used to make estimates of the covariance matrix, for instance of the multivariate normal distribution.

Time Series

A time series is a series of data points indexed (or listed or graphed) in time order.

_images/TimeSeries.png

image TimeSeries

On the X axis we need to use a date variable, in our case “Ticket purchase” and on the Y axis we used “Survived”, for aggregation we have selected Average value (we have also option of Count, Sum, Average, Minimum, Maximum, Median, Standard Deviation and Variance). From our graph we can see, that everybody who purchased the ticket on April 1st survived.

Distribution

A distribution plot displays a distribution and range of a set of numeric values plotted against a target variable(s).

_images/Visualisationdistribution.png

Distribution

Advanced Analysis

Outlier analysis

Outlier analysis is the process of identifying outliers, or abnormal observations, in a dataset. Also known as outlier detection, it’s an important step in data analysis, as it removes erroneous or inaccurate observations which might otherwise skew conclusions.

_images/Outlieranalysis.png

image Outlier analysis

Deviation analysis

The measurement of the absolute difference between any one number in a set and the mean of the set. Deviation analysis can be used to determine how a specification will behave in the face of such deviations.

_images/Deviationanalysis.png

image Deviation analysis

Decision Tree analysis

Decision tree analysis is the process of drawing a decision tree, which is a graphic representation of various alternative solutions that are available to solve a given problem, in order to determine the most effective courses of action. Decision trees are comprised of nodes and branches. Nodes represent a test on an attribute and branches represent potential alternative outcomes.

_images/Decisiontreeanalysis.png

image Decision Tree analysis

Predictive Analytics

_images/Predict.png

image Predictive Analytics

After completing our data manipulation and graphical analysis, we can move to the last part, which is the Predictive Analytics. In the “Statistics” part, you need to select all the columns that are important for your research and ignore those that have small or no relation to the target. From the beginning our target for the research is the “Survived” field. Very important step in this research was to convert the field “Survived” from “Numerical” to “Categorical” type, as keeping it as “Numerical” will result in very low accuracy.

  • Balancing

In following step we can choose the type of balancing - “None”, “Down sampling”, “SMOTE” or “Up sampling” - in our example we selected “None”. Sampling is used to resolve class imbalance, because the imbalance in the observed classes might negatively impact our model. Machine learning algorithms have trouble learning when one class dominates the other so the balancing of data helps us avoid these problems.

_images/Balancing.png

image Balancing

  • None - no sampling of data performed, data stays as it is the dataset.

  • Down Sampling - the system will randomly subset all the classes in the data set so that their class frequencies match the least common class

  • SMOTE - Synthetic Minority Over sampling Technique - this technique synthesises a new minority instances between already existing minority instances

  • Up Sampling - the system will randomly sample and replace the minority class to be the same size as the majority class

Next step is to select which algorithms should be used, the ratio (default is 80:20 - which means 80% of the sample is used for training and 20% is used for cross validation) and cross validation number (in how many groups the cross validation sample should be split into) and then click on the “Start learning” button and the system will run these algorithms.

Algorithms

_images/Algorithm1.png

image Algorithms

Linear regression

Linear regression is the statistical model that analyzes the linear relationship between a scalar response and one or more explanatory variables.

Logistic regression

Logistic regression is a supervised learning algorithm used to predict a dependent categorical target variable.

Decision Tree

Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain parameter. The tree can be explained by two entities, namely decision nodes, and leaves.

Random Forest

The random forest is a classification algorithm consisting of many decisions trees. It is an ensemble method that is better than a single decision tree as it reduces over-fitting by averaging the result. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by the committee is more accurate than that of any individual tree.

Gradient Boosting Machines

A Gradient Boosting Machine or GBM combines the predictions from multiple decision trees to generate the final predictions.

eXtreme Gradient Boosting

eXtreme Gradient Boosting or XGBoost is another popular boosting algorithm. In fact, XGBoost is simply an improvised version of the GBM algorithm! The working procedure of XGBoost is the same as GBM. The trees in XGBoost are built sequentially. XGBoost also includes a variety of regularization techniques that reduce overfitting and improve overall performance.

Naive Bayes

The Naive Bayes is a classification algorithm that is suitable for binary and multiclass classification. Naïve Bayes performs well in cases of categorical input variables compared to numerical variables. It is useful for making predictions and forecasting data based on historical results.

K-Nearest Neighbors

KNN is a non-parametric method used for classification. It is one of the best-known classification algorithms in that known data are arranged in a space defined by the selected features. When new data is supplied to the algorithm, the algorithm will compare the classes of the k closest data to determine the class of the new data.

Support Vector Machine

Support vector machine algorithm helps to find a hyperplane in N-dimensional space(N — the number of features) that distinctly classifies the data points.

Multilayer Perceptron

A multilayer perceptron (MLP) is a feedforward artificial neural network that generates a set of outputs from a set of inputs. An MLP is characterized by several layers of input nodes connected as a directed graph between the input and output layers.

Light GBM

Light Gradient Boosted Machine, or LightGBM for short, is an open-source library that provides an efficient and effective implementation of the gradient boosting algorithm. It can be used for data having more than 10,000+ rows. No fixed threshold helps in deciding the usage of LightGBM. It can be used for large volumes of data especially when one needs to achieve high accuracy.

We can see the status of each algorithm, some of these are more complicated and will take more time. After all are at 100% of completion, the algorithms are sorted by their accuracy.

In our case, we can see 7 algorithms were used.

  • XGBOOST - decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework - with this algorithm we got an accuracy of 85%.

  • Decision tree

  • GBM - Gradient boosting

  • Random forest

  • Logistic regression

  • Naive Bayes

  • KNN - K nearest neighbors

We can also see further details of the algorithm - after clicking on the “Metrices” button.

Variable of Importance

_images/ModelDetailsVI.png

image ModelDetailsVI

In this chart we can see which data field has a higher correlation to our selected target field.

Metrics - Classification report

_images/ClassificationReport.png

image ClassificationReport

Classification report is used to measure the quality of predictions from a classification algorithm. The metrics are calculated by using true and false positives, true and false negatives.

Metrics- Confusion Matrix

_images/ConfusionMatrix.png

image Confusion Matrix

Confusion matrix, also known as an error matrix, is a specific table layout that allows visualisation of the performance of an algorithm. It shows the “True Positives”, “True Negatives”, “False positives” and “False Negatives”.

Metrics- Gain Charts

_images/Gain.png

image Gain Charts

Cumulative Gain charts are used for visually measuring the effectiveness of classification model and a very popular metrics in predictive analytics. Gain charts measures the performance for a proportion of the population data. It is calculated as the ratio between the results obtained without model (random line) or with the model (the model line). The greater the area between these two line the better the model is. The Y-axis shows the target data and the X-axis shows the percentage of the population.

In the image in the Gain Charts, the black line is the random line, and the green and orange Line is results based on the model for the two classes. From the figure above it can be see that orange class model has performed better than the green model, and both the line has performed better than the random line (black).

Metrics- Lift Chart

_images/Lift.png

image Lift Charts

Lift charts are used for visually measuring the effectiveness of classification model and also a popular metrics in predictive analytics. Lift Charts shows the actual lift. The lift curve is determined by measuring the ratio between the result predicted by model and the result using no model. A lift chart graphically represents the improvement that a model has when compared against a random guess, and it measures the change in terms of a lift score. A model can be determined which is best, by comparing the lift scores. A lift chart shows the point at which the model’s predictions are strong and when it become weak (less useful).

Metrics- K-S Chart

_images/K-S.png

image K-S

K-S is a measure of the degree of separation between the positive and negative distributions.

AUC- Area under the ROC Curve

_images/AUC.png

image AUC

On the Y axis we can see the True Positive Rate - samples that were evaluated correctly, on the X axis we can see the False Positive Rate - samples evaluated incorrectly. When we plot each of the results, we get the ROC curve. We evaluate the AUC - area under the ROC - if the value is near to zero, the selected model is not classifying correctly. If the value is near to 1, the selected model is classifying correctly. In our case the number is 0.92.

Model comparison

The SEDGE application now allows data analysts to review and compare models by MAPE (mean absolute percentage error), RMSE (root mean square error), MSE (mean squared error), MAE (mean absolute error), and MLCH. This allows the analyst to save and test models for greatest accuracy.

_images/Modelcompare.png

image Model Comparison

Trained model sharing to users who can access from any other environment.

_images/Modelsharing.png

image Modelsharing

Actual vs predicted

Actual vs. predicted plots are the visualization of actual vs. predicted values for all predictive models.

_images/actual.png

image Actual vs Predicted

The above is the graph between the actual and predicted values.

Shapely importance

SHAP or Shapley Additive exPlanations is a visualization technique that can be used for making a machine learning model more explainable by visualizing its output. It can be used for explaining the prediction of any model by computing the contribution of each feature to the prediction. If the SHAP value is much closer to zero, we can say that the data point contributes very little to predictions. If the SHAP value is a strong positive or strong negative value, we can say that the data point greatly contributes to predicting the positive or negative class.

_images/Predictionshapelyimportance.png

Shapely importance