Data analytics

_images/z_data_analysis.png

Click on the DashPro icon, which will redirect you into the project creation screen. This will take the users into the screen where the project can be created.

_images/data_analysis.png

Data Analysis

  • A - For project creation, click on “Create project” which will redirect you to your data sources (screen number 4)

  • B - Here you can see a list of all your current projects

  • C - By clicking on the tick box next to your project, you can select the project

  • D - Delete the selected project

_images/projects.png
Other possible actions with the selected project:
  • E - Load the data to perform analysis

  • F - Delete a file from the project

  • G - Download the data into the local system as CSV file

Data sources

Connecting to the file system

_images/z_file_sys.png

Data Sources- Screen number 4.1

  • A - For uploading a file from your computer, select file upload

  • B - Select file that you are planning to upload

  • C - Select the file type - CSV, JSON, Parquet, Avro, S3 or HDFS

_images/file_upload.png

file upload

  • D - Select your delimiters – tab, Semicolon, Comma, Space, Other

  • E - After completing previous steps click on Add files

Connecting to Database

_images/z_con_db.png

Data Sources- Screen number 4.2

  • A - For uploading a file from a database, select DB connection

  • B - If you wish to upload a new file, select New Connection tab

  • C - If you already have a connection, select Existing Connection tab

_images/exist_dbconnection.png
  • D - Within new connection, select the database type from the following options

_images/connection.png
  • E - Fill in your username and password

DB Schemas and Access

This section is about the data upload into SEDGE from different types of database. SEDGE has the functionality of connecting to different database such as MYSQL, Mariadb, Oracle and Postgresql.

The future release of SEDGE will also cover other database such as Cassandra, SQLite, Presto, Redshift and Redis. If you need assistance in connecting to these DB, talk to our support team, and they can provide functionality to connect to these databases also

Accessing Database

_images/DBSCHEMA%26ACCESS.png

DB Schemas and Access

  • A - For uploading a file from a database, select DB connection

  • B - If you wish to upload new file, select New Connection tab

  • C - If you already have a connection, select Existing Connection tab

Connecting to DB

_images/existing_connection_blank.png

existing connection 1

  • D - Within new connection, select the database type from the following options, we are demonstrating our example on the MYSQL database

_images/Database_type.png

database type

  • E - Fill in the class name of your database

  • F - Fill in the host name assigned to your database

  • G - Fill in the port number

  • H - Fill in the username

  • I - Fill in the password

  • J - Proceed with the connect button

Selecting existing DB Connections

_images/save_connection.png

save connection

After filling in the above table a new window appears
  • A - Fill in the Database and Connection Name

  • B - Save connection (this will then appear in the Existing Connections tab)

  • C - Or proceed directly to SQL Editor

_images/existing_connections.png

existing connections 2

Your project was successfully uploaded and is among the Existing Connections
  • A - You can open the project by clicking on the Connect button and you will be redirected to the SQL editor

  • B - You can erase the project by clicking on the Bin button

Working with SQL editor

_images/SQLUPD1.png

SQL editor

  • A - This section displays the tables available within the selected database in your SQL workspace. You can browse and query these tables as needed.

  • B – This is where you can write and manage your SQL queries. It supports all types of SQL operations such as retrieving data, filtering, sorting, joining tables, and performing calculations or data transformations.

  • C – After writing your SQL query, click this button to run it and view the results in the preview section below.

  • D – This button clears the SQL editor, allowing you to start fresh with a new query.

  • E – Automatically formats your SQL code for better readability and structure.

  • F – Click to expand and view a history of past queries, allowing easy access and reuse.

_images/QUERYLISTUPD.png
  • G – Allows you to control how many rows are fetched and displayed in the result preview pane after executing a query. You can choose from preset options such as:

    • Limit to 100 rows

    • Limit to 500 rows

    • Limit to 1000 rows

    • Limit to 10,000 rows

    • Limit to 50,000 rows

    • Limit to 100,000 rows

  • H – Saves the current result set. You must provide a project name and file name to save, and the saved output will be listed under your Projects page in SEDGE.

  • I – Allows you to export the query results into formats like CSV or Excel for further use or sharing.

NLP Integration with SQL

_images/NLPWSQL.png

NLP Integration with SQL

  • A - Turn this switch on to enable Natural Language to SQL conversion using AI.

  • B - Enter your query in plain English — then click the Run button to generate the corresponding SQL automatically.

  • C - The system converts your prompt into a valid SQL query, which you can review and run to get the output below.

Joining multiple tables in SQL query Editor

_images/SQL_join.png

SQL join

SQL Join query example

We can use the SQL Query Generator to combine data from multiple tables using joins, which help relate information across datasets. It supports various types of joins, including Inner Join, Left Join, Right Join, and Full Join — enabling flexible data merging based on matching fields such as IDs or keys.

Connecting to API

_images/z_con_api.png

Data Sources- Screen number 4.3

  • A - Select API Connection to fetch and upload data from external platforms.

  • B - To upload data through a new API, select the New Connection tab.

  • C - To reuse a previously connected API, select the Existing Connection tab.

_images/z_api_existtab.png

existing connection 1

  • D - Within new connection,Select the type of API you want to connect to (e.g., JIRA).

    we are demonstrating our example on the JIRA api

_images/z_api_newcon.png
  • A - Enter your JIRA Server URL

    (e.g., https://yourdomain.atlassian.net). This is the base URL of your JIRA instance.

  • B - Provide your JIRA Username.

  • C - Paste your JIRA API Token

    which you can generate from JIRA → Profile → Account Settings → Security → API Tokens.

  • D - Click the Connect button to establish the connection. If needed, use Reset to clear all fields.

    After a successful connection a new field appears

  • E - Enter a unique Connection Name and click “Save Connection” to reuse this setup in the future.

Your connection was successfully saved and is now listed under Existing Connections.

_images/z_api_exist_tab.png

existing connection 2

  • A - Click this icon to initiate a connection with the selected JIRA project. This allows you to fetch and analyze data from the linked source.

  • B - Use this icon to update connection details such as the JIRA URL, username, or API token.

  • C - Click the trash icon to permanently remove the connection from the system. This action is irreversible, so use it with caution.

File upload

_images/z_fileupload.png

upload file - Screen no 5

  • A - After selecting your file, You can enter a Project Name (optional). If left empty, the system will use the file name by default.

  • B - In the Action column:

    Use the ✏️ icon to rename the file.

    Use the ➖ icon to remove the file.

  • C - In the No of Rows box, enter how many rows you want to upload from the file. This helps when working with large files.

  • D - Use the SQL Editor to preview or modify the uploaded data using SQL queries before continuing.

  • E - After completing these steps, click the Upload button to proceed to the data preview page. The upload may take some time depending on the file size.

    Users also have the option of zipping the CSV file, as large size CSV file can be uploaded.

Multiple upload

It is also possible to upload multiple files simultaneously. SEDGE allows to upload multiple files in one upload form.

_images/z_multipleupload.png

Multipleupload

File collaboration

File collaboration in SEDGE facilitates the sending and receiving of files and collaboration among users. This functionality helps in collaborating and exchanging opinions with the other users in SEDGE. This offers a wide array of advantages, especially among users from different geographic locations and diverse timelines. It only takes a single click to send and enable the file access to the user.

_images/fcollablist.png

Filecollaboration

Data preview

_images/Datacanva.png

Data Preview - Screen no 6

For our demonstration, we have chosen the “Titanic” data. In the future steps, we will be forecasting, based on the available data, what was the chance of survival on board of Titanic. In this case, the factors taken into consideration are: class, name, sex, age, number of family members, parch, ticket number, fare, cabin, and point of embarkment. Some of these factors might not be relevant for the predictive analysis, so we will also demonstrate how to erase or ignore some of the data.

Note

This page is just a preview of the full data, only a small sample is displayed over here to get the first image.

  • A - This row displays all the variables in the current analysis. You can sort the data by clicking on the arrows next to the respective variable

  • B - This row identifies if the data in a column are categorical or numerical (C or N)

  • C - Monitor in which phase of the analysis you are currently located – identified by the green colour

  • D - Information about the total number of rows and columns in the uploaded file

  • E - Action buttons to move either to the next step or return to the previous step

Profile Dataset

_images/pfpdataset.png

Profile Dataset - Screen no 7

  • A - Dataset info – basic information about the respective data set like number or variables, number of observations/rows, percentage of missing data, warnings

  • B - List of all columns/variables in the data set

  • D - Information if the variable is categorical or numerical

  • E - How many unique values are there within each variable (sex – male or female – 2 possible values, name – unlimited number of values)

  • F - Mean – the average value of the respective numerical variable. Found by adding all numbers in the data set and then dividing by the total number of variables in the data set

  • G - Median – the middle value in the data set – value that separates the higher half of the data sample from the lower half

  • H - SD – standard deviation represents how close or far the values are dispersed from the mean

  • I - Total number of missing values within the respective variable

  • J - In this tab, you have to select the target you are going to be predicting. It is one of the variables that are already part of the dataset. In our case, we are going to be predicting if the Titanic passenger did or did not survive

_images/profilingsedge.png

feature list

  • K - Sampling of data – you can base your research only on a limited part of the dataset (for example when the dataset is extremely large).

  • If you wish to sample your data, after clicking yes, you have to select the type

  • In the next step select the size type – percentage, absolute value or by the calculator

_images/samplingsedge.png
_images/samplingsedge2.png

sampling

  • Select the percentage of the sample you are going to base your analytics on

_images/Percentage.png

sampling percent

  • Select the variable

_images/variable_selection.png

sampling percent

  • After selecting your preferences click on Apply button

  • L - GDPR - Data pseudonymization

  • Depending on the nature of your data, you can choose if you wish to protect personal data or no. If you select yes, next step will take you to the GDPR page where you will select which data you want to hide. If you do not need to hide any of the data, select NO button and the GDPR page will be skipped

_images/dataprotection.png

data protection

_images/dataprtnew.png

GDPR

Statistics

Statistics- Numerical Operations

_images/Numericaloperations.png

image Numerical Operations

Absolute Value

The operation converts a numerical value into its absolute value, exclusively accepting numerical inputs. For instance, temperature can be transformed into its absolute value to determine its distance from zero. Upon applying an operation, successful validation appear in the top-right corner. A newly generated column is displayed at the bottom of the preview. After column addition, the variable of importance may alter; to observe its impact.

_images/absoperation.png

image showing message notification after ABS function (absolute)

Ceiling

Returns the closest integer greater than or equal to a given number, used for getting the nearest integer up

Floor

Returns the closest integer less than or equal to a given number, used for getting the nearest integer up

Unit converter

Simplifies converting measurements between different units. For example, Converting inches to centimeters or pounds to kilograms using a unit converter.

_images/stats-numunitconvertor.png

Numerical conversion to hexadecimal form

Convert from a standard numerical format to hexadecimal system, where 16 symbols are used compared to 10 symbols used in the numerical system

Rounding of values

Round numbers to a specified number of digits, such as rounding housing measurements to 2 digits. This function should allow for manual or automatic selection of the column name and if required, enable the user to manually choose the name of the new column.

_images/roundoff.png

image round

Num –> Cat

Converting a numerical column to a categorical format. In case of missing values, they can be managed by either deletion or imputation, techniques elaborated upon in the forthcoming “Data” chapter.

Upon applying this function, the successful validation will appear.

_images/Numtocat.png

Num format conversion

Helps in generating the numeric text of values in different languages. The value will be displayed in words in the chosen language.

_images/Numformat.png

Num format conversion

Note

Numerical operations can be performed only to the Numerical variable.

Statistics- String

_images/string.png

String

Substring

A Substring is a contiguous sequence of characters within a larger string. It is a portion of a string that is extracted or considered separately based on its starting position and length.

Steps to Perform
  • Select a column, then click on ‘Substring’

  • Choose position (Right, Middle, Left, Exclude String, or Split by Character/Pattern) based on your requirement.

  • Define length and choose ‘Auto’ or ‘User Defined’

  • Click on ‘Apply’

Middle

For the string “houstan”, the selected position is “Middle” with a start index 1 and length of 5. This operation will return the middle portion. For the middle position it is required to select the start index.

The output will be: houst.

_images/substring_middle.png

Example for substring for the position - Middle

Left

For the string “ New York”, the selected position is “Left” with a length of 3. This operation will return the leftmost 3 characters.

The output will be: New.

_images/substringleft.png

Example for substring for the position - Left

Exclude String

For the string “New York”, the selected position is “Exclude String” with a length of 3 and a start index of 1. This operation will exclude the specified characters from the string.

The output will be: York.

_images/substringexclude.png

Example for substring for the position - Exclude String

Split by character/pattern

For the string “Los Angels” and the selected character “s” with the position “after”, the operation will split the string after each occurrence of the character “s”.

The output will be: Angel.

_images/substringsplit.png

Example for substring for the position - Spelit by Character/Pattern

Note

For the position “Middle” and “Exclude String” it is required to select the start index.

For the position “Split by character/pattern” it is required to choose either “Before” or “After.

Replace

Replace one or more characters with another pattern/set of strings. Under ‘Pattern which exist,’ provide the pattern that needs to be changed. Then, under ‘Pattern to be replaced with,’ specify the pattern for replacement.

To replace “UK” with “United Kingdom,” provide the following:

_images/replaceimg.png
  • Pattern which exists: “UK”

  • Pattern to be replaced with: “United Kingdom”

The output will be: Hi

_images/replace.png

Example for Replace

Replace range

Replace set of strings based on its occurrence index value. specify the start and end indices of the substring to be replaced along with the replacement text.

Example: Original word: “Chine”

To replace a range of characters from index 1 to index 5 with “India” provide the following,

  • Start index: 1

  • End index: 5

  • String to be replaced: “India”

_images/rr.png

After completing the replacement operation, the updated word will be: India

_images/replacerange.png

Example for ReplaceRange

Note

After performing the operation a new column will appear at the bottom.

Delete range

Deletes a set of characters based on the index values. To delete a range, specify the start and end indices, then click “Apply”.

Example: Original String: Los Angels

To remove ‘Ang’ from “Los Angels”, specify the indices as follows:

  • Start index: 4

  • End index: 8

_images/dr.png

After applying this deletion, the updated word will be Losels.

_images/deleterange.png

Example for DeleteRange

Append

Append a single or multiple character in each class/row.

Example:

  • Original String: “Japan”

  • To append “ese” to the end of the word, simply add “ese” to it.

_images/ap.png

After appending, the updated word will be: **Japanese*.

_images/append.png

Example for Append

Insert

Select the index or position where insertion is desired, then specify the value to be inserted.

Example: Original string: “Phoenix”

  • Index to be inserted: 8

  • Inserted value: “City”

_images/ins.png

After insertion, the updated string will be: PhoenixCity.

_images/insert.png

Example for Insert

Repeat

To choose the number of times the selected column should be repeated.

images/repeatsta.png

Example:

  • Original string: “UK”

  • To repeat “UK” two times, simply specify the repeat count as 2.

After repeating, the updated string will be: UKUK.

_images/repeat.png

Example for Repeat

Pop

Pop removes and returns the element at the specified index.

Example:

  • Original string: “Chicago”

  • To remove ‘o’ from the string by specifying the index as 7.

_images/popimg.png

After the operation, the updated string will be: Chicag.

_images/pop.png

Example for Pop

Reverse item

Iterate through the string in reverse order.

Example:

  • Original string: “UK”

  • Reversed string: “KU”

_images/reverse.png

Example for Reverse

Upper

Convert a column to upper case.

“New York” will be be converted into “NEW YORK”.

_images/upper.png

Example for Upper

Lower

Convert a column to lower case

“New York” will be be converted into “new york”

_images/lower.png

Example for Lower

Find

To determine if an entry matches a searched string, compare each entry against the specified string. If a match is found, return the matched string. If no match is found, return False to indicate that the string is not present in the column.

_images/findsta.png
_images/find.png

Example for Find

Length

Count the number of characters in the string.

Example:

  • String: “Chicago”

  • Length of the string: 7

_images/length.png

Example for Length

Trim

Trimming/erasing of leading & trailing spaces from the string.

_images/trim.png

Example for Trim

CountOccurence

Determines the number of times a specific substring appears.

Steps to perform
  • Select Substring to specify a substring pattern.

  • Select Regex to provide a regular expression pattern.

_images/co.png

Example:

  • String: “Japan”

  • Substring pattern: “a”

_images/cos.png

The output will be 2.

_images/countoccurence.png

Example for Count occurence

Extract

To extract specific information from a string.

Select the Method such as: First word, Last word, Numbers extract.

_images/ext.png

First word:

  • Original String: New York

  • The output will be: New

_images/extract%28first%29.png

Example for Extract - First Word

Last word:

  • Original String: New York

  • The output will be: York

_images/extract%28last%29.png

Example for Extract - Last Word

Numbers extract:

  • Original String: Hello, I am 22 years Old now turning to 23

  • The output will be: 2223

_images/numextract.png

Example for Numbers Extract

Split

Splitting is the act of partitioning available data into portions, usually for cross-validatory purposes. Below are the methods of splitting.

  • HTTP string

  • Email

  • Invalid

  • URL

_images/split1.png

image Split

Split Http string

This processor splits the elements of an HTTP query string.

Split email

This processor splits an e-mail address into the local part and the domain.

Split invalid

This processor takes all values of a column that are invalid.

Split URL

This processor splits the elements of an URL into multiple columns.

ExtractNum

To extract numerical values from text This function is useful for isolating and retrieving numbers embedded within text, simplifying the process of data cleaning and preprocessing.

Cat -> Num

Convert categorical variables into numerical values only when the selected categorical column contains numbers exclusively. Upon applying this function, the successful validation will appear.

_images/cattonum.png

Note

String Operations can be performed only to the String variables.

Statistics- Dates

_images/z_dates.png

Dates Extraction

The Dates function with the menu, helps to strip out the date, month, Year, etc., from a date column. These stripped out date, month and year, act as additional feature for the model.

Example: Lets try to extract Month from the “Date of Hire” column,

_images/extracteg.png

_images/extractoutput.png

Dates Pattern Conversion

This function is used to convert the date pattern in a different, more suitable way for the research. After selecting the input and desired output pattern click on the apply button.

Example: Lets convert the date pattern, from “yyyy-MM-dd HH:mm:ss” to “yyyy-MM-dd”

_images/dpconvert.png

This will covert the existing column with the selected date pattern.

_images/dpconvertoutput.png

There are several types of date patterns and shortcuts, that can be present in a date variable. Can Design own format patterns for dates and times from the list of symbols in the following table:

Date Format Examples

Date format

Description

Examples

EEEE

Day of the week

Monday, Tuesday

EEE

Abbreviated day of the week

Mon, Tue, Wed

MM

Month of the year in a two-digit format

03, 07, 12

MMM

Abbreviated month of the year

Jan, Feb, Mar

MMMM

Month of the year

January, February, March

dd

Day of the month

01, 15, 30

D

Day in the year

yyyy-DDD —> 1999-081

yy

Year in two-digit format

99, 12, 20

yyyy

Year in four-digit format

1999, 2012, 2020

YYYY

Week-based year

2009; 09

HH

Hour of the day (0-23)

00, 12, 23

hh

Clock hour in AM/PM (1-12) format

00, 05, 12

mm

Minute in hour

00, 30, 59

ss

Second in minute

00, 30, 59

SSS

Fraction of a second

235, 512, 820

a

AM/PM marker

AM, PM

G

Era designator

AD, BC

u

Day number of week (1 = Mon,.., 7 = Sun)

1, 2, 7

ww

Week in year

02, 27, 52

W

Week in month

1, 2, 4

z

General time zone

GMT, PST, SGT, CET

Z

UTC time offset

+0000, -0800, +0100

XXX

ISO 8601 time zone

+01:00, -08:00

Escape for text

H’ h ‘mm z —> 5 h 06 CET

Time pattern examples

Date and time pattern

Example

yyyyMMddZ

19990322+0100

yyyyMMdd

19990322

yyyy/MM/dd H:mm

1999/03/22 5:06

yyyy.MM.dd HH:mm:ss

1999.03.22 05:06:07

yyyy.d.M HH:mm:ss

1999.22.3 05:06:07

yyyy-MM-dd

1999-03-22

yyyy-MM-ddXXX

1999-03-22+01:00

yyyy-MM-dd’T’HH:mm:ss.SSSXXX

1999-03-22T05:06:07.000+01:00

yyyy-MM-dd’T’HH:mm:ss.SSS’Z’

1999-03-22T05:06:07.000Z

yyyy-MM-dd’T’HH:mm:ss.SSS

1999-03-22T05:06:07.000

yyyy-MM-dd’T’HH:mm:ss

1999-03-22T05:06:07

yyyy-MM-dd HH:mm:ss.S

1999-03-22 05:06:07.0

yyyy-MM-dd h:mm:ss a

1999-03-22 5:06:07 AM

yyyy-MM-dd H:mm:ss

1999-03-22 5:06:07

yyyy-MM-dd G

1999-03-22 AD

yyyy-M-d HH:mm:ss

1999-3-22 05:06:07

yyyy-M-d h:mm:ss a

1999-3-22 5:06:07 AM

yyyy-DDDXXX

1999-081+01:00

yyyy MMMM dd E

2019 July 24 Wed

yyyy-ww

1999-02

yyyy-ww-u

2001-27-3

yyyy-‘W’ww-u

2001-W27-3

yy/MM/dd HH:mm

99/03/22 05:06

yy/MM/dd H:mm:ss

99/03/22 5:06:07

MMMM d, yyyy h:mm:ss z a

March 22, 1999 5:06:07 CET AM

MMM.dd.yyyy

Mar.22.1999

MMM d, yyyy h:mm:ss a

Mar 22, 1999 5:06:07 AM

MM/dd/yyyy HH:mm:ss

03/22/1999 05:06:07

MM/dd/yyyy h:mm:ss a

03/22/1999 5:06:07 AM

MM-dd-yyyy HH:mm:ss

03-22-1999 05:06:07

MM-dd-yyyy h:mm:ss a

03-22-1999 5:06:07 AM

HH:mm:ss dd/MM/yyyy

05:06:07 22/03/1999

HH:mm:ss dd-MM-yyyy

05:06:07 22-03-1999.

EEEE, MMMM d, yyyy h:mm:ss a z

Monday, March 22, 1999 5:06:07 AM CET.

EEEE, MMMM d, yyyy h:mm:ss ‘o’’clock’ a z

Monday, March 22, 1999 5:06:07 o’clock AM CET.

EEEE, MMMM d, yyyy

Monday, March 22, 1999

EEEE, d MMMM yyyy HH:mm:ss ‘o’’clock’ z.

Monday, 22 March 1999 05:06:07 o’clock CET.

EEEE, d MMMM yyyy

Monday, 22 March 1999

EEE, d MMM yyyy HH:mm:ss Z

Mon, 22 Mar 1999 05:06:07 +0100

EEE MMM dd HH:mm:ss z yyyy

Mon Mar 22 05:06:07 CET 1999

dd/MMM/yy h:mm a

22/Mar/99 5:06 AM

dd/MM/yyyy HH:mm:ss

22/03/1999 05:06:07

dd/MM/yyyy h:mm:ss a

22/03/1999 5:06:07 AM

dd/MM/yyyy H:mm

22/03/1999 5:06

dd.MM.yyyy. HH.mm.ss z

22.03.1999. 05.06.07 CET

dd-MMM-yyyy HH:mm:ss

22-Mar-1999 05:06:07

dd-MM-yyyy HH:mm:ss

22-03-1999 05:06:07

dd-MM-yy HH:mm

22-03-99 05:06

dd MMMM yyyy HH:mm:ss z

22 March 1999 05:06:07 CET

Characters that are not letters are treated as quoted text. That is, they will appear in the formatted text even if they are not enclosed within single quotes.

Delete Date Range

Removing or deleting all records or entries that fall within a specified range of dates from a dataset.

Steps to perform:

  • Select “From date” and “To date”

  • Select Keep range (will keep only the selected range of dates) or Delete range (will delete only the selected range of dates) based on your preferences.

Example: Lets keep only the records from “6/19/1961” To “8/17/1970”

_images/deldatrange.png

This will alter the existing column and will keep only the selected range of values.

Flag Date Range

Marking or highlighting all records or entries that fall within a specified range of dates. This is often done to identify, review, or take specific actions on the data within that range.

Example: Lets try to flag from “5/30/2011” to “2/19/2013”

_images/flagdatrang.png

Date Difference

Calculation of the amount of time between two dates. This can be measured in various units such as days, months, or years.

Steps to perform:

  • select Minuend (The number from which another number (the subtrahend) is to be subtracted) and Subtrahend (The number that is to be subtracted from the minuend).

  • choose output time unit from the dropdown menu.

_images/dropdown.png

Example: Lets try to do difference for the columns ‘Date of Hire” and “Date of Termination”

_images/input.png

This will generate a new column at the bottom of the preview.

_images/dd.png

Flag Holiday

Specific date refers to a day designated for celebrating or commemorating a national flag, often marked by various patriotic activities.

Steps to perform:

  • Select flag type from the dropdown either holiday or weekend.

  • Select Country or Region.

  • Click on Apply

Example: Lets flag the Holiday for Australia.

_images/fh.png

The output will be in the form of boolean (true or false).

_images/fgoutput.png

Statistics- Maths

_images/maths.png

Statistics - Maths

Sign conversion

Convert a positive to negative sign and vice versa. Choose between positive or negative conversion, and the result will be displayed in the existing column.

Before applying sign conversion,

_images/originalsign.png

Example for before applying sign conversion

After applying sign conversion,

_images/negativesign.png

Example for after applying negative sign conversion

Note

This function will not generate a new column for the output. It will modify the original column.

Binning

Select Binning Type such as: Interval binning, Equal width and Custom Binning.

Interval binning

Interval binning involves dividing the range of values into equal-sized intervals.

For example, if the data ranging from 11 to 100 and want 5 bins,

_images/intervalbinning.png

Example for Interval Binning

Equal width

Equal width binning is a type of interval binning where each bin has the same width. For example, if the data representing the heights of students in a class ranging from 120 cm to 180 cm, and need to create 4 bins,

_images/equalwidthbinning.png

Example for Equal Width Binning

Custom Binning

Custom binning involves defining bins based on specific criteria or requirements.

Addition

Add multiple columns. Atleast two columns are required.

_images/add.png

Example for Addition of two columns

Subtraction

Subtract multiple columns.

_images/sub.png

Example for Subtraction of two columns

Multiplication

Multiply multiple columns.

_images/multiplication.png

Example for Multiplication of two columns

Division

Divide multiple columns. Its a reverse function from multiplication.

_images/division.png

Example for Division

Note

The functions (D, E, F) are designed to operate sequentially from left to right.

Remainder

The remainder is the integer left over after division, indicating the part that remains unallocated.

_images/Remainder.png

Example for Remainder

Data transformation

This function includes several mathematical operations - Box cox, Yeo Johnson, Cubic root, Exponential, Log, Normalization, Power2, Power3, Reciprocal, Square root and Standardization. After selecting your chosen operation a new column is created at the bottom.

Box Cox

The Box-Cox transformation is a statistical method used to stabilize variance and make data more normally distributed (symmetric bell-shaped distribution where the mean, median, and mode are all equal)

This dataset is skewed to the right, indicating that most of the house prices are lower.

To apply the Box-Cox transformation mathematically,

Choose a range of values for the λ (lambda) parameter to test the transformation. Let’s say we try λ values ranging from -2 to 2.

For each λ value, calculate the transformed values using the Box-Cox transformation formula:

y_i = \begin{cases} \frac{x_i^\lambda - 1}{\lambda}, & \text{if } \lambda \neq 0 \\ \ln(x_i), & \text{if } \lambda = 0 \end{cases}

y_i represents the transformed value and x_i represents the original value.

Example:

Consider the dataset:

X={1200,2000,4500,800,3500,3000,1500,2500,4000,1000}

For λ ≠ 0 λ=0.2784475681948208:

y_1 &= \frac{1200^{0.2784475681948208} - 1}{0.2784475681948208} \approx 22.26977 \\ y_2 &= \frac{2000^{0.2784475681948208} - 1}{0.2784475681948208} \approx 26.22267 \\ y_3 &= \frac{4500^{0.2784475681948208} - 1}{0.2784475681948208} \approx 33.77536 \\ y_4 &= \frac{800^{0.2784475681948208} - 1}{0.2784475681948208} \approx 19.50882 \\ y_5 &= \frac{3500^{0.2784475681948208} - 1}{0.2784475681948208} \approx 31.24991 \\ y_6 &= \frac{3000^{0.2784475681948208} - 1}{0.2784475681948208} \approx 29.78606 \\ y_7 &= \frac{1500^{0.2784475681948208} - 1}{0.2784475681948208} \approx 23.92758 \\ y_8 &= \frac{2500^{0.2784475681948208} - 1}{0.2784475681948208} \approx 28.13388 \\ y_9 &= \frac{4000^{0.2784475681948208} - 1}{0.2784475681948208} \approx 32.56974 \\ y_10 &= \frac{1000^{0.2784475681948208} - 1}{0.2784475681948208} \approx 20.98964 \\

Note

The lambda (λ), which varies from -5 to 5. All values of λ are considered and the optimal value for your data is selected; The “optimal value” is the one which results in the best approximation of a normal distribution curve.

_images/boxcox.png

Example for Box Cox Transformation

Yeo Johnson

The Yeo-Johnson transformation extends the Box-Cox transformation to handle both positive and negative values, as well as zero values. It is useful for stabilizing variance and making data more normally distributed.

y_i = \begin{cases} \frac{(x_i + 1)^{\lambda_i} - 1}{\lambda_i}, & \text{if } \lambda_i \neq 0, x_i \geq 0 \\ \ln(x_i + 1), & \text{if } \lambda_i = 0, x_i \geq 0 \\ -\frac{(|x_i| + 1)^{2 - \lambda_i} - 1}{2 - \lambda_i}, & \text{if } \lambda_i \neq 2, x_i < 0 \\ -\ln(|x_i| + 1), & \text{if } \lambda_i = 2, x_i < 0 \end{cases}

where:

  • x_i is the original value,

  • y_i is the transformed value,

  • λ_i is a parameter that varies for each feature.

Example:

Let’s consider x=3 and lambda (λ) = 0.7995651776710038,

\begin{align*} x'_1 &= ((3 + 1)^{0.7995651776710038} - 1) / 0.7995651776710038 \\ &= 2.389922907892544 \\

x=6 and lambda (λ) = 0.7995651776710038,

\begin{align*} x'_1 &= ((3 + 1)^{0.7995651776710038} - 1) / 0.7995651776710038 \\ &= 4.191151663607149 \\

Note

Factors influencing the lambda value include the data’s skewness, distribution, variance, sensitivity to outliers and optimization method. The lambda value is empirically determined to maximize the normality and symmetry of the transformed data.


_images/yeojohnson.png

Example for Yeo Johnson Transformation

Cubic Root

The value when multiplied by itself twice, yields the original number is referred to as the cubic root. For instance, the cubic root of 8 is 2. .. math:

y_i = \sqrt[3]{x_i}

where:

  • x_i is the original value,

  • y_i is the transformed value.

Example:

\sqrt[3]{27} = 27^{1/3} = 3.0

\sqrt[3]{64} = 64^{1/3} = 4.0


_images/cubicroot.png

Example for Cubic Root

Exponential

Exponential refers to a mathematical operation or function where a constant (the base) is raised to the power of an exponent, resulting in a rapidly increasing or decreasing function. It is commonly represented as a^x, where a is the base and x is the exponent.

An example of an exponential function is f(x) = 2^x. Let’s evaluate this function for some values of x:

  • When x = 0, f(0) = 2^0 = 1.

  • When x = 1, f(1) = 2^1 = 2.

_images/exponential.png

Example for Exponential Transformation

Log

The log transformation is a mathematical operation that calculates the logarithm of a number. It is commonly used to transform skewed data to make it more symmetrically distributed.

The transformation formula is as follows:

y_i = \log(x_i)

where:

  • x_i is the original value,

  • y_i is the transformed value.

Example:

\log_{10}(100) = \frac{\ln(100)}{\ln(10)}

We can substitute \ln(100) and \ln(10) with their respective values:

\ln(100) = \ln(e^4) = 4\ln(e) \ln(10) \approx 2.30258

Finally:

\log_{10}(100) \approx \frac{4\ln(e)}{2.30258}

Since \ln(e) = 1, we have:

\log_{10}(100) \approx \frac{4}{2.30258} \approx 1.732

Therefore, \log_{10}(100) is approximately 1.732.

\log_2(8) = \frac{\ln(8)}{\ln(2)}

We can substitute \ln(8) and \ln(2) with their respective values:

\ln(8) = \ln(e^2.07944) \approx 2.07944 \ln(2) \approx 0.69315

Finally:

\log_2(8) \approx \frac{2.07944}{0.69315}

This simplifies to:

\log_2(8) \approx 3

Therefore, \log_2(8) is equal to 3.

_images/log.png

Example for Log Transformation

Normalization

Normalization is a technique used to scale numeric features to a standard range, typically between 0 and 1 or between -1 and 1. It is used statistical analyses to ensure that all features have the same scale and to prevent features with larger values from dominating those with smaller values.

For Example, Consider the following dataset,

{Data} = [2, 5, 8, 10, 15]

To normalize these values using min-max scaling, we’ll use the formula:

x_{\text{norm}} = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}

Where:

  • x_min= 2 is the minimum value of the dataset,

  • x_max = 15 is the maximum value of the dataset.

Now, we can calculate the normalized values using the formula:

For x = 2:

x_{\text{norm}} = \frac{2 - 2}{15 - 2} = 0

For x = 5:

x_{\text{norm}} = \frac{5 - 2}{15 - 2} = \frac{3}{13} \approx 0.2308

For x = 8:

x_{\text{norm}} = \frac{8 - 2}{15 - 2} = \frac{6}{13} \approx 0.4615

For x = 10:

x_{\text{norm}} = \frac{10 - 2}{15 - 2} = \frac{8}{13} \approx 0.6154

For x = 15:

x_{\text{norm}} = \frac{15 - 2}{15 - 2} = 1

So, after normalizing the data using min-max scaling, the normalized dataset would be:

Normalized Data = [0, 0.2308, 0.4615, 0.6154, 1]

These values are now scaled to the range between 0 and 1.

_images/normalization.png

Example for Normalization

Power2

The term “Power of 2” typically refers to the operation of exponentiation, where a number is raised to the second power. In mathematical notation, it is represented as "x^2" where, x is the base number. This operation involves multiplying the base number by itself.

For example:

2^{2} = 4

In general, raising a number to the power of 2 is known as squaring that number.

_images/power2.png

Example for Power 2

Power3

The term “power of 3” also refers to exponentiation, where a number is raised to the third power. In mathematical notation, it is represented as "x^3" where, x is the base number. This operation involves multiplying the base number by itself twice.

For example:

1200^{3} = 1,728,000,000

In general, raising a number to the power of 3 is known as cubing that number.

_images/power3.png

Example for Power 3

Reciprocal

The reciprocal of a number is calculated by dividing 1 by that number.

Mathematically, if we have a number represented as x, the reciprocal of x is represented as \frac{1}{x}.

For example:

x = 5, the reciprocal of 5 = 1/5.

Similarly, x = 0.5, the reciprocal of 0.5 is 1/0.5 = 2.

The reciprocal of a number essentially represents how many times that number can fit into 1. For instance, the reciprocal of 5 is (1/5) means that 5 fits into 1 one-fifth of a time.

_images/reciprocal.png

Example for Reciprocal

Square Root

The square root of a number is a value that, when multiplied by itself, gives the original number. In mathematical notation, the square root of a number x is represented as √x.

For example:

\sqrt{25} = 5. Since 5 * 5 = 25

The square root operation can also be denoted using exponentiation: x^(1/2). This means raising x to the power of one-half, which is equivalent to taking the square root of x.

_images/squareroot.png

Example for Square Root

Standardization

In statistics, standardization is a process used to transform data so that it has a mean of zero and a standard deviation of one. This technique is applied to individual variables within a dataset, making them comparable and facilitating statistical analysis.

The formula for standardizing a variable x is:

z = \frac{x - \mu}{\sigma}

Where:
  • x is the original value of the variable,

  • \mu is the mean of the variable,

  • \sigma is the standard deviation of the variable, and

  • z is the standardized value.

Standardization transforms the distribution of the data to have a mean of zero and a standard deviation of one. This process does not change the shape of the distribution but places it on a common scale, allowing for easier comparison between variables and datasets.

Standardization is particularly useful in statistical modeling, machine learning, and data analysis, where it helps improve the interpretability and performance of models by ensuring that variables are on a consistent scale.

Example:

Consider the data x= [5, 20, 8, 3, 4, 6, 15, 12, 9, 10]

Mean \bar{x} = \frac{{5 + 20 + 8 + 3 + 4 + 6 + 15 + 12 + 9 + 10}}{{10}} = 8.2

Standard Deviation \sigma = \sqrt{\frac{{\sum_{i=1}^{10} (x_i - \bar{x})^2}}{{10}}} = 5.03

Standardized Data z_i = \frac{{x_i - \bar{x}}}{{\sigma}}

where x_i is each data point in the given data.

z_1 &= \frac{5 - 8.2}{5.03} \approx -0.64 \\ z_2 &= \frac{20 - 8.2}{5.03} \approx 2.35 \\ z_3 &= \frac{8 - 8.2}{5.03} \approx -0.04 \\ z_4 &= \frac{3 - 8.2}{5.03} \approx -1.03 \\ z_5 &= \frac{4 - 8.2}{5.03} \approx -0.83 \\ z_6 &= \frac{6 - 8.2}{5.03} \approx -0.43 \\ z_7 &= \frac{15 - 8.2}{5.03} \approx 1.35 \\ z_8 &= \frac{12 - 8.2}{5.03} \approx 0.76 \\ z_9 &= \frac{9 - 8.2}{5.03} \approx 0.16 \\ z_{10} &= \frac{10 - 8.2}{5.03} \approx 0.36

Note

Mathematical operations can be performed only for the Numerical variables.

Statistics - Text

Text Processing

Text processing is used to analyze most used words in a written text. Can also be used to identify SPAM emails, by searching for specific key words. By doing the text processing you will get the top 100 used words as new columns.

Select the preferred N grams from 1 to 5.

Select the Encode type such as : Boolean( 0 or 1), Count.

Select Custom stop words if required.

Click on apply button.

Note

The output will appear at the bottom of the page. You can also preview the output on the preview page.


_images/textprocessing.png

Example for Text Processing

Text->Cat

Converts Text data type into Categorical data type.

Text data type before conversion,

_images/beforeconversion.png

Example for Text->Cat before conversion

After conversion,

_images/afterconversion.png

Example for Text->Cat after conversion

Note

This operation will not generate a new column for the output; instead, it will only modify the data type within the same column, altering the output accordingly.

_images/output.png

Note

Text functions can be performed only for the Text variable.

Statistics- Data

Label encoding

Label encoding is a technique used to convert categorical variables into numerical format.

Consider a dataset with a categorical variable “Temperature” and its corresponding values are “Hot,” “Cold,” and “Warm.”

Using label encoding, numerical labels are assigned to each unique category. The labels are assigned based on alphabetical order.

Note

This function applicable only for the Categorical column.


_images/labelencoding.png

Example for Label Encoding

One Hot Encoding

One hot encoding is another method used to represent categorical variables in a format suitable for machine learning algorithms. Instead of assigning numerical labels like label encoding, one hot encoding creates binary vectors for each category.

Note

This function is applicable for both categorical and text columns. It does not require a user-defined output column name; instead, it will automatically generate a column name for the output.

_images/onehotencoding.png

Example for One Hot Encoding

Renaming Factor Level

Changing the labels or names of categories in a categorical variable to make them more descriptive or meaningful.

Note

Renaming is only available for text and categorical data types with cardinality of 100 or less.

_images/renamingfactor.png

Example for Renaming Factor

_images/validation.png

Ignore

Refers to excluding Certain variables to filtering out irrelevant or noisy data.

Note

Multiple columns can also be ignored.

_images/ignore.png

Example for Ignore

Unignore

Reverse function of ignore (Ignored variables can be undo by using this function). Unignoring the ignored variables.

_images/unignore.png

Example for Unignore

Impute Missing

Involves filling in missing data points in a dataset.

Techniques available for Numerical Data Type include:

  • Mean (Often known as the average, represents the central tendency of a dataset. It’s computed by adding up all values and dividing by the total count of values.)

  • Median (Middle value in a dataset when the values are arranged in ascending or descending order. If there’s an even number of values, it’s the average of the two middle values.)

  • Mode (Value that appears most frequently in a dataset. It’s the most common observation or value in the data.)

  • Default

  • Moving Average (Smooths out fluctuations in data by calculating the average of a subset of data points within a sliding window.)

  • Forward Filling (Copying the last observed value forward to fill in missing values.)

  • Backward Filling (Copying the next observed value backward to fill in missing values in a dataset.)

  • Custom Filling (Filling missing values using a user-defined method.)

  • Average of previous and next (Involves taking the average of the values immediately preceding and following the missing value to estimate its value.)

  • KNN (Missing values are replaced with the average or median value of the k nearest neighbors.)

The available techniques for “Categorical” variable are,

  • Mode

  • Forward Filling

  • Backward Filling

  • Custom Filling

  • KNN

_images/categorical.png

Techniques for categorical variable

The available technique for “Date” variable are,
  • Mode

  • Forward Filling

  • Backward Filling

_images/categorical.png

Techniques for Numerical variable

The available technique for “Text” variable are,

  • Mode

  • Forward Filling

  • Backward Filling

  • Custom Filling

  • KNN


_images/imputemissing.png

Example for Impute Missing

Delete Missing

Deleting rows with missing values can be a preferred approach when those values are critical for analysis and cannot be easily imputed.

This function removes all rows containing missing values across all columns in the dataset, ensuring that only complete observations are retained for analysis.

_images/deletemissing.png

Example for Delete Missing

Copy

Copying a column or creating a duplicate allows for performing two different operations on the same column, which can be useful. This process simply involves creating a copy of the column, and then proceeding with the desired operations. The new column is created at the bottom.

_images/copy.png

Example for Copy

Column Concat

The function is used to concatenate two or more columns into one column. For instance, if there are columns for “First name” and “Surname,” they can be merged into a single column.

For example, “Name” and “Sex” columns are being combined. A delimiter is manually selected to clearly mark the transition between the values of the first and second columns.

_images/columnconcat.png

Example for Column Concat

Cross Tab

Cross tabulation is a method to organize and summarize data from two categorical variables into a table.

It helps to see how these variables are related by counting the occurrences of each combination of categories.

_images/crosstab.png

Example for Cross Tab

To transpose the output, click on the ‘inverse’ option

_images/crosstabinverse.png

Example for Cross Tab Inverse

Filter

Filter the data using multiple conditions. Available Filters for the Data types are,

Condition

Filter Operator

Description

Applicable Data Types

equal

Filters values exactly matching a specified value.

Text, Number, Date

not equal

Filters values not equal to the specified value.

Text, Number, Date

greater

Filters values greater than a specified threshold.

Number, Date

greater or equal

Filters values greater than or equal to a specified threshold.

Number, Date

less

Filters values less than a specified threshold.

Number, Date

less or equal

Filters values less than or equal to a specified threshold.

Number, Date

is null

Filters values that are blank or missing.

Text, Number, Date, Categorical

is not null

Filters values that are present (not blank).

Text, Number, Date, Categorical

between

Filters values within a defined range (e.g., 500–1000).

Number, Date

not between

Filters values outside a specified range.

Number, Date

in

Filters values that match any value in a specified list.

Text, Number, Date

not in

Filters values that do not match any value in a specified list.

Text, Number, Date

contains

Filters text values that include the specified substring.

Text, Categorical

not contains

Filters text values that do not include the specified substring.

Text, Categorical

doesn’t contain

Same as “not contains”; filters values that do not include substring.

Text, Categorical

begins with

Filters text values that start with the specified substring.

Text, Categorical

ends with

Filters text values that end with the specified substring.

Text, Categorical

_images/addrule.png

Example for Add Rule and Add Group

Steps to Apply Filter:

  • Add Rule – Click this to add a new filtering condition (e.g., priceUSD greater than 100).

  • Add Group – Allows you to create a nested group of filter rules with its own AND/OR logic.

  • AND / OR Toggle – Used to define the logical relationship between multiple filter rules. Choose AND to apply all conditions or OR to apply any one of them.

  • View – Shows a live preview of the filtered data based on current rules.

  • Apply – Applies all active filters and updates the dataset view accordingly.

  • Reset – Clears all defined filters and resets the dataset to its original, unfiltered state.

  • Delete – Removes an individual rule or an entire group of conditions.

  • Filter Rule Components
    • Column Selector: Choose the column to filter (e.g., priceUSD).

    • Operator Selector: Choose a condition type

Grouped Statistics

Grouped statistics organize data into groups and calculate statistical measures for each group, offering detailed insights into how variables behave within specific categories.

Note

This filter is only applicable to Numerical data types.

The available aggregation are:

  • Mean (Often known as the average, represents the central tendency of a dataset. It’s computed by adding up all values and dividing by the total count of values.)

  • Median (Middle value in a dataset when the values are arranged in ascending or descending order. If there’s an even number of values, it’s the average of the two middle values.)

  • Variance (Value that appears most frequently in a dataset. It’s the most common observation or value in the data.)

  • Standard Deviation (How the data is spread around the mean)

  • Unique (Number of unique values in the dataset)

  • Count (Total number of elements present in a particular variable or column)

  • Sum (Total value obtained by adding together all the individual values in a specific variable)

Example: To calculate average booking amount for each destination city, select INR_Amount with,

_images/groupstat.png
_images/groupedstat.png

Example for Grouped Statistics

Transpose

Transpose refers to the operation of flipping or rotating a dataset, typically switching its rows and columns. This operation can be useful for various purposes, such as changing the orientation of data for easier analysis or visualization, or preparing data for specific computations or algorithms. When a dataset is transposed, the rows become columns and vice versa.

Transpose Methods:

  • Split & Transpose

  • Combine & Transpose

Split & Transpose

Dividing a text string based on a separator and then rearranging the resulting parts into columns or rows.

Note

This functionality will work only for Categorical variable(s).


_images/seperator.png
_images/split%26transpose.png

Example for Split & Transpose

Combine & Transpose

Merging multiple columns into one column, and then rearranging the data from rows to columns or vice versa.

_images/transpose.png
_images/combine%26transpose.png

Example for Combine & Transpose

Drop Duplicates(Row Wise)

Dropping duplicates refers to removing Duplicates in column or entire row duplicates. This operation ensures that each row in the dataset is unique, which can be important for various data analysis tasks

In the selected Column ( from_city) contains 2033 missing records.

_images/duplicatecount.png
_images/dropduplicates.png

After applying the function to the selected column, the validation results will be displayed.

_images/successvalidation.png

_images/result.png

Result

Statistics- Advanced

Pivot analysis

Analysis based on pivoting data - a Pivot Table is used to summarise, sort, reorganise, group, count, total or average data stored in a table. It allows us to transform columns into rows and rows into columns.

We have two options in this section - either a “Pivot mode” or a regular chart mode.

Regular chart mode

In the regular chart mode we can select all the rows that we are interested in, in our example we selected Age, Fare, Sex, Ticket purchase, Cabin Detail and Number of meals. We get a basic chart with the selected values.

_images/pivotanalysis.png

RegularModeSelect

In the next step, we can also divide our selected columns in groups, by dragging the preferred group type down to the “Row groups” section. We have selected “Gender” to differentiate the groups. We can see we have 3555 items in group “Male” and 3488 items in group “Female”. We can open each group by clicking on the group name.

_images/pivotgrprow.png

RegularModeRowGroups

In the next step, we can add some other numerical values that we are interested in. In our case, we dragged the “Partner” down bellow to the “Values” section and by clicking on the desired value in the bottom green box. We can also select in which form the value should be displayed - average, count, first, last, max, min or sum. In our case it was the “average fare” value.

_images/pivotvalues.png

RegularModeValues

Pivot mode

Firstly it is necessary to switch to the “Pivot mode” in the top right part of the screen

_images/PivotModeOn.png

PivotModeOn

After enabling Pivot Mode, the fields were arranged to analyze the distribution of phone service subscriptions by gender. The gender field was added to the Row Groups, so the data is grouped row-wise into Female and Male categories. The PhoneService field was placed under Column Labels, splitting the table into two columns labeled No and Yes based on subscription status. Finally, count(PhoneService) was added to the Values section to display the count of records for each combination of gender and phone service status. This setup provides a clear comparison of how phone service is distributed between male and female customers.

_images/pivotrowcolumn.png

PivotMode

More tools are hidden next to each value column as per the bellow image

_images/ToolsButton.png

ToolsButton

_images/MoreTools.png

You can also export or copy the chart with the right click on the table.

_images/exportpivot.png

Jupyter Notebook

Jupyter is an open-source, interactive web tool known as a computational notebook, which researchers can use to combine software code, computational output, explanatory text and multimedia resources in a single document. User can create a customised program/code using Jupyter notebook and import it back to EDGE.

Regular Expression

Regular Expressions (REs) offer a mechanism to select specific strings from a set of character strings. They provide a context-independent syntax capable of representing a wide variety of character sets and their orderings, with interpretations based on the current locale. In addition to using Search and Replace for finding and replacing simple text strings, you can perform more precise searches using Regular Expressions (RegEx). RegEx allows you to accurately match a particular pattern in your search.

Select the Method such as :

  • Extract

  • Search

  • Replace

_images/extract.png

Extract

Example: Lets try to extract ‘_’,

_images/regextract.png

Example for Extract

Example: Lets try to search ‘Chennai’,

_images/search.png
_images/regsearch.png

Example for Search

Example: Lets replace ‘Bālāpur’ with ‘Balapur’

_images/regreplace.png

Example for Replace

Delta

Delta (Δ) is a generalized function that calculates the difference between each row value and the value from its corresponding subtrahend position in the given numerical column grouped & sorted through the groupBy & orderBy columns respectively.

_images/delta.png
_images/egdelta.png

Example for Delta

Statistics- Boolean

Refers to a data type that represents one of two states: true or false.

Booleans are commonly used to filter, subset, or mask data based on certain conditions.

_images/stats_boolean.png

Is Null

Refers to missing or nonexistent values in a dataset.

When checking for null values in a specific data field or observation, the result is typically a boolean (true or false) indicating whether the value is null (true) or not null (false). This process of checking for null values is essential for data cleaning and quality assessment before analysis.

_images/null.png

Example for Is Null

Is Even

Determining whether a given number is divisible by 2 without any remainder, indicating that it is an even number.

It will return true if the number is even and false if it is not.

_images/even.png

Example for Is Even

Is Odd

Determining whether a given number is not divisible by 2 without any remainder, indicating that it is an odd number.

It will return true if the number is odd and false if it is not.

Note

This function only applicable to Numerical Variables.


_images/odd.png

Example for Is Odd

Is Number

Checks whether a given value is a numerical type.

It will return true if the value is a number and false if it is not.

_images/number.png

Example for Is Number

Is Text

Checks whether a given value is a string or textual data type.

It will return true if the value is a number and false if it is not.

_images/egtext.png

Example for Is Text

Negate Boolean

Means flipping its truth value. For example, if the original boolean value is true, its negation would be false, and vice versa. It’s a way to change a statement from being true to false, or vice versa.

Note

This function applicable only to the Boolean values.


When selecting non boolean data type variable it will show the validation called “Selected column has no boolean values.

_images/validation_negate.png
_images/negate.png

Example for Negate Boolean

Frequently Used

The functions commonly used for analysis are:

  • Label Encoding

  • One Hot Encoding

  • Ignore

  • unignore

  • Impute Missing

  • Extract Date Parts

  • Data Transform

  • Binning

Graphstat

This feature presents a bar chart for the selected column, displaying the value and its count. It supports numerical, categorical, and date data types for bar chart visualization. Numerical data types offer an option to adjust the number of Bins. Text data types are represented using a word cloud. Scatter plots can be enabled to visualize two numerical variables against each other, typically the selected variable and the target variable. Users can view charts in full-screen mode and utilize a print option. Chart download options include PNG, JPEG, PDF, and SVG formats. Data can be downloaded in CSV or XLS format.

Note

Categorical and date data types display only bar charts, while numerical data types offer both scatter plots and bar charts.

For categorical variables, the ‘sort by’ feature is available.

_images/sortby.png

Graphstat for Categorical Column

_images/cat.png

Graphstat for Date Column

_images/gsdatecol.png

Graphstat for Numerical Column

_images/numeric.png

If Scatter Plot is required, enable the Scatter with Target variable toggle.

_images/scatteroption.png

After enabling the toggle, the Stack will display

Graphstat for Numerical Column–Scatter Plot

_images/graphstatstack.png

The word cloud offers the N-grams feature, which ranges from 1 to 5.

_images/textoptions.png
_images/textwc.png

Graphstat for Text variable

Statistical Test

A one-sample t-test is used to determine if the mean of a single sample is significantly different from a known or hypothesized population mean.

One Sample T Test - One Tailed

A one-sample t-test is used to determine if the mean of a single sample is significantly different from a known or hypothesized population mean.

Steps to perform

  • Select One Sample T Test - One tailed under test name.

  • Under Numerical column select Numerical variable.

  • Under Categorical column select categorical variable. Select if it is required.

  • Choose a hypothesis mean value based on existing theories, previous research findings, or practical considerations. For example, if previous studies suggest that the average height of adults is 170 cm, you might set your hypothesis mean value to 170 cm.

  • The default Alpha (Level of significance) value is 0.05. Change the value based on your requirement.

_images/onetailed.png

Sampling:

  • Enable Sampling, if required. Choose the sampling method Random (each member of the population is selected purely by chance. This means that every individual or item has an equal chance to be included in the sample, and selection is not biased by any personal judgment or preference.)or Stratified (method of sampling in which the population is divided into subgroups, or strata, based on certain characteristics that are relevant to the research question. Then, samples are randomly selected from each stratum independently. This approach ensures that each subgroup is represented in the sample, and it allows for more precise estimates and comparisons within each stratum.)

Size:

_images/size.png
  • % of a Variable: Selecting a sample size that is a certain percentage of a particular variable. For instance, if the variable is the total population, and you choose 10%, your sample will include 10% of the total population. This approach is useful when you want the sample size to be proportional to the size of a certain variable.

_images/sizeinputs.png
  • Absolute Value: Selecting a sample size by specifying an absolute number. For example, if you decide that your sample size should be 5 individuals, you simply select 5 individuals from the population. This method is straightforward and is often used when the required sample size is predetermined by external factors such as budget or time constraints.

_images/absinputs.png
  • By Calculator: Using a calculator involves employing a statistical tool or software to determine the appropriate sample size based on certain parameters such as confidence level, margin of error, and population size. This method ensures that the sample size is statistically valid and sufficient to make reliable inferences about the population.

_images/bycalculatorinputs.png
  • Lazy Sampling: A less rigorous approach where the sample is selected based on convenience or ease of access rather than using a strict randomization process. This method is quick and cost-effective but may introduce bias and is not always representative of the population. It’s often used in exploratory research or when resources are limited.

_images/lazysamplinginputs.png
  • Systematic Sampling: Selecting every nth item from a list or population. For example, if you have a list of 1,000 individuals and you want a sample of 100, you would select every 10th individual. This method is easy to implement and ensures a spread across the population, but it assumes that the list is randomly ordered to avoid periodicity bias.

_images/ssinputs.png
  • Filtering: Enable Filtering and filter the data, if required.

Once the steps are done, Click on ‘Test’ and verify the result.

_images/resultos.png

Output

Histogram-Density

Provides a normalized representation of the distribution of data, making it easier to compare distributions and understand the relative proportions of data within different intervals.

_images/histogram.png

Histogram:

  • The histogram displays the frequency distribution of age data within specified intervals (bins).

  • The blue bars represent the number of data points (frequency) that fall within each age interval.

  • The height of each bar corresponds to the count of data points in that interval.

Density:

  • The black curve represents the density estimate, showing the distribution of the data in a continuous manner.

  • The area under the density curve represents the entire data set and sums to 1.

The histogram gives a direct count of data points within intervals, while the density plot provides a smoothed view, highlighting the shape of the distribution. Together, they offer a comprehensive view of the data, with the histogram showing discrete counts and the density plot illustrating the overall distribution pattern.

T distribution

Provides a visual representation of how the shape of the distribution changes with different sample sizes and illustrates the properties of the t-distribution relevant for statistical inference.

_images/t-distribution.png

Central Tendency of Age Data:

  • The peak of the density curve (red line) suggests that the most common or average age in the dataset is around 0 on a standardized scale. This means that the data is centered around this value.

Spread and Distribution of Age Data:

  • The shape of the density curve (red line) and the t-distribution curve (blue line) indicates how age values are distributed in the dataset.

  • The t-distribution curve (blue line) closely follows the density curve, but it’s slightly smoother and broader. This suggests that the data may have been fitted to a theoretical distribution, such as the t-distribution, often used in statistics for small sample sizes.

Significance of Tail Area:

  • The shaded area under the t-distribution curve (pink) represents the tail of the distribution.

  • This area is significant in statistical terms, especially for hypothesis testing, where it helps determine the probability of observing values beyond a certain threshold (e.g., ages above a certain point).

Data Observation

The observation of the data is also displayed.

_images/obs.png

Hypothesis

Here the detailed hypothesis displayed.

_images/hypothesis.png

Summary

Here the Summary Stats are displayed.

_images/summarystats.png

One Sample T Test - Two Tailed

In a one-sample t-test with a two-tailed hypothesis, one examines whether the sample mean significantly differs from a hypothesized population mean. This involves calculating the t-statistic based on the sample data and comparing it to critical values from a t-distribution. The aim is to determine if the mean significantly deviates from the hypothesized value in either direction, hence the term “two-tailed”.

Steps to perform

  • Select One Sample T Test – Two Tailed under test name.

  • Under Numerical column select Numerical variable.

  • Under Categorical column select categorical variable. Select if it is required.

  • Choose a hypothesis mean value based on existing theories, previous research findings, or practical considerations. For example, if previous studies suggest that the average height of adults is 170 cm, you might set your hypothesis mean value to 170 cm.

  • The default Alpha (Level of significance) value is 0.05. Change the value based on your requirement.

  • The two-tailed test checks whether the sample mean is significantly different from the hypothesized mean in either direction (greater or smaller).

  • To learn more: ‘Sampling’

  • To learn more: ‘Filtering’

Histogram-Density

To learn more: ‘Histogram-Density’

T Distribution

To learn more: ‘T distribution’

To learn more: ‘Data Observation’

Shapiro-Wilk

The Shapiro-Wilk test is a useful tool for determining the normality of a data set, which is a crucial assumption in many statistical analyses. It provides a clear decision-making process based on the p-value derived from the T-statistic.

Steps to perform

  • Select Shapiro-Wilk test under test name.

  • Select Numerical variable under Numerical clumn.

To learn more: ‘Sampling’

To learn more: ‘Filtering’

Histogram-Density

To learn more: ‘Histogram-Density’

_images/shapirooutput.png

Output

Data Observation

The observation of the selected variable is displayed. Unlike One Sample T test - One Tailed & Two Tailed, this won’t display Standard error.

_images/shapiroobs.png

To learn more: ‘T distribution’

Hypothesis

Under Hypothesis, this will give whether the sample looks normally distributed or not.

_images/hypo.png

Summary

Under Summary, only the P Value and T-Statistics are displayed.

_images/sum.png

Paired Student T-Test

The paired Student’s t-test compares the means of two related groups to determine if there is a statistically significant difference. It is used when the same subjects are measured before and after a treatment or in matched pairs. The test calculates the difference between paired observations, then computes the mean and standard deviation of these differences to determine the t-statistic. If the p-value is less than a chosen significance level (e.g., 0.05), the null hypothesis (no difference) is rejected, indicating a significant difference between the means.

This won’t display the Histogram-Density Plot and T-distribution.

Steps to perform

  • Select Paired Student T test under test name.

  • Select two Numerical variables under Numerical columns.

  • To learn more: ‘Sampling’

  • To learn more: ‘Filtering’

Data Observation

The observation of the selected variable is displayed. Unlike One Sample T test - One Tailed & Two Tailed, this won’t display Standard error.

_images/pairedobs.png

Hypothesis

Under Hypothesis, will display only whether the two pairs are significant or not.

_images/pairedhypo.png

Summary Stats

Under Summary, the summary statistics are displayed.

_images/pairedsum.png

Chi Square Test

The Chi Square test is a statistical method used to determine whether there is a significant association between categorical variables. It compares the observed frequencies of categories with the expected frequencies if there were no association.

This won’t display the Histogram-Density Plot and T-distribution

Steps to perform

  • Select Chi Square test under test name.

  • Select two Categorical variables under Categorical columns.

  • To learn more: ‘Sampling’

  • To learn more: ‘Filtering’

Data Observation

The observation of the selected variable is displayed. Unlike One Sample T test - One Tailed & Two Tailed, this won’t display Standard error.

_images/chiobs.png

Hypothesis

Under Hypothesis, will display only whether the two pairs are significant or not.

_images/chihypo.png

Summary Stats

Under Summary, the summary statistics are displayed.

_images/chisum.png

Note

In order to view the T distribution, click on ‘T Distribution’ next to ‘Histogram-Density’.

In order to view the Summary Stats click on the ‘Summary’ next to ‘Hypothesis’ in the ‘Hypothesis & Summary’ Panel.

Transformations

_images/z_transformation.png

SEDGE provides 192 transformation functions organized into seven categories. Use transformations to create new columns, clean data, perform calculations, and reshape your dataset.

Category

Count

Highlights

Utility

7

copy, row_index, uuid, binning, impute, unit_converter, xxhash64

Mathematical

63

Arithmetic, trigonometric, logarithmic, statistical/aggregate, financial, bitwise, geospatial

String

55

Case conversion, padding, encoding, hashing, regex, phonetic, distance

Date/Time

48

Extraction, differences, construction, timezone, holidays

Type Conversion

7

int, float, bigint, double, string, cat2num, num2cat

Validation

7

Null/NaN checks, type checks, boolean negation

Conditional

5

if (with and/or), coalesce, ifnull, nvl2, nullif

Total

192

Note

Syntax Change in SEDGE 9.0:

  • Column names are now referenced directly (for example, amount).

  • Output column aliases use plain names (for example, as new_col).

  • The as keyword is lowercase throughout.

If you are upgrading from SEDGE 8.x, update your existing transformation expressions to use the new syntax.

Note

  • Function names are case-insensitive.

  • Most functions support nested function calls at arbitrary depth.

  • Column names in functions should match the actual column names in your dataset.

  • Use single quotes ('value') or double quotes ("value") for string literals.

  • Basic arithmetic (+, -, *, /) is supported in SELECT expressions, including nested function calls within arithmetic.

Utility Functions

Copy

copy(): Copy a column with a new name.

Syntax:

copy(original_col) as new_col

Example:

copy(original_col) as new_col

Creates an exact duplicate of the original column with a new name.


Row Index

row_index(): Generate a row index starting at 1.

Syntax:

row_index() as row_num

Example:

row_index() as row_num

Assigns a sequential number starting from 1 to each row in the dataset.


UUID

uuid(): Generate a universally unique identifier (UUID) per row.

Syntax:

uuid() as row_uuid

Example:

uuid() as row_uuid

Generates a unique identifier for each row. Useful for creating primary keys or tracking individual records.


Binning

binning(): Bin numeric values using equal_width, interval, or custom edges.

Syntax:

binning(column, 'method', param) as binned_column

Note

The binning function can only be applied to Numerical datatype columns. Supported methods are equal_width, interval, and custom.

Example (equal width):

binning(amount, 'equal_width', 5) as amount_bin

Example (interval):

binning(amount, 'interval', 10) as amount_bin

Example (custom):

binning(amount, 'custom', '0,50,100,200') as amount_bin

Groups numeric values into discrete bins. Use equal_width for evenly spaced bins, interval for fixed-width bins, or custom for manually defined edges.


Impute

impute(): Impute missing values using various methods (mean, median, mode, movingavg, ffill, bfill, custom, knn). Default is auto.

Syntax:

impute(column) as imputed_column
impute(column, 'method') as imputed_column
impute(column, 'method', value) as imputed_column

Note

Supported methods: mean, median, mode, movingavg, ffill, bfill, custom, knn. If no method is specified, the function uses auto-detection.

Example (auto):

impute(amount) as amount_imputed

Example (mean):

impute(amount, 'mean') as amount_imputed

Example (moving average):

impute(amount, 'movingavg', 3) as amount_imputed

Example (custom value):

impute(status, 'custom', 'NA') as status_imputed

Fills in missing (null) values in a column using the selected imputation strategy.


Unit Converter

unit_converter(): Convert values between units of temperature, volume, distance, and weight.

Syntax:

unit_converter(column, "from_unit", "to_unit") as converted_column

Note

Supported unit categories:

  • Temperature: C, F, K (aliases: celsius, fahrenheit, kelvin)

  • Volume: tsp, tbs/tbsp, gal, l, ml

  • Distance: mi, km, m, cm, mm, in, ft, yd

  • Weight: lb/lbm, kg, g

Example:

unit_converter(temp, "C", "F") as temp_f

Converts temperature values from Celsius to Fahrenheit. Useful for standardizing measurements across different unit systems.


xxHash64

xxhash64(): Generate a 64-bit hash of one or more arguments.

Syntax:

xxhash64(column1, column2) as hash_val

Example:

xxhash64(id, name) as hash_val

Produces a fast, non-cryptographic hash of the combined input values. Useful for deduplication or creating composite keys.

Mathematical Functions

Use the same pattern below for each remaining function:

Function Name

function_name(): Short description.

Syntax:

function_name(arg1, arg2) as output_column

Example:

function_name(arg1, arg2) as output_column

Short explanation of the result.

String Functions

Upper Case

upper(): Convert all characters in a string to uppercase.

Syntax:

upper(column) as upper_column

Example:

upper(name) as name_upper

Lower Case

lower(): Convert all characters in a string to lowercase.

Syntax:

lower(column) as lower_column

Example:

lower(email) as email_lower

InitCap

initcap(): Title case — first letter of each word uppercase, rest lowercase.

Syntax:

initcap(column) as initcap_column

Example:

initcap(full_name) as name_title

Proper

proper(): Proper case (alias of initcap). First letter of each word uppercase, rest lowercase.

Syntax:

proper(column) as proper_column

Example:

proper(full_name) as name_title

Length

For video tutorial see: SEDGE YouTube channel - L functions.

length(): Returns the character count (string length).

Syntax:

length(column) as len_column

Example:

length(description) as desc_length

Label Encode

labelencode(): Encode categorical/string values to sequential integers based on first appearance.

Syntax:

labelencode(column) as encoded_column

Example:

labelencode(category) as category_id

Each unique value is assigned a unique integer. The first encountered value receives 0, the second receives 1, and so on.


Count Occurrences

count_occurrences(): Count substring or regex matches in text. Default mode is literal.

Syntax:

count_occurrences(column, 'pattern') as count_col
count_occurrences(column, 'pattern', 'regex') as count_col

Example (literal):

count_occurrences(text, 'foo') as foo_count

Example (regex):

count_occurrences(comment, '\\bERR\\d+\\b', 'regex') as err_count

Pop

pop(): Remove a character at a 1-based index from a string. Supports negative index from end.

Syntax:

pop(column, index) as pop_column

Example:

pop(code, 1) as code_no_first

Rename Factors

rename_factors(): Rename multiple categorical/string values using a mapping string.

Syntax:

rename_factors(column, 'mapping') as renamed_column

Note

Mapping syntax: old:new pairs separated by ;. Also supports old->new or old=new separators.

Example:

rename_factors(category, 'A:Group1;B:Group1;C:Group2') as category_new

Extract Number

extract_num(): Extract the first numeric token from text as a number. Supports sign and decimals.

Syntax:

extract_num(column) as num_column

Example:

extract_num(code) as code_num

Trim

For video tutorial see: SEDGE YouTube channel - T functions.

trim(): Remove whitespace from both ends of a string.

Syntax:

trim(column) as trim_column

Example:

trim(name) as trimmed_name

Left Trim

ltrim(): Remove whitespace from the left end of a string.

Syntax:

ltrim(column) as ltrim_column

Example:

ltrim(text) as left_trimmed

Right Trim

rtrim(): Remove whitespace from the right end of a string.

Syntax:

rtrim(column) as rtrim_column

Example:

rtrim(text) as right_trimmed

Reverse

reverse(): Reverse a string.

Syntax:

reverse(column) as rev_column

Example:

reverse(text) as reversed_text

Substring

substring(): Extract a substring from a string using 1-based indexing.

Syntax:

substring(column, start, length) as sub_column

Note

Start position is 1-based (first character is position 1).

Example:

substring(phone, 1, 3) as area_code

Extracts the first 3 characters from the phone column.


Replace

replace(): Replace all occurrences of a substring with another string.

Syntax:

replace(column, 'old', 'new') as replace_column

Example:

replace(phone, "-", "") as clean_phone

Removes all dashes from the phone number.


Concatenate

concat(): Concatenate strings with an optional separator.

Syntax:

concat(col1, col2) as concat_column
concat("separator", col1, col2) as concat_column

Example:

concat(first_name, last_name) as full_name
concat(",", first_name, last_name) as full_name_csv

Left

left(): Extract a specified number of characters from the left side of a string.

Syntax:

left(column, length) as left_column

Example:

left(name, 3) as prefix

Right

right(): Extract a specified number of characters from the right side of a string.

Syntax:

right(column, length) as right_column

Example:

right(name, 3) as suffix

Left Pad

lpad(): Left-pad a string to a specified length with a padding character.

Syntax:

lpad(column, length, "pad") as lpad_column

Example:

lpad(code, 5, "0") as padded_code

Pads the code column with leading zeros to ensure a width of 5 characters.


Right Pad

rpad(): Right-pad a string to a specified length with a padding character.

Syntax:

rpad(column, length, "pad") as rpad_column

Example:

rpad(code, 5, "0") as padded_code

Repeat

repeat(): Repeat a string N times.

Syntax:

repeat(column, count) as repeat_column

Example:

repeat(token, 3) as repeated

InStr

instr(): Returns the position of a substring within a string (1-based, 0 if not found).

Syntax:

instr(column, "substring") as instr_column

Example:

instr(name, "abc") as pos

Locate

locate(): Returns the position of a substring within a string (1-based, 0 if not found). Note the argument order is reversed compared to instr.

Syntax:

locate("substring", column) as locate_column

Example:

locate("abc", name) as pos

Substring Index

substring_index(): Returns the substring before or after the Nth delimiter.

Syntax:

substring_index(column, "delimiter", count) as subidx_column

Example:

substring_index(path, "/", 2) as root

Find

find(): Find whether a substring exists in a string (returns boolean).

Syntax:

find(column, "substring") as find_column

Example:

find(name, "abc") as has_abc

Append

append(): Append a suffix to a string.

Syntax:

append(column, 'suffix') as append_column

Note

Both Categorical and Numerical values are supported. The new column created is a Categorical type.

Example:

append(temperature, 'Deg C') as temp_with_unit

Assuming the temperature column is a numeric value, appending the string Deg C will append the text to the numerical value. Example 13.5 becomes 13.5 Deg C, and the new column is a string.


Remove

remove(): Remove all occurrences of a substring from a string.

Syntax:

remove(column, "substring") as remove_column

Example:

remove(name, "abc") as cleaned

Insert

insert(): Insert a substring at a specified position (1-based).

Syntax:

insert(column, position, "substring") as insert_column

Example:

insert(name, 2, "_") as name_ins

Octet Length

octet_length(): Returns the byte length of a string.

Syntax:

octet_length(column) as octet_column

Example:

octet_length(payload) as payload_bytes

Multiple Replacement

For video tutorial see: SEDGE YouTube channel - M functions.

mul_replacement(): Replace a range of characters in a string with replacement text.

Syntax:

mul_replacement(column, start, end, "replacement") as mulrep_column

Example:

mul_replacement(code, 2, 4, "_OK") as code_new

One Replacement

For video tutorial see: SEDGE YouTube channel - O functions.

one_replacement(): Replace a single character at a specified position.

Syntax:

one_replacement(column, position, "replacement") as onerep_column

Example:

one_replacement(code, 3, "X") as code_new

Step Delete

step_delete(): Delete characters by step within a range.

Syntax:

step_delete(column, start, end, step) as stepdel_column

Example:

step_delete(token, 1, 10, 2) as token_clean

Regex Replace

regexp_replace(): Replace substrings matching a regular expression pattern.

Syntax:

regexp_replace(column, "pattern", "replacement") as regrep_column

Example:

regexp_replace(text, "[0-9]+", "") as letters

Removes all numeric sequences from the text.


Regex Extract

regexp_extract(): Extract a group from a regular expression match.

Syntax:

regexp_extract(column, "pattern", group_idx) as regext_column

Example:

regexp_extract(text, "(\\d+)", 1) as digits

Starts With

startswith(): Check if a string starts with a specified prefix (returns boolean).

Syntax:

startswith(column, "prefix") as starts_column

Example:

startswith(email, "admin") as is_admin_email

Ends With

endswith(): Check if a string ends with a specified suffix (returns boolean).

Syntax:

endswith(column, "suffix") as ends_column

Example:

endswith(filename, ".pdf") as is_pdf

Contains

contains(): Check if a string contains a specified substring (returns boolean).

Syntax:

contains(column, "substring") as contains_column

Example:

contains(description, "urgent") as is_urgent

ASCII

ascii(): Returns the ASCII code of the first character of a string.

Syntax:

ascii(column) as ascii_column

Note

Works for categorical columns.

Example:

ascii(name) as first_code

We have a categorical column Geography and we want to transform the variables into numerical ASCII representation. We have variable France, Spain and Germany and after the transformation we get results of 70, 83 and 71 as per the ASCII characters table.


Hexadecimal

For video tutorial see: SEDGE YouTube channel - H functions.

hex(): Convert an integer to a hexadecimal string.

Syntax:

hex(column) as hex_column

Example:

hex(id) as id_hex

Unhex

unhex(): Convert a hexadecimal string to an integer.

Syntax:

unhex(column) as unhex_column

Example:

unhex(hex_id) as id_num

Base64 Encode

base64(): Encode a string to Base64.

Syntax:

base64(column) as b64_column

Example:

base64(payload) as payload_b64

Base64 Decode

unbase64(): Decode a Base64 string.

Syntax:

unbase64(column) as unb64_column

Example:

unbase64(payload_b64) as payload

MD5 Hash

md5(): Generate an MD5 hash of a string (hex output).

Syntax:

md5(column) as md5_column

Example:

md5(email) as email_md5

SHA1 Hash

sha1(): Generate a SHA1 hash of a string (hex output).

Syntax:

sha1(column) as sha1_column

Example:

sha1(email) as email_sha1

SHA2 Hash

sha2(): Generate a SHA2 hash of a string (hex output). Supports 224, 256, 384, and 512 bits.

Syntax:

sha2(column, bits) as sha2_column

Example:

sha2(email, 256) as email_sha256

CRC32

crc32(): Generate a CRC32 checksum.

Syntax:

crc32(column) as crc_column

Example:

crc32(payload) as payload_crc

Soundex

soundex(): Generate a Soundex phonetic code for a string.

Syntax:

soundex(column) as soundex_column

Example:

soundex(name) as name_soundex

Soundex encodes strings by their sound when spoken in English, useful for matching names with different spellings.


Levenshtein Distance

levenshtein(): Calculate the Levenshtein edit distance between two strings.

Syntax:

levenshtein(col1, col2) as leven_column

Example:

levenshtein(a, b) as edit_dist

Returns the minimum number of single-character edits (insertions, deletions, substitutions) needed to change one string into the other.


Bit Length

bit_length(): Returns the bit length of a string.

Syntax:

bit_length(column) as bitlen_column

Example:

bit_length(token) as token_bits

Char

char(): Convert an ASCII code to its corresponding character.

Syntax:

char(column) as char_column

Example:

char(code) as char_val

Binary String

bin(): Convert an integer to its binary string representation.

Syntax:

bin(column) as bin_column

Example:

bin(id) as id_bin

Base Conversion

conv(): Convert a number between bases.

Syntax:

conv(column, from_base, to_base) as conv_column

Example:

conv("ff", 16, 10) as dec_val

Converts the hexadecimal value ff to decimal, returning 255.


Delete

For video tutorial see: SEDGE YouTube channel - D functions.

delete(): Delete a substring at a specified position and length.

Syntax:

delete(column, start, length) as delete_column

Example:

delete(name, 2, 3) as trimmed_name

Format Number

format_number(): Format a number with commas and a specified number of decimal places.

Syntax:

format_number(column, digits) as fmt_column

Example:

format_number(amount, 2) as amount_fmt

Formats 1234567.891 as 1,234,567.89.

Date/Time Functions

Year

year(): Extract the year from a date or datetime column.

Syntax:

year(date_column) as year_column

Note

The year function can only be applied to Date or Datetime datatype columns.

Example:

year(birth_date) as birth_year

Week Year

week_year(): Extract the ISO week-based year from a date or datetime.

Syntax:

week_year(date_column) as week_year_column

Example:

week_year(birth_date) as birth_week_year

Note

The ISO week-year may differ from the calendar year at year boundaries.


Month

month(): Extract the month from a date or datetime (1-12).

Syntax:

month(date_column) as month_column

Example:

month(order_date) as order_month

Day

day(): Extract the day from a date or datetime (1-31).

Syntax:

day(date_column) as day_column

Example:

day(order_date) as order_day

Hour

hour(): Extract the hour from a datetime (0-23).

Syntax:

hour(datetime_column) as hour_column

Example:

hour(timestamp) as hour_of_day

Minute

minute(): Extract the minute from a datetime (0-59).

Syntax:

minute(datetime_column) as minute_column

Example:

minute(timestamp) as minute_of_hour

Second

second(): Extract the second from a datetime (0-59).

Syntax:

second(datetime_column) as second_column

Example:

second(timestamp) as second_of_minute

Weekday

weekday(): Extract the weekday name (for example, Monday).

Syntax:

weekday(date_column) as weekday_column

Example:

weekday(date) as day_of_week

Day Number of Week

daynumofweek(): Extract the weekday number (1=Monday, 7=Sunday).

Syntax:

daynumofweek(date_column) as daynum_column

Example:

daynumofweek(date) as day_of_week_num

Month Abbreviation

month_abrv(): Extract the abbreviated month name (for example, Jan).

Syntax:

month_abrv(date_column) as mon_abrv_column

Example:

month_abrv(date) as month_short

Month Full Name

month_full(): Extract the full month name (for example, January).

Syntax:

month_full(date_column) as mon_full_column

Example:

month_full(date) as month_name

Week Number

week(): Extract the ISO week number from a date.

Syntax:

week(date_column) as week_column

Example:

week(date) as week_number

Quarter

quarter(): Extract the quarter (1-4) from a date.

Syntax:

quarter(date_column) as quarter_column

Example:

quarter(date) as fiscal_quarter

Day of Year

dayofyear(): Extract the day of year (1-366).

Syntax:

dayofyear(date_column) as doy_column

Example:

dayofyear(date) as day_of_year

Date Difference (Days)

datediff(): Calculate the difference between two dates in days.

Syntax:

datediff(end_date, start_date) as days_diff

Note

Returns end_date - start_date in days. Both columns must be Date datatype. Output type: Int64.

Example:

datediff(end_date, start_date) as days_diff

Date Difference (Months)

datediff_mon(): Calculate the difference between two dates in months.

Syntax:

datediff_mon(end_date, start_date) as months_diff

Note

Accounts for year and month differences. Output type: Float64.

Example:

datediff_mon(end_date, start_date) as months_diff

Date Difference (Years)

datediff_year(): Calculate the difference between two dates in years.

Syntax:

datediff_year(end_date, start_date) as years_diff

Note

Calculated as months between / 12.0. Output type: Float64.

Example:

datediff_year(end_date, start_date) as years_diff

Date Difference (Weeks)

datediff_weeks(): Calculate the difference between two dates in weeks.

Syntax:

datediff_weeks(end_date, start_date) as weeks_diff

Note

Calculated as total days / 7.0. Output type: Float64.

Example:

datediff_weeks(end_date, start_date) as weeks_diff

Date Difference (Milliseconds)

datediff_ms(): Calculate the difference between two datetimes in milliseconds.

Syntax:

datediff_ms(end_datetime, start_datetime) as ms_diff

Note

Calculated as total seconds * 1000.0. Output type: Float64.

Example:

datediff_ms(end_time, start_time) as ms_diff

Date Difference (Hours)

datediff_hours(): Calculate the difference between two datetimes in hours.

Syntax:

datediff_hours(end_datetime, start_datetime) as hours_diff

Note

Output type: Float64.

Example:

datediff_hours(end_time, start_time) as hours_diff

Date Difference (Minutes)

datediff_minutes(): Calculate the difference between two datetimes in minutes.

Syntax:

datediff_minutes(end_datetime, start_datetime) as mins_diff

Note

Output type: Float64.

Example:

datediff_minutes(end_time, start_time) as minutes_diff

Date Difference (Seconds)

datediff_seconds(): Calculate the difference between two datetimes in seconds.

Syntax:

datediff_seconds(end_datetime, start_datetime) as secs_diff

Note

Output type: Float64.

Example:

datediff_seconds(end_time, start_time) as seconds_diff

Date Add

date_add(): Add a specified number of days to a date.

Syntax:

date_add(date_column, days) as new_date_column

Note

The date_add function can only be applied to Date datatype columns.

Example:

date_add(date_column, 7) as next_week

Date Subtract

date_sub(): Subtract a specified number of days from a date.

Syntax:

date_sub(date_column, days) as new_date_column

Example:

date_sub(date_column, 7) as prev_week

Add Months

add_months(): Add a specified number of months to a date.

Syntax:

add_months(date_column, number_of_months) as new_date_column

Note

The add_months function can only be applied to Date datatype columns.

Example:

add_months(date_column, 2) as plus_two_months

Last Day

last_day(): Returns the last day of the month for a given date.

Syntax:

last_day(date_column) as last_day_column

Example:

last_day(date_column) as month_end

Date Truncate

date_trunc(): Truncate a date/time to a specified unit (year, month, day, hour, etc.).

Syntax:

date_trunc('unit', date_column) as trunc_column

Note

Returns a formatted string representation.

Example:

date_trunc('month', date_column) as month_start

To Date

to_date(): Parse a string to a date using a specified format.

Syntax:

to_date(column, 'format') as date_column

Example:

to_date(date_str, '%Y-%m-%d') as parsed_date

To Timestamp

to_timestamp(): Parse a string to a timestamp using a specified format.

Syntax:

to_timestamp(column, 'format') as ts_column

Example:

to_timestamp(ts_str, '%Y-%m-%d %H:%M:%S') as parsed_ts

From Unix Time

from_unixtime(): Convert a Unix timestamp (seconds since epoch) to a datetime.

Syntax:

from_unixtime(unix_column) as datetime_column

Example:

from_unixtime(unix_seconds) as dt

Milliseconds to Date

millisec_todate(): Convert milliseconds since epoch to a datetime.

Syntax:

millisec_todate(ms_column) as datetime_column

Example:

millisec_todate(ms) as dt

From UTC Timestamp

from_utc_timestamp(): Convert a UTC datetime to a specified timezone.

Syntax:

from_utc_timestamp(datetime_column, 'timezone') as local_column

Example:

from_utc_timestamp(ts, 'America/New_York') as local_ts

To UTC Timestamp

to_utc_timestamp(): Convert a timezone-aware datetime to UTC.

Syntax:

to_utc_timestamp(datetime_column, 'timezone') as utc_column

Example:

to_utc_timestamp(local_ts, 'America/New_York') as utc_ts

To Unix Timestamp

to_unix_timestamp(): Convert a datetime to a Unix timestamp (seconds since epoch).

Syntax:

to_unix_timestamp(datetime_column) as unix_column
to_unix_timestamp(datetime_column, 'format') as unix_column

Example:

to_unix_timestamp(ts, '%Y-%m-%d %H:%M:%S') as ts_seconds

Months Between

months_between(): Calculate the number of months between two dates (rounded to 8 decimals by default).

Syntax:

months_between(end_date, start_date) as months_between_col
months_between(end_date, start_date, false) as months_between_col

Example:

months_between(end_date, start_date, false) as months_diff

Pass false as the third argument to disable rounding.


Next Day

next_day(): Returns the first occurrence of a specified weekday after a given date.

Syntax:

next_day(date_column, 'weekday') as next_day_column

Example:

next_day(date_column, 'monday') as next_mon

Date Is In Range

date_isin(): Check if a date is within a specified range (inclusive).

Syntax:

date_isin(date_column, 'start_date', 'end_date') as in_range_column

Example:

date_isin(date_column, '2024-01-01', '2024-12-31') as in_range

Returns a boolean indicating whether the date falls within the specified range.


Is Holiday

is_holiday(): Check if a date is a holiday for a specified country.

Syntax:

is_holiday(date_column, 'country') as is_holiday_column

Note

Falls back to weekend-only detection if the country’s holiday calendar is not available.

Example:

is_holiday(date_column, 'US') as is_holiday

Holiday Count

holiday_count(): Count the number of holidays in a date range for a specified country.

Syntax:

holiday_count(start_date, end_date, 'country') as holiday_count_col

Example:

holiday_count(start_date, end_date, 'US') as holidays

Workday Count

workday_count(): Count the number of working days in a date range for a specified country.

Syntax:

workday_count(start_date, end_date, 'country') as workday_count_col

Example:

workday_count(start_date, end_date, 'US') as workdays

Weekend Count

weekend_count(): Count the number of weekend days in a date range.

Syntax:

weekend_count(start_date, end_date, 'country') as weekend_count_col

Example:

weekend_count(start_date, end_date, 'US') as weekends

Off-Day Count

offday_count(): Count the number of off-days (holidays + weekends) in a date range.

Syntax:

offday_count(start_date, end_date, 'country') as offday_count_col

Example:

offday_count(start_date, end_date, 'US') as offdays

Unix Timestamp

unix_timestamp(): Return the current Unix timestamp or convert a datetime to Unix timestamp.

Syntax:

unix_timestamp() as now_ts
unix_timestamp(datetime_column) as ts_seconds

Example:

unix_timestamp() as now_ts
unix_timestamp(ts_column) as ts_seconds

Date Value

date_value(): Convert a date to a serial number (days since epoch).

Syntax:

date_value(date_column) as serial_column

Example:

date_value(date_column) as date_serial

Year to Date

year_to_date(): Convert a year to its start date (YYYY-01-01).

Syntax:

year_to_date(year) as year_start_column

Example:

year_to_date(2024) as year_start

Current Date

current_date(): Return the current date.

Syntax:

current_date() as today

Example:

current_date() as today

Current Timestamp

current_timestamp(): Return the current timestamp as a Unix timestamp.

Syntax:

current_timestamp() as now_ts

Example:

current_timestamp() as now_ts

Now

now(): Return the current datetime.

Syntax:

now() as current_datetime

Example:

now() as current_datetime

Type Conversion Functions

Integer

int(): Convert a column to integer (Int64).

Syntax:

int(column) as int_column

Example:

int(price_string) as price_int

Float

float(): Convert a column to float (Float64).

Syntax:

float(column) as float_column

Example:

float(age_string) as age_float

Big Integer

bigint(): Convert a column to 64-bit integer.

Syntax:

bigint(column) as bigint_column

Example:

bigint(id) as id_bigint

Double

double(): Convert a column to double precision float (Float64).

Syntax:

double(column) as double_column

Example:

double(score) as score_double

String

string(): Convert a column to string.

Syntax:

string(column) as string_column

Example:

string(id) as id_string

Categorical to Numerical

cat2num(): Convert categorical values to numerical representation.

Syntax:

cat2num(column) as num_column

Example:

cat2num(category) as category_num

Numerical to Categorical

num2cat(): Convert numerical values to categorical (string) representation.

Syntax:

num2cat(column) as cat_column

Example:

num2cat(code) as code_string

Validation Functions

Is Null

isnull(): Check if a value is null (returns boolean).

Syntax:

isnull(column) as is_null_column

Example:

isnull(email) as email_is_null

Is Not Null

isnotnull(): Check if a value is not null (returns boolean).

Syntax:

isnotnull(column) as is_not_null_column

Example:

isnotnull(email) as email_is_valid

Is NaN

isnan(): Check if a value is NaN — Not a Number (returns boolean).

Syntax:

isnan(column) as is_nan_column

Example:

isnan(metric) as metric_is_nan

Negate Boolean

negate_bool(): Negate boolean values (true becomes false, false becomes true).

Syntax:

negate_bool(column) as neg_bool_column

Example:

negate_bool(flag) as flag_neg

Type Of

typeof(): Returns the type name of a column as a string.

Syntax:

typeof(column) as type_column

Example:

typeof(value) as value_type

Is Numeric

is_numeric(): Check if a value is numeric (returns boolean).

Syntax:

is_numeric(column) as is_num_column

Example:

is_numeric(value) as is_num

Is Text

is_text(): Check if a string length is greater than 50 (returns boolean).

Syntax:

is_text(column) as is_text_column

Example:

is_text(value) as is_text

Conditional Functions

If Else

For video tutorial see: SEDGE YouTube channel - I functions.

if(): Conditional if-then-else function. Supports logical operators (and, or), nested functions in conditions and values, and can be embedded inside arithmetic expressions.

Syntax:

if(condition, true_value, false_value) as if_column

Note

Condition format: column operator value

Supported operators: >=, <=, >, <, ==, !=, =

Logical operators: and / or to combine multiple conditions.

Nested support: Functions can be used inside conditions and values.

Arithmetic embedding: if() can be used inside arithmetic expressions.

Example:

if(score >= 50, 'pass', 'fail') as result

Example with logical operators:

if(age >= 18 and age <= 65, 'working_age', 'other') as group

Example with nested functions:

if(substring(EquipmentType, 1, 2) = '20', 1, 2) as Teus

Example inside arithmetic:

(allInAmt / if(RIGHT(eqpType, 2) = '20', 1, 2)) as adjusted_amt

Coalesce

coalesce(): Return the first non-null value from multiple columns.

Syntax:

coalesce(col1, col2, "default") as coalesce_column

Example:

coalesce(email, backup_email, "unknown") as contact_email

If Null

ifnull(): Fill null values with a specified replacement value.

Syntax:

ifnull(column, 'replacement') as ifnull_column

Example:

ifnull(email, 'no-email@example.com') as safe_email

NVL2

nvl2(): Return one value if the column is not null, another if it is null.

Syntax:

nvl2(column, "value_if_not_null", "value_if_null") as nvl2_column

Example:

nvl2(email, "has", "missing") as status

Null If

nullif(): Set cells to null when they match a specified value.

Syntax:

nullif(column, 'value_to_null') as nullif_column

Example:

nullif(status, 'N/A') as clean_status

Replaces all occurrences of N/A with null, making it easier to handle missing data downstream. Legacy Alias Support ====================

The following legacy function names are accepted and mapped to their current equivalents:

Legacy Name

Maps To

bround

round

rint

round

pow

power

concat_ws

concat

char_length

length

dayofmonth

day

dayofweek

weekday

weekofyear

week

datediff_days

datediff

datediff_hrs

datediff_hours

datediff_mins

datediff_minutes

datediff_secs

datediff_seconds

positive

abs

Notes

  1. Function Names: All function names are case-insensitive.

  2. Nested Functions: Most functions support nested function calls in their arguments at arbitrary depth.

  3. Column References: Column names in functions should match the actual column names in your dataset.

  4. String Literals: Use single quotes ('value') or double quotes ("value") for string literals.

  5. Arithmetic Operations: Basic arithmetic (+, -, *, /) is supported in SELECT expressions.

  6. Logical Operators: and / or operators are supported in if conditions.

  7. Temporary Columns: Nested functions create temporary columns that are automatically cleaned up.

  8. Output Types: Some date difference functions return Float64; count returns Int64.

  9. Deep Validation: Use deep_validate: true to perform schema-level and sample data checks before execution.

  10. Legacy Aliases: Backward compatibility with Spark SQL-style function names is maintained through alias mapping.

Visual Analytics

Cluster Analysis

Cluster analysis is a technique to group similar observations into a number of clusters based on the observed values of several variables for each individual. It is an exploratory data analysis technique that allows us to analyze the multivariate data sets.

Clustering algorithms

K-means

It is an iterative algorithm that divides the unlabeled dataset into k different clusters.

_images/clusteranalysis1.png

image_K-means

Bisecting K-Means

Bisecting K-Means is a hybrid approach of the K-Means algorithm to produce partitional/hierarchical clustering. It splits one cluster into two sub-clusters at each bisecting step (using k-means) until k clusters are obtained.

_images/clusteranalysis2.png

image_Bisecting_K-Means

Gaussian Mixture

The Gaussian mixture model is defined as a clustering algorithm that is used to discover the underlying groups of data. Gaussian mixture models have a higher chance of finding the right number of clusters in the data compared to k-means.

_images/clusteranalysis3.png

image_Gaussian_Mixture

Method for Optimal Cluster

Elbow method

Elbow Method is an empirical method to find the optimal number of clusters for a dataset. In this method, we pick a range of candidate values of k, then apply K-Means clustering using each of the values of k. Find the average distance of each point in a cluster to its centroid, and represent it in a plot. Pick the value of k, where the average distance falls suddenly.

_images/optimalcluster1.png

image Elbow method

Silhouette method

The silhouette Method is also a method to find the optimal number of clusters and interpretation and validation of consistency within clusters of data. The silhouette method computes silhouette coefficients of each point that measure how much a point is similar to its own cluster compared to other clusters. by providing a succinct graphical representation of how well each object has been classified.

_images/optimalcluster2.png

image Silhouette method

Graphical Analysis

_images/GraphicalAnalysis.png

image GraphicalAnalysis

Graphs are used to display large amounts of numerical data and to represent the relationships between numerical values of different variables. They can also be used to derive quantitative relationships between variables.

_images/BarChartAge.png

image BarChartAge

Bart chart presents data with rectangular bars with heights or lengths proportional to the values that they represent.

_images/BarChartAge.png

image BarChartAge

We have chosen age as an example for our bar chart. We can point our mouse at any part of the bar and the system will tell as which point we are currently standing - X axis tells us the location of our mouse and Y axis tells us the total count in the bar. Also we can zoom in or out by the grey tool in the upper part of the graph (highlighted in green).

“Group by” tool and “Aggregation” tool works with categorical values.

Note

We can also select the “Use Distinct” tool - which represents percentage of each category in the pie irrespective of the count in each category.

Stack Bar

The stack bar chart extends the standard bar chart from looking at numeric values across one categorical variable to two, where each bar is divided into a number of sub-bars.

_images/StackBarSurEmb.png

image StackBarSurEmb

In this Stack Bar chart we have selected two factors - “port of embarkment” and “survived”. We can see that each port of embarkment - 0,1 and 2 is dived into two colours - 1 showing those who survived and 0 showing the opposite. By pointing the mouse on any section of the bar we can see the total number of cases in the group.

Box Plot

Box Plot graph depicts groups of numerical data through their quartiles. Box plots have also lines extending from the boxes (whiskers) indicating variability outside the upper and lower quartiles, these are the outlier values.

_images/BoxPlotAge.png

image BoxPlot

We used age as an example, we can see how mostly the age of passengers was in between 22 and 35 years, other values re represented as outliers. By pointing at the graph we can also see the Open, Low, High and Close values.

Pie Chart

Pie charts are beneficial in expressing data and information in terms of proportions. A pie chart is most effective when dealing with the collection of data. It is constructed by clustering all the required data into a circular constructed frame wherein the data are depicted in slices.


_images/PieChartMealNo.png

image PieChartMealNo

For our example we used the number of meals and we can see the number was between 1 and 9, each part of the pie is also represented by the percentage and after pointing at each section we can also see the total number of cases

Bubble Chart

A bubble chart is a type of chart that displays three dimensions of data. Each entity with its triplet of associated data is plotted as a disk that expresses two of the vᵢ values through the disk’s xy location and the third through its size.

_images/BubbleChartAgeFare.png

image BubbleChartAgeFare

We used Age and Fare for our example and we can now see how the values are spread in the graph.


We can also add another factor in our chart, which represents the “Size” - we selected “Fare” and after that we can see the bubbles size is corresponding with the increasing fare.

_images/BubbleChartAgeFareSize.png

image BubbleChartAgeFareSize


We have another feature that can help us represent the values effectively - we can add another factor and represent it in colour, in our case we have added “Embarkment point” so we can see the results also differentiated by the point of embarkment

_images/BubbleChartAgeFareSizeEmb.png

image BubbleChartAgeFareSizeEmb

Correlation Chart

The depiction address each numerical variable’s correlation against all other variables. Bigger the connection better the correlation between the variables.

_images/Correlation.png

image Correlation

_images/Correlationtarget.png

image Correlationtarget

Scatter Graph

Scatter graph uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables.

_images/Scatter.png

image Scatter

For our demonstration we have used three different numerical variables - Age as the target on X axis and Fare and Total meal cost on the Y axis. We can even add more numerical variables and all will be displayed in one chart.

Parallel Coordinates

Parallel coordinates are a common way of visualising and analysing high-dimensional datasets. It is applied to data where the axes do not correspond to points in time, and therefore do not have a natural order.

_images/Parallel.png

image Parallel

We have used Meal Cost, Number of Meals and Total Meal cost to visualise the relation between these variables.

Scatter Matrix

In multivariate statistics and probability theory, the scatter matrix is a statistic that is used to make estimates of the covariance matrix, for instance of the multivariate normal distribution.

Time Series

A time series is a series of data points indexed (or listed or graphed) in time order.

_images/TimeSeries.png

image TimeSeries

On the X axis we need to use a date variable, in our case “Ticket purchase” and on the Y axis we used “Survived”, for aggregation we have selected Average value (we have also option of Count, Sum, Average, Minimum, Maximum, Median, Standard Deviation and Variance). From our graph we can see, that everybody who purchased the ticket on April 1st survived.

Distribution

A distribution plot displays a distribution and range of a set of numeric values plotted against a target variable(s).

_images/Visualisationdistribution.png

Distribution

Advanced Analysis

Outlier analysis

Outlier analysis is the process of identifying outliers, or abnormal observations, in a dataset. Also known as outlier detection, it’s an important step in data analysis, as it removes erroneous or inaccurate observations which might otherwise skew conclusions.

_images/Outlieranalysis.png

image Outlier analysis

Deviation analysis

The measurement of the absolute difference between any one number in a set and the mean of the set. Deviation analysis can be used to determine how a specification will behave in the face of such deviations.

_images/Deviationanalysis.png

image Deviation analysis

Decision Tree analysis

Decision tree analysis is the process of drawing a decision tree, which is a graphic representation of various alternative solutions that are available to solve a given problem, in order to determine the most effective courses of action. Decision trees are comprised of nodes and branches. Nodes represent a test on an attribute and branches represent potential alternative outcomes.

_images/Decisiontreeanalysis.png

image Decision Tree analysis

Predictive Analytics

_images/Predict.png

image Predictive Analytics

After completing our data manipulation and graphical analysis, we can move to the last part, which is the Predictive Analytics. In the “Statistics” part, you need to select all the columns that are important for your research and ignore those that have small or no relation to the target. From the beginning our target for the research is the “Survived” field. Very important step in this research was to convert the field “Survived” from “Numerical” to “Categorical” type, as keeping it as “Numerical” will result in very low accuracy.

  • Balancing

In following step we can choose the type of balancing - “None”, “Down sampling”, “SMOTE” or “Up sampling” - in our example we selected “None”. Sampling is used to resolve class imbalance, because the imbalance in the observed classes might negatively impact our model. Machine learning algorithms have trouble learning when one class dominates the other so the balancing of data helps us avoid these problems.

_images/Balancing.png

image Balancing

  • None - no sampling of data performed, data stays as it is the dataset.

  • Down Sampling - the system will randomly subset all the classes in the data set so that their class frequencies match the least common class

  • SMOTE - Synthetic Minority Over sampling Technique - this technique synthesises a new minority instances between already existing minority instances

  • Up Sampling - the system will randomly sample and replace the minority class to be the same size as the majority class

Next step is to select which algorithms should be used, the ratio (default is 80:20 - which means 80% of the sample is used for training and 20% is used for cross validation) and cross validation number (in how many groups the cross validation sample should be split into) and then click on the “Start learning” button and the system will run these algorithms.

Algorithms

_images/Algorithm1.png

image Algorithms

Linear regression

Linear regression is the statistical model that analyzes the linear relationship between a scalar response and one or more explanatory variables.

Logistic regression

Logistic regression is a supervised learning algorithm used to predict a dependent categorical target variable.

Decision Tree

Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain parameter. The tree can be explained by two entities, namely decision nodes, and leaves.

Random Forest

The random forest is a classification algorithm consisting of many decisions trees. It is an ensemble method that is better than a single decision tree as it reduces over-fitting by averaging the result. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by the committee is more accurate than that of any individual tree.

Gradient Boosting Machines

A Gradient Boosting Machine or GBM combines the predictions from multiple decision trees to generate the final predictions.

eXtreme Gradient Boosting

eXtreme Gradient Boosting or XGBoost is another popular boosting algorithm. In fact, XGBoost is simply an improvised version of the GBM algorithm! The working procedure of XGBoost is the same as GBM. The trees in XGBoost are built sequentially. XGBoost also includes a variety of regularization techniques that reduce overfitting and improve overall performance.

Naive Bayes

The Naive Bayes is a classification algorithm that is suitable for binary and multiclass classification. Naïve Bayes performs well in cases of categorical input variables compared to numerical variables. It is useful for making predictions and forecasting data based on historical results.

K-Nearest Neighbors

KNN is a non-parametric method used for classification. It is one of the best-known classification algorithms in that known data are arranged in a space defined by the selected features. When new data is supplied to the algorithm, the algorithm will compare the classes of the k closest data to determine the class of the new data.

Support Vector Machine

Support vector machine algorithm helps to find a hyperplane in N-dimensional space(N — the number of features) that distinctly classifies the data points.

Multilayer Perceptron

A multilayer perceptron (MLP) is a feedforward artificial neural network that generates a set of outputs from a set of inputs. An MLP is characterized by several layers of input nodes connected as a directed graph between the input and output layers.

Light GBM

Light Gradient Boosted Machine, or LightGBM for short, is an open-source library that provides an efficient and effective implementation of the gradient boosting algorithm. It can be used for data having more than 10,000+ rows. No fixed threshold helps in deciding the usage of LightGBM. It can be used for large volumes of data especially when one needs to achieve high accuracy.

We can see the status of each algorithm, some of these are more complicated and will take more time. After all are at 100% of completion, the algorithms are sorted by their accuracy.

In our case, we can see 7 algorithms were used.

  • XGBOOST - decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework - with this algorithm we got an accuracy of 85%.

  • Decision tree

  • GBM - Gradient boosting

  • Random forest

  • Logistic regression

  • Naive Bayes

  • KNN - K nearest neighbors

We can also see further details of the algorithm - after clicking on the “Metrices” button.

Variable of Importance

_images/ModelDetailsVI.png

image ModelDetailsVI

In this chart we can see which data field has a higher correlation to our selected target field.

Metrics - Classification report

_images/ClassificationReport.png

image ClassificationReport

Classification report is used to measure the quality of predictions from a classification algorithm. The metrics are calculated by using true and false positives, true and false negatives.

Metrics- Confusion Matrix

_images/ConfusionMatrix.png

image Confusion Matrix

Confusion matrix, also known as an error matrix, is a specific table layout that allows visualisation of the performance of an algorithm. It shows the “True Positives”, “True Negatives”, “False positives” and “False Negatives”.

Metrics- Gain Charts

_images/Gain.png

image Gain Charts

Cumulative Gain charts are used for visually measuring the effectiveness of classification model and a very popular metrics in predictive analytics. Gain charts measures the performance for a proportion of the population data. It is calculated as the ratio between the results obtained without model (random line) or with the model (the model line). The greater the area between these two line the better the model is. The Y-axis shows the target data and the X-axis shows the percentage of the population.

In the image in the Gain Charts, the black line is the random line, and the green and orange Line is results based on the model for the two classes. From the figure above it can be see that orange class model has performed better than the green model, and both the line has performed better than the random line (black).

Metrics- Lift Chart

_images/Lift.png

image Lift Charts

Lift charts are used for visually measuring the effectiveness of classification model and also a popular metrics in predictive analytics. Lift Charts shows the actual lift. The lift curve is determined by measuring the ratio between the result predicted by model and the result using no model. A lift chart graphically represents the improvement that a model has when compared against a random guess, and it measures the change in terms of a lift score. A model can be determined which is best, by comparing the lift scores. A lift chart shows the point at which the model’s predictions are strong and when it become weak (less useful).

Metrics- K-S Chart

_images/K-S.png

image K-S

K-S is a measure of the degree of separation between the positive and negative distributions.

AUC- Area under the ROC Curve

_images/AUC.png

image AUC

On the Y axis we can see the True Positive Rate - samples that were evaluated correctly, on the X axis we can see the False Positive Rate - samples evaluated incorrectly. When we plot each of the results, we get the ROC curve. We evaluate the AUC - area under the ROC - if the value is near to zero, the selected model is not classifying correctly. If the value is near to 1, the selected model is classifying correctly. In our case the number is 0.92.

Model comparison

The SEDGE application now allows data analysts to review and compare models by MAPE (mean absolute percentage error), RMSE (root mean square error), MSE (mean squared error), MAE (mean absolute error), and MLCH. This allows the analyst to save and test models for greatest accuracy.

_images/Modelcompare.png

image Model Comparison

Trained model sharing to users who can access from any other environment.

_images/Modelsharing.png

image Modelsharing

Actual vs predicted

Actual vs. predicted plots are the visualization of actual vs. predicted values for all predictive models.

_images/actual.png

image Actual vs Predicted

The above is the graph between the actual and predicted values.

Shapely importance

SHAP or Shapley Additive exPlanations is a visualization technique that can be used for making a machine learning model more explainable by visualizing its output. It can be used for explaining the prediction of any model by computing the contribution of each feature to the prediction. If the SHAP value is much closer to zero, we can say that the data point contributes very little to predictions. If the SHAP value is a strong positive or strong negative value, we can say that the data point greatly contributes to predicting the positive or negative class.

_images/Predictionshapelyimportance.png

Shapely importance