Click on the Data analytics icon, which will redirect you into the project creation screen. This will take the users into the screen where the project can be created.
This section is about the data upload into SEDGE from different types of database. SEDGE has the functionality of connecting to different database such as MYSQL, Mariadb, Oracle and Postgresql.
The future release of SEDGE will also cover other database such as Cassandra, SQLite, Presto, Redshift and Redis. If you need assistance in connecting to these DB, talk to our support team, and they can provide functionality to connect to these databases also
We can use the SQL Query Generator in many ways, for example to combine records from two or more tables. There are 4 types of joins – Inner, Full, Left or Right Join. In our example, we are performing a left join – putting together records from the first (left-most) table with matching right table records. As a result, we get a left join with table “cancer diagnostic” and “diabetes”
using as a condition patient ID. This means we will get one final table, that will combine the information on the respective patient with same ID.
A - After adding your files, which might take a while, depending on the file size, click on the upload button, which will redirect you to the data preview page.
B - Users also have the option of zipping the CSV file, as large size CSV file can be uploaded.
C - Users can enter the name of the project (optional). If the users have not entered the name of the file then the system will take the name of the file and set a default project number.
File collaboration in SEDGE facilitates the sending and receiving of files and collaboration among users. This functionality helps in collaborating and exchanging opinions with the other users in SEDGE. This offers a wide array of advantages, especially among users from different geographic locations and diverse timelines. It only takes a single click to send and enable the file access to the user.
For our demonstration, we have chosen the “Titanic” data. In the future steps, we will be forecasting, based on the available data, what was the chance of survival on board of Titanic.
In this case, the factors taken into consideration are: class, name, sex, age, number of family members, parch, ticket number, fare, cabin, and point of embarkment. Some of these factors might not be relevant for the predictive analysis, so we will also demonstrate how to erase or ignore some of the data.
Note
This page is just a preview of the full data, only a small sample is displayed over here to get the first image.
A - This row displays all the variables in the current analysis. You can sort the data by clicking on the arrows next to the respective variable
B - This row identifies if the data in a column are categorical or numerical (C or N)
C - Monitor in which phase of the analysis you are currently located – identified by the green colour
D - Information about the total number of rows and columns in the uploaded file
E - Action buttons to move either to the next step or return to the previous step
A - Dataset info – basic information about the respective data set like number or variables, number of observations/rows, percentage of missing data, warnings
B - List of all columns/variables in the data set
C - Information if the variable is categorical or numerical
D - How many unique values are there within each variable (sex – male or female – 2 possible values, name – unlimited number of values)
E - Total number of missing values within the respective variable
F - Number of missing values represented in percentage
G - Mean – the average value of the respective numerical variable. Found by adding all numbers in the data set and then dividing by the total number of variables in the data set
H - Median – the middle value in the data set – value that separates the higher half of the data sample from the lower half
I - SD – standard deviation represents how close or far the values are dispersed from the mean
J - In this tab, you have to select the target you are going to be predicting. It is one of the variables that are already part of the dataset. In our case, we are going to be predicting if the Titanic passenger did or did not survive
After selecting your preferences click on Apply button
L - GDPR - Data pseudonymization
Depending on the nature of your data, you can choose if you wish to protect personal data or no. If you select yes, next step will take you to the GDPR page where you will select which data you want to hide. If you do not need to hide any of the data, select NO button and the GDPR page will be skipped
data protection
GDPR
After filling in the GDPR form, you can select if you want to run and Automatic Machine Learning or a Custom Machine Learning
M - Machine learning (ML)
If in the previous step you have selected “Yes” for GDPR protection, you will find the ML option under the GDPR form. If you selected “No” you can select this option under this tab
Statistics is the branch of mathematics concerned with collecting, analyzing, interpreting, presenting, and organizing data. It encompasses a wide range of techniques and methodologies used to make sense of data, including summarizing data through measures based on samples.
On the statistics page, there are several top functions available, which offer:
Numerical Operations
Strings
Dates
Maths
Text
Data
Advanced
Boolean
Frequently used
To navigate to the Statistics page from the profiling page, click on the next button or click the statistics icon.
A: In the first row, all the possible data operations are displayed – sorted into eight different categories: Numerical, String, Dates, Maths, Text, Data, Advanced, Boolean, and Frequently used. Each tab will be discussed separately on the following pages.
B: The first window in the statistics display showcases the fundamental statistics of the dataset, with the target highlighted in green, as indicated by the legend.
C: Information we can get in this window is:
Column name: Shows the columns or variables that are present in the dataset.
Value type: Numerical, Categorical, Date, Text.
Unique Values: Number of unique values in the variable.
Mean: The average value of the variable.
Median: The middle value of the variable.
Standard Deviation: SD - Measures how the data is spread around the mean.
Missing Values: Number of missing values of the variable in the total count.
% of Missing Values: The number of missing values in the dataset is represented as a percentage. Upon clicking on a column name, the other charts dynamically adjust based on the selection. To perform operations with a specific column, simply click on the checkbox corresponding to that column and proceed the functions.
D: Graphstat - The graphical display illustrates our chosen value – in this case, Sales is selected for demonstration purposes. In this tab, the distribution of sales can be observed. Additionally, options are available to view the display in full-screen mode, download the graphs in different formats, print the chart, or utilize data binning to divide the dataset into appropriate bins (which refers to grouping data into ranges of values). Furthermore, if a Scatter Plot is required, the option can be enabled. This will plot for the target variable and the selected variable.
Bins
E: Transformations - Under this category, various available transformations can be applied to modify the data as needed. To perform transformations, refer to the syntax provided, and you can also clear, test, and apply the syntax. Additionally, view the Expression History.
F: Preview - Displays preview of the whole dataset, top 50 rows
G: Advanced statistics - In this tab, advanced statistics for the selected row are displayed; in this case, the ‘Sales’ row is selected.
H: In this line, corresponding information from the advanced statistics is displayed: Minimum value, 5th percentile, First Quartile (25th percentile), Third Quartile (75th percentile), Interquartile Range (IQR), 95th percentile, Maximum value, Variance (measure of data dispersion), Skewness (measure of data symmetry), Kurtosis (measure of data peakiness or tails), Outliers (data points significantly distant from the rest), and the option to delete the Outliers.
J: Variable of Importance - In this window, all the rows of the dataset are visible, sorted by their importance towards the selected target. In this case, Sales is selected as the most important factor, although it may not be accurate. The variable of importance can be refreshed using this icon (M). Additionally, a Chart Context Menu is available where you can view in full screen, print the chart, download the image as PNG, JPEG, PDF, SVG, and download the data as CSV and XLS.
The operation converts a numerical value into its absolute value, exclusively accepting numerical inputs. For instance, temperature can be transformed into its absolute value to determine its distance from zero. Upon applying an operation, successful validation appear in the top-right corner. A newly generated column is displayed at the bottom of the preview. After column addition, the variable of importance may alter; to observe its impact.
image showing message notification after ABS function (absolute)
Simplifies converting measurements between different units. For example, Converting inches to centimeters or pounds to kilograms using a unit converter.
Round numbers to a specified number of digits, such as rounding housing measurements to 2 digits. This function should allow for manual or automatic selection of the column name and if required, enable the user to manually choose the name of the new column.
Converting a numerical column to a categorical format. In case of missing values, they can be managed by either deletion or imputation, techniques elaborated upon in the forthcoming “Data” chapter.
Upon applying this function, the successful validation will appear.
A Substring is a contiguous sequence of characters within a larger string. It is a portion of a string that is extracted or considered separately based on its starting position and length.
For the string “houstan”, the selected position is “Middle” with a start index 1 and length of 5. This operation will return the middle portion. For the middle position it is required to select the start index.
For the string “New York”, the selected position is “Exclude String” with a length of 3 and a start index of 1. This operation will exclude the specified characters from the string.
The output will be: York.
Example for substring for the position - Exclude String
For the string “Los Angels” and the selected character “s” with the position “after”, the operation will split the string after each occurrence of the character “s”.
The output will be: Angel.
Example for substring for the position - Spelit by Character/Pattern
Note
For the position “Middle” and “Exclude String” it is required to select the start index.
For the position “Split by character/pattern” it is required to choose either “Before” or “After.
Replace one or more characters with another pattern/set of strings. Under ‘Pattern which exist,’ provide the pattern that needs to be changed. Then, under ‘Pattern to be replaced with,’ specify the pattern for replacement.
To replace “UK” with “United Kingdom,” provide the following:
Replace set of strings based on its occurrence index value. specify the start and end indices of the substring to be replaced along with the replacement text.
Example: Original word: “Chine”
To replace a range of characters from index 1 to index 5 with “India” provide the following,
Start index: 1
End index: 5
String to be replaced: “India”
After completing the replacement operation, the updated word will be: India
To determine if an entry matches a searched string, compare each entry against the specified string. If a match is found, return the matched string. If no match is found, return False to indicate that the string is not present in the column.
To extract numerical values from text This function is useful for isolating and retrieving numbers embedded within text, simplifying the process of data cleaning and preprocessing.
Convert categorical variables into numerical values only when the selected categorical column contains numbers exclusively. Upon applying this function, the successful validation will appear.
Note
String Operations can be performed only to the String variables.
The Dates function with the menu, helps to strip out the date, month, Year, etc., from a date column. These stripped out date, month and year, act as additional feature for the model.
Example: Lets try to extract Month from the “Date of Hire” column,
This function is used to convert the date pattern in a different, more suitable way for the research. After selecting the input and desired output pattern click on the apply button.
Example: Lets convert the date pattern, from “yyyy-MM-dd HH:mm:ss” to “yyyy-MM-dd”
This will covert the existing column with the selected date pattern.
There are several types of date patterns and shortcuts, that can be present in a date variable. Can Design own format patterns for dates and times from the list of symbols in the following table:
Characters that are not letters are treated as quoted text. That is, they will appear in the formatted text even if they are not enclosed within single quotes.
Removing or deleting all records or entries that fall within a specified range of dates from a dataset.
Steps to perform:
Select “From date” and “To date”
Select Keep range (will keep only the selected range of dates) or Delete range (will delete only the selected range of dates) based on your preferences.
Example: Lets keep only the records from “6/19/1961” To “8/17/1970”
This will alter the existing column and will keep only the selected range of values.
Marking or highlighting all records or entries that fall within a specified range of dates. This is often done to identify, review, or take specific actions on the data within that range.
Example: Lets try to flag from “5/30/2011” to “2/19/2013”
Calculation of the amount of time between two dates. This can be measured in various units such as days, months, or years.
Steps to perform:
select Minuend (The number from which another number (the subtrahend) is to be subtracted) and Subtrahend (The number that is to be subtracted from the minuend).
choose output time unit from the dropdown menu.
Example: Lets try to do difference for the columns ‘Date of Hire” and “Date of Termination”
This will generate a new column at the bottom of the preview.
Convert a positive to negative sign and vice versa. Choose between positive or negative conversion, and the result will be displayed in the existing column.
Equal width binning is a type of interval binning where each bin has the same width.
For example, if the data representing the heights of students in a class ranging from 120 cm to 180 cm, and need to create 4 bins,
This function includes several mathematical operations - Box cox, Yeo Johnson, Cubic root, Exponential, Log, Normalization, Power2, Power3, Reciprocal, Square root and Standardization. After selecting your chosen operation a new column is created at the bottom.
The Box-Cox transformation is a statistical method used to stabilize variance and make data more normally distributed (symmetric bell-shaped distribution where the mean, median, and mode are all equal)
This dataset is skewed to the right, indicating that most of the house prices are lower.
To apply the Box-Cox transformation mathematically,
Choose a range of values for the λ (lambda) parameter to test the transformation. Let’s say we try λ values ranging from -2 to 2.
For each λ value, calculate the transformed values using the Box-Cox transformation formula:
The lambda (λ), which varies from -5 to 5. All values of λ are considered and the optimal value for your data is selected; The “optimal value” is the one which results in the best approximation of a normal distribution curve.
The Yeo-Johnson transformation extends the Box-Cox transformation to handle both positive and negative values, as well as zero values. It is useful for stabilizing variance and making data more normally distributed.
Factors influencing the lambda value include the data’s skewness, distribution, variance, sensitivity to outliers and optimization method. The lambda value is empirically determined to maximize the normality and symmetry of the transformed data.
The value when multiplied by itself twice, yields the original number is referred to as the cubic root. For instance, the cubic root of 8 is 2.
.. math:
Exponential refers to a mathematical operation or function where a constant (the base) is raised to the power of an exponent, resulting in a rapidly increasing or decreasing function. It is commonly represented as a^x, where a is the base and x is the exponent.
An example of an exponential function is f(x) = 2^x. Let’s evaluate this function for some values of x:
The log transformation is a mathematical operation that calculates the logarithm of a number. It is commonly used to transform skewed data to make it more symmetrically distributed.
The transformation formula is as follows:
y_i = \log(x_i)
where:
x_i is the original value,
y_i is the transformed value.
Example:
\log_{10}(100) = \frac{\ln(100)}{\ln(10)}
We can substitute \ln(100) and \ln(10) with their respective values:
Normalization is a technique used to scale numeric features to a standard range, typically between 0 and 1 or between -1 and 1. It is used statistical analyses to ensure that all features have the same scale and to prevent features with larger values from dominating those with smaller values.
For Example, Consider the following dataset,
{Data} = [2, 5, 8, 10, 15]
To normalize these values using min-max scaling, we’ll use the formula:
The term “Power of 2” typically refers to the operation of exponentiation, where a number is raised to the second power. In mathematical notation, it is represented as "x^2" where, x is the base number. This operation involves multiplying the base number by itself.
For example:
2^{2} = 4
In general, raising a number to the power of 2 is known as squaring that number.
The term “power of 3” also refers to exponentiation, where a number is raised to the third power. In mathematical notation, it is represented as "x^3" where, x is the base number. This operation involves multiplying the base number by itself twice.
For example:
1200^{3} = 1,728,000,000
In general, raising a number to the power of 3 is known as cubing that number.
The reciprocal of a number is calculated by dividing 1 by that number.
Mathematically, if we have a number represented as x, the reciprocal of x is represented as \frac{1}{x}.
For example:
x = 5, the reciprocal of 5 = 1/5.
Similarly, x = 0.5, the reciprocal of 0.5 is 1/0.5 = 2.
The reciprocal of a number essentially represents how many times that number can fit into 1. For instance, the reciprocal of 5 is (1/5) means that 5 fits into 1 one-fifth of a time.
The square root of a number is a value that, when multiplied by itself, gives the original number. In mathematical notation, the square root of a number x is represented as √x.
For example:
\sqrt{25} = 5. Since 5 * 5 = 25
The square root operation can also be denoted using exponentiation: x^(1/2). This means raising x to the power of one-half, which is equivalent to taking the square root of x.
In statistics, standardization is a process used to transform data so that it has a mean of zero and a standard deviation of one. This technique is applied to individual variables within a dataset, making them comparable and facilitating statistical analysis.
The formula for standardizing a variable x is:
z = \frac{x - \mu}{\sigma}
Where:
x is the original value of the variable,
\mu is the mean of the variable,
\sigma is the standard deviation of the variable, and
z is the standardized value.
Standardization transforms the distribution of the data to have a mean of zero and a standard deviation of one. This process does not change the shape of the distribution but places it on a common scale, allowing for easier comparison between variables and datasets.
Standardization is particularly useful in statistical modeling, machine learning, and data analysis, where it helps improve the interpretability and performance of models by ensuring that variables are on a consistent scale.
Example:
Consider the data x= [5, 20, 8, 3, 4, 6, 15, 12, 9, 10]
Text processing is used to analyze most used words in a written text. Can also be used to identify SPAM emails, by searching for specific key words. By doing the text processing you will get the top 100 used words as new columns.
Select the preferred N grams from 1 to 5.
Select the Encode type such as : Boolean( 0 or 1), Count.
Select Custom stop words if required.
Click on apply button.
Note
The output will appear at the bottom of the page. You can also preview the output on the preview page.
This operation will not generate a new column for the output; instead, it will only modify the data type within the same column, altering the output accordingly.
Note
Text functions can be performed only for the Text variable.
One hot encoding is another method used to represent categorical variables in a format suitable for machine learning algorithms. Instead of assigning numerical labels like label encoding, one hot encoding creates binary vectors for each category.
Note
This function is applicable for both categorical and text columns. It does not require a user-defined output column name; instead, it will automatically generate a column name for the output.
Involves filling in missing data points in a dataset.
Techniques available for Numerical Data Type include:
Mean (Often known as the average, represents the central tendency of a dataset. It’s computed by adding up all values and dividing by the total count of values.)
Median (Middle value in a dataset when the values are arranged in ascending or descending order. If there’s an even number of values, it’s the average of the two middle values.)
Mode (Value that appears most frequently in a dataset. It’s the most common observation or value in the data.)
Default
Moving Average (Smooths out fluctuations in data by calculating the average of a subset of data points within a sliding window.)
Forward Filling (Copying the last observed value forward to fill in missing values.)
Backward Filling (Copying the next observed value backward to fill in missing values in a dataset.)
Custom Filling (Filling missing values using a user-defined method.)
Average of previous and next (Involves taking the average of the values immediately preceding and following the missing value to estimate its value.)
KNN (Missing values are replaced with the average or median value of the k nearest neighbors.)
The available techniques for “Categorical” variable are,
Deleting rows with missing values can be a preferred approach when those values are critical for analysis and cannot be easily imputed.
This function removes all rows containing missing values across all columns in the dataset, ensuring that only complete observations are retained for analysis.
Copying a column or creating a duplicate allows for performing two different operations on the same column, which can be useful. This process simply involves creating a copy of the column, and then proceeding with the desired operations. The new column is created at the bottom.
The function is used to concatenate two or more columns into one column. For instance, if there are columns for “First name” and “Surname,” they can be merged into a single column.
For example, “Name” and “Sex” columns are being combined. A delimiter is manually selected to clearly mark the transition between the values of the first and second columns.
Equal to: Filters data where the numerical value is exactly equal to a specified value. For example, filtering sales data for transactions where the amount equals $1000.
Not Equal to: Filters data where the numerical value is not equal to a specified value. For instance, excluding transactions with a sales amount of $1000.
Is Greater Than: Filters data where the numerical value is greater than a specified threshold. For example, filtering sales data for transactions with amounts greater than $1000.
Is Greater Than or Equal to: Filters data where the numerical value is either greater than or equal to a specified threshold. For instance, filtering sales data for transactions with amounts greater than or equal to $1000.
Is Less Than: Filters data where the numerical value is less than a specified threshold. For example, filtering sales data for transactions with amounts less than $1000.
Is Less Than or Equal to: Filters data where the numerical value is either less than or equal to a specified threshold. For instance, filtering sales data for transactions with amounts less than or equal to $1000.
Is Null: Filters data where the numerical value is null or missing.
Is Not Null: Filters data where the numerical value is not null or missing.
Between: Filters data where the numerical value falls within a specified range. For example, filtering sales data for transactions with amounts between $500 and $1000.
Note
This filter is applicable to Categorical and Text data types.
For adding more rules and conditions, click on “Add a Group” and choose between an “and” or “or” option for each step. Once the “Apply” button is clicked, the entire dataset will be modified to include only those entries that meet the specified filter criteria.
Grouped statistics organize data into groups and calculate statistical measures for each group, offering detailed insights into how variables behave within specific categories.
The available aggregation are:
Mean (Often known as the average, represents the central tendency of a dataset. It’s computed by adding up all values and dividing by the total count of values.)
Median (Middle value in a dataset when the values are arranged in ascending or descending order. If there’s an even number of values, it’s the average of the two middle values.)
Variance (Value that appears most frequently in a dataset. It’s the most common observation or value in the data.)
Standard Deviation (How the data is spread around the mean)
Unique (Number of unique values in the dataset)
Count (Total number of elements present in a particular variable or column)
Sum (Total value obtained by adding together all the individual values in a specific variable)
Example: To calculate average booking amount for each destination city, select INR_Amount with,
Transpose refers to the operation of flipping or rotating a dataset, typically switching its rows and columns. This operation can be useful for various purposes, such as changing the orientation of data for easier analysis or visualization, or preparing data for specific computations or algorithms. When a dataset is transposed, the rows become columns and vice versa.
Dropping duplicates refers to removing Duplicates in column or entire row duplicates. This operation ensures that each row in the dataset is unique, which can be important for various data analysis tasks
In the selected Column ( from_city) contains 2033 missing records.
After applying the function to the selected column, the validation results will be displayed.
Analysis based on pivoting data - a Pivot Table is used to summarise, sort, reorganise, group, count, total or average data stored in a table. It allows us to transform columns into rows and rows into columns.
We have two options in this section - either a “Pivot mode” or a regular chart mode.
In the regular chart mode we can select all the rows that we are interested in, in our example we selected Age, Fare, Sex, Ticket purchase, Cabin Detail and Number of meals. We get a basic chart with the selected values.
In the next step, we can also divide our selected columns in groups, by dragging the preferred group type down to the “Row groups” section. We have selected “Sex” to differentiate the groups. We can see we have 577 items in group “Male” and 314 items in group “Female”. We can open each group by clicking on the group name.
In the next step, we can add some other numerical values that we are interested in. In our case, we dragged the “Fare” down bellow to the “Values” section and by clicking on the desired value in the bottom green box. We can also select in which form the value should be displayed - average, count, first, last, max, min or sum. In our case it was the “average fare” value.
After that, select the fields you want to have in your Pivot table. For our demonstration we selected - Survived, Age, Sex, Ticket purchase and Meal Cost. “Ticket purchase” was dragged to the “Row Groups”, so it was automatically shifted to the left side as the first column. We also shifted “Sex” to the “Column Labels” section, so our chart was split into two sections - one section showing only the “Female” data and second showing only the “Male” data. Rest of our selected fields (Survived, Age and Meal cost) remain in the “Values” section, where we can choose again in which form each field should be displayed (average, count, first, last, max, min or sum. In our case it was the “average fare”).
Jupyter is an open-source, interactive web tool known as a computational notebook, which researchers can use to combine software code, computational output, explanatory text and multimedia resources in a single document. User can create a customised program/code using Jupyter notebook and import it back to EDGE.
Regular Expressions (REs) offer a mechanism to select specific strings from a set of character strings. They provide a context-independent syntax capable of representing a wide variety of character sets and their orderings, with interpretations based on the current locale. In addition to using Search and Replace for finding and replacing simple text strings, you can perform more precise searches using Regular Expressions (RegEx). RegEx allows you to accurately match a particular pattern in your search.
Delta (Δ) is a generalized function that calculates the difference between each row value and the value from its corresponding subtrahend position in the given numerical column grouped & sorted through the groupBy & orderBy columns respectively.
Refers to missing or nonexistent values in a dataset.
When checking for null values in a specific data field or observation, the result is typically a boolean (true or false) indicating whether the value is null (true) or not null (false). This process of checking for null values is essential for data cleaning and quality assessment before analysis.
Means flipping its truth value. For example, if the original boolean value is true, its negation would be false, and vice versa. It’s a way to change a statement from being true to false, or vice versa.
Note
This function applicable only to the Boolean values.
When selecting non boolean data type variable it will show the validation called “Selected column has no boolean values.
This feature presents a bar chart for the selected column, displaying the value and its count. It supports numerical, categorical, and date data types for bar chart visualization. Numerical data types offer an option to adjust the number of Bins. Text data types are represented using a word cloud. Scatter plots can be enabled to visualize two numerical variables against each other, typically the selected variable and the target variable. Users can view charts in full-screen mode and utilize a print option. Chart download options include PNG, JPEG, PDF, and SVG formats. Data can be downloaded in CSV or XLS format.
Note
Categorical and date data types display only bar charts, while numerical data types offer both scatter plots and bar charts.
For categorical variables, the ‘sort by’ feature is available.
Select One Sample T Test - One tailed under test name.
Under Numerical column select Numerical variable.
Under Categorical column select categorical variable. Select if it is required.
Choose a hypothesis mean value based on existing theories, previous research findings, or practical considerations. For example, if previous studies suggest that the average height of adults is 170 cm, you might set your hypothesis mean value to 170 cm.
The default Alpha (Level of significance) value is 0.05. Change the value based on your requirement.
Sampling:
Enable Sampling, if required. Choose the sampling method Random (each member of the population is selected purely by chance. This means that every individual or item has an equal chance to be included in the sample, and selection is not biased by any personal judgment or preference.)or Stratified (method of sampling in which the population is divided into subgroups, or strata, based on certain characteristics that are relevant to the research question. Then, samples are randomly selected from each stratum independently. This approach ensures that each subgroup is represented in the sample, and it allows for more precise estimates and comparisons within each stratum.)
Size:
% of a Variable: Selecting a sample size that is a certain percentage of a particular variable. For instance, if the variable is the total population, and you choose 10%, your sample will include 10% of the total population. This approach is useful when you want the sample size to be proportional to the size of a certain variable.
Absolute Value: Selecting a sample size by specifying an absolute number. For example, if you decide that your sample size should be 5 individuals, you simply select 5 individuals from the population. This method is straightforward and is often used when the required sample size is predetermined by external factors such as budget or time constraints.
By Calculator: Using a calculator involves employing a statistical tool or software to determine the appropriate sample size based on certain parameters such as confidence level, margin of error, and population size. This method ensures that the sample size is statistically valid and sufficient to make reliable inferences about the population.
Lazy Sampling: A less rigorous approach where the sample is selected based on convenience or ease of access rather than using a strict randomization process. This method is quick and cost-effective but may introduce bias and is not always representative of the population. It’s often used in exploratory research or when resources are limited.
Systematic Sampling: Selecting every nth item from a list or population. For example, if you have a list of 1,000 individuals and you want a sample of 100, you would select every 10th individual. This method is easy to implement and ensures a spread across the population, but it assumes that the list is randomly ordered to avoid periodicity bias.
Filtering: Enable Filtering and filter the data, if required.
Once the steps are done, Click on ‘Test’ and verify the result.
Provides a normalized representation of the distribution of data, making it easier to compare distributions and understand the relative proportions of data within different intervals.
Histogram:
The histogram displays the frequency distribution of age data within specified intervals (bins).
The blue bars represent the number of data points (frequency) that fall within each age interval.
The height of each bar corresponds to the count of data points in that interval.
Density:
The black curve represents the density estimate, showing the distribution of the data in a continuous manner.
The area under the density curve represents the entire data set and sums to 1.
The histogram gives a direct count of data points within intervals, while the density plot provides a smoothed view, highlighting the shape of the distribution.
Together, they offer a comprehensive view of the data, with the histogram showing discrete counts and the density plot illustrating the overall distribution pattern.
Provides a visual representation of how the shape of the distribution changes with different sample sizes and illustrates the properties of the t-distribution relevant for statistical inference.
Central Tendency of Age Data:
The peak of the density curve (red line) suggests that the most common or average age in the dataset is around 0 on a standardized scale. This means that the data is centered around this value.
Spread and Distribution of Age Data:
The shape of the density curve (red line) and the t-distribution curve (blue line) indicates how age values are distributed in the dataset.
The t-distribution curve (blue line) closely follows the density curve, but it’s slightly smoother and broader. This suggests that the data may have been fitted to a theoretical distribution, such as the t-distribution, often used in statistics for small sample sizes.
Significance of Tail Area:
The shaded area under the t-distribution curve (pink) represents the tail of the distribution.
This area is significant in statistical terms, especially for hypothesis testing, where it helps determine the probability of observing values beyond a certain threshold (e.g., ages above a certain point).
In a one-sample t-test with a two-tailed hypothesis, one examines whether the sample mean significantly differs from a hypothesized population mean. This involves calculating the t-statistic based on the sample data and comparing it to critical values from a t-distribution. The aim is to determine if the mean significantly deviates from the hypothesized value in either direction, hence the term “two-tailed”.
Select One Sample T Test - One tailed under test name.
Under Numerical column select Numerical variable.
Under Categorical column select categorical variable. Select if it is required.
Choose a hypothesis mean value based on existing theories, previous research findings, or practical considerations. For example, if previous studies suggest that the average height of adults is 170 cm, you might set your hypothesis mean value to 170 cm.
The default Alpha (Level of significance) value is 0.05. Change the value based on your requirement.
The Shapiro-Wilk test is a useful tool for determining the normality of a data set, which is a crucial assumption in many statistical analyses. It provides a clear decision-making process based on the p-value derived from the T-statistic.
The paired Student’s t-test compares the means of two related groups to determine if there is a statistically significant difference. It is used when the same subjects are measured before and after a treatment or in matched pairs. The test calculates the difference between paired observations, then computes the mean and standard deviation of these differences to determine the t-statistic. If the p-value is less than a chosen significance level (e.g., 0.05), the null hypothesis (no difference) is rejected, indicating a significant difference between the means.
This won’t display the Histogram-Density Plot and T-distribution.
The Chi Square test is a statistical method used to determine whether there is a significant association between categorical variables. It compares the observed frequencies of categories with the expected frequencies if there were no association.
This won’t display the Histogram-Density Plot and T-distribution
abs( ): Absolute value of a variable. The magnitude of a real number without regard to its sign. Numbers with negative values will be converted to positive ones.
Syntax:
abs(#column1#) AS <<Newabscolumn>>
New column name: Newabscolumn
Note
The absolute value can only be applied to Numerical datatype columns.
The new column name should not contain any special characters such as !@#$%^&*()-+= space
If there are any missing rows in the numerical column, then the row will be ignored.
Example:
abs(#temperature#)
Where (#temperature#) include following values -12, 6, 0.5, -6, ,25
After applying abs(#temperature#) the values will change to 12, 6, 0.5, 6, ,25
The column may have temperatures in -ve value and this will convert all negative temperatures to positive
acos( ): Arc cosine function result is the same as the inverse of cosine of a variable, the function returns angles in radians
Syntax:
acos(#column1#) AS <<Newacoscolumn>>
New column name: Newacoscolumn
Note
The acos function can only be applied to Numerical datatype columns
The new column name should not contain any special characters such as !@#$%^&*()-+= space
If there are any missing rows in the numerical column, then the row will be ignored
Example:
acos(#NumericalColumn#) AS <<Newacoscolumn>>
Where (#numericalColumn#) value is between -1 and 1
After applying acos, we will get an angle expressed in radians
Add months( ): The function returns a date with given number of months added (date + integer months)
Syntax:
add_months(#DateColumn#, number_of_months) as <<NewAddColumn>>
New column name: NewAddColumn
Note
The Add months function can only be applied to Date datatype columns
Example:
add_months(#DateOfHire#, 12) as <<NewAddColumn>>
Assuming new employees have a 12 month trial period, after adding 12 months to the Date of hire column we will get the date when the trial period terminates
append( ): Appending a variable with a pattern to create a new variable.
Syntax:
append(#column1#, 'Pattern to be appended') AS <<NewColumnApp>>
New column name: NewColumn
Note
Both Categorical and Numerical value and the new field which is created is a Categorical type.
Example:
append(#temperature#, 'Deg C') AS <<AppCol1>>
Assuming that the temperature column is a numeric value and appending the string ‘Deg C’ will append the text to the numerical value. Example ‘13.5’ is the numerical value and the text ‘Deg C’ will be appended to read ‘13.5 Deg C’, and the new column AppCol1 which is formed will be string.
ascii( ): Function returns numerical value of the first character of a string
Syntax:
ascii(#Column#) as <<NewAsciiColumn>>
New column name: NewAsciiColumn
Note
Works for categorical columns
Example:
ascii(#CategoricalColumn#) AS <<NewAsciiColumn>>
We have a categorical column “Geography” and we want to transform the variables into numerical Ascii representation. We have variable “France”, “Spain” and “Germany” and after the transformation we get results of “70”, “83” and “71” as per the ASCII characters table.
asin( ): The arcsin function is the inverse of the sine function. It returns the angle whose sine is a given number. Use arcsin when you know the sine of an angle and want to know the actual angle
Syntax:
asin(#Column#) as <<NewAsinColumn>>
New column name: NewAsinColumn
Note
The asin function can only be applied to Numerical datatype columns
The new column name should not contain any special characters such as !@#$%^&*()-+= space
If there are any missing rows in the numerical column, then the row will be ignored
Example:
asin(#SineColumn#) as <<NewAsinColumn>>
Assuming we have a column with sine value (0 to 1) we can determine the Asin value.
In case our Sine value is 0,85 we will get an Asin value of 1.0159852938148251.
atan( ): The Atan function returns the arc tangent in radians (which is the inverse tangent), of its argument. The arctangent means the angle whose tangent is the argument.
Syntax:
atan(#Column#) as <<NewAtanColumn>>
New column name: NewAtanColumn
Note
The atan function can only be applied to Numerical datatype columns
The new column name should not contain any special characters such as !@#$%^&*()-+= space
If there are any missing rows in the numerical column, then the row will be ignored
Example:
atan(#NumericColumn#) as <<NewAtanColumn>>
Assuming we have a column with numerical value and a row has value of 37, the Atan function will return the arc tangent value of 1.5437758 - angle in radians.
atan2( ): The atan2 function calculates one unique arc tangent value from two variables x and y and returns angle in radians
Syntax:
atan2(#Column_Y#, #Column_X#) as <<NewColumn>>
New column name: NewAtan2Column
Note
The atan2 function can only be applied to Numerical datatype columns
The new column name should not contain any special characters such as !@#$%^&*()-+= space
If there are any missing rows in the numerical column, then the row will be ignored
Example:
atan2(#NumericColumnX#, #NumericColumnY#) as <<NewAtan2Column>>
Assuming we have 2 columns with numerical value and a row has value of x=44 and y=10 the Atan2 function will return the value of 1.3473197256542635
base64( ): The base64 function is most commonly used to encode binary data and transform the argument from binary format to a base 64 string
Syntax:
base64(#Column#) as <<NewBaseColumn>>
New column name: NewBaseColumn
Example:
base(#CategoricalColumn#) as <<NewBaseColumn>>
Assuming we have a categorical column with locations and one variable is France, the Base64 function will transform the value to RnJhbmNI as a base 64 string
Coalesce( ): The function returns the first non-null expression in the list. If all expressions evaluate to null, then the Coalesce function returns null
Syntax:
coalesce(#Column1#, #Column2#, ..., #ColumnN#) as <<NewCoalColumn>>
New column name: NewCoalColumn
Note
The function can be applied to Categorical and Numerical datatype columns
Example:
coalesce(#FinishDate#, #StartDate#) as <<NewCoalColumn>>
Assuming we have a column “StartDate” and “FinishDate”. Some of the variables in “FinishDate” are not filled in as a task was not completed yet. In this case the transformation will be searching for “FinishDate” firstly (written on first place in the syntax), if the value is null it will return the “StartDate” value.
Concat( ): The function is used to concatenate two strings to make a single string, for example concatenate two or more columns in one
Syntax:
concat(#Column1#, #Column2#, ..., #ColumnN#) as <<NewConcColumn>>
New column name: NewConcColumn
Example:
concat(#Sex#, #Age#) as <<NewConcColumn>>
Assuming we have a column “Sex” and “Age” and we want both of these variables in one single column. After applying the concat function, we will get results like “Male24”, “Female43” etc.
Concat_ws( ): The function joins the input strings into a single string. It separates those concatenated strings with the separator specified in the first argument
Syntax:
concat_ws('separator',#Column1#, #Column2#, ..., #ColumnN#) as <<NewConcWSColumn>>
New column name: NewConcWSColumn
Note
The CONCAT_WS() requires at least two input strings
Example:
concat_ws('/',#Sex#, #Age#) as <<NewConcWSColumn>>
Assuming we have a column “Sex” and “Age” and we want both of these variables in one single column separated by “/”. After applying the concat function, we will get results like “Male/24”, “Female/43” etc.
Cos( ): The Cos function returns the cosine of a number
Syntax:
cos(#Column#) as <<NewCosColumn>>
New column name: NewCosColumn
Note
The Cos function can only be applied to Numerical datatype columns
The new column name should not contain any special characters such as !@#$%^&*()-+= space
If there are any missing rows in the numerical column, then the row will be ignored
Example:
cos(#NumericalColumn#) as <<NewCosColumn>>
Assuming we have a numerical column with angles and we want to convert it into cosine value we use this function. A value of 2 will be converted into -0,416147.
Cosh( ): The Cosh function returns the hyperbolic cosine of a number
Syntax:
cosh(#Column#) as <<NewCoshColumn>>
New column name: NewCoshColumn
Note
The Cosh function can only be applied to Numerical datatype columns
The new column name should not contain any special characters such as !@#$%^&*()-+= space
If there are any missing rows in the numerical column, then the row will be ignored
Example:
cosh(#NumericalColumn#) as <<NewCoshColumn>>
Assuming we have a numerical column with angles and we want to convert it into hyperbolic cosine value we use this function. A value of 3 will be converted into 10,067.
Cot( ): The Cot function returns the cotangent of a number
Syntax:
cot(#Column#) as <<NewCotColumn>>
New column name: NewCotColumn
Note
The Cos function can only be applied to Numerical datatype columns
The new column name should not contain any special characters such as !@#$%^&*()-+= space
If there are any missing rows in the numerical column, then the row will be ignored
Example:
cot(#NumericalColumn#) as <<NewCotColumn>>
Assuming we have a numerical column with angles and we want to convert it into cotangent value we use this function. A value of 6 will be converted into -3.436353.
Crc32( ): The Crc32 function function calculates a 32-bit CRC (cyclic redundancy checksum) for a string. This function can be used to validate data integrity
Syntax:
crc32(#Column#) as <<NewCrcColumn>>
New column name: NewCrcColumn
Note
The Crc32 function can only be applied to Categorical datatype columns
Example:
crc32(#CategoricalColumn#) as <<NewCrcColumn>>
Assuming we have a categorical column and we want to convert it into 32-bit CRC value we use this function. A string word “France” will be converted into 1493835906.
ceil() : Ceiling value of a number returns the greatest integer, greater than or equal to the number. The ceil of number 2.5 will generate the value 3. The ceil of the value -2.5 will generate the value -2.
Syntax:
ceil(#column1#) AS <<NewCeilColumn>>
New column name: Newceilcolumn
Note
The ceil value can only be applied to Numerical datatype columns.
The new column name should not contain any special characters such as !@#$%^&*()-+= space
If there are any missing rows in the numerical column, then the row will be ignored.
Example:
ceil(#NewBornWeight#) AS <<NewCeilColumn>>
Where (#NewBornWeight #) include following values 2.2, 3.5, 2.8, , 2.7
After applying ceil(#NewBornWeight #) the values will change to 3, 4, 3, , 3
Date_trunc ( ): The function returns date with the time portion of the day truncated to the unit specified by the format model
Format should be one of [“YEAR”, “YYYY”, “YY”, “MON”, “MONTH”, “MM”, “DAY”, “DD”, “HOUR”, “MINUTE”, “SECOND”, “MILLISECOND”, “MICROSCOND”, “WEEK”, “QUARTER”]
Syntax:
date_trunc('format',#DateColumn#) as <<NewTruncColumn>>
delete( ): Deleting string character from the 2nd to 5th position in the Column
Syntax:
delete(#column1#, 2, 5) AS <<NewColumnDel>>
New column name: NewColumn
Note
Applies to Categorical data type and the new column which is created is also Categorical type.
Example:
delete(#column1#, 1, 6) AS <<NewColumnDel>>
The characters in column 1 in the position from 1st Character to the 6th Character is deleted. If the data in column 1 is “Hello World”, “Great Day”, after applying the formula, the new field will have the values “World” and “Day”.
effect( ): The EFFECT Function returns the effective annual interest rate, from the nominal annual interest rate (APR), and the number of compounding periods per year.
The arguments can be explained as:
Nominal - Annual interest rate, also known as APR (Annual Percentage Rate)
Npery - Number of periods that interest is calculated
Syntax:
EFFECT(#nominal_rate#, #npery#) AS <<Newcol_effect>>
exp( ): The exp() function allows users to calculate the exponential value with the base set to e.
e is a mathematical constant, with a value approximately equal to 2.71828 and is also called Eulers constant.
Syntax:
exp(#column1#) AS <<NewColumnExpo>>
New column name: NewColumn
Note
A - e is a Mathematical constant, with a value approximately equal to 2.71828
B - Applies to Numerical data type and the new column which is created is also Numerical type.
Example:
exp(#column1#) AS <<NewColumnExpo>>
Assuming the numerical value in the #column1# is 2 then exp(2) means 2.7182 to the power of 2
Expm1 ( ): The expm1 function (e ^ x )- 1 returns e raised to the given power argument, minus 1.0, where e is the base of the natural logarithms (e is the Eulers number equal to 2.71828)
Syntax:
expm1(#Column#) as <<NewExpm1Column>>
New column name: NewExpm1Column
Example:
expm1(#NumericalColumn#) AS <<NewExpm1Column>>
Assuming the numerical value in the #column# is 3 then expm1 result will be 19.08554
A - Applies to Numerical data type and the new column which is created is also Numerical type.
B - If the value is non-integer, then the factorial will change the number to integer value and then apply factorial on integer.
Example:
factorial(#column1#) AS <<NewColumnFact>>
Assuming the numerical value in the #column1# is 5 then factorial(5) means 5 x 4 x 3 x 2 x 1 = 120
If the value in the column is non-integer then factorial(4.3) means 4 x 3 x 2 x 1 = 24
find( ): Finds the full character within the field, and returns true if it finds character, and false if not.
Syntax:
find(#marital_status#,'Married') AS <<NewColumnFind>>
The first word is the field name within #. Example #marital_status#
The second is the text / word within double quotes that needs to be found. Example ‘Married’
New column name: NewColumn
Note
A Applies to Categorical data type and the new column which is created is Categorical type and the output is True or False.
B If the word is partly existing, it will not return True.
C The word being found is case sensitive.
Example:
find(#marital_status#,'Married') AS <<NewColumnFind>>
The word ‘Married’ if found in the field, it return True, else False.
floor( ) : Floor value of a number returns the largest integer value less than or equal to the specified number. The floor of number 2.5 will generate the value 2. The floor of the value -2.5 will generate the value -3.
Syntax:
floor(#column1#) AS <<Newfloorcolumn>>
New column name: Newfloorcolumn
Note
The floor value can only be applied to Numerical datatype columns.
The new column name should not contain any special characters such as !@#$%^&*()-+= space
If there are any missing rows in the numerical column, then the row will be ignored.
Example:
floor(#NewBornWeight#) AS <<Newfloorcolumn>>
Where (#NewBornWeight #) include following values 2.2, 3.5, 2.8, ,2.7
After applying floor(#NewBornWeight #) the values will change to 3, 4, 3, , 3
Format_number ( ): The Format number function formats a number to a format like “#,###,###.##”, rounded to a specified number of decimal places, then it returns the result as a string
Syntax:
format_number(#Column#, number of digits to be rounded to) as <<NewNumForColumn>>
New column name: NewNumForColumn
Example:
format_number(#Distance#, 2) as <<NewNumForColumn>>
Where distance is 153.4625 the result string will be 153.46.
From_utc_timestamp ( ): The function returns the current UTC date and time as a value in ‘YYYY-MM-DD HH:MM:SS’ or YYYYMMDDHHMMSS.uuuuuu format depending on the usage of the function i.e. in a string or numeric context.
Syntax:
from_utc_timestamp(#Datecolumn#, 'timezone') as <<NewTimeColumn>>
fv( ): The FV function is a financial function that returns the future value of an investment. We can use the FV function to get the future value of an investment assuming periodic, constant payments with a constant interest rate.
Rate - Interest rate per compound period
N - The total number of payment periods
Pmt - The payment per period
Pv - The present value of the investment
Type - When payments are deposited
end = End of each period
begin = Beginning of each period
Syntax:
fv(#Annual_Interest_Rate#, #Number_of_Periods#, #Amount_of_Payments#, #Present_value#, 'end') as <<Newcol_fv>>
New column name: : Newcol_fv
fv(#int_rate#, #open_acc#, #Par#, #pv#) as <<new_fv>>
geo_dist_kms(#Lati_A#, #Lati_B#, #Longi_A#, #Longi_B#) as <<NewGDKColumn>>
Where the selected Latitude A is 38.631913, Longitude A is -121.434879, Latitude B is 38.502519 and Longitude B is -121.420769 the calculated distance is 14.44016 Kms
|
hex( ): Converts decimal numbers (base 10) to hexa decimal value (base 16). As a base-16 numeral system, it uses 16 symbols. These are the 10 decimal digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) and the first six letters of the English alphabet (A, B, C, D, E, F). The letters are used in order to represent the values 10, 11, 12, 13, 14 and 15 each in one single symbol.
Syntax:
hex(#column1#) AS <<NewColumnHex>>
New column name: NewColumn
Note
Applies to numerical data type and the new column which is created is Categorical type.
Example:
hex(#column1#) AS <<NewColumnHex>>
Converts the decimal numeral system to hexa decimal numeral system.
If Else And: The if/else statement executes a code if a specified condition is true. If the condition is false, another code is executed.
Syntax:
if(#age#>=0 and #age#<=18,'Child',(if(#age#>=19 and #age#<=50,'Adult' ,(if(#age#>=51 and #age#<=70,'Mature','OLD' ))))) as <<NewAgeColumn2If>>
New column name: NewAgeColumn2
Note
A Applies to Categorical, Numerical, Text or Date data type and the new column which is created is either Categorical or Numerical type.
B The column name is case sensitive and the existing columns should be defined within #.
C The And operator is also used to join multiple conditions.
Example:
if(#age#>=0 and #age#<=18,'Child',(if(#age#>=19 and #age#<=50,'Adult' ,(if(#age#>=51 and #age#<=70,'Mature','OLD' ))))) as <<NewAgeColumn2If>>
In the above example the values in the column age if it is between 0 to 18 then a NewAgeColumn2 is created where the age values between 0 to 18 is classified as ‘Child’, age from 19 to 50 classified as ‘Adult’, age between 51 to 70 classified as ‘Wise’ and rest classified as ‘Old’.
Initcap ( ): The function returns the expression with the first letter in each word in uppercase. Rest of the letters are in lowercase. Words are delimited by a white space
Syntax:
initcap(#Column#) as <<NewInitCapColumn>>
New column name: NewInitCapColumn
Example:
initcap(#Gender#) as <<NewInitCapColumn>>
In case the variable like ‘male’ and ‘female’ start with a lower case letter, the new column after applying the function will have values ‘Male’ and ‘Female’
insert( ): Insert word at the position specified in the field.
Syntax:
insert(#column#, position,'_WordToInsert') as <<NewColumnIns>>
The first word is the field name within # where the word needs to be inserted. Example #occupation#
The second is the number which is the position where the word will be inserted. Example the number 4
The third part is the text / word with double quotes that needs to be inserted. Example ‘_SERVICE’
New column name: NewColumn
Note
A Applies to Categorical and Text data type and the new column which is created is Categorical type.
B If the field does not have any word in the field the insert function will not insert the word.
Example:
insert(#occupation#,4,'_SERVICE') as <<NewColumnIns>>
The word ‘_SERVICE’ is inserted after the 4th position of the characters found in the field #occupation#.
Instr ( ): The function returns the position of the first occurrence of a string in another string
Syntax:
instr(#Column#, 'substr') as <<NewInstrColumn>>
New column name: NewInstrColumn
Example:
instr(#Name#, 'Mr.') as <<NewInstrColumn>>
We are searching for an expression/string ‘Mr’ in a name column and the function will tell us a position number where the string was found. In case the string is not there, the function returns ‘0’.
Inverse_sign ( ): The function returns the negated values i.e. positive to negative and vice versa of given numerical column
Syntax:
inverse_sign(#Column#) as <<NewInvSignColumn>>
New column name: NewInvSignColumn
Example:
inverse_sign(#Temeprature#) as <<NewInvSignColumn>>
In a selected ‘Temperature’ column we will have values like 20, 7, -6 and -18 and after the transformation the function will return values -20, -7, 6 and 18.
is_text( ): Conditional statement to check if the each field from the given column is Text. Returns True or False. Condition: A string is considered as text, if its length of characters is greater than 50.
Syntax:
is_text(#column#) AS <<Newcol_istext>>
New column name: Newcol_istext
Note
Applies for both Numerical and Categorical datatypes to check whether the value is Text or Not
Example:
is_text(#age#) as <<istext_col>>
Checks whether the string in the field is text, returns condition True or False
Isnan ( ): The function determines whether a value is NaN (Not-a-Number). This function returns true if the value equates to NaN. Otherwise it returns false
Syntax:
isnan(#Column#) as <<NewIsNaNColumn>>
New column name: NewIsNaNColumn
Example:
isnan(#PasswordColumn#) as <<NewIsNaNColumn>>
We are checking if a value in a selected “PasswordColumn” is Not A Number - if we have value ‘567’ the function will return ‘false’, as the expression is a number, if we have value ‘blue’ then function will return ‘true’ as the expression is not a number.
Isnull ( ): The function returns 1 or 0 depending on whether an expression is NULL.
If expression is NULL, this function returns 1 as true. If not, it returns 0 as false.
Syntax:
isnull(#Column#) as <<NewIsNullColumn>>
New column name: NewIsNullColumn
Example:
isnull(#Column#) as <<NewIsNullColumn>>
We are checking if a value in a selected “Column” is NULL. If the value in the column is ‘1’ the function will return ‘False’ as the function is not NULL.
Is not null ( ): The function returns 1 or 0 depending on whether an expression is NULL.
If expression is not NULL, this function returns 1 as true. If it is NULL, it returns 0 as false.
Syntax:
isnotnull(#Column#) as <<NewIsNotNullColumn>>
New column name: NewIsNotNullColumn
Example:
isnotnull(#Column#) as <<NewIsNotNullColumn>>
We are checking if a value in a selected “Column” is NOT NULL. If the value in the column is ‘1’ the function will return ‘True’ as the function is not NULL.
Least ( ): The function returns the smallest value from the list of arguments
Syntax:
least(#Column1#, #Column2#, ..., #ColumnN#) as <<NewLeastColumn>>
New column name: NewLeastColumn
Example:
least(#Balance#, #EstimatedSalary#) as <<NewLeastColumn>>
If we have numerical columns with multiple values, the function will select the smallest value. If one column contains a value of 14,567 and second column a value of 6,553 the function will return 6,553 as the Least value.
Levenshtein ( ): The function returns the Levenshtein distance between two strings. The Levenshtein distance is the number of characters you have to replace, insert or delete to transform string1 into string2
Syntax:
levenshtein(#Column1#, #Column2#) as <<NewLevenColumn>>
New column name: NewLevenColumn
Example:
levenshtein(#Password1#, #Password2#) as <<NewLevenColumn>>
If one value is “tomato” and second one is “mate” the Levenshtein distance is 3 - as we need to replace, insert or delete 3 characters to transform “tomato” into “mate”.
Locate ( ): The function identifies the position of the first occurrence of a substring in a string. If the substring is not found in the original string, the function returns 0. The function ignores upper or lower cases
Syntax:
locate('substr',#Column#) as <<NewLocColumn>>
New column name: NewLocColumn
Example:
locate('r',#Surname#) as <<NewLocColumn>>
We are searching for the expression “r” in the surname column. Surname “Nobe” returns 0 as it does not contain an “r”. Surname “Taylor” returns “6” as the expression “r” is at the 6th position.
log( ): gives the logarithmic value of the number.
A natural log value is logarithm to the base of the mathematical constant e, where e is an irrational and transcendental number approximately equal to 2.71828
Similarly , user can use any base like 2,10 to calculate the logarithmic value.
Syntax:
log(Basevalue,#column1#) AS <<NewColumnLog>>
New column name: NewColumn
Note
Applies to Numerical data type and the new column which is created is Numerical type.
Example:
log('e',#column1#) AS <<NewColumn1>>
Example:
log(2,#column1#) AS <<NewColumn2>>
Performing logarithmic transformation help in data normality especially if the data is skewed. The log transformation can be used to make highly skewed distributions less skewed. Log transformation is important as it makes patterns in the data more interpretable.
Md5 ( ): The Message Digest Algorithm 5 (MD5) is a cryptographic hash algorithm that is used to create a 128-bit string value from an arbitrary length string
Syntax:
md5(#Column#) as <<NewMd5Column>>
New column name: NewMd5Column
Example:
md5(#France#) as <<NewMd5Column>>
The function will return 0309a6c666a7a803fdb9db95de71cf01 as a result.
Months_between ( ): The function returns number of months between date1 and date2. If date1 is later than date2, then the result is positive. If date1 and date2 are on the same day of month, or both are the last day of month, time of the day will be ignored. Otherwise, the difference is calculated based on 31 days per month, and rounded to 8 digits. If we mention ‘false’ in the syntax, the number will not be rounded.
Syntax:
months_between(#EndDatecolumn#, #StartDatecolumn#, false) as <<NewMBColumn>>
New column name: NewMBColumn
Example:
months_between(#2003/01/01#, #2003/03/14#, false) as <<NewMBColumn>>
The above mentioned date difference will be calculated as -2.41935483870968.
offday_count( ): Returns the number of holidays between two given date columns(Include holidays and weekends) based on the country
Syntax:
offday_count(#column1#,#column2#, 'countrycode') as <<Newcol_offdaycount>>
New column name: Newcol_offdaycount
For the list of country codes click on the link below:List of country codes
.. note::
* The offday_count function is only applied for date_columns.
* Countrycode field should have valid codes.
Example:
offday_count(#DOB#, #Date of Hire#, 'US') as <<offdaycount>>
Above example returns the number of holidays between two given date columns(Include holidays and weekends) based on the country
pmt( ): The PMT Function returns the regular and constant repayments for a loan or mortgage required to reduce the balance to zero, or an amount you specify. It’s based on a constant interest rate.
Syntax:
pmt(#Annual_Interest_Rate#, #Number_of_Payments#, #Loan_amount#, #Final_balance#, 'end') as <<Newcol_pmt>>
pow( ): The power (or exponent) of a number is the number of times to use the number in a multiplication
Syntax:
pow(#age#,3) as <<power_age_column>>
New column name: NewColumn
Note
A Applies to Numerical data type and the new column which is created is Numerical type.
B The power transformation is used when the numbers are clustered too close to each other. Power Transformer is used to make the data distribution more-Gaussian (Normal).
Example:
pow(#age#,3) as <<power_age_column>>
The power (or exponent) of a number is the number of times to use the number in a multiplication. In the example above the power is raised to 3, that is multiplying the number in the field three times.
Proper( ): The Proper Case converts categorical and text values to upper case for the first character of each word and to lower case for subsequent characters in the word.
Syntax:
Proper(#column1#) AS <<NewProperColumn>>
New column name: NewProperColumn
Note
A Applies to categorical and text data type and the new column which is created is Categorical type.
B Does not apply to Numerical and Data type columns
Example:
Proper(#column1#) AS <<NewProperColumn>>
Converts characters to proper case (Starting Alphabet is capitals, rest are lower case).
radians( ): Converts angles in degrees to radians. 1 degree is equivalent to (π/180) radians. Multiply the number of degrees with by π/180 to convert it to radian terms.
Syntax:
radians(#angles#) AS <<NewRadiansColumn>>
New column name: NewRadiansColumn
Note
Applies to Numerical data type and the new column which is created is Numerical type.
Rand ( ): The function returns a random number between 0 (inclusive) and 1 (exclusive). If seed is specified, it returns a repeatable sequence of random numbers. If seed is not specified, it returns a completely random number
Randn ( ): The function returns a random value with independent and identically distributed (i.i.d.) values drawn from the standard normal distribution
reci( ): The reciprocal transformation is defined as the transformation of x to 1/x. The transformation reverses the order of values with the same sign.
Syntax:
reci(#column1#) AS <<NewReciprocalColumn>>
New column name: NewReciprocalColumn
Note
A Applies to Numerical data type and the new column which is created is Numerical type.
B The transformation can only be used for non-zero values and also values should not be missing.
C If the standard deviation is proportional to the mean squared, a reciprocal transformation can be performed.
D Transformation helps to improve interpretability. If fuel economy of car is being compared, then there are two way of looking at the fuel consumption a. liters per kilometer or b. kilometers per liter. The reciprocal of ‘liters per Kilometer’ is ‘kilometers per liter’.
Example:
reci(#column1#) AS <<NewReciprocalColumn>>
The reciprocal transformation is defined as the transformation of x to 1/x.
image MilesPerGallon to GallonsPerMile using reciprocal
rint( ): The rint functions returns a floating-point value that represents the nearest integer to x. Halfway values are rounded according to the current setting of the floating-point rounding mode
running_total( ): This function returns a sequence of partial sums of given column which means displaying the summation of the data as it grows over time.
Syntax:
running_total(#column#) as <<Newcol_total>>
New column name: Newcol_total
Note
The running_total function is only applied for numerical columns.
Example:
running_total(#Age#) as <<total_age>>
Above example will return the running_total of age column.
Sha1( ): The function calculates an SHA-1 160-bit checksum for a string. The function returns a value as a binary string of 40 hex digits. In case the string supplied as the argument is NULL, the function returns NULL.
Syntax:
sha1(#Column#) as <<NewSha1Column>>
New column name: NewSha1Column
Example:
sha1(#Sedge#) as <<NewSha1Column>>
The function will return fca21fee4d117ec70a95dbee415c01e5c6e86df3
Sha2( ): The function uses the SHA2 cryptographic hash function to convert a variable-length string into a character string. The character string is a text representation of the hexadecimal value of the checksum with the specified number of bits
Syntax:
sha2(#Column#, bitLength) as <<NewSha2Column>>
New column name: NewSha2Column
Note
String - A variable-length string
Integer - The number of bits in the hash functions. Valid values are 0 (same as 256), 224, 256, 384, and 512
Sign( ): Converts if value in column is greater than 0 it returns 1 and if value is less than 0 it returns -1 and if the value is 0 it returns 0
Syntax:
Sign(#MPG#) as <<NewColSign>>
New column name: Newcolsign
Note
A Applies to Numerical data type and the new column which is created is Numerical type.
Example:
Sign(#temperature#) as <<NewColSig>>
If the value in the column temperature is 36, then the new column (Newcol) will have value 1, If the value in the column temperature is -25, then the new column (Newcol) will have value -1 and If the value in the column temperature is 0, then the new column (Newcol) will have value 0
step_delete( ): Delete single character at certain position and then skips the next characters, and then deleting the Single character.
Syntax:
step_delete(#Occupation#,1,10,2) as <<NewColSkipDelete>>
New column name: NewColSkipDelete
Note
A Applies to Categorical and text data type and the resulting column is categorical data type.
B Does not apply to numerical data type.
Example:
step_delete(#Occupation#,1,10,2) as <<NewColSkipDelete>>
In the above example data in the field Occupation is selected. The step delete is applicable from 1st Character to 10th character and the skips 2 characters.
Example if the word is ‘Life is beautiful’ then this function will work as per the below image.
standardize( ): Data standardization is the process of rescaling the data from the given numerical column to have a mean of 0 and a standard deviation of 1 (unit variance). The function creates a new standardized column
Syntax:
standardize(#Column#) as <<NewStandColumn>>
New column name: NewStandColumn
Note
The standardize fuction is only applied for Numerical datatype columns
Example:
standardize(#age#) as <<NewStandColumn>>
Returns the standardized column for given numerical column
To_timestamp( ): The function converts a character datatype to a value of TIMESTAMP datatype format. The Timestamp_format_pattern specifies the format of character
Syntax:
to_timestamp(#Column#, 'Timestamp_format_pattern') as <<NewToTimeColumn>>
To_unix_timestamp( ): The function returns the UNIX timestamp of the given time column. The unix time stamp is a way to track time as a running total of seconds. This count starts at the Unix Epoch on January 1st, 1970 at UTC
Syntax:
to_unix_timestamp(#Column#, 'Date/time_format_pattern') as <<NewUnixColumn>>
unix_timestamp( ): The function returns the UNIX timestamp of current time. The unix time stamp is a way to track time as a running total of seconds. This count starts at the Unix Epoch on January 1st, 1970 at UTC
Syntax:
unix_timestamp()as<<NewUnixColumn>>
New column name: NewUnixColumn
Example:
unix_timestamp()as<<NewUnixColumn>>
Above example performed on Mon Jun 07 2021 14:16:31 GMT will return Unix Timestamp 1623075391
unit_converter( ) : unit_converter of a number converts from one unit to another unit. The available units for conversion are listed below.
Syntax:
unit_converter(#temperature#, 'from_unit', 'to_unit') AS <<Newconvertedcolumn>>
New column name: Newconvertedcolumn
Note
The from_unit and the to_unit refer to the unit such as “C”. referring to Celsius or “F”, referring to Fahrenheit.
The UNIT_CONVERTER value can only be applied to Numerical datatype columns.
The new column name should not contain any special characters such as !@#$%^&*()-+= space
If there are any missing rows in the numerical column, then the row will be ignored.
The user can enter “from unit” and ‘to unit’ in upper case or lower case.
weekend_count( ): Returns the number of weekend between two given date columns based on country
Syntax:
weekend_count(#column1#,#column2#, 'countrycode') as <<Newcol_weekendcount>>
New column name: Newcol_weekendcount
For the list of country codes click on the link below:List of country codes
.. note::
* The weekend_count function is only applied for date_columns.
* Countrycode field should have valid codes.
Example:
weekend_count(#DOB#, #Date of Hire#, 'US') as <<weekendcount>>
Above example returns the number of weekend between two given date columns based on country
workday_count( ): Count the number of workdays between two given date columns based on country
Syntax:
workday_count(#column1#,#column2#, 'countrycode') as <<Newcol_workdaycount>>
New column name: Newcol_workdaycount
For the list of country codes click on the link below:List of country codes
.. note::
* The workday_count function is only applied for date_columns.
* Countrycode field should have valid codes.
Example:
workday_count(#DOB#, #Date of Hire#, 'US) as <<workdaycount>>
Above example returns the number of workdays between two given date columns based on country
Cluster analysis is a technique to group similar observations into a number of clusters based on the observed values of several variables for each individual. It is an exploratory data analysis technique that allows us to analyze the multivariate data sets.
Bisecting K-Means is a hybrid approach of the K-Means algorithm to produce partitional/hierarchical clustering. It splits one cluster into two sub-clusters at each bisecting step (using k-means) until k clusters are obtained.
The Gaussian mixture model is defined as a clustering algorithm that is used to discover the underlying groups of data. Gaussian mixture models have a higher chance of finding the right number of clusters in the data compared to k-means.
Elbow Method is an empirical method to find the optimal number of clusters for a dataset. In this method, we pick a range of candidate values of k, then apply K-Means clustering using each of the values of k. Find the average distance of each point in a cluster to its centroid, and represent it in a plot. Pick the value of k, where the average distance falls suddenly.
The silhouette Method is also a method to find the optimal number of clusters and interpretation and validation of consistency within clusters of data. The silhouette method computes silhouette coefficients of each point that measure how much a point is similar to its own cluster compared to other clusters. by providing a succinct graphical representation of how well each object has been classified.
Graphs are used to display large amounts of numerical data and to represent the relationships between numerical values of different variables. They can also be used to derive quantitative relationships between variables.
We have chosen age as an example for our bar chart. We can point our mouse at any part of the bar and the system will tell as which point we are currently standing - X axis tells us the location of our mouse and Y axis tells us the total count in the bar. Also we can zoom in or out by the grey tool in the upper part of the graph (highlighted in green).
“Group by” tool and “Aggregation” tool works with categorical values.
Note
We can also select the “Use Distinct” tool - which represents percentage of each category in the pie irrespective of the count in each category.
The stack bar chart extends the standard bar chart from looking at numeric values across one categorical variable to two, where each bar is divided into a number of sub-bars.
In this Stack Bar chart we have selected two factors - “port of embarkment” and “survived”. We can see that each port of embarkment - 0,1 and 2 is dived into two colours - 1 showing those who survived and 0 showing the opposite. By pointing the mouse on any section of the bar we can see the total number of cases in the group.
Box Plot graph depicts groups of numerical data through their quartiles. Box plots have also lines extending from the boxes (whiskers) indicating variability outside the upper and lower quartiles, these are the outlier values.
We used age as an example, we can see how mostly the age of passengers was in between 22 and 35 years, other values re represented as outliers. By pointing at the graph we can also see the Open, Low, High and Close values.
Pie charts are beneficial in expressing data and information in terms of proportions. A pie chart is most effective when dealing with the collection of data. It is constructed by clustering all the required data into a circular constructed frame wherein the data are depicted in slices.
For our example we used the number of meals and we can see the number was between 1 and 9, each part of the pie is also represented by the percentage and after pointing at each section we can also see the total number of cases
A bubble chart is a type of chart that displays three dimensions of data. Each entity with its triplet of associated data is plotted as a disk that expresses two of the vᵢ values through the disk’s xy location and the third through its size.
We used Age and Fare for our example and we can now see how the values are spread in the graph.
We can also add another factor in our chart, which represents the “Size” - we selected “Fare” and after that we can see the bubbles size is corresponding with the increasing fare.
We have another feature that can help us represent the values effectively - we can add another factor and represent it in colour, in our case we have added “Embarkment point” so we can see the results also differentiated by the point of embarkment
The depiction address each numerical variable’s correlation against all other variables. Bigger the connection better the correlation between the variables.
Scatter graph uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables.
For our demonstration we have used three different numerical variables - Age as the target on X axis and Fare and Total meal cost on the Y axis. We can even add more numerical variables and all will be displayed in one chart.
Parallel coordinates are a common way of visualising and analysing high-dimensional datasets. It is applied to data where the axes do not correspond to points in time, and therefore do not have a natural order.
In multivariate statistics and probability theory, the scatter matrix is a statistic that is used to make estimates of the covariance matrix, for instance of the multivariate normal distribution.
On the X axis we need to use a date variable, in our case “Ticket purchase” and on the Y axis we used “Survived”, for aggregation we have selected Average value (we have also option of Count, Sum, Average, Minimum, Maximum, Median, Standard Deviation and Variance). From our graph we can see, that everybody who purchased the ticket on April 1st survived.
Outlier analysis is the process of identifying outliers, or abnormal observations, in a dataset. Also known as outlier detection, it’s an important step in data analysis, as it removes erroneous or inaccurate observations which might otherwise skew conclusions.
The measurement of the absolute difference between any one number in a set and the mean of the set. Deviation analysis can be used to determine how a specification will behave in the face of such deviations.
Decision tree analysis is the process of drawing a decision tree, which is a graphic representation of various alternative solutions that are available to solve a given problem, in order to determine the most effective courses of action. Decision trees are comprised of nodes and branches. Nodes represent a test on an attribute and branches represent potential alternative outcomes.
After completing our data manipulation and graphical analysis, we can move to the last part, which is the Predictive Analytics. In the “Statistics” part, you need to select all the columns that are important for your research and ignore those that have small or no relation to the target. From the beginning our target for the research is the “Survived” field. Very important step in this research was to convert the field “Survived” from “Numerical” to “Categorical” type, as keeping it as “Numerical” will result in very low accuracy.
Balancing
In following step we can choose the type of balancing - “None”, “Down sampling”, “SMOTE” or “Up sampling” - in our example we selected “None”. Sampling is used to resolve class imbalance, because the imbalance in the observed classes might negatively impact our model. Machine learning algorithms have trouble learning when one class dominates the other so the balancing of data helps us avoid these problems.
None - no sampling of data performed, data stays as it is the dataset.
Down Sampling - the system will randomly subset all the classes in the data set so that their class frequencies match the least common class
SMOTE - Synthetic Minority Over sampling Technique - this technique synthesises a new minority instances between already existing minority instances
Up Sampling - the system will randomly sample and replace the minority class to be the same size as the majority class
Next step is to select which algorithms should be used, the ratio (default is 80:20 - which means 80% of the sample is used for training and 20% is used for cross validation) and cross validation number (in how many groups the cross validation sample should be split into) and then click on the “Start learning” button and the system will run these algorithms.
Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain parameter. The tree can be explained by two entities, namely decision nodes, and leaves.
The random forest is a classification algorithm consisting of many decisions trees. It is an ensemble method that is better than a single decision tree as it reduces over-fitting by averaging the result. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by the committee is more accurate than that of any individual tree.
eXtreme Gradient Boosting or XGBoost is another popular boosting algorithm. In fact, XGBoost is simply an improvised version of the GBM algorithm! The working procedure of XGBoost is the same as GBM. The trees in XGBoost are built sequentially. XGBoost also includes a variety of regularization techniques that reduce overfitting and improve overall performance.
The Naive Bayes is a classification algorithm that is suitable for binary and multiclass classification. Naïve Bayes performs well in cases of categorical input variables compared to numerical variables. It is useful for making predictions and forecasting data based on historical results.
KNN is a non-parametric method used for classification. It is one of the best-known classification algorithms in that known data are arranged in a space defined by the selected features. When new data is supplied to the algorithm, the algorithm will compare the classes of the k closest data to determine the class of the new data.
Support vector machine algorithm helps to find a hyperplane in N-dimensional space(N — the number of features) that distinctly classifies the data points.
A multilayer perceptron (MLP) is a feedforward artificial neural network that generates a set of outputs from a set of inputs. An MLP is characterized by several layers of input nodes connected as a directed graph between the input and output layers.
Light Gradient Boosted Machine, or LightGBM for short, is an open-source library that provides an efficient and effective implementation of the gradient boosting algorithm. It can be used for data having more than 10,000+ rows. No fixed threshold helps in deciding the usage of LightGBM. It can be used for large volumes of data especially when one needs to achieve high accuracy.
We can see the status of each algorithm, some of these are more complicated and will take more time. After all are at 100% of completion, the algorithms are sorted by their accuracy.
In our case, we can see 7 algorithms were used.
XGBOOST - decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework - with this algorithm we got an accuracy of 85%.
Decision tree
GBM - Gradient boosting
Random forest
Logistic regression
Naive Bayes
KNN - K nearest neighbors
We can also see further details of the algorithm - after clicking on the “Metrices” button.
Classification report is used to measure the quality of predictions from a classification algorithm. The metrics are calculated by using true and false positives, true and false negatives.
Confusion matrix, also known as an error matrix, is a specific table layout that allows visualisation of the performance of an algorithm. It shows the “True Positives”, “True Negatives”, “False positives” and “False Negatives”.
Cumulative Gain charts are used for visually measuring the effectiveness of classification model and a very popular metrics in predictive analytics. Gain charts measures the performance for a proportion of the population data. It is calculated as the ratio between the results obtained without model (random line) or with the model (the model line). The greater the area between these two line the better the model is. The Y-axis shows the target data and the X-axis shows the percentage of the population.
In the image in the Gain Charts, the black line is the random line, and the green and orange Line is results based on the model for the two classes. From the figure above it can be see that orange class model has performed better than the green model, and both the line has performed better than the random line (black).
Lift charts are used for visually measuring the effectiveness of classification model and also a popular metrics in predictive analytics.
Lift Charts shows the actual lift. The lift curve is determined by measuring the ratio between the result predicted by model and the result using no model. A lift chart graphically represents the improvement that a model has when compared against a random guess, and it measures the change in terms of a lift score. A model can be determined which is best, by comparing the lift scores. A lift chart shows the point at which the model’s predictions are strong and when it become weak (less useful).
On the Y axis we can see the True Positive Rate - samples that were evaluated correctly, on the X axis we can see the False Positive Rate - samples evaluated incorrectly. When we plot each of the results, we get the ROC curve. We evaluate the AUC - area under the ROC - if the value is near to zero, the selected model is not classifying correctly. If the value is near to 1, the selected model is classifying correctly. In our case the number is 0.92.
The SEDGE application now allows data analysts to review and compare models by MAPE (mean absolute percentage error), RMSE (root mean square error), MSE (mean squared error), MAE (mean absolute error), and MLCH. This allows the analyst to save and test models for greatest accuracy.
SHAP or Shapley Additive exPlanations is a visualization technique that can be used for making a machine learning model more explainable by visualizing its output. It can be used for explaining the prediction of any model by computing the contribution of each feature to the prediction. If the SHAP value is much closer to zero, we can say that the data point contributes very little to predictions. If the SHAP value is a strong positive or strong negative value, we can say that the data point greatly contributes to predicting the positive or negative class.