Icon

kn_​example_​python_​graphic_​scatterplot_​diagonal_​reference

KNIME & Python Graphics - Scatterplot with trend line and statistics

KNIME & Python Graphics - Scatterplot with trend line and statistics

https://forum.knime.com/t/scatter-plot-diagonal-reference-line/73484/2?u=mlauber71
https://hub.knime.com/-/spaces/-/latest/~TQl4cycTtFLs4cnA/

Exploring the Power of Python Graphics with KNIME: A Collection of Examples
https://medium.com/p/841df87b5563

KNIME & Python Graphics - Scatterplot with trend line and statisticshttps://forum.knime.com/t/scatter-plot-diagonal-reference-line/73484/2?u=mlauber71https://hub.knime.com/-/spaces/-/latest/~TQl4cycTtFLs4cnA/ import knime.scripting.io as knio# Import necessary libraries for data handling and visualizationfrom io import BytesIO # For handling binary input/output streamsimport osimport pandas as pd # For data manipulationimport seaborn as sns # For visualizationimport matplotlib.pyplot as plt # For plottingfrom scipy import stats # For statistical analysis# Set theme for Seaborn plots to "whitegrid" stylesns.set_theme(style="whitegrid")# Convert KNIME input table to a pandas DataFrameinput_table = knio.input_tables[0].to_pandas()# Retrieve flow variables for plot customization (title, footnote, labels, etc.)var_title = knio.flow_variables['title_graphic']var_footnote = knio.flow_variables['footnote_graphic']# X-axis variable and labelvar_x_variable = knio.flow_variables['variable_x']var_x_label = knio.flow_variables['label_x']# Y-axis variable and labelvar_y_a_variable = knio.flow_variables['variable_y_a']var_y_a_label = knio.flow_variables['label_y_a']# Color variable for the scatter plot pointsvar_colour = knio.flow_variables['v_colour']# Define figure size for the plotsns.set(rc={'figure.figsize': (16, 9)}) # Set figure dimensions################################################### Reference: https://seaborn.pydata.org/examples/large_distributions.html# Remove rows with missing data in the x or y columns to ensure valid analysisinput_table.dropna(subset=[var_x_variable, var_y_a_variable], inplace=True)# Perform statistical analyses between the x and y variables# Pearson correlation: Measures linear correlation between x and ypearson_coef, pearson_p = stats.pearsonr(input_table[var_x_variable].dropna(), input_table[var_y_a_variable].dropna())# Spearman correlation: Measures monotonic (ranked) correlation between x and yspearman_coef, spearman_p = stats.spearmanr(input_table[var_x_variable].dropna(), input_table[var_y_a_variable].dropna())# Perform linear regression analysis to get slope, intercept, R-value, p-value, and standard errorslope, intercept, r_value, p_value, std_err = stats.linregress(input_table[var_x_variable], input_table[var_y_a_variable])# Create annotation text for regression, R², Pearson's r, and Spearman's ρline = (f"Regression: y = {slope:.2f}x + {intercept:.2f}" f"\nR² = {r_value**2:.2f}, p = {p_value:.3f}" f"\nPearson's r: {pearson_coef:.2f} (p = {pearson_p:.3f})" f"\nSpearman's ρ: {spearman_coef:.2f} (p = {spearman_p:.3f})")# y = mx + c: Linear equation, where:# - {slope:.2f} (m) is the slope (rate of change between x and y)# - {intercept:.2f} (c) is the y-intercept (value of y when x = 0)# R² ({r_value**2:.2f}) measures the goodness-of-fit for the regression model# p-value ({p_value:.3f}) indicates the statistical significance of the regression# Pearson's r: Measures linear correlation strength (1 is perfect positive, -1 is perfect negative)# Spearman's ρ (rho): Measures monotonic relationship between variables (rank-based correlation)# Create a scatter plotg = sns.scatterplot(x=var_x_variable, y=var_y_a_variable, data=input_table, color=var_colour)# Add a regression line to the scatter plotsns.regplot(x=var_x_variable, y=var_y_a_variable, data=input_table, scatter=False, color="black", line_kws={"lw": 2}, ci=None, ax=g)# Annotate the plot with the regression statistics (equation, R², correlations)g.annotate(line, xy=(0.05, 0.95), xycoords='axes fraction', fontsize=10, verticalalignment='top', bbox=dict(boxstyle="round,pad=0.3", edgecolor="black", facecolor="aliceblue"))# Set axis labels and the plot titleg.set(xlabel=var_x_label, ylabel=var_y_a_label, title=var_title)################################################### Save the calculated regression statistics in KNIME flow variablesknio.flow_variables['line_stats'] = line# Retrieve the figure from the plotfig_out = g.get_figure()fig_out.set_size_inches(16, 9) # Set the figure size again for final output# Add a footnote at the bottom of the figurefig_out.text(0.1, 0.025, var_footnote, fontsize=10)# Create a buffer to hold the plot in SVG formatbuffer = BytesIO()fig_out.savefig(buffer, format='svg') # Save the plot as SVG into the bufferoutput_image = buffer.getvalue() # Get the SVG data from the buffer# Output the plot image as the result to KNIMEknio.output_images[0] = output_image# Assign the plot view to be displayed in KNIMEknio.output_view = knio.view(fig_out) The diamonds dataset is a popular dataset in the Seaborn library, and it containsinformation about a large number of diamonds. Here's a breakdown of the columnsin the dataset:1. **carat**: Weight of the diamond, measured in carats. A carat is equivalent to 0.2grams.2. **cut**: Quality of the diamond cut, and it's an ordinal categorical variable. It hasvalues: - **Fair**: Worst quality - **Good** - **Very Good** - **Premium** - **Ideal**: Best quality3. **color**: Color of the diamond, also an ordinal categorical variable. The valuesrange from: - **J**: Worst color - ... - **D**: Best color4. **clarity**: A measurement of how clear the diamond is, another ordinalcategorical variable. The values are: - **I1**: Worst clarity (inclusions are obvious under 10× magnification) - **SI2**: Slightly included 2 - **SI1**: Slightly included 1 - **VS2**: Very slightly included 2 - **VS1**: Very slightly included 1 - **VVS2**: Very, very slightly included 2 - **VVS1**: Very, very slightly included 1 - **IF**: Internally flawless, best clarity5. **depth**: Total depth percentage, calculated as `z / mean(x, y)`, where z is thedepth of the diamond and x and y are the length and width. This gives an idea of theshape and proportions of the diamond.6. **table**: Width of the diamond's top relative to its widest point, represented as apercentage.7. **price**: Price of the diamond in US dollars.8. **x**: Length of the diamond in millimeters.9. **y**: Width of the diamond in millimeters.10. **z**: Depth of the diamond in millimeters.The diamonds dataset is often used for regression and classification tasks, as wellas for data visualization exercises, as it offers a mix of numeric and categoricalattributes. If you plot attributes like carat against price in a scatter plot, you'll notice apositive correlation, with larger diamonds generally being more expensive. Exploring the Power of Python Graphics with KNIME: A Collection ofExampleshttps://medium.com/p/841df87b5563 y = mx + c: This is the equation of a straight line, where:{slope:.2f} (m) is the slope of the line. It indicates the change in y (dependent variable) for a unit change in x(independent variable).{intercept:.2f} (c) is the y-intercept. It's the value of y when x is 0.R² ({r_value**2:.2f}) is the coefficient of determination. It tells the proportion of the variance in the dependent variablethat's predictable from the independent variable.p ({p_value:.3f}) is the p-value. If it's below 0.05, it generally indicates that the observed relationship is statisticallysignificant. locate and create/data/ folderwith absolute paths1.920 x 1.080PNG filefrom_knime_scatterplot.pngdiamonds.parquetIneractive VIEWwith Python graphics(right click)Collect LocalMetadata Image To Table Renderer to Image Table To Image Image Writer (Port) Parquet Reader Python graphics interactive - Scatterplotwith ternd line and statistics KNIME & Python Graphics - Scatterplot with trend line and statisticshttps://forum.knime.com/t/scatter-plot-diagonal-reference-line/73484/2?u=mlauber71https://hub.knime.com/-/spaces/-/latest/~TQl4cycTtFLs4cnA/ import knime.scripting.io as knio# Import necessary libraries for data handling and visualizationfrom io import BytesIO # For handling binary input/output streamsimport osimport pandas as pd # For data manipulationimport seaborn as sns # For visualizationimport matplotlib.pyplot as plt # For plottingfrom scipy import stats # For statistical analysis# Set theme for Seaborn plots to "whitegrid" stylesns.set_theme(style="whitegrid")# Convert KNIME input table to a pandas DataFrameinput_table = knio.input_tables[0].to_pandas()# Retrieve flow variables for plot customization (title, footnote, labels, etc.)var_title = knio.flow_variables['title_graphic']var_footnote = knio.flow_variables['footnote_graphic']# X-axis variable and labelvar_x_variable = knio.flow_variables['variable_x']var_x_label = knio.flow_variables['label_x']# Y-axis variable and labelvar_y_a_variable = knio.flow_variables['variable_y_a']var_y_a_label = knio.flow_variables['label_y_a']# Color variable for the scatter plot pointsvar_colour = knio.flow_variables['v_colour']# Define figure size for the plotsns.set(rc={'figure.figsize': (16, 9)}) # Set figure dimensions################################################### Reference: https://seaborn.pydata.org/examples/large_distributions.html# Remove rows with missing data in the x or y columns to ensure valid analysisinput_table.dropna(subset=[var_x_variable, var_y_a_variable], inplace=True)# Perform statistical analyses between the x and y variables# Pearson correlation: Measures linear correlation between x and ypearson_coef, pearson_p = stats.pearsonr(input_table[var_x_variable].dropna(), input_table[var_y_a_variable].dropna())# Spearman correlation: Measures monotonic (ranked) correlation between x and yspearman_coef, spearman_p = stats.spearmanr(input_table[var_x_variable].dropna(), input_table[var_y_a_variable].dropna())# Perform linear regression analysis to get slope, intercept, R-value, p-value, and standard errorslope, intercept, r_value, p_value, std_err = stats.linregress(input_table[var_x_variable], input_table[var_y_a_variable])# Create annotation text for regression, R², Pearson's r, and Spearman's ρline = (f"Regression: y = {slope:.2f}x + {intercept:.2f}" f"\nR² = {r_value**2:.2f}, p = {p_value:.3f}" f"\nPearson's r: {pearson_coef:.2f} (p = {pearson_p:.3f})" f"\nSpearman's ρ: {spearman_coef:.2f} (p = {spearman_p:.3f})")# y = mx + c: Linear equation, where:# - {slope:.2f} (m) is the slope (rate of change between x and y)# - {intercept:.2f} (c) is the y-intercept (value of y when x = 0)# R² ({r_value**2:.2f}) measures the goodness-of-fit for the regression model# p-value ({p_value:.3f}) indicates the statistical significance of the regression# Pearson's r: Measures linear correlation strength (1 is perfect positive, -1 is perfect negative)# Spearman's ρ (rho): Measures monotonic relationship between variables (rank-based correlation)# Create a scatter plotg = sns.scatterplot(x=var_x_variable, y=var_y_a_variable, data=input_table, color=var_colour)# Add a regression line to the scatter plotsns.regplot(x=var_x_variable, y=var_y_a_variable, data=input_table, scatter=False, color="black", line_kws={"lw": 2}, ci=None, ax=g)# Annotate the plot with the regression statistics (equation, R², correlations)g.annotate(line, xy=(0.05, 0.95), xycoords='axes fraction', fontsize=10, verticalalignment='top', bbox=dict(boxstyle="round,pad=0.3", edgecolor="black", facecolor="aliceblue"))# Set axis labels and the plot titleg.set(xlabel=var_x_label, ylabel=var_y_a_label, title=var_title)################################################### Save the calculated regression statistics in KNIME flow variablesknio.flow_variables['line_stats'] = line# Retrieve the figure from the plotfig_out = g.get_figure()fig_out.set_size_inches(16, 9) # Set the figure size again for final output# Add a footnote at the bottom of the figurefig_out.text(0.1, 0.025, var_footnote, fontsize=10)# Create a buffer to hold the plot in SVG formatbuffer = BytesIO()fig_out.savefig(buffer, format='svg') # Save the plot as SVG into the bufferoutput_image = buffer.getvalue() # Get the SVG data from the buffer# Output the plot image as the result to KNIMEknio.output_images[0] = output_image# Assign the plot view to be displayed in KNIMEknio.output_view = knio.view(fig_out) The diamonds dataset is a popular dataset in the Seaborn library, and it containsinformation about a large number of diamonds. Here's a breakdown of the columnsin the dataset:1. **carat**: Weight of the diamond, measured in carats. A carat is equivalent to 0.2grams.2. **cut**: Quality of the diamond cut, and it's an ordinal categorical variable. It hasvalues: - **Fair**: Worst quality - **Good** - **Very Good** - **Premium** - **Ideal**: Best quality3. **color**: Color of the diamond, also an ordinal categorical variable. The valuesrange from: - **J**: Worst color - ... - **D**: Best color4. **clarity**: A measurement of how clear the diamond is, another ordinalcategorical variable. The values are: - **I1**: Worst clarity (inclusions are obvious under 10× magnification) - **SI2**: Slightly included 2 - **SI1**: Slightly included 1 - **VS2**: Very slightly included 2 - **VS1**: Very slightly included 1 - **VVS2**: Very, very slightly included 2 - **VVS1**: Very, very slightly included 1 - **IF**: Internally flawless, best clarity5. **depth**: Total depth percentage, calculated as `z / mean(x, y)`, where z is thedepth of the diamond and x and y are the length and width. This gives an idea of theshape and proportions of the diamond.6. **table**: Width of the diamond's top relative to its widest point, represented as apercentage.7. **price**: Price of the diamond in US dollars.8. **x**: Length of the diamond in millimeters.9. **y**: Width of the diamond in millimeters.10. **z**: Depth of the diamond in millimeters.The diamonds dataset is often used for regression and classification tasks, as wellas for data visualization exercises, as it offers a mix of numeric and categoricalattributes. If you plot attributes like carat against price in a scatter plot, you'll notice apositive correlation, with larger diamonds generally being more expensive. Exploring the Power of Python Graphics with KNIME: A Collection ofExampleshttps://medium.com/p/841df87b5563 y = mx + c: This is the equation of a straight line, where:{slope:.2f} (m) is the slope of the line. It indicates the change in y (dependent variable) for a unit change in x(independent variable).{intercept:.2f} (c) is the y-intercept. It's the value of y when x is 0.R² ({r_value**2:.2f}) is the coefficient of determination. It tells the proportion of the variance in the dependent variablethat's predictable from the independent variable.p ({p_value:.3f}) is the p-value. If it's below 0.05, it generally indicates that the observed relationship is statisticallysignificant. locate and create/data/ folderwith absolute paths1.920 x 1.080PNG filefrom_knime_scatterplot.pngdiamonds.parquetIneractive VIEWwith Python graphics(right click)Collect LocalMetadata Image To Table Renderer to Image Table To Image Image Writer (Port) Parquet Reader Python graphics interactive - Scatterplotwith ternd line and statistics

Nodes

Extensions

Links