Mathematik, Informatik und Statistik - Open Access LMU

Generalized Additive Models with Unknown Link Function Including Variable Selection

Play Episode Listen Later May 21, 2013

The generalized additive model is a well established and strong tool that allows to model smooth effects of predictors on the response. However, if the link function, which is typically chosen as the canonical link, is misspecified, substantial bias is to be expected. A procedure is proposed that simultaneously estimates the form of the link function and the unknown form of the predictor functions including selection of predictors. The procedure is based on boosting methodology, which obtains estimates by using a sequence of weak learners. It strongly dominates fitting procedures that are unable to modify a given link function if the true link function deviates from the fixed function. The performance of the procedure is shown in simulation studies and illustrated by a real world example.

statistik mathematik ddc:510 informatik und statistik technische reports

On the Procrustean analogue of individual differences scaling (INDSCAL)

Play Episode Listen Later May 10, 2013

In this paper, individual differences scaling (INDSCAL) is revisited, considering INDSCAL as being embedded within a hierarchy of individual difference scaling models. We explore the members of this family, distinguishing (i) models, (ii) the role of identification and substantive constraints, (iii) criteria for fitting models and (iv) algorithms to optimise the criteria. Model formulations may be based either on data that are in the form of proximities or on configurational matrices. In its configurational version, individual difference scaling may be formulated as a form of generalized Procrustes analysis. Algorithms are introduced for fitting the new models. An application from sensory evaluation illustrates the performance of the methods and their solutions.

model statistik algorithms mathematik procrustes ddc:510 informatik und statistik technische reports

Global permutation tests for multivariate ordinal data: alternatives, test statistics, and the null dilemma

Play Episode Listen Later Apr 17, 2013

We discuss two-sample global permutation tests for sets of multivariate ordinal data in possibly high-dimensional setups, motivated by the analysis of data collected by means of the World Health Organisation's International Classification of Functioning, Disability and Health. The tests do not require any modelling of the multivariate dependence structure. Specifically, we consider testing for marginal inhomogeneity and direction-independent marginal order. Max-T test statistics are known to lead to good power against alternatives with few strong individual effects. We propose test statistics that can be seen as their counterparts for alternatives with many weak individual effects. Permutation tests are valid only if the two multivariate distributions are identical under the null hypothesis. By means of simulations, we examine the practical impact of violations of this exchangeability condition. Our simulations suggest that theoretically invalid permutation tests can still be 'practically valid'. In particular, they suggest that the degree of the permutation procedure's failure may be considered as a function of the difference in group-specific covariance matrices, the proportion between group sizes, the number of variables in the set, the test statistic used, and the number of levels per variable.

health disability functioning statistik permutation ddc:510 technische reports max t

Clustering in linear mixed models with approximate Dirichlet process mixtures using EM algorithm

Play Episode Listen Later Feb 1, 2013

In linear mixed models, the assumption of normally distributed random effects is often inappropriate and unnecessarily restrictive. The proposed approximate Dirichlet process mixture assumes a hierarchical Gaussian mixture that is based on the truncated version of the stick breaking presentation of the Dirichlet process. In addition to the weakening of distributional assumptions, the specification allows to identify clusters of observations with a similar random effects structure. An Expectation-Maximization algorithm is given that solves the estimation problem and that, in certain respects, may exhibit advantages over Markov chain Monte Carlo approaches when modelling with Dirichlet processes. The method is evaluated in a simulation study and applied to the dynamics of unemployment in Germany as well as lung function growth data.

germany monte carlo statistik markov dirichlet gaussian ddc:510

Variable selection with Random Forests for missing data

Play Episode Listen Later Jan 15, 2013

Variable selection has been suggested for Random Forests to improve their efficiency of data prediction and interpretation. However, its basic element, i.e. variable importance measures, can not be computed straightforward when there is missing data. Therefore an extensive simulation study has been conducted to explore possible solutions, i.e. multiple imputation, complete case analysis and a newly suggested importance measure for several missing data generating processes. The ability to distinguish relevant from non-relevant variables has been investigated for these procedures in combination with two popular variable selection methods. Findings and recommendations: Complete case analysis should not be applied as it lead to inaccurate variable selection and models with the worst prediction accuracy. Multiple imputation is a good means to select variables that would be of relevance in fully observed data. It produced the best prediction accuracy. By contrast, the application of the new importance measure causes a selection of variables that reflects the actual data situation, i.e. that takes the occurrence of missing values into account. It's error was only negligible worse compared to imputation.

multiple findings statistik variable mathematik ddc:510 informatik und statistik random forests technische reports

Regularisierte Schätzverfahren für Bradley-Terry-Luce Modelle

Play Episode Listen Later Jan 1, 2013

Tue, 1 Jan 2013 12:00:00 +0100 https://epub.ub.uni-muenchen.de/21739/1/MA_Bitterlich.pdf Bitterlich, Manuela ddc:500, Ausgewählte Abschlussarbeiten, Statistik

statistik manuela ausgew abschlussarbeiten ddc:500

Was sind die Risikofaktoren für Rehe, vom Luchs gerissen zu werden?

Play Episode Listen Later Jan 1, 2013

Tue, 1 Jan 2013 12:00:00 +0100 https://epub.ub.uni-muenchen.de/21729/1/BA_Zeis.pdf Zeis, Klara ddc:500, ddc:310, Ausgewählte Abschlussarbeiten, Statistik

statistik ausgew abschlussarbeiten ddc:500 ddc:310

Spezifikation der Linkfunktionen in diskreten Verweildauermodellen

Play Episode Listen Later Jan 1, 2013

Tue, 1 Jan 2013 12:00:00 +0100 https://epub.ub.uni-muenchen.de/21731/1/BA_Huber_Cynt.pdf Huber, Cynthia ddc:500, Ausgewählte Abschlussarbeiten, Statistik

statistik ausgew huber abschlussarbeiten ddc:500

Clusteranalyse zur Gruppierung von Items: Strategien zur Auffindung von Faktorstrukturen

Play Episode Listen Later Jan 1, 2013

Tue, 1 Jan 2013 12:00:00 +0100 https://epub.ub.uni-muenchen.de/21736/1/BA_Hoelzl.pdf Hölzl, Andreas ddc:500, Ausgewählte Abschlussarbeiten, Statistik, Mathematik, In

andreas statistik ausgew mathematik abschlussarbeiten ddc:500 informatik und statistik

Modelling Comparison Data with Ordinal Response

Play Episode Listen Later Jan 1, 2013

Tue, 1 Jan 2013 12:00:00 +0100 https://epub.ub.uni-muenchen.de/21746/1/MA_Casalicchio.pdf Casalicchio, Giuseppe ddc:500, Ausgewählte Abschlussarbeiten, Statistik

giuseppe statistik ausgew abschlussarbeiten ddc:500

Clustering bilingual text corpora using mixtures of von Mises-Fisher distributions

Play Episode Listen Later Jan 1, 2013

Tue, 1 Jan 2013 12:00:00 +0100 https://epub.ub.uni-muenchen.de/21742/1/MA_Ernst.pdf Ernst, Dominik ddc:500, Ausgewählte Abschlussarbeiten, Statistik, Mathematik, Informatik un

ernst statistik ausgew mathematik informatik abschlussarbeiten ddc:500 informatik und statistik

Vergleich mehrerer Verfahren für multiples Testen bei der Analyse volatiler organischer Komponenten verschiedener Bakterien und Pilze zur Erregerdifferenzierung

Play Episode Listen Later Jan 1, 2013

Tue, 1 Jan 2013 12:00:00 +0100 https://epub.ub.uni-muenchen.de/21455/1/BA_Hummrich.pdf Hummrich, Katrin

katrin statistik mathematik ddc:500 informatik und statistik

Boosting-Techniken zur Modellierung itemmodifizierender Effekte in Item- Response-Modellen

Play Episode Listen Later Jan 1, 2013

Tue, 1 Jan 2013 12:00:00 +0100 https://epub.ub.uni-muenchen.de/21732/1/MA_Berger.pdf Berger, Moritz ddc:500, Ausgewählte Abschlussarbeiten, Statistik, Mathema

berger moritz statistik ausgew mathematik abschlussarbeiten ddc:500 informatik und statistik

Statistical methods for comparison of two inaccurate measurement procedures in experiments with measurement replications

Play Episode Listen Later Jan 1, 2013

Tue, 1 Jan 2013 12:00:00 +0100 https://epub.ub.uni-muenchen.de/21738/1/MA_Manuilova.pdf Manuilova, Ekaterina ddc:500, Ausgewählte Abschlu

statistik ausgew abschlu mathematik ddc:500 informatik und statistik

Parallel Boosting

Play Episode Listen Later Jan 1, 2013

Tue, 1 Jan 2013 12:00:00 +0100 https://epub.ub.uni-muenchen.de/21456/1/MA_Obst.pdf Obst, Ronert ddc:500, Ausgewählte Abschlussarbeiten, Statistik, Mathematik, Informatik und Statistik 0

statistik ausgew obst mathematik informatik abschlussarbeiten ddc:500 informatik und statistik

Affiliate Marketing: Analyse zeitlicher Aspekte im Online-Shopping

Play Episode Listen Later Jan 1, 2013

Tue, 1 Jan 2013 12:00:00 +0100 https://epub.ub.uni-muenchen.de/21730/1/MA_Meingast.pdf Meingast, Maximilian ddc:500, Ausgewählte Abschlussarbeiten, Stat

statistik stat ausgew abschlussarbeiten ddc:500

Real-Data-Beispiele in Biostatistik- und Bioinformatik-Fachzeitschriften: Ein Survey

Play Episode Listen Later Jan 1, 2013

Tue, 1 Jan 2013 12:00:00 +0100 https://epub.ub.uni-muenchen.de/21735/1/BA_Wenzler.pdf Wenzler, Germaine ddc:500, Ausgewählte Absch

statistik ausgew absch ddc:500

Modellierung der Heterogenität in Bradley-Terry-Luce Modellen

Play Episode Listen Later Jan 1, 2013

Tue, 1 Jan 2013 12:00:00 +0100 https://epub.ub.uni-muenchen.de/21452/1/MA_Poppe.pdf Poppe, Melanie ddc:500, Ausgewählte Abschlussarbeiten, Statistik, Mathematik, Informatik und Statistik

statistik ausgew poppe mathematik informatik abschlussarbeiten ddc:500 informatik und statistik

A Technical Note on the Dirichlet-Multinomial Model

Play Episode Listen Later Oct 4, 2012

This short note contains an explicit proof of the Dirichlet distribution being the conjugate prior to the Multinomial sample distribution as resulting from the general construction method described, e.g., in Bernardo and Smith (2000). The well-known Dirichlet-Multinomial model is thus shown to fit into the framework of canonical conjugate analysis (Bernardo and Smith 2000, Prop.~5.6, p.~273), where the update step for the prior parameters to their posterior counterparts has an especially simple structure. This structure is used, e.g., in the Imprecise Dirichlet Model (IDM) by Walley (1996), a simple yet powerful model for imprecise Bayesian inference using sets of Dirichlet priors to model vague prior knowledge, and furthermore in other imprecise probability models for inference in exponential families where sets of priors are considered.

prop bernardo walley statistik mathematik bayesian dirichlet ddc:510 informatik und statistik technische reports

Stability of impurities with Coulomb potential in graphene with homogeneous magnetic field

Play Episode Listen Later Jul 9, 2012

Mon, 9 Jul 2012 12:00:00 +0100 http://scitation.aip.org/content/aip/journal/jmp/53/9/10.1063/1.4728982 https://epub.ub.uni-muenchen.de/16210/1/Siedentop_16210.pdf Siedentop, Heinz; Maier, Thomas ddc:530, ddc:510, Mathematik, Infor

maier heinz mathematik infor ddc:510 ddc:530 informatik und statistik

Variable Selection in General Multinomial Logit Models

Play Episode Listen Later Jun 21, 2012

The use of the multinomial logit model is typically restricted to applications with few predictors, because in high-dimensional settings maximum likelihood estimates tend to deteriorate. In this paper we are proposing a sparsity-inducing penalty that accounts for the special structure of multinomial models. In contrast to existing methods, it penalizes the parameters that are linked to one variable in a grouped way and thus yields variable selection instead of parameter selection. We develop a proximal gradient method that is able to efficiently compute stable estimates. In addition, the penalization is extended to the important case of predictors that vary across response categories. We apply our estimator to the modeling of party choice of voters in Germany including voter-specific variables like age and gender but also party-specific features like stance on nuclear energy and immigration.

germany statistik mathematik ddc:510 informatik und statistik technische reports

Variable Selection in General Multinomial Logit Models

Play Episode Listen Later Jun 21, 2012

The use of the multinomial logit model is typically restricted to applications with few predictors, because in high-dimensional settings maximum likelihood estimates tend to deteriorate. In this paper we are proposing a sparsity-inducing penalty that accounts for the special structure of multinomial models. In contrast to existing methods, it penalizes the parameters that are linked to one variable in a grouped way and thus yields variable selection instead of parameter selection. We develop a proximal gradient method that is able to efficiently compute stable estimates. In addition, the penalization is extended to the important case of predictors that vary across response categories. We apply our estimator to the modeling of party choice of voters in Germany including voter-specific variables like age and gender but also party-specific features like stance on nuclear energy and immigration.

germany statistik mathematik ddc:510 informatik und statistik technische reports

Clustering in linear mixed models with a group fused lasso penalty

Play Episode Listen Later Jun 12, 2012

A method is proposed that aims at identifying clusters of individuals that show similar patterns when observed repeatedly. We consider linear mixed models which are widely used for the modeling of longitudinal data. In contrast to the classical assumption of a normal distribution for the random effects a finite mixture of normal distributions is assumed. Typically, the number of mixture components is unknown and has to be chosen, ideally by data driven tools. For this purpose an EM algorithm-based approach is considered that uses a penalized normal mixture as random effects distribution. The penalty term shrinks the pairwise distances of cluster centers based on the group lasso and the fused lasso method. The effect is that individuals with similar time trends are merged into the same cluster. The strength of regularization is determined by one penalization parameter. For finding the optimal penalization parameter a new model choice criterion is proposed.

typically statistik mathematik ddc:500 informatik und statistik technische reports

A space-time conditional intensity model for invasive meningococcal disease occurrence 1/2

Play Episode Listen Later Jun 1, 2012

A novel point process model continuous in space-time is proposed for quantifying the transmission dynamics of the two most common meningococcal antigenic sequence types observed in Germany 2002-2008. Modelling is based on the conditional intensity function (CIF) which is described by a superposition of additive and multiplicative components. As an epidemiological interesting finding, spread behaviour was shown to depend on type in addition to age: basic reproduction numbers were 0.25 (95% CI 0.19-0.34) and 0.11 (95% CI 0.07-0.17) for types B:P1.7-2,4:F1-5 and C:P1.5,2:F3-3, respectively. Altogether, the proposed methodology represents a comprehensive and universal regression framework for the modelling, simulation and inference of self-exciting spatio-temporal point processes based on the CIF. Usability of the modelling in biometric practice is promoted by an implementation in the R package surveillance.

germany f1 modelling statistik usability f3 altogether cif ddc:310

A space-time conditional intensity model for invasive meningococcal disease occurrence 2/2

Play Episode Listen Later Jun 1, 2012

A novel point process model continuous in space-time is proposed for quantifying the transmission dynamics of the two most common meningococcal antigenic sequence types observed in Germany 2002-2008. Modelling is based on the conditional intensity function (CIF) which is described by a superposition of additive and multiplicative components. As an epidemiological interesting finding, spread behaviour was shown to depend on type in addition to age: basic reproduction numbers were 0.25 (95% CI 0.19-0.34) and 0.11 (95% CI 0.07-0.17) for types B:P1.7-2,4:F1-5 and C:P1.5,2:F3-3, respectively. Altogether, the proposed methodology represents a comprehensive and universal regression framework for the modelling, simulation and inference of self-exciting spatio-temporal point processes based on the CIF. Usability of the modelling in biometric practice is promoted by an implementation in the R package surveillance.

germany f1 modelling statistik usability f3 altogether cif ddc:310

Ensemble methods for classification trees under imprecise probabilities

Play Episode Listen Later Jan 1, 2012

Sun, 1 Jan 2012 12:00:00 +0100 https://epub.ub.uni-muenchen.de/25521/1/MA_Fink_Paul.pdf Fink, Paul ddc:500, Ausgewählte Abschlussarbeiten, Statistik

sun statistik fink ausgew abschlussarbeiten ddc:500

Integriertes Management von Security-Frameworks

Play Episode Listen Later Jan 1, 2012

Security-Frameworks sind baukastenähnliche, zunächst abstrakte Konzepte, die aufeinander abgestimmte technische und organisatorische Maßnahmen zur Prävention, Detektion und Bearbeitung von Informationssicherheitsvorfällen bündeln. Anders als bei der Zusammenstellung eigener Sicherheitskonzepte aus einer Vielzahl punktueller Einzelmaßnahmen wird bei der Anwendung von Security-Frameworks das Ziel verfolgt, mit einem relativ geringen Aufwand auf bewährte Lösungsansätze zur Absicherung von komplexen IT-Diensten und IT-Architekturen zurückgreifen zu können. Die praktische Umsetzung eines Security-Frameworks erfordert seine szenarienspezifische Adaption und Implementierung, durch die insbesondere eine nahtlose Integration in die vorhandene Infrastruktur sichergestellt und die Basis für den nachhaltigen, effizienten Betrieb geschaffen werden müssen. Die vorliegende Arbeit behandelt das integrierte Management von Security-Frameworks. Im Kern ihrer Betrachtungen liegen folglich nicht individuelle Frameworkkonzepte, sondern Managementmethoden, -prozesse und -werkzeuge für den parallelen Einsatz mehrerer Frameworkinstanzen in komplexen organisationsweiten und -übergreifenden Szenarien. Ihre Schwerpunkte werden zum einen durch die derzeit sehr technische Ausprägung vieler Security-Frameworks und zum anderen durch die fehlende Betrachtung ihres Lebenszyklus über die szenarienspezifische Anpassung hinaus motiviert. Beide Aspekte wirken sich bislang inhibitorisch auf den praktischen Einsatz aus, da zur Umsetzung von Security-Frameworks immer noch ein erheblicher szenarienspezifischer konzeptioneller Aufwand erbracht werden muss. Nach der Diskussion der relevanten Grundlagen des Sicherheitsmanagements und der Einordnung von Security-Frameworks in Informationssicherheitsmanagementsysteme werden auf Basis ausgewählter konkreter Szenarien mehr als 50 Anforderungen an Security-Frameworks aus der Perspektive ihres Managements abgeleitet und begründet gewichtet. Die anschließende Anwendung dieses Anforderungskatalogs auf mehr als 75 aktuelle Security-Frameworks zeigt typische Stärken sowie Schwächen auf und motiviert neben konkreten Verbesserungsvorschlägen für Frameworkkonzepte die nachfolgend erarbeiteten, für Security-Frameworks spezifischen Managementmethoden. Als Bezugsbasis für alle eigenen Konzepte dient eine detaillierte Analyse des gesamten Lebenszyklus von Security-Frameworks, der zur grundlegenden Spezifikation von Managementaufgaben, Verantwortlichkeiten und Schnittstellen zu anderen Managementprozessen herangezogen wird. Darauf aufbauend werden an den Einsatz von Security-Frameworks angepasste Methoden und Prozesse u. a. für das Risikomanagement und ausgewählte Disziplinen des operativen Sicherheitsmanagements spezifiziert, eine Sicherheitsmanagementarchitektur für Security-Frameworks konzipiert, die prozessualen Schnittstellen am Beispiel von ISO/IEC 27001 und ITIL v3 umfassend ausgearbeitet und der Einsatz von IT-Sicherheitskennzahlen zur Beurteilung von Security-Frameworks demonstriert. Die praktische Anwendung dieser innovativen Methoden erfordert dedizierte Managementwerkzeuge, die im Anschluss im Detail konzipiert und in Form von Prototypen bzw. Simulationen umgesetzt, exemplifiziert und bewertet werden. Ein umfassendes Anwendungsbeispiel demonstriert die praktische, parallele Anwendung mehrerer Security-Frameworks und der spezifizierten Konzepte und Werkzeuge. Abschließend werden alle erreichten Ergebnisse kritisch beurteilt und ein Ausblick auf mögliche Weiterentwicklungen und offene Forschungsfragestellungen in verwandten Bereichen gegeben.

A PAUC-based Estimation Technique for Disease Classification and Biomarker Selection.

Play Episode Listen Later Jan 1, 2012

The partial area under the receiver operating characteristic curve (PAUC) is a well-established performance measure to evaluate biomarker combinations for disease classification. Because the PAUC is defined as the area under the ROC curve within a restricted interval of false positive rates, it enables practitioners to quantify sensitivity rates within pre-specified specificity ranges. This issue is of considerable importance for the development of medical screening tests. Although many authors have highlighted the importance of PAUC, there exist only few methods that use the PAUC as an objective function for finding optimal combinations of biomarkers. In this paper, we introduce a boosting method for deriving marker combinations that is explicitly based on the PAUC criterion. The proposed method can be applied in high-dimensional settings where the number of biomarkers exceeds the number of observations. Additionally, the proposed method incorporates a recently proposed variable selection technique (stability selection) that results in sparse prediction rules incorporating only those biomarkers that make relevant contributions to predicting the outcome of interest. Using both simulated data and real data, we demonstrate that our method performs well with respect to both variable selection and prediction accuracy. Specifically, if the focus is on a limited range of specificity values, the new method results in better predictions than other established techniques for disease classification.

statistik roc ddc:310

Methoden des Elastic Net zur sparsamen Variablenselektion und deren Anwendung in der Genetik

Play Episode Listen Later Jan 1, 2012

Sun, 1 Jan 2012 12:00:00 +0100 https://epub.ub.uni-muenchen.de/21743/1/MA_Stuckart.pdf Stuckart, Claudia ddc:500, Ausgewählte Abschlussarbeiten, Statistik, Mathematik, Inf

sun statistik ausgew mathematik inf abschlussarbeiten ddc:500 informatik und statistik

Statistische Modellierung von EEG-abhängigen Stimuluseffekten in der fMRT-Analyse

Play Episode Listen Later Jan 1, 2012

Sun, 1 Jan 2012 12:00:00 +0100 https://epub.ub.uni-muenchen.de/21856/1/DA_Bothmann.pdf Bothmann, Ludwig ddc:500, Ausgewählte Abschlussarbeiten, Mathematik, Informatik un

sun ludwig ausgew mathematik informatik abschlussarbeiten ddc:500 informatik und statistik

Concurvity in Geo-Additive Models

Play Episode Listen Later Jan 1, 2012

Sun, 1 Jan 2012 12:00:00 +0100 https://epub.ub.uni-muenchen.de/21744/1/MA_Lindenlaub.pdf Lindenlaub, Christian ddc:500, Ausgewählte Abschlussarbeiten, Statistik, Mathematik, Informatik und Statistik

sun statistik ausgew mathematik informatik abschlussarbeiten ddc:500 informatik und statistik

Integration of multiple high-throughput data-types in cancer research

Play Episode Listen Later Jan 1, 2011

Sat, 1 Jan 2011 12:00:00 +0100 https://epub.ub.uni-muenchen.de/21857/1/DA_Kuehnle.pdf Kühnle, Oliver ddc:500, Ausgewählte Abschlussarbeiten, Mathematik, Informatik und Statistik

statistik ausgew mathematik informatik abschlussarbeiten ddc:500 informatik und statistik

A robust procedure for comparing multiple means under heteroscedasticity in unbalanced designs.

Play Episode Listen Later Mar 29, 2010

Investigating differences between means of more than two groups or experimental conditions is a routine research question addressed in biology. In order to assess differences statistically, multiple comparison procedures are applied. The most prominent procedures of this type, the Dunnett and Tukey-Kramer test, control the probability of reporting at least one false positive result when the data are normally distributed and when the sample sizes and variances do not differ between groups. All three assumptions are non-realistic in biological research and any violation leads to an increased number of reported false positive results. Based on a general statistical framework for simultaneous inference and robust covariance estimators we propose a new statistical multiple comparison procedure for assessing multiple means. In contrast to the Dunnett or Tukey-Kramer tests, no assumptions regarding the distribution, sample sizes or variance homogeneity are necessary. The performance of the new procedure is assessed by means of its familywise error rate and power under different distributions. The practical merits are demonstrated by a reanalysis of fatty acid phenotypes of the bacterium Bacillus simplex from the "Evolution Canyons" I and II in Israel. The simulation results show that even under severely varying variances, the procedure controls the number of false positive findings very well. Thus, the here presented procedure works well under biologically realistic scenarios of unbalanced group sizes, non-normality and heteroscedasticity.

statistik investigating dunnett bacillus ddc:570

Regularization and Model Selection with Categorial Effect Modifiers

Play Episode Listen Later Jan 1, 2010

The case of continuous effect modifiers in varying-coefficient models has been well investigated. Categorial effect modifiers, however, have been largely neglected. In this paper a regularization technique is proposed that allows for selection of covariates and fusion of categories of categorial effect modifiers in a linear model. It is distinguished between nominal and ordinal variables, since for the latter more economic parametrizations are warranted. The proposed methods are illustrated and investigated in simulation studies and real world data evaluations. Moreover, some asymptotic properties are derived.

statistik ddc:500 technische reports

Regularization and Model Selection with Categorial Effect Modifiers

Play Episode Listen Later Jan 1, 2010

The case of continuous effect modifiers in varying-coefficient models has been well investigated. Categorial effect modifiers, however, have been largely neglected. In this paper a regularization technique is proposed that allows for selection of covariates and fusion of categories of categorial effect modifiers in a linear model. It is distinguished between nominal and ordinal variables, since for the latter more economic parametrizations are warranted. The proposed methods are illustrated and investigated in simulation studies and real world data evaluations. Moreover, some asymptotic properties are derived. The paper is a preprint of an article that has been accepted for publication in Statistica Sinica. Please use the journal version for citation.

statistik ddc:500 technische reports

Analyse und Erweiterung von Vorlesungsaufzeichnungen der UnterrichtsMitschau aus der Perspektive der gemäßigt konstruktivistischen Lerntheorie

Play Episode Listen Later Dec 31, 2009

’Wie kann eine Anwendung zum Lernen mit Vorlesungsaufzeichnungen so gestaltet werden, dass sie den Wissenserwerb möglichst optimal unterstützt?’. Dies war die zentrale Frage dieser Diplomarbeit, zu deren Beantwortung, aufbauend auf ein aktuelles System zur Bereitstellung von aufgezeichneten Vorlesungen, eine neue prototypische Lernanwendung implementiert wurde. Dazu wurden die Entwicklungsmöglichkeiten des Systems der ’UnterrichtsMitschau’ an der LMU München entsprechend der gemäßigt konstruktivistischen Lerntheorie herausgearbeitet. Um die Akzeptanz einer Anwendung bei den studentischen Nutzern zu gewährleisten, wurden deren Wünsche und Ideen mit Hilfe einer Fokusgruppendiskussion ermittelt und in das entworfene Konzept einbezogen. Auf dieser Grundlage wurde eine Anwendung entwickelt, deren zentrale Neuerungen das Hinzufügen von Annotationen und die Möglichkeit zum kooperativen Lernen in zwei unterschiedlichen Modi sind. Im ersten Kooperationsmodus tauschen sich die Lernenden asynchron, also zeitversetzt mit Hilfe von Annotationen über die Vorlesungsinhalte aus. Im zweiten, dem ’synchronen kooperativen Modus’ stehen die Lernenden über eine Audioverbindung direkt miteinander in Kontakt und bearbeiten die Vorlesungsaufzeichnung synchron. Eine nachgelagerte Studie mit 15 potenziellen Nutzern zeigte unter anderem, dass beide kooperativen Modi des neuen Systems im Vergleich zur bisherigen Anwendung besser bewertet wurden. Unter anderem sahen die Nutzer die Prozessmerkmale des Lernens aus der gemäßigt konstruktivistischen Lerntheorie stärker unterstützt. Des weiteren würden die Testpersonen die neue Anwendung eher im Studium einsetzen als die bisherige.

A Framework for Unbiased Model Selection Based on Boosting

Play Episode Listen Later Dec 10, 2009

Variable selection and model choice are of major concern in many statistical applications, especially in high-dimensional regression models. Boosting is a convenient statistical method that combines model fitting with intrinsic model selection. We investigate the impact of base-learner specification on the performance of boosting as a model selection procedure. We show that variable selection may be biased if the covariates are of different nature. Important examples are models combining continuous and categorical covariates, especially if the number of categories is large. In this case, least squares base-learners offer increased flexibility for the categorical covariate and lead to a preference even if the categorical covariate is non-informative. Similar difficulties arise when comparing linear and nonlinear base-learners for a continuous covariate. The additional flexibility in the nonlinear base-learner again yields a preference of the more complex modeling alternative. We investigate these problems from a theoretical perspective and suggest a framework for unbiased model selection based on a general class of penalized least squares base-learners. Making all base-learners comparable in terms of their degrees of freedom strongly reduces the selection bias observed in naive boosting specifications. The importance of unbiased model selection is demonstrated in simulations and an application to forest health models.

important statistik similar variable boosting mathematik ddc:510 informatik und statistik technische reports

Exploratory and Inferential Analysis of Gene Cluster Neighborhood Graphs

Play Episode Listen Later Jan 1, 2009

Many different cluster methods are frequently used in gene expression data analysis to find groups of co–expressed genes. However, cluster algorithms with the ability to visualize the resulting clusters are usually preferred. The visualization of gene clusters gives practitioners an understanding of the cluster structure of their data and makes it easier to interpret the cluster results. In this paper recent extensions of R package gcExplorer are presented. gc-Explorer is an interactive visualization toolbox for the investigation of the overall cluster structure as well as single clusters. The different visualization options including arbitrary node and panel functions are described in detail. Finally the toolbox can be used to investigate the quality of a given clustering graphically as well as theoretically by testing the association between a partition and a functional group under study. It is shown that gcExplorer is a very helpful tool for a general exploration of microarray experiments. The identification of potentially interesting gene candidates or functional groups is substantially accelerated and eased. Inferential analysis on a cluster solution is used to judge its ability to provide insight into the underlying mechanistic biology of the experiment.

statistik explorer mathematik ddc:510 informatik und statistik inferential technische reports

Mixtures of Regression Models for Time-Course Gene Expression Data: Evaluation of Initialization and Random Effects

Play Episode Listen Later Jan 1, 2009

Finite mixture models are routinely applied to time course microarray data. Due to the complexity and size of this type of data the choice of good starting values plays an important role. So far initialization strategies have only been investigated for data from a mixture of multivariate normal distributions. In this work several initialization procedures are evaluated for mixtures of regression models with and without random effects in an extensive simulation study on different artificial datasets. Finally these procedures are also applied to a real dataset from E. coli.

finite statistik mathematik ddc:510 informatik und statistik technische reports

Bayesian Estimation of the Size of a Population

Play Episode Listen Later Dec 20, 2006

We consider the following problem: estimate the size of a population marked with serial numbers after only a sample of the serial numbers has been observed. Its simplicity in formulation and the inviting possibilities of application make this estimation well suited for an undergraduate level probability course. Our contribution consists in a Bayesian treatment of the problem. For an improper uniform prior distribution, we show that the posterior mean and variance have nice closed form expressions and we demonstrate how to compute highest posterior density intervals. Maple and R code is provided on the authors’ web-page to allow students to verify the theoretical results and experiment with data.

maple bayesian

Smoothing with Curvature Constraints based on Boosting Techniques

Play Episode Listen Later Jan 1, 2006

In many applications it is known that the underlying smooth function is constrained to have a specific form. In the present paper, we propose an estimation method based on the regression spline approach, which allows to include concavity or convexity constraints in an appealing way. Instead of using linear or quadratic programming routines, we handle the required inequality constraints on basis coefficients by boosting techniques. Therefore, recently developed componentwise boosting methods for regression purposes are applied, which allow to control the restrictions in each iteration. The proposed approach is compared to several competitors in a simulation study. We also consider a real world data set.

ddc:510

Multivariate Tail Copula: Modeling and Estimation

Play Episode Listen Later Jan 1, 2006

In general, risk of an extreme outcome in financial markets can be expressed as a function of the tail copula of a high-dimensional vector after standardizing marginals. Hence it is of importance to model and estimate tail copulas. Even for moderate dimension, nonparametrically estimating a tail copula is very inefficient and fitting a parametric model to tail copulas is not robust. In this paper we propose a semi-parametric model for tail copulas via an elliptical copula. Based on this model assumption, we propose a novel estimator for the tail copula, which proves favourable compared to the empirical tail copula, both theoretically and empirically.

ddc:510

Empirical Likelihodd Methods for an AR(1) process with ARCH(1) errors

Play Episode Listen Later Jan 1, 2006

For an AR(1) process with ARCH(1) errors, we propose empirical likelihood tests for testing whether the sequence is strictly stationary but has infinite variance, or the sequence is an ARCH(1) sequence or the sequence is an iid sequence. Moreover, an empirical likelihood based confidence interval for the parameter in the AR part is proposed. All of these results do not require more than a finite second moment of the innovations. This includes the case of t-innovations for any degree of freedom larger than 2, which serves as a prominent model for real data.

arch ddc:510

Estimating Tail Dependence of Elliptical Distributions

Play Episode Listen Later Jan 1, 2006

Recently there has been an increasing interest in applying elliptical distributions to risk management. Under weak conditions, Hult and Lindskog (2002) showed that a random vector with an elliptical distribution is in the domain of attraction of a multivariate extreme value distribution. In this paper we study two estimators for the tail dependence function, which are based on extreme value theory and the structure of an elliptical distribution, respectively. After deriving second order regular variation estimates and proving asymptotic normality for both estimators, we show that the estimator based on the structure of an elliptical distribution is better than that based on extreme value theory in terms of both asymptotic variance and optimal asymptotic mean squared error.Our theoretical results are confirmed by a simulation study.

hult ddc:510 lindskog

A Bayesian semiparametric latent variable model for mixed responses

Play Episode Listen Later Jan 1, 2006

In this article we introduce a latent variable model (LVM) for mixed ordinal and continuous responses, where covariate effects on the continuous latent variables are modelled through a flexible semiparametric predictor. We extend existing LVM with simple linear covariate effects by including nonparametric components for nonlinear effects of continuous covariates and interactions with other covariates as well as spatial effects. Full Bayesian modelling is based on penalized spline and Markov random field priors and is performed by computationally efficient Markov chain Monte Carlo (MCMC) methods. We apply our approach to a large German social science survey which motivated our methodological development.

german markov lvm ddc:510 monte carlo mcmc

The Effect of Single-Axis Sorting on the Estimation of a Linear Regression

Play Episode Listen Later Jan 1, 2006

Microaggregation is one of the most important statistical disclosure control techniques for continuous data. The basic principle of microaggregation is to group the observations in a data set and to replace them by their corresponding group means. In this paper, we consider single-axis sorting, a frequently applied microaggregation technique where the formation of groups depends on the magnitude of a sorting variable related to the variables in the data set. The paper deals with the impact of this technique on a linear model in continuous variables. We show that parameter estimates are asymptotically biased if the sorting variable depends on the response variable of the linear model. Using this result, we develop a consistent estimator that removes the aggregation bias. Moreover, we derive the asymptotic covariance matrix of the corrected least squares estimator.

ddc:510

Asymptotic Variance Estimation for the Misclassification SIMEX

Play Episode Listen Later Jan 1, 2006

Most epidemiological studies suffer from misclassification in the response and/or the covariates. Since ignoring misclassification induces bias on the parameter estimates, correction for such errors is important. For measurement error, the continuous analog to misclassification, a general approach for bias correction is the SIMEX (simulation extrapolation) originally suggested by Cook and Stefanski (1994). This approach has been recently extended to regression models with a possibly misclassified categorical response and/or the covariates by Küchenhoff et al. (2005), and is called the MC-SIMEX approach. To assess the importance of a regressor not only its (corrected) estimate is needed, but also its standard error. For the original SIMEX approach. Carroll et al. (1996) developed a method for estimating the asymptotic variance. Here we derive the asymptotic variance estimators for the MC-SIMEX approach, extending the methodology of Carroll et al. (1996). We also include the case where the misclassification probabilities are estimated by a validation study. An extensive simulation study shows the good performance of our approach. The approach is illustrated using an example in caries research including a logistic regression model, where the response and a binary covariate are possibly misclassified.

cook stefanski simex ddc:510

Testing for zero-modification in count regression models

Play Episode Listen Later Jan 1, 2006

Count data often exhibit overdispersion and/or require an adjustment for zero outcomes with respect to a Poisson model. Zero-modified Poisson (ZMP) and zero-modified generalized Poisson (ZMGP) regression models are useful classes of models for such data. In the literature so far only score tests are used for testing the necessity of this adjustment. For this testing problem we show how poor the performance of the corresponding score test can be in comparison to the performance of Wald and likelihood ratio (LR) tests through a simulation study. In particular, the score test in the ZMP case results in a power loss of 47% compared to the Wald test in the worst case, while in the ZMGP case the worst loss is 87%. Therefore, regardless of the computational advantage of score tests, the loss in power compared to the Wald and LR tests should not be neglected and these much more powerful alternatives should be used instead. We also prove consistency and asymptotic normality of the maximum likelihood estimators in the above mentioned regression models to give a theoretical justification for Wald and likelihood ratio tests.

count poisson wald lr ddc:510

Modelling count data with overdispersion and spatial effects

Play Episode Listen Later Jan 1, 2006

In this paper we consider regression models for count data allowing for overdispersion in a Bayesian framework. We account for unobserved heterogeneity in the data in two ways. On the one hand, we consider more flexible models than a common Poisson model allowing for overdispersion in different ways. In particular, the negative binomial and the generalized Poisson distribution are addressed where overdispersion is modelled by an additional model parameter. Further, zero-inflated models in which overdispersion is assumed to be caused by an excessive number of zeros are discussed. On the other hand, extra spatial variability in the data is taken into account by adding spatial random effects to the models. This approach allows for an underlying spatial dependency structure which is modelled using a conditional autoregressive prior based on Pettitt et al. (2002). In an application the presented models are used to analyse the number of invasive meningococcal disease cases in Germany in the year 2004. Models are compared according to the deviance information criterion (DIC) suggested by Spiegelhalter et al. (2002) and using proper scoring rules, see for example Gneiting and Raftery (2004). We observe a rather high degree of overdispersion in the data which is captured best by the GP model when spatial effects are neglected. While the addition of spatial effects to the models allowing for overdispersion gives no or only little improvement, a spatial Poisson model is to be preferred over all other models according to the considered criteria.

germany models gp spiegelhalter poisson dic raftery bayesian pettitt gneiting ddc:510

Maximally selected chi-square statistics and umbrella orderings

Play Episode Listen Later Jan 1, 2006

Binary outcomes that depend on an ordinal predictor in a non-monotonic way are common in medical data analysis. Such patterns can be addressed in terms of cutpoints: for example, one looks for two cutpoints that define an interval in the range of the ordinal predictor for which the probability of a positive outcome is particularly high (or low). A chi-square test may then be performed to compare the proportions of positive outcomes in and outside this interval. However, if the two cutpoints are chosen to maximize the chi-square statistic, referring the obtained chi-square statistic to the standard chi-square distribution is an inappropriate approach. It is then necessary to correct the p-value for multiple comparisons by considering the distribution of the maximally selected chi-square statistic instead of the nominal chi-square distribution. Here, we derive the exact distribution of the chi-square statistic obtained by the optimal two cutpoints. We suggest a combinatorial computation method and illustrate our approach by a simulation study and an application to varicella data.

binary ddc:510

Asymptotic optimality of the quasi-score estimator in a class of linear score estimators

Play Episode Listen Later Jan 1, 2006

We prove that the quasi-score estimator in a mean-variance model is optimal in the class of (unbiased) linear score estimators, in the sense that the difference of the asymptotic covariance matrices of the linear score and quasi-score estimator is positive semi-definite. We also give conditions under which this difference is zero or under which it is positive definite. This result can be applied to measurement error models where it implies that the quasi-score estimator is asymptotically more efficient than the corrected score estimator.

ddc:510

Mathematik, Informatik und Statistik - Open Access LMU - Teil 02/03

Search for episodes from Mathematik, Informatik und Statistik - Open Access LMU - Teil 02/03 with a specific topic:

Latest episodes from Mathematik, Informatik und Statistik - Open Access LMU - Teil 02/03

Generalized Additive Models with Unknown Link Function Including Variable Selection

On the Procrustean analogue of individual differences scaling (INDSCAL)

Global permutation tests for multivariate ordinal data: alternatives, test statistics, and the null dilemma

Clustering in linear mixed models with approximate Dirichlet process mixtures using EM algorithm

Variable selection with Random Forests for missing data

Regularisierte Schätzverfahren für Bradley-Terry-Luce Modelle

Was sind die Risikofaktoren für Rehe, vom Luchs gerissen zu werden?

Spezifikation der Linkfunktionen in diskreten Verweildauermodellen

Clusteranalyse zur Gruppierung von Items: Strategien zur Auffindung von Faktorstrukturen

Modelling Comparison Data with Ordinal Response

Clustering bilingual text corpora using mixtures of von Mises-Fisher distributions

Vergleich mehrerer Verfahren für multiples Testen bei der Analyse volatiler organischer Komponenten verschiedener Bakterien und Pilze zur Erregerdifferenzierung

Boosting-Techniken zur Modellierung itemmodifizierender Effekte in Item- Response-Modellen

Statistical methods for comparison of two inaccurate measurement procedures in experiments with measurement replications

Parallel Boosting

Affiliate Marketing: Analyse zeitlicher Aspekte im Online-Shopping

Real-Data-Beispiele in Biostatistik- und Bioinformatik-Fachzeitschriften: Ein Survey

Modellierung der Heterogenität in Bradley-Terry-Luce Modellen

A Technical Note on the Dirichlet-Multinomial Model

Stability of impurities with Coulomb potential in graphene with homogeneous magnetic field

Variable Selection in General Multinomial Logit Models

Variable Selection in General Multinomial Logit Models

Clustering in linear mixed models with a group fused lasso penalty

A space-time conditional intensity model for invasive meningococcal disease occurrence 1/2

A space-time conditional intensity model for invasive meningococcal disease occurrence 2/2

Ensemble methods for classification trees under imprecise probabilities

Integriertes Management von Security-Frameworks

A PAUC-based Estimation Technique for Disease Classification and Biomarker Selection.

Methoden des Elastic Net zur sparsamen Variablenselektion und deren Anwendung in der Genetik

Statistische Modellierung von EEG-abhängigen Stimuluseffekten in der fMRT-Analyse

Concurvity in Geo-Additive Models

Integration of multiple high-throughput data-types in cancer research

A robust procedure for comparing multiple means under heteroscedasticity in unbalanced designs.

Regularization and Model Selection with Categorial Effect Modifiers

Regularization and Model Selection with Categorial Effect Modifiers

Analyse und Erweiterung von Vorlesungsaufzeichnungen der UnterrichtsMitschau aus der Perspektive der gemäßigt konstruktivistischen Lerntheorie

A Framework for Unbiased Model Selection Based on Boosting

Exploratory and Inferential Analysis of Gene Cluster Neighborhood Graphs

Mixtures of Regression Models for Time-Course Gene Expression Data: Evaluation of Initialization and Random Effects

Bayesian Estimation of the Size of a Population

Smoothing with Curvature Constraints based on Boosting Techniques

Multivariate Tail Copula: Modeling and Estimation

Empirical Likelihodd Methods for an AR(1) process with ARCH(1) errors

Estimating Tail Dependence of Elliptical Distributions

A Bayesian semiparametric latent variable model for mixed responses

The Effect of Single-Axis Sorting on the Estimation of a Linear Regression

Asymptotic Variance Estimation for the Misclassification SIMEX

Testing for zero-modification in count regression models

Modelling count data with overdispersion and spatial effects

Maximally selected chi-square statistics and umbrella orderings

Asymptotic optimality of the quasi-score estimator in a class of linear score estimators

Claim Mathematik, Informatik und Statistik - Open Access LMU - Teil 02/03

On the way!