Javascript required
Skip to content Skip to sidebar Skip to footer

Influence Estimation and Maximization in Continuous time Diffusion Networks

  • Journal List
  • HHS Author Manuscripts
  • PMC4704803

Adv Neural Inf Process Syst. Author manuscript; available in PMC 2016 Jan 7.

Published in final edited form as:

Adv Neural Inf Process Syst. 2013; 26: 3147–3155.

PMCID: PMC4704803

NIHMSID: NIHMS578665

Scalable Influence Estimation in Continuous-Time Diffusion Networks

Nan Du

*Georgia Institute of Technology

Le Song

*Georgia Institute of Technology

Manuel Gomez-Rodriguez

MPI for Intelligent Systems

Hongyuan Zha

*Georgia Institute of Technology

Abstract

If a piece of information is released from a media site, can we predict whether it may spread to one million web pages, in a month ? This influence estimation problem is very challenging since both the time-sensitive nature of the task and the requirement of scalability need to be addressed simultaneously. In this paper, we propose a randomized algorithm for influence estimation in continuous-time diffusion networks. Our algorithm can estimate the influence of every node in a network with |V| nodes and |ε| edges to an accuracy of ε using n = O(1/ε 2) randomizations and up to logarithmic factors O(n|ε|+n|V|) computations. When used as a subroutine in a greedy influence maximization approach, our proposed algorithm is guaranteed to find a set of C nodes with the influence of at least (1 − 1/e) OPT − 2Cε , where OPT is the optimal value. Experiments on both synthetic and real-world data show that the proposed algorithm can easily scale up to networks of millions of nodes while significantly improves over previous state-of-the-arts in terms of the accuracy of the estimated influence and the quality of the selected nodes in maximizing the influence.

1 Introduction

Motivated by applications in viral marketing [1], researchers have been studying the influence maximization problem: find a set of nodes whose initial adoptions of certain idea or product can trigger, in a time window, the largest expected number of follow-ups. For this purpose, it is essential to accurately and efficiently estimate the number of follow-ups of an arbitrary set of source nodes within the given time window. This is a challenging problem for that we need first accurately model the timing information in cascade data and then design a scalable algorithm to deal with large real-world networks. Most previous work in the literature tackled the influence estimation and maximization problems for infinite time window [1, 2, 3, 4, 5, 6]. However, in most cases, influence must be estimated or maximized up to a given time, i.e., a finite time window must be considered [7]. For example, a marketer would like to have her advertisement viewed by a million people in one month, rather than in one hundred years. Such time-sensitive requirement renders those algorithms which only consider static information, such as network topologies, inappropriate in this context.

A sequence of recent work has argued that modeling cascade data and information diffusion using continuous-time diffusion networks can provide significantly more accurate models than discrete-time models [8, 9, 10, 11, 12, 13, 14, 15]. There is a twofold rationale behind this modeling choice. First, since follow-ups occur asynchronously, continuous variables seem more appropriate to represent them. Artificially discretizing the time axis into bins introduces additional tuning parameters, like the bin size, which are not easy to choose optimally. Second, discrete time models can only describe transmission times which obey an exponential density, and hence can be too restricted to capture the rich temporal dynamics in the data. Extensive experimental comparisons on both synthetic and real world data showed that continuous-time models yield significant improvement in settings such as recovering hidden diffusion network structures from cascade data [8, 10] and predicting the timings of future events [9, 14].

However, estimating and maximizing influence based on continuous-time diffusion models also entail many challenges. First, the influence estimation problem in this setting is a difficult graphical model inference problem, i.e., computing the marginal density of continuous variables in loopy graphical models. The exact answer can be computed only for very special cases. For example, Gomez-Rodriguez et al. [7] have shown that the problem can be solved exactly when the transmission functions are exponential densities, by using continuous time Markov processes theory. However, the computational complexity of such approach, in general, scales exponentially with the size and density of the network. Moreover, extending the approach to deal with arbitrary transmission functions would require additional nontrivial approximations which would increase even more the computational complexity. Second, it is unclear how to scale up influence estimation and maximization algorithms based on continuous-time diffusion models to millions of nodes. Especially in the maximization case, even a naive sampling algorithm for approximate inference is not scalable: n sampling rounds need to be carried out for each node to estimate the influence, which results in an overall O(n|V||ε|) algorithm. Thus, our goal is to design a scalable algorithm which can perform influence estimation and maximization in this regime of networks with millions of nodes.

In particular, we propose ConTinEst (Continous-Time Influence Estimation), a scalable randomized algorithm for influence estimation in a continuous-time diffusion network with heterogeneous edge transmission functions. The key idea of the algorithm is to view the problem from the perspective of graphical model inference, and reduces the problem to a neighborhood estimation problem in graphs. Our algorithm can estimate the influence of every node in a network with |V| nodes and |ε| edges to an accuracy of ε using n = O(1/ε 2) randomizations and up to logarithmic factors O(n|ε| + n|V|) computations. When used as a subroutine in a greedy influence maximization algorithm, our proposed algorithm is guaranteed to find a set of nodes with an influence of at least (1 − 1/e) OPT − 2Cε , where OPT is the optimal value. Finally, we validate ConTinEst on both influence estimation and maximization problems over large synthetic and real world datasets. In terms of influence estimation, ConTinEst is much closer to the true influence and much faster than other state-of-the-art methods. With respect to the influence maximization, ConTinEst allows us to find a set of sources with greater influence than other state-of-the-art methods.

2 Continuous-Time Diffusion Networks

First, we revisit the continuous-time generative model for cascade data in social networks introduced in [10]. The model associates each edge ji with a transmission function, fji (τji ), a density over time, in contrast to previous discrete-time models which associate each edge with a fixed infection probability [1]. Moreover, it also differs from discrete-time models in the sense that events in a cascade are not generated iteratively in rounds, but event timings are sampled directly from the transmission function in the continuous-time model.

Continuous-Time Independent Cascade Model

Given a directed contact network, G = (V, ε), we use a continuous-time independent cascade model for modeling a diffusion process [10]. The process begins with a set of infected source nodes, A, initially adopting certain contagion (idea, meme or product) at time zero. The contagion is transmitted from the sources along their out-going edges to their direct neighbors. Each transmission through an edge entails a random spreading time, τ, drawn from a density over time, fji (τ). We assume transmission times are independent and possibly distributed differently across edges. Then, the infected neighbors transmit the contagion to their respective neighbors, and the process continues. We assume that an infected node remains infected for the entire diffusion process. Thus, if a node i is infected by multiple neighbors, only the neighbor that first infects node i will be the true parent. As a result, although the contact network can be an arbitrary directed network, each cascade (a vector of event timing information from the spread of a contagion) induces a Directed Acyclic Graph (DAG).

Heterogeneous Transmission Functions

Formally, the transmission function fji (ti tj ) for directed edge ji is the conditional density of node i getting infected at time ti given that node j was infected at time tj . We assume it is shift invariant: fji (ti tj ) = fji (τji ), where τji := ti tj , and nonnegative: fji (τji ) = 0 if τji < 0. Both parametric transmission functions, such as the exponential and Rayleigh function [10, 16], and nonparametric function [8] can be used and estimated from cascade data (see Appendix A for more details).

Shortest-Path property

The independent cascade model has a useful property we will use later: given a sample of transmission times of all edges, the time ti taken to infect a node i is the length of the shortest path in G from the sources to node i, where the edge weights correspond to the associated transmission times.

3 Graphical Model Perspectives for Continuous-Time Diffusion Networks

The continuous-time independent cascade model is essentially a directed graphical model for a set of dependent random variables, the infection times ti of the nodes, where the conditional independence structure is supported on the contact network G (see Appendix B for more details). More formally, the joint density of {ti } iV can be expressed as

p ({t i } iV ) = ∏ iV p (t i ∣{t j } jπ i ),

(1)

where πi denotes the set of parents of node i in a cascade-induced DAG, and p(ti ∣{tj } jπi ) is the conditional density of infection ti at node i given the infection times of its parents.

Instead of directly modeling the infection times ti , we can focus on the set of mutually independent random transmission times τji = ti tj . Interestingly, by switching from a node-centric view to an edge-centric view, we obtain a fully factorized joint density of the set of transmission times

p ({τ ji }(j,i)∈ε ) = ∏(j,i)∈ε f ji (τ ji ),

(2)

Based on the Shortest-Path property of the independent cascade model, each variable ti can be viewed as a transformation from the collection of variables {τji }(j,i)∈ε .

More specifically, let Qi be the collection of directed paths in G from the source nodes to node i, where each path qQi contains a sequence of directed edges (j, l). Assuming all source nodes are infected at zero time, then we obtain variable ti via

t i = g i ( { τ ji } ( j , i ) ε ) : = min q Q i ( j , l ) q τ jl ,

(3)

where the transformation gi (·) is the value of the shortest-path minimization. As a special case, we can now compute the probability of node i infected before T using a set of independent variables:

Pr{t i  ≤T} = Pr {g i ({τ ji }(j,i)∈ε ) ≤T}.

(4)

The significance of the relation is that it allows us to transform a problem involving a sequence of dependent variables {ti } iV to one with independent variables {τji }(j,i)∈ε . Furthermore, the two perspectives are connected via the shortest path algorithm in weighted directed graph, a standard well-studied operation in graph analysis.

4 Influence Estimation Problem in Continuous-Time Diffusion Networks

Intuitively, given a time window, the wider the spread of infection, the more influential the set of sources. We adopt the definition of influence as the average number of infected nodes given a set of source nodes and a time window, as in previous work [7]. More formally, consider a set of C source nodes AV which gets infected at time zero, then, given a time window T, a node i is infected in the time window if ti T. The expected number of infected nodes (or the influence) given the set of transmission functions {fji }(j,i)∈ε can be computed as

σ(A,T) = E [∑ iV I{t i  ≤T}] = ∑ iV E [I {t i  ≤T}] = ∑ iV Pr{t i  ≤T},

(5)

where I {·} is the indicator function and the expectation is taken over the the set of dependent variables {ti } iV .

Essentially, the influence estimation problem in Eq. (5) is an inference problem for graphical models, where the probability of event ti T given sources in A can be obtained by summing out the possible configuration of other variables {tj } ji . That is

Pr { t i T } = 0 t i = 0 T 0 ( j V p ( t j { t l } l π j ) ) ( j V d t j ) ,

(6)

which is, in general, a very challenging problem. First, the corresponding directed graphical models can contain nodes with high in-degree and high out-degree. For example, in Twitter, a user can follow dozens of other users, and another user can have hundreds of "followees". The tree-width corresponding to this directed graphical model can be very high, and we need to perform integration for functions involving many continuous variables. Second, the integral in general can not be evaluated analytically for heterogeneous transmission functions, which means that we need to resort to numerical integration by discretizing the domain [0, ∞). If we use N level of discretization for each variable, we would need to enumerate O(N |πi |) entries, exponential in the number of parents.

Only in very special cases, can one derive the closed-form equation for computing Pr{ti T} [7]. However, without further heuristic approximation, the computational complexity of the algorithm is exponential in the size and density of the network. The intrinsic complexity of the problem entails the utilization of approximation algorithms, such as mean field algorithms or message passing algorithms. We will design an efficient randomized (or sampling) algorithm in the next section.

5 Efficient Influence Estimation in Continuous-Time Diffusion Networks

Our first key observation is that we can transform the influence estimation problem in Eq. (5) into a problem with independent variables. Using relation in Eq. (4), we have

σ(A,T) = ∑ iV Pr {g i ({τ ij }(j,i)∈ε ) ≤T} = E [∑ iV I {g i ({τ ji }(j,i)∈ε ) ≤T}],

(7)

where the expectation is with respect to the set of independent variables {τji }(j,i)∈ε . This equivalent formulation suggests a naive sampling (NS) algorithm for approximating σ(A, T): draw n samples of {τji }(j,i)∈ε , run a shortest path algorithm for each sample, and finally average the results (see Appendix C for more details). However, this naive sampling approach has a computational complexity of O(nC|V||ε| + nC|V|2 log |V|) due to the repeated calling of the shortest path algorithm. This is quadratic to the network size, and hence not scalable to millions of nodes.

Our second key observation is that for each sample {τji }(j,i)∈ε , we are only interested in the neighborhood size of the source nodes, i.e., the summation Σ iV I {·} in Eq. (7), rather than in the individual shortest paths. Fortunately, the neighborhood size estimation problem has been studied in the theoretical computer science literature. Here, we adapt a very efficient randomized algorithm by Cohen [17] to our influence estimation problem. This randomized algorithm has a computational complexity of O(|ε| log |V| + |V| log2 |V|) and it estimates the neighborhood sizes for all possible single source node locations. Since it needs to run once for each sample of {τji }(j,i)∈ε , we obtain an overall influence estimation algorithm with O(n|ε| log |V| + n|V|log2 |V|) computation, nearly linear in network size. Next we will revisit Cohen's algorithm for neighborhood estimation.

5.1 Randomized Algorithm for Single-Source Neighborhood Size Estimation

Given a fixed set of edge transmission times {τji }(j,i)∈ε and a source node s, infected at time 0, the neighborhood N(s, T) of a source node s given a time window T is the set of nodes within distance T from s, i.e.,

N(s,T) = {ig i ({τ ji }(j,i)∈ε ) ≤T,i ∈V}.

(8)

Instead of estimating N(s, T) directly, the algorithm will assign an exponentially distributed random label ri to each network node i. Then, it makes use of the fact that the minimum of a set of exponential random variables {ri } iN(s,T) will also be a exponential random variable, but with its parameter equals to the number of variables. That is if each ri ~ exp(−ri ), then the smallest label within distance T from source s, r * := min iN(s,T) ri , will distribute as r * ~ exp {−|N (s, T)|r *}. Suppose we randomize over the labeling m times, and obtain m such least labels, { r u } u = 1 m . Then the neighborhood size can be estimated as

which is shown to be an unbiased estimator of |N(s, T)| [17]. This is an interesting relation since it allows us to transform the counting problem in (8) to a problem of finding the minimum random label r *. The key question is whether we can compute the least label r * efficiently, given random labels {ri } iV and any source node s.

Cohen [17] designed a modified Dijkstra's algorithm (Algorithm 1) to construct a data structure r *(s), called least label list, for each node s to support such query. Essentially, the algorithm starts with the node i with the smallest label ri , and then it traverses in breadth-first search fashion along the reverse direction of the graph edges to find all reachable nodes. For each reachable node s, the distance d * between i and s, and ri are added to the end of r *(s). Then the algorithm moves to the node i′ with the second smallest label ri′ , and similarly find all reachable nodes. For each reachable node s, the algorithm will compare the current distance d * between i′ and s with the last recorded distance in r *(s). If the current distance is smaller, then the current d * and ri′ are added to the end of r *(s). Then the algorithm move to the node with the third smallest label and so on. The algorithm is summarized in Algorithm 1 in Appendix D.

Algorithm 1 returns a list r *(s) per node sV, which contains information about distance to the smallest reachable labels from s. In particular, each list contains pairs of distance and random labels, (d, r), and these pairs are ordered as

 >d (1) >d (2) > … >d (|r (s)|) = 0

(10)

r (1) <r (2) < … <r (|r (s)|),

(11)

where {·}(l) denotes the l-th element in the list. (see Appendix D for an example). If we want to query the smallest reachable random label r * for a given source s and a time T, we only need to perform a binary search on the list for node s:

r =r (l), where d (l−1) >T ≥d (l).

(12)

Finally, to estimate |N(s, T)|, we generate m i.i.d. collections of random labels, run Algorithm 1 on each collection, and obtain m values { r u } u = 1 m , which we use on Eq. (9) to estimate |N (i, T)|.

The computational complexity of Algorithm 1 is O(|ε|log |V| + |V|log2 |V|), with expected size of each r *(s) being O(log |V|). Then the expected time for querying r * is O(log log |V|) using binary search. Since we need to generate m set of random labels and run Algorithm 1 m times, the overall computational complexity for estimating the single-source neighborhood size for all sV is O(m|ε| log |V| + m|V| log2 |V| + m|V| log log |V|). For large scale network, and when m ≪ min{|V|, |ε|}, this randomized algorithm can be much more efficient than approaches based on directly calculating the shortest paths.

5.2 Constructing Estimation for Multiple-Source Neighborhood Size

When we have a set of sources, A, its neighborhood is the union of the neighborhoods of its constituent sources

N (A,T) = ∪ iA N (i,T).

(13)

This is true because each source independently infects its downstream nodes. Furthermore, to calculate the least label list r * corresponding to N (A, T), we can simply reuse the least label list r *(i) of each individual source iA. More formally,

r = min iA min jN (i,T) r j ,

(14)

where the inner minimization can be carried out by querying r * (i). Similarly, after we obtain m samples of r *, we can estimate |N (A, T)| using Eq. (9). Importantly, very little additional work is needed when we want to calculate r * for a set of sources A, and we can reuse work done for a single source. This is very different from a naive sampling approach where the sampling process needs to be done completely anew if we increase the source set. In contrast, using the randomized algorithm, only an additional constant-time minimization over |A| numbers is needed.

5.3 Overall Algorithm

So far, we have achieved efficient neighborhood size estimation of |N(A, T)| with respect to a given set of transmission times {τji }(j,i)∈ε . Next, we will estimate the influence by averaging over multiple sets of samples for {τji }(j,i)∈ε . More specifically, the relation from (7)

σ ( A , T ) = E { τ ji } ( j , i ) ε [ | N ( A , T ) | ] = E { τ ji } E { r 1 , , r m } { τ ji } [ m 1 u = 1 m r u ] ,

(15)

suggests the following overall algorithm

Continuous-Time Influence Estimation (ConTinEst):

  1. Sample n sets of random transmission times { τ ij l } ( j , i ) ε ~ ( j , i ) ε f ji ( τ ji )

  2. Given a set of { τ ij l } ( j , i ) ε , sample m sets of random labels { r i u } i V ~ i V exp ( r i )

  3. Estimate σ(A, T) by sample averages σ ( A , T ) 1 n l = 1 n ( ( m 1 ) / u l = 1 m r u l )

Importantly, the number of random labels, m, does not need to be very large. Since the estimator for |N(A, T)| is unbiased [17], essentially the outer-loop of averaging over n samples of random transmission times further reduces the variance of the estimator in a rate of O(1/n). In practice, we can use a very small m (e.g., 5 or 10) and still achieve good results, which is also confirmed by our later experiments. Compared to [2], the novel application of Cohen's algorithm arises for estimating influence for multiple sources, which drastically reduces the computation by cleverly using the least-label list from single source. Moreover, we have the following theoretical guarantee (see Appendix E for proof)

Theorem 1

Draw the following number of samples for the set of random transmission times

where Λ := max A:|A| ≤C 2σ(A, T)2/(m − 2) + 2Var(|N(A, T)|)(m − 1)/(m − 2) + 2/3 and |N(A, T)| ≤ a, and for each set of random transmission times, draw m set of random labels. Then | σ ̂ (A, T) − σ(A, T)| ≤ ε uniformly for all A with |A| ≤ C, with probability at least 1 − δ.

The theorem indicates that the minimum number of samples, n, needed to achieve certain accuracy is related to the actual size of the influence σ(A, T), and the variance of the neighborhood size |N(A, T)| over the random draw of samples. The number of random labels, m, drawn in the inner loop of the algorithm will monotonically decrease the dependency of n on σ(A, T). It suffices to draw a small number of random labels, as long as the value of σ(A, T)2/(m − 2) matches that of Var(|N(A, T)|). Another implication is that influence at larger time window T is harder to estimate, since σ(A, T) will generally be larger and hence require more random labels.

6 Influence Maximization

Once we know how to estimate the influence σ(A, T) for any AV and time window T efficiently, we can use them in finding the optimal set of C source nodes A* ⊆ V such that the expected number of infected nodes in G is maximized at T. That is, we seek to solve,

A = argmax|A|≤C σ (A,T),

(17)

where set A is the variable. The above optimization problem is NP-hard in general. By construction, σ(A, T) is a non-negative, monotonic nondecreasing function in the set of source nodes, and it can be shown that σ(A, T) satisfies a diminishing returns property called submodularity [7].

A well-known approximation algorithm to maximize monotonic submodular functions is the greedy algorithm. It adds nodes to the source node set A sequentially. In step k, it adds the node i which maximizes the marginal gain σ(A k−1 ∪ {i}; T) − σ(A k−1; T). The greedy algorithm finds a source node set which achieves at least a constant fraction (1 − 1/e) of the optimal [18]. Moreover, lazy evaluation [5] can be employed to reduce the required number of marginal gains per iteration. By using our influence estimation algorithm in each iteration of the greedy algorithm, we gain the following additional benefits:

First, at each iteration k, we do not need to rerun the full influence estimation algorithm (section 5.2). We just need to store the least label list r *(i) for each node iV computed for a single source, which requires expected storage size of O(|V| log |V|) overall.

Second, our influence estimation algorithm can be easily parallelized. Its two nested sampling loops can be parallelized in a straightforward way since the variables are independent of each other. However, in practice, we use a small number of random labels, and mn. Thus we only need to parallelize the sampling for the set of random transmission times {τji }. The storage of the least element lists can also be distributed.

However, by using our randomized algorithm for influence estimation, we also introduce a sampling error to the greedy algorithm due to the approximation of the influence σ(A, T). Fortunately, the greedy algorithm is tolerant to such sampling noise, and a well-known result provides a guarantee for this case (following an argument in [19, Th. 7.9]):

Theorem 2

Suppose the influence σ(A, T) for all A with |A| ≤ C are estimated uniformly with error ε and confidence 1 − δ, the greedy algorithm returns a set of sources  such that σ(Â, T) ≥ (1 − 1/e)OPT − 2Cε with probability at least 1 − δ.

7 Experiments

We evaluate the accuracy of the estimated influence given by ConTinEst and investigate the performance of influence maximization on synthetic and real networks. We show that our approach significantly outperforms the state-of-the-art methods in terms of both speed and solution quality.

Synthetic network generation

We generate three types of Kronecker networks [20]: (i) coreperiphery networks (parameter matrix: [0.9 0.5; 0.5 0.3]), which mimic the information diffusion traces in real world networks [21], (ii) random networks ([0.5 0.5; 0.5 0.5]), typically used in physics and graph theory [22] and (iii) hierarchical networks ([0.9 0.1; 0.1 0.9]) [10]. Next, we assign a pairwise transmission function for every directed edge in each type of network and set its parameters at random. In our experiments, we use the Weibull distribution [16], f ( t ; α , β ) = β α ( t α ) β 1 e ( t / α ) β , t ≥ 0, where α > 0 is a scale parameter and β > 0 is a shape parameter. The Weibull distribution (Wbl) has often been used to model lifetime events in survival analysis, providing more flexibility than an exponential distribution [16]. We choose α and β from 0 to 10 uniformly at random for each edge in order to have heterogeneous temporal dynamics. Finally, for each type of Kronecker network, we generate 10 sample networks, each of which has different α and β chosen for every edge.

Accuracy of the estimated influence

To the best of our knowledge, there is no analytical solution to the influence estimation given Weibull transmission function. Therefore, we compare ConTinEst with Naive Sampling (NS) approach (see Appendix C) by considering the highest degree node in a network as the source, and draw 1,000,000 samples for NS to obtain near ground truth. Figures 1(a) compares ConTinEst with the ground truth provided by NS at different time window T, from 0.1 to 10 in corre-periphery networks. For ConTinEst, we generate up to 10,000 random samples (or set of random waiting times), and 5 random labels in the inner loop. In all three networks, estimation provided by ConTinEst fits accurately the ground truth, and the relative error decreases quickly as we increase the number of samples and labels (Figures 1(b) and 1(c)). For 10,000 random samples with 5 random labels, the relative error is smaller than 0.01. (see Appendix F for additional results on the random and hierarchal networks)

An external file that holds a picture, illustration, etc.  Object name is nihms578665f1.jpg

For core-periphery networks with 1,024 nodes and 2,048 edges, (a) estimated influence for increasing time window T, and (b) fixing T = 10, relative error for increasing number of samples with 5 random labels, and (c) for increasing number of random labels with 10,000 random samples.

Scalability

We compare ConTinEst to the state-of-the-art method Influmax [7] and the Naive Sampling (NS) method in terms of runtime for the continuous-time influence estimation and maximization. For ConTinEst, we draw 10,000 samples in the outer loop, each having 5 random labels in the inner loop. For NS, we also draw 10,000 samples. The first two experiments are carried out in a single 2.4GHz processor. First, we compare the performance of increasingly selecting sources (from 1 to 10) on small core-periphery networks (Figure 2(a)). When the number of selected sources is 1, different algorithms essentially spend time estimating the influence for each node. ConTinEst outperforms other methods by order of magnitude and for the number of sources larger than 1, it can efficiently reuse computations for estimating influence for individual nodes. Dashed lines mean that a method did not finish in 24 hours, and the estimated run time is plotted. Next, we compare the run time for selecting 10 sources on core-periphery networks of 128 nodes with increasing densities (or the number of edges) (Figure 2(a)). Again, Influmax and NS are order of magnitude slower due to their respective exponential and quadratic computational complexity in network density. In contrast, the run time of ConTinEst only increases slightly with the increasing density since its computational complexity is linear in the number of edges (see Appendix F for additional results on the random and hierarchal networks). Finally, we evaluate the speed on large core-periphery networks, ranging from 100 to 1,000,000 nodes with density 1.5 in Figure 2(c). We report the parallel run time only for ConTinEst and NS (both are implemented by MPI running on 192 cores of 2.4Ghz) since Influmax is not scalable. In contrast to NS, the performance of ConTinEst increases linearly with the network size and can easily scale up to one million nodes.

An external file that holds a picture, illustration, etc.  Object name is nihms578665f2.jpg

For core-periphery networks with T = 10, runtime for (a) selecting increasing number of sources in networks of 128 nodes and 320 edges; for (b) selecting 10 sources in networks of 128 nodes with increasing density; and for (c) selecting 10 sources with increasing network size from 100 to 1,000,000 fixing 1.5 density.

Real-world data

We first quantify how well each method can estimate the true influence in a real-world dataset. Then, we evaluate the solution quality of the selected sources for influence maximization. We use the MemeTracker dataset [23] which has 10,967 hyperlink cascades among 600 media sites. We repeatedly split all cascades into a 80% training set and a 20% test set at random for five times. On each training set, we learn the continuous-time model using NetRate [10] with exponential transmission functions. For discrete-time model, we learn the infection probabilities using [24] for IC, SP1M and PMIA. Similarly, for LT, we follow the methodology by [1]. Let C(u) be the set of all cascades where u was the source node. Based on C(u), the total number of distinct nodes infected before T quantifies the real influence of node u up to time T. In Figure 3(a), we report the Mean Absolute Error (MAE) between the real and the estimated influence. Clearly, ConTinEst performs the best statistically. Because the length of real cascades empirically conforms to a power-law distribution where most cascades are very short (2-4 nodes), the gap of the estimation error is relatively not large. However, we emphasize that such accuracy improvement is critical for maximizing long-term influence. The estimation error for individuals will accumulate along the spreading paths. Hence, any consistent improvement in influence estimation can lead to significant improvement to the overall influence estimation and maximization task, which is further confirmed by Figures 3(b) and 3(c) where we evaluate the influence of the selected nodes in the same spirit as influence estimation: the true influence is calculated as the total number of distinct nodes infected before T based on C(u) of the selected nodes. The selected sources given by ConTinEst achieve the best performance as we vary the number of selected sources and the observation time window.

An external file that holds a picture, illustration, etc.  Object name is nihms578665f3.jpg

In MemeTracker dataset, (a) comparison of the accuracy of the estimated influence in terms of mean absolute error, (b) comparison of the influence of the selected nodes by fixing the observation window T = 5 and varying the number sources, and (c) comparison of the influence of the selected nodes by by fixing the number of sources to 50 and varying the time window.

8 Conclusions

We propose a randomized influence estimation algorithm in continuous-time diffusion networks, which can scale up to networks of millions of nodes while significantly improves over previous state-of-the-arts in terms of the accuracy of the estimated influence and the quality of the selected nodes in maximizing the influence. In future work, it will be interesting to apply the current algorithm to other tasks like influence minimization and manipulation, and design scalable algorithms for continuous-time models other than the independent cascade model.

Acknowledgments

Our work is supported by NSF/NIH BIGDATA 1R01GM108341-01, NSF IIS1116886, NSF IIS1218749, NSFC 61129001, a DARPA Xdata grant and Raytheon Faculty Fellowship of Gatech.

A Heterogeneous Transmission Functions

We denote the waiting time distribution, or transmission function, along a directed edge of G as fji (ti tj ). Formally, the transmission function fji (ti tj ) for directed edge ji is the conditional density of node i getting infected at time ti given that node j was infected at time tj . We assume it is shift invariant, i.e., fji (ti tj ) = fji (ti tj ) = fji (τji ), where τji := ti tj , and it takes positive values when τji ≥ 0, and the value of zero otherwise.

In most previous work, simple parametric transmission functions such as the exponential distribution αji exp(−αjiτji ), and the Rayleigh distribution α ji τ exp ( α ji τ ji 2 / 2 ) have been used [16, 10]. However, in many real world scenarios, information transmission between pairs of nodes can be heterogeneous and the waiting times can obey distributions that dramatically differ from these simple models. For instance, in viral marketing, active consumers could update their status instantly, while an inactive user may just log in and respond once a day. As a result, the transmission function between an active user and his friends can be quite different from that between an inactive user and his friends. As an attempt to model these complex scenarios, nonparametric transmission functions have been recently considered [8]. In such approach, the relationship between the survival function, the conditional intensity function or hazard, and the transmission function is exploited. In particular, the survival function is defined as S ji ( τ ji ) : = 1 0 τ ji f ji ( τ ) d τ and the hazard function is defined as hji (τji ) := fji (τji )/Sji (τji ). Then, it is a well-known result in survival theory that S ji ( τ ji ) = exp ( 0 τ ji h ji ( τ ) d τ ) and fji (τji ) = hji (τji )Sji (τji ). The advantage of using the conditional intensity function is that we do not need to explicitly enforce "the integral equals 1" constraint for the conditional density fji . Instead, we just need to ensure hji ≥ 0. This facilitates nonparametric modeling of the transmission function. For instance, we can define the conditional intensity function as a positive combination of n positive kernel functions k,

h ji ( τ ) = l = 1 n α l k ( τ l , τ ) , if τ > 0 , and 0 otherwise .

A common choice of the kernel function is the Gaussian RBF kernel k(τ′, τ) = exp(−∥ττ′∥2/2s 2). Nonparametric transmission functions significantly improve modeling of real world diffusion, as is shown in [8].

B A Graphical Model Perspective

Now, we look at the independent cascade model from the perspective of graphical models, where the collection of random variables includes the infection times ti of the nodes. Although the original contact graph G can contain directed loops, each diffusion process (or a cascade) induces a directed acyclic graph (DAG). For those cascades consistent with a particular DAG, we can model the joint density of ti using a directed graphical model:

p({t i } iV ) = ∏ iV p(t i ∣{t j } jπ i ),

(18)

where each πi denotes the collection of parents of node i in the induced DAG, and each term p(ti ∣{tj } jπi ) corresponds to a conditional density of tj given the infection times of the parents of node i. This is true because given the infection times of node i's parents, ti is independent of other infection times, satisfying the local Markov property of a directed graphical model. We note that the independent cascade model only specifies explicitly the pairwise transmission functions for each directed edge, but does not directly define the conditional density p(ti ∣{tj } jπi ).

However, these conditional densities can be derived from the pairwise transmission functions based on the Independent-Infection property [10]:

p(t i ∣{t j } jπ i ) = ∑ jπ i h ji (t i t j ) ∏ lπ i S(t i t l ),

(19)

which is the sum of the likelihoods that node i is infected by each parent node j. More precisely, each term in the summation can be interpreted as the instantaneous risk of node i being infected at ti by node j given that it has survived the infection of all parent nodes until time ti .

Perhaps surprisingly, the factorization in Eq. (18) is the same factorization that can be used for an arbitrary induced DAG consistent with the contact network G. In this case, we only need to replace the definition of πi (the parent of node i in the DAG) to the set of neighbors of node i with an edge pointing to node i in G. This is not immediately obvious from Eq. (18), since the contact network G can contain directed loops which may be in conflict with the conditional independence semantics of directed graphical models. The reason it is possible to do so is as follows: Any fixed set of infection times, t 1, …, td , induces an ordering of the infection times. If ti tj for an edge ji in G, hji (ti tj ) = 0, and the corresponding term in Eq. (19) is zeroed out, making the conditional density consistent with the semantics of directed graphical models.

Based on the joint density of the infection times in Eq. (18), we can perform various inference and learning tasks. For instance, previous work has used Eq. (18) for learning the parameters of the independent cascade model [8, 10, 11]. However, this may not be the most convenient form for addressing other inference problems, including the influence estimation problem in the next section. To this end, we propose an alternative view.

Instead of directly modeling the infection times ti , we can focus on the collection of mutually independent random transmission times τji = ti tj . In this case, the joint density of the collection of transmission times τji is fully factorized

p ({τ ji }(j,i)∈ε ) = ∏(j,i)∈ε f ji (τ ji ),

where ε denotes the set of edges in the contact network G — switching from the earlier node-centric view to the now edge-centric view. Based on the Shortest-Path property of the independent cascade model, variable ti can be viewed as a transformation from the collection of variables {τji }(j,i)∈ε . More specifically, let Qi be the collection of directed paths in G from the source nodes to node i, where each path qQi contains a sequence of directed edges (j, l), and assuming all source nodes are infected at zero time, then we obtain variable ti via

t i = g i ( { τ ji } ( j , i ) ε ) : = min q Q i ( j , l ) q τ jl ,

(20)

where gi (·) is the transformation.

Importantly, we can now compute the probability of infection of node i at ti using the set of variables {τji }(j,i)∈ε :

Pr{t i  ≤T} = Pr {g i ({τ ji }(j,i)∈ε ) ≤T}.

(21)

The significance of the relation is that it allows us to transform a problem involving a sequence of dependent variables {ti } iV to one with independent variables {τji }(j,i)∈ε . Furthermore, the two problems are connected via the shortest path algorithm in weighted directed graph, a standard well studied operation in graph analysis.

C Naive Sampling Algorithm

The graphical model perspective described in Section 3 and Appendix B suggests a naive sampling (NS) algorithm for approximating σ(A, T):

  1. Draw n samples, { { τ ji l } ( j , i ) ε } l = 1 n , i.i.d. from the waiting time product distribution Π(j,i)∈ε fji (τji );

  2. For each sample { τ ji l } ( j , i ) ε and for each node i, find the shortest path from source nodes to node i; count the number of nodes with g i ( { τ ji l } ( j , i ) ε ) T ;

  3. Average the counts across n samples.

Although the naive sampling algorithm can handle arbitrary transmission function, it is not scalable to networks with millions of nodes. We need to compute the shortest path for each node and each sample, which results in a computational complexity of O(n|ε| + n|V| log |V|) for a single source node. The problem is even more pressing in the influence maximization problem, where we need to estimate the influence of source nodes at different location and with increasing number of source nodes. To do this, the algorithm needs to be repeated, adding a multiplicative factor of C|V| to the computational complexity (C is the number of nodes to select). Then, the algorithm becomes quadratic in the network size. When the network size is in the order of thousands and millions, typical in modern social network analysis, the naive sampling algorithm become prohibitively expensive. Additionally, we may need to draw thousands of samples (n is large), further making the algorithm impractical for large scale problems.

D Least Label List

The notation "argsort((r 1, …, r |V|), ascend)" in line 2 of Algorithm 1 means that we sort the collection of random labels in ascending order and return the argument of the sort as an ordered list.

An external file that holds a picture, illustration, etc.  Object name is nihms578665u1.jpg

Figure 4 shows an example of the Least-Label-List. The nodes from a to g are assigned to exponentially distributed labels with mean one shown in each parentheses. Given a query distance 0.8 for node c, we can binary-search its Least-label-list r *(c) to find that node a belongs to this range with the smallest label r(a) = 1.5.

An external file that holds a picture, illustration, etc.  Object name is nihms578665f4.jpg

Graph G = (ν, ε), edge weights {τji }(j,i)∈ε , and node labeling {ri } iν with the associated output from Algorithm 1.

  • Node labeling :

    • e(0.2) < b(0.3) < d(0.4) < a(1.5) < c(1.8) < g(2.2) < f(3.7)

  • Neighborhoods:

    • N(c, 2) = {a, b, c, e}; N(c, 3) = {a, b, c, d, e, f};

  • Least-label list:

    • r *(c) : (2, 0.2), (1, 0.3), (0.5, 1.5), (0, 1.8)

  • Query: r *(c, 0.8) = r(a) = 1.5

E Theorem 1

Theorem 1

Sample the following number of sets of random transmission times

where Λ := max A:|A|≤C 2σ(A, T)2/(m − 2) + 2Var(|N(A, T)|)(m − 1)/(m − 2) + 2/3, |N(A, T)| ≤ a, and for each set of random transmission times, sample m set of random labels. Then we can guarantee that

simultaneously for all A with |A| ≥ C, with probability at least 1 − δ.

Proof

Let Sτ := |N (A, T)| for a fixed set of {τji } and then σ(A, T) = E τ [Sτ ]. The randomized algorithm with m randomizations produces an unbiased estimator S ^ τ = ( m 1 ) / ( u = 1 m r u ) for Sτ, i.e., E rτ [Ŝτ ] = Sτ , with variance E r τ [ ( S ^ τ S τ ) 2 ] = S τ 2 / ( m 2 ) .

Then Ŝτ is also an unbiased estimator for σ(A, T), since E τ, r [Ŝτ ] = E τ E rτ [Ŝτ ] = E τ [Sτ ] = σ(A, T). Its variance is

Var ( S ^ τ ) : = E τ , r [ ( S ^ τ σ ( A , T ) ) 2 ] = E τ , r [ ( S ^ τ S τ + S τ σ ( A , T ) ) 2 ] = E τ , r [ ( S ^ τ S τ ) 2 ] + 2 E τ , r [ ( S ^ τ S τ ) ( S τ σ ( A , T ) ) ] + E τ , r [ ( S τ σ ( A , T ) ) 2 ] = E τ [ S τ 2 / ( m 2 ) ] + 0 + Var ( S τ ) = σ ( A , T ) 2 / ( m 2 ) + Var ( S τ ) ( m 1 ) / ( m 2 )

Then using Bernstein's inequality, we have, for our final estimator σ ^ ( A , T ) = 1 n l = 1 n S ^ τ l , that

Pr { | σ ^ ( A , T ) σ ( A , T ) | ε } 2 exp ( n ε 2 2 Var ( S ^ τ ) + 2 / 3 )

(22)

where Ŝτ < a ≤ |V|.

Setting the right hand side of relation (22) to δ, we have that, with probability 1 − δ, sampling the following number sets of random transmission times

n 2 Var ( S ^ τ ) + 2 / 3 ε 2 log ( 2 δ ) = 2 σ ( A , T ) 2 / ( m 2 ) + 2 Var ( S τ ) ( m 1 ) / ( m 2 ) + 2 / 3 ε 2 log ( 2 δ )

we can guarantee that our estimator to have error | σ ̂ (A, T) − σ(A, T)| ≤ ε.

If we want to insure that | σ ̂ (A, T) − σ(A, T)| ≤ ε simultaneously hold for all A such that |A| ≤ C ≪ |V|, we can first use union bound with relation (22). In this case, we have that, with probability 1 − δ, sampling the following number sets of random transmission times

we can guarantee that our estimator to have error | σ ̂ (A, T) − σ(A, T)| ≤ ε for all A with |A| ≤ C. Note that we have define the constant Λ := max A:|A|≤C 2σ(A, T)2/(m − 2) + 2Var(Sτ )(m − 1)/(m − 2) + 2/3.

F Additional Experimental Results

In this section, we report additional experimental results on accuracy of influence estimation, continuous-time influence maximization and scalability for the synthetic networks.

F.1 Accuracy of Influence Estimation

Figure 5 evaluates the estimated scope of influence for different time windows and the relative errors with respective to different number of random samples and labels on the random kronecker networks with 1,024 nodes and 2,048 edges. Figure 6 further reports similar results on the hierarchical kronecker networks. In all cases, the errors decrease dramatically as we draw more samples and labels.

An external file that holds a picture, illustration, etc.  Object name is nihms578665f5.jpg

On the random kronecker networks with 1,024 nodes and 2,048 edges, panels show (a) the estimated influence with increasing time window T; (b) the average relative error for different number of samples, each of which has 5 random labels for every node; and (c) the average relative error for varying number of random labels assigned to every node in each of 10,000 samples. For both (b) and (c), we set T = 10.

An external file that holds a picture, illustration, etc.  Object name is nihms578665f6.jpg

On the hierarchical kronecker networks with 1,024 nodes and 2,048 edges, panels show (a) the estimated influence with increasing time window T; (b) the average relative error for different number of samples, each of which has 5 random labels for every node; and (c) the average relative error for varying number of random labels assigned to every node in each of 10,000 samples. For both (b) and (c), we set T = 10.

In addition, because Influmax can produce exact closed form influence on sparse small networks with exponential transmission functions, we compare ConTinEst with Influmax in Figure 7, where we chose the highest degree node in the network as the source. We have drawn 10,000 random samples, each of which has 5 random labels for each node. ConTinEst outputs values of influence which are very close to the exact values given by Influmax, with relative error less than 0.01 in all three types of networks.

An external file that holds a picture, illustration, etc.  Object name is nihms578665f7.jpg

Infected neighborhood size over three different types of networks with the exponential transmission function associated with each edge. Each type of network consists of 128 nodes and 141 edges. For panels (d-i), we set the observation window T = 10.

F.2 Continuous-time Influence Maximization

We compare ConTinEst to other influence maximization methods based on discrete-time diffusion models: traditional greedy [1], with discrete-time Linear Threshold Model (LT) and Independent Cascade Model (IC) diffusion models, and the heuristic methods SP1M [2] and PMIA [25]. For Influmax, since it only supports exponential pairwise transmission functions, we fit an exponential distribution per edge. Furthermore, Influmax is not scalable; when the average network density of the synthetic networks is ~ 2.0, the run time for Influmax is more than 24 hours. Instead, we present the results of ConTinEst using fitted exponential distributions (Exp). For the discrete-time IC model, we learn the infection probability within time window T using Netrapalli's method [24]. The learned pairwise infection probabilities are also served for SP1M and PMIA, which essentially approximately calculate the influence based on the IC model. For the discrete-time LT model, we set the weight of each incoming edge to a node u to the inverse of its in-degree, as in previous work [1], and choose each node's threshold uniformly at random. Figure 8 compares the expected number of infected nodes against source set size for different methods. ConTinEst outperforms the rest, and the competitive advantage becomes more dramatic the larger the source set grows. Figure 9 shows the expected number of infected nodes against the time window for 50 selected sources. Again, ConTinEst performs the best for all three types of networks.

An external file that holds a picture, illustration, etc.  Object name is nihms578665f8.jpg

Panels present the influence against the number of sources by T = 5 on the networks having 1,024 nodes and 2,048 edges with heterogeneous Weibull transmission functions.

An external file that holds a picture, illustration, etc.  Object name is nihms578665f9.jpg

Panels present the influence against the time window T using 50 sources on the networks having 1,024 nodes and 2,048 edges with heterogeneous Weibull transmission functions.

F.3 Scalability

Figure 10 compares ConTinEst to Influmax and the Naive Simulation (NS) method in terms of running time for the continuous-time influence maximization problem over the random and hierarchal kronecker type of networks, respectively, with different densities and sizes on a single 2.4Ghz CPU core. For ConTinEst, we have drawn 10,000 samples, each of which has 5 random labels assigned to each node. For NS, we follow the work [1] to run 10,000 Monte Carlo simulations. For running times longer than 24 hours, we use dashed line to qualitatively indicate the estimated performance based on the time complexity of each method.

An external file that holds a picture, illustration, etc.  Object name is nihms578665f10.jpg

Panels(a-b) show the running time against the network density by fixing the number of sources at 10 on the random and hierarchal kronecker network with 128 nodes. Panels(c-d) present the running time as we increase the number of selected source nodes on the networks with 128 nodes and 256 edges.

References

1. Kempe David, Kleinberg Jon, Tardos Éva. Maximizing the spread of influence through a social network. KDD. 2003:137–146. [Google Scholar]

2. Chen Wei, Wang Yajun, Yang Siyu. Efficient influence maximization in social networks. KDD. 2009:199–208. [Google Scholar]

3. Chen Wei, Yuan Yifei, Zhang Li. Scalable influence maximization in social networks under the linear threshold model. ICDM. 2010:88–97. [Google Scholar]

4. Goyal Amit, Bonchi Francesco, Lakshmanan LaksVS. A data-based approach to social influence maximization. Proc VLDB Endow. 2011:5. [Google Scholar]

5. Leskovec Jure, Krause Andreas, Guestrin Carlos, Faloutsos Christos, VanBriesen JeanneM, Glance NatalieS. Cost-effective outbreak detection in networks. KDD. 2007:420–429. [Google Scholar]

6. Richardson Matthew, Domingos Pedro. Mining knowledge-sharing sites for viral marketing. KDD. 2002:61–70. [Google Scholar]

7. Gomez-Rodriguez Manuel, Schölkopf Bernhard. Influence maximization in continuous time diffusion networks. ICML '12. 2012 [Google Scholar]

8. Du Nan, Song Le, Smola AlexanderJ, Yuan Ming. Learning networks of heterogeneous influence. NIPS. 2012 [Google Scholar]

9. Du Nan, Song Le, Woo Hyenkyun, Zha Hongyuan. Uncover topic-sensitive information diffusion networks. AISTATS. 2013 [Google Scholar]

10. Gomez-Rodriguez Manuel, Balduzzi David, Schölkopf Bernhard. Uncovering the temporal dynamics of diffusion networks. ICML. 2011:561–568. [Google Scholar]

11. Gomez-Rodriguez Manuel, Leskovec Jure, Schölkopf Bernhard. Structure and Dynamics of Information Pathways in On-line Media. WSDM. 2013 [Google Scholar]

12. Zhou Ke, Song Le, Zha Hongyuan. Learning social infectivity in sparse low-rank networks using multi-dimensional hawkes processes. Artificial Intelligence and Statistics (AISTATS) 2013 [Google Scholar]

13. Zhou Ke, Zha Hongyuan, Song Le. Learning triggering kernels for multi-dimensional hawkes processes. International Conference on Machine Learning(ICML); 2013. [Google Scholar]

14. Gomez-Rodriguez Manuel, Leskovec Jure, Schölkopf Bernhard. Modeling information propagation with survival theory. ICML. 2013 [Google Scholar]

15. Yang Shuanghong, Zha Hongyuan. Mixture of mutually exciting processes for viral diffusion. International Conference on Machine Learning(ICML); 2013. [Google Scholar]

16. Lawless JeraldF. Statistical Models and Methods for Lifetime Data. Wiley-Interscience; 2002. [Google Scholar]

17. Cohen Edith. Size-estimation framework with applications to transitive closure and reachability. Journal of Computer and System Sciences. 1997;55(3):441–453. [Google Scholar]

18. Nemhauser GL, Wolsey LA, Fisher ML. An analysis of approximations for maximizing submodular set functions. Mathematical Programming. 1978;14(1) [Google Scholar]

19. Krause Andreas. Ph D Thesis. CMU; 2008. [Google Scholar]

20. Leskovec Jure, Chakrabarti Deepayan, Kleinberg JonM, Faloutsos Christos, Ghahramani Zoubin. Kronecker graphs: An approach to modeling networks. JMLR. 2010;11 [Google Scholar]

21. Gomez-Rodriguez Manuel, Leskovec Jure, Krause Andreas. Inferring networks of diffusion and influence. KDD. 2010 [Google Scholar]

22. Easley David, Kleinberg Jon. Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge University Press; 2010. [Google Scholar]

23. Leskovec Jure, Backstrom Lars, Kleinberg JonM. Meme-tracking and the dynamics of the news cycle. KDD. 2009 [Google Scholar]

24. Netrapalli Praneeth, Sanghavi Sujay. SIGMETRICS/PERFORMANCE. ACM; 2012. Learning the graph of epidemic cascades; pp. 211–222. [Google Scholar]

25. Chen Wei, Wang Chi, Wang Yajun. Scalable influence maximization for prevalent viral marketing in large-scale social networks. KDD '10. 2010:1029–1038. [Google Scholar]

flanneryforad1979.blogspot.com

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4704803/